No title

Neural Computation, Volume 12, Number 1 CONTENTS Articles Correctness of Local Probability Propagation in Graphical Mo...

Author: MIT Press

6 downloads 719 Views 45MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Neural Computation, Volume 12, Number 1

CONTENTS Articles Correctness of Local Probability Propagation in Graphical Models with Loops Yair Weiss

1

Population Dynamics of Spiking Neurons: Fast Transients, Asynchronous States, and Locking Wulfram Gerstner

43

Dynamics of Strongly Coupled Spiking Neurons Paul C. Bressloff and S. Coombes

91

Notes On Connectedness: A Solution Based on Oscillatory Correlation DeLiang L. Wang

131

Practical Identiability of Finite Mixtures of Multivariate Bernoulli Distributions ´ Carreira-Perpi˜n´an and Steve Renals Miguel A.

141

Letters The Effects of Pair-wise and Higher-order Correlations on the Firing Rate of a Postsynaptic Neuron S. M. Bohte, H. Spekreijse, and P. R. Roelfsema

153

Effects of Spike Timing on Winner-Take-All Competition in Model Cortical Circuits Erik D. Lumer

181

Model Dependence in Quantication of Spike Interdependence by Joint Peri-Stimulus Time Histogram Hiroyuki Ito and Satoshi Tsuji

195

Reinforcement Learning in Continuous Time and Space Kenji Doya

219

ARTICLE

Communicated by Stuart Russell

Correctness of Local Probability Propagation in Graphical Models with Loops Yair Weiss Department of Brain and Cognitive Sciences, MIT, Cambridge, MA 02139, U.S.A.

Graphical models, such as Bayesian networks and Markov networks, represent joint distributions over a set of variables by means of a graph. When the graph is singly connected, local propagation rules of the sort proposed by Pearl (1988) are guaranteed to converge to the correct posterior probabilities. Recently a number of researchers have empirically demonstrated good performance of these same local propagation schemes on graphs with loops, but a theoretical understanding of this performance has yet to be achieved. For graphical models with a single loop, we derive an analytical relationship between the probabilities computed using local propagation and the correct marginals. Using this relationship we show a category of graphical models with loops for which local propagation gives rise to provably optimal maximum a posteriori assignments (although the computed marginals will be incorrect). We also show how nodes can use local information in the messages they receive in order to correct their computed marginals. We discuss how these results can be extended to graphical models with multiple loops and show simulation results suggesting that some properties of propagation on single-loop graphs may hold for a larger class of graphs. Specically we discuss the implication of our results for understanding a class of recently proposed error-correcting codes known as turbo codes. 1 Introduction Problems involving probabilistic belief propagation arise in a wide variety of applications, including error-correcting codes, speech recognition, and medical diagnosis. Typically, a probability distribution is assumed over a set of variables, and the task is to infer the values of the unobserved variables given the observed ones. The assumed probability distribution is described using a graphical model (Lauritzen, 1996); the qualitative aspects of the distribution are specied by a graph structure. Figure 1 shows two examples of such graphical models: a Bayesian network (Pearl, 1988; Jensen, 1996) (see Figure 1a) and Markov networks (Pearl, 1988; Geman & Geman, 1984) (see Figures 1b–c). Neural Computation 12, 1–41 (2000)

c 2000 Massachusetts Institute of Technology °

2

Yair Weiss

Figure 1: Examples of graphical models. Nodes represent variables, and the qualitative aspects of the joint probability function are represented by properties of the graph. Shaded nodes represent observed variables. (a) Singly connected Bayesian network. (b) Singly connected Markov network (A and F are observed). (c) Markov network with a loop. This article analyzes the behavior of local propagation rules in graphical models with a loop.

If the graph is singly connected—there is only one path between any two given nodes—then there exist efcient local message-passing schemes to calculate the posterior probability of an unobserved variable given the observed variables. Pearl (1988) derived such a scheme for singly connected Bayesian networks and showed that this “belief propagation” algorithm is guaranteed to converge to the correct posterior probabilities (or “beliefs”). However, as Pearl noted, the same algorithm will not give the correct beliefs for multiply connected networks: When loops are present, the network is no longer singly connected and local propagation schemes will invariably run into trouble. . . . If we ignore the existence of loops and permit the nodes to continue communicating with each other as if the network were singly connected, messages may circulate indenitely around the loops and the process may not converge to a stable equilibrium. . . . Such oscillations do not normally occur in probabilistic networks. . . . which tend to bring all messages to some stable equilibrium as time goes on. However, this asymptotic equilibrium is not coherent, in the sense that it does not represent the posterior probabilities of all nodes of the network. (p. 195) Despite these reservations, Pearl advocated the use of belief propagation in loopy networks as an approximation scheme (J. Pearl, personal communication), and one of the exercises in Pearl (1988) investigates the quality of the approximation when it is applied to a particular loopy belief network.

Correctness of Local Probability Propagation

3

Several groups (Frey, 1998; MacKay & Neal, 1995; Weiss, 1996) have recently reported excellent experimental results by running algorithms equivalent to Pearl’s algorithm on networks with loops. Perhaps the most dramatic instance of this performance is in an error-correcting code scheme known as turbo codes (Berrou, Glavieux, Thitimajshima, 1993), described as “the most exciting and potentially important development in coding theory in many years” (McEliece, Rodemich, & Cheng, 1995) and have recently been shown to use an algorithm equivalent to belief propagation in a network with loops (Wiberg, 1996; Kschischang & Frey, 1998; McEliece, MacKay, & Cheng, 1998). Although there is widespread agreement in the coding community that these codes “represent a genuine, and perhaps historic, breakthrough” (McEliece et al., 1995), a theoretical understanding of their performance has yet to be achieved. Here we lay a foundation for an understanding of belief propagation in networks with loops. For networks with a single loop, we derive an analytic expression relating the steady-state beliefs to the correct posterior probabilities. We discuss how these results can be extended to networks with multiple loops and show simulation results for networks with multiple loops. Specically we discuss the implication of our results for understanding the performance of turbo codes. 2 Formulation: Bayes Nets, Markov Nets, and Belief Propagation Bayesian networks and Markov networks are graphs that represent the qualitative nature of a distribution over a set of variables; the structure of the graph implies a product form for the distribution. See Pearl (1988, chap. 3) for a more exhaustive treatment. A Bayesian network is a directed acyclic graph in which the nodes represent variables, the arcs signify the existence of direct inuences among the linked variables, and the strength of these inuences is expressed by forward conditional probabilities. Using the chain rule, the full probability distribution over the variables can be expressed as a product of conditional and prior probabilities. Thus, for example, for the network in Figure 1a, we can write: P( ABCDEF) = P( A) P( B | A) P( C) P( D | B, C) P(E | C).

(2.1)

A Markov network is an undirected graph in which the nodes represent variables, and arcs represent compatability constraints between them. Assuming that all probabilities are nonzero, the Hammersley-Clifford theorem (Pearl, 1988) guarantees that the probability distribution will factorize into a product of functions of the maximal cliques of the graph. Thus, for example, in the network in Figure 1c we have: P( ABCD) =

1 Y(AB)Y(AC)Y(CD)Y(DB). Z

(2.2)

4

Yair Weiss

Different communities tend to prefer different graphical models (see Smyth, 1997, for a recent review). Directed graphs are more commonin articial intelligence, medical diagnosis, and statistics, and undirected graphs are more common in image processing, statistical physics, and error-correcting codes. In both cases, however, a full specication of the probability distribution involves specifying the graph and numerical values for the conditional probabilities (in Bayes nets) or the clique potentials (in Markov nets). Given a full specication of the probability, the inference task in graphical models is simply to infer the values of the unobserved variables given the observed ones. Figure 1b shows an example. The observed nodes A, F are represented by shaded circles in the graph. We use O to denote the set of observed variables. There are actually three distinct subtasks for the inference problem: Marginalization: Calculating the marginal probability of a variable given the observed variables—for example P( C = c | O ). Maximum a posteriori (MAP) assignment: Finding assignments to the unobserved variables that are most probable given the observed variables— for example, nding ( b, c, d, e) such that P( B = b, C = c, D = d, E = e | O ) is maximized. Maximum marginal (MM) assignment: Finding assignments to the unobserved variables that maximize the marginal probability of the assignment—for example, nding ( b, c, d, e) such that P( B = b | O ),P( C = c | O ),P( D = d | O ) ,P(E = e | O ) are all maximized. 2.1 Message-Passing Algorithms for Graphical Models. For singly connected graphical models, there exist various message-passing schemes for performing inference (see Smyth, Heckerman, & Jordan, 1997 for a review). Here we present such a scheme for pairwise Markov nets—ones in which the maximal cliques are pairs of units. This choice is not as restrictive as it may seem. First, any singly connected Markov net is a pairwise Markov network, and a Markov network with larger cliques can be converted into a pairwise Markov network by merging larger cliques into cluster nodes. Furthermore, as we show in the appendix, any Bayesian network can be converted into a pairwise Markov net. When this conversion is performed, the update rules given here reduce to Pearl’s original algorithm for inference in Bayesian networks. The main advantage of the pairwise Markov net formulation is that it enables us to write the message-passing scheme in terms of matrix and vector operations, and this makes the subsequent analysis simpler. To derive the message-passing algorithm, consider rst the naive way of performing inference: by exhaustive enumeration. Referring again to

Correctness of Local Probability Propagation

5

Figure 1b, calculating the marginal of C merely involves computing: P(C = c | A = a, F = f ) = a

X

P( a, b, c, d, e, f ),

(2.3)

b,d,e

where a is a normalizing factor. This exhaustive enumeration is, of course, exponential in the number of unobserved nodes, but for the particular factorized distributions of Markov nets, we can write: P(C = c | A = a, F = f ) XXX = a Y(ab)Y(bc)Y(cd)Y(df )Y( de) = a

(

b

d

X

e

!

Y(ab)Y(bc)

b

(

X d

Y(cd)Y(df )

X

(2.4) !

Y(de) .

(2.5)

e

Note that the exponential enumeration is now converted into a series of enumerations over each variable separately and that many of the terms P (e.g., b Y(ab)Y(bc)) are equivalent to matrix multiplication. This forms the basis for the message-passing algorithm. The messages that nodes transmit to each other are vectors, and we denote by vEXY the message that node X sends to node Y. We dene the transition matrix of an edge in the graph by def

M XY (i, j) = Y(X = i, Y = j).

(2.6)

Note that the matrix going in one direction on an edge is equal by denition to the transpose of the matrix going in the other direction: MXY = M TYX . We denote by bEX the belief vector at node X. Following the terminology of Pearl (1988) we call the message-passing scheme for calculating marginals belief update. In belief update, the message that node X sends to node Y is updated as follows: Combine all messages coming into X except for that coming from Y into a vector vE. The combination is done by multiplying all the message vectors element by element. Multiply vE by the matrix M XY corresponding to the link from X to Y.

Normalize the product MXY vE so it sums to 1. The normalized vector is sent to Y. The belief vector for a node X is obtained by combining all incoming messages to X (again by multiplying the message vectors element by element) and normalizing.

6

Yair Weiss

By introducing a symbol ¯ for the componentwise multiplication operator, the updates can be rewritten: vEXY Ã aMXY bEX Ã a

¯

¯

Z2N( X)nY

Z2N( X)

(2.7)

vEZX

vEZX ,

(2.8)

where Ez = xE ¯ yE $ Ez( i) = xE(i) yE( i). Throughout this article aE v denotes normalizing the components of vE so they sum to 1. The procedure is initialized with all message vectors set to (1, 1, . . . , 1). Observed nodes do not receive messages, and they always transmit the same vector. If X is observed to be in state x, then vEXY (i) = Y(X = x, Y = i). The normalization of vEXY in equation 2.7 is not theoretically necessary; whether the message are normalized or not, the belief vector bEX will be identical. Thus one could use an equivalent algorithm where the messages are not normalized. However, as Pearl (1988) has pointed out, normalizing the messages avoids numerical underow and adds to the stability of the algorithm. Equation 2.7 does not specify the order in which the messages are updated. For simplicity, we assume throughout this article that all nodes simultaneously update their messages in parallel. For a singly connected network, the belief update equations (equations 2.7–2.8) solve two of the three inference tasks mentioned earlier. It can be shown that the the belief vectors reach a steady state after a number of iterations equal to the length of the longest path in the graph. Furthermore, at convergence, the belief vectors are equal to the posterior marginal probabilities. Thus for singly connected networks, bEX ( j) = P( X = j | O ). We dene the belief update (BU) assignment by assigning each variable X to the state j that maximizes bEX ( j). Since bEX converges to the correct posteriors for a singly connected network, the BU assignment is guaranteed to give the MM assignment. Thus two of the three inference tasks mentioned earlier can be solved using the belief update procedure. To calculate the MAP assignment, however, calculating the posterior marginals is insufcient. In order to calculate the MAP assignment, the nodes need to calculate the posterior when the other unobserved nodes are maximized rather than summed. In Figure 1 this corresponds to calculating P ¤ (C = c | A = a, F = f ) = a max P (a, b, c, d, e, f ) b,d,e

(2.9)

(we use the notation P¤ to denote the fact that P¤ does not correspond to a marginal probability). In complete analogy to equation 2.4 the maximization over the unobserved variables can be replaced by a series of maximizations over individual variables. This leads to an alternative message-passing scheme that we call, following Pearl (1988), belief revision.

Correctness of Local Probability Propagation

7

The procedure is almost identical to that described earlier for belief update. The only difference is that the matrix multiplication in equation 2.7 needs to be replaced with a nonlinear matrix operator. For a matrix M and 1

1

a vector xE, we dene the operator M such that yE = M xE if yE(i) = max M( i, j) xE( j).

(2.10)

j

1

Note that the M operator is similar to regular matrix multiplication, with the sum replaced with the maximum operator. The belief revision update rules are 1

vEXY Ã a M XY bEX Ã a

¯

¯

Z2N ( X)nY

Z2N( X )

vEZX ,

vEZX

(2.11) (2.12)

with initialization identical to belief update. Again, it can be shown that for singly connected networks, the belief vector bEX ( i) at a node will converge to the modied belief P¤ ( X = i). We dene the belief revision (BR) assignment as assigning each node X the value j that maximizes bEX ( j) after belief revision has converged. From the denition of P¤ , it follows that the BR assignment is equal to the MAP assignment for singly connected networks. Thus, all three inference tasks can be accomplished by local message passing. Update rules such as equations 2.7–2.8 and Equations 2.11–2.12 have appeared in many areas of applied mathematics and optimization. In the hidden Markov model literature, the belief update equations (2.7–2.8) are equivalent to the forward-backward algorithm, and the BR equations (2.11– 2.12) are equivalent to the Viterbi algorithm (Rabiner, 1989). In a general optimization context, equations 2.11–2.12 are a distributed implementation of standard dynamic programming (Bertsekas, 1987). If the nodes represent continuous gaussian random variables, equations 2.7–2.8 are equivalent to the Kalman lter and optimal smoothing (Gelb, 1974). In error-correcting codes, both belief revision and belief update can be thought of as special cases of a general class of decoding algorithms for codes dened on graphs (Kschischang & Frey, 1998; Forney, 1997; Aji & McEliece, in press). In nearly all the contexts surveyed above, belief propagation has been analyzed only for singly connected graphs. Note, however, that these procedures are perfectly well dened for any pairwise Markov network. This raises questions, including: How far is the steady-state belief from the correct posterior when the update rules (equations 2.7–2.8) are applied in a loopy network?

8

Yair Weiss

Figure 2: Intuition behind loopy belief propagation. In singly connected networks (a), the local update rules avoid double counting of evidence. In multiply connected networks (b), double counting is unavoidable. However, as we show in the text, certain structures lead to equal double counting.

What are the conditions under which the BU assignment equals the MM assignment when the update rules are applied in a loopy network? What are the conditions under which the BR assignment equals the MAP assignment when the update rules are applied in a loopy network? Answering these questions is the goal of this article. 3 Intuition: Why Does Loopy Propagation Work? Before launching into the details of the analysis, let us consider the examples in Figure 2. In both graphs, there is an observed node connected to every unobserved nodes; we will refer to the observed node connected to X as the local evidence at X. Intuitively, the message-passing algorithm can be thought of as a way of communicating local evidence between nodes such that all nodes calculate their beliefs given all the evidence. As Pearl (1988) has pointed out, in order for a message-passing scheme to be successful, it needs to avoid double counting—a situation in which the same evidence is passed around the network multiple times and mistaken for new evidence. In a singly connected network (such as Figure 2a) this is accomplished by update rules such as those in equation 2.7. Thus in Figure 2a, node B will receive from A a message that involves the local evidence at A and send that information to C. The message it sends to A will involve the local evidence at C but not the local evidence at A. Thus, A never receives its own evidence back again, and double counting is avoided.

Correctness of Local Probability Propagation

9

In a loopy graph (such as Figure 2b), double counting cannot be avoided. Thus, B will send A’s evidence to C, but in the next iteration, C will send that same information back to A. Thus, it seems that belief propagation in such a net will invariably give the wrong answer. How, then, can we explain the good performance reported experimentally? Intuitively, the explanation is that double counting may still lead to correct inference if all evidence is double counted in equal amounts. Because nodes mistake existing evidence for new evidence, they are overly condent in their beliefs regarding which values should be assigned to the unobserved variables. However, if all evidence is equally double counted, the assignment, although based on overly condent beliefs, may still be correct. Indeed, as we show in this article, this intuition can be formalized for belief revision. For all networks with a single loop, the BR assignment gives the maximum a posteriori (MAP) assignment even though the numerical values of the beliefs are wrong. The notion of equal double counting can be formalized by means of what we call the unwrapped network corresponding to a loopy network. The unwrapped network is a singly connected network constructed such that performing belief propagation in the unwrapped network is equivalent to performing belief propagation in the loopy network. The exact construction method of the unwrapped network is discussed in section 5, but the basic idea is to replicate the nodes and the transition matrices as shown in the examples in Figure 3. As we show, (see also Weiss, 1996; MacKay & Neal, 1995; Wiberg, 1996; Frey, Koetter, & Vardy, 1998), for any number of iterations of belief propagation in the loopy network, there exists an unwrapped network such that the nal messages received by a node in the unwrapped network are equivalent to those that would be received by a corresponding node in the loopy network. This concept is illustrated in Figure 3. The messages received by node B after one, two, and three iterations of propagation in the loopy network are identical to the nal messages received by node B in the unwrapped networks shown in Figure 3. It seems that all we have done is to convert a nite, loopy problem into an innite network without loops. What have we gained? The importance of the unwrapped network is that since it is singly connected, belief propagation on it is guaranteed to give the correct beliefs. Thus, belief propagation on the loopy network gives the correct answer for the unwrapped network. The usefulness of this estimate now depends on the similarity between the probability distribution induced by the unwrapped problem and the original loopy problem. In subsequent sections we formally compare the probability distribution induced by the loopy network to that induced by the original problem. Roughly speaking, if we denote the original probability induced on the hidden nodes in Figure 1b by P(a, b, c) = ae¡J(a,b,c) ,

(3.1)

10

Yair Weiss

Figure 3: (a) Simple loopy network. (b–d) Unwrapped networks corresponding to the loopy network. The unwrapped networks are constructed by replicating the evidence and the transition matrices while preserving the local connectivity of the loopy network. They are constructed so that the messages received by node B after t iterations in the loopy network are equivalent to those that would be received by B in the unwrapped network. The unwrapped networks for the rst three iterations are shown.

Correctness of Local Probability Propagation

11

then MAP assignment for the unwrapped problem maximizes PQ(a, b, c) = ae¡kJ(a,b,c)

(3.2)

for some positive constant k. That is, if we are considering the probability of assignment, the unwrapped problem induces a probability that is a monotonic transformation of the true probability. In statistical physics terms, the unwrapped network has the same energy function but at different temperature. This is the intuitive reason that belief revision, which nds the MAP assignment in the unwrapped network, also nds the MAP assignment in the loopy network. However, belief update, which nds marginal distributions of particular nodes, may not nd the MM assignment; the marginals of PQ are not necessarily a monotonic transformation of the marginals of P. Since every iteration of loopy propagation gives the correct beliefs for a different problem, it is not immediately clear why this scheme should ever converge. Note, however, that the unwrapped network of iteration t C 1 is simply the unwrapped network of size t with an additional nite number of nodes added at the boundary. Thus, loopy belief propagation will converge when the addition of these nodes at the boundary will not alter the posterior probability of the node in the center. In other words, convergence is equivalent to the independence of center nodes and boundary nodes in the unwrapped network. In the subsequent sections, we formalize this intuition. 4 Belief Update in Networks with a Single Loop Consider a network composed of N unobserved nodes arranged in a loop and N observed nodes, one attached to each unobserved node (see Figure 4). We denote by U1 , U2 , . . . , UN the unobserved variables, and O1 , O2 , . . . , ON the corresponding observed variables. In this section we derive a relationship between the belief at a given node bEU1 and the correct posterior probability, which we denote by pEU1 . First, let us nd what the messages received by node U1 converge to. By the belief update equations (see equation 2.7), the message that UN sends to U1 depends on the message UN receives from UN¡1 ¡ ¢ (4.1) vEUN U1 Ã aMUN U1 vEON UN ¯ vEUN¡1 UN , and similarly the message UN¡1 sends to UN depends on the message UN¡1 receives from UN¡2 : ¡ ¢ (4.2) vEUN¡1 UN Ã aMUN¡1 UN vEON¡1 UN¡1 ¯ vEUN¡2 UN¡1 We can continue expressing each message in terms of the one received from the neighbor until we go back in the loop to U1 : ¡ ¢ (4.3) vEU1 U2 Ã aMU1 U2 vEO1 U1 ¯ vEUN U1 .

12

Yair Weiss

Figure 4: Simple network with a single loop. We label the unobserved nodes U1 , . . . , UN and the observed nodes O1 , . . . , ON . We derive an analytical relationship between the belief vector at each node in the loop (e.g., EbU 1 ) and its correct marginal probability (e.g., pE U1 ).

Thus, the message UN sends to U1 at a given time step depends on the message UN sent to U1 at a previous time step (N time steps ago), (

)

()

tCN t vEUN U1 = aCN1 vEUN U1 ,

(4.4)

where the matrix CN1 summarizes all the operations performed on the message as it is passed around the loop, CN1 = MUN UN¡1 DN MUN¡1 UN¡2 DN¡1 , . . . , M U1 U2 D1 ,

(4.5)

and we dene the matrix Di to be a diagonal matrix whose elements are the constant messages sent from observed node Oi to unobserved Ui . A diagonal matrix is used because it enables us to rewrite the componentwise multiplication of two vectors as a multiplication of a matrix and a vector. For any vector xE, Di xE = vEOi Ui ¯ xE. Using the matrix CN1 we can formally state the claims of this section. Claims. Consider a pairwise Markov network consisting of a single loop of unobserved variables fUi gN i= 1 , each of which is connected to an observed variable Oi . Dene CN1 as in equation 4.5 where Di a diagonal matrix whose entries are the messages vEOi Ui . If all elements of CN1 are nonzero, then: 1. vEUN U1 converges to the principal eigenvector of CN1 .

T 2. vEU2 U1 converges to the principal eigenvector of D¡1 1 C N1 D1 .

3. The convergence rate of the messages is governed by the ratio of the largest eigenvalue of CN1 to the second-largest eigenvalue.

Correctness of Local Probability Propagation

13

4. The diagonal elements of CN1 give the correct posteriors: pEU1 ( i) = aCN1 (i, i). 5. The steady-state belief bEU1 is related to the correct posterior marginal pEU1 by: bEU1 = b pEU1 C (1 ¡ b)EqU1 , where b is the ratio of the largest eigenvalue of CN1 to the sum of all eigenvalues and qEU1 depends on the eigenvectors of CN1 . The rst claim follows from the recursion (see equation 4.4). Recursions of this type form the basis of the power method for nding eigenvalues of matrices (Strang, 1986). The method is based on the following lemma (Strang, 1986). Power method lemma. Let C be a matrix with eigenvalues l 1 , . . . , l n sorted by decreasing magnitude and eigenvectors uE1 , . . . , uEn . If |l 1 | > |l 2 |, then the E( tC1) = aCE recursion x( t) converges to a multiple of uE1 from any initial vector Px (0) Ei s.t. b1 6 = 0. The convergence factor r is given by r = |l 2 /l 1 | (the xE = i bi u distance to the steady-state vector decreases by r% at every iteration). Proof. If vE is a xed point for the recursion (see equation 4.4), then vE = aCN1 vE or CN1 vE = 1/ aE v so that vE is an eigenvector of CN1 . The power method lemma guarantees that in general, the method will converge to the principal eigenvector—the one with the largest eigenvalue. It is not trivial to check for a given matrix and initial condition whether the assumptions of the power method lemma are satised. A sufcient (but not necessary) condition is that all elements of the matrix C and the initial vector xE(0) be positive. The Perron-Frobenius theorem (Minc, 1988) guarantees that for such matricesl 1 > l 2 and that C will have only one eigenvector whose elements are nonnegative: uE 1 . Thus for C with positive elements, the iterations will converge to auE1 for any initial vector xE(0) with positive elements. If the matrix contains nonnegative values, conditions for guaranteed convergence from arbitrary initial condition are more complicated (Minc, 1988). In the context of graphical models, the matrix CN1 is a product of matrices with nonnegative elements (the elements are either conditional probabilities or potential functions), but it may contain zeros. Indeed, it is possible to construct examples of graphical models for which the messages do not converge. These examples, however, are extremely rare (e.g., when all the transition matrices are deterministic). To summarize, if the matrix CN1 has only positive elements, then vEUN U1 converges to the principal eigenvector of CN1 . Proof.

To prove the second claim, we dene

C21 = MU2 U1 D2 MU3 U2 D3 , . . . , MU1 UN D1 ,

(4.6)

14

Yair Weiss

and in complete analogy to the previous proof, this denes a recursion for vEU2 U1 and therefore the steady-state message is the principal eigenvector of C21. Furthermore, C21 can be expressed as T C21 = D¡1 1 C N1 D1 ,

(4.7)

where we have used the fact that for any two nodes X, Y MXY = MTYX . Proof. The third claim, regarding the convergence rate of the messages, also follows from the power method lemma. Thus the convergence rate of vEUN U1 is governed by the ratio of the largest eigenvalue to the second largest eigenvalue of CN1 and that of vEU2 U1 by the ratio of eigenvalues of C21. Furthermore, by equation 4.7, C21 and CN1 have the same eigenvalues. Proof. To prove the fourth claim, we denote by eEi the vector that is zero everywhere except for a 1 at the ith component; then: pEU1 (i) = a

X

P (U1 = i, U2 , . . . , UN )

(4.8)

Y(i, U2 )Y(i, O1 )Y(U2 , U3 )

(4.9)

U2 ,...,UN

X

= a

U2 ,...,UN

Y(U2 , O2 ), . . . , Y(UN , i)Y( UN , ON ) X X Y(i, U2 )Y(U2 , O2 ) Y(U2 , U3 )

= a

U2

¢¢¢

X

(4.10)

U3

Y(UN , ON )Y( UN , i)Y( i, O1 )

UN

= aEeTi M U2 U1 D2 MU3 U2 D3 , . . . , MU1 UN D1 Eei = aEeTi C21 eEi = aC21( i, i).

(4.11) (4.12) (4.13)

Note that we can also calculate the normalizing factor a in terms of the matrix C21 ; thus: pEU1 (i) =

C21 ( i, i) . trace( C21 )

(4.14)

Again, by equation 4.7, C21 and CN1 have the same diagonal elements so pEU1 ( i) = aCN1 ( i, i). E i , l i g the eigenvectors and Proof. To prove the fth claim, we denote by fw eigenvalues of the matrix CN1 and by fEsi , l i g the eigenvectors and eigenvalues of matrix C21 . Now, recall from the belief update rules equation 2.7, that: bEU1 = aE vU2 U1 ¯ vEUN U1 ¯ vEO1 U1 ,

(4.15)

Correctness of Local Probability Propagation

15

Using the rst two claims, we can rewrite this in terms of the diagonal matrix D1 and the principal eigenvectors of C21 and CN1 , bEU1 = aD1 sE1 ¯ wE1 ,

(4.16)

or, in component notation, E 1 (i). bEU1 ( i) = aD1 (i, i)sE1 ( i) w

(4.17)

However, to obtain an expression analogous to equation 4.14 we want to rewrite bEU1 solely in terms of C21. First we write C21 = SL S¡1 where the columns of S contain the eigenvectors of C21 and the diagonal elements of L contain the corresponding eigenvalues. We order S such that the principal eigenvector is in the rst column; thus S( i, 1) = bEs1 (i). Surprisingly, the rst row of S¡1 is related to E 1 , where w E 1 is the principal eigenvector of CN1 : the product D1 w C21 = SL S¡1

(4.18)

¡1 ¡1 T T ( ) L ST D 1 . CN1 = D¡1 1 C21 D1 = D1 S

(4.19)

Thus the rst row of S¡1 D¡1 1 gives the principal eigenvector of CN1 , or E 1 ( i) = c S¡1 (1, i), D1 ( i, i) w

(4.20)

where the constant c is independent of i. Substituting into equation 4.17 gives: bEU1 ( i) = aS( i, 1)S¡1 (1, i),

(4.21)

where a is again a normalizing constant. In fact, this constant is equal to unity since for any invertible matrix S: X (4.22) S( i, 1)S¡1 (1, i) = 1. i

Finally, we can express the relationship between pEU1 and bEU1 : pEU1 ( i) =

EeTi C21 Eei trace( C21)

EeTi SL S¡1 eEi P j lj P ( )l ¡1 ( j, i) j S i, j j P P = j lj P l 1 bEU1 ( i) C j= 2 S( i, j)l j S¡1 ( j, i) P = . jlj =

(4.23) (4.24) (4.25) (4.26)

16

Yair Weiss

Thus pEU1 can be written as a weighted average of bEU1 and a second term, which we will denote by qEU1 , ! l1 E l1 (4.27) pEU1 = P bU1 C 1 ¡ P qEU1 j lj j lj

(

with: P

j= 2 S( i,

qEU1 =

P

j)l j S¡1 ( j, i)

j= 2 l j

.

(4.28)

The weight given to bEU1 is proportional to the maximum eigenvalue l 1 . Thus the error in loopy belief propagation is small when the maximum eigenvalue dominates the eigenvalue spectrum: ! l 1 (EqU1 C pEU1 ). (4.29) pEU1 ¡ bEU1 = 1 ¡ P j lj

(

Note again the importance of the ratio between the subdominant eigenvalue and the dominant one. When this ratio is small, loopy belief propagation converges rapidly, and furthermore the approximation error is small. The preceding discussion has made no assumption about the range of possible values that variables in the networks can take. If we now assume that the variables are binary, we can show that the belief update assignment is guaranteed to give the maximum marginal assignment. Claim. Consider a pairwise Markov network consisting of a single loop of unobserved variables Ui , each connected to an observed variable Oi . Assume the unobserved variables are binary; then the belief update assignment gives the maximum marginal assignment. Proof.

Note that in this case, l 1 S(i, 1)S¡1 (1, i) Cl 2 S(i, 2)S¡1 (2, i) l 1 Cl 2 E l 1 bU1 ( i) Cl 2 (1 ¡ bEU1 ( i)) = . l 1 Cl 2

pEU1 (i) =

(4.30) (4.31)

Thus: pEU1 (i) ¡ pEU1 ( j) =

l1 ¡ l2 E (bU1 ( i) ¡ bEU1 ( j)). l 1 Cl 2

(4.32)

Thus pEU1 (i) ¡ pEU1 ( j) will be positive if and only if bEU1 ( i) ¡ bEU1 ( j) is positive (l 1 > l 2 , l 1 > 0 and trace( C12) > 0). In other words the calculated beliefs may be wrong but are guaranteed to be on the correct side of 0.5.

Correctness of Local Probability Propagation

17

4.1 Correcting the Beliefs Using Locally Available Information. Equation 4.26 suggests that if node U1 was able to estimate the higher-order eigenvalues and eigenvectors of the matrix C21 from the messages it receives, it would be able to correct the belief vector. Indeed, there exist several numerical algorithms to estimate all eigenvectors of a matrix using recursions similar to equation 4.4 (Strang,1986). Claim. Consider a pairwise Markov network consisting of a single loop of unobserved binary variables fUi gN i= 1 , each of which is connected to an observed variable Oi . For node Ui dene the temporal difference () () ( ) () ( ) , and the ratio r as the solution to dEi t = rdEi t¡N . Then dEi t = vEUt 2 U1 ¡ vEUt¡N 2 U1 1 E r (1 ¡ bEUi ). pEUi = 1Cr bUi C 1Cr The proof is based on the following lemma: Temporal difference lemma. Let C be a matrix with eigenvalues l 1 , . . . , l n sorted by decreasing magnitude and eigenvectors uE1 , . . . , uEn . Assume that the eigenvectors are normalized so they sum to 1. Dene the recursion xE( tC1) = aCE x( t) ( t) ( t) ( t¡1) E and the temporal difference d = xE ¡ xE . Assume that the conditions of the power method lemma hold, and in addition, |l 2 | > |l 3 |, then dE( t) approaches a multiple of uE 2 ¡ uE 1 , and furthermore dE(t) approaches ll 2 dE( t¡1) . 1

Proof. This lemma can be proved by linearizing the relationship between xE( tC1) and xE( t) around the steady-state value xE1 and then applying the powermethod lemma. The temporal difference lemma guarantees that each component of dE in the claim will decrease geometrically in magnitude, and the rate of decrease gives the ratio r = l 2 / l 1 . Substituting r = l 2 /l 1 into equation 4.31 gives: pEU1 =

1 E r (1 ¡ bEU1 ). bU C 1 Cr 1 1 Cr

(4.33)

More complicated correction schemes are possible. In the case of ternary nodes, the nodes can use the direction of the vector dE to calculate the second eigenvector and obtain a better estimate of the correct probability. In general, all eigenvectors and eigenvalues can be calculated by performing operations on the stream of messages a node receives. 4.2 Illustrative Examples. Figure 5a shows a network of four nodes, connected in a single loop. Figure 5b shows the local probabilities—the probability of a node’s being in state 1 or 2 given only its local evidence. The transition matrices were set to: M=

¡ 0.9 0.1

¢

0.1 . 0.9

(4.34)

18

Yair Weiss

Figure 5: (a) Simple loop structure. (b) Local evidence probabilities (EvOi U i ) used in the simulations. (c) Differences between messages at four iterations (t) (t¡4) (| vE U2 U 1 ¡ vE U2 U1 |). Note the geometric convergence. (d) Log of the magnitude of the differences plotted in c. The slope of this line determines the ratio of the second and rst eigenvalues.

Figure 5c shows the convergence rate of one of the messages as a function of iteration. We plot kdE( t) k as a function of iteration, and the magnitude can be seen to decrease geometrically (or linearly in the log plot in Figure 5d). The rate at which the components of dE(t) decrease gives the value of r that the nodes can use to correct the beliefs. Figure 6a shows the correct posteriors (calculated by exhaustive enumeration) and the steady-state beliefs estimated by loopy propagation. Note that the estimated beliefs are on the correct side of 0.5 as guaranteed by the theory but are numerically wrong. In particular, the estimated beliefs are overly condent (closer to 1 and 0, and further from 0.5). Intuitively, this is due to the fact that the loopy network involves counting each evidence multiple

Correctness of Local Probability Propagation

19

Figure 6: (a) Correct marginals (calculated by exhaustive enumeration) for the data in Figure 5. (b) Beliefs estimated at convergence by the loopy belief propagation algorithm. Note that the estimated beliefs are on the correct side of 0.5 as guaranteed by the theory but are numerically wrong. In particular, the estimated beliefs are overly condent (closer to 1 and 0, and further from 0.5). (c) Estimated beliefs after correction based on the convergence rates of the messages (see equation 4.33). The beliefs are identical to the correct beliefs up to machine precision.

times rather than a single time, thus leading to an overly condent estimate. More formally, this is a result of l 2 being positive (see equation 4.31). Figure 6c shows the estimated beliefs after the correction of equation 4.33 has been added. The beliefs are identical to the correct beliefs up to machine precision. Figure 7 illustrates the relationship between the asymptotic convergence ratio | r| = kdE(tC4) k/ kdE( t) k and the approximation error kbEX ¡ pEX k at a given node. Each data point represents the results on a simple loop network (see

20

Yair Weiss

Figure 7: Maximal error between the beliefs estimated using loopy belief update and the true beliefs, plotted as a function of the convergence ratio. As predicted by the theory, the error is small when convergence is rapid.

Figure 5a) with randomly selected transition matrices. Consistent with equation 4.26 the error is small when convergence is rapid and large when convergence is slow. 4.3 Networks Containing a Single Loop and Additional Trees. The results of the previous section can be extended to networks that contain a single loop and additional trees. An example is shown in Figure 8a. These networks include arbitrary combinations of chains and trees, except for a single loop. The nodes in these networks can be divided into two categories: those in the loop and those not in the loop. For example in Figure 8, nodes 1–4 are part of the loop, while nodes 5–7 are not. Let us focus rst on nodes inside the loop. If all we are interested in is obtaining the correct beliefs for these nodes, then the situation is equivalent to a graph that has only the loop and evidence nodes. That is, after a nite number of iterations, the messages coming into the loop nodes from the nonloop nodes will converge to a steady-state value. These messages can be thought of as messages from observed nodes, and the analysis of the previous section holds. For nodes outside the loop, however, the relationship derived in the previous section between pEX and bEX does not hold. In particular, there is no

Correctness of Local Probability Propagation

21

Figure 8: (a) A network including a tree and a loop. (b) The probability of each node given its local evidence used in the simulations.

guarantee that the BU assignment will give the MM assignment, even in the case of binary nodes. Consider node 5 in the gure. Its posterior probability can be factored into three independent sources: the probability given the evidence at node 6, at node 7, and the evidence in the loop. The messages that node 5 receives from nodes 6 and 7 in the belief update algorithm will correctly give the probability given the evidence at these nodes. However, the message it receives from node 4 will be based on the messages passed around the loop multiple times; that is, evidence in the loop will be counted more times than evidence outside the loop. Nevertheless, if nodes in the loop use the convergence rate to correct their beliefs, they can also correct the messages they send to nodes outside the loop in a similar way. One way to do this is to calculate outgoing messages vEXY by dividing the belief vector bEX componentwise by the vector vEYX and multiplying the result by M XY . If bEX is as dened by equation 2.8 then this procedure gives the exact same messages as equation 2.7. However, if bEX represents the corrected beliefs, then the outgoing messages will also be corrected, and the beliefs in all nodes will be correct. To illustrate these ideas, consider Figures 8 and 9. Figure 8b shows the local probabilities for each of the seven nodes. The transition matrix was identical to that used in the previous example (see equation 4.33). Figure 9a shows the correct posterior probabilities calculated using exhaustive enumeration and Figure 9b the beliefs estimated using regular belief update on the loopy network. Note that the beliefs for the nodes inside the loop are on the correct side of 0.5 but overly condent. But the belief of node 6 is not on

22

Yair Weiss

Figure 9: (a) Correct posterior probabilities calculated using exhaustive enumeration. (b) Beliefs estimated using regular belief update on the loopy network. Note that the beliefs for the nodes inside the loop are on the correct side of 0.5 but overly condent. The belief of node 6 is not on the correct side of 0.5. (c) Results of the corrected belief propagation algorithm, modied by using equation 4.33 for nodes in the loop. The results are identical to the correct posteriors up to machine precision.

the correct side of 0.5. Finally, Figure 9c shows the results of the corrected belief propagation algorithm, by using equation 4.33 for nodes in the loop. The results are identical to the correct posteriors up to machine precision. 5 Belief Revision in Networks with a Single Loop For belief revision the situation is simpler than belief update. In belief update we had to distinguish between binary nodes versus nonbinary nodes, and between networks consisting of a single loop versus networks with a loop

Correctness of Local Probability Propagation

23

and additional trees. For belief revision we can prove the optimality of the BR assignment for any net with a single loop. Claim. Consider a pairwise Markov net with a single loop. If all potentials Y are positive and belief revision converges, then the BR assignment gives the MAP assignment. The proof uses the notion of the “unwrapped” network introduced in section 3. Recall that for every loopy net, node X, and iteration t, there exists an unwrapped network such that the messages received by node X in the loopy net after t iterations are identical to the nal messages received by the corresponding node in the unwrapped network after convergence. This means that the BR assignment at node X after t iterations corresponds to the MAP assignment for the corresponding X in the unwrapped network. It does not, however, tell us anything directly about the MAP assignment for other replicas of X in the unwrapped network. When belief revision converges, however, the BR assignment at X gives us the MAP assignment at nearly all replicas of X in the unwrapped network. This is shown in the following two lemmas. Unwrapped network lemma. For every node X in a loopy network and itera( t) tion time t, we can construct an unwrapped network such that all messages vEYX that X receives from a neighboring node Y at time t in the loopy network are equal to the nal messages vEYX that X receives from the corresponding node in the unwrapped network. Proof. The proof is by construction. We construct an unwrapped tree by setting X to be the root node and then iterating the following procedure t times: Find all leafs of the tree (nodes that have no children). For each leaf, nd all k nodes in the loopy graph that neighbor the node corresponding to this leaf. Add k¡1 nodes as children to each leaf, corresponding to all neighbors except the parent node. The transition matrices are set to be identical to those in the loopy network. Likewise, if a node X in the loopy network is observed to be in state i, then all replicas of X are also set to be observed in state i. Thus, any node in the interior of the unwrapped tree has the same neighbors as the corresponding node in the loopy network and the same transition matrices. It is easy to verify that the messages that the root node receives in the unwrapped tree are identical to those received by the corresponding node in the loopy network after t iterations.

24

Yair Weiss

Figure 10: Illustration of the procedure for generating an unwrapped network. At every iteration, we examine all leaf nodes in the tree and nd the neighbors of the corresponding node in the loopy network. We add nodes as children to each leaf corresponding to all neighbors except the parent node.

Figure 10 illustrates the construction of the unwrapped network for a network with a single loop. Although we have used a tree structure to generate the unwrapped network, for the case of a single-loop network, the unwrapped network is equivalent to an increasingly long chain (corresponding to the loop nodes) and additional tree nodes connected to the chain at the corresponding nodes. Figure 11a shows the same unwrapped network as in Figure 10 but drawn as a long chain. We dene the end point nodes of the unwrapped network as the two loop nodes that are most distant from the root. Thus, in Figure 11a, U200 and U003 are end point nodes. Any unwrapped network denes a joint probability distribution, and we can therefore dene the MAP assignment of the unwrapped network. Although the unwrapped network is constructed so that all replicas of X have

Correctness of Local Probability Propagation

25

identical neighbors, there is no constraint that their MAP assignment is identical. Thus in Figure 11a, the MAP assignment in the unwrapped network requires assigning values to 11 variables. There is no guarantee that replicas of the same node (e.g., U1 , U10 ) will have the same assignment. However, as we show in the following lemma, if belief revision converges the MAP assignment for the unwrapped network is guaranteed to be semiperiodic. All replicas of a given node in the unwrapped network will have the same MAP assignment as long as they are far away from the end point nodes. Periodic assignment lemma. Assume all messages in a network with a single ( t) = vE1 loop converge to their steady-state values after tc iterations: vEXY XY for t > tc . If Y is a node in the unwrapped network such that the distance of Y to both end points is greater than tc , then the MAP assignment at Y is equal to the BR assignment of the corresponding node in the loopy network. Proof. We rst show that all messages transmitted in the unwrapped network are equal to the steady-state messages in the loopy network. To show this, we divide the nodes in the unwrapped network into two categories: chain nodes, which form part of the innite chain (corresponding to nodes in the loop in the loopy network), and nonchain nodes (corresponding to nodes outside the loop in the loopy network). This gives four types of messages in the unwrapped network. Leftward and rightward messages are the two directions of propagation inside the chain, outward are propagated in the nonchain nodes away from the chain nodes, and inward messages go toward the chain nodes. Thus in Figure 11a, vEU2 U1 is a rightward message, vEU3 U1 is a leftward message, vEU3 U4 is an outward message, and vEU4 U3 is an inward message. The inward messages depend on only other inward messages and the evidence at the nonchain nodes. Thus, they are identical at all replicas of the nonchain nodes. Since the transition matrices and evidences are replicated from the loopy network, the inward messages are identical to the corresponding messages in the loopy network. The leftward messages depend on only other leftward messages to the right of them and the inward message. By construction of the unwrapped network, if X is a node in the unwrapped network whose distance from the ( d) right-most end point is d and Y is its neighbor on the left, then vEXY = vEXY ; that is, the message is identical to the message that X sends to Y in the loopy network at time d. Since d > tc vEXY = vE1 XY . The proof for the rightward messages is analogous to that of the leftward messages. Finally, the outward messages that a chain node sends to a nonchain node depend on only the leftward and rightward messages. Thus if both leftward and rightward messages that it receives are equal to the steady-state messages in the loopy network, so will the outward message it sends out.

26

Yair Weiss

Figure 11: (a) For networks with a single loop, the unwrapped network is equivalent to an increasingly long chain (corresponding to the loop nodes) and additional trees connected at the corresponding nodes. (b) When belief revision converges, the MAP assignment of the unwrapped network contains periodic replicas of the MAP assignment in the loopy network.

If X is a node in the unwrapped tree whose distance from the end point is more than tc , all incoming messages to X are identical to the steady-state messages in the loopy network in the corresponding node. Therefore, the belief vector at X is identical to the corresponding steady-state belief in the loopy network, and the MAP assignment at X is equal to the BR assignment in the corresponding node. The periodic assignment lemma guarantees that if belief revision converges, then the MAP assignment in the unwrapped tree will have replicas of the BR assignment for all nodes whose distance from the end points is greater than tc . The importance of this result is that we can now relate optimality in the unwrapped tree to optimality in the loopy network. It is easier here to work with log probabilities than with probabilities. We dene J( X, Y) = ¡ log Y(X, Y). For any Markov network with observed variables, we refer to the network cost function J as the negative log posterior of the unobserved variables. Thus in Figure 10a, J(fui g) = J(u1 , u2 ) C J(u2 , u3 ) C J (u3 , u1 ) C J( u3 , u4 ) C J( u4 , o4 ), where o4 is the observed value of O4 .

(5.1)

Correctness of Local Probability Propagation

27

Optimality relationship lemma. Assume that belief revision converges in a network with a single loop and u¤i is the BR assignment in the loopy network. Assume furthermore that all potentials Y(X, Y) are nonzero, and dene J as the negative log posterior of the loopy network. For any n, u¤i must minimize Jn (fui g) = nJ (fui g) C J,Q

(5.2)

where JQ is nite and independent of n. Proof. To prove this, note that for any n there exists a t such that the unwrapped network at iteration t contains a subnetwork with n copies of all links in the loopy network. Furthermore, by choosing a sufciently large t, the distance of all nodes in the subnetwork from the end points can be made sufciently large. By the periodic assignment lemma, the MAP assignment will have replicas of the BR assignment for all nodes within the subnetwork. Furthermore, by the denition of the MAP assignment, the assignment within the subnetwork maximizes the posterior probability when nodes outside the subnetwork are clamped to their MAP assignment values. We denote by Jn the cost function that the assignment of the subnetwork minimizes. The cost function decomposes into a sum of pairwise costs, one for each link in the subnetwork. For a periodic assignment, the cost function can be rewritten as Jn (fui g) = nJ (fui g) C J,Q

(5.3)

where the JQ term captures the existence of links between the boundaries of the chain and its observed neighbors. This term is independent of n (the distance of the neighboring nodes to the end points is also greater than tc ). Since periodic copies of fu¤i g form the MAP assignment for the subnetwork, fu¤i g must minimize Jn . Figure 12 illustrates the optimality relationship lemma for n = 2. The subnetwork encircled by the dotted line includes two copies of all links in the original loopy network. Therefore the subnetwork cost Jn is given by Jn (fui g) = 2J(fui g) C JQ(fui g),

(5.4)

with JQ(fui g) a sum of the two links that connect the subnetwork to the larger unwrapped network: J( u¤2 , u1 ) C J (u1 , u¤3 ). Proof. The optimality relationship lemma essentially proves the main claim of this section. Since the BR assignment minimizes Jn for arbitrarily large n and JQ is independent of n, the BR assignment must minimize J and hence is the MAP assignment for the loopy network. To illustrate this analysis, Figure 13 shows two network structures for which there is no guarantee that the BU assignment will coincide with the

28

Yair Weiss

Figure 12: Proof of the optimality relationship lemma for networks with a single loop. For any n there exists a t such that the unwrapped network of size t includes a subnetwork that contains exactly n copies of all links in the original loopy network.

MM assignment. The loop structure has ternary nodes, while the added tree on the right means that the BU assignment even for binary variables will not necessarily give the MM assignment. However, due to the claim proved above, the BR assignment should always coincide with the MAP assignment. We selected 5000 random transition matrices for the two structures and calculated the correct beliefs using exhaustive enumeration. We then counted the number of correct assignments using belief update and belief revision by comparing the BU assignment to the MM assignment and the BR assignment to the MAP assignment. We considered assignment only when belief revision converged (about 95% of the simulations). The results are shown in Figure 13. In both of these structures, the BU assignment is not guaranteed to give the MM assignment, and indeed the wrong assignment is sometimes observed. However, an incorrect BR assignment is never observed. 6 Discussion: Single Loop The analysis enables us to categorize a given network with a loop into one of two categories: (1) structures for which both belief update and belief revision are guaranteed to give the correct assignment or (2) those for which only belief revision is guaranteed to give the correct assignment.

Correctness of Local Probability Propagation

29

Figure 13: (a–b) Two structures for which the BU assignment need not coincide with the MM assignment, but the BR assignment is guaranteed to give the MAP assignment. (a) Simple loop structure with ternary nodes. (b) Loop adjoined to a tree. (c–d) Number of correct assignments using belief update and belief revision in 5000 networks of these structure with randomly generated transition matrices. Consistent with the analysis, the BU assignment may be incorrect, but an incorrect BR assignment is never observed.

The crucial element determining the performance of belief update is the eigenspectrum of the matrix CN1 . Note that this matrix depends on both the transition matrices and the observed data. Thus belief update may give very exact estimates for certain observed data even in structures for which it is not guaranteed to give the correct assignment. We have primarily dealt here with discrete state spaces, but the analysis can also be carried out for continuous spaces. For the case of gaussian posterior probabilities, it is easy to show that belief update assignment will give the correct MAP assignment for all gaussian Markov networks with single

30

Yair Weiss

loop (this is due to the fact that belief update and revision are identical for a gaussian). We emphasize again that our choice of Markov nets with pairwise potentials as the primary object of analysis was based mainly on notational convenience. Due to the equivalence between Pearl’s algorithm on Bayesian nets and our update rules on Markov nets, our results also hold for applications of Pearl’s original update rules on Bayesian nets with a single loop. In particular, this means that one can perform exact inference on a Bayesian net with a single loop by running Pearl’s propagation algorithm and then using a correction analogous to equation 4.33. More generally, if the Bayesian network can be converted into a Markov network with a single loop, our results show how to perform exact inference. Exact inference in networks containing a single loop, however, can also be performed using cut-set conditioning (Pearl, 1988). Our goal in analyzing belief propagation in networks with a single loop is primarily to understand loopy belief propagation in general, with single-loop networks being the simplest special case. Independent of our work, several groups working in the context of errorcorrecting codes have recently obtained results on probability propagation in networks with a single loop. Aji, Horn, and McEliece (1998) have shown that iterative decoding (equivalent to belief update) of a single-loop code will converge to a correct decoding for binary nodes. They also showed that the messages will converge to the principal eigenvectors of a matrix analogous to our matrix C21. Forney, Kschischang, and Marcus (1998) have also shown the convergence of the messages to the principal eigenvectors and have additionally analyzed the “max-sum” decoding algorithm (analogous to belief revision) on a network consisting of a single loop, showing that it will converge to the dominant pseudocodeword—a periodic assignment on the unwrapped network such that the average cost per period is minimal. To the best of our knowledge, the analytical relationship between the correct beliefs and the steady-state ones, the analysis of networks including a single loop and arbitrary trees, and the correction of beliefs using locally available information have not been discussed elsewhere. One of the important distinctions that arise from analyzing networks containing a single loop and arbitrary trees is the difference in performance between belief update and belief revision when both converge. The BR assignment is always guaranteed to give the MAP assignment, while the BU assignment is guaranteed to give the MM assignment only when there are no additional trees attached to the loop and the nodes are binary. As we show in subsequent sections, this difference in performance is also apparent in networks with multiple loops. 7 Networks with Multiple Loops The basic intuition that explains why belief propagation works in networks with a single loop is the notion of equal double counting. Essentially, while

Correctness of Local Probability Propagation

31

evidence is double counted, all evidence is double counted equally, as can be seen in the unwrapped network, and hence belief revision is guaranteed to give the correct assignment. Furthermore, by using recursion expressions for the messages, we could quantify the “amount” of double counting, and derive a relationship between the estimated beliefs and the true ones. For networks with multiple loops, the situation is more complex. In particular, the relationship between the messages sent at time t and at time t C n is not as simple as in the single loop case, and tools such as the power method lemma or the temporal difference lemma are of no immediate use. However, the notion of the unwrapped tree is equally well applied to networks with multiple loops. Figure 14 shows an example. The unwrapped network lemma of the previous section holds; that is, the messages received by node A in the loopy network after four iterations are identical to the nal messages received by node A in the unwrapped network. Furthermore, a version of the periodic assignment lemma of the previous section holds for any network; if belief revision converges, then the MAP assignment for arbitrarily large unwrapped networks contains periodic replicas of the BR assignment. Thus for a network with multiple loops, if BR converges, then one can construct arbitrarily large problems such that periodic copies of the BR assignment give the optimal assignment for that problem. For example, for the network in Figure 14, the periodic assignment lemma guarantees that if fu¤i g is the BR assignment, then it minimizes Jn (fui g) = nJ (fui g) C JQn

(7.1)

for arbitrarily large n. As in the single-loop case, JQn is a boundary cost reecting the constraints imposed on the end points of the unwrapped network. Note that unlike the single-loop case, the boundary cost here may increase with increasing n. Because the boundary cost may grow with n, we cannot automatically assume that if an assignment minimizes Jn , it also minimizes J; that is, the assignment may be optimal only because of the boundary cost. We have not been able to determine analytically the conditions under which the steadystate assignment is due to JQ as opposed to J. However, the fact that equation 7.1 holds for arbitrarily large subnetworks (such that the boundary can be made arbitrarily far from the center node) leads us to hypothesize that for positive potentials Y loopy belief revision will give the correct assignment for the network in Figure 14. Although we have not been able to prove this hypothesis, extensive simulation results are consistent with it. We performed loopy belief revision on the structure shown in Figure 14 using randomly selected transition matrices and local evidence probabilities. We repeated the experiment 10,000 times and counted the number of correct assignments. As in the previous section, a correct assignment for belief revision is one that is identical to the MAP

32

Yair Weiss

Figure 14: (Top) A Markov network with multiple loops. (Bottom) Unwrapped network corresponding to this structure. Note that all nodes and connections in the loopy structure appear equal numbers of time in the unwrapped network, except the boundary nodes A.

assignment, and a correct assignment for belief update is one that is identical to the MM assignment. Again, we considered only runs in which belief revision converged. We never observed a convergence of belief revision to an incorrect assignment for this structure, while belief update gave 247 incorrect assignments (error rate 2.47%). Since these rates are based on 10,000 trials, the difference is highly signicant. While by no means a proof, these

Correctness of Local Probability Propagation

33

simulation results suggest that some of our analytic results for single-loop networks may also hold for a subclass of multiple loop networks. Trying to extend the proofs for a broader class of networks is a topic of current research. 7.1 Turbo Codes. Turbo codes are a class of error-correcting codes whose decoding algorithm has recently been shown to be equivalent to belief propagation on a network with loops. To make the connection to the update rules (equations 2.7–2.8) explicit, we briey review this equivalence. We use the formulation of turbo decoding presented in (McEliece et al., E is encoded using two codes, each of 1995). An unknown binary vector U which is easy to decode by itself, and the results are transmitted over a noisy channel. The task of turbo decoding is to infer the values of every bit E from two noisy versions Y E 1 , YE 2 that are received after transmission. of U E (which we assume here is The decoder knows the prior distribution over U E = uE ). uniform) and the two conditional probabilities pi ( yE | uE ) = P (YE i = yE | U Since both noisy versions are assumed to be conditionally independent, the marginal probability ratio is simply P E E E E E U( i)= 1 P( Y1 | U ) P ( Y 2 | U ) P(U ( i) = 1) U: = P . (7.2) E )P (YE2 | U E) P(U ( i) = 0) P( YE1 | U E U: U( i)= 0

In the typical error-correcting code application, the sum in equation 7.2 is intractable. Rather, the Turbo decoder approximates the ratio by the iterative algorithm outlined in Figure 15. The algorithm consists of two turbo decision modules, each of which E i . Each module receives a likelihood considers only one “noisy version” Y ratio vector as input and outputs a modied likelihood ratio vector that E i . If TE = TDM1 ( SE), then: takes into account the observed Y P Q ( E) j6= i SE( j) uE( j) uE : uE( i)= 1 p yE1 | u . (7.3) TE (i) = P Q E) j6= i SE( j) uE( j) uE : uE( i)= 0 p( yE1 | u The output of module 2 is identical with yE1 replaced by yE2 . Note that each module still requires summing over all uE , but the codes are constructed so that pi ( yE | uE) have a factorized structure that enables calculation of TDMi ( SE) efciently. The turbo decoding iterations can be written as TE Ã TDM1 ( SE) SE Ã TDM2 ( TE ),

(7.4) (7.5)

and the likelihood ratio is approximated as E RE = TE ¯ S.

(7.6)

34

Yair Weiss

Figure 15: (a) The Turbo decoding algorithm consists of two Turbo decision modules such that the output of each module serves as the input to the other one. The posterior likelihood ratio is approximated by multiplying the outputs S, T at convergence. (b) The Turbo decoding algorithm as belief update in a loopy Markov network. Each unknown bit is represented by a node Ui . The belief of a node is approximated by multiplying the two incoming messages (corresponding to S and T). When the link matrices are set as explained in the text, the update rules for the messages (equation 2.7–2.8) reduce to the turbo decoding algorithm.

To see the equivalence of this scheme to belief update in Markov nets, E is of length consider Figure 15b. We assume that the unknown vector U 3 and represent the value of the ith bit with a node Ui in the graph. Each Ui is connected to exactly two nodes, X1 , X2 , that are in turn connected to the received “noisy versions” Y1 , Y2 . By the belief update equations (see equation 2.8), the belief of node Ui is equal to the pointwise multiplication of the messages it receives from X1 and X2 . Note the similarity to equation 7.6, where the likelihood ratio for U (i) is obtained by multiplying S(i) and T ( i). Indeed, we now show that when the potentials in the graph in Figure 15b are set in a particular form, S( i) is equivalent to the message that X1 sends to Ui and T ( i) is equivalent to the message that X2 sends to Ui . E and Xi are meant to represent copies of the unknown binary message U hence in the example (of message length 3), Xi can take on eight possible values. We set Y( xEi , uj ) = 1 if xEi ( j) = uj and zero otherwise. Furthermore, we set Y(xEi , yEi ) = pi ( yE| xE). When the potentials are set in this form, the belief update equations (see equation 2.7) give the turbo decoding algorithm. Denote the message vEX2 Ui sent by node X2 to Ui by the vector a(Ti , 1). The message vEUi X1 that Ui sends

Correctness of Local Probability Propagation

35

to X1 is a vector of length 8. It is obtained by taking the message Ui receives from X2 and multiplying it by the transition matrix M Ui X1 : vEU1 X1 ( x1 ) = a

X u1

Y(x1 , ui ) vEX2 Ui (ui )

(7.7)

= aT iX1 ( i) .

(7.8)

Now, the message vEX1 Ui sent from X1 to U1 is a vector of length 2 that is obtained by multiplying all incoming messages except the one coming from Ui and then passing the result through the transition matrix MX1 Ui : vEX1 Ui ( ui ) =

X x1

Y(x1 , ui ) p1 ( y1 | x1 )

Y j6 = i

x ( j)

Tj 1 .

(7.9)

If we denote by SE( i) the ratio of components of vEX1 Ui , then equation 7.9 is equivalent to equation 7.3. Thus the belief update rules in equations 2.7–2.8 give the turbo decoding algorithm. McEliece et al. (1995) showed several examples where the turbo decoding algorithm converges to an incorrect decoding. Since the turbo decoding algorithm is equivalent to belief update, we wanted to see how well a decoding algorithm equivalent to belief revision will perform. We randomly selected the probability distributions p1 ( u) and p2 ( u) and calculated the MAP assignment (often referred to as the ML decoding in coding applications) and the MM assignment (referred to as the optimal bitwise decoding, or OBD, in coding applications). We compared these assignments to the BU and BR assignments. As in previous sections, the BU assignment was judged correct if it agreed with the MM assignment, and the BR assignment was judged correct if it agreed with the MAP assignment. We repeated the experiment for 10, 000 randomly selected probability distributions. Only trials in which belief revision converged were considered. Results are shown in Figure 16. As observed by McEliece et al. (1995) turbo decoding is quite likely to converge to an incorrect decoding (16% of the time). In contrast, belief revision converges to an incorrect answer less than 0.1% of the runs. Note that this difference is based on 10,000 trials and is highly signicant. However, the results considered only cases where belief revision converged, and in general belief revision converged less frequently than did belief update. A turbo decoding algorithm that used maximization rather than summation in equation 7.3 was presented in Benedetto, Montorsi, Divsalar, and Pollara (1996), where the maximization was presented as an approximation to the full sum in equation 7.3. The decoding performance of the approximate algorithm was shown to be slightly worse than standard turbo decoding. In our simulations, however, we considered only decodings where belief revision converged, whereas in Benedetto et al. (1996), the decoding

36

Yair Weiss

Figure 16: Results of belief revision and update on the turbo code problem.

obtained after a xed number of iterations was used, regardless of whether the algorithm converged. Our motivation in considering only decodings at convergence was based on our analysis of single-loop networks. The simulation results suggest that turbo codes exhibit a similar difference in performance between belief update and belief revision. From a practical point of view, of course, errors due to failures in convergence are as troubling as errors due to a convergence to a wrong answer. Thus understanding the behavior of the decoding algorithms when they do not converge is an important question for further research. Forney et al. (1998) have obtained some encouraging results in this regard. Unlike the networks considered in the previous sections, the turbo code Markov net has deterministic links (between the Xi and Uj ), and hence the log of the potential functions Y may be innite. We believe this may be the reason for the existence of a few incorrect decodings with the belief revision. The deterministic links make it difcult to relate optimality in the unwrapped tree to optimality in the original loopy network. 7.2 Discussion. Turbo codes are not the only error-correctingcode whose decoding is equivalent to belief propagation. Indeed one of the rst derivations of an algorithm equivalent to belief propagation appears in Gallager (1963) as a means of decoding codes known as low-density parity check codes. Gallager ’s decoding algorithm is equivalent to probability propagation on a network with loops, and Gallager discussed the fact that his decoding algorithm is not guaranteed to give the optimal decoding because it considers replicas of the same evidence as if they were independent. Gal-

Correctness of Local Probability Propagation

37

lager also showed that his algorithm will give the MAP decoding for a computation tree, equivalent to our unwrapped network. He noted that if the number of iterations is small compared to the size of the loops in the graph, the computation tree will not contain replicas of the same evidence. He presented a method of constructing codes such that the size of the loops in the network will be large. Indeed, if double counting is to be avoided, we would expect performance of belief propagation in loopy nets to be best when the size of the loops is large or (equivalently) the number of iterations is small. Turbo codes, however, contradict both of these expectations. The Markov network corresponding to turbo decoding (see Figure 15b) contains many loops of size 4. Furthermore, it has been well established (McEliece et al., 1996; Benedetto et al., 1996) that turbo decoding performance improves as the number of iterations is increased. The performance of turbo codes is consistent, however, with the expectation based on the notion of equal double counting emphasized throughout this article. As we saw in the analysis of networks with a single loop, the BR assignment at convergence is guaranteed to give the MAP assignment regardless of the size of the loop. Furthermore, as we discussed in section 3, for any nite number of iterations, the unwrapped network will not contain an equal number of replicas of all nodes and connections. It is only as the size of the unwrapped network approaches innity that the inuence of the boundary nodes can be neglected. Thus, additional iterations are expected, based on our analysis, to increase the decoding performance. Despite the importance of turbo codes, one should use caution in generalizing from their performance to general loopy belief propagation. In the typical turbo decoding setting, the probabilities P( yEi | uE) in equation 7.3 are highly peaked around the correct uE . Thus the sum in equation 7.3 is typically dominated by a single term, and therefore belief revision (in which the sum is approximated by its largest value) and belief update behave quite similarly. As we have shown throughout this article, this is not the case for general loopy belief propagation, nor is it the case for Turbo decoding when the probabilities P( yEi | uE ) are chosen randomly in simulations. We believe that additional insights into Turbo decoding will be obtained by comparing the performance of the algorithm at different probability regimes. 8 Conclusion As many authors have observed, in order for belief propagation to be exact, we must nd a way to avoid double counting. Since double counting is unavoidable in networks with loops, belief propagation in such networks may seem to be a bad idea. The excellent performance of turbo codes motivates a closer look at belief propagation in loopy networks. Here we have shown a class of networks for which, even though the evidence is double counted,

38

Yair Weiss

all evidence is equally double counted. For networks with a single loop, we have obtained an analytical expression relating the beliefs estimated by loopy belief propagation and the correct beliefs. This allows us to nd structures in which belief update may give the wrong beliefs, but the BU assignment is guaranteed to give the correct MM assignment. We have also shown that for networks with a single loop, the BR assignment is guaranteed to give the MAP assignment. For networks with multiple loops we have presented simulation results suggesting that some of the results we have obtained analytically for single-loop networks may hold for a large class of networks. These results are an encouraging rst step toward understanding belief propagation in networks with loops. Appendix: Converting a Bayesian Net to a Pairwise Markov Net There are various ways to convert a Bayesian network into a Markov network that represents the same probability distribution. Indeed one of the most popular inference algorithms for Bayesian networks is to convert them into a Markov network known as the join tree or the junction tree (Pearl, 1988; Jensen, 1996) and perform inference on the junction tree. Here we present a method to convert a Bayesian network into a Markov net with pairwise potentials, such that the update rules described in the article reduce to Pearl’s algorithm on the original Bayesian network. In particular, if the Bayesian network has loops in it, so will the corresponding Markov network constructed using this method. This should be contrasted with the junction tree construction in Jensen (1996) which constructs a singly connected Markov network for any Bayesian network. For concreteness, consider the Bayesian network in Figure 8a. The joint probability is given by: P ( ABCDEF) = P( A) P(B | A) P(C | A) P( D | BC) P(F | C) P(E | D)

(A.1)

The procedure is simple. For any node that has multiple parents, we create a compound node into which the common parents are clustered. This compound node is then connected to the individual parent nodes as well as the child. For all nodes that have single parents, we drop the arrows and leave the topology unchanged. Figure 17 shows an example. Since D has two parents B and C, we create a compound node Z = ( B0 , C0 ) and connect it to D, B, and C. Having created a new graph, we need to set the pairwise potentials so that the probability distributions are identical. The potentials are the conditional probabilities of children given parents, with the exception of the potentials between the parent nodes and a compound node. These are set to one if the nodes have a consistent estimate of the parent node and zero otherwise. Thus in the example, if Z = (B0 , C0 ) then Y(BZ) = 1 if B0 = B and zero otherwise.

Correctness of Local Probability Propagation

39

Figure 17: Any Bayesian netowrk can be converted into a pairwise Markov network by adding cluster nodes for all parents that share a common child. (a) Bayesian network. (b) Corresponding pairwise Markov network. A cluster node for (B, C) has been added. When the potentials are set in the way detailed in the text, the update rules presented in this article reduce to Pearl’s propagation rules in the original Bayesian network.

To complete the example, we set Y(AB) = P( A) P( B | A), Y(AC) = P(C | A), Y(FC) = P( F | C), Y(ZD) = P( D | B0 C0 ) and then the joint probability of the Markov net, P( ABCDEFZ) = Y(AB)Y(AC)Y(FC)Y(BZ)Y(DE)Y(ZB)Y(ZC), (A.2) which contains only pairwise potentials but is identical to the probability distribution of the original Bayes net (see equation A.1). We now describe the equivalence between the messages passed in the Markov net and those that are passed in the Bayes net using Pearl’s algorithm. If X is a single parent of Y, then the message X sends to Y in the Bayes net is given by p x ( y). The message that X sends to Y in the Markov net is given by vEXY . The messages are related by multiplication by the link matrix: vEXY = aM XYp x ( y). Thus, in Pearl’s formulation, the matrix multiplication is performed at the child node, while in the algorithm described here, it is performed in the parent node. The message that Y sends to X in Pearl’s algorithm is denoted byl Y ( x) and is equal to vEYX in the Markov net algorithm up to normalization. In Pearl’s algorithm the p messages are normalized,

40

Yair Weiss

but the l ones are not, and in the algorithm presented here, all messages are normalized. If X is one of many parents of Y, then in the Markov net, X is connected to Y by means of a cluster node Z. As before, vEXZ = M XZp X ( y) but since the M XZ matrix contains only ones or zeros, the message in the Markov net is simply a replication of elements in the Bayesian net message. As before l Y ( x) is given by aE vZX . Every message in Pearl’s algorithm in the Bayes net can be identied with an equivalent message in the Markov net using the algorithm presented here. Furthermore, by substituting the equivalent messages into the update rules presented here, one can derive Pearl’s update rules. Acknowledgments This work was supported by NEI R01 EY11005 to E. H. Adelson. I thank the anonymous reviewers, E. Adelson, M. Jordan, P. Dayan, M. Meila, Q. Morris, and J. Tenenbaum for helpful discussions and comments. After a preliminary version of this work appeared (Weiss, 1997), I had the pleasure of several discussions with D. Forney and F. Kschischang regarding related work they have been pursuing independently. I thank them for their comments and suggestions. References Aji, S., Horn, G., & McEliece, R. (1998). On the convergence of iterative decoding on graphs with a single cycle. In Proc. 1998 ISIT. Aji, S. M., & McEliece, R. (in press). The generalized distributive law. IEEE Trans. Inform. Theory. Available at: http://www.systems.caltech.edu/ EE/Faculty/rjm. Benedetto, S., Montorsi, G., Divsalar, D., & Pollara, F. (1996). Soft-output decoding algorithms in iterative decoding of turbo codes (Jet Propulsion Laboratory, Telecommunications and Data Acquisition Progress Rep. 42-124). Pasadena, CA: California Institute of Technology. Berrou, C., Glavieux, A., & Thitimajshima, P. (1993). Near Shannon limit errorcorrecting coding and decoding: Turbo codes. In Proc. IEEE International Communications Conference ’93. Bertsekas, D. P. (1987). Dynamic programming: Deterministic and stochastic models. Englewood Cliffs, NJ: Prentice Hall. Forney, G. (1997). On iterative decoding and the two-way algorithm. In Intl. Symp. Turbo Codes and Related Topics (Brest, France). Forney, G., Kschischang, F., & Marcus, B. (1998). Iterative decoding of tail-biting trellisses. Presented at 1998 Information Theory Workshop in San Diego. Frey, B. J. (1998). Graphical models for pattern classication, data compression and channel coding. Cambridge, MA: MIT Press. Frey, B. J., Koetter, R., & Vardy, A. (1998). Skewness and pseudocodewords in iterative decoding. In Proc. 1998 ISIT. Gallager, R. (1963). Low density parity check codes. Cambridge, MA: MIT Press.

Correctness of Local Probability Propagation

41

Gelb, A.(Ed.). (1974). Applied optimal estimation. Cambridge, MA: MIT Press. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. PAMI, 6(6), 721–741. Jensen, F. (1996). An introduction to Bayesian networks. Berlin: Springer-Verlag. Kschischang, F. R., & Frey, B. J. (1998). Iterative decoding of compound codes by probability propagation in graphical models. IEEE Journal on Selected Areas in Communication, 16(2), 219–230. Lauritzen, S. (1996). Graphical models. New York: Oxford University Press. MacKay, D., & Neal, R. M. (1995). Good error-correcting codes based on very sparse matrices. In Cryptographyand Coding: 5th IAM Conference. Lecture Notes in Computer Science No. 1025 (pp. 100–111). Berlin: Springer-Verlag. McEliece, R., MacKay, D., and Cheng, J. (1998). Turbo decoding as an instance of Pearl’s “belief propagation” algorithm. IEEE Journal on Selected Areas in Communication, 16(2), 140–152. McEliece, R., Rodemich, E., & Cheng, J. (1995). The Turbo decision algorithm. In Proc. 33rd Allerton Conference on Communications, Control and Computing (pp. 366–379). Monticello, IL. Minc, H. (1988). Nonnegative matrices. New York: Wiley. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan Kaufmann. Rabiner, L. (1989).A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE, 77(2), 257–286. Smyth, P. (1997). Belief networks, hidden Markov models, and Markov random elds: A unifying view. Pattern Recognition, 18(11), 1261–1268. Smyth, P., Heckerman, D., & Jordan, M. I. (1997). Probabilistic independence networks for hidden Markov probability models. Neural Computation, 9(2), 227–269. Strang, G. (1986). Introduction to applied mathematics. Wellesley, MA: WellesleyCambridge. Weiss, Y. (1996). Interpreting images by propagating Bayesian beliefs. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Weiss, Y. (1997). Belief propagation and revision in networks with loops (Tech. Rep. No. 1616). Cambridge, MA: MIT Articial Intelligence Laboratory. Wiberg, N. (1996). Codes and decoding on general graphs. Unpublished doctoral dissertation, University of Linkoping, Sweden. Received November 11, 1997; accepted November 25, 1998.

Communicated by Carl van Vreeswijk

ARTICLE

Population Dynamics of Spiking Neurons: Fast Transients, Asynchronous States, and Locking Wulfram Gerstner Center for Neuromimetic Systems, Swiss Federal Institute of Technology, EPFL-DI, CH-1015 Lausanne, Switzerland

An integral equation describing the time evolution of the population activity in a homogeneous pool of spiking neurons of the integrate-and-re type is discussed. It is analytically shown that transients from a state of incoherent ring can be immediate. The stability of incoherent ring is analyzed in terms of the noise level and transmission delay, and a bifurcation diagram is derived. The response of a population of noisy integrateand-re neurons to an input current of small amplitude is calculated and characterized by a linear lter L . The stability of perfectly synchronized “locked” solutions is analyzed. 1 Introduction In many areas of the brain, neurons are organized in populations of units with similar properties. The most prominent examples are probably columns in the somatosensory and visual cortex (Hubel & Wiesel, 1962; Mountcastle, 1957) and pools of motor neurons (Kandel & Schwartz, 1991). In such a situation, we may want to describe the mean activity of the neuronal population rather than the spiking of individual neurons. In a population of N neurons, we can formally determine the proportion of active neurons by counting the number of spikes nact ( tI t CD t) in a small time interval D t and dividing by N. Division by D t yields the activity A( t) = limD t!0

1 nact ( tI t C D t) . Dt N

(1.1)

Although equation 1.1 has units of a rate, the concept of a population average is quite distinct from the denition of a ring rate via a temporal average. Temporal averaging over many spikes of a single neuron is a concept that works well if the input is constant or changes on a time scale that is slow with respect to the size of the temporal averaging window. Sensory input in a real-world scenario, however, is never constant. Moreover, reaction times are often short, which indicates that neurons do not have the time for temporal averaging (Thorpe, Fize, & Marlot, 1996). The population activity, on the other hand, may react quickly to changes in the input Neural Computation 12, 43–89 (2000)

c 2000 Massachusetts Institute of Technology °

44

Wulfram Gerstner ext

I (t)

A(t) t

Figure 1: Population of neurons (schematic). All neurons receive the same input I ext (t) (left), which results in a time-dependent population activity A(t) (right).

(Knight, 1972a; Treves, 1992; Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996). In this article we study the properties of a large and homogeneous population of spiking neurons. All neurons are identical, receive the same input I ext ( t), and are mutually coupled by synapses of uniform strength (see Figure 1). Can we write down a dynamic equation in continuous time that describes the evolution of the population activity? This problem comes up in various models and has been studied by several researchers (Knight, 1972a; Wilson & Cowan, 1972, 1973; Amari, 1974; Gerstner & van Hemmen, 1992, 1993, 1994; Bauer & Pawelzik, 1993; Gerstner, 1995; Pinto, Brumberg, Simons, & Ermentrout, 1996; Senn et al., 1996; Eggert & van Hemmen, 1997; Pham, Pakdamen, Champagnat, & Vibert, 1998). A frequently adopted solution is a differential equation (Wilson & Cowan, 1972), Z 1 dA (t) t = ¡A(t) C g (1.2) 2 (s) A( t ¡ s) C 2 Q (s) I ext ( t ¡ s) ds dt 0

¡

¢

where neurons are coupled to each other with a delay kernel 2 and receive external input (or input from other populations) ltered by a kernel 2 Q ; g(.) is the transfer function of the population and t is a time constant. Equation 1.2 is equivalent to the equations of graded-response neurons (Cowan, 1968; Cohen & Grossberg, 1983; Hopeld, 1984). Equation 1.2 introduces a time constant t that is basically ad hoc. Wilson and Cowan (1972, 1973) have derived (1.2) from an integral equation, " # Z abs Z A ( t) =

1¡

d

0

A( t ¡ s) ds

g

¡

1

0

2 ( s) A( t ¡ s)

¢

C 2 Q ( s) I ext (t ¡ s) ds ,

(1.3)

which is valid for stochastic neurons with an absolute refractory period of length d abs . The term in the square brackets accounts for the fact that neurons that have red between t ¡ d abs and t are in the refractory state and

Population Dynamics of Spiking Neurons

45

cannot re. g(.) is the neuronal transfer function. Wilson and Cowan subsequently used a method of time-coarse graining to transform equation 1.3 into the differential equation 1.2. The time constant t , which appears on the left-hand side of equation 1.2, is related to the window of time-coarse graining. It has been argued that t in equation 1.2 is the membrane time constant of the neurons, an assertion that may be criticized since it holds for slowly changing activities only (Wilson & Cowan, 1972, 1973; Abbott & van Vreeswijk, 1993; Gerstner, 1995). The activity is, however, not always slowly changing. It has been shown previously that the population activity can react quasi instantaneously to abrupt changes in the input (Treves, 1992; Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996). Transients in networks of nonleaky integrate-and-re neurons can be very short (Hopeld & Herz, 1995). A population of leaky integrate-and-re units can follow a periodic drive up to very high frequencies (Knight, 1972a, 1972b). Moreover, homogeneous networks may be in an oscillatory state where all neurons re at exactly the same time (Mirollo & Strogatz, 1990; Gerstner & van Hemmen, 1993; Gerstner, van Hemmen, & Cowan, 1996; Somers & Kopell, 1993; van Vreeswijk, Abbott, & Ermentrout, 1994; Hansel, Mato, & Meunier, 1995; Terman & Wang, 1995). In this case, the population activity changes rapidly between zero and a very high activity. Rapid transients and perfectly synchronized oscillations are inconsistent with a differential equation of the form 1.2. In this article we analyze the population dynamics and derive a dynamic equation that is exact in the limit of a large number of neurons. The relevant equation is a generalization of the integral equations of Wilson and Cowan (1972) and Knight (1972a) and has been discussed previously in Gerstner and van Hemmen (1994) and Gerstner (1995). The dynamic properties of the population activity can be analyzed directly on the level of the integral equation; there is no need to transform it into a differential equation. The population equation allows us to discuss the following four questions from a unied point of view: How does a population of neurons react to a fast change in the input? We show that during the initial phase of the transient, the population activity reacts instantaneously (Treves, 1992; Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996). Specically, the initial phase of the transient reects directly the form 2 Q of the postsynaptic potential. The end of the transient, on the other hand, can be rather slow and depends on the noise level and the coupling. In the special case of noiseless uncoupled integrate-and-re neurons, a new periodic state is reached as soon as every neuron has red once. What are the conditions for exact synchrony in the ring of all neurons? We show that neurons are “locked” together if ring occurs while the input potential is rising. The locking theorem in Gerstner et al. (1996) follows naturally from the integral equation.

46

Wulfram Gerstner

What are the conditions to make the neurons re in an optimally asynchronous manner? We discuss why, without noise, asynchronous ring is almost always unstable (Abbott & van Vreeswijk, 1993; Gerstner & van Hemmen, 1993; Gerstner, 1995). We show that with a proper choice of the delays and a sufcient amount of noise, the state of asynchronous ring can be stabilized (Gerstner, 1995). A bifurcation diagram is determined in terms of the noise level and the transmission delay. What is the response of an asynchronously ring population to an arbitrary timedependent input current? We show that at least in the low-noise regime, the population activity responds faithfully even to high-frequency components of the input current (Knight, 1972a, 1972b). The cutoff frequency of the system is therefore given by the time constant of the synaptic current rather than by the membrane time constant. For some noise models, the cut-off frequency in the high-noise regime is further reduced. Theories of population activity have a fairly long history (Wilson and Cowan, 1972, 1973; Knight, 1972a; Amari, 1974; Feldman & Cowan, 1975). Some researchers have formulated population equations as maps in discrete time for homogeneous (Gerstner and van Hemmen, 1992; Bauer & Pawelzik, 1993) or inhomogeneous populations (Senn et al., 1996; Pham et al., 1998). Others have formulated continuity equations for the probability distribution of the membrane potential with the threshold as an absorbing boundary (Abbott & van Vreeswijk, 1993; Treves, 1993; Brunel & Hakim, 1998; Nykamp, Tranchina, Shapley, & McLoughlin, 1998). The population equations discussed in this article are related to the integral equations of Wilson and Cowan (1972) and Knight (1972a). Section 3 will review the population equations in continuous time developed in detail in Gerstner and van Hemmen (1992, 1994) and Gerstner (1995). Sections 4 through 7 will apply the population equations to the questions of locking, fast transients, signal transmission, and asynchronous ring. 2 Model 2.1 Deterministic Threshold Model. Model neurons are described by the spike response model, a variant of the integrate-and-re model (Gerstner, 1995; Gerstner et al., 1996). A neuron i res if its membrane potential ui hits the threshold #. The membrane potential is of the form ui ( t) = g(t ¡ tOi ) C hPSP ( t| tOi ),

(2.1)

where tOi is the most recent ring time of neuron i, and XX ( f) hPSP (t| tOi ) = wij2 ( t ¡ tOi , t ¡ tj ) j2Ci t( f ) j

Z C Jext

1 0

2 Q ( t ¡ tOi , s) I ext ( t ¡ s) ds

(2.2)

Population Dynamics of Spiking Neurons

47 ( f)

is the postsynaptic potential caused by rings tj of presynaptic neurons j 2 C i or by external input I ext (t). The kernel g(.) in equation 2.1 is a (negative) contribution due to refractoriness. Alternatively, we could incorporate refractoriness into the threshold and dene a dynamic threshold #¡g ( t¡ tOi ). Since we are interested in a homogeneous population of neurons, the kernels g, 2 , and 2 Q have no indices i, j, and all neurons receive the same input I ext ( t). Moreover, the interaction strength between the neurons is uniform, wij =

J0 , N

(2.3)

where J0 is a parameter. The interaction strength scales with one over the number N of neurons so that the total input remains nite in the limit of N ! 1. The theoretical approach developed in this article is valid for arbitrary response kernels 2 and 2 Q and for a broad variety of refractory kernels g. Occasionally, however, we may want to specify the response kernels. For example, with an appropriate choice of the kernels 2 , 2 Q , and g, equation 2.1 gives an excellent approximation to the Hodgkin-Huxley model with timedependent input (Kistler, Gerstner, & van Hemmen, 1997). Kernels can also be adjusted to experimental data (Stevens & Zador, 1998). With a different choice of kernels, equation 2.1 can be mapped exactly to various versions of the integrate-and-re model (Gerstner, 1995). Although most of the theory developed in this article is general, we will use two specic models for an illustration of the results. 2.1.1 Simple Spike Response Model (SRM0 ). In the rst model, we assume O s) = that the kernels 2 and 2 Q do not depend on their rst argument: 2 ( t ¡ t, O s) = 2 Q0 ( s). This is the simplest instance of a spike response 2 0 ( s) and 2 Q (t ¡ t, model (Gerstner & van Hemmen, 1992) and will be called SRM0 . In this case, the postsynaptic potential is independent of the last ring time of the neuron. We set hPSP (t| tO) = h( t) with Z 1 XX ( f) (2.4) h( t) = wij2 0 ( t ¡ tj ) C Jext 2 Q0 ( s) I ext ( t ¡ s) ds. 0

j2C i t( f ) j

We will refer to h( t) as the input potential. In simulations we need to speciy the kernels. We take 2 0 (s) =

¡

s ¡ D ax s ¡ D ax exp ¡ t2 t

¢

H( s ¡ D ax ),

(2.5)

where D ax is the R axonal delay and t is a membrane time constant. The normalization is 2 0 ( s) ds = 1. As always, H (s) is the Heaviside step function

48

Wulfram Gerstner

with H (s) = 0 for s · 0 and H( s) = 1 for s > 0. For external input we use in simulations ³ s´ 1 exp ¡ H( s) (2.6) 2 Q0 ( s) = t t with the same time constant t as in equation 2.5. After each spike the membrane potential is reset by adding a refractory kernel, g(s) = ¡gQ 0 exp (¡s/ t ) H (s).

(2.7)

In simulations of SRM0 neurons we set gQ0 = 1. The spike response model with the specic kernels dened in equations 2.5 through 2.7, can be considered as an approximation to the integrate-and-re model. The approximation is based on a truncation and is valid if the typical interspike interval T0 is much larger than the time constant t (Gerstner et al., 1996); the error of the approximation is of the order exp(¡T0 / t ). For the numerical analysis later on, we have adjusted the threshold # so that the mean interspike interval T0 is exactly T0 = 2t . In this case, we may expect that SRM0 gives a reasonable, albeit not perfect, approximation to the integrate-and-re model. A major advantage of SRM0 neurons (compared to integrate-and-re neurons) is that most of the results of sections 3 through 7 can be formulated in a mathematically transparent form, which allows an easy interpretation. 2.1.2 Integrate-and-Fire model (IF).

With a different choice of the kernels

2 and 2 Q in equation 2.2, it is possible to construct a perfect mapping between

equation 2.1 and the standard integrate-and-re model (Gerstner, 1995). Between two rings, the membrane potential of an integrate-and-re (IF) unit i changes according to tm

XX dui ( f) = ¡ui C wij a(t ¡ tj ) C J ext I ext ( t), dt (f) j

(2.8)

tj

where tm is the membrane time constant, wij is the coupling strength, and a(.) is the time course of the synaptic current. If ui hits a threshold #, the membrane potential is reset to ureset < #. The spike response equation, 2.1, can be considered an integrated version of equation 2.8. The value ureset comes in as an initial condition for the integration of equation 2.8 and leads to a kernelg as in equation 2.7 with t = tm and gQ 0 = ¡ureset . For ureset = 0, the g-kernel vanishes. Integration of the synaptic current terms in equation 2.8 yields Z 2 ( x, s) =

s

s¡x

a(s0 ) e¡(s¡s )/ tm ds 0 0

= 2 0 ( s) ¡ 2 0 ( s ¡ x) e¡x/ tm ,

(2.9)

Population Dynamics of Spiking Neurons

49

Rs where 2 0 ( s) = 0 ds0 a(s0 ) exp[¡(s ¡ s0 )/ tm ]. The response kernel dened in equation 2.5 corresponds to a synaptic current a(s) = ts¡1 exp(¡s/ ts ) H( s) with time constant ts = tm = t and an additional axonal transmission delay D ax . Similarly, 2 Q (x, s) = 2 Q0 ( s) ¡ 2 Q0 (s ¡ x) e¡x/ tm with 2 Q0 given by equation 2.6 and tm = t . Using the expressions for 2 Q and 2 with tm = t , the postsynaptic potential for IF units can be expressed as O

)/ hPSP ( t| tO) = h( t) ¡ h(tO) e¡(t¡t t ,

(2.10)

where the input potential h(t) is given by equation 2.4. We emphasize that the specic choice of the kernel 2 0 in equation 2.5 corresponds to a synaptic current with fast rise time. The theoretical framework of this article, however, is more general and applies equally well to slowly rising synaptic currents. 2.2 Noise. The neuron model discussed so far is purely deterministic. Given a spike at tO and for a known postsynaptic potential hPSP (t| tO), we can calculate the interval until the next spike from the threshold condition, O T (tO) = minf(t ¡ tO)| u( t) = #I t > tg.

(2.11)

In words, the interval T is given by the rst threshold crossing after the O spike at t. In the presence of noise, the exact ring time and, hence, the exact interval are unknown. We may, however, determine the probability density Ph ( t| tO) that ring occurs around time t given the postsynaptic potential hPSP ( t0 | tO) O The distribution Ph ( t| tO) for tO < t0 · t and knowing the last ring time t. will play an important role for the population equations in section 3. In the noiseless case, the interval distribution reduces to a Dirac d -function Ph ( t| tO) = d [t ¡ tO ¡ T (tO)],

(2.12)

where T ( tO) is the next interval given a spike at tO and determined implicitly by equation 2.11. To calculate Ph ( t| tO) in the noisy case, we have to decide on a noise model. There are at least three different ways to include noise in spiking neurons. A popular procedure is to add white noise on the right-hand side of the differential equation of the IF neuron. Below this is called noise model C. Calculation of Ph ( t| tO) requires the solution of a rst passage time problem, which is known to be hard (Tuckwell, 1988, 1989). It is therefore convenient to work with one of the following two noise models: threshold noise (noise model A) or reset noise (noise model B). 2.2.1 Noisy Threshold (Noise Model A). In this rst noise model, we assume that the neuron can re even though the formal threshold # has not

50

Wulfram Gerstner

been reached yet or may stay quiescent even though the membrane potential is above threshold. To do this consistently, we introduce an “escape rate” r , which depends on the distance between the momentary value of the membrane potential and the threshold, r ( t) = f [u(t) ¡ #].

(2.13)

In the mathematical literature, the quantity r would be called a stochastic intensity. We require f (x) ! 0 for x ! ¡1. Otherwise the choice of the function f is arbitrary. Plausible assumptions are an exponential dependence r = r 0 exp[b ( u ¡ #)] or a gaussian dependence r = r 0 exp[¡b ( u ¡ #)2 ]; or a step function r = r 0 H (u ¡ #); or a piecewise linear function r = r 0 (u ¡ #)H (u ¡ #). b and r 0 are parameters. Note that the escape rate r is implicitly time dependent, since the membrane potential u( t) = g(t ¡ tO) ChPSP ( t| tO) varies over time. In addition, we may also include an explicit time dependence, for example, to account for a reduced spiking probability immediO ately after the spike at t. Let us now calculate Ph ( t| tO), the probability density of having a spike at t O and in the presence of a postsynaptic given that the last spike occurred at t, O O ( ) potential hPSP t| t for t > t. At each moment of time, the value u(t) of the membrane potential determines the escape rate r (t) = f [u(t) ¡ #]. In order O t) without to emit the next spike at t, the neuron has to survive the interval ( t, ring and then re at t. Given the escape rater (t), the probability of survival from tO to t without ring is

¡

Sh ( t| tO) = exp ¡

Z

t tO

¢

r ( t0 ) dt0 .

(2.14)

The probability density of ring at time t is r ( t); thus with equation 2.14 we have

¡

P h ( t| tO) = r ( t) exp ¡

Z tO

t

¢

r ( t0 ) dt0 ,

(2.15)

which is the desired result. More detailed derivations of equation 2.15 can be found in Gerstner and van Hemmen (1994) and in the appendix of Wilson and Cowan (1972) (see also Cox & Lewis, 1966) 2.2.2 Noisy Reset (Noise Model B). In this noise model, ring is given by the exact threshold condition u(t) = #. Noise is included in the formulation of reset and refractoriness. A convenient way of doing this is by replacing gQ 0 in equation 2.7 by gQ 0 ¡! g0 ( r) = gQ 0 er/ t ,

(2.16)

Population Dynamics of Spiking Neurons

51

where r is a random variable with zero mean. In the language of the IF neuron, we can describe the effect of r as a stochastic component in the value of the reset potential. After each spike, a new value of r is drawn randomly from a gaussian distribution G s ( r) with variance s ¿ t . Let us O r) for the next interval of a neuron that has red at tO and was write T ( t, reset with a stochastic value r. Since r is drawn at random, the interval distribution is Z O O r)] G s (r). (2.17) Ph ( t| t) = drd[t ¡ tO ¡ T ( t, In order to keep the arguments simple, we assume in the following that the O 0) is signicantly larger than the width s of the typical interval T0 ( tO) = T ( t, distribution. In the case of the simple spike response model SRM0 , a reset according to equation 2.16 with r 6 = 0 shifts the refractory kernel horizontally along the time axis. To see this, let us consider a neuron that has red its last spike at tO and has been reset with variable r. Its refractory kernel is g(t) = gQ 0 exp[¡(t ¡ tO ¡ r)/ t]. Thus the neuron acts like a noiseless neuron that has O r) = r C T0 ( tO C r), red its last spike at t0 = tO C r. We therefore have T ( t, where T0 (t0 ) is the interval of a noiseless neuron that has red at t0 . For a constant input potential h( t) = h0 , the interval distribution of a SRM0 neuron is therefore a gaussian distribution centered at the noise-free interval T0 , Ph0 ( t| tO) = G sO ( t¡ tO¡T0 ) with variance sO = s. Thus, the gaussian distribution G s ( r) of the noise variable r maps directly to a gaussian distribution of the intervals around the mean T 0 . In the case of the IF neuron, the argument is slightly more involved. For O r) = T0 ( tO) C r T1 (tO) C O ( r2 ) s ¿ T0 (tO) we may expand the interval T ( t, in equation 2.17 to linear order in r. From the threshold condition, equation 2.11, we nd T1 ( tO) = g 0 / u0 where g 0 denotes the derivative of g(s) at s = T0 ( tO) and u0 = du/ dt is to be evaluated at t = tO C T0 ( tO). The integration on the right-hand side of equation 2.17 can now be performed and yields Ph ( t| tO) = G

sO [t

¡ tO ¡ T0 (tO)]

(2.18)

with variance sO = s T1 ( tO). Hence the interval distribution is a gaussian around the noise-free interval. For a constant input potential h0 , we nd from equation 2.10 T1 = gQ 0 / (gQ0 Ch0 ), which may be used for the calculation of s. O 2.2.3 Noisy Integration (Noise Model C). A noise term is added on the right-hand side of equation 2.8 (Tuckwell, 1988, 1989), tm

³ ´ XX dui ( f) = ¡ui C C Jext I ext (t) Cj wij a t ¡ tj dt ( ) f j tj

(2.19)

52

Wulfram Gerstner

where j represents gaussian white noise with zero mean and correlation hj ( t) j ( t0 )i = D tm d (t¡t0 ). The noise term can be motivated either by random input from background neurons that are not modeled explicitly (Stein, 1967) or by random openings of membrane channels. Noise causes the actual trajectory to drift away from the noise-free reference trajectory. In the absence of a threshold, the actual trajectories would have a gaussian distribution around the reference trajectory with variance su2 = ( D/ 2) [1 ¡ exp(¡2t/ tm )] (Tuckwell, 1989). The threshold acts as an absorbing boundary, and Ph ( t| tO) is the distribution of rst passage times. In general, the rst passage time problem is hard to solve. Even for constant driving current, the interval distribution is not symmetric around its mean, and higher moments are important. If, however, the level of noise is low and the reference trajectory hits the threshold with nite slope u0 , then a gaussian approximation of the distribution is reasonable (Tuckwell, 1989). In this case, the interval distribution is the same as for noise model B with s = su / u0 . On the other hand, if the reference trajectory stays signicantly below threshold, then noise model C may be approximated by a noisy threshold (noise model A) with escape rate / r 0 exp[¡b (u ¡ #)2 ] with parameters r 0 ¼ 1/ tm and b = D¡1 . (Plesser & Gerstner, 1999). To avoid the complications of noise model C, we work in the following mainly with noise models A and B, except for some statements concerning C. A disadvantage of noise model B is that it generates spikes only for suprathreshold stimuli and cannot describe noise-activated spiking with subthreshold input. On the other hand, an indirect justication of noise model B could come from the theory of excitable membranes (Gutkin & Ermentrout, 1998). 3 Population Activity Equation We consider a homogeneous and fully connected network of spiking neurons in the limit of N ! 1. We aim for a dynamic equation that describes the evolution of the population activity over time. The total number of spikes nact emitted by all neurons in the population is

nact ( tI t C D t) =

N XZ X i= 1 t( f ) i

tCD t t

( f)

d (t0 ¡ ti ) dt 0 .

(3.1)

The population activity has already been dened in equation 1.1. Using equation 3.1 in equation 1.1, we may write

A(t) dt =

N ³ ´ 1 X ( f) d t ¡ ti dt. N i= 1

(3.2)

Population Dynamics of Spiking Neurons

53

The denition of A allows us to rewrite the postsynaptic potential hPSP in equation 2.2 in the form Z hPSP ( t| tO) = J0

1 0

O s) A( t ¡ s) ds 2 (t ¡ t,

Z

C J ext

0

1

O s) I 2 Q ( t ¡ t,

ext

( t ¡ s) ds.

(3.3)

Thus, given the activity A(t0 ) for t0 < t, we can determine the potential O What we need is another hPSP ( t) of a neuron that has red its last spike at t. equation that allows us to determine the present activity A(t) given the past. The equation for the activity dynamics will be derived from three observations: 1. The model neurons are supposed to show no adaptation. According to equation 2.1, the state of neuron i depends explicitly on only the most recent ring time tOi (and, of course, on the input hPSP ), but not on the ring times of earlier spikes of neuron i. This allows us to work in the framework of a (nonstationary) renewal theory (Cox, 1962; Gerstner, 1995). 2. In the limit of N to innity, only the expectation value of a random variable matters. The exact form of a probability distribution is irrelevant, and we can work directly with the mean. Full connectivity and N ! 1 is the limit where mean-eld theory becomes exact. 3. The total number of neurons in the population remains constant. We may exploit this fact to derive a conservation law.

3.1 Integral Equation for the Dynamics. Because of the rst observation above, the probability density for a spike at t depends only on tO and on O We have already used this knowledge in our notation hPSP ( t| tO) for t > t. Ph (t| tO) for the probability density of ring. The subscript h stands for the dependence on hPSP ( t| tO). Integration of the probability density over time Rt ( O) gives the probability that a neuron that has red at tO res its tO Ph s| t ds next spike at some arbitrary time between tO and t. Consequently, we can dene a survival probability Z Sh ( t| tO) = 1 ¡

t tO

Ph (s| tO) ds,

(3.4)

that is, the probability that a neuron that is under the inuence of hPSP and has red its last spike at tO survives without ring up to time t. We now turn to a homogeneous population of neurons in the limit of N ! 1. We consider the network state at time t. Because of the second ob-

54

Wulfram Gerstner

servation, the proportion of neurons that have red their last spike between t0 and t (and have not red since) is exactly Z

t t0

O Sh (t| tO) A( tO) dt.

(3.5)

All neurons have red at some point in the past;1 thus, Z

t ¡1

Sh ( t| tO) A( tO) dtO = 1.

(3.6)

Because of the third observation, the normalization equation 3.6 must hold at any arbitrary time t. Taking the derivative of equation 3.6 with respect to t yields the activity dynamics, Z A ( t) =

t ¡1

O Ph ( t| tO) A( tO) dt.

(3.7)

A different derivation of equation 3.7 is given in Gerstner and van Hemmen (1992, 1994) and Gerstner (1993, 1995). The population equation 3.7, with hPSP given by 3.3, is the starting point for the discussions in the following sections. Intuitively, equation 3.7 is easy to understand. The kernel Ph ( t| tO) is the probability density that the next spike of a neuron under the inuence of a potential hPSP (t| tO) occurs at time O The number of neurons that have red at t given that its last spike was at t. Ot is proportional to A( tO), and the integral runs over all the past. An important remark concerns the proper normalization of the activity. Since equation 3.7 is dened as the derivative of equation 3.6, the integration constant on the right-hand side of the equation is lost. The system of equations 3.7 and 3.3 is therefore invariant under a rescaling of the activity A ¡! c A and J0 ¡! c¡1 J0 with some constant c. To get the correct normalization, we have to go back to equation 3.6. In the following sections the population equations in the form of equations 3.7 or 3.6 are analyzed for various scenarios. Equation 3.7 with the kernel Ph ( t| tO) calculated for the potential hPSP given by equation 3.3 denes the dynamics in a homogeneous network of spiking neurons with short-term memory. We remark that although equation 3.7 looks linear, it is in fact a highly nonlinear equation because the kernel Ph ( t| tO) depends nonlinearly on hPSP , and hPSP in turn depends again on the activity via equation 3.3. (See noise model A in section 2.2 for an explicit example of the kernel Ph ( t| tO).) 1

Neurons that have never red before are assigned a formal ring time tO = ¡1

Population Dynamics of Spiking Neurons

55

3.2 Relation to the Wilson-Cowan Integral Equation. The WilsonCowan integral equation, 1.3, is a special case of the population dynamics, equation 3.7. We assume neurons of the SRM0 type, hPSP ( t| tO) = h( t), and consider noise model A. The interval distribution is equation 2.15 with instantaneous rate r (t) = f [u(t)]. We may identify f with the gain function g of the Wilson-Cowan model. As shown in appendix A, for a refractory kernel g with absolute refractoriness only, equation 3.7 reduces strictly to the Wilson-Cowan equation, 1.3. Moreover, the framework of the population equation 3.7 also works for relative refractoriness. In the appendix it is shown that equation 3.7 for SRM0 neurons with noise model A is formally equivalent to the population equations for neurons with relative refractoriness derived in the appendix of Wilson and Cowan (1972). There is, however, an important difference in the interpretation of the equations. Wilson and Cowan motivated their gain function g by a distribution of threshold values # in an inhomogeneous population. In this case, the population equation 1.3 is an approximation, since correlations are neglected. In general, it matters whether the neurons in the refractory state are those with high or low threshold (Wilson & Cowan, 1972). In our interpretation g is the instantaneous escape rate due to a noisy threshold in a homogeneous population. In this interpretation, equation 1.3 is the exact equation for neurons with absolute refractoriness. 4 Noise-Free Population Dynamics 4.1 General Results. We start with a discussion of equation 3.7 in the noise-free case. As we have seen in equation 2.12, the probability density Ph (t| tO) reduces in the limit of no noise to a Dirac d -function. We use equation 2.12 in 3.7 and nd Z A ( t) =

t ¡1

O d ( t ¡ tO ¡ T (tO)) A(tO) dt.

(4.1)

Here T ( tO) is the interval given implicitly by the threshold condition, equation 2.11. Note that T ( tO) is the interval starting at tO and looking forward toward the next spike. The integration over the d-function in equation 4.1 can be done. Since T O the result is in the argument of the d-function depends on t, µ A ( t) =

1C

@t h C @tOh g 0 ¡ @tOh

¶ C

A[t ¡ Tb (t)],

(4.2)

where Tb (t) is the backward interval given a spike at time t. The partial O are to derivatives in equation 4.2, @t h = dhPSP (t| tO)/ dt and @tOh = dhPSP ( t| tO)/ dt, be evaluated at tO = t ¡ Tb ( t). The derivative g 0 = dg/ ds is to be evaluated at

56

Wulfram Gerstner

A

J

A(t) u

u(t)

d t’ input potential

t

d t h(t)

Figure 2: A change in the input potential h with positive slope h0 > 0 (dashed line, bottom) shifts neuronal ring times closer together (middle). As a result, the activity A(t) (solid line, top) is higher at t = Ot C T( Ot) than it was at time Ot (schematic diagram).

s = Tb ( t). Note that the expression in the square brackets can also be written as u0 / (g 0 C @tOh) with u0 = du/ dt| t evaluated at t. Since the threshold must be reached from below, a spike at time t is possible only if u0 > 0. For u0 < 0, the backward interval does not exist, and the activity vanishes. We have taken care of this fact by the notation [x] C = x for x ¸ 0 and zero otherwise. Equation 4.2 is the rst major result of our analysis. In order to facilitate the discussion, we dene a variable: C(t) =

@t h C @tOh . g 0 ¡ @tOh

(4.3)

Let us evaluate C for SRM0 and IF-neurons. For SRM0 neurons, all terms in equation 4.2 can be calculated explicitly. Since @tOh vanishes, equation 4.3 reduces to CSRM ( t) =

h 0 ( t) . g 0 [T b ( t)]

(4.4)

The backward interval Tb exists if h(t) > # and is found from the threshold condition, equation 2.11, viz. Tb ( t) = t lnfg0 / [h(t) ¡ #]g. With equation 4.4, the factor in square brackets in equation 4.2 becomes [1 Ch0 / g 0 ] C. Its meaning is illustrated in Figure 2. A neuron that has red at tO will re again at t = tO C T ( tO). Another neuron that has red slightly later at tO Cdt0 res its next spike at t Cd t. If the input potential is constant between t and t Cd t, then

Population Dynamics of Spiking Neurons

57

d t = d t0 . If, however, h increases between t and t C d t, as is the case in Figure 2, then the ring time difference is reduced. The compression of ring time differences is directly related to an increase in the activity A. To see this, we note that all neurons that re between tO and tO C dt0 must re again between t and t Cd t. This is due to the fact that the network is homogeneous and the mapping tO ! t = tO C T ( tO) is monotonous. If ring time differences are compressed, the population activity increases. For a SRM0 neuron with a kernel g as in equation 2.7, g 0 (s) > 0 holds for all s > 0. An input with h0 > 0 implies, because of equation 4.2, an increase of the activity: h0 > 0 () A(t) > A( t ¡ Tb ). For IF neurons we have a related result. To evaluate C, we use equation 2.10 and nd CIF ( t) = t

h0 (t) eTb / t ¡ h0 ( t ¡ Tb ) . gQ 0 C h(t ¡ Tb ) C t h0 ( t ¡ Tb )

(4.5)

The backward interval Tb (t) is given implicitly by the threshold condition # = h( t) ¡ [gQ0 C h( t ¡ Tb )] exp(¡Tb / t ). It is shown in appendix B that the denominator on the right-hand side of equation 4.5 is always positive. Therefore, a compression of ring times and hence an increase of the activity occurs whenever the numerator of the equation is positive: h0 (t) ¡ h0 ( t ¡ Tb ) exp(¡Tb / t) > 0 () A( t) > A(t ¡ Tb ). This is a rather general property of the noise-free IF dynamics that will be exploited in the following sections. For general spike response neurons dened by equation 2.1, @t h C@tOh > 0 is a necessary condition for an increase of the activity, but it is not sufcient: @t h C @tOh > 0 H) A( t) > A(t ¡ Tb ). Details are discussed in appendix B. 4.2 Application to Locking. We will show in this subsection that the locking theorem developed in Gerstner et al. (1996) follows directly from the noise-free population equation 4.2. We consider a population that is already close to perfect synchrony and res nearly regularly with period T. In order to keep the arguments transparent, let us assume that the population activity for times t < T/ 2 can be approximated by a sequence of square pulses, A ( t) =

0 X

1 H (t ¡ nT Cd n ) H (nT Cd n ¡ t), n 2d n= ¡m

(4.6)

where m is some small integer, say, m = 3, and H (.) denotes the Heaviside step function with H( s) = 1 for s > 0 and H (s) = 0 for s · 0. The d n are assumed to be small, d n ¿ T, where T, is the period. We want to check whether the “Ansatz” (see equation 4.6) is consistent with the noise-free population dynamics (see equation 4.2). More precisely, we ask, Is there some period T and some sequence d n for n = ¡m, ¡m C 1, . . . , 0, 1, 2, . . . so that the sequence of square pulses described by equation 4.6 continues for t > T/ 2? We will determine T and the sequence d n

58

Wulfram Gerstner

self-consistently. If we nd d n ! 0 for n ! 1, then the square pulses contract to d-functions, and we say that a synchronous T-periodic oscillation is a stable solution of the population dynamics. As a rst step, we determine the potential hPSP ( t| tO). Given hPSP , we can calculate the period T from the threshold condition and also the derivatives @t h and @tOh needed in equation 4.2. To get hPSP , we put equation 4.6 in 3.3. We assume d n ¿ T and integrate. To rst order in d n , we nd hPSP (t| tO) =

n max X n= 0

h i O t C n T ) C O (d n ) 2 , J0 2 (t ¡ t,

(4.7)

where ¡d 0 · tO · d 0 is the last ring time of the neuron under consideration. O s) as a function The sum runs over all pulses back in the past. Since 2 ( t ¡ t, of s decays quickly for s À T, it is usually sufcient to keep only a nite number of terms (e.g., nmax = 1 or 2). As the next step we determine the period T. To do so, we consider a neuron in the center of the square pulse that has red its last spike at tO = 0. Since we consider noiseless neurons the relative order of ring of the neurons cannot change. To make the Ansatz, equation 4.6, consistent, the next spike of this neuron must therefore occur at t = T, in the center of the next square pulse. We use tO = 0 in the threshold condition, equation 2.11, which yields ( ) n max X ( ) ( ) (4.8) T = min t|g t C J0 2 t, t C n T = # . n= 0

If a synchronized solution exists, equation 4.8 denes the period. In the population equation, 4.2, we need the derivative of hPSP , @t h C @tOh = J0

n max X n= 0

d 2 (x, s)| x= T,s= n T . ds

(4.9)

According to equation 4.2, the new value of the activity at time t = T is the old value multiplied by the factor in the square brackets. A necessary condition for an increase of the activity from one cycle to the next is that the derivative dened by the right-hand-side of equation 4.9 is positive, which is the essence of the locking theorem (Gerstner et al., 1996). To illustrate the idea, let us study two specic cases: SRM0 neurons and IF neurons. For SRM0 neurons we have 2 ( x, s) = 2 0 ( s), hence @tOh = 0 and P hPSP ( t| tO) = h( t) = J0 n 2 0 ( t Cn T ). With theg kernel dened by equation 2.7, 0( ) we have g T > 0 whatever T. Thus h 0 ( T ) = J0

nmax XC1 n= 1

2 00 ( n T ) > 0

()

A( T ) > A(0).

(4.10)

h {a.u.}

A {kHz}

Population Dynamics of Spiking Neurons

59

1.0 0.0 0.0

0

10

20 t {m s}

30

40

Figure 3: A sequence of activity pulses (top) contracts tod -pulses if ring always occurs when the input potential h (dashed line, bottom) is rising. Numerical integration of the population equation, 3.7, for SRM0 neurons with inhibitory interaction J = ¡0.1 and kernel 2 0 (see equation 2.5) with delay D abs = 2 ms. There is no noise (s = 0). The activity was initialized with a square pulse A(t) = 1 kHz for ¡1 ms< t < 0 and integrated with a step size of 0.05 ms.

For IF neurons we could go through an analogous argument to show that equation 4.10 holds. Instead we may verify its validity directly from equation 4.5 with h0 ( t) = h0 (t ¡ Tb ) due to periodicity. Therefore the amplitude of the synchronous pulse grows only if h0 (T ) > 0. The growth of amplitude corresponds to a compression of the width of the pulse. It can be shown that the corner neurons, which have red at time §d 0 , re their next spike atd 1 = d 0 [1 C C]¡1 with C dened in equation 4.3. Thus the square pulse remains normalized, as it should be. By iteration of the argument for t = n T with n = 2, 3, 4, . . . we see that (for IF or SRM0 neurons) the sequence d n converges to zero, and the pulses approach Psquare 0( ) a Dirac d-pulse under the condition that h0 ( T ) = n 2 0 n T > 0. In words, the T-periodic synchronized solution with T given by equation 4.8 is stable if the input potential h at the moment of ring is rising (Gerstner et al., 1996). In order for the sequence of square pulses to be an exact solution of the population equation, we must require that the factor C dened in equation 4.3 remains constant over the width of a pulse. The derivatives of equation 4.7, however, do depend on t. As a consequence, the form of the pulse changes over time, as is visible in Figure 3. The activity as a function of time was obtained by a numerical integration of the population equation with a square pulse as the initial condition for a network of SRM0 neurons coupled via equation 2.5 with weak inhibitory coupling J = ¡0.1 and delay D ax = 2 ms. For this set of parameters, h0 ( T ) > 0 and locking is possible. The framework of the population equation allows us also to extend the locking argument to noisy SRM0 neurons. At each cycle, the pulse of synchronous activity is compressed due to locking if h0 ( T ) > 0. At the same time it is smeared out because of noise. For SRM0 neurons with gaussian noise

A {kHz}

Wulfram Gerstner

A {kHz}

60

1 .0

1 .0

0 .0 950

975 t {m s }

1000

0 .0 950

975 t {m s }

1000

Figure 4: Synchronous activity in the presence of noise. Simulation of a population of 1000 neurons with inhibitory coupling (J = ¡1, D ax = 2 ms) and noise model B. (a) Low-noise level (s = 0.25). (b) For larger noise (s = 0.5), the periodic pulses become broader.

in the reset (noise model B), we may use equation 2.17 for the probability density in the population equation, 3.7, Z A ( t) =

t ¡1

Z dtO

1 ¡1

O r)] G drd[t ¡ tO ¡ T ( t,

s ( r)

A( tO),

(4.11)

O r) = r CT0 (tO Cr) where T0 ( t0 ) and search for periodic solutions. We insert T ( t, is the forward interval of a noiseless neuron that has red its last spike at t0 . The integration over tO can be done and yields µ ¶Z 1 h0 A ( t) = 1 C 0 dr G g ¡1

s ( r)

A[t ¡ Tb ( t) ¡ r],

(4.12)

where T b is the backward interval. The factor [1 C ( h0 /g 0 )] arises due to the integration over thed -function just as in the noiseless case; see equations 4.2 and 4.4. The integral over r leads to a broadening, the factor [1 C ( h0 / g 0 )] to a compression of the pulse. As shown in appendix C, a limit cycle of equation 4.11 consisting of a periodic sequence of gaussian pulses exists if the noise amplitude s is small and ( h0 /g 0 ) > 0. The width of the activity pulses in the limit cycle is d where ¡ ¢ ¡ ¢2 d = s [2 h0 /g 0 C h0 / g 0 ]¡1/ 2 . As in the noise-free situation in equation 4.10, h0 and g 0 have to be evaluated at T. A simulation of locking with noise is shown in Figure 4. The network of SRM0 neurons has inhibitory connections (J0 = ¡1) and is coupled via the response kernel 2 0 , equation 2.5, with a transmission delay of D ax = 2 ms. Doubling the noise level s leads to activity pulses with twice the width. 5 Transients In this section we study the response of the population activity to a rapid change in the input. To keep the arguments as simple as possible, we con-

Population Dynamics of Spiking Neurons

61

sider an input that has a constant value I0 for t < t0 and then changes abruptly to a new value I0 C D I . Thus, (

I

ext (

t) =

I0 I0 CD I

for for

t · t0

(5.1)

t > t0 .

For the sake of simplicity, we assume a population of independent IF or SRM0 neurons (J0 = 0). For R t · t0 , all neurons receive a constant input potential h0 = Jext I0 since 2 Q0 (s) ds = 1. For t > t0 , the input potential, equation 2.4, changes due to the additional current D I . Thus ( h( t) =

h0 h0 CJ

ext

DI

R t¡t0 0

for 2 Q0 ( s) ds

for

t · t0

t > t0 .

(5.2)

O we can calculate for Given the input potential h( t) and the last ring time t, any given neuron its momentary membrane potential u( t)—but what is the time course of the population activity? In order to analyze the situation, let us suppose that for t < t0 , the network is in a state of incoherent ring. In other words, all neurons re at the same mean ring rate, but their spikes are not synchronized. Rather, the ring times are maximally spread out over time. Such a state has been called the splay phase (Chow, 1998). In the limit of N ! 1, the population activity is then a constant A( t) = A0 . In section 7 we will study the conditions under which a state of incoherent ring can be a stable state of the system. Here we just assume that we can set the parameters so that the network res asynchronously and with constant activity. 5.1 Transients in a Noise-Free Network. In the noiseless case, neurons that receive a constant input I0 re regularly with some period T0 . For t < t0 , the mean activity is simply A0 = 1/ T0 . The reason is that, for a constant activity, averaging over time and averaging over the population must be the same. We now apply the equation of the noise-free population dynamics: equation 4.2. To do so, we have to calculate the factor C( t) dened in equation 4.3. For IF neurons, C is given by equation 4.5 and for SRM neurons by equation 4.4. In both cases we need the derivative of equation 5.2: ( 0

h ( t) =

J

ext D I

0

for

2 Q0 ( t ¡ t0 )

for

t · t0

t > t0

.

(5.3)

For t < t0 , we have h0 = 0 and thus, from equation 4.2, A(t) = A(t ¡ T0 ), as it should be for a constant activity A0 . Let us now consider a neuron that has red exactly at t0 . Its next spike occurs at t0 C T, where T is given by

62

Wulfram Gerstner

the threshold condition ui (t0 C T ) = #. If the change D I at t0 is small, we expect that the interspike interval T changes only slightly. In order to keep the arguments transparent, we make a zero-order approximation and set exp(¡T/ t ) ¼ exp(¡T0 / t), hence g 0 ( T ) ¼ g 0 ( T0 ). The population equation, 4.2, then yields A(t) = [1 C a 2 Q0 ( t ¡ t0 )] A0

for

t0 < t < t0 C T,

(5.4)

with a constant a = Jext D I /g 0 (T0 ) for SRM0 neurons and a = Jext D I / [g 0 ( T0 ) C h0 exp(¡T0 / t )] for IF neurons. Thus, the time course of the initial transient reects the time course 2 Q0 of the postsynaptic potential caused by external input. So far, the results are general. We now specify the response kernel 2 Q0 . With equation 2.6, the response of the input potential to the step current is h(t) = h0 C Jext D I [1 ¡ exp[¡(t ¡ t0 )/ t ]] for t > t0 . Thus, the potential has the characteristics of a low-pass lter. The population activity, however, reacts instantaneously to the step current. We put equation 2.6 in 5.4 and nd A(t) = A0 CD A

¡

1 t¡t0 exp ¡ t t

¢

H( t¡t0 )

for t0 < t < t0 CT,

(5.5)

where D A = A0 a and H (.) is the Heaviside step function. Thus there is an immediate response at t = t0 . The simulation in Figure 5 clearly exhibits the rapid initial response of the population. It is also conrmed by a numerical integration of the noise-free population dynamics, equation 4.2. As an aside we note that a dynamic rate model of the form of equation 1.2 dened by a differential equation with time constant t would not be able to capture the initial phase of the transient. The discussion up to now has focused on the initial phase of the transient. For t > t0 CT every neuron has red once, and the activity on the right-hand side of equation 4.2 can no longer be considered constant. What happens? Let us treat the IF model rst. With the response kernel 2 Q0 dened in equation 2.6, we nd for t > t0 C T the relation h0 ( t ¡ T ) = h0 ( t) exp(T/ t ). Hence CIF dened in equation 4.5 vanishes for t > t0 CT. Therefore A(t) = A( t¡T ), where T is the new backward interval consistent with the constant stimulation current I0 C D I . Thus the population activity is in a new periodic state as soon as every neuron has red once. This is what we should expect since neurons are independent, and as soon as each has been reset once after t0 , they all feel the same constant input current. The picture is different for SRM0 neurons. Here the new periodic state is reached only asymptotically (see Figure 5). The sequence of peaks in the activity that occur at times tk ¼ t0 C k T0 is given by the iteration A( tkC1 ) = A( tk ) [1 C h0 ( tkC1 )/g 0 ( T0 )]. Since h0 ( t) / exp[¡(t ¡ t0 )/ t ], the new limit cycle is approached asymptotically with time constant t . For SRM0 neurons, the

Population Dynamics of Spiking Neurons

63 0 .4

A {kH z}

A {kH z}

0 .4 0 .2

0 .2

0 .2 0 .0

ext

0 .0

I , h {a.u.}

0 .0

ext

I , h {a.u.}

0 .0

0 .2

80

90

100 110 t {m s }

120

80

90

100 110 t {m s }

120

Figure 5: (Top) Response of the population activity to a step current for low noise (s = 0.01 ms). Solid line: Simulation of a population of 1000 neurons. Dashed line: Numerical integration of the population equation, 3.7 (a) IF neurons. (b) SRM0 neurons. (Bottom) Step current input Iext (solid line) and input potential h(t) (dashed line). Parameters: Time constant t = 4 ms with response kernel 2 Q0 dened in equation 2.6; input step of 0.05 units at t = 100 ms. The threshold # was chosen so that the mean interval before the step is 8 ms, corresponding to a mean activity of 0.125 kHz. There are no interactions, J = 0; integration time step, 0.05 ms. The result of the simulation has been smoothed with a running average over 0.2 ms.

sequence of activity peaks therefore follows roughly the same exponential time course as the input potential h. 5.2 Transients with Noise. So far we have considered noiseless neurons. In the presence of noise, subsequent pulses get smaller and broader, and the network approaches a new incoherent state (cf. Figure 6a). If, in

A {kHz}

0 .4

A {kHz}

0 .4 0 .2 0 .0

80

90

100 110 t {m s }

120

0 .2 0 .0

80

90

100 110 t {m s }

120

Figure 6: Transients for SRM0 neurons with noise model B. (a) Same as in Figure 5b, but at a noise level of s = 2 and J = 0. (b) Noisy network (s = 2) with inhibitory interactions dened by equation 2.5 with D ax = 0.5 ms and J = ¡2. In both cases the results of a simulation of 1000 SRM0 neurons (solid line) are compared with a numerical integration (dashed line) of equation 3.7. All other parameters are as in Figure 5.

64

Wulfram Gerstner

addition, there is also some interaction J0 6 = 0 between the neurons, then the value of the new stationary state is shifted. A particularly interesting state is inhibitory interaction J0 < 0. In this case the new value of the stationary activity for t > t0 is only slightly higher than that for t < t0 . For noise model B, the fast transient with a sharp rst peak is clearly visible and marks the moment of input switching (see Figure 6b). Fast switching has previously been seen in heterogeneous networks with balanced excitation and inhibition (Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996) and also discussed in the context of associative memory: Treves, 1992; Treves, Rolls, & Simmen, 1997). Here it is demonstrated and analyzed for homogeneous networks. For a preliminary analysis of the noisy case, we take SRM0 neurons with noise model B and start from equation 4.12. To simplify the expression, we write A(t) = A0 C D A(t) and expand equation 4.12 to rst order in D A. The result is Z D A( t) =

1

¡1

G

s ( r)

D A(t ¡ T0 ¡ r) dr C

h 0 ( t) A0 . g 0 ( T0 )

(5.6)

The result for IF neurons with noise model B is analogous to equation 5.6 and is derived in appendix D. The rst term in equation 5.6 accounts for changes that are due to previous variations of the activity. During the initial phase of the transient (for t0 < t < t0 CT0 ¡s), it plays no role and can be neglected. Thus the initial phase is determined by the second term. As in the noiseless case, the initial transient is proportional to the derivative of h. After this initial phase, the convolution with the gaussian (the rst term on the right-hand side of equation 5.6) comes into play and leads to a rapid smoothing (see Figure 6). The initial transient, however, is sharp. We may wonder whether we can understand fast switching intuitively. Before the abrupt change, the input was stationary and the population in a state of incoherent ring, dened as a state where neuronal ring times are spread out maximally. Thus some of the neurons re, others are in the refractory period, and again others approach the threshold. There is always a group of neuron whose potential is just below threshold. An increase in the input causes those neurons to re immediately, and this accounts for the strong population response during the initial phase of the transient. The above consideration also holds for noise model C (noisy integration) in the limit of low noise and suprathreshold inputs. In order to understand why the derivative of h comes into play, consider a nite step in the input potential h1 ( t) = D h H (t ¡ t0 ). All neurons i that are hovering below threshold so that their potential ui ( t0 ) is between # ¡ D h and # will be put above threshold and re synchronously at t0 . Thus, a step in the potential causes a d-pulse in the activity D A( t) / d (t ¡ t0 ) / h0 ( t0 ). In Figure 7a we have used a current step (see equation 5.1), the same step current as in Figure 5. The

Population Dynamics of Spiking Neurons 0 .4

A {kHz}

A {kHz}

0 .4 0 .2 0 .0

80

90

80

90

80

90

100

110 120

0 .4

A {kH z}

A {kH z}

0 .2 0 .0

100 110 120

0 .4 0 .2 0 .0

65

80

90

100 110 120 t {m s}

0 .2 0 .0

100 110 120 t {m s }

Figure 7: Response of a network of 1000 neurons without interaction (J = 0) to step current input as in Figure 5, but for (a) noise model C (noisy integration) and (b) noise model A (noisy threshold). For low noise, the transition is sharp (top row), whereas for high noise rather smooth (bottom). In all cases, threshold and noise level were adjusted so that the mean interval before the current step is 8 ms so as to allow a comparison with Figure 5. (Top) Noise level chosen so that the variance of the interval distribution is 1 ms. (Bottom) Variance about 4 ms.

response at low noise (top) has roughly the form D A( t) / h0 (t) / 2 Q0 ( t ¡ t0 ), as expected. The rapid transient is slightly less pronounced than for noise model B, but nevertheless clearly visible; compare Figures 6 and 7. As the amplitude of the noise grows, the transient becomes less sharp. Thus there is a transition from a regime where the transient is proportional to h0 (see Figure 7a, top) to another regime where the transient is proportional to h (see Figure 7a, bottom). What are the reasons for the change of behavior? The above simple argument based on a potential step D h > 0 holds only p for a nite step size, which is at least of the order of the noise amplitude D. In noise model C, the threshold acts as an absorbing boundary. Therefore the density of neurons with potential ui vanishes for ui ! #. Thence, for D h ! 0, the proportion of neurons that are instantaneously put across threshold is p 0. In a stationary state, the boundary layer with lowpdensity is of the order D (Brunel & Hakim, 1999). A potential step D h > D puts a signicant proportion of neurons above threshold and leads to a d -pulse in the activity. Thus the result that the response is proportional to the derivative of the potential is essentially valid in the low-noise regime. On the other hand, noise model C can also be used in the regime of large-noise and subthreshold input. In the high-noise regime, a step in the potential raises the instantaneous rate of the neurons but does not force them to re immediately. The response to a current step is therefore smooth and

66

Wulfram Gerstner

follows the potential h( t) (see Figure 7a, bottom). A comparison of Figures 7a and 7b shows that noise model A exhibits a similar transition from sharp to smooth responses with increasing noise level. It can be shown that at least in the subthreshold regime, noise model C can be well approximated by noise model A (Plesser & Gerstner, 1999; see also Collins, Carson, Capela, & Imhoff, 1996). Analytical arguments for transients with noise model A are presented below. 5.3 Theory of Transients with Noise. To formalize the above ideas on transient responses in the presence of noise, we start from the population equation, 3.7, and write it in the form 0=

d dt

Z

t ¡1

O Sh (t| tO) A( tO) dt.

(5.7)

Equation 3.7 is the derivative of the conservation law, equation 3.6, and this is made explicit by our notation in equation 5.7. We consider a small perturbation of the stationary state A( t) = A0 CD A( t) and hPSP (t| tO) = h0 ( t| tO) C h1 (t| tO) and expand equation 5.8 to linear order in D A or h1 : Z d t 0= S0 (t ¡ tO) D A( tO) dtO dt ¡1 (Z ) Z t t ( t| tO) d @S h C A0 | h1 = 0 . (5.8) dt1 dtO h1 ( t1 | tO) dt @h1 (t1 | tO) ¡1 ¡1 We have used the notation S0 ( t ¡ tO) = Sh0 (t| tO) for the survivor function of the asynchronous ring state. For the derivative in the rst term in equation 5.8, d we use dt S0 ( t ¡ tO) = ¡Ph0 ( t| tO) and S0 (0) = 1. To make some progress in the treatment of the second term in equation 5.8, it is helpful to take either the SRM0 or the IF model. For SRM0 neurons, we may drop the tO dependence of the potential and set h1 ( t1 | tO) = h1 ( t1 ), which allows us to pull the h1 ( t1 ) in front of the integral over tO and write equation 5.8 in the form Z D A( t) =

t

d Ph0 (t| tO) D A( tO) dtO C A0 dt ¡1

»Z

1 0

¼

L (x) h1 (t ¡ x) dx ,

(5.9)

with a lter L = L SRM dened in the top row of Table 1. For IF neurons we set h1 ( t| tO) = h1 ( t) ¡ h1 (tO) exp[¡(t ¡ tO)/ t ]. After some rearrangements of the terms, equation 5.8 becomes identical to 5.9 with a lter L = L IF (see Table 1). The rst term on the right-hand side of equation 5.9 is of the same form as the population equation 3.7 and describes how perturbations D A( tO) in

Population Dynamics of Spiking Neurons

67

Table 1: Filter L(x) for Integrate-and-Fire or SRM0 neurons (Upper index IF or SRM, respectively). Def A

L IF (x) =

(x) = L SRM A (x) = L IF A

B

R1

|0) dj @h@S(j x 1 (j ¡x) x L SRM (x) C 0 dj e¡j / t @S(x|0) @h 1 (j ) 1 0 dj f [u(j ¡ x)] S0 (j ) x x (x) ¡ S0 (x) L SRM dj e¡j / t f 0 [u(j )] A 0

L SRM (x) = ¡

R

R

R

(x) = d£(x)/ g 0 L SRM B ¤ (x) = d (x) ¡ GsO (x ¡ T0 ) e¡T0 / t / u0 L IF B

Notes: f 0 = df/ du is the derivative of the escape function f (u); S0 (s) = Sh0 (s| 0) is the survivor function in the incoherent state. For noise model B, the width of the gaussian is sO = s T1 = s gQ 0 / (gQ 0 C h0 ) and the derivatives are g0 = dg/ ds| s= T0 and u0 = g 0 C t ¡1 h0 exp(¡T0 / t ). In B, the approximation exp( s/ O t ) ¼ 1 has been used.

the past inuence the current activity D A(t). The second term gives an additional contribution, which is proportional to the derivative of a ltered version of the potential. As we have seen, for SRM0 neurons with noise (x) = (1/g 0 ) d ( x). Thus for noise model B, model B, we nd L ( x) = L SRM B the lter is ad -function, and the second term in equation 5.9 is proportional to h0 . On the other hand, it is shown in appendix E that for noise model A, L can often be approximated by a low-pass lter,

L

SRM ( ) x A

= ar e¡r x H ( x),

(5.10)

where a is a constant and r is a measure of the noise. (The result for IF neurons is given in appendix E.) The noise-free threshold process can be ( x) = ad ( x), and retrieved from equation 5.10 for r ! 1. In this limit L SRM A the initial transient is proportional to h0 , as discussed above. For small r , however, the behavior is different. We use equation 5.10 and rewrite the last term in equation 5.9 in the form d dt

Z

1 0

L

SRM A ( x) h1 ( t

¡ x) dx = ar [h1 ( t) ¡ h1 (t)],

(5.11)

R1 where h1 (t) = 0 r exp(¡r x) is a running average. Thus the activity responds to the temporal contrast h1 ( t) ¡h1 (t). At high-noise levels,r is small. During the initial phase of the transient (t ¡ t0 < r ¡1 ), we may set h1 ( t) = 0. Thus, we nd for noise model A in the large-noise limit D A( t) / h( t). This is

68

Wulfram Gerstner

exactly the result that would be expected for a simple rate model. For a simulation of noise model A, see Figure 7b. The generalization of equation 5.11 to IF neurons is straightforward (cf. appendix E). In all cases, the initial phase of the transient is proportional to a linearly d fL ¤ ltered version of the potential, D A / dt hg. In the limit of low noise, the choice of noise model is irrelevant; the transient response is proportional to the derivative of the potential, D A / h0 . If the level of noise is increased, noise model B retains its sharp transients, whereas A and C turn to a different regime where the transients follow h rather than h0 . The transition from one regime to the other is most easily studied for noise model A. The lter L is essentially a low-pass lter. The time constant of the lter increases with the noise level. The results have an interesting relation to experimental input-output measurements on motoneurons (Fetz and Gustafsson, 1983; Poliakov, Powers, & Binder, 1997). In the low-noise regime (where the type of noise model is irrelevant) the response to a synaptic input current pulse is proportional to the derivative of the postsynaptic potential (Fetz & Gustafsson, 1983), as predicted by earlier theories (Knox, 1974). On the other hand, we have seen that for noise models A and C, the behavior changes at higher noise levels and the response follows h rather than h0 . At intermediate noise levels, the response to a synpatic input is therefore proportional to some mixture between the postsynaptic potential and its derivative (Fetz & Gustafsson, 1983; Poliakov et al., 1997). For noise model B, the initial response to a current step is D A / 2 Q0 (t ¡ t0 ), independent of the noise level. Note that if the rise time of the postsynaptic potential 2 Q0 is slow (as it is the case for slow synaptic channels), the transient will be slow as well. In other words, the response time is limited by the time constant of the synpatic current rather than by the membrane time constant. In the following section, this statement will be made more precise. 6 Signal Transmission Our considerations regarding step current input can be generalized to arbitrary input current Iext ( t). We study a population of independent IF or SRM0 neurons. For each neuron, theRmembrane potential can be calculated 1 from the input potential h(t) = J ext 0 2 Q0 ( s) Iext ( t ¡ s) ds and the last ring time tO (cf. equations 2.1 and 2.10). We assume that the population is close to a state of asynchronous ring: A( t) = A0 C D A(t). The linear response of the population to the change in the potential is given by equation 5.9. The Fourier transform is Jext A0 L O (v) 2 OQ (v) O I (v). AO (v) = iv 1 ¡ PO(v)

(6.1)

1.0

0.0

0

200 400 600 800 f {Hz}

I ext {a.u.}

S(f)

A {kH z}

Population Dynamics of Spiking Neurons

69

0 .2 0 .1 0 .0

0 .0 180

190 t {m s}

200

O Figure 8: (a) Signal gain S( f ) = | A(2p f )/ IO (2p f )| as a function of the frequency f for IF neurons with time constant t = 4 ms and noise model B and a mean interspike interval of 8 ms. Resonances are at the multiples of the single-neuron frequency of 125 Hz. Noise level s = 4 ms (long-dashed line); s = 2 ms (solid line); s = 0.75 ms (short-dashed); Jext = 1. (b) Response of the population activity (top) of SRM0 neurons with noise model B to a time-dependent current (bottom). The current is a superposition of 4 sine waves at 9, 47, 111, and 1000 Hz. The simulation of a population of 4000 neurons (solid line, top) is compared with the numerical integration (dashed line) of the population equation, 3.7. Note that even the 1 kHz component of the signal is well transmitted. Parameters: Response function 2 Q0 (see equation 2.6) with time constant t = 4 ms. Threshold is # = ¡0.135 so that the mean activity is A = 125 Hz; noise s = 2 ms; J0 = 0.

Hats denote transformed quantities; 2 OQ (v) = [1 Civt ]¡1 is the Fourier transform of the kernel 2 Q0 dened in equation 2.6 PO (v) is the Fourier transform of the interval distribution, I (v) is the Fourier transform of the current Iext , and L O (v) is the transform of the linear lter L . To be specic, we consider IF neurons with noise model B. For gaussian reset we nd PO(v) = expf¡ 12 sO 2 v2 ¡ ivT0 g. The lter L may be read off Table 1 and gives L O (v) = f1 ¡ exp[¡ 12 sO 2 v2 ¡ ivT0 ¡ T0 / t]g/ u0 where u0 is to be evaluated at T0 . In Figure 8a we have plotted the gain S = | AO (v)/ IO (v)| as a function of the frequency f = v/ (2p ). For a medium noise level of s = 2 ms, the signal gain has a single resonance at f = 1/ T0 = 125 Hz. For lower noise, further resonances at multiples of 125 Hz appear. For a variant of noise model B, a result closely related to equation 6.1 has been derived by Knight (1972a). The result for SRM0 neurons with noise model B is obtained by neglecting a term / exp(¡T0 / t). Independent of the noise level, we obtain for IF neurons for v ! 0 the result S(0) = Jext A0 [1¡exp(¡T0 / t )]/ ( u0 T0 ). Most interesting is the behavior in the high-frequency limit. For v ! 1 we nd S ! Jext A0 / ( u0 t ), hence i¡1 S(1) T0 h = 1 ¡ e¡T0 / t . t S(0)

(6.2)

70

Wulfram Gerstner

We emphasize that the high-frequency components of the current are not attenuated by the population activity despite the integration on the level of the individual neurons. The reason is that the threshold process acts as a differentiator and reverses the low-pass ltering of the integration. In fact, equation 6.2 shows that high frequencies can be transmitted more effectively than low frequencies. The good transmission characteristics of spiking neurons at high frequencies have been studied in (Knight, 1972a) and conrmed experimentally (Knight, 1972b). In biology there is, of course, also a time constant of synaptic channels that leads to a frequency cutoff for the input current that may enter the cell. In this sense, it is the time constant of the synaptic current that determines the cutoff frequency of the population. The membrane time constant is of minor inuence. Treves (1993) has reached a similar conclusion. So far we have discussed results of the linearized theory (see equations 5.6 and 6.1). The behavior of the full nonlinear system is shown in Figure 8b. A population of unconnected SRM0 neurons is stimulated by a time-dependent input current that was generated as a superposition of four sinusoidal components with frequencies at 9, 47, 111, and 1000 Hz, which have been chosen arbitrarily. The activity equation, 3.7, has been integrated with time steps of 0.05 ms, and the results are compared with those of a simulation of a population of 4000 neurons. The 1 kHz component of the signal I (t) is clearly reected in the population A(t). Theory and simulation are in excellent agreement. We have seen that noise model B was rather exceptional in that the transient remained sharp even in the limit of high noise. To study the relevance of the noise model, we return to equation 6.1. The signal gain S(v) = | AO (v)/ IO(v)| is proportional to L O (v). If the lter L ( x) is broad, the Fourier transform will fall off to zero at high frequencies, and so does the signal gain S(v). In Figure 9 we have plotted the lter L IF A and the signal gain S for IF neurons with noise model A at different noise levels. At low noise, the result for noise model A is similar to that of noise model B (compare Figures 9b and 8a) except for a drop of the gain at high frequencies. Increasing the noise level, however, lowers the signal gain of the system. For high noise (the long-dashed line in Figure 9) the signal gain at 1000 Hz is 10 times lower than the gain at zero frequency. Note that for noise model A, the gain at zero frequency changes with the level of noise. 7 Incoherent Firing In the preceding sections it was shown that a population of neurons may react rapidly to changes in the input if the network is in a state of incoherent ring. For a constant input current, incoherent ring may be dened as a macroscopic ring state with constant activity A( t) = A0 . In other words, incoherent ring corresponds to a xed point of the population dynamics.

Population Dynamics of Spiking Neurons

S(f)

L A IF

25

0 0

5 x {m s}

71

1.0

0.0

10

0

2 0 0 4 00 6 0 0 8 0 0 f {H z}

(x) (a) and the signal gain (b) for IF neurons with noise Figure 9: The lter LIF A model A and escape rate r = r 0 [u ¡ #]H(u ¡ #). For low noise (short-dashed line, r 0 = 20 which corresponds to a variance of the interval distribution of s = 0.75 ms), the lter LIF A has a small width. For high noise (long-dashed line; r 0 = 2.5; variance s = 4 ms), the lter L IF A is broad. Solid line: r 0 = 5.5, which corresponds to an interval distribution with s = 2 ms. The value of the threshold has been adjusted so that the mean interval is always 8 ms. At low noise, the signal gain shows several resonances at the single-neuron frequency of 125 Hz. For high frequencies, the signal gain declines due to the low-pass characteristics of the lter LIF A . In (a) the slightly negative valley around T0 = 8 ms is typical for IF neurons and would be absent for SRM0 neurons.

In this section we study the existence and stability of incoherent ring states. 7.1 Determination of the Activity. We search for a xed point A( t) = A0 of the population dynamics. Given constant A0 and constant external input I0 , the input potential of IF or SRM0 neurons is also constant, h( t) = h0 = J0 A0 C Jext I0 ,

(7.1)

R R where we have used the normalization 2 0 (s) ds = 1 = 2 Q0 ( s) ds. In a noisefree situation, neurons driven by h0 re regularly with a period T0 , which may be determined directly from the threshold condition. The population activity is then A0 = 1/ T0 . In the noisy case, the activity is A0 = 1/ hTi, R1 where hTi = 0 s Ph0 ( tO C s| tO) ds is the mean interval length (Gerstner, 1995). The statement follows from the normalization, equation 3.6. For constant input potential h0 , the survivor function and the interval distribution cannot depend explicitly on the absolute time, but only on the time difference O Hence we may set S0 ( s) = Sh0 (tO C s| tO) and P0 ( s) = Ph0 ( tO C s| tO). The t ¡ t. normalization reduces to Z 1 = A0

1 0

Z S0 ( s) ds = A0

0

1

s P0 ( s) ds = A0 hTi.

The second equality sign follows from integration by parts using ¡P0 (s).

(7.2) d ( ) ds S0 s

=

72

Wulfram Gerstner

For noise model B, P0 ( s) is a gaussian P0 (s) = G sO ( s ¡ T0 ) where sO is the width of the distribution and T0 is the noise-free interval in the presence of the input potential h0 . Thus, for this noise model, A0 = 1/ hTi = 1/ T0 as in the noise-free case. This is a nice property of noise model B. For noise models A and C, the mean interval, and hence A0 , changes with the level of noise. 7.2 Stability of Asynchronous Firing. In this subsection, the stability of incoherent ring is analyzed. We assume that for t > 0, the activity is subject to a small perturbation around the xed point at A0 , A(t) = A0 C AO 1 eivtCl t ,

(7.3)

with AO 1 ¿ A0 . The perturbation in the activity induces a perturbation in the postsynaptic potential, h( t) = h0 C h1 eivtCl t ,

(7.4)

with h0 = J02 O (0) A0 and h1 = J02 O (v ¡ il ) AO 1 , where Z

¡iy 2 O (v ¡ il ) = |2 O (v ¡ il )| e

(v¡il )

=

0

1

¡ivs¡ls

2 0 (s) e

ds

(7.5)

is a Laplacian transform of 2 0 . The Fourier transform is dened by equation 7.5 with l = 0. The angle y (.) is the phase shift between h and A. The change in the potential causes some of the neurons to re earlier (when the change in h is positive) and others to re later (whenever the change is negative). The perturbation may therefore build up (l > 0, the incoherent state is unstable) or decay back to zero (l < 0, the incoherent state is stable). At the transition between the region of stability and instability, the amplitude of the perturbation remains constant (l = 0, marginal stability of the incoherent state). These transition points, dened by l = 0, are determined now. We use equations 7.3 and 7.4 on the right-hand side of the linearized population equation, 5.9. After cancellation of a common factor AO 1 exp(ivt), the result can be written in the form 1 ¡ PO (v) = iv J0 A0 2 O (v) L O (v).

(7.6)

PO(v) is the Fourier transform of the interval distribution Ph0 ( t| tO), and L O (v) is the transform of the lter L in Table 1. Equation 7.6 denes the bifurcation points where the incoherent ring state loses its stability toward an oscillation with frequency v. So far the result is completely general. We now specify the model. For SRM0 neurons with gaussian noise in the reset (noise model B), the interval

Population Dynamics of Spiking Neurons

73

S f(w T 0)

1 .5 1 .0 0 .5 0 .0

0

1

2 w T0

3

4

Figure 10: Amplitude condition for the bifurcation of an oscillatory solution. The feedback gain S f is plotted as a function of the normalized frequency v T0 for two different values of the noise, s = 1 ms (solid line) and s = 0.1 ms (dashed line). Given a feedback with appropriate phase, instabilities of the incoherent ring state are possible for frequencies where S f > 1. For low noise, S f crosses unity (dotted horizontal line) at frequencies v ¼ vn = n 2p / T0 . For s = 1 ms there is a single instability region for v T0 ¼ 1. The feedback gain is dened by the right-hand side of equation 7.7. For the plot we have set T0 = 2t .

distribution is a gaussian centered at T0 and the lter is ad-function, L ( x) = d ( x)/ g 0 . Hence equation 7.6 is of the form 1=

iv J0 A0 2 O (v) , g 0 1 ¡ G Os (v) e¡ivT0

(7.7)

where G Os (v) = expf¡ 12 s 2 v2 g is the Fourier transform of a gaussian with width s. We note that equation 7.7 could have been obtained directly from the signal gain, equation 6.1, after replacement of the input term J ext2 OQ (v)IO (v) ¡! J02 O (v) AO 1 and evaluation of PO(v) and L O (v) for noise model B. In the following we analyze solutions of equation 7.7. To do so we split the equation into two equations: one for the absolute value and the other for the phase. Let us start with the equation for the absolute values. A necessary condition for a solution to equation 7.7 is a feedback gain S f of unity, where S f is dened by the absolute value of the right-hand side of the equation. In Figure 10 we have plotted S f as a function of vT0 . The left-hand side of equation 7.7 corresponds to a horizontal line at unity. Solutions to equation 7.7 may exist only for frequencies v ¼ vn = n 2p / T0 with integer n, where T0 = 1/ A0 is the typical interspike interval of the neurons. A0 is determined by equation 7.2, with h0 given by equation 7.1. Thus T0 = 1/ A0 includes the effect of the feedback. T0 is not the intrinsic frequency of the free neuron, which could be quite different. A solution with v ¼ v1 implies that the period of the population activity is identical to the period of individual neurons driven by an input potential h0 . Higher harmonics correspond to cluster states: each neuron res with

74

Wulfram Gerstner

a mean period of T0 , but neurons tend to re in groups so that the activity oscillates several times faster. We see from Figure 10 that the harmonics are relevant for low noise only. At high-noise level, even the solution v ¼ v1 disappears. For s ! 0 the absolute value of the denominator of equation 7.7 is 2|sin(vT0 / 2)|, and solutions may occur for all higher harmonics. Figure 10 species the amplitude condition for the solution of equation 7.7. We now turn to the phase equation. In order to work out the phase conditions, we need the Fourier transform of the response kernel 2 0 dened in equation 2.5. It is given by equation 7.5 with amplitude |2 O (v)| = (1 C v2 t 2 )¡1 and phase: y (v) = v D ax C 2 arctan(v t ).

(7.8)

We note that a change in the delay D ax affects only the phase of the Fourier transform, not the amplitude. By changing the transmission delay, we can therefore systematically shift the phase of the numerator on the right-hand side of equation 7.7, a convenient property that we will exploit below. Since we know that solutions are possible only for v ¼ vn , we set v = vn Cºn . The numerical solutions of equation 7.7 for different values of the delay D ax and different levels of the noise s are shown in the bifurcation diagram in the center of Figure 11. Neurons interact with strength J0 = 1. The value of the threshold # was adjusted so that the mean interval in the presence of coupling was T0 = 2t . (For this choice of parameters, the interval of the free neuron would be about twice as long.) In the simulations shown in the four insets in Figure 11, we have taken t = 4 ms and a total number of N = 1000 neurons. 7.3 Bifurcation Diagram. Let us consider a network with interaction delay D ax = 2 ms. This corresponds to an x-value of D ax / T0 = 0.25 in Figure 11. The phase diagram predicts that at a noise level of s = 0.5 ms, the network is in a state of asynchronous ring. The activity during a simulation run is shown in the inset in the upper right-hand corner. It conrms that the activity uctuates around a constant value of A0 = 1/ T0 = 0.125 kHz. If the noise level of the network is signicantly reduced, the system crosses the short-dashed line. The line is the boundary at which the constant activity state becomes unstable with respect to an oscillation with v ¼ 3 (2p / T0 ). A simulation result for a network at a noise level of s = 0.1 but otherwise the same parameters as before conrms that the population activity exhibits an oscillation with period T osc ¼ T0 / 3 ¼ 2.6 ms. We now return to our original parameters of s = 0.5 ms and D ax = 2 ms and reduce the axonal transmission delay while keeping the noise level xed. This correponds to a horizontal move across the phase diagram in Figure 11. At some point, the system crosses the solid line, which marks the transition to an instability with frequency v1 = 2p / T0 . Again this is conrmed by the simulation results shown in the inset in the upper left

Population Dynamics of Spiking Neurons

75 3

A(t) {kHz}

A(t) {kHz}

3 2 1 0 940

960

980

2 1 0 940

1000

960

t {ms}

980

1000

t {ms}

1.0

s {ms}

0.8 0.6 0.4 0.2 0.0 0.0

0.2

0.4 D /T 0

0.6

2 1 0 940

1.0

3

A(t) {kHz}

A(t) {kHz}

3

0.8

960

980

t {m s }

1000

2 1 0 940

960

980

1000

t {ms}

Figure 11: Stability diagram (center) as a function of noise s (y-axis) and delay D ax (x-axis). The diagram shows the borders of the stability region with respect to v1 , . . . , v4 . For high values of the noise, the asynchronous ring state is always stable. If the noise is reduced, the asynchronous state becomes unstable with respect to an oscillation with frequency v1 (solid border lines,) or v2 (longdashed border lines), or v3 (short-dashed border lines), or v4 (long-short dashed border lines). Four insets show typical patterns of the activity as a function of time taken from a simulation with 1000 neurons. Parameters s = 0.5 ms and D ax = 0.2 ms (top left); s = 0.5 ms and D ax = 2.0 ms (top right); s = 0.1 ms and D ax = 0.2 ms (bottom left); s = 0.1 ms and D ax = 2.0 ms (bottom right); Since the pattern repeats along the x-axis with period T0 , we have plotted a normalized delay D ax / T0 .

corner. If we now decrease the noise level, the oscillation becomes more pronounced (see the bottom right of Figure 11). In the limit of low noise, the incoherent network state is unstable for virtually all values of the delay. The region of the phase diagram in Figure 11

76

Wulfram Gerstner

around D ax / T0 ¼ 0.1 looks stable but hides instabilities with respect to the higher harmonics v6 and v5 , which are not shown. We emphasize that the specic location of the stability borders depends on the form of the postsynaptic response function 2 . The qualitative features of the phase diagram in Figure 11 are generic and hold for all kinds of response kernels. The numerical results apply to the response kernel 2 0 (s) dened in equation 2.5, which corresponds to a synaptic current a(s) with zero rise time (cf. equations 2.9 and 2.8). What happens if a is a double exponential with rise time trise and decay time tsyn ? In this case, the right-hand side of equation 7.7 has an additional factor [1 C ivtrise]¡1 that leads to two changes. First, due to the reduced amplitude of the feedback, instabilities with fre¡1 quencies v > trise are suppressed. The tongues for the higher harmonics are therefore smaller. Second, the phase of the feedback changes. Thus, all tongues of frequency vn are moved horizontally along the x-axis by an amount D / T 0 = ¡atan(vn trise)/ ( n 2p ). What happens if the excitatory interaction is replaced by inhibitory coupling? A change in the sign of the interaction correponds to a phase shift of p . For each harmonic, the region along the delay axis where the incoherent state is unstable for excitatory coupling (cf. Figure 11) becomes stable for inhibition, and vice versa. In other words, we simply have to shift the instability tongues for each frequency vn horizontally by an amount D / T0 = 1/ (2n). Otherwise the pattern remains the same. 8 Discussion 8.1 Assumptions. We have discussed an integral equation for the population dynamics. The validity of the population equations relies on three assumptions: (1) a homogeneous population of (2) an innite number of neurons that show (3) no adaptation. It is clear that there are no large and completely homogeneous populations in biology. The population equations may nevertheless be a useful starting point for a theory of heterogeneous populations (Tsodyks, Mitkov, & Sompolinsky, 1993; Senn et al., 1996; Chow, 1998; Pham et al., 1998; Brunel & Hakim, 1998). We expect that most of the results carry over to nonhomogeneous networks. In a sense, noise model B can be considered an annealed version of a heterogeneous model where the reset value varies from one neuron to the next. The treatment of heterogeneity as noise (reset values are randomly chosen after each reset rather than only once at the beginning) neglects, however, correlations that would be present in a truly heterogeneous model. To replace a heterogenous model with a noisy version of a homogeneous model is somewhat ad hoc but common practice in the literature. Wilson and Cowan (1972), for example, motivated the sigmoidal threshold function in their model by a heterogeneous system where the threshold value varies from one neuron to the next. In or-

Population Dynamics of Spiking Neurons

77

der to analyze the system, they neglect correlations and replace the heterogeneous system by a homogeneous population with noisy threshold (noise model A). A more recent example for the replacement of a heterogeneous system by a noisy homogeneous system can be found in Pham et al., 1988. A generalization of equations 3.3 and 3.7 to a network subdivided into several pools is possible. Within each pool, neurons are homogeneous. The activity of a pool m is Am ( t). A neuron in pool m receives input from all neurons in pool n with strength wmn = Jmn / Nn . The postsynpatic potential of neurons in pool m is then given by a straightforward generalization of R1 P O equation 3.3 (Gerstner 1995) hm ( t| tO) = n J mn 0 2 ( t ¡ t, s) An ( t ¡ s) ds. The population equation 3.7 remains unchanged but has to be applied to each pool activity Am separately. The pools are coupled via the potential hm , which determines the kernel Phm (t| tO). A transition from discrete pools to a continuous population is possible (Gerstner, 1995) and might be useful for modeling of primary visual cortex. The second condition is the limit of a large network. For N ! 1 the population activity shows no uctuations, and this fact has been used for the derivation of the population equation. For systems of nite size, uctuations are important since they limit the amount of information that can be transmitted by the population activity. For populations without internal coupling (J0 = 0), uctuations can be calculated directly from the interval distribution Ph ( t| tO). For networks with internal coupling, an exact treatment of nite size effects is difcult. For noise model A, rst attempts toward a description of the uctuations have been made (Spiridon, Chow, & Gerstner, 1998). For noise model C, nite size effects in the low-connectivity limit are treated in Brunel and Hakim (1999). The limit of no adaptation seems to be valid for fast spiking neurons (Connors, 1990). Most cortical neurons, however, show adaptation. A generalization of the population equations to neuron models with adaptation does not seem straightforward. From the modeling point of view, all IF neurons are in the class of nonadaptive neurons, since the membrane potential is reset (and the past forgotten) after each output spike. Both leaky and nonleaky IF neurons can therefore be mapped to a response kernel description with no dependence on earlier output spikes (Gerstner, 1995). To a very good approximation, the Hodgkin-Huxley model can also be mapped to a response kernel description (Kistler et al., 1997). The condition of short memory (that is, no adaptation) leads to the class of renewal models (Perkel, Gerstein, & Moore, 1967; Stein, 1967; Cox, 1962), and this is where the integral equation applies; (cf. Gerstner, 1995). 8.2 Signal Transmission by Incoherent Firing. We have mainly focused on the state of incoherent ring in the low-noise regime. This state may be particularly interesting for information transmission, since the system

78

Wulfram Gerstner

can respond rapidly to changes in the input current. For noise model B, the signal gain dened as the amplitude of the population activity divided by that of the input current shows no cutoff at high frequencies (Knight, 1972a). The effective cutoff frequency of the system is therefore given by the input current. Changes in the input current are, of course, limited by the opening and closing times of synaptic channels. Each presynpatic spike evokes a postsynaptic current pulse of nite width. The time course of the current pulse determines the response time of the system (Treves, 1993). These insights may have important implications for modeling as well as for interpretations of experiments. It is often thought, that the response time of neurons is directly related to the membrane time constant tm . This idea has been criticized repeatedly since in some cases neurons do respond much faster (Treves, 1992; Koch, Rapp, & Segev, 1996). In neural network modeling, a description of the population activity by a differential equation of the form 1.2 is common practice. Our results suggest that the time constant t of the left-hand side of equation 1.2 should be of the order of the duration of the synaptic current pulse. We have seen that the Wilson-Cowan integral equation 1.3 and hence the differential equation 1.2 describes neurons with absolute refractoriness and noise model A; the gain function g in equation 1.3 may be identied with the escape rate f . For neurons with absolute refractoriness and noise model A, the lter L is a low-pass lter with cutoff frequency f (h0 ) (cf. appendix E). If the mean activity of the system is low, then the cutoff freqency is low, and the activity follows during a transient the input potential h. The transient is therefore slow. On the other hand, if the rate f ( h0 ) is high, then the activity follows the derivative h0, and the transient is fast. Thus time-coarse graining would be possible with a time constant smaller than f (h0 )¡1 . We must keep in mind that even if we take a short integration time constant, the differential equation 1.2 is always a description of noise model A. We hope that the treatment presented in this article helps to clarify some important points. The limitations of the differential equation 1.2 are already mentioned in Wilson and Cowan (1972) and have been discussed in Abbott and van Vreeswijk (1993), Treves (1993), and Gerstner (1995). For noise model A, the integral equation can be replaced by a system of differential equations (Eggert & van Hemmen, 1997) or by partial differential equations (Gerstner & van Hemmen, 1992). For noise model C, partial differential equations for the distribution of the membrane potential u have been developed (Abbott & van Vreeswijk, 1993; Brunel & Hakim, 1998) and are an alternative to the approach by integral equations. Since we consider incoherent ring as a state with useful signal transmission properties, we must be concerned about the stability of such a state. Incoherent ring can be stabilized by a suitable choice of time constants, transmission delay, and noise (Abbott & van Vreeswijk, 1993; Gerstner &

Population Dynamics of Spiking Neurons

79

van Hemmen, 1993, 1994; Gerstner, 1995). For low noise, incoherent states are unstable for nearly all choices of parameters. Instabilities lead toward oscillations, either with a period comparable to the typical interval of the neuronal spike trains, or much faster (higher harmonics) (Golomb, Hansel, Shraiman, & Sompolinsky, 1992; Gerstner & van Hemmen, 1993; Ernst, Pawelzik, & Geisel, 1994; Golomb & Rinzel, 1994). The harmonics have also been called cluster states since neurons spontaneously split into groups of neurons that re approximately together. Higher harmonics can be easily suppressed by noise. The instability of asynchronous ring, has previously been been studied in phase models (Kuramoto, 1975, 1984; Winfree, 1980; Ermentrout, 1981; Strogatz & Mirollo, 1991; Golomb et al., 1992). Phase models can describe arbitrary nonlinear oscillator systems in the limit of weak coupling (Ermentrout, 1981). In contrast to phase models, the analytical framework presented in this article does not rely on a weak coupling assumption but works just as well for strong coupling. The mean interspike interval T0 in the asynchronous or synchronous state is determined self-consistently and includes the effects of the feedback from other neurons. In the absence of coupling, the model neurons may be silent or exhibit a Poisson-like rather than an oscillatory ring behavior. Thus neurons are not necessarily oscillators. To analyze the stability in the Poisson regime, we simply have to choose the appropriate interval distribution Ph0 ( t| tO), determine its Fourier transform PO(v), and the lter L O (v), and put the expressions in the bifurcation equation, 7.6. The instability of asynchronous ring in pulse-coupled neuron models has been studied in the noise-free case (Abbott & van Vreeswijk, 1993; Gerstner & van Hemmen, 1993; Gerstner, 1995). The effect of small inhomogeneities has been studied by Tsodyks et al. (1993) and a network with several populations and types on ion currents by Treves (1993). Noise in the differential equation of delayless IF neurons was treated by Kuramoto (1991) and Abbott and van Vreeswijk (1993). An analysis similar to that of Abbott and van Vreeswijk (1993), but for heterogeneous networks with low connectivity, has been done by Brunel and Hakim (1999). The effect of threshold noise (noise model A) in the spike response model including delays has been investigated by Gerstner (1995). For the analysis in section 7, we have assumed noise in the reset (noise model B). This noise model is probably not the most realistic one, but it is particularly convenient to work with since the mean interspike interval is independent of the noise level. For other noise models, a bifurcation diagram as in Figure 11 will look more complicated since the mean interval depends on the noise level. A way out would be to readjust the threshold for each noise level so as to keep the mean interval constant. If the network is ring incoherently, then the population activity responds immediately to an abrupt change in the input (Treves, 1992; Tsodyks & Se-

80

Wulfram Gerstner

jnowski, 1995; van Vreeswijk & Sompolinsky, 1996; Horn & Levanda, 1998). There is no integration delay. In the case of incoherent ring, there are always some neurons close to threshold, and this is why the system as a whole can respond immediately. This property suggests that a population of neurons may transmit information fast and reliably. Fast information processing seems to be a necessary requirement for biological nervous systems if the reaction time experiments are to be accounted for (Thorpe et al., 1996). Appendix A: Wilson-Cowan Integral Equations We apply the population equation 3.7 to SRM0 neurons with noise model A. The escape rate is a sigmoidal function f [u] and # = 0. The neuron model is specied by a refractory function g in three pieces. First, during an absolute refractory time 0 < s · d abs , we formally set g(s) to ¡1. Second, during the relative refractory period d abs < s < d rel , g may take some arbitrary values g(s). Third, g(s) = 0 for s ¸ d rel . Given g it seems natural to split the integral in the activity equation 3.7 into three pieces: Z A ( t) =

t¡d rel ¡1

Z

C

Z Ph ( t| tO) A(tO) dtO C

t

t¡d abs

t¡d abs t¡d rel

Ph ( t| tO) A(tO) dtO

O Ph ( t| tO) A(tO) dt.

(A.1)

The interval distribution Ph ( t| tO) for noise model A is given by equation 2.15, and is repeated here for convenience: » Z t ¼ 0 0 0 O O O (A.2) P h ( t| t) = f [h(t) Cg(t ¡ t)] exp ¡ f [h(t ) Cg(t ¡ t)] dt . tO

The last term in equation A.1 vanishes since spiking is impossible during the absolute refractory time, f [¡1] = 0. In the rst term we can move a factor f [h(t) Cg(t ¡ tO)] = f [h(t)] in front of the integral since g vanishes for t ¡ tO > d rel . The rst term in equation A.1 can therefore be rewritten as Z

f [h(t)]

t¡d rel ¡1

O Sh ( t| tO) A( tO) dt,

(A.3)

where we have used the denition (see equation 2.14) of the survivor function in noise model A. The integral in equation A.3 can be treated further if we use the normalization given in equation 3.6. The result of these manipulations is µ Z t ¶ Z t¡d abs O (A.4) A(t) = f [h(t)] 1¡ Sh ( t|tO) A( tO) dtO C Ph ( t|tO) A(tO) dt. t¡d rel

t¡d rel

Population Dynamics of Spiking Neurons

81

Except for a slight change of notation,2 equation A.4 is identical to the integral equation derived by Wilson and Cowan (1972). It applies to neurons with relative refractoriness of nite duration. For d rel ! 1 the expression in the square brackets vanishes, and we are back to equation 3.7. If there is no relative refractoriness but absolute refractoriness only, we may set d rel = d abs . The last term in equation A.4 then vanishes. Furthermore, during the absolute refractory period, we have a survival probablity Sh (t| tO) = 1 since the neurons cannot re. This yields » Z ( ) A t = f [h(t)] 1 ¡

t t¡d abs

¼ 0 0 ( ) A t dt ,

(A.5)

which is after the substitution f ¡! g exactly the Wilson-Cowan integral equation 1.3. Appendix B: Compression of Firing Times We prove the following generalization of the locking theorem: 1. @t h C@tOh > 0 is a necessary condition for a compression of ring times. 2. For IF neurons, the condition @t h C @tOh > 0 is also sufcient.

Proof. With view to equations 4.2 and 4.3, we need to show that @t h C@tOh < 0 H) C < 0. We will use that u0 = g 0 C @t h > 0 at the moment of ring. The assertion follows from C=

@t h C @tOh @t h C @tOh = 0 . g 0 ¡ @tOh u ¡ [@t h C @tOh]

(B.1)

We need to show that for IF neurons the denominator in equation B.1 is always positive. With view to equation 4.5, this is equivalent to the statement gQ 0 Ch(t ¡ Tb ) Ct h0 ( t ¡ Tb ) > 0 where t ¡ Tb = tO is the last ring time. Hence u0 ( tO) > 0 and u( tO) = #. The differential equation for the IF model, 2.8, is t u0 = ¡u C input where input comprises synaptic and external currents. The input potential h( t) is a solution to the differential equation—hence, input = h C t h0 0 < t u0 ( tO) = ¡# C input < ¡ureset C h C t h0 ,

(B.2)

where we have used ureset < #. The assertion follows since gQ 0 = ¡ureset. In Wilson and Cowan (1972), f is called S , Sh (t| tO) A( tO) is called R(t, tO), and Ph (t| tO) A( tO) is S R(t, tO). 2

82

Wulfram Gerstner

Appendix C: Locking in the Presence of Noise for SRM0 Neurons We start from equation 4.12 and search for periodic pulse-type solutions. We assumeP that the pulses are gaussians with width d and repeat with period T: A( t) = n G d ( t¡nT ). The pulse width d will be determined self-consistently from equation 4.12. The integral over r in that equation can be performed and yields a gaussian with width sQ = [d2 C s 2 ]1/ 2 . Equation 4.12 becomes X

µ

G

n

d (t

¡ nT ) =

1C

h 0 ( t) g0 ( T)

¶X

G

n

sQ [t

¡ Tb (t) ¡ nT],

(C.1)

where Tb ( t) = t lnfgQ 0 / [h(t) ¡ #]g is the interspike interval looking backward in time. Let us work out the self-consistency condition and focus on the pulse around t ¼ 0. It corresponds to the n = 0 term on the left-hand side, which must equal the n = ¡1 term on the right-hand side of equation C.1. We assume that the pulse width is small d ¿ T and expand Tb ( t) to linear order around Tb (0) = T. This yields µ ¶ h0 (0) ¡ T. t ¡ T b ( t) = t 1 C 0 g (T)

(C.2)

The expansion is valid if h0 (t) varies slowly over the width d of the pulse. We use equation C.2 in the argument of the gaussian on the right-hand side of equation C.1. Since we have assumed that h0 varies slowly, the factor h0 ( t) in equation C.1 may be replaced by h0 (0). In the following we suppress the arguments and write simply h0 and g 0 . The result is d (t)

G

=

¡

h0 1C 0 g

¢

G

µ sQ

t

¡

h0 1C 0 g

¢¶

.

(C.3)

The gaussian on the left-hand side of equation C.3 must have the same width as the gaussian on the right-hand side. The condition is d = sQ / [1 C h0 / g 0 ] with sQ = [d2 C s 2 ]1/ 2 . A simple algebraic transformation yields an explicit expression for the pulse width, h ¡ ¢ ¡ ¢2 i¡1/ 2 , d = s 2 h0 / g 0 C h0 /g 0

(C.4)

where d is the width of the pulse and s is the strength of the noise. Appendix D: The Filter L (x) for Noise Model B We calculate the linear response of IF neurons with noise model B to an input O D I . We start from equation 4.11 and formally integrate over the variable t.

Population Dynamics of Spiking Neurons

83

This yields Z A ( t) =

µ ¶¡1 d O G dr 1 C T (t, r) dtO

s ( r)

A[t ¡ Tb ( t, r)],

(D.1)

where T b ( t, r) is the backward interval of neurons that have been reset the last time with value r. The derivative has to be evaluated at tO = t ¡ Tb ( t, r). In order to calculate the derivative needed in equation D.1, we write the membrane potential in the form ! ! t ¡ tO t ¡ tO t r/ O C h( t) ¡ h(t) exp ¡ , (D.2) u(t) = ¡gQ 0 e exp ¡ t t

(

(

where we have used equations 2.1, 2.4, 2.10, and 2.16. Neurons that re at t have u(t) = #, hence ! ! O r) O r) T (t, T ( t, r/ t O r)]¡h(tO) exp ¡ Ch[tO CT( t, . (D.3) # = ¡gQ 0 e exp ¡ t t

(

(

We take the derivative of equation D.3 with respect to tO and nd o 1 n 0 d 0 ¡Tb ( t,r)/ t O r) = ¡ ( ) ( )] ¡ [t ¡ , T (t, h t h T t, r e b u0 dtO

(D.4)

where u0 is the derivative of equation D.2. We now linearize equation D.1. The variation D I in the input current causes a perturbation D h( t) of the input potential (see equation 2.4) around h0 , which in turn evokes a change D A(t) in the activity. We set h( t) = h0 C D h( t) and A(t) = A0 CD A(t) (where D A is of the order D h) and expand both sides of equation D.1 to rst order in D A or D h. Note that the zero-order O r) vanishes (see equation D.4). The result of the linearization term of dO T ( t, dt is Z Z d ( ) ( ) ( )] DA t = G s r D A[t ¡ T r dr ¡ A0 G s ( r) T(t,O r) dr, (D.5) dtO where T ( r) = T 0 C T1 r is the interval of a neuron in the asynchronous state (no perturbation), which has been reset with value r. T1 = gQ 0 / (gQ 0 C h0 ) has O r) is given by equation D.4 with been introduced after equation 2.18. dO T (t, dt 0 ¡1 u = t [gQ 0 exp(r/ t) Ch0 ] exp[¡T(r)/ t ]. To simplify the expression, u0 may be approximated by its value for r = 0 and moved in front of the integral. O r]/ t ] ¼ exp[¡T0 / t). The nal result is Similarly, we use exp[¡Tb ( t, Z D A ( t) = G sO ( r) D A(t ¡ T0 ¡ r) dr C

A0 u0

µ

Z

h0 (t) ¡ e¡T0 / t

G

¶ 0 ( ) ( ) ¡ ¡ , r h t T r dr 0 sO

(D.6)

84

Wulfram Gerstner

where sO = T1 s. The rst term contains the convolution of D A with the (unperturbed) interval distribution. The second term can be written as a 0 lter L IF B applied to h (see Table 1 and equation 5.9). Appendix E: The Filter L ( x) for Noise Model A In noise model A, we have from equation 2.14, » Z t ¼ Sh ( t| tO) = exp ¡ f [g(t0 ¡ tO) C hPSP ( t0 | tO)] dt0 tO

(E.1)

where f [u] is the instantaneous escape rate across the noisy threshold. We write hPSP ( t| tO) = h0 ( t ¡ tO) C h1 ( t| tO). Taking the derivative with respect to h1 yields @Sh ( t|tO) = ¡ H( t1¡tO) H (t¡t1 ) f 0 [g(t1¡tO)Ch0 (t1¡tO)] S0 (t¡tO), (E.2) @h1 ( t1|tO) h1 = 0

where S0 ( t ¡ tO) = Sh0 (t| tO) and f 0 = df/ du. For SRM0 neurons, we have O The lter L is therefore h0 ( t ¡ tO) ´ h0 and h1 ( t| tO) = h1 ( t), independent of t. Z

SRM ( t A

L

¡ t 1 ) = H ( t ¡ t1 )

t1 ¡1

dtO f 0 [g(t1 ¡ tO) C h0 ] S0 ( t ¡ tO),

(E.3)

as noted in Table 1. It is now shown that equation E.3 can be reduced to SRM ( ) x A

L

= ar e¡r x H (x),

(E.4)

in two limiting cases: (1) SRM0 -neurons and noise model A with instantaneous rate f (u) = r H (u ¡ #) and arbitrary g and 2 kernels and (2) SRM0 neurons and noise model A with arbitrary f (u) for neurons with absolute refractoriness g(s) = ¡1 for 0 < s < d abs and zero otherwise. Furthermore it is shown that for IF neurons with step function threshold,

L

IF ( ) A x

= L

SRM ( ) x A

¡ e¡T0 / t L

SRM ( x A

¡ T0 ).

(E.5)

E.1 Step Function Threshold. We take f ( u) = r H( u ¡ #). For r ! 1 neurons re immediately as soon as u( t) > #, and we are back to the noisefree sharp threshold. For nite r , neurons respond stochastically with time constant r ¡1 . Let us denote by T0 the time between the last ring time tO and the formal O ( t ¡ tO) C h0 = #g. The derivative of f is threshold crossing, T0 = minft ¡ t|g a d -function r (E.6) f 0 [g(t ¡ tO) C h0 ] = r d[g(t ¡ tO) C h0 ¡ #] = 0 d (t ¡ tO ¡ T0 ), g

Population Dynamics of Spiking Neurons

85

d where g 0 = ds g| T0 . The survivor function S0 ( s) is unity for s < T0 and S0 (s) = exp[¡r ( s ¡ T0 )] for s > T0 . Integration of equation E.3 yields A(t

L

¡ t1 ) =

1 H( t ¡ t1 ) r exp[¡r (t ¡ t1 )], g0

as claimed above. For the lter L Z S0 ( x )

0

x

IF A

(E.7)

we also have to evaluate the integral

dj e¡j / t f 0 [u(j 0 )] = S0 ( x) H (x ¡ T0 )

r ¡T0 / t , e g0

(E.8)

which conrms the assertion. E.2 Absolute Refractoriness. We take an arbitrary escape rate f (u) ¸ 0 with limu!¡1 f ( u) = 0 = limu!¡1 f 0 (u). Absolute refractoriness is dened by a refractory kernel g(s) = ¡1 for 0 < s < d abs and zero otherwise. This yields f [g(t ¡ tO) C h0 ] = f ( h0 ) H( t ¡ tO ¡ d abs ) and hence f 0 [g(t ¡ tO) C h0 ] = f 0 ( h0 ) H( t ¡ tO ¡ d abs ).

(E.9)

The survivor function S0 ( s) is unity for s < d abs and decays as exp[¡ f (h0 ) ( s¡ d abs )] for s > d abs . Integration of equation E.3 yields

L

A(t

¡ t1 ) = H ( t ¡ t1 )

f 0 ( h0 ) exp[¡ f (h0 ) ( t ¡ t1 )]. f ( h0 )

(E.10)

Note that for neurons with absolute refractoriness, the transition to the noiseless case is not meaningful. As shown in appendix A, absolute refractoriness leads to the Wilson-Cowan integral equation, 1.3. Thus L A = L WC dened in equation E.10 is the lter relating to equation 1.3. It is a low-pass lter with cutoff frequency f ( h0 ), which depends on the input potential h0 . Acknowledgments I thank Carson C. Chow for many discussions, in particular for the idea of formulating the population equation as a conservation law. Thanks are due to the referees for numerous constructive comments. Carl van Vreeswijk has helped to improve the article through his many insightful suggestions. References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in a network of pulse-coupled oscillators. Phys. Rev. E, 48, 1483–1490. Amari, S. (1974). A method of statistical neurodynamics. Kybernetik, 14, 201–215.

86

Wulfram Gerstner

Bauer, H. U., & Pawelzik, K. (1993). Alternating oscillatory and stochastic dynamics in a model for a neuronal assembly. Physica D, 69, 380–393. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-re neurons with low ring rates. Preprint submitted to Neural Computation. Chow, C. C. (1998). Phase-locking in weakly heterogeneous neuronal networks. Physica D, 118, 343–370. Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Trans. on Systems, Man, and Cybernetics, 13, 815–823. Collins, J. J., Carson, C. C., Capela, A. C., & Imhoff, T. T. (1996). Aperiodic stochastic resonance. Physical Review E, 54, 5575–5584. Connors, B. W., & Gutnick, M. J. (1990).Intrinsic ring patterns of diverse cortical neurons. Trends in Neurosci., 13, 99–104. Cowan, J. D. (1968). Statistical mechanics of nervous nets. In E. R. Caianello (Ed.), Proceedings of the 1967 NATO Conference on Neural Networks. Berlin: Springer-Verlag. Cox, D. R. (1962) Renewal theory. London: Methuen. Cox, D. R., & Lewis, P. A. W. (1966). The statistical analysis of series of events. London: Methuen. Eggert, J., & van Hemmen, J. L. Derivation of pool dynamics from microscopic neuronal models. In W. Gerstner et al. (Eds.), Articial neural networks— ICANN’97 (pp. 109–114). Heidelberg: Springer-Verlag. Ermentrout, G. B. (1981). N:M phase-locking in weakly coupled oscillators. J. Math. Biology, 12, 327–342. Ernst, U., Pawelzik, K., & Geisel, T. (1994). Multiple phase clustering of globally coupled neurons with delay. In M. Marinaro & P. G. Morasso (Eds.), Proceedings of the ICANN’94 (pp. 1063–1065). Berlin: SpringerVerlag. Feldman, J. L., & Cowan, J. D. (1975). Large-scale activity in neural nets I: Theory with application to motoneuron pool responses. Biol. Cybern., 17, 29–38. Fetz, E. E., & Gustafsson, B. (1983). Relation between shapes of post-synaptic potentials and changes in ring probability of cat motoneurones. J. Physiol., 341, 387–410. Gerstner, W. (1993). Kodierung und Signal¨ubertragung in Neuronalen Systemen: Assoziative Netzwerke mit stochastisch feuernden Neuronen, Frankfurt/Main: Harri–Deutsch Verlag. Gerstner, W. (1995). Time structure of the activity in neural network models. Phys. Rev. E, 51(1), 738–758. Gerstner, W., & van Hemmen, J. L. (1992). Associative memory in a network of “spiking” neurons. Network, 3, 139–164. Gerstner, W., & van Hemmen, J. L. (1993). Coherence and incoherence in a globally coupled ensemble of pulse emitting units. Phys. Rev. Lett., 71(3), 312–315. Gerstner, W., & van Hemmen, J. L. (1994). Coding and information processing in neural networks. In E. Domany, J. L. van Hemmen, & K. Schulten (Eds.), Models of neural networks II (pp. 1–93). New York: Springer-Verlag. Gerstner, W., van Hemmen, J. L., & Cowan, J. D. (1996).What matters in neuronal

Population Dynamics of Spiking Neurons

87

locking. Neural Comput., 8, 1689–1712. Golomb, D., Hansel, D., Shraiman, B., and Sompolinsky, H. (1992). Clustering in globally coupled phase oscillators. Phys. Rev. A, 45, 3516–3530. Golomb, D., & Rinzel, J. (1994). Clustering in globally coupled inhibitory neurons. Physica D, 72, 259–282. Gutkin, B., & Ermentrout, G. G. (1998). Dynamics of membrane excitability determine inter-spike interval variability: A link between spike generation mechanism and cortical spike train statistics. Neural Comput., 10, 1047–1065. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comput., 7, 307–337. Hopeld, J. J. (1984). Neurons with graded response have computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA, 81, 3088–3092. Hopeld, J. J., & Herz, A. V. M. (1995). Rapid local synchronization of action potentials: Towards computation with coupled integrate-and-re networks. Proc. Natl. Acad. Sci. USA, 92, 6655. Horn, D., and Levanda, S. (1998). Fast temporal encoding and decoding with spiking neurons. Neural Comput., 10, 1705–1720. Hubel, D. H., & Wiesel, T. N. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. (London), 160, 106– 154. Kandel, E. C., & Schwartz, J. H. (1991). Principles of neural science. New York: Elsevier. Kistler, W., Gerstner, W., & van Hemmen, J. L. (1997). Reduction of HodgkinHuxley equations to a single-variable threshold model. Neural Comput., 9, 1015–1045. Knight, B. W. (1972a). Dynamics of encoding in a population of neurons J. Gen. Physiology, 59, 734–766. Knight, B. W. (1972b).The relationship between the ring rate of a single neuron and the level of activity in a population of neurons J. Gen. Physiology, 59, 767– 788. Knox, C. K. (1974). Cross-correlation functions for a neuronal model Biophysical J., 14, 567–582. Koch, C., Rapp, M., & Segev, I. (1996). A brief history of time constants. Cerebral Cortex, 6, 93–101. Kuramoto, Y. (1975). Self-entrainment of a population of coupled nonlinear oscillators. In H. Araki (Ed.), International Symposium on Mathematical Problems in Theoretical Physics (pp. 420-422). Berlin: Springer-Verlag. Kuramoto, Y. (1984). Chemical oscillations, waves, and turbulence. Berlin: SpringerVerlag. Kuramoto, Y. (1991).Collective synchronization of pulse-coupled oscillators and excitable units Physica D, 50, 15–30. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronization of pulse coupled biological oscillators. SIAM J. Appl. Math., 50, 1645–1662. Mountcastle, V. B. (1957). Modality and topographic properties of single neurons of cat’s somatosensory cortex. J. Neurophysiol., 20, 408–434. Nykamp, D., Tranchina, D., Shapley, R., & McLoughlin, D. (1998). A population density approach facilitates large-scale modeling of neural networks. To

88

Wulfram Gerstner

appear in Journal of Computational Neuroscience. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967). Neuronal spike trains and stochastic point processes. (I) the single spike train. Biophys. J., 7, 391– 418. Pham, J., Pakdamen, K., Champagnat, J., & Vibert, J.-R. (1998). Activity in sparsely connected excitatory neural networks: Effect of connectivity. Neural Networks, 11, 415–438. Pinto, D. J., Brumberg, J. C., Simons, D. J., & Ermentrout, G. B. (1996). A quantitative population model of whiskers barrels: Re-examining the Wilson-Cowan equations. J. Comput. Neurosci., 3, 247–264. Plesser, H. E., & Gerstner, W. (1999). Noise in integrate-and-re neurons: From stochastic input to escape rates. To appear in Neural Computation Poliakov, A. V., Powers, R. K., & Binder, M. D. (1997). Functional identication of the input-output transforms of motoneurons in the rat and the cat J. Physiology, 504, 2:401–424. Senn, W., Wyler, K., Streit, J., Larkum, M., Luscher, ¨ H.-R., Mey, H., Muller, ¨ L., Steinhauser, D., Vogt, K., & Wannier, T. (1996). Dynamics of random neural network with synaptic depression. Neural Networks, 9, 575–588. Somers, D., & Kopell, N. (1993). Rapid synchronization through fast threshold modulation. Biol. Cybern., 68, 393–407. Spiridon, M., Chow, C. C., & Gerstner, W. (1998). Frequency spectrum of coupled stochastic neurons with refractoriness. In L. Niklasson et al. (Eds.), ICANN’98 (pp. 337–342). Berlin: Springer-Verlag. Stein, R. B. (1967). Some models of neuronal variability. Biophys. J., 7, 37–68. Stevens, C. F., & Zador, A. (1998). Novel integrate-and-re-like model of repetitive ring in cortical neurons. In Proceedings of the Fifth Joint Symposion on Neural Computation, University of California at San Diego. Strogatz, S. H., & Mirollo, R. E. (1991). Stability of incoherence in a population of coupled oscillators. J. Stat. Phys., 63, 613–635. Terman, D., & Wang, D. (1995). Global competition and local cooperation in a network of neural oscillators. Physica D, 81, 148–176. Thorpe, S., Fize, D., & Marlot, C. (1996). Speed of processing in the human visual system. Nature, 381, 520–522. Treves, A. (1992). Local neocortical processing: A time for recognition, Int. J. Neural Systems, 3(Supp.), 115–119. Treves, A. (1993). Mean-eld analysis of neuronal spike dynamics. Network, 4, 259–284. Treves, A., Rolls, E. T., & Simmen, M. (1997). Time for retrieval in recurrent associative memories. Physica D, 107, 392–400. Tsodyks, M., Mitkov, I., & Sompolinsky, H. (1993). Patterns of synchrony in inhomogeneous networks of oscillators with pulse interaction. Phys. Rev. Lett., 71, 1281–1283. Tsodyks, M. V., & Sejnowski, T. (1995). Rapid state switching in balanced cortical networks. Network, 6, 111–124. Tuckwell, H. C. (1988). Introduction to theoretic neurobiology. Cambridge: Cambridge University Press. Tuckwell, H. C. (1989). Stochastic processes in the neurosciences. Philadelphia:

Population Dynamics of Spiking Neurons

89

SIAM. van Vreeswijk, C., Abbott, L. E., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural ring. Journal of Computational Neuroscience, 1, 313–321. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophysical J., 12, 1–24. Wilson, H. R., & Cowan, J. D. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik, 13, 55–80. Winfree, A. T. (1980). The geometry of biological time. Berlin: Springer-Verlag. Received March 23, 1998; accepted January 13, 1999.

ARTICLE

Communicated by Carl van Vreeswijk

Dynamics of Strongly Coupled Spiking Neurons Paul C. Bressloff S. Coombes Nonlinear and Complex Systems Group, Department of Mathematical Sciences, Loughborough University, Loughborough, Leicestershire LE11 3TU, U.K.

We present a dynamical theory of integrate-and-re neurons with strong synaptic coupling. We show how phase-locked states that are stable in the weak coupling regime can destabilize as the coupling is increased, leading to states characterized by spatiotemporal variations in the interspike intervals (ISIs). The dynamics is compared with that of a corresponding network of analog neurons in which the outputs of the neurons are taken to be mean ring rates. A fundamental result is that for slow interactions, there is good agreement between the two models (on an appropriately dened timescale). Various examples of desynchronization in the strong coupling regime are presented. First, a globally coupled network of identical neurons with strong inhibitory coupling is shown to exhibit oscillator death in which some of the neurons suppress the activity of others. However, the stability of the synchronous state persists for very large networks and fast synapses. Second, an asymmetric network with a mixture of excitation and inhibition is shown to exhibit periodic bursting patterns. Finally, a one-dimensional network of neurons with long-range interactions is shown to desynchronize to a state with a spatially periodic pattern of mean ring rates across the network. This is modulated by deterministic uctuations of the instantaneous ring rate whose size is an increasing function of the speed of synaptic response. 1 Introduction There are two basic classes of neurodynamical model that are distinguished by their representation of neuronal output activity (see Abbott, 1994, and references therein). The rst considers the output of a neuron to be a mean ring rate that either species the number of spikes emitted in some xed time window (Hopeld, 1984; Amit & Tsodyks, 1991; Ermentrout, 1994) or corresponds to the probability of ring within some population (Wilson & Cowan, 1972; Gerstner, 1995). Recently, however, a number of experiments on sensory neurons have shown that the precise timing of spikes may be signicant in neuronal information processing, and this has led to renewed interest in the second class of neurodynamical models based on spiking neurons. For example, spike train recordings from H1, a motion-sensitive Neural Computation 12, 91–129 (2000)

c 2000 Massachusetts Institute of Technology °

92

Paul C. Bressloff and S. Coombes

neuron in the y visual system, exhibit variability of response to constant stimuli but a high degree of reproducibility for more natural dynamic stimuli (Strong, Koberle, van Steveninck, & Bialek, 1998). This reproducibility provides an enhanced capacity for carrying information (Rieke, Warland, & van Steveninck, 1996). Similar ndings have been obtained in retinal ganglion cells of the tiger salamander and rabbit (Berry, Warland, & Meister, 1997). It is also well known that precise spike timing is essential for sound localization in the auditory system of animals such as the barn owl (Carr & Konishi, 1990). Recent advances in computational neuroscience support the notion that synapses are capable of supporting computations based on highly structured temporal codes (Mainen & Sejnowski, 1995; Gerstner, Kreiter, Markram, & Herz, 1997). Moreover, this mode of computation suggests how only one type of neuroarchitecture can support the processing of several very different sensory modalities (Hopeld, 1995). Such codes are undoubtedly important in sensory and cognitive processing. The simplest and most popular example of a spiking neuron is the so-called integrateand-re (IF) model (Keener, Hoppensteadt, & Rinzel, 1981; Tuckwell, 1988) and its generalizations (Gerstner, 1995). The state of an IF neuron changes discontinuously (resets) whenever it crosses some threshold and res, so a complete description in terms of smooth differential equations is no longer possible. This type of model can be derived systematically from more detailed Hodgkin-Huxley equations describing the process of action potential generation (Abbott & Kepler, 1990; Kistler, Gerstner, & van Hemmen, 1997). There are major differences in the analytical treatments of ring-rate and spiking neural network models. A common starting point for the former is to consider conditions under which destabilization of a homogeneous low-activity state occurs, leading to the formation of a state with inhomogeneous and/ or time-dependent ring rates (see, for example, Wilson & Cowan, 1973; Ermentrout & Cowan, 1979; Atiya & Baldi, 1989; Li & Hopeld, 1989; Ermentrout, 1998a). On the other hand, most of the work on IF network dynamics has been concerned with the existence and stability of phase-locked solutions, in which the neurons re at a xed common frequency. Analysis of globally coupled IF oscillators in terms of return maps has shown that synchronization almost always occurs in the presence of instantaneous excitatory interactions (Mirollo & Strogatz, 1990). Subsequently this result has been extended to take into account the effects of inhomogeneities and various synaptic and axonal delays (Treves, 1993; Tsodyks, Mitkov, & Sompolinsky, 1993; Abbott & van Vreeswijk, 1993; van Vreeswijk, Abbott, & Ermentrout, 1994; Ernst, Pawelzik, & Giesel, 1995; Hansel, Mato, & Meunier, 1995; Coombes & Lord, 1997; Bressloff, Coombes, & De Souza, 1997; Bressloff & Coombes, 1998, 1999). One nds that for small transmission delays, inhibitory (excitatory) synapses tend to synchronize if the rise time of a synapse is longer (shorter) than the duration of an action potential. Increasing the transmission delay leads to alternating bands of stability and instability of the synchronous state. Analogous results have been ob-

Dynamics of Strongly Coupled Spiking Neurons

93

tained in the spike response model (Gerstner, Ritz, & van Hemmen, 1993; Gerstner, 1995; Gerstner, van Hemmen, & Cowan, 1996). Traveling waves of synchronized activity have been investigated in nite chains of IF oscillators modeling locomotion in simple vertebrates (Bressloff & Coombes, 1998), and in two-dimensional networks where spirals and target patterns are also observed (Chu, Milton, & Cowan, 1994; Kistler, Seitz, & van Hemmen, 1998). In this article we present a theory of spike train dynamics in IF networks that bridges the gap between ring-rate and spiking models. In particular, we show that synaptic interactions that are synchronizing in the weak coupling regime can become desynchronizing for sufciently strong coupling. The resulting dynamics is compared with the behavior of a corresponding analog model in which the outputs of the neurons are taken to be mean ring rates. A basic result of our work is that for slow interactions, there is good agreement between the two models (on an appropriately dened timescale). On the other hand, discrepancies can arise for fast synapses where IF neurons may remain phase locked. We take as our starting point the nonlinear mapping of the neuronal ring times and show how a bifurcation analysis of this map can serve as a basis for understanding the extremely rich dynamical structure seen in networks of spiking neurons. Explicit criteria for the stability of phase-locked solutions are derived in both the weak and strong coupling regimes by considering the propagation of perturbations of the ring times throughout a network. In the strong coupling case, the analysis predicts regions in parameter space where instabilities in the ring times may cause transitions to nonphase-locked states. Numerical simulations are used to establish that in these regions, the full nonlinear ring map can support several distinct types of behavior. For small networks, these include mode-locked bursting states, where packets of spikes may be generated, separated by periods of inactivity, and inhomogeneous states in which some of the oscillators become inactive. For large global inhibitory networks with vanishing mean perturbations of the ring times, we show that such bifurcations can be suppressed, in agreement with the mode locking theorem of Gerstner et al. (1996). As a nal example of the importance of strong coupling instabilities, we consider a ring of IF neurons with a Mexican hat interaction function. A discrete Turing-Hopf bifurcation of the ring times from a synchronous state to a state with periodic or quasi-periodic variations of the interspike intervals (ISIs) on closed orbits is shown to occur in the strong coupling regime. Further, it is shown how the separation of these orbits in phasespace results in a spatially periodic pattern of mean ring rate across the network. 2 Integrate-and-Fire Model of a Spiking Neuron The standard dynamical system for describing a neuron with spatially constant membrane potential V is based on conservation of electric charge, so

94

Paul C. Bressloff and S. Coombes

that C

dV = ¡F C Is C I, dt

(2.1)

where C is the cell capacitance, F is the membrane current, Is the sum of synaptic currents entering the cell, and I describes any external currents. In the Hodgkin-Huxley model the membrane current arises mainly through the conduction of sodium and potassium ions through voltage-dependent channels in the membrane. The contribution from other ionic currents is assumed to obey Ohm’s law. In fact F is considered to be a function of V and of three time-dependent and voltage-dependent conductance variables m, n, and h, F( V, m, n, h) = gL (V ¡ VL ) C gK n4 (V ¡ VK ) C gNa hm3 ( V ¡ VNa ), (2.2) where gL , gK , and gNa are constants and VL , VK , and VNa represent the constant membrane reversal potentials associated with the leakage, potassium, and sodium channels, respectively. The conductance variables m, n, and h take values between 0 and 1 and approach the asymptotic values m1 ( V), n1 ( V ) and h1 ( V ) with time constants tm (V ), tn ( V ) and th (V ), respectively. The conductance-based Hodgkin-Huxley equations depend on four dynamical variables. A reduction of this number is often desirable in order to facilitate any mathematical analysis and to ease the computational burden for simulations of large networks. A systematic approach for doing this involves the use of equivalent potentials (Abbott & Kepler, 1990; Kepler, Abbott, & Marder, 1992). Following this approach, one can derive an IF model, which provides a caricature of the capacitative nature of cell membrane at the expense of a detailed model of the refractory process. The IF model satises equation 2.1 with F = F( V), together with the condition that whenever the neuron reaches a threshold h, it res, and V is immediately reset to the resting potential V0 . The dynamics of the membrane conductances m, n, h is eliminated completely. Using a curve-tting procedure, it is possible to approximate F( V) by a cubic, F( V ) = a( V ¡ V0 )(V ¡ V1 )( V ¡ h) where the constants a, V0,1 , and h can be determined explicitly from the reduction of the underlying Hodgkin-Huxley equations. At a synapse, presynaptic ring results in the release of neurotransmitters that causes a change in the membrane conductance of the postsynaptic neuron. This postsynaptic current may be written Is = gs s( Vs ¡ V ),

(2.3)

where V is the voltage of the postsynaptic neuron, Vs is the membrane reversal potential, and gs is a constant. The variable s corresponds to the probability that a synaptic receptor channel is in an open conducting state.

Dynamics of Strongly Coupled Spiking Neurons

95

This probability depends on the presence and concentration of neurotransmitter released by the presynaptic neuron. Under certain assumptions, it may be shown that a second-order Markov scheme for the synaptic channel kinetics describes the so-called alpha function response commonly used in synaptic modeling (Destexhe, Mainen, & Sejnowski, 1994): the synaptic response to an incoming spike at time t0 is s( t) = s0 ( t ¡ t0 )e¡a(t¡t0 ) ,

t > t0

(2.4)

for constants s0 , a. Here a determines the inverse rise time for synaptic response. If we now combine equations 2.1 and 2.3 under the approximation F = F(V ), we obtain an equation of the form (after setting C = 1) X dV = ¡F(V) C I C gs s( t ¡ tm )[Vs ¡ V]. dt m

(2.5)

We are assuming that the neuron receives a sequence of action potential spikes at times ftm g and each spike generates a synaptic response according to equation 2.4. The neuron itself res a spike whenever V( t) reaches the threshold h, and V is immediately reset to the resting potential, V0 . Denoting the ring times of the neuron by fT m g, we can view the neuron as a device that maps ftm g ! fTm g. We introduce two additional simplications. First, we neglect shunting effects by setting Vs ¡ V ¼ Vs ; the possible nonlinear effects of shunting are discussed by Abbott (1991). Second, we reduce F( V) further by considering the linear approximation F( V ) = b(V ¡ V0 ) (Abbott & Kepler, 1990). This linear IF model of a spiking neuron will be used in our subsequent analysis of network dynamics (see sections 3 and 4). However, before proceeding further, we briey mention an alternative formulation of spiking neurons based on the so-called spike response model. The IF model assumes that it is the capacitative nature of the cell that in conjunction with a simple thresholding process dominates the production of spikes. The spike response (SR) model (Gerstner & van Hemmen, 1994; Gerstner, 1995; Gerstner et al., 1996) is a more general framework that can accommodate the apparent reduced excitability (or increased threshold) of a neuron after the emission of a spike. Spike reception and spike generation are combined with the use of two separate response functions. The rst, hs ( t), describes the postsynaptic response to an incoming spike in a similar fashion to the IF model, whereas the second, hr ( t), mimics the effect of refractoriness. The refractory function hr ( t) can in principle be related to the detailed dynamics underlying the description of ionic channels. In practice, an idealized functional form is often used, although numerical ts to the Hodgkin-Huxley equations during the spiking process are also possible (Kistler et al., 1997). In more detail, a sequence of incoming spikes ftm g evokes P a postsynaptic potential in the neuron via V s ( t) = m hs ( t ¡ tm ), where hs ( t) incorporates details of axonal, synaptic, and dendritic processing. The total

96

Paul C. Bressloff and S. Coombes

r( ) s ( ), () membrane Ppotential of the neuron is taken to be V t = V t CV t where ( ) ¡ and fT g is the sequence of output ring times. V r ( t) = h t T m m m r Whenever V ( t) reaches some threshold, the neuron res a spike, and at the same time a negative contribution hr is added to V so as to approximate the reduced excitability seen after ring. Since the reset condition of the P IF model is equivalent to a sequence of current pulses, ¡h m d (Tm ), the linear IF model is a special case of the SR model. That is, if I = 0, F(V ) = ¡V and we neglect shunting, then we can integrate equation 2.5 to obtain the equivalent formulation

V ( t) =

X m

hr ( t ¡ Tm ) C

X m

hs ( t ¡ tm ),

(2.6)

where Z hr (t) = ¡he¡t ,

hs (t) =

t 0

e¡(t¡t ) s( t0 )dt0 , 0

t > 0,

(2.7)

and there is no reset. Most work to date on the analysis of the SR model has been based on the study of large networks using dynamical mean eld theory (Gerstner & van Hemmen, 1994; Gerstner, 1995), which can be viewed as a generalization of the population averaging approach of Wilson and Cowan (1972). This is different from the approach we take here, which is principally concerned with the dynamics of nite networks of spiking neurons. However, we will link the two in section 4.3, where we discuss large, globally coupled networks and the mode-locking theorem of Gerstner et al. (1996). 3 Averaging Methods for Analyzing IF Network Dynamics We now consider a network of IF neurons that interact via synapses by transmitting spike trains to one another. Let Vi ( t) denote the state of the ith neuron at time t, i = 1, . . . , N, where N is the total number of neurons in the network. Suppose that the variables Vi (t) evolve according to the equations (cf. equation 2.5) dVi ( t) Vi (t) = ¡ C Ii C Xi (t), dt td

(3.1)

where Ii is a constant external bias, td is the membrane time constant, and Xi ( t) is the total synaptic current into the cell. We shall x the units of time by setting td = 1; typical values for td are in the range 5–20 msec. Equation 3.1 is supplemented by the reset condition that whenever Vi = h, neuron i res and is reset to Vi = 0. We shall set the threshold h = 1. The synaptic current

Dynamics of Strongly Coupled Spiking Neurons

97

Xi (t) is generated by the arrival of spikes from neuron j and takes the explicit form Xi ( t) = 2

N X

Wij

j= 1

1 X m= ¡1

J(t ¡ Tjm ),

(3.2)

where 2 Wij is the effective synaptic weight of the connection from the jth to the ith neuron, J( t) determines the time course of the postsynaptic response to a single spike with J( t) = 0 for t < 0, and Tjm denotes the sequence of ring times of the jth neuron with m running through the integers. We have introduced a coupling constant 2 characterizing the overall strength of synaptic interactions. From the discussion in section 2, a biologically motivated choice for J(t) is J( t) = s(t ¡ ta )H( t ¡ ta ),

s(t) = a2 te¡at ,

(3.3)

where s(t) is an alpha function (with unit normalization), ta is a discrete axonal transmission delay, and H is the unit step function, H (t) = 1 if t > 0 and zero otherwise. The maximum synaptic response then occurs at a nonzero delay t = ta C a¡1 . In a companion paper, the effects of dendritic interactions (both passive and active) will be analyzed by taking J (t) to be the Green’s function of some cable equation (Bressloff, 1999). Although equations 3.1 and 3.2 are considerably simpler than those describing a network of Hodgkin-Huxley neurons, the analysis of the resulting dynamics is still a nontrivial problem since one has to handle both the presence of delays and discontinuities arising from reset. Most work to date has therefore been based on some form of approximation scheme. We shall describe two such schemes: phase reduction in the weak coupling regime and time averaging with slow synapses. In section 4 we shall then carry out a direct analysis of the IF network dynamics without recourse to such approximations. 3.1 Weak Coupling: Phase-Oscillator Models. We rst show how in the weak coupling limit, equation 3.1 with reset can be reduced to a phaseoscillator equation. For concreteness, assume that Ii = I > 1 for all i = 1, . . . , N, so that in the absence of any coupling (2 = 0), each neuron acts as a regular oscillator by ring spikes with a constant period T = ln[I/ ( I ¡ 1)]. Following van Vreeswijk et al. (1994), we introduce the phase variable y i (t) according to (mod 1) yi ( t) C

1 t = Y(Vi ( t)) ´ T T

Z

Vi ( t ) 0

dx , F( x)

(3.4)

where F( x) = I¡x for the linear IF model. (One can also apply this transform to the nonlinear IF model discussed in section 2 with F( x) now given by a

98

Paul C. Bressloff and S. Coombes

cubic.) Under such a transformation, equation 3.1 becomes yP i ( t) = RT (y i ( t) C t/ T ) Xi (t),

(3.5)

with 1 1 = RT (h ) ´ ¡1 T F[Y (h )]

£

¤ 1 ¡ e¡T eTh

(3.6)

T

for 0 · h < 1 and RT (h C k) = RT (h ) for all integers k. In the absence of any coupling, the phase variable y i (t) is constant in time, and all oscillators re with their natural periods T. For weak coupling, each oscillator still approximately res at the unperturbed rate, but now the phases slowly drift according to equation 3.5 since Xi ( t) = O (2 ). To rst order in 2 , we can take the ring times to be Tjn = (n ¡ yj ( t))T. Under this approximation, equations 3.2 and 3.5 lead to the following equation for the shifted phases hi ( t) = yi ( t) C t/ T: N X dh i 1 = C2 W ij RT (h i ) PT (hj ) C O (2 2 ), dt T j= 1

(3.7)

where PT (h C 1) = PT (h ) for all h and P T (h ) =

1 X m= ¡1

J((h C m)T ),

0 · h < 1.

(3.8)

The summation over m in equation 3.8 is easily performed for J(t ), satisfying equation 3.3, and gives PT (h ) = b JT (h ¡ ta / T ), where b JT (h ) is a periodic function of h with µ ¶ a2 e¡ahT Te¡aT b hT C , JT (h ) = 1 ¡ e¡aT 1 ¡ e¡aT

0 · h < 1.

(3.9)

The function RT may be interpreted as the phase-response curve (PRC) of an individual IF oscillator, and PT is the corresponding pulselike function that contains all details concerning the synaptic interactions. Note that phase equations of the form 3.7 can also be derived for a system of weakly coupled limit cycle oscillators based on more detailed biophysical models such as Hodgkin-Huxley neurons (Ermentrout & Kopell, 1984; Kuramoto, 1984; Hansel et al., 1995). As discussed in some detail by Hansel et al. (1995), linear IF oscillators have a type I PRC, which means that an instantaneous excitatory stimulus always advances its phase (RT (h ) is positive for all h ). On the other hand, Hodgkin-Huxley neurons are of type II since a stimulus can either advance or retard the phase depending on the point on the cycle

Dynamics of Strongly Coupled Spiking Neurons

99

at which the stimulus is applied (RT (h ) takes on both positive and negative values over the domain h 2 [0, 1]). Equation 3.7 can be simplied further by averaging over the natural period T. This leads to an equation of the form N X dhi = v C2 Wij HT (hj ¡ h i ) C O (2 2 ) dt j= 1

with v = 1/ T and Z 1 HT (w) = RT (h )P T (h C w ) dh . 0

(3.10)

(3.11)

For positive kernels J(t ), the interaction function HT (w ) is positive for all w since IF oscillators are of type I. However, it is possible to mimic an effective type II interaction function by introducing a combination of excitatory and inhibitory interactions between the IF neurons in the denition of J(t ) (Bressloff & Coombes, 1998a). To illustrate this idea, suppose that we decompose J(t ) as J(t ) = J C(t ) ¡ J¡ (t ), 2 t e¡a§ t , s§ (t ) = a§

J§ (t ) = s§ (t ¡ t§ )H (t ¡ t§ ),

(3.12)

where J§ (t ) represent a functions for excitatory (C) and inhibitory (¡) synapses. Assuming that the inhibitory pathways are delayed with respect to the excitatory ones (t¡ > t C) then the effective interaction function of an IF oscillator can be approximately sinusoidal. This is illustrated in Figure 1. It is not clear that such a combination of synapses is directly realized in cortical microcircuits. However, it is known that recurrent excitatory connections also stimulate inhibitory interneurons, and this might lead to an effective delay kernel of the form 3.14 at the population level. We dene a phase-locked solution of equation 3.10 to be of the form hi (t) = w i C Vt, where w i is a constant phase and V = 1/ T C O (2 ) is the collective frequency of the coupled oscillators. Substitution of this solution into equation 3.10 and working to O (2 ) leads to the xed-point equations, V=

N X 1 C2 Wij HT (w j ¡ w i ). T j= 1

(3.13)

After choosing some reference oscillator, the N equations (3.13) determine the collective period V and N¡1 relative phases, with the latter independent of 2 . In order to analyze the local stability of a phase-locked solution W = (w 1 , . . . , w N ), we linearize equation 3.10 by setting h i (t) = w i C Vt C hei (t) and expanding to rst order in hei : N X dhei bij (W)hej , = 2 H dt j= 1

(3.14)

100

Paul C. Bressloff and S. Coombes

Figure 1: (Top) Delay kernel J(t) and (bottom) associated interaction function HT (w ) for a combination of excitatory and delayed inhibitory synaptic interactions given by equation 3.12. Here a§ = 10, t C = 0, t¡ = 0.6, and T = ln 2.

where bij (W) = Hij (W) ¡ d i, j H

N X k= 1

Hik (W),

Hij (W) = Wij HT0 (w j ¡ w i )

(3.15)

b is always and HT0 (w ) = dHT (w )/ dw . One of the eigenvalues of the Jacobian H zero, and the corresponding eigenvector points in the direction of the ow, that is (1, 1, . . . , 1). The phase-locked solution will be stable provided that all other eigenvalues have a negative real part (Ermentrout, 1985). The existence and stability of phase-locked solutions in the case of a symmetric pair of excitatory or inhibitory IF neurons with synaptic interactions have been studied in some detail by van Vreeswijk et al. (1994). In this case, N = 2 and W 11 = W22 = 0, W12 = W21 = 1. Equation 3.13 then shows that the allowed solutions for the relative phase w = w 2 ¡ w 1 are given by the zeroes of the odd interaction function HT¡ (w ) = HT (w) ¡ HT (¡w). The underlying symmetry of the pair of neurons guarantees the existence of the in-phase or synchronous solution w = 0 and the antiphase or antisynchronous solution w = 1/ 2. Suppose that ta = 0. For small a, one nds that the in-phase and antiphase solutions are the only phase-locked solutions. However, as a is increased, a critical value ac is reached, where

Dynamics of Strongly Coupled Spiking Neurons

101

1.0 (a)

0.8 f

0.6 0.4 0.2 0.0 2

6

10

14 a

18

1.0 0.8 f

0.6 0.4 0.2 0.0 0

0.2

t

0.4

0.6

Figure 2: (a) Relative phase w = w 2 ¡w 1 for a pair of IF oscillators with symmetric inhibitory coupling as a function of a with I = 2. In each case the antiphase state undergoes a bifurcation at a critical value of a = ac , where it becomes stable, and two additional unstable solutions w , 1 ¡ w are created. The synchronous state remains stable for all a. (b) Relative phase w versus discrete delay ta for a pair of pulse-coupled IF oscillators with I = 2 and a = 2.

there is a bifurcation of the antiphase solution, leading to the creation of two partially synchronized states h and 1 ¡ h , with 0 < h < 1/ 2 and h ! 0 as a ! 1 (see Figure 2a and van Vreeswijk et al., 1994). As shown by Coombes and Lord (1997), variation of ta for xed a produces a checkerboard pattern of alternating stable and unstable solution branches (see Figure 2b) that can overlap to produce multistable solutions. Using equation 3.15, one obtains the following necessary and sufcient condition for local asymptotic stability of a phase-locked state in the weak coupling regime: 2

dHT¡ (w ) > 0. dw

(3.16)

One nds that for ta = 0 and inhibitory coupling (2 < 0), the synchronous state is stable for all 0 < a < 1. Moreover, the antiphase solution w = 1/ 2 is unstable for a < ac , but it gains stability when a > ac with the creation of two unstable partially synchronized states. The stability properties of

102

Paul C. Bressloff and S. Coombes 2 1.6 1.2 a

- 1

0.8 0.4 0

0

0.2

0.4 t

0.6

0.8

1

Figure 3: Stability of the synchronous solution w = 0 as a function of a¡1 and ta for weak excitatory coupling with I = 2. Stable and unstable regions are denoted by s and u, respectively. The stability diagrams are periodic in ta with period ln[I/ (I ¡ 1)]. (This periodicity would be distorted in the strong coupling regime since the collective period of oscillation depends on the strength of coupling).

all solutions are reversed in the case of excitatory coupling (2 > 0) so that the synchronous state is now unstable for all a. If the discrete delay ta is increased from zero, then alternating bands of stability and instability are created. An example of such regions is displayed in Figure 3. The corresponding stability diagram for the antisynchronous state may be obtained by shifting the delay according to ta ! ta C T/ 2. 3.2 Slow Synapses: Analog Models. A second approximation scheme for analyzing IF network dynamics is to assume that the synaptic interactions are sufciently slow so that the output of a neuron can be characterized reasonably well by a mean (time-averaged) ring rate (see, for example, Amit & Tsodyks, 1991; Ermentrout, 1994). Therefore, let us consider the case in which J(t ) is given by the alpha function (see equation 3.3) with a synaptic rise time a¡1 signicantly longer than all other timescales in the system. Suppose that the total synaptic current Xi ( t) to neuron i is described by a slowly varying function of time t. If the neuronal dynamics is fast relative to a¡1 , then the actual ring rate Ei (t) of a neuron will quickly relax to approximately its steady-state value, that is, E i (t) = f (Xi ( t) C Ii ),

(3.17)

where, from equation 3.1, the ring-rate function f is of the form f ( X) =

» µ ln

X X¡1

¶¼

¡1

H(X ¡ 1).

(3.18)

Dynamics of Strongly Coupled Spiking Neurons

103

(Note that we have ignored the effects of absolute refractory period, which is reasonable when the system is operating well below its optimal ring rate). Equation 3.17 relates the dynamics of the ring rate directly to the stimulus dynamics Xi (t) through the steady-state response function. In effect, the use of a slowly varying kernel J(t ) allows a consistent denition of the ring rate so that a dynamical network model can be based on the steady-state properties of an isolated neuron. Within the above approximation, we can derive a closed set of equations for the synaptic currents Xi (t). This is achieved by rewriting equation 3.2 as a pair of differential equations that generate the alpha function J( t) of equation 3.3, and replacing the output spike train of a neuron by the ringrate function (see equation 3.18): a¡1

dXi (t) C Xi ( t) = Yi ( t), dt

(3.19)

a¡1

N X dYi (t) C Y i ( t) = 2 W ij Ej ( t ¡ ta ). dt j= 1

(3.20)

Here Yi ( t) is an auxiliary current. A common starting point for the analysis of analog models is to consider conditions under which destabilization of a homogeneous low-activity state occurs, leading to the formation of a state with inhomogeneous and/ or timedependent ring rates (see Ermentrout, 1998a, for a review). To simplify our analysis, we shall impose the following condition on the external bias Ii , I = Ii C 2 f ( I )

N X

(3.21)

Wij

j= 1

for some xed I > 1. Then for sufciently weak coupling, |2 | ¿ 1, the P analog model has a single stable xed point given by Yi = Xi = 2 f ( I) j Wij , such that the ring rates are kept at the value f (I). Suppose that we linearize equations 3.19 and 3.20 about this xed point and substitute into the linearized equations a solution of the form ( Xk (t), Yk ( t)) = el t (Xk , Yk ). This leads to the eigenvalue equation l p§ a

q § = ¡1 § 2 f 0 ( I)ºp e¡l p ta / 2 ,

p = 1, . . . , N,

(3.22)

whereºp , p = 1, . . . , N, are the eigenvalues of the weight matrix W. The xed point will be asymptotically stable if and only if Rel pt < 0 for all p. As |2 | is increased from zero, an instability may occur in at least two distinct ways. If a single real eigenvalue l crosses the origin in the complex l -plane, then a static bifurcation can occur, leading to the emergence of additional xedpoint solutions that correspond to inhomogeneous ring rates. For example,

104

Paul C. Bressloff and S. Coombes

if 2 > 0 and W has real eigenvalues º1 > º2 > ¢ ¢ ¢ > ºN with º1 > 0, then a static bifurcation will occur at the critical coupling 2 c for which l 1C = 0, p that is, 1 = 2 c f 0 ( I)º1 . On the other hand, if a pair of complex conjugate eigenvalues l = l R § il I crosses the imaginary axis (l R = 0) from left to right in the complex plane, then a Hopf bifurcation can occur, leading to the formation of periodic solutions, that is, time-dependent ring rates. For example, suppose that ta = 0 and W has a pair of complex eigenvalues (º, º¤ ) with º = reih and 0 < h < p . Denote the corresponding solutions of equation 3.22 by (l § , l ¤§ ). Assuming that all other eigenvalues l have negative real part, a Hopf bifurcation p will occur at the critical coupling 2 c for which Rel C = 0, that is, 1 = 2 c f 0 ( I) r cos(h/ 2). An alternative way of generating oscillatory solutions is to have nonzero delays ta (Marcus & Westervelt, 1989). Note that the basic stability results are independent of the inverse rise time a. To illustrate, consider a symmetric pair of analog neurons with inhibitory coupling, 2 < 0, and W11 = W22 = 0, W 12 = W 21 = 1. The weight matrix W has eigenvalues §1 and eigenmodes (1, §1). Let xi = Xi ¡ 2 f ( I) so that the xed-point equations become x1 = 2 f ( x2 C I) ¡ 2 f (I) and x2 = 2 f (x1 C I) ¡ 2 f ( I). One solution is the homogeneous xed point xi = 0. A full bifurcation diagram showing the location of the xed points x1 as a function of |2 | is shown in Figure 4 (top). The homogeneous xed point xi = 0 is stable for sufciently small coupling |2 | but destabilizes at the critical point |2 | = 2 c with 2 c = 1/ f 0 ( I), where it coalesces with two unstable xed points. For |2 | > 2 c , the unstable xed point at the origin coexists with two stable xed points (arising from saddle-node bifurcations). Just beyond the bifurcation point, the system jumps from a homogeneous state to a state in which one neuron is active with a constant ring rate f ( I ¡ 2 f (I)) and the other is passive with zero ring rate. Note that a pair of excitatory analog neurons would bifurcate into another homogeneous state. For example, since Ii has to decrease to keep the ring rate constant when 2 > 0 (see equation 3.21), for strong enough coupling Ii < 1 so that the quiescent state is also a valid solution. A well-known result from the analysis of analog neurons is that an excitatory-inhibitory pair can undergo a Hopf bifurcation to an oscillatory solution (Atiya & Baldi, 1989). As an illustration, consider the case 2 > 0, ta = 0 and W11 = W22 = 0, W 12 = ¡2, W21 = 1. Here neuron 2 inhibits neuron 1, whereas neuron 1 excites neuron 2. The eigenvalues of the weight matrix are §i so that a Hopf bifurcation arises as 2 is increased. (See the discussion below equation 3.22.) Let x1 = X1 C22 f ( I) and x2 = X2 ¡ 2 f ( I) so that there exists a xed point at xi = 0. The bifurcation diagram for the amplitude x1 as a function of 2 is shown in Figure 4 (bottom). It can be seen that the system undergoes a so-called subcritical Hopf bifurcation in which the homogeneous xed point xi = 0 becomes unstable and the system jumps to a coexisting stable limit cycle, signaling a solution with periodic ring rates.

Dynamics of Strongly Coupled Spiking Neurons

1

105

active

passive e 1

e

Figure 4: (Top) Bifurcation diagram for a pair of analog neurons with symmetric inhibitory coupling and external input I = 2. (Bottom) Subcritical Hopf bifurcation for a pair of analog oscillators with self-interactions. a = 0.5, I = 2, W11 = W22 = 0, W21 = 1, and W12 = ¡2. Open circles denote the amplitude of the resulting limit cycle from the Hopf bifurcation point 2 ¼ 2.0, and s (u) stands for stable (unsta ble) dynamics.

This form of jump is often referred to as a hard excitation. The system also exhibits hysteresis. It is interesting to note that if the ring-rate function f ( X) of equation 3.18 were taken to be the usual smooth sigmoid function, then, for the given weights, the Hopf bifurcation would be supercritical in the sense that the limit cycle would grow smoothly from the unstable xed point so that there is no jump phenomenon or hysteresis. This is called a soft excitation. (See Atiya & Baldi, 1989.) Another important point is that if the analog model were described by a rst-order equation rather than a second-order equation (as given by equations 3.19 and 3.20), then it would be necessary to introduce additional self-interactions (W12 , W21 6 = 0) in order for a Hopf bifurcation to occur. This is difcult to justify on neurobiological grounds. (A rst-order equation would be obtained if the delay kernel were taken to be an exponential function rather than the alpha function 3.3.) We shall return to this issue in section 4.6. 4 Spike Train Dynamics in the Strong-Coupling Regime Recall from section 3 that networks of weakly coupled IF neurons can have stable phase-locked solutions in which all the neurons have the same con-

106

Paul C. Bressloff and S. Coombes

stant ISI. On the other hand, a strongly coupled network of analog neurons can bifurcate from a stable homogeneous state with identical timeindependent ring rates to a state with inhomogeneous and/ or time-varying ring rates. (Note that a homogeneous state of the analog model does not distinguish between different phase-locked solutions since all phase information is lost during time averaging.) This suggests that there should exist some mechanism for destabilizing phase-locked solutions of the IF model in the strong-coupling regime. Here we identify such a mechanism based on a discrete Hopf bifurcation of the ring times. This induces inhomogeneous and periodic variations in the ISIs, which supports a variety of complex dynamics, including oscillator death (section 4.3), bursting (section 4.4), and pattern formation (section 4.5). 4.1 Phase Locking for Arbitrary Coupling. An important simplifying aspect of the dynamics of pulse-coupled (linear) IF neurons is that it is possible to derive phase-locking equations without the assumption of weak coupling (van Vreeswijk et al., 1994; Bressloff et al., 1997; Bressloff & Coombes, 1999). This can be achieved by solving equation 3.1 directly under the ansatz that the ring times are of the form Tjn = ( n ¡ w j ) T for some self-consistent period T and constant phases wj . Integrating equation 3.1 over the interval t 2 (¡Tw i , T ¡ Tw i ) and incorporating the reset condition by setting Ui (¡w i T ) = 0 and Ui ( T ¡ w i T ) = 1 leads to the result 1 = (1 ¡ e¡T )Ii C 2

N X j= 1

Wij KT (w j ¡ w i ),

(4.1)

where Z KT (w ) = e¡T

T 0

et

1 X m= ¡1

J(t C (m C w ) T ) dt = ¡

T 2 e¡T ¢ HT (w). (4.2) 1 ¡ e¡T

Equation 4.1 has an identical structure to that of equation 3.13 and involves the same phase interaction function (up to a multiplicative factor). Indeed, in the weak coupling regime with Ii = I for all i, equation 4.1 reduces to equation 3.13 with 1/ T = V. This means that techniques previously developed for studying phase locking in weakly coupled oscillator networks can be extended to strongly coupled IF networks. For example, in the case of a ring of identical IF oscillators with symmetric coupling, group-theoretic methods can be used to classify all phase-locked solutions and construct bifurcation diagrams showing how new solution branches emerge via spontaneous symmetry breaking (Bressloff et al., 1997; Bressloff & Coombes, 1999). Furthermore, in the case of a nite chain of IF oscillators with a gradient of external inputs and sinusoidal-like phase interaction functions of the form shown in Figure 1, one can establish the existence of “traveling

Dynamics of Strongly Coupled Spiking Neurons

107

wave” solutions in which the phase varies monotonically along the chain (except in some narrow boundary layer); such systems can be used to model locomotion in simple vertebrates (Bressloff & Coombes, 1998a). There are, however, a number of signicant differences between phaselocking equations 3.13 and 4.1. First, equation 4.1 is exact, whereas equation 3.13 is valid only to O (2 ) since it is derived under the assumption of weak coupling. Second, the collective period of oscillations T must be determined self-consistently in equation 4.1. Assume for the moment that T is given. Suppose that we choose h 1 as a reference oscillator and subtract from equation 4.1 for i = 2, . . . , N the corresponding equation for i = 1. This leads to N ¡1 xed-point equations for the N ¡1 relative phases wOj = wj ¡w 1 , j = 2, . . . , N, which for Ii = I take the form 0=

N X j= 1

W ij KT ( wOj ¡ wOi ) ¡

N X j= 1

W1j KT (wOj ),

where wO1 ´ 0. The resulting solutions for wOj , j = 2, . . . , N, are functions of T, which can then be substituted back into equation 4.1 for i = 1 to give an implicit self-consistency condition for T. The analysis is considerably simpler in the weak-coupling regime since the relative phases are then functions of the natural period T. The third difference between weak and strong coupling is that although the equations for phase locking in the two models are formally the same, the underlying dynamical systems are distinct, thus leading to differences in their stability properties. For example, in the special case of a pair of IF neurons, a return map argument can be used to show that equation 3.16 with T replaced by the collective period T is a necessary condition for the stability of a phase-locked state for any 2 (van Vreeswijk et al., 1994). However, as we shall establish below, it is no longer a sufcient condition for stability in the strong coupling regime. (See also Chow, 1998.) 4.2 Desynchronization in the Strong Coupling Regime. In order to investigate the linear stability of phase-locked solutions of equation 4.1, we consider perturbations d in of the ring times (van Vreeswijk, 1996; Gerstner et al., 1996; Bressloff & Coombes, 1999). That is, we set Tin = (n ¡ w i ) T Cd in in equation 3.2 and then integrate equation 3.1 from Tin to TinC1 using the reset condition. This leads to a mapping of the ring times that can be expanded to rst order in the perturbations (Bressloff & Coombes, 1999):

8 9 < N =h i X Ii ¡ 1 C 2 Wij P T (wj ¡ w i ) d inC1 ¡ d in : ; 1 j=

= 2

N X j= 1

Wij

1 X m= ¡1

h i Gm (wj ¡ w i , T ) d jn¡m ¡d in ,

(4.3)

108

Paul C. Bressloff and S. Coombes

where PT satises equation 3.8 and Gm (w , T ) =

e¡T T

Z

T 0

et J0 (t C ( m C w)T) dt.

(4.4)

The linear delay-difference equation 4.3 has solutions of the form djn = enld j with 0 · Im(l) < 2p . The eigenvalues l and eigenvectors (d 1 , . . . , d N ) satisfy the equation 2

N X

4 Ii ¡ 1 C 2 = 2

j= 1 N X j= 1

3 Wij PT (w j ¡ w i ) 5 (el ¡ 1)d i

Wij GT (wj ¡ w i , l )d j ¡ 2

N X j= 1

Wij GT (w j ¡ w i , 0)d i ,

(4.5)

with GT (w , l) =

1 X

e¡ml Gm (w , T ).

(4.6)

m= ¡1

One solution to equation 4.5 is l = 0 with d i = d for all i = 1, . . . , N. This reects the invariance of the dynamics with respect to uniform phase shifts in the ring times, Tin ! Tin Cd . Thus the condition for linear stability of a phase-locked state is that all remaining solutions l of equation 4.5 have negative real part. This ensures that d jn ! 0 as n ! 1 and, hence, that the phase-locked solution is asymptotically stable. By performing an expansion in powers of the coupling 2 , it can be established that for sufciently small coupling, equation 4.5 yields a stability condition that is equivalent to the one based on the Jacobian of the phase-averaged model, equations 3.14 and 3.15. (See Bressloff & Coombes, 1999.) We wish to determine whether this stability condition breaks down as |2 | is increased (with the sign of 2 xed). For concreteness, we shall focus on the stability of synchronous solutions. In order to ensure that such solutions exist, we impose the condition 2 Ii = I 41 ¡ 2 KT (0)

N X

3 Wij 5 , i = 1, . . . , N

(4.7)

j= 1

for some xed I > 1 and T = ln[I/ ( I ¡ 1)]. The condition on Ii for the IF model plays an analogous role to equation 3.21 for the analog model in section 3.2. The synchronous state w i = w for all i and arbitrary w is then a solution of equation 4.1 with collective period T. By xing T we can make

Dynamics of Strongly Coupled Spiking Neurons

109

a more direct comparison with the analog model whose xed point at the origin will have a ring rate f ( I) = 1/ T in equation 3.18. Dene the matrix b according to W b ij = Wij ¡ d i, j W

N X

Wik .

(4.8)

k= 1

For sufciently weak coupling, equations 3.14, 3.15, and 4.2 imply that the synchronous state is linearly stable if and only if £ ¤ (4.9) 2 KT0 (0)Re b ºp < 0, p = 1, . . . , N ¡ 1, b with b where b ºp , p = 1, . . . , N are the eigenvalues of the matrix W ºN = 0, and KT0 (h ) = T ¡1 dK T (h )/ dh . In the particular case of a symmetric pair of neurons, equation 4.9 reduces to the condition 2 K0T (0) > 0, which is equivalent to equation 3.16 for w = 0. (The case w 6 = 0 is obtained in exactly the same fashion). It follows that the stability diagram of Figure 3 displays the sign of KT0 (0) as a function of a and ta . For example, in the case of zero axonal delays (ta = 0), Figure 3 implies that K0T (0) < 0 for all nite a and T, and equation 4.9 reduces to the stability condition 2 Reb ºp > 0 for all p = 1, . . . , N ¡ 1. The stability of the synchronous state for weak coupling implies that the zero eigenvalue is nondegenerate and all other eigenvalues l satisfying equation 4.5 have negative real parts. As the coupling strength |2 | is increased from zero, destabilization of the synchronous state will be signaled by one or more complex conjugate pairs of eigenvalues crossing the imaginary axis in the complexl -plane from left to right. (It is simple to establish that the synchronous state cannot destabilize due to a real eigenvalue’s crossing the origin by setting l = 0 in equation 4.5.) We proceed, therefore, by substituting w i = w , i = 1, . . . , N and imposing the condition 4.7. Equation 4.5 reduces to the form £¡

N X ¢¡ ¢ ¤ el ¡ 1 I ¡ 1 C 2 C i IK0T (0) C 2 C i KT0 (0) d i = 2 GT (0, l) Wijdj ,

(4.10)

j= 1

PN 0 where C i = j= 1 Wij . We have used the identities PT (0) ¡ IKT (0) = IK T (0) and GT (0, 0) = K0T (0). We then substitute l = ib into equation 4.10 and look for the smallest value of the coupling strength, |2 | = 2 c , for which a real solution b exists. This determines the Hopf bifurcation point at which the synchronous state becomes unstable. (In the special case b = p this reduces to a period doubling bifurcation.) Note that when C i is independent of i, equation 4.10 can be simplied by choosing d i to lie along one of the eigenvectors of the weight matrix W. We shall explore the nature of the Hopf instability through a number of specic examples. We shall also establish that for slow synapses, the strong

110

Paul C. Bressloff and S. Coombes

coupling behavior of the IF model is consistent with that of the mean ringrate model of section 3.2 (on an appropriately dened timescale). In order to compare the two types of model, it will be useful to introduce a few denitions. For a network of N IF neurons labeled i = 1, . . . , N, let us dene ¡1 the long-term ring rate of a neuron to be ai = D i , where D i is the mean ISI, M X 1 Dm i , M!1 2M C 1 m= ¡M

D i = lim

(4.11)

mC1 ¡T m . A necessary requirement for good agreement between with D m i = Ti i the analog and IF models is that the mean ring rates ai of the IF model match the corresponding ring rates of the analog model. In general, one would also expect to observe uctuations in the ISIs of the IF network that are not resolved by the analog model. A Hopf bifurcation for maps (also known as a Neimark-Sacker bifurcation) is usually associated with the formation of periodic (or quasiperiodic) orbits, which in the current context corresponds to periodic variations in the ISIs. In cases where the analog model bifurcates to a state with time-independent ring rates, we expect the periodic orbits of the ISIs to be small relative to the ring period, at least for slow synapses; that is, |D m i ¡ D i |/ D i ¿ 1 for all i, m. (See the example of pattern formation in section 4.5, in particular, Figure 12.) On the other hand, in cases where an analog network destabilizes to a state with time-varying ring rates, we expect the periodic orbits of the ISIs in the IF model to be relatively large. Under such circumstances, we can introduce a sliding window of width 2P C 1 for the IF model in order to dene a short-term average ring rate of the form ai ( m) = D i (m) ¡1 where

D i ( m) =

P X 1 mCp Di . 2P C 1 p= ¡P

One can then see if there is a good match between the time-dependent ring rates of the two models for an appropriately chosen value of P. Actually, in practice it is simpler to compare variations in the ring rate of an analog network directly with variations in the ISIs of the corresponding IF network (see Figure 8). 4.3 Oscillator Death in a Globally Coupled Inhibitory Network. As our rst example illustrating the Hopf instability identied in section 4.2, we consider a network of N identical IF oscillators with all-to-all inhibitory coupling and no self-interactions. This type of architecture has been used, for example, to model the reticular thalamic nucleus (RTN), which is thought to act as a pacemaker for synchronous spindle oscillations observed during sleep or anesthesia (Wang & Rinzel, 1992; Golomb & Rinzel, 1994). In the

Dynamics of Strongly Coupled Spiking Neurons

111

biophysical model of RTN developed by Wang and Rinzel (1992), neural oscillations are sustained by postinhibitory rebound, a phenomenon whereby a neuron can re after being hyperpolarized over an extended period and then released. This should be contrasted with our simple IF model in which oscillations are sustained by an external bias. We shall show that for slow synapses, desynchronization via a Hopf bifurcation in the ring times occurs, leading to oscillator death in the strong coupling regime, that is, certain cells suppress the activity of others. (See also the recent study of mutually inhibitory Hodgkin-Huxley neurons by White, Chow, Ritt, Soto-Trevino, & Kopell, 1998.) The weight matrix W of a globally coupled inhibitory network with 2 < 0 is given by Wii = 0 and Wij = 1/ ( N ¡ 1) so that C i = 1 for all i = 1, . . . , N. It follows that W has a nondegenerate eigenvalue ºC = 1 with corresponding eigenvector (1, 1, . . . , 1) and an ( N ¡ 1)-fold degenerate b equation 4.8, eigenvalue º¡ = ¡1/ ( N ¡ 1). The eigenvalues of the matrix W, are b ºC = 0 and b º¡ = ¡N/ (N ¡ 1), so that the synchronous state is stable in the weak coupling regime provided that 2 K0T (0) > 0 (see equation 4.9). This is certainly true for zero axonal delays (see Figure 3). Note that the underlying permutation symmetry of the system means that there are additional phase-locked states in which the neurons are divided into two or more clusters of synchronized cells (Golomb & Rinzel, 1994). We shall investigate the stability of the synchronous state as |2 | is increased. Take (d 1 , . . . , d N ) in equation 4.10 to be an eigenvector corresponding to either º = ºC or º = º¡ and set l = ib. Equating real and imaginary parts then leads to the pair of equations, 0 0 2 KT (0) C [cos(b) ¡ 1][I ¡ 1 C 2 IKT (0)] = 2 C(b)º

sin(b)[I ¡ 1 C 2 IK0T (0)] = ¡2 S(b)º,

(4.12)

where C(b) = ReGT (0, ib, ), S(b) = ¡ImGT (0, ib ). In Figure 5 we plot the solutions 2 of equation 4.12 as a function of the inverse rise time a for a pair of inhibitory neurons (N = 2) with T = ln 2 and ta = 0. The solid (dashed) solution branch corresponds to the eigenvalue º = ¡1 (º = 1). The lower branch determines the critical coupling |2 | = 2 c (a) for a Hopf instability. An important result that emerges from this gure is that there exists a critical value a0 of the inverse rise time beyond which a Hopf bifurcation cannot occur. That is, if a > a0 , then the synchronous state remains stable for arbitrarily large inhibitory coupling. On the other hand, for a < a0 , destabilization of the synchronous state occurs as the coupling strength 2 crosses the solid curve of Figure 5 from below. This signals the activation of the eigenmode (d 1 , d 2 ) = (1, ¡1), which suggests that an inhomogeneous state will be generated. Indeed, direct numerical simulation of the IF model shows that just after destabilization of the synchronous state (|2 | > 2 c ), the system jumps to an inhomogeneous state consisting of one active neuron and one passive neuron. We conclude

112

Paul C. Bressloff and S. Coombes

Figure 5: Region of stability for a synchronized pair of identical IF neurons with inhibitory coupling and collective period T = ln 2. The solid curve |2 | = 2 c (a) denotes the boundary of the stability region, which is obtained by solving equation 4.12 for º = ¡1 as a function of a with ta = 0. Crossing the boundary from below signals excitation of the linear mode (1, ¡1), leading to a stable state in which one neuron becomes quiescent (oscillator death). For a > a0 , the synchronous state is stable for all 2 . The dashed curve denotes the solution of equation 4.12 for º = 1.

that for sufciently slow synapses, a pair of IF neurons displays similar behavior to a corresponding pair of analog neurons in the strong coupling regime (see section 3.2 and Figure 4 (top)). A rough order estimate for the critical inverse rise time a0 is a0¡1 ¼ T, where T is the collective period of oscillations before destabilization. Such a result is not surprising since one would expect that for a reasonable match between the IF and (timeaveraged) analog models to occur, the IF neurons should sample incoming spike trains over a sliding window that is not too small. The width of such a sliding window is determined by the rise time a¡1 . The above result appears to be quite general. For example, suppose that we have a nonzero discrete delay ta such that the synchronous state is unstable and the antiphase state is stable for a given a. The latter state also destabilizes via a Hopf bifurcation in the ring times for small a, leading to oscillator death. The result extends to larger values of N, for which º¡ = ¡1/ ( N¡1) in equation 4.12. Here desynchronization of a phase-locked state leads to clusters of active and passive neurons. Interestingly, the critical value a0 decreases with N, indicating the greater tendency for phase locking to occur in large, globally coupled networks. We plot 2 c as a function of a for a range of values of network size N in Figure 6, where it can be seen that a0 ( N ) ! 0 as N ! 1. This implies that for large networks, the neurons remain synchronized for arbitrarily large coupling, even for slow synapses. Indeed, since real neurons typically have an inverse rise time a of the order 2 or larger, it follows that for realistic synaptic time constants, the network can never experience oscillator death for N larger than 5 or so. (There is one subtlety to be noted here. The dashed solution curve shown

Dynamics of Strongly Coupled Spiking Neurons

113

e c

a

Figure 6: Plot of critical coupling 2 c as a function of a for various network sizes N. The critical inverse rise time a0 (N) is seen to be a decreasing function of N, with a0 (N) ! 0 as N ! 1.

in Figure 5 corresponds to excitation of the uniform mode (1, 1, . . . , 1) and is independent of N. As N increases, it is crossed by the curve 2 c (a) so that for a certain range of values of a, it is possible for the synchronous state to destabilize due to excitation of the uniform mode. The neurons in the new state will still be synchronized, but the spike trains will no longer have a simple periodic structure.) The persistence of synchrony in large networks is consistent with the mode-locking theorem obtained by Gerstner et al. (1996) in their analysis of the spike response model. We shall briey discuss their result within the context of the IF model. For the given globally coupled network, the linearized map of the ring times, equation 4.3, becomes i © ªh I ¡ 1 C 2 IKT0 (0) d inC1 ¡ d in 2 3 1 X X 1 = 2 d jn¡m ¡ d in 5 . Gm (0, T ) 4 ¡ 1 N m= ¡1 j6 = i

(4.13)

The major assumption underlying the analysisP of Gerstner et al. (1996) is m that for large N, the mean perturbation hd m i = j6 = i d j / ( N ¡ 1) ¼ 0 for all m ¸ 0 say. Equation 4.13 then simplies to the one-dimensional, rst-order mapping, d inC1 =

I ¡ 1 C 2 ( I ¡ 1)K0T (0) n d i ´ kTd in . I ¡ 1 C 2 IKT0 (0)

(4.14)

The synchronous (coherent) state will be stable if and only if |kT | < 1. It is instructive to establish explicitly that equation 4.14 is equivalent to the

114

Paul C. Bressloff and S. Coombes

corresponding result derived for the spike-response model (see equation 4.8 of Gerstner et al., 1996), which in our notation can be written as P 0 nC1¡l l¸1 hr ( lT )d i P d inC1 = P . (4.15) 0( ) 0( ) l¸1 hr lT C 2 l¸1 hs lT HereP h0r ( t) = dhr ( t)/ dt. First, using equations 2.7, P3.3, and 4.2, it can be shown that l¸1 h0r ( lT ) = e¡T / (1 ¡ e¡T ) = I ¡ 1 and l¸1 h0s ( lT ) = IKT0 (0). Setting A( T ) = I ¡ 1 C 2 IKT (0), we can then rewrite equation 4.15 as A( T )d inC1 = P ¡lT nC1¡l di . It follows that A(T )d inC1 ¡ e¡T A(T )d in = e¡Td in , which is l¸1 e identical to equation 4.14 since e¡T = ( I¡1)/ I. Equation 4.15 or equation 4.14 implies synchronous state is stable in the large N limit provided Pthat the 0( ) that 2 > 0, that is, 2 KT0 (0) > 0 (cf. equation 3.16). This is the h lT l¸1 s essence of the mode-locking theorem of Gerstner et al. (1996). Our analysis also highlights one of the simplifying features of the IF model with alpha function response kernels: that the summations over l in equation 4.15 can be performed explicitly. It would be of interest, however, to investigate to what extent the results of this article carry over to more general choices of hr and hs in the construction of the spike-response model. We expect that the inclusion of details concerning refractoriness will not alter the basic picture, for oscillator death is also found in a pair of mutually inhibitory HodgkinHuxley neurons, as shown by White et al. (1998) in the case of synapses with rst-order channel kinetics, and also conrmed numerically by ourselves in the case of second-order kinetics. Finally, in other related work, van Vreeswijk (1996) has shown how a network of IF neurons with global excitatory coupling can destabilize from an asynchronous state via a Hopf bifurcation in the ring times. However, this leads to small-amplitude quasiperiodic variations in the ISIs of the oscillators (the ISIs lie on relatively small, closed orbits). This can be understood by looking at a corresponding network of excitatory analog neurons, which can bifurcate only to another homogeneous time-independent state. 4.4 Bursting in an Excitatory-Inhibitory Pair of IF Neurons. Now let us consider an excitatory-inhibitory pair of IF neurons with 2 > 0, W11 = W22 = 0 and W12 = ¡2, W21 = 1 and zero axonal delays ta = 0. The analog version of this network, studied in section 3.2, was shown to exhibit periodic variations in the mean ring rate in the strong coupling regime. It can be established from equation 4.9 that the synchronous state is stable for sufciently weak coupling. In order to investigate Hopf instabilities in the strong coupling regime, we set l = ib in equation 4.10 and solve for (2 , b) as a function of a. For each a, the smallest solution 2 = 2 c determines the Hopf bifurcation point, which leads to a stability diagram of the form shown in Figure 7. In contrast to the previous example, a critical coupling for a Hopf bifurcation in the ring times exists for all a. Direct numerical solution of the system shows that beyond the Hopf bifurcation point, the

Dynamics of Strongly Coupled Spiking Neurons

115

e

a

Figure 7: Region of stability for an excitatory-inhibitory pair of IF neurons with inhibitory self-interactions, 2 > 0 and W11 = W22 = 0, W21 = 1, W12 = ¡2. The collective period of synchronized oscillations is taken to be T = ln 2. Solid and dashed curves denote solutions of equation 4.10 for 2 with l = ib, real b. Crossing the solid boundary of the stability region from below signals destabilization of the synchronous state, leading to the formation of a periodic bursting state.

system jumps to a state in which the two neurons exhibit periodic bursting patterns (see Figure 8). This can be understood in terms of mode-locking associated with periodic variations of the ISIs on closed attracting orbits (see Figure 9). Suppose that the kth oscillator has a periodic solution of length Mk nCpMk so that D k = D nk for all integers p. If D 1k À D nk for all n = 2, . . . , M k , say, then the resulting spike train exhibits bursting with the interburst interval equal to D 1k and the number of spikes per burst equal to Mk . Although both oscillators have different interburst intervals (D 11 6 = D 12 ) and numbers of spikes per burst (M1 6 = M2 ), their spike trains have the same total period, P 1 n PM 2 n that is, M n= 1 D 1 = n= 1 D 2 . Due to the periodicity of the activity, the ISIs fall on only a number of discrete points on the orbit, which is radically different from the case found in van Vreeswijk (1996), where the whole curve is visited due to the quasiperiodical nature of the ring. (Quasiperiodicity is also observed in the pattern formation example presented in section 4.5.) The variation of the ISIs D ni with n is compared directly with the corresponding variation of the ring rates of the analog model in Figure 8. Good agreement can be seen between the two models in the case of small a (see Figure 8a), but discrepancies between the two arise as a increases (see Figure 8b). As in the case of oscillator death, the occurrence of bursting through a Hopf bifurcation in the ring times appears to be quite general. For example, suppose that we have a nonzero discrete delay ta such that the synchronous state is unstable and the antiphase state is stable for a given a. We nd that the latter state also destabilizes via a Hopf bifurcation in the ring times for small a, leading to a periodic bursting state.

116

Paul C. Bressloff and S. Coombes

t Figure 8: Spike train dynamics for a pair of IF neurons with both excitatory and inhibitory coupling corresponding to points A and B in Figure 7, respectively. The ring times of the two oscillators are represented with lines of different heights (marked with a +). Smooth curves represent variation of ring rate in the analog version of the model.

D

D

D

D

Figure 9: A plot of the ISIs (D n¡1 , D nk ) of the spike trains shown in Figure 8a, ilk lustrating how they lie on closed periodic orbits. Points on an orbit are connected by lines. Each triangular region is associated with only one of the neurons, highlighting the difference in interburst intervals (see also Figure 8a). The inset is a blowup of orbit points for one of the neurons within a burst.

Dynamics of Strongly Coupled Spiking Neurons

117

Figure 10: (a) Example of a Mexican hat function weight distribution W(k) and (b) its eigenvalue distribution º(p).

Our analysis of a pair of excitatory-inhibitory IF neurons suggests that bursting can arise at a network level due to strong synaptic coupling without the need for additional slow ionic currents (see Wang & Rinzel, 1995). An alternative mechanism for generating bursts in networks of interacting neurons is based on electrical (diffusive) coupling between cells. This has been studied in small networks of leaky-integrator neurons with postinhibitory rebound (Mulloney, Perkel, & Budelli, 1981) and in large networks of Hodgkin-Huxley neurons (Han, Kurrer, & Kuramoto, 1995). Our analysis also shows that bursting can occur in the absence of the self-interaction terms considered in our previous work (Bressloff & Coombes, 1998b); it is hard to justify the presence of such terms on neurophysiological grounds. (See section 4.6.) 4.5 Pattern Formation in a One-Dimensional Network. As our nal example of strong-coupling Hopf instabilities in IF networks, we consider a ring of N = 2M C 1 neurons evolving according to equations 3.1 and 3.2 with 2 > 0, ta = 0 and a weight matrix Wij = W (i ¡ j) where W ( k) = A1 exp(¡k2 / (2s12 )) ¡ A2 exp(¡k2 / (2s22 )),

0 < k · M. (4.16)

W (k) = 0 for |k| > M and W (0) = 0 (no self-interactions). We choose A1 > A2 and s1 < s2 . W ( k) then represents a lattice version of a Mexican hat interaction function in which there is competition between short-range excitation and long-range inhibition. A continuum version of W ( k) is plotted in Figure 10a for illustration. This type of architecture is well known to support spatial pattern formation in analog neural systems (Ermentrout & Cowan, 1979; Ermentrout, 1998a). We shall show that similar behavior occurs in the IF model. For the given weight matrix W and homogeneous external inputs Ii = I, i = 1, . . . , N, the network is translationally invariant. This means that one class of solution to the phase-locking equations 4.1 consists of “traveling waves” of the form w k = qk/ N for integers q = 0, . . . , N ¡ 1. This type of phase-locked solution also arises in studies of the spike-response model

118

Paul C. Bressloff and S. Coombes

(Kistler et al., 1998) and in studies of weakly coupled phase oscillators (Crook, Ermentrout, Vanier, & Bower, 1997; Bressloff & Coombes, 1997). We shall concentrate on the synchronous solution q = 0. For convenience, we shall assume that there is an equal balance between excitation and inhibition PN PM ( ) by xing A1 , A2 in equation 4.16 so that C i ´ j= 1 Wij = k= ¡M W k = 0 for all i. (This condition is not necessary for pattern formation to occur.) The collective period of synchronous oscillations is then T = ln[I/ (I ¡ 1)] (see equation 4.7.) In order to investigate the linear stability of the synchronous state, we need to determine the eigenvalues of the weight matrix W. These are given by º( p) = 2

M X

¡ ¢ W (k) cos pk ,

p = 0,

k= 1

2p 2p ( N ¡ 1) , ..., , N N

(4.17)

with º( p) > º(0) = 0 for all p 6 = 0. The corresponding eigenvectors are d k = eikp . Equation 4.9 then shows that the synchronous state is stable in the weak coupling regime, since KT0 (0) < 0 for all a (when ta = 0) and b º( p) = º( p). Let §pmax be the wave numbers at whichº(p) attains its maxima. (This is illustrated for the continuum Mexican hat function in Figure 10b.) Take d k = eikp and C i = 0, and set l = ib in equation 4.10. Equating real and imaginary parts leads to the pair of equations [cos(b) ¡ 1][I ¡ 1] = 2bC(b) sin(b)[I ¡ 1] = ¡b 2 S(b),

(4.18)

where b 2 = 2 ºmax and ºmax = º( pmax ). We nd that there exist non-trivial solutions of equation 4.18 for all values of a (see Figure 11). It follows that a Hopf bifurcation occurs due to activation of the eigenmodes d k = e§ipmax k . Since pmax 6 = 0, the instability will involve spatially periodic perturbations of the ring times, and this leads to the formation of regular activity patterns across the network, as illustrated in Figure 12. When the network undergoes a Hopf bifurcation, it leads to the creation of periodic orbits for the ISIs. Dene the long-term average ring rate ac¡1

cording to ak = D k , where D k is dened in equation 4.11. It follows that spatial (k-dependent) variations in the ring rates ak can occur if there is a corresponding spatial distribution of the orbits in phase-space. Figure 12 shows that the orbits are indeed separated from each other in phase-space (although they remain small relative to the natural timescales of the system). The resulting long-term average behavior of the system is characterized by spatially regular patterns of output activity as shown in the inset of Figure 12. These patterns are found to be consistent with those observed in the corresponding analog version of the model presented in section 3.2, and such agreement holds over a wide range of values of a that includes both fast and slow synapses. However, the IF model has additional ne structure

Dynamics of Strongly Coupled Spiking Neurons

119

20 15

^e

10

patterns

patterns

5 region of stability 0

0

1

2

3 a

4

5

6

Figure 11: Region of stability for a ring of N = 51 IF neurons having the weight distribution W(k) of equation 4.16 with s1 = 2.1, s2 = 3.5, A1 = 1.77, and P k W(k) = 0. The collective period of synchronized oscillations is taken to be T = ln 1.5. Solid and dashed curves denote solutions of equation 4.18 for b 2 . Crossing the solid boundary of the stability region from below signals destabilization of the synchronous state, leading to the formation of spatially periodic patterns of activity as shown in Figure 12.

associated with the dynamics on the periodic orbits of Figure 12, which is not resolved by the analog model. This is illustrated in the insets of Figure 13, where we plot temporal variations in the inverse ISI (or instantaneous ring rate) of a single oscillator for two different values of a. It can be seen that there are deterministic uctuations in the mean ring rate. In order to characterize the size of these uctuations, we dene a deterministic coefcient of variation CV ( k) for a neuron k according to r ¡ CV ( k) =

Dk ¡Dk Dk

¢2 ,

(4.19)

with averages dened by equation 4.12 for some sliding window of width P. The CV for a single neuron is plotted as a function of a in Figure 13. This shows that the relative size of the deterministic uctuations in the mean ring rate is an increasing function of a. For slow synapses (a ! 0), the CV is very small, indicating an excellent match between the IF and analog models. However, the uctuations become more signicant when the synapses are fast. For IF networks with a type of disordered Mexican hat interaction, instantaneous synapses, and stochastic external input, it is also known that a further component to the CV arises from the amplication of correlated uctuations (Usher, Stemmler, Koch, & Olami, 1994). Interestingly Softy and Koch (1992) have shown that stochastic input alone cannot account for the high ISI variability exhibited by cortical neurons in awake monkeys.

120

Paul C. Bressloff and S. Coombes

0.8

n D k

0.7

a k2 1 0

0

15

k 30

45

0.6 k

0.5 0.4 0.4

0.5

0.6

n-1 D k

0.7

0.8

Figure 12: Separation of the ISI orbits in phase-space for a ring of N = 51 IF corresponding to point A in Figure 11 (a = 2, 2 = 0.4). The attractor of the embedded ISI with coordinates, (D n¡1 , D nk ), is shown for all N neurons. k (Inset) Regular spatial variations in the long-term average ring rate ak (dashed curve) are in good agreement with the corresponding activity pattern (solid curve) found in the analog version of the network constructed in section 3.2, with ak now determined by the mean ring-rate function (see equation 3.18).

One potential application of the above analysis is to the study of an IF model of orientation selectivity in simple cells of cat visual cortex. Such a model has been investigated numerically by Somers, Nelson, and Sur (1996), who consider a ring of IF neurons with k labeling the orientation preference h = p k/ N of a given neuron. The neurons interacted via synapses with spike-evoked conductance changes described by alpha functions and with a pattern of connectivity consisting of short-range excitation and long-range inhibition analogous to the Mexican hat function of Figure 10. Numerical investigation showed how the recurrent interactions led to sharpening of orientation tuning curves. Note that the model of Somers et al. (1996) also included shunting terms and time-dependent thresholds; it should be possible to extend the analysis of our article to account for such features. 4.6 Bursting and Oscillator Death Revisited. In section 3.2 we pointed out that in the case of a rst-order analog model, an excitatory-inhibitory pair of neurons without self-interactions cannot undergo a Hopf bifurcation. This suggests that the bursting behavior studied in section 4.4 might

Dynamics of Strongly Coupled Spiking Neurons

121

Figure 13: Plot of the coefcient of variation CV for a single IF oscillator as a function of the inverse rise time a. All other parameter values are as in Figure 12. For each a, the oscillator is chosen to be one with the maximum mean ring rate. (Insets) Time series showing variation of the inverse ISI (D n ) ¡1 with n for two particular values of a.

be sensitive to the rise time of synaptic response even in the case of slow synaptic decay. In order to explore the distinct roles played by the rate of rise and fall of synaptic interactions, we generalize the alpha function of equation 3.3 by considering the difference of exponentials J ( t) =

¢ a1 a2 ¡ ¡a1 t e ¡ e¡a2 t H (t). a2 ¡ a1

(4.20)

In the limit a2 ! a1 ! a, this reduces to the alpha function (see equation 3.3). If a1 > a2 , then the asymptotic behavior of a synapse is approximately given by e¡a2 t , and the time to peak is tp = [a1 ¡ a2 ]¡1 ln(a1 / a2 ). The roles of a1 and a2 are reversed if a2 > a1 . In Figure 14a, we plot the critical coupling for a discrete Hopf bifurcation in the ring times for the excitatory-inhibitory pair previously analyzed in section 4.4. Now, however, the delay distribution J( t) is taken to be given by equation 4.20, with a2 = 0.2 and a1 a free parameter. Since the synaptic decay time is relatively slow, we expect there to be a good match between the IF and analog models. This is indeed found to be the case, as illustrated in Figure 14a, which shows that the critical coupling for bursting in the analog and IF models is in very close agreement. More surprising, it is seen that bursting persists even for large values of a1 , that is, for fast synaptic rise times. (Bursting still occurs even when a1 is of the order 100 or more.) One might still expect the frequency of the oscillations to approach zero

122

Paul C. Bressloff and S. Coombes

e bursting synchrony (a) a |e |

1

oscillator death

synchrony (b) a 1

Figure 14: Critical coupling for a Hopf bifurcation in the ring times as a function of the rise time a1 for xed decay time a2 = 0.2. (a) Excitatory-inhibitory pair with 2 > 0, W11 = W22 = 0, W21 = 1, and W12 = ¡2. Bursting can occur even for fast rise times. The corresponding critical coupling for the analog model is shown as a dashed curve. (b) Inhibitory pair of IF neurons with 2 < 0, W11 = W22 = 0, W21 = W12 = 1. Critical coupling is only weakly dependent on a1 .

in the limit a1 ! 1. However, this is not found to be the case, which can be understood from the fact that the Hopf bifurcations are subcritical (cf. Figure 4b). Finally, in Figure 14b we plot the critical coupling for oscillator death in an inhibitory pair of IF neurons. There is only a weak dependence on a1 , which is consistent with the analog model. 5 Discussion We have presented a dynamical theory of spiking neurons that bridges the gap between weakly coupled phase oscillator models and strongly coupled ring-rate (analog) models. The basic results of our article can be summarized as follows: 1. A fundamental mechanism for the desynchronization of IF neurons with strong synaptic coupling is a discrete Hopf bifurcation in the ring times. This typically leads to nonphase-locked behavior characterized by periodic or quasiperiodic variations of the ISIs on closed orbits. In the case of slow synapses, the coarse-grained dynamics can

Dynamics of Strongly Coupled Spiking Neurons

123

be represented very well by a corresponding analog model in which the output of a neuron is taken to be a mean ring rate. 2. A homogeneous network of N synchronized IF neurons with global inhibitory coupling can destabilize in the strong coupling regime to a state in which some of the neurons became inactive (oscillator death). Thus the stability criterion obtained by van Vreeswijk et al. (1994) for N = 2 is valid only in the weak coupling regime. There exists a critical value of the inverse rise time, a0 ( N ), such that synchrony persists for arbitrarily large coupling when a > a0 ( N ) (fast synapses). Moreover, a0 ( N ) ! 0 as N ! 1. This establishes that the stability condition specied in the mode-locking theorem of Gerstner et al. (1996) is valid in the thermodynamic limit but no longer holds for nite networks with slow synapses. 3. Rhythmic bursting patterns can occur in asymmetric IF networks with a mixture of inhibitory and excitatory synaptic coupling. The ISIs evolve on periodic orbits that are large relative to the mean ISI within a burst. There is good agreement between the temporal variation of the ISIs in the IF model and the variation of the ring rate in the corresponding analog model. 4. A network of IF neurons with long-range interactions can desynchronize to a state with a regular spatial variation in the mean ring rate that is in good agreement with the corresponding analog model. The ISIs evolve on closed orbits that are well separated in phase-space. The ne temporal structure of the IF model leads to deterministic uctuations of the activity patterns whose coefcient of variation is an increasing function of the inverse rise time a. In other words, fast synapses lead to higher levels of deterministic noise. An important question that we hope to address elsewhere concerns the extent to which the complex behavior exhibited by the IF networks studied in this article is mirrored by more biophysically detailed models of spiking neurons. Preliminary numerical studies of Hodgkin-Huxley neurons suggest a strong connection between the two classes of model. (See also White et al., 1998.) The advantage of working with the IF model is that it allows precise analytical statements to be made. Moreover, there are a number of ways of extending our analysis to take into account additional aspects of neuronal processing. The effects of diffusion along the dendritic tree of a neuron can be incorporated into the model by taking J(t ) in equation 3.2 to be the Green’s function of the corresponding cable equation. It is also possible to take into account active membrane properties under the assumption that variations of the dendritic membrane potential are small so that the channel kinetics can be linearized along the lines developed by Koch (1984). The resulting membrane impedance of the dendrites displays resonant-like behavior due to the

124

Paul C. Bressloff and S. Coombes

additional presence of inductances. This leads to a corresponding form of resonant-like synchronization. In other words, there is a strongly enhanced (reduced) tendency to synchronize for excitatory (inhibitory) coupling when the resonant frequency of the dendritic membrane approximately matches the frequency of the oscillators. Active dendrites can also desynchronize homogeneous networks of IF oscillators with strong excitatory or inhibitory coupling, leading to periodic bursting patterns (Bressloff, 1999). The response of a network to a sinusoidal external stimulus may be studied by taking Ii = A CB sin(vt) in equation 3.1, where A, B are constants. Since the resulting dynamical system is now nonautonomous, it is no longer possible to specify phase locking in terms of N ¡ 1 relative phases and a self-consistent collective period (see equation 4.1). However, one can derive conditions under which the network is entrained to the external stimulus. For example, in the case of 1:1 mode locking, the ring times of the IF neurons are of the form Tjm = ( m ¡ w j ) T, with T = 2p / v. The absolute phases wj are determined from the set of equations 1 = (1 ¡ e¡T ) A C (1 ¡ e¡T )F(w i ) C 2

N X j= 1

Wij KT (w j ¡ w i ),

(5.1)

with KT given by equation 4.2 and F(w ) = B[sin(2p w ) Cv cos(2p w )]/ (1 Cv2 ). It is also possible to investigate the stability of such solutions by a relatively straightforward generalization of the analysis presented in section 4.2. Another interesting extension is to p: q mode locking. For example, suppose that the neurons have T-periodic ring patterns in which the jth neuron res qj times over an interval T. In order to generate such solutions, it is necessary to take the ring times to have the general form (Coombes & Bressloff, 1999): µ ¶ n ¡ w j,n mod qj , (5.2) Tjn = T qj PN where [.] denotes the integer part. This generates a set of j= 1 qj closed equations for the absolute phases w j,n mod q . One can use these equations j to determine the borders in parameter space where the ansatz 5.2 for mode locking fails. The set of such borders will dene an Arnold tongue diagram, with overlapping tongues indicating possible regions of multistability and chaos. Similar techniques can be used to study the bursting solutions found in section 4.4. The existence of chaotic solutions can be studied using an extension of the notion of a Lyapunov exponent that takes into account the presence of discontinuities in the dynamics due to reset (Coombes & Bressloff, 1999). Incidentally, the construction of a Lyapunov exponent also allows one to understand the spike-time reliability of an IF neuron in response to small, aperiodic signals as recently observed by Hunter, Milton, Thomas, & Cowan (1998).

¡

¢

Dynamics of Strongly Coupled Spiking Neurons

125

Another important issue concerns the effects of noise. We expect that sufciently small levels of noise will not alter the basic results of this article. On the other hand, the presence of noise can lead to new and interesting phenomena as illustrated by the numerical study of pattern formation by Usher et al. (1994) and Usher, Stemmler, & Olami (1995). They considered a two-dimensional network of IF neurons with short-range excitation and long-range inhibition, fast synapses, and a stochastic external input. They observed patterns of activity in the mean ring rate that are two-dimensional versions of the patterns shown in the inset of Figure 12. It was found that in a certain parameter regime, these patterns were metastable in the presence of noise and tended to diffuse through the network. This macroscopic dynamics resulted in 1/ f power spectra and power law distributions of the ISIs on the microscopic scale of a single neuron, and led to a large coefcient of variation, CV > 1. We are currently carrying out a detailed study of pattern formation in two-dimensional IF networks using the methods presented in our article and hope to gain further insight into the observations of Usher et al. (1994, 1995), in particular, the interplay between noise and the underlying deterministic dynamics. One way of including noise in the IF model or the spike-response model is by introducing a probability P of ring during an innitesimal time dt: P( UId t) = t ¡1 ( U)d t. To mimic more realistic models, the response time t(U) is chosen to be large if U < h and to vanish if U À h. This is achieved by writing t (U ) = t0 e¡b ( U¡h) , where t0 is the response time at threshold and b determines the level of noise. The deterministic case is recovered in the limit b ! 1. In the case of large, homogeneous networks, one can use a meaneld approach to reduce the dynamics to an analytically tractable form (see Gerstner, 1995). Note that another consequence of noise is to smooth out the ring-rate function (see equation 3.18) in the analog version of the model. A nal aspect we would like to highlight here is the distinction between excitable and oscillatory networks. In this article we have focused on the latter case by taking Ii > 1 in equation 3.1 so that the network may be viewed as a coupled system of oscillators with natural periods Ti = ln[Ii / (Ii ¡ 1)]. However, it is also possible for spontaneous network oscillations to occur in the excitable regime, whereby Ii < 1 for all i = 1, . . . , N. All of the basic analytical results presented in section 4 can be carried over to the excitable regime, in particular, the phase-locking equation 4.1 and the eigenvalue equation 4.5. On the other hand, the weak coupling theory of section 3.1 is no longer applicable since one requires a nite level of coupling to maintain oscillatory behaviour. There have also been a number of studies of waves in excitable spiking networks in both one dimension (Ermentrout & Rinzel, 1981; Ermentrout, 1998b) and two dimensions (Chu et al., 1994; Kistler et al., 1998). We are currently carrying out a detailed comparison of wave phenomena in excitable and oscillatory IF networks using some of the ideas presented in this article.

126

Paul C. Bressloff and S. Coombes

Acknowledgments We thank Jack Cowan at the University of Chicago for helpful discussions during the completion of this work. We are also grateful for the constructive comments of the referees and the suggestion to include the example in section 4.6. The research was supported by grant number GR/K86220 from the EPSRC (UK). References Abbott, L. F. (1991). Realistic synaptic inputs for model neural networks. Network, 2, 245–258. Abbott, L. F. (1994). Decoding neuronal ring and modeling neural networks. Quart. Rev. Biophys., 27, 291–331. Abbott, L. F., & Kepler, T. B. (1990). Model neurons: From Hodgkin-Huxley to Hopeld. In Luis Garrido (Ed.), Statistical mechanics of neural networks. Berlin: Springer-Verlag. Abbott, L. F., & van Vreeswijk, C. (1993).Asynchronous states in neural networks of pulse-coupled oscillators. Phys. Rev. E., 48, 1483–1490. Amit, D. J., & Tsodyks, M. V. (1991). Quantitative study of attractor neural networks retrieving at low spike rates I: Substrate—spikes, rates and neuronal gain. Network, 2, 259–274. Atiya, A., & Baldi, P. (1989).Oscillations and synchronization in neural networks: An exploration of the labeling hypothesis. Int. J. Neural Syst., 1, 103–124. Berry, M. J., Warland, D. K., & Meister, M. (1997). The structure and precision of retinal spike trains. Proc. Natl. Acad. Sci. USA, 94, 5411–5416. Bressloff, P. C. (1999). Resonant-like synchronization and bursting in a model of pulse-coupled neurons with active dendrites. J. Comp. Neurosci., 6, 237–249. Bressloff, P. C., & Coombes, S. (1997). Physics of the extended neuron. Int. J. Mod. Phys., 11, 2343–2392. Bressloff, P. C., & Coombes, S. (1998a). Travelling waves in chains of pulsecoupled oscillators. Phys. Rev. Lett., 80, 4185–4188. Bressloff, P. C., & Coombes, S. (1998b). Desynchronization, mode-locking and bursting in strongly coupled IF oscillators. Phys. Rev. Lett., 81, 2168–2171. Bressloff, P. C., & Coombes, S. (1999). Symmetry and phase-locking in a ring of pulse-coupled oscillators with distributed delays. Physica D, 126, 99–122. Bressloff, P. C., Coombes, S., & De Souza, B. (1997). Dynamics of a ring of pulsecoupled oscillators: A group theoretic approach. Phys. Rev. Lett., 79, 2791– 2794. Carr, C. E., & Konishi, M. (1990). A circuit for detection of interaural differences in the brain stem of the barn owl. J. Neurosci., 10, 3227–3246. Chow, C. C. (1998). Phase-locking in weakly heterogeneous neuronal networks. Physica D, 118, 343–370. Chu, P. H., Milton, J. G., & Cowan, J. D. (1994). Connectivity and the dynamics of integrate-and-re neural networks. Int. J. Bif. Chaos, 4, 237–243. Coombes, S., & Bressloff, P. C. (1999). Mode-locking and Arnold tongues in

Dynamics of Strongly Coupled Spiking Neurons

127

periodically forced integrate-and-re oscillators. Phys. Rev. E, 60, 2086–2096. Coombes, S., & Lord, G. J. (1997). Desynchronisation of pulse-coupled integrateand-re neurons. Phys. Rev. E, 55, 2104–2107. Crook, S. M., Ermentrout, G. B., Vanier, M. C., & Bower, J. M. (1997). The role of axonal delay in the synchronization of networks of coupled cortical oscillators. J. Comp. Neurosci., 4, 161–172. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1994). Synthesis of models for excitable membranes synaptic transmission and neuromodulation using a common kinetic formalism. J. Comp. Neurosci., 1, 195–231. Ermentrout, G. B. (1985). The behavior of rings of coupled oscillators. J. Mathemat. Biol., 23, 55–74. Ermentrout, G. B. (1994). Reduction of conductance-based models with slow synapses to neural nets. Neural Computation, 8, 679–695. Ermentrout, G. B. (1998a). Neural networks as spatial pattern forming systems. Rep. Prog. Phys., 61, 353–430. Ermentrout, G. B. (1998b). The analysis of synaptically generated traveling waves. J. Comp. Neurosci., 5, 191–208. Ermentrout, G. B., & Cowan, J. D. (1979). A mathematical theory of visual hallucination pattens. Biol. Cybern., 34, 137–150. Ermentrout, G. B., & Kopell, N. (1984). Frequency plateaus in a chain of weakly coupled oscillators. SIAM J. Math. Anal., 15, 215–237. Ermentrout, G. B., & Rinzel, J. (1981). Waves in a simple excitable or oscillatory, reaction-diffusion model. J. Math. Biol., 11, 1269–294. Ernst, U., Pawelzik, K., & Giesel, T. (1995). Synchronization induced by temporal delays in pulse-coupled oscillators. Phys. Rev. Lett., 74(9), 1570–1573. Gerstner, W. (1995). Time structure of the activity in neural-network models. Phys. Rev. E, 51(1), 738–758. Gerstner, W., Kreiter, A. K., Markram, H., & Herz, A .K. M. (1997). Neural codes: Firing rates and beyond. Proc. Natl. Acad. Sci. USA, 94, 12740–12741. Gerstner, W., Ritz, R., & van Hemmen, J. L. (1993). A biologically motivated and analytically soluble model of collective oscillations in the cortex I: Theory of weak locking. Biol. Cybern., 68, 363–374. Gerstner, W., & van Hemmen, J. L. (1994). Coding and information processing in neural networks. In E. Domany, J. L. van Hemmen, and K. Schulten (Eds.), Models of neural networks II (pp. 1–93). Berlin: Springer-Verlag. Gerstner, W., van Hemmen, J. L., & Cowan, J. D. (1996).What matters in neuronal locking. Neural Comput., 94, 1653–1676. Golomb, D., & Rinzel, J. (1994). Clustering in globally coupled inhibitory neurons. Physica D, 72, 259–282. Han, S. K., Kurrer, C., & Kuramoto, Y. (1995). Dephasing and bursting in coupled neural oscillators. Phys. Rev. Lett., 75, 3190–3193. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comput., 7, 307–337. Hopeld, J. J. (1984).Neurons with graded response have computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U.S.A., 81, 3088–3092. Hopeld, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36.

128

Paul C. Bressloff and S. Coombes

Hunter, J. D., Milton, J. G., Thomas, P. J., & Cowan, J. D. (1998). A resonance effect for neural spike time reliability J. Neurophysiol., 80, 1427–1438. Keener, J. P., Hoppensteadt, F. C., and Rinzel, J. (1981). Integrate-and-re models of nerve membrane response to oscillatory input. SIAM J. Appl. Math., 41(3), 503–517. Kepler, T. B., Abbott, L. F., and Marder, E. (1992). Reduction of conductancebased neuron models. Biological Cybernetics, 66, 381–387. Kistler, W. M., Gerstner, W., & van Hemmen, J. L. (1997). Reduction of the Hodgkin-Huxley equations to a single-variable threshold model. Neural Comput., 9, 1015–1045. Kistler, W. M., Seitz, R., & van Hemmen, J. L. (1998). Modeling collective excitations in cortical tissue. Physica D, 114, 273–295. Koch, C. (1984). Cable theory in neurons with active, linearized membranes. Biol. Cybern., 50, 15–33. Kuramoto, Y. (1984). Chemical oscillations, waves and turbulence. Berlin: SpringerVerlag. Li, Z., & Hopeld, J. J. (1989). Modeling the olfactory bulb and its neural oscillatory processings. Biol. Cybern., 61, 379–392. Mainen, Z., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 268, 1503–1506. Marcus, C. M., & Westervelt, R. M. (1989). Stability of analog neural networks with delay. Phys. Rev. A, 39, 347–365. Mirollo, R. E., & Strogatz, S. H. (1990). Synchronisation of pulse-coupled biological oscillators. SIAM J. Appl. Maths, 50, 1645–1662. Mulloney, B., Perkel, D. H., & Budelli, R. W. (1981). Motor-pattern production: Interaction of chemical and electrical synapses. Brain Research, 229, 25–33. Rieke, F., Warland, D., van Steveninck, R. R. D., & Bialek, W. (1996). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Softky, W. R., & Koch, C. (1992). Cortical cells should re regularly, but do not. Neural Computation, 4, 643–645. Somers, D. C., Nelson, S. B., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. J. Neurosci., 15, 5448–5465. Strong, S. P., Koberle, R., van Steveninck, R. R. D., & Bialek, W. (1998). Entropy and information in neural spike trains. Phys. Rev. Lett., 80, 197–200. Treves, A. (1993). Mean-eld analysis of neuronal spike dynamics. Network, 4, 259–284. Tsodyks, M., Mitkov, I., & Sompolinsky, H. (1993). Pattern of synchrony in inhomogeneous networks of oscillators with pulse interactions. Phys. Rev. Lett., 71, 1280–1283. Tuckwell, H. C. (1988). Introduction to theoreticalneurobiology (Vol. 1). Cambridge: Cambridge University Press. Usher, M., Stemmler, M., Koch, C., & Olami, Z. (1994). Network amplication of local uctuations causes high spike rate variability, fractal ring patterns and oscillatory local eld potentials. Neural Comput., 5, 795–835. Usher, M., Stemmler, M., & Olami, Z. (1995). Dynamic pattern formation leads to 1/ f noise in neural populations. Phys. Rev. Lett., 74, 326–329.

Dynamics of Strongly Coupled Spiking Neurons

129

van Vreeswijk, C. (1996). Partial synchronization in populations of pulsecoupled oscillators. Phys. Rev. E., 54(5), 5522–5537. van Vreeswijk, C., Abbott, L. F., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural ring. J. Comp. Neurosci., 1, 313–321. Wang, X.-J., & Rinzel, J. (1992). Alternating and synchronous rhythms in reciprocally inhibitory model neurons. Neural Comput., 4, 89–97. Wang, X.-J., & Rinzel, J. (1995). Oscillatory and bursting properties of neurons. In M. Arbib (Ed.), Handbook of brain theory and neural networks (pp. 686–691). Cambridge, MA: MIT Press. White, J. A., Chow, C. C., Ritt, J., Soto-Trevino, C., & Kopell, N. (1998). Synchronization and oscillatory dynamics in heterogeneous, mutually inhibited neurons. J. Comput. Neurosci., 5, 5–16. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. J. Biophys., 12, 1–24. Wilson, H. R., & Cowan, J. D. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik, 13, 55–80. Received August 27, 1998; accepted January 27, 1999.

NOTE

Communicated by Bard Ermentrout

On Connectedness: A Solution Based on Oscillatory Correlation DeLiang L. Wang Department of Computer and Information Science and Center for Cognitive Science, The Ohio State University, Columbus, OH 43210-1277, U.S.A.

A long-standing problem in neural computation has been the problem of connectedness, rst identied by Minsky and Papert (1969). This problem served as the cornerstone for them to establish analytically that perceptrons are fundamentally limited in computing geometrical (topological) properties. A solution to this problem is offered by a different class of neural networks: oscillator networks. To solve the problem, the representation of oscillatory correlation is employed, whereby one pattern is represented as a synchronized block of oscillators and different patterns are represented by distinct blocks that desynchronize from each other. Oscillatory correlation emerges from LEGION (locally excitatory globally inhibitory oscillator network), whose architecture consists of local excitation and global inhibition among neural oscillators. It is further shown that these oscillator networks exhibit sensitivity to topological structure, which may lay a neurocomputational foundation for explaining the psychophysical phenomenon of topological perception. Thirty years ago Minsky and Papert (1969), in their milestone book, pointed out fundamental limitations of perceptrons. They proved that all except one topologically invariant predicates are not of nite order. In particular, the topological predicate of connectedness was used as the cornerstone in their mathematical analysis. To understand the implications of this famous (negative) result, let us rst explain what it means. A perceptron (Rosenblatt, 1962) may be viewed as a binary classication device that computes a predicate. As shown in Figure 1, the class of perceptrons that Minsky and Papert studied consists of a binary input layer R (symbolizing retina), a layer of binary feature-detecting units, and a response unit that signies the classication outcome. The feature detectors sense a specic area of R, and the response unit is a threshold unit operating on a weighted sum of all the units in the detector layer. The order of a predicate is the smallest number k for which one can compute the predicate with feature detectors that sense no more than k units of R. The MinskyPapert connectedness theorem states that to tell whether an input pattern is connected requires arbitrarily large orders as R grows in size. In fact, the order increases at least as fast as |R| 1/ 2 . Note that for any xed R, the theoNeural Computation 12, 131–139 (2000)

c 2000 Massachusetts Institute of Technology °

132

DeLiang L. Wang

Figure 1: Diagram of a perceptron. A feature detector senses a specic area of the input layer, R, and the response unit computes whether the weighted sum of all the detectors is above a certain threshold.

rem does not prevent a theoretical solution to the problem by a perceptron. With a discrete R, there are a nite (but exponentially high except for onedimensional R)1 number of connected patterns, and one can trivially nd a perceptron whose feature detectors detect individual connected patterns. However, problems that are not of nite order require nonlocal (global) feature detectors, and too many of them to be feasible (Minsky & Papert, 1969). In a sense, their result is a negative statement on the scalability of machinery. Perceptrons are best known for their amazing ability to learn. What is powerful about the Minsky-Papert theory is that it identies limitations on the kinds of patterns that perceptrons are capable of recognizing, regardless of whatever learning rule is employed. Much of the Minsky-Papert analysis is about single-layer (simple) perceptrons, and the following critical question arises: Is the limitation applicable to multilayer perceptrons with backpropagation learning, which represents the most inuential development of modern neural networks? General analytical statements are not available. As far as connectedness is concerned, Minsky and Papert (1988, p. 52) claimed that multilayer perceptrons are no more powerful. For multilayer perceptrons, order is of less concern because hidden units in typical architecture receive input from all of the input units. The central questions are, How many hidden units are required? and How long is the training process? If the number or the duration increase very quickly as R grows in size, one must consider the problem not solvable in practice. For twodimensional (2D) R, the number of connected patterns grows exponentially with respect to |R| . Thus, for R not too small, a practical training process

1

I am grateful to N. Robertson for discussions that led to this conclusion.

On Connectedness

133

can employ only a tiny percentage of all possible training samples, and it is hard to expect that a multilayer perceptron can successfully generalize from relatively so few training samples. Given this consideration and that no success has been reported on recognizing connectedness, it is reasonable to project, based on the result on simple perceptrons, that the limitation exists for multilayer perceptrons (see also Minsky & Papert, 1988). Here we point out that a new class of neural networks—neural oscillator networks—provides a solution to the problem of connectedness. The basic unit of an oscillator network is a neural oscillator, and with an additional degree of freedom—the phase of an oscillator—one can speak of synchronization and desynchronization in an oscillator network. In particular, Terman and Wang (1995; Wang & Terman, 1997) proposed and analyzed LEGION (locally excitatory globally inhibitory oscillator network). Representation in these networks is based on oscillatory correlation, a special form of von der Malsburg’s temporal correlation theory (von der Malsburg, 1981; von der Malsburg & Schneider, 1986). More specically, a pattern is represented by a synchronized group of oscillators, and different patterns are represented by distinct oscillator groups that desynchronize from each other. A LEGION network consists of three parts: (1) its basic unit, a relaxation oscillator with two time scales; (2) oscillators coupled with local excitation, which leads to rapid synchronization within a group corresponding to one pattern; and (3) a global inhibitor, which desynchronizes different groups of oscillators. Formally, a single oscillator i in LEGION is dened by a pair of excitatory variable xi and inhibitory variable yi . Oscillator i receives external stimulation Ii , and overall coupling Si from the network. The denition of oscillator i includes a relaxation parameter e, which induces two time scales that characterize typical relaxation oscillations. The limit cycle for a relaxation oscillation quickly alternates (or jumps) between an active phase of relatively high x values and a silent phase of relatively low x values. To study connectedness, we employ a 2D LEGION with four nearest-neighbor coupling and the global inhibitor z. (See Terman & Wang, 1995, and Wang & Terman, 1997, for precise denitions of LEGION.) Given Terman-Wang analysis on LEGION, it is not difcult to use LEGION to detect connectedness. Before describing how to do it, we show the response of a 30£30 LEGION network to two gures: one connected and one disconnected. The connected gure is a cup shown in Figure 2A, while the disconnected one is the image of the word cup shown in Figure 2B. The LEGION network is solved using a Runge-Kutta method. The oscillators of the network start with random phases. Figure 2C displays the temporal activity of all the stimulated oscillators (Ii > 0) for the connected cup image. Unstimulated oscillators (Ii < 0) are omitted from the display because they do not oscillate. The oscillators corresponding to each pattern are combined in the display and thus appear as a single oscillator when they are in synchrony. The upper panel shows the oscillator block corresponding to the cup, and the middle panel shows the activity of the global inhibitor. Syn-

134

DeLiang L. Wang

chrony occurs in the rst cycle of oscillations. The case for the disconnected cup is shown in Figure 2D, where the upper three traces show the three blocks corresponding to the three patterns, respectively. From Figure 2D, it is clear that synchronization within each block and desynchronization between different blocks are achieved in the rst two cycles. These simulations illustrate clearly that after a few cycles, all the oscillators corresponding to one (connected) pattern are synchronized, whereas those corresponding to different patterns are desynchronized. When a synchronized block jumps to the active phase, the global inhibitor is triggered and approaches from 0 to 1 on the fast timescale. Within an oscillation period, the global inhibitor is triggered as many times as is the number of patterns in the input gure. Thus, one can nd how many patterns are in the input image by comparing the response of any stimulated oscillator and that of the global inhibitor. If their frequencies of oscillations are the same, then the input gure contains one pattern, and thus the gure is connected. Otherwise the input gure contains more than one pattern and the gure is disconnected. More specically, we compute the accumulated activity of RT the global inhibitor over an oscillation period t by T¡t z dt, where T denotes the current time. The corresponding average accumulated activity of a stimulated oscillator is given by PRT ( ¡ hx ) dt T¡t H xi i P . H ( Ii ) i

Here H stands for the Heaviside step function, and h x is chosen between x values that correspond to the silent phase and those to the active phase. Note that the denominator indicates how many oscillators are stimulated. Regarding the period t of a stimulated oscillator, in the singular limit e ! 0, it can be precisely expressed in terms of the parameters of a single Figure 2: Facing page. (A) A connected cup image as mapped to a 30£30 LEGION. (B) A disconnected image with three patterns forming the word cup as mapped to a 30£30 network. (C) Results for the connected cup image. (D) Results for the disconnected cup image. In C and D, the upper traces give the combined x activities of the oscillator blocks indicated by their respective labels. The nextto-bottom trace gives the activity of the global inhibitor, and the bottom one shows the temporal activity of the RHS of equation 1, together with h . The oscillator activity is normalized in the gure. The simulation in each case takes 7500 integration steps. The parameter values are (see Wang & Terman, 1997, for the meanings of these parameters): e = 0.02, b = 0.1, c = 6.0, r = 0.02, hx = ¡0.5, h z = 0.1, w = 3.0, Wz = 1.0, and WT = 8.0 (weights are identical before dynamic normalization). I = 0.2 for a stimulated oscillator and I = ¡0.02 otherwise.

On Connectedness

135

136

DeLiang L. Wang

oscillator (see equation 10 in Linsay & Wang, 1998). When e > 0 as in the previous simulations, the calculated period is not precise. But as will be clear later, we do not need precise values of t in order to detect connectedness. To summarize, the connectedness predicate is calculated by 0 1 , P R T H (x ¡ h ) dt Z T i x T¡t B i C P (1) z dt @ A < h. ( ) H Ii T¡t i

The threshold h should be chosen between 1 and 2 for the following reason. Given a positive e and system noise, synchrony within each pattern is not perfect (see Figure 2). Since z is triggered as long as a single oscillator jumps to the active phase, the width of the active phase of the inhibitor is slightly wider than a stimulated oscillator. Thus, for a connected gure, the righthand side (RHS) of equation 1 is slightly greater than 1 but certainly less than 2. The two bottom traces in Figures 2C and 2D show the RHS value of equation 1 for these two cases. For the parameter values used in the simulations, t = 5.27. A threshold h = 1.6 is used in Figure 2, and this is sufcient to detect connectedness. Beyond a short beginning duration corresponding to the process of synchronization and desynchronization, equation 1 correctly reveals the connectedness predicate. More generally, the RHS of equation 1 reveals the number of connected components in the input gure. With this formula, the LEGION network can be used to count how many patterns are in a picture. Note that equation 1 does not consider the degenerate case of no stimulus. What if there are numerous patterns in a picture? As analyzed in Wang and Terman (1997), for a xed set of parameters, the LEGION network exhibits a xed capacity in segmentation. If the number of patterns in an input picture exceeds the capacity, the network separates the picture into as many segments as the capacity. This observation, together with the result on the speed of synchronization and desynchronization (Terman & Wang, 1995; Wang & Terman, 1997), suggests that no matter how numerous patterns are in the input, one needs to wait for at most as many cycles as the capacity before connectedness can be correctly detected. With any capacity greater than 1, the connectedness predicate of equation 1 is not affected when numerous patterns appear in the input, because it is an assertion of whether the gure contains just one pattern. Thus, we obtain an upper bound on how long the system takes to compute the connectedness predicate. We emphasize that this detection of connectedness is analytically established, regardless of specic shapes, sizes, or arrangements of patterns in a picture. Connectedness is not only computationally interesting, but also of fundamental signicance to perception. It is one of the basic perceptual grouping principles (Rock & Palmer, 1990). According to Palmer and Rock (1994),

On Connectedness

137

Figure 3: Three pairs of stimuli used in Chen’s experiments (adapted from Chen, 1982).

connectedness accounts for the rst-level organization. Furthermore, based on a series of psychophysical experiments Chen (1982, 1989) observed that the human visual system is sensitive to topological properties of stimuli, and topological perception constitutes a basic and early part of perceptual organization. Figure 3 illustrates a typical experiment with three pairs of stimuli, which manifest topological perception. Chen found that humans are more accurate in discriminating the annulus-disk pair under rapid display conditions. Note that the pairs in Figures 3A and 3B are topologically equivalent, whereas the annulus-disk pair is topologically distinct. Based on the difculty of neural networks on connectedness, he challenged that topological perception remains a mystery for current neural network models (Chen, 1989). Given that all six patterns in Figure 3 are connected, connectedness alone cannot explain the experimental result. However, we observe that although every pattern has an open “outside,” only the annulus has an inside hole. With this observation, we can assume that a hole is a perceptually distinct pattern. Thus, the annulus consists of two connected patterns (or three, if one considers the open outside as a separate region), whereas the other ve stimuli contain one pattern. When the stimuli in Figure 3 are presented to a LEGION network, all of them result in one segment, except for the annu-

138

DeLiang L. Wang

lus that results in two segments. 2 The qualitative difference in the number of segmented patterns emerging from LEGION provides a foundation for explaining topological perception. Our explanation applies equally well to cases involving more than one hole (Chen, 1990). The fact that LEGION exhibits a xed segmentation capacity results in the following novel and testable prediction: topology-based discrimination occurs only up to a certain number of holes. This prediction differs from that of Chen’s (1990) topological hypothesis, rooted in mathematical topology. To conclude, we have found a solution to the problem of detecting connectedness. The solution is given by a class of neural oscillator networks, which differ fundamentally from layered perceptrons. These networks are a kind of recurrent neural network and a dynamical system due to their continuous-time denitions. As such, oscillators in a network are effectively of unbounded order, which provides a necessary means to address the problem of connectedness. Furthermore, with their ability to exhibit sensitivity to topological structure, these oscillator networks provide a foundation for explaining psychophysical data of topological perception, and therefore have answered another challenge to neural networks. With their success in solving these difcult problems as well as their unique capability in scene analysis (Wang & Terman, 1997) and biological plausibility (Singer & Gray, 1995), oscillator networks may suggest a promising direction for future research in neural computation. Acknowledgments Thanks to B. Chandrasekaran, J. Josephson, and R. Lewis for useful discussions and reading an earlier draft of this article. This work was supported in part by an NSF grant (IRI-9423312) and an ONR Young Investigator Award (N00014-96-1-0676). References Chen, L. (1982). Topological structure in visual perception. Science, 218, 699–700. Chen, L. (1989). Topological perception: A challenge to computational approaches to vision. In R. Pfeifer, Z. Schreter, F. Fogelman-Soulie, & L. Steels (Eds.), Connectionism in perspective (pp. 317–329). Amsterdam: Elsevier. Chen, L. (1990). Holes and wholes: A reply to Rubin and Kanwisher. Percept. Psychophys., 47, 47–53. Linsay, P. S., & Wang, D. L. (1998). Fast numerical integration of relaxation oscillator networks based on singular limit solutions. IEEE Trans. Neural Net., 9, 523–532. 2 For simulations involving holes, as in Figure 3C, oscillators corresponding to holes are also stimulated and excitatory connections are formed only between neighbors with the same pixel levels (black or white).

On Connectedness

139

Minsky, M. L., & Papert, S. A. (1969). Perceptrons. Cambridge, MA: MIT Press. Minsky, M. L., & Papert, S. A. (1988). Perceptrons (Exp. ed.). Cambridge, MA: MIT Press. Palmer, S., & Rock, I. (1994). Rethinking perceptual organization: The role of uniform connectedness. Psychol. Bull. Rev., 1, 29–55. Rock, I., & Palmer, S. (1990). The legacy of Gestalt psychology. Sci. Am., 263, 84–90. Rosenblatt, F. (1962). Principles of neural dynamics. New York: Spartan. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Ann. Rev. Neurosci., 18, 555–586. Terman, D., & Wang, D. L. (1995). Global competition and local cooperation in a network of neural oscillators. Physica D, 81, 148–176. von der Malsburg, C. (1981). The correlation theory of brain function (Internal Rep. 81-2). Gottingen: ¨ Max-Planck-Institute for Biophysical Chemistry. von der Malsburg, C., & Schneider, W. (1986). A neural cocktail-party processor. Biol. Cybern., 54, 29–40. Wang, D. L., & Terman, D. (1997). Image segmentation based on oscillatory correlation. Neural Comp., 9, 805–836. (For errata see Neural Comp., 9, 1623– 1626, 1997.)

Received June 24, 1998; accepted January 22, 1999.

Communicated by Peter Dayan

NOTE

Practical Identiability of Finite Mixtures of Multivariate Bernoulli Distributions ´ Carreira-Perpi n´ Miguel A. ˜ an Steve Renals Department of Computer Science, University of Shefeld, Shefeld S1 4DP, U.K.

The class of nite mixtures of multivariate Bernoulli distributions is known to be nonidentiable; that is, different values of the mixture parameters can correspond to exactly the same probability distribution. In principle, this would mean that sample estimates using this model would give rise to different interpretations. We give empirical support to the fact that estimation of this class of mixtures can still produce meaningful results in practice, thus lessening the importance of the identiability problem. We also show that the expectation-maximization algorithm is guaranteed to converge to a proper maximum likelihood estimate, owing to a property of the log-likelihood surface. Experiments with synthetic data sets show that an original generating distribution can be estimated from a sample. Experiments with an electropalatography data set show important structure in the data. 1 Introduction Finite mixtures of multivariate Bernoulli distributions have been extensively used in diverse elds (such as bacterial taxonomy) to model a population of binary, multivariate measurements in terms of a few latent classes (see Everitt & Hand, 1981; Gyllenberg, Koski, Reilink, & Verlaan, 1994 and references therein). However, it has been recently proved that this class of mixture models is nonidentiable (Gyllenberg et al., 1994), which potentially undermines the interpretation of sample estimates, since the same sample could be equally attributed to a number of different estimates. We consider a nite mixture distribution (Everitt & Hand, 1981) dened on the D-dimensional binary space f0, 1gD : p(t) =

M X

p m p(t| m),

m= 1

where M is the number of components, t = ( t1 , . . . , tD )T is a binary Ddimensional vector, the p m = p( m) are the mixture proportions, and p(t| m) is a multivariate Bernoulli distribution with parameters (called prototypes) Neural Computation 12, 141–152 (2000)

c 2000 Massachusetts Institute of Technology °

´ Carreira-Perpin´ Miguel A. ˜ an and Steve Renals

142

pm = ( pm1 , . . . , pmD ) T , that is, p(t| m) =

D Y d= 1

³ ´1¡td td 1 ¡ pmd . pmd

These parameters must obey the following constraints: M X

p m = 1I

m= 1

pmd 2 [0, 1]

p m 2 (0, 1)

8m = 1, . . . , MI

8m = 1, . . . , M, d = 1, . . . , D.

It has been proved (Gyllenberg et al., 1994) that this class of mixtures is nontrivially nonidentiable1 for all dimensions D. This means that there are M g many combinations of values of the parameter tuple Q = fM, fp m , pm gm= 1 that produce an identical distribution p(t): there exist at least two tuples Q , 0 0 Q for which p(t| Q ) = p(t| Q ) 8t 2 f0, 1gD . For example, it can be readily veried that the four mixtures given by the following parameter tuples represent the same distribution (here D = 3): n n ¡ ¢T oo Q : p 1 = 1, p 1 = 12 12 12 M = 1, n n n ¡ ¢T o ¡ ¢T oo Q 0: p 1 = 12 , p1 = 12 0 12 , p 2 = 12 , p2 = 12 1 12 M = 2, n n n ¡ ¢T o ¡ ¢T oo Q 00 : p 1 = 14 , p1 = 12 0 12 , p 2 = 34 , p2 = 12 23 12 M = 2, n n n ¡ ¢T o ¡ ¢T oo Q 000 : p 1 = 14 , p1 = 1 12 12 , p 2 = 34 , p2 = 13 12 12 . M = 2, However, it does not mean that for every parameter tuple Q there must exist at least one different Q 0 representing the same distribution. Identiability is a property of the class of mixtures rather than of a particular parameter tuple. Hence, in principle there are many tuples in parameter space that are completely equivalent but would give rise to different interpretations. This may seem an insurmountable difculty for parameter estimation, but our practical studies have produced promising results. We will show that, given a sample from a mixture of multivariate Bernoulli distributions, maximum likelihood estimates of the parameters, obtained by an expectation-maximization (EM) algorithm, can still be interpretable. Before giving some experimental results to support this claim, we give some properties of the log-likelihood surface of mixtures of multivariate Bernoulli distributions and introduce an EM algorithm for them. 1

As opposed to trivial nonidentiability, which is given by permutations of the mixture components or by coincident component distributions p(t| m) for several components.

Practical Identiability of Finite Bernoulli Mixtures

143

2 Maximum Likelihood Parameter Estimation We assume a xed number of components M. Maximum likelihood estimation can be achieved by an EM algorithm. Dene ¼ = (p 1 , . . . , p M ) T and P = (p1 , . . . , pM ). The log-likelihood of the parameters f¼ , Pg given a sample ftn gN n= 1 is l( ¼ , P) =

N X n= 1

log p(tn I ¼ , P) =

N X

log

n= 1

(

M X

pm

m= 1

D Y d= 1

!

nd (1 ¡ ptmd

pmd ) 1¡tnd

(2.1)

and its gradient is easily seen to be N 1 X @l = p( m|tn I ¼ , P) ¡ N p m n= 1 @p m

m = 1, . . . , M

(2.2)

N X 1 @l = p( m| tn I ¼, P)(tnd ¡ pmd ) @pmd pmd (1 ¡ pmd ) n= 1

m = 1, . . . , M

d = 1, . . . , D,

(2.3)

where p(tn |mI ¼, P)p(m) p(m|tn I ¼ , P) = P M 0 0 m0 = 1 p(tn |m I ¼ , P)p(m ) Q tnd (1 ¡ pmd ) 1¡tnd pm D d= 1 pmd = PM QD tnd 1¡tnd 0 m0 = 1 p m d= 1 pm0 d (1 ¡ pm0 d )

(2.4)

is the posterior probability (or responsibility) that component m generated data point tn . The term ¡N in equation 2.2 results from the constraint PM m= 1 p m = 1 introduced in the log-likelihood via a Lagrange multiplier. Derivation of the EM algorithm for nite mixtures of multivariate Bernoulli distributions is straightforward and can be found elsewhere (Everitt & Hand, 1981; Wolfe, 1970). Here we give the basic equations: E-step: Computation of the responsibilities using equation 2.4 from the current parameter estimates f¼(t ) , P(t ) g at iteration t, p( m|tn I ¼ (t ) , P(t ) ). M-step: Reestimation of f¼ (t C1) , P(t C1) g: p m(t C1) = C1) p (t = m

N 1 X p( m| tn I ¼(t ) , P(t ) ) N n= 1

1

N X

( m|tn I ¼ (t ) , P (t ) )tn .

p (t C1) Np m n= 1

(2.5)

´ Carreira-Perpin´ Miguel A. ˜ an and Steve Renals

144

The sequence of parameters obtained for t = 0, 1, 2, . . . by iterating between the E- and M-steps from any starting point f¼ (0) , P(0) g produces a monotonically increasing sequence of values for the log-likelihood. A common problem of estimation in mixture distributions is that of singularities, that is, points in parameter space whose log-likelihood tends to positive innity (e.g., a mixture of gaussians in which one of the components is located on a data point and its variance tends to zero, thereby becoming a Dirac delta). Such singularities are undesirable because they give rise to degenerate distributions. Fortunately, the log-likelihood surface of a nite mixture of multivariate Bernoulli distributions has no singularities of value C1 (although it does have singularities of value ¡1). The reason is that both the log-likelihood (see equation 2.1) and its gradient (see equations 2.2 and 2.3) are bounded above in the whole parameter space, including its boundaries. 2 This means that estimation by the above EM algorithm from any nonpathological starting point, which is always possible by choosing pmd in (0, 1), will always lead to a proper stationary point of the log-likelihood. 3 Stationary Points of the Log-Likelihood We note that the second derivatives of the log-likelihood with respect to the mixing proportions are always negative (for clarity, we omit the dependence on the parameters): N 1 X @2 l = ¡ p( m| tn ) p(m0 |tn ) · 0. p mp m0 n= 1 @p m @p m0

Therefore the Hessian of the log-likelihood has negative numbers in its diagonal, and it cannot be positive denite. Hence, no stationary point of the log-likelihood is a minimum. Note that this is a general property of nite mixtures. At any stationary point of the log-likelihood, equations 2.5 hold, so that we have M X p m Ep(t| m) ftg Ep(t) ftg = m= 1

=

M X m= 1

2

p m pm =

M N N X 1 X 1 X tn = Nt, p( m|tn )tn = N N m= 1 n= 1 n= 1

When pmd ! 0 for some m, d, the log-likelihood gradient in equation 2.3 remains bounded above, because for each n, either tnd ¡ pmd ! k1 pmd (if tnd = 0) or p(m|tn ) / p(tn |m) ! k2 pmd (if tnd = 1), where k1 and k2 are constants. The same is true in the case pmd ! 1. Therefore the log-likelihood is differentiable for pmd 2 [0, 1], except in pathological situations where |pmd ¡t nd | = 1 for all m and xed n. In these cases p(tn | m) = 0 ) ! ¡1. Since the EM algorithm always climbs the logfor all m and l(fp m , pm gM m= 1 likelihood surface, it will not be attracted by such singularities.

Practical Identiability of Finite Bernoulli Mixtures

145

and so the mean of the mixture coincides with the sample mean at any stationary point. The converse does not hold generally. The point in parameter space where pm = Nt for all m = 1, . . . , M and any distribution of the mixing proportions p m is a stationary point of the QD tnd N N (1¡tnd ) is independent of log-likelihood, because p(tn | m) = d= 1 td (1 ¡ td ) m and therefore p(m|tn ) = p m and the gradient is zero there. This point is equivalent to a single multivariate Bernoulli distribution and should be avoided because experience shows that its log-likelihood, D ³ ´ ³ ´1¡tNd Y tNd M N N = log 1 ¡ , l fp m , pm = Ntgm= N t t 1 d d d= 1

is much smaller than that of other local maxima (an intuitive fact since the mixture is trivial). Observe that a starting point of the EM algorithm in which pm is the same for all components (e.g., the apparently innocuous starting point pmd = 1/ 2 for all m and d) will lead to the mentioned trivial mixture after one EM iteration for any original distribution of the p m . Our experiments showed that random starting points in (0, 1) were less prone to leading to trivial mixtures. 4 Experimental Results 4.1 Synthetic Data. We generated N = 10,000 vectors in a binary space of D = 16 dimensions from a xed mixture of M = 8 16-variate Bernoulli distributions, whose parameters are shown in Figure 1. We call these the original parameters and denote them with an “o” superscript (e.g., p om ). From the sample alone, 10 maximum likelihood estimates were found for mixtures of M = 4, M = 8, and M = 10 components. We used the above EM algorithm with random starting values of the parameters pmd in the range [ 14 , 34 ], stopping it when the relative change in log-likelihood was smaller than 10¡6 . The starting values for the p m parameters were xed to 1/ M, thus giving each component the same weight at the beginning. The results were as follows:3 Using the original number of components (M = 8), EM found the original parameters (both pm and p m ) 9 out of 10 times; the normalized distance between the original and the estimated parameters was smaller than 0.0013 in those cases, and the log-likelihood was ¡94,990 (to four signicant digits). However, the remaining estimate was a suboptimal maximum of the log-likelihood in which two of the compo3

We quantify the distance between two vectors p, q in the D-dimensional rectangle [0, 1]D with the normalized undirected distance D1 kp ¡ qk22 , where k ¢ k2 is the Euclidean norm. This distance is a real number in [0, 1] that averages to 1/ 6 for two uniformly random vectors. Observe that if pd = qd C2 for d = 1, . . . , D, then D1 kp¡qk22 = |2 | independent of D.

146

´ Carreira-Perpin´ Miguel A. ˜ an and Steve Renals

Figure 1: Parameters of the mixture of M = 8 16-variate Bernoulli distributions used to generate the sample in section 4.1. For m = 1, . . . , 8, each 16-dimensional pm vector is represented as a 4 £ 4 image in which the area of each pixel is proportional to the value of its associated pmd parameter; for example, for the leftmost image, p1d is 0.8 for d · 4 (row 1) and 0.2 for d ¸ 5 (rows 2–4).

nents had the same pm parameter (p5 ¼ p8 ¼ p o5) and a log-likelihood of ¡95,201. The difference between the log-likelihood values of both kinds of estimates was 0.2%. Using fewer components than necessary (M = 4) produced an estimate in which each prototype pm was approximately either one of the original prototypes or a linear combination of several of the original prototypes (normalized distance smaller than 0.0023). Figure 2 shows the situation, for a particular estimate in which pm ¼ pom for m = 1, 2, 3 and p 4 ¼ 23 po4 C 13 po5 . As in the previous case, estimates having different prototypes also had very close log-likelihood values (differing by 0.5%).

Figure 2: Parameters of a mixture of M = 4 16-variate Bernoulli distributions estimated by maximum likelihood from the sample. Observe that p1 , p2 , p3 , and p4 coincide very closely with po1 , po2 , po3 , and 23 po4 C 13 po5 , respectively.

Practical Identiability of Finite Bernoulli Mixtures

147

Using more components than necessary (M = 10) always produced the eight original pom vectors (normalized distance smaller than 0.0022) plus two extra ones, typically either repeated instances of some of the original ones or linear combinations of them. The log-likelihood value of each estimate did not differ from any of the others in more than 0.02%, indicating that once the eight original prototypes are found, the remaining ones are largely irrelevant and reect peculiarities of the sample used. Experiments performed with other synthetic data sets produced the same results. We propose the following interpretation of the experimental facts. Given a large sample generated from a known mixture of multivariate Bernoulli distributions, let us construct another mixture in this way. First, pick up freely the number of components M; then choose its p1 , . . . , p M prototypes either as some of the original ones or as linear combinations of them. We claim that for certain values of the mixing proportions, such a point in parameter space is very close to a maximum of the log-likelihood surface. However, we do not have theoretical support for this and do not have a valid interpretation for the values of the mixing proportions. These results also suggest a procedure to follow when estimating an unknown mixture of multivariate Bernoulli distributions from a sample. Choose freely a number of components M, and, using EM from random starting points, nd several (say, 10) maximum likelihood estimates for it. Inspect the prototypes obtained. If they look the same for every estimate, then M is probably the right number of components, and the estimate is very close to the true generating distribution. If a xed group of prototypes appears in each estimate, and the rest of the prototypes are repetitions of those in the group, then M is probably too big; reduce it and start again. However, if there are different prototypes in different estimates, then M is probably too small; increase it and start again. 4.2 EPG Data. The technique of electropalatography (Hardcastle, Jones, Knight, Trudgeon, & Calder, 1989) records the presence or absence of contact between the tongue and the hard palate in a number of xed locations of the latter and at xed intervals during continuous speech. The result is a stream of two-dimensional binary patterns, or electropalatograms (EPGs), which can be used in speech therapy and assessment. We used a subset of electropalatography data from the EUR-ACCOR database (Marchal & Hardcastle, 1993) containing 11,852 different 62-dimensional vectors (EPGs), obtained from 12 different utterances by a native English speaker. We estimated the density of its distribution in 62-dimensional space using a nite mixture of multivariate Bernoulli distributions with M = 6 components. A number of estimates were found with the EM algorithm. As with the synthetic data set, the starting parameter £ ¤ values were 1/ M for the p m parameters and a random number in 14 , 34 for the parameters pmd , and EM

148

´ Carreira-Perpin´ Miguel A. ˜ an and Steve Renals

Figure 3: The nine prototypes p1 , . . . , p9 for a mixture of M = 9 multivariate Bernoulli distributions trained with the EPG data set. Each pm vector, consisting of D = 62 values in the range [0, 1], is customarily displayed as an image resembling the palate. The top row (alveolar part) contains parameters pm1 to pm6 from left to right, row 2 contains pm7 to pm14 , and so on until the bottom row (velar part), always from left to right. The area of each pixel is proportional to the value of its associated pmd parameter.

was stopped when the relative change in log-likelihood was smaller than 10¡6 . Examination of the parameter values at this point showed that: Several different kinds of estimates were found, each characterized by a subset of the prototypes shown in Figure 3 and by specic values for the mixing proportions. Prototypes p1 , p2 , p5 , and p9 appeared in almost every estimate (the normalized distance between corresponding prototypes did not exceed 0.02), and p3 and p8 were very common too. The log-likelihood of these mixtures was quite close, varying from ¡138,341 for the combination fp 1 , p2 , p5 , p 7 , p8 , p9 g to ¡141,697 for the combination fp1 , p2 , p3 , p5 , p6 , p9 g. Occasionally a prototype was found that could be expressed as a linear combination of some of the prototypes of Figure 3, for example, 12 p8 C 1 2 p9 . We also found estimates for mixtures of M components, where M varied from 1 to 15. This made apparent the fact that for M 0 9, some prototypes appeared several times (with slight differences) in the same mixture, which therefore becomes trivial. This suggested selecting M = 9 as the optimum

Practical Identiability of Finite Bernoulli Mixtures

149

Figure 4: Log-likelihood of the mixture of multivariate Bernoulli distributions model for the EPG data as a function of the number of components M. Each value plotted corresponds to an average over 10 independent estimates.

number of components for this data set, 4 with typical values for the optimal parameters (both pm and p m ) given in Figure 3. Considering a sequence of mixture estimates starting from M = 1 to M = 15, prototypes with high mixture proportions tended to appear early in the sequence, although at times somewhat distorted due to the interference with other prototypes. For example, prototype p1 in Figure 3 was present in all mixtures, while p6 starts to appear only (very unfrequently) for M ¸ 6. However, prototype p9 appears very frequently for M ¸ 4. Also, the log-likelihood for the data set considered increases with M and reaches a plateau for M ¼ 9 (see Figure 4). An interesting fact is that these prototypes are highly interpretable, corresponding to physically feasible EPGs and in fact assimilable to well-known quasi-static patterns in EPG studies (e.g., velar, alveolar; Hardcastle et al., 1989), thus revealing important structure patterns in the data. These prototypes are similar to those produced by other methods, in particular latent variable models (Carreira-Perpin´ ˜ an & Renals, 1998). 4

An alternative way to select the critical M is to examine the log-likelihood curves for a validation set.

´ Carreira-Perpin´ Miguel A. ˜ an and Steve Renals

150

5 Conclusion We have shown that for the class of nite mixtures of multivariate Bernoulli distributions, the EM algorithm always converges to a proper stationary point of the log-likelihood, provided it is not started from a pathological point in parameter space (which is always possible). The reasons are the absence of singularities of value C1 in the log-likelihood surface and the fact that the EM algorithm always climbs that surface. We have given empirical evidence that for this particular class of nite mixture models, sensible and interpretable maximum likelihood estimates can be found even though that class of mixtures is not identiable. This would suggest that the lack of identiability might not be important from the practical point of view in some cases. Let us analyze the possible reasons with some detail. First, we note that the nonidentiability result is not surprising if one considers that a distribution dened over a discrete domain can be specied by a nite number of equations, one for each point of the domain. In our case, this means 2D ¡ 1 equations (plus another one linearly dependent with them due to all the equations adding to one). Since the mixture has MD C M ¡ 1 free parameters (one of the mixing proportions being linearly 2D dependent on the others), making M larger than DC1 would yield an underdetermined system of equations with multiple solutions. However, even for small dimensions D, this number is extremely large, and the number of components employed in practice will be much smaller. We restrict the rest of this discussion to a situation where the maximum number of components 2D is much smaller than DC1 , which yields an overdetermined system (where multiple solutions can still exist). One reason for the apparent sparseness of the nonidentiability effect may be a low population of equivalent parameter tuples. Let us call equivalent to two different parameter tuples Q , Q 0 that produce the same distribution. Although we know that for any dimension D, there exist equivalent parameter tuples, this does not mean that for each parameter tuple Q there 2D will exist an equivalent, different one Q 0 (always for the case M ¿ DC1 ). For example, it is easily seen that for the mixture in D dimensions with parameters given by,

Q :

n

M = 2,

n o n oo p 1 2 (0, 1), p1 = (1 1 . . . 1)T , 1 ¡ p 1 , p2 = (0 0 . . . 0) T ,

which produces a distribution

8

t = (1 1 . . . 1) t = (0 0 . . . 0) otherwise,

Practical Identiability of Finite Bernoulli Mixtures

151 D

2 there does not exist any equivalent parameter tuple Q 0 for M < DC1 (disregarding, as usual, permutations of mixture components and coincident component distributions). Thus, the actual practical problem of identiability is which parameter tuples Q have equivalent parameter tuples, or, considering the partition of the space of parameter tuples into classes of equivalence (each class of equivalence consisting of all equivalent parameter tuples), what is the cardinality of each class. This is a difcult question to answer analytically with generality, but the experimental results suggest that nontrivial equivalence classes (consisting of more than one element) may be rare, perhaps pathological. Another reason for the possibility of estimating a sensible parameter tuple seems to be that every estimate contains some of the original prototypes and that the original number of components may be selected by inspection of a collection of maximum likelihood estimates obtained with the EM algorithm (as we did in section 4.1). Note that the likelihood function associated with a small sample (compared to the total number of possible different vectors, usually 2D ) need not be maximized by the original mixture. Finally, a reciprocal situation to the one described here may also be possible, as the following example in D = 1 dimension shows. Consider the class of mixtures of normal distributions, which is known to be identiable for all dimensions (Everitt & Hand, 1981). It can be readily veried that the two-component equiprobable mixture with component means at §1 and standard deviations of 1.5 is virtually equal to a norm of mean 0 and standard deviation 1.85 (e.g., the distance between both mixture distributions in the L2 -norm sense is 50 times smaller than the L2 -norm of any of the mixtures). A sample of enormous size would be necessary to tell one mixture from the other in terms of likelihood of the parameters. Thus, theoretical identiability does not guarantee practical identiability, and interpretation problems may still arise.

Acknowledgments This work was supported by ESPRIT Long Term Research Project SPRACH (20077) and by a scholarship from the Spanish Ministry of Education and Science. M. A. C.-P. acknowledges helpful comments from Zoubin Ghahramani. References ´ & Renals, S. (1998). Dimensionality reduction of elecCarreira-Perpin´ ˜ an, M. A., tropalatographic data using latent variable models. Speech Communication, 26, 259–282. Everitt, B. S., & Hand, D. J. (1981). Finite mixture distributions. London: Chapman & Hall.

152

´ Carreira-Perpin´ Miguel A. ˜ an and Steve Renals

Gyllenberg, M., Koski, T., Reilink, E., & Verlaan, M. (1994). Non-uniqueness in probabilistic numerical identication of bacteria. J. Appl. Prob., 31, 542–548. Hardcastle, W. J., Jones, W., Knight, C., Trudgeon, A., & Calder, G. (1989). New developments in electropalatography: A state-of-the-art report. J. Clinical Linguistics and Phonetics, 3, 1–38. Marchal, A., & Hardcastle, W. J. (1993). ACCOR: Instrumentation and database for the cross-language study of coarticulation. Language and Speech, 36, 137– 153. Wolfe, J. H. (1970). Pattern clustering by multivariate mixture analysis. Multivariate Behavioral Research, 5, 329–350. Received October 21, 1997; accepted July 2, 1998.

LETTER

Communicated by Michael Shadlen

The Effects of Pair-wise and Higher-order Correlations on the Firing Rate of a Postsynaptic Neuron S. M. Bohte Graduate School Neurosciences Amsterdam, AMC, Department of Visual System Analysis, University of Amsterdam, P.O. Box 12011, 1100 AA Amsterdam, and Netherlands Center for Mathematics and Computer Science, (W1) Amsterdam, The Netherlands

H. Spekreijse P. R. Roelfsema Graduate School Neurosciences Amsterdam, AMC, Department of Visual System Analysis, University of Amsterdam, P.O. Box 12011, 1100 AA Amsterdam, The Netherlands

Coincident ring of neurons projecting to a common target cell is likely to raise the probability of ring of this postsynaptic cell. Therefore, synchronized ring constitutes a signicant event for postsynaptic neurons and is likely to play a role in neuronal information processing. Physiological data on synchronized ring in cortical networks are based primarily on paired recordings and cross-correlation analysis. However, pair-wise correlations among all inputs onto a postsynaptic neuron do not uniquely determine the distribution of simultaneous postsynaptic events. We develop a framework in order to calculate the amount of synchronous ring that, based on maximum entropy, should exist in a homogeneous neural network in which the neurons have known pair-wise correlations and higher-order structure is absent. According to the distribution of maximal entropy, synchronous events in which a large proportion of the neurons participates should exist even in the case of weak pair-wise correlations. Network simulations also exhibit these highly synchronous events in the case of weak pair-wise correlations. If such a group of neurons provides input to a common postsynaptic target, these network bursts may enhance the impact of this input, especially in the case of a high postsynaptic threshold. The proportion of neurons participating in synchronous bursts can be approximated by our method under restricted conditions. When these conditions are not fullled, the spike trains have less than maximal entropy, which is indicative of the presence of higher-order structure. In this situation, the degree of synchronicity cannot be derived from the pair-wise correlations. 1 Introduction The occurrence of correlations in the spike trains of neurons responding to the same object has raised considerable excitement during the past decade Neural Computation 12, 153–179 (2000)

c 2000 Massachusetts Institute of Technology °

154

S. M. Bohte, H. Spekreijse, and P. R. Roelfsema

(review by Singer & Gray, 1995). Correlations between pairs of neurons are thought to reect a high degree of synchronous ring within a larger assembly of neurons (Singer, 1995; Engel, Konig, ¨ Kreiter, Schillen, & Singer, 1992) and can have a high temporal precision, in the range of a few milliseconds (Eckhorn et al., 1988; Gray, Konig, ¨ Engel, & Singer, 1989; Roelfsema, Engel, Konig, ¨ & Singer, 1997; Alonso, Usrey, & Reid, 1996; Abeles, Bergman, Margalit, & Vaadia, 1993; Gray & Singer, 1989). von der Malsburg (1981) suggested that assemblies of neurons might convey additional information by ring in synchrony, since synchrony could be instrumental in forming relationships between the members of such an assembly. However, the possible relevance of ne temporal structure in spike trains opposes another widespread belief. The irregular timing of cortical action potentials is quite often attributed to stochastic forces acting on the neuron (Bair, Koch, Newsome, & Britten, 1994; Shadlen & Newsome, 1994). In such a stochastic model, the information is thought to be conveyed to the next processing stage (cortical layer) by pools of neurons using a noisy rate code. Each individual neuron is considered to be a slow, unreliable information processor, reecting changes in its receptive eld by modulating its average ring rate. Only by pooling the information from a larger number of neurons can a reliable rate code be obtained. Obviously, this scheme does not need precise timing of the individual spikes to convey much information. These two opposing views on the role of temporal structure of neuronal information processing are the subject of considerable debate (K¨onig, Engel, & Singer, 1996; Shadlen & Newsome, 1995; Softky & Koch, 1993). This debate has focused on two important questions: Is the cortical neuron a coincidence detector (on the millisecond timescale), and how much coincident input is there? The rst question refers to the relevance of synchronous presynaptic spikes. It has been suggested that synchronous input induces a higher ring rate in the postsynaptic target cell. Does this assumption hold, especially on a millisecond timescale? This question has been amply recognized, and several studies have attempted to answer it. Shadlen and Newsome (1995) argue that based on physiological considerations, a cortical neuron is not capable of detecting very tightly synchronized input. However, others have argued that cortical neurons might have a high sensitivity for the synchronicity in their input (Softky, 1995; Konig ¨ et al., 1996). Softky (1995) pointed out that the biological data leave too many parameters undetermined to draw any denite conclusions on biological properties that distinguish the various models. Two further studies on the impact of synchronized input on a postsynaptic target reinforce this observation. Using detailed models of groups of neurons, Bernander, Koch, and Usher (1994) and Murthy and Fetz (1994) studied the impact of coincident input on the ring rate of a postsynaptic neuron. Their conclusions are similar to Softky’s: within the biologically plausible parameter ranges, synchrony may either increase or decrease the ring rate of postsynaptic neurons.

Effects of Pair-wise and Higher-order Correlations

155

Figure 1: (A) Three neurons with correlated activity. The pair-wise correlation coefcients are r AB , r AC , and r BC . (B) Examples of different spike congurations in windows of the same size. The horizontal lines represent the spike trains, and each tick denotes a spike. Spikes occurring with a time window (dotted box) are considered to be coincident. (C) Three examples of correlated spike trains. Pairs of neurons in the three panels have the same pair-wise correlation. Spike doublets are shown as unlled arrows and triplets as lled arrows. The number of triplets differs from panel to panel.

In this study we attempt to shed more light on the second question: How much synchrony is there? In general, it is implicitly assumed that pair-wise correlations provide a good estimate of the amount of synchrony in a pool of neurons from which recordings are obtained. However, to date there are no direct electrophysiological measurements of large, synchronous pools of cortical neurons. Most of the physiological data on neuronal synchronization so far have been obtained using cross-correlation techniques (with the notable exception of the work of Abeles et al., 1993). These techniques merely provide information on the probability of events in which a pair of neurons res at the same time (that is, within some time window). Unfortunately, pair-wise correlations provide only an indirect estimate of the probability of higher-order events, like the coincident ring of, say, 5 or 50 neurons. Even when the pair-wise correlations between all neurons of a network are xed, the probability of these higher-order events remains undetermined, as is illustrated in Figure 1.

156

S. M. Bohte, H. Spekreijse, and P. R. Roelfsema

Pair-wise correlation is dened as the difference between the probability that two neurons re simultaneously and the product of their ring rates (the coincidence rate dictated by chance). This value is the same for any pair of neurons in the three panels of Figure 1C. However, the number of spike triplets differs considerably from panel to panel. Given this example, it is quite clear that the number of neurons that re within some time window is not exclusively determined by the pair-wise correlation coefcient. And although this is an articial example, it is already quite difcult to determine how many of these triplets can be attributed to higher-order correlation and how many result from two doublets that happen to occur at the same time. In a rst attempt to quantify the incidence of higher-order correlations, Martignon, von Hasseln, Grun, ¨ Aertsen, and Palm (1995) analyzed data from six cortical neurons. Unfortunately, we found that their methods cannot be used for the analysis of large numbers of neurons (as will be discussed). The main goal of this article is to study the relationship between pair-wise correlations and the amount of synchronicity in a pool of neurons and to determine the impact of the synchronous events on a postsynaptic target cell. 2 Solution of the Three-Neuron Problem To illustrate the general methodology used for estimation of the probability of higher-order events, we examine the three neuron network of Figure 1. We wish to calculate the probability that a triplet (or an N cluster, where N equals 3) occurs within a given time window. The null hypothesis is that no structure is present in the spike trains other than the pair-wise correlations. That is, all triplets should be due to the occurrence of two doublets at the same time, by chance forming a triplet. Csiszar (1975) has proved the unique existence of a distribution with just this property. The basic approach for calculating the probability distribution involves maximizing the informational entropy of the data while preserving the pair-wise correlation and the ring rate (see also Martignon et al., 1995). This informational entropy is a measure of the “order” in the data: the more structured the data, the lower the entropy. By measuring the pair-wise correlations and the ring rates, a certain degree of order is xed. Taking these constraints into account, maximizing the entropy will minimize all higher-order correlations since higher-order correlations will add structure to the distribution, further lowering the entropy. Therefore, maximal entropy implies minimal higher-order correlations. Our aim is to obtain the distribution of N clusters that has maximal entropy. We will illustrate the procedure by computing this distribution for three connected neurons. The neurons are labeled A, B, and C; their ring probabilities are f1A , f1B , and f1C ; and the pair-wise correlation coefcients are denoted as r AB , r BC , and r AC ( f1i denotes the probability that neuron i res in a particular time bin and depends on the ring rate and the size of the time bins; the

Effects of Pair-wise and Higher-order Correlations

157

rationale of the sufx 1 will become clear below). Given these coefcients as constraints, probabilities for all 2n possible events can be calculated. For example G010 designates the probability of the event in which B is ring and A and C are silent (see Figure 1B). There are eight probabilities to solve for. Since ring probabilities and pair-wise correlation coefcients are xed, seven equations can easily be obtained. The sum of the probabilities of all possible events equals one: G000 C G001 C G010 C G011 C G100 C G101 C G110 C G111 = 1.

(2.1)

The ring probability of a neuron A is: G100 C G110 C G101 C G111 = f1A .

(2.2–2.4)

(The equations for f1B and f1C are derived analogously.) Let f2AB denote the probability that A and B re simultaneously (the sufx now indicates a cluster of size 2). f2AB is determined by the correlation coefcient r AB , since r AB = q

f2AB ¡ f1A f1B

2 )( 2 ) ( f1A ¡ f1A f1B ¡ f1B

.

(2.5)

This implies that r AB determines f2AB , and since f1A and f1B are xed, this yields: G110 C G111 = f2AB ,

(2.6–2.9)

and corresponding equations are derived for r AC and r BC . These seven equations (2.1–2.4 and 2.6–2.9) yield one free parameter (G111, for example), which determines the entropy of the probability distribution. By calculating the value at which the distribution has maximal entropy, we x this last free parameter. The entropy function is dened as: S´¡

X

Gi ln(Gi )

i over all possible spike congurations. (2.10)

i

The entropy is maximized by solving the following equation, which has a unique solution (see the appendix): @S = 0. @G111

(2.11)

Using this equation, the probability distribution of congurations with the desired zero higher-order correlation can easily be calculated (see also Martignon et al., 1995).

158

S. M. Bohte, H. Spekreijse, and P. R. Roelfsema

3 Calculating the Distribution with N Identical Neurons For more than three neurons, the equations can no longer be solved easily by analytical means. For instance, for four neurons, the same calculations yield the entropy as a function of ve free parameters. Calculating the maximum of this function is already quite complicated. Extending this to N neurons, only 1 CN(N C1)/ 2 equations are determined by the pair-wise correlations and ring probabilities, with 2N parameters to solve for. To overcome this problem, an iterative algorithm for calculating the maximal entropy distribution has been provided by Gokhale and Kullback (1978). Unfortunately, this algorithm has a serious drawback: the number of calculations increases exponentially with the number of neurons. If one is interested in the behavior of many correlated neurons, computational restrictions prohibit this method of calculating the distribution for more than 20 neurons on a workstation, thus putting the more serious number of neurons out of reach of even the fastest supercomputers. To overcome the problem of increasing computation time, we made the additional assumption that all neurons have identical properties, which implies a major degeneration of the probability space. Since any two neurons are equal, the same will hold for the probability of all permutations of spike congurations. In addition, the N ring probabilities and the N (N ¡ 1)/ 2 pair-wise correlation coefcients will also be equal. The distribution of spike congurations is described by N C 1 variables, D0 through DN , where Di denotes the probability of a particular spike conguration in which exactly i neurons re. In the case of N = 7: D1 = G1000000 = G0100000 = ¢ ¢ ¢ = G0000001; D2 = G1100000 = G1010000 = G0010010, etc. Since all permutations of i spiking and N ¡ i silent neurons have the same probability Di , the requirement that all probabilities add up to 1 now reads 1=

¡ ¢

N X N ¢ Di . i

(3.1)

i= 0

The ring probability of a cell equals (see the appendix) f1 =

¡

¢

N X N¡1 ¢ Di . i¡1

(3.2)

i= 1

Under the assumption of identical neurons, in equation 2.5, f1B equals f1A . Rewriting equation 2.5, replacing f1A and f1B by f1 , and f2AB by f2 yields r =

f2 ¡ f12 . f1 (1 ¡ f1 )

(3.3)

Effects of Pair-wise and Higher-order Correlations

159

Thus, f2 can once again be calculated from the ring probability of a neuron and the correlation. f2 , the probability that any two particular neurons re at the same time, equals f2 =

¡

¢

N X N¡2 ¢ Di . i¡2

(3.4)

i= 2

By maximizing the entropy S, we derive the remaining N ¡ 2 equations. The entropy is dened as S= ¡

¡ ¢

N X N Di ln(Di ). i

(3.5)

i= 0

Maximization of entropy yields @S = 0 @Di

for i = 3, . . . , N.

(3.6)

A proof of a unique solution of this set of equations is included in the appendix. After some calculations the N ¡ 2 equations turn out to have the form

¡

¢

1 ln(Di ) = ¡ ¡ i C 1 (i ¡ 1) ln(D0 ) ¡ i( i ¡ 2) ln(D1 ) 2 1 C i( i ¡ 1) ln(D2 ) 2

for i = 3, . . . , N

(3.7)

Inserting these N ¡ 2 equations into equations 3.1, 3.2, and 3.4, we can numerically solve for D0 , D1 , and D2 using the Newton-Raphson method for nonlinear systems of equations (Press, Flanney, Teukolsky, & Vetterling, 1986). The maximal entropy distribution, thus dened, depends on only two parameters: the ring probability f1 and the pair-wise correlationr . Figure 2 illustrates the shape of the N-cluster distribution with maximal entropy for a network of 150 neurons and its dependence on f1 and r . Figure 2 shows the probability Pi that a particular number of neurons re simultaneously, where P i equals the sum of the probabilities of all congurations in which exactly i neurons re: Pi =

¡ N¢ i

Di .

(3.8)

For small values ofr , the distribution of N clusters approaches a binomial distribution. This determines the rst peak of the N-cluster distribution,

160

S. M. Bohte, H. Spekreijse, and P. R. Roelfsema

corresponding to small cluster sizes. For larger values of r , a second peak appears in the distribution, the amplitude of which grows with increasing r . A second inuence of increasing r is a divergence of the two peaks. This divergence can be seen most clearly in the contour plots of Figures 2D–F. Indeed, in the limiting case ofr = 1, the rst peak approaches a cluster size of 0 and the second peak a cluster size of 150, since all neurons re at exactly the same time. An increase in f1 shifts the rst peak of the distribution to larger values, as is predicted by the binomial distribution (see Figures 2A–C). Remarkably, an increase in f1 is also associated with a shift of the second peak, to smaller values. Thus, the maximal entropy distribution predicts that low ring probabilities are associated with sparse but highly synchronized bursts. With a higher ring probability, synchronous bursts occur more frequently but comprise fewer spikes. 4 An Articial Neuron Network We now compare N-cluster distributions obtained from simulations of an articial neural network to the maximal entropy distribution. Any difference between the two distributions can then be attributed to higher-order correlations. The network used is based on a network described by Deppisch et al. (1993), which, in turn, is based on the work of MacGregor and Oliver (1974). This network was chosen since it could easily be adapted to consist of identical neurons. 4.1 The Neuron Model. The neuron model involves four variables: the membrane potential E( t), a potassium current g( t), the spiking threshold h ( t), and the neuronal output o( t). The dynamics of neuron i are described by four coupled equations: tE

dEi ( t) = ¡(Ei ( t) ¡ E0 ) ¡ ( gi (t) ¡ go )( Ei ( t) ¡ Ek ) Cg i ( t) C dt 2 3 X ¡4 wij ¢ oj (t ¡ tij ) 5 ( Ei ( t) ¡ Eex )

(4.1)

j

th

dhi ( t) = ¡(hi (t) ¡ h 0 ) C c( Ei ( t) ¡ E0 ) dt

tg

dgi ( t) = ¡(gi ( t) ¡ g0 ) C tg boi ( t) dt »

o i ( t) =

1 0

if Ei (t) > hi ( t) otherwise.

0·c·1

(4.2)

(4.3)

(4.4)

Effects of Pair-wise and Higher-order Correlations

161

Figure 2: (A–C) Probability of N clusters (p) as a function of the pair-wise correlation r , for three ring probabilities, f1 = 0.05 (A), 0.146 (B), and 0.225 (C). Probabilities are clipped at a value of 10¡4 . (D–F) Contour plots of the same data show that the separation between the two peaks becomes larger with increasing r . Contour levels indicate probabilities of 0.001 0.01, and 0.1.

Without input, the membrane potential Ei (t) is driven towards its resting value E0 with a time constant tE . An inux of potassium drives the potential toward the potassium equilibrium potential Ek . Excitatory input moves the potential toward the ionic equilibrium value Eex . In the simulations, the

162

S. M. Bohte, H. Spekreijse, and P. R. Roelfsema

values used by Deppisch et al. (1993) were adopted: E 0 = 0, tE = 2.5 msec, Ek = ¡1, E ex = 7. The membrane time constant tE may appear to be rather small, although some have argued that it may be within the biologically plausible range (Konig ¨ et al., 1996; Bernander, Douglas, Martin, & Koch, 1991). In the discussion we will address the dependence of our results on this particular choice. Input to the cells was strictly excitatory, with synaptic delay tij = 0.5 msec. External input is provided by external stochastically spiking units and is treated equivalent to internal input. Internal noise is added by the additive noise term gi ( t) in equation 4.1, with a standard deviation of 0.06 ¢ Ei ( t). Equation 4.2 describes the slow adaptation of the threshold hi to the membrane potential, modeling an adaptation of the neuron to excitation (h0 = 1, th = 10 msec, c = 0.3). Equation 4.3 governs the potassium current decay dynamics, which is driven toward g0 = 0 with time constant tg (5 msec). In the case of a spike, the potassium current rises by an amount b (4.0) corresponding to the outward potassium current. A spike is generated each time the membrane potential exceeds the threshold (see equation 4.4), after which the neuron is in a refractory state for another 1.5 msec. In the actual numerical implementation of the neuron model, a discrete-time analog of equations 4.1–4.4 was used. These equations were iterated with time steps corresponding to 0.5 msec of real time. Each neuron was connected to every other neuron with a xed homogeneous synaptic weight wij = wint / N > 0. Here, wint denotes the sum of the strengths of all synapses from within the network impinging on a single cell. External input consisted of N independent stochastic elements generating spikes with p = 0.05 per msec (50 Hz), connected in a one-to-one fashion with the network neurons with a constant weight wext = 0.8. The parameter wint was varied during the simulation. Increasing the value of the weights drives the network from a stochastic mode into an oscillatory mode (Deppisch et al., 1993). Somewhere within this range of synaptic weights, the network alternates between the stochastic and oscillatory modes. All results are based on simulations with a network with N = 150 neurons, with the exception of the results in section 4.5. This was a compromise between a realistic number of neurons and a network consideration: the absence of inhibitory neurons makes a network very quickly prone to saturation, a state where the constant excitation induces a very sharp oscillation. A further increase in the number of neurons also narrows the weight range in which the network is in the alternating state. The network output was obtained by recording the output of all 150 neurons. A sample of the output of 50 neurons is shown in Figure 3. 4.2 Network Simulations. In order to vary the average pair-wise correlation, simulations were performed with different values of the synaptic weight wint . As was noted by Deppisch et al. (1993), the network exhibits three distinct modes, which depend on the value of the synaptic weight.

Effects of Pair-wise and Higher-order Correlations

163

Figure 3: (Upper panel) Activity of 50 of the 150 neurons. On the horizontal axis, time is displayed (sampling time, 0.5 msec). The vertical ticks indicate action potentials. Oscillatory episodes are alternated by periods of more random activity. (Lower panel) Summed activity of all 150 neurons.

The rst, at low values of the synaptic weights, is the stochastic mode in which the average correlation between pairs of neurons is near zero. At the other end of the scale is the mode with high values of internal weight. In this mode, the network activity is highly oscillatory. In between these extremes are synaptic weight values for which the network exhibits episodes in which many neurons re synchronously, alternated by episodes in which neurons are synchronized to a lesser degree. Typical activity of the neurons in these three network modes is shown in Figures 4A–C. Also plotted are the cross-correlations between a pair of neurons in the network (see Figures 4D–F). As observed by Deppisch et al. (1993), these cross-correlograms show qualitative resemblance to data obtained from electrophysiological recordings (e.g., Konig ¨ et al., 1995). Remarkably, the network with intermediate synaptic strength (w = 3.875, r = 0.03, for 2 msec bin size) exhibits occasional population bursts, which barely show up in the correlation function (see Figures 4B and 4E). Thus, a network state in which a number of highly synchronous population bursts occur can be associated with correlation functions indicative of weak pair-wise coupling. When investigating the relationship between occurrences of N clusters and the pair-wise correlation, an additional assumption has to be made with regard to the maximal time difference between two spikes that are considered to be synchronous. As a rst approach, we used a time window of 2 msec. The maximal ring rate of the neurons is determined by the refrac-

164

S. M. Bohte, H. Spekreijse, and P. R. Roelfsema

Figure 4: Network activity for three values of the synaptic weight, wint . (A–C) Summed activity of the 150 neurons. The synaptic weights were 3.5 (A, r = 0.003), 3.875 (B, r = 0.03), and 4.375 (C, r = 0.20). Pair-wise correlations were determined for coincidences within a window of 2 msec. (D–F) Cross-correlation functions of two randomly selected neurons for the same synaptic weights as are plotted on the left. Even for a low pair-wise correlation (r = 0.03), highly synchronous network bursts occur, but these barely show up in the cross-correlogram.

tory period (1.5 msec) and the duration of an action potential (0.5 msec). During network bursts, neurons reach this maximal ring rate, and a single spike occurs in each 2-msec bin. The effect of changing the bin width on the distribution of cluster sizes will be investigated in section 4.4.

Effects of Pair-wise and Higher-order Correlations

165

Figure 5 shows the relationship between the occurrence of N clusters and the pair-wise correlation. Plotted is the probability Pi (see also equation 3.8) of a particular number of coincident spikes, for simulations with different pair-wise correlations (different values of wint ). As can be seen in Figure 5A, most 2-msec bins are occupied by N clusters containing a fairly low number of spikes. This corresponds to the stochastic activity between bursts, and the probability of these events is approximated by a binomial distribution. Synchronous bursts are represented by the second peak in the probability distribution, as can be seen more clearly in the logarithmic plot of Figure 5B. Between these two peaks is a plateau of time bins in which an intermediate number of neurons re. These events are caused by time bins that are aligned on the onset or end of a burst. Increases in the pair-wise correlation are associated with a slightly smaller rst peak and an enhanced second peak. For very large, and biologically implausible, pair-wise correlations, a third peak containing intermediate cluster sizes is visible, which can be attributed to the onset and offset of bursts. Let us now consider the impact of this distribution on a postsynaptic cell, receiving input from all network neurons. For such a cell, the exact number of synchronous spikes (the quantity plotted in Figures 5A and 5B) is not important. Rather, for a postsynaptic neuron with a ring threshold of m excitatory postsynaptic potentials (EPSPs), the incidence of m or more coincident spikes is a much more signicant quantity, since this determines its ring rate to a large extent. Therefore, we computed a cumulative probability distribution: the probability that m or more neurons re synchronously. The cumulative probability distributions (the cumulatives of Figure 5B) are plotted in Figure 5C. Suppose that the membrane potential of a postsynaptic neuron, which receives input from all cells in our network, equals the resting potential at time t. The probability that this neuron res at time t C 1 is equal to the probability ofh 0 or more coincident spikes, where h 0 is the ratio between the postsynaptic threshold (h ) and the EPSP amplitude. In other words, the curves in Figure 5C illustrate the relation between the threshold and ring probability of a neuron, which receives input from all neurons in the network, in the case of a very short membrane time constant. The curve for a very low pair-wise correlation coefcient (r = 0.003) shows the rapidly declining probability of higher-order events. The curves for network activity with a larger pair-wise correlation exhibit a plateau before their decline at very high numbers of synchronized spikes. This plateau results from the second peak in the N cluster probability distribution. It is remarkable that even small changes in the strength of the pair-wise correlations, which are not physiologically implausible (e.g., Gray et al., 1989; Livingstone, 1996), exhibit a strong effect on the ring probability in case of an intermediate threshold h 0 . Indeed, the ring probability of a neuron with a threshold of, say, 50 EPSPs is raised by an order of magnitude due to an increase in the pair-wise correlation as small as 0.08 (compare the distributions forr = 0.03 and r = 0.12).

166

S. M. Bohte, H. Spekreijse, and P. R. Roelfsema

The question remains how closely the ring rate of a postsynaptic neuron with nonzero membrane time constant is approximated by the cumulative probability distribution. There would obviously be a one-to-one relationship for a postsynaptic neuron that has “zero” memory—that is, a neuron that is inuenced only by the input in the previous time step (the bin size for which spikes are considered synchronous, as discussed above). However, typical neurons have parameters with larger time constants, including the refractory period, and the time constant of the membrane. In order to assess the impact of this “nonzero memory” an additional 151st neuron was included in the network. This cell was identical to the other 150 neurons from which it received input, but it did not project back to them. The pair-wise correlation was xed at 0.12, and simulations were performed with different values of the postsynaptic thresholdh 0 by varying wi,151 (1 < i < 150), the strength of the synapses projecting onto neuron 151. The dependence of the ring probability on h 0 (h/ wi,151 ) is shown in Figure 5D. Superimposed on this graph is the distribution of N clusters. As expected, some minor differences between the ring probability and the cumulative probability distribution are observed. Most of these differences can be explained, given the fact that the nonzero average activity in a pool of neurons keeps the membrane potential of individual neurons at a somewhat higher level than the resting potential. This lowers the average number of coincident spikes the neuron would require to cross threshold. On the whole, however, the ring probability of the postsynaptic neuron with a nonzero membrane constant is predicted with good accuracy by the cumulative probability distribution. A remarkable feature of Figure 5D is that the ring probability of a neuron with a threshold of, for example, 40 does not differ much from that of a neuron with a much higher threshold (e.g., 130). Most of the spikes of a postsynaptic cell with a threshold larger than 40 are triggered by synchronous bursts in which the majority of neurons participate.

Figure 5: Facing page. (A,B) Probability distribution of N clusters, plotted on a normal scale (A) and a logarithmic scale (B). (C) Cumulative probability of N clusters. Abscissa, cluster size (h 0 ). Ordinate, probability of a cluster with a size that is equal to or larger than h 0 . (D) Comparison of the cumulative N-cluster distribution (solid line) to the ring probability of a postsynaptic neuron, which receives input from all 150 neurons in the network. The threshold h 0 is dened as h / wi,151 . (E) Dependence of the ring probability of a neuron receiving input from all 150 neurons on its relative threshold h 0 and the pair-wise correlation in the network. Different curves correspond to different values ofh 0 . Triangles indicate dependence of the average ring probability f1 on the pair-wise correlation. (F) Same as E, but the ring probability of the postsynaptic cell and f1 are plotted as a function of wint .

Effects of Pair-wise and Higher-order Correlations

167

Using this interpretation of the cumulative probability distribution, the effect of an increase in the pair-wise correlation in the network on postsynaptic neurons was investigated by varying the synaptic weight (wint ). Figure 5E shows the relationship between the average pair-wise correlation and the ring probability of a hypothetical postsynaptic neuron with threshold h 0 . Calculated were the ring probabilities for thresholds h 0 = 30, 50, and 100. It can be seen that in the biologically relevant range of pair-

168

S. M. Bohte, H. Spekreijse, and P. R. Roelfsema

wise correlations (r = 0–0.2), there is a monotonic relationship between the postsynaptic ring probability and the pair-wise correlation. For values of r below 0.2, changes in the pair-wise correlation are associated with a relatively large increase in the ring probability of the postsynaptic neuron. Importantly, higher values of r are also associated with an enhanced ring rate of the presynaptic network neurons (see Figure 5E). This increase in activity undoubtedly contributes to the enhanced probability of higher-order events. However, it should be noted that the probability of higher-order events exhibits a steeper dependence on r than the ring probability of the network neurons (see Figure 5E). Figure 5F illustrates the dependency of the ring probability of the postsynaptic neuron on wint , the parameter that was actually varied during the simulations. 4.3 Estimation of the N-Cluster Distribution by Entropy Maximization. In most electrophysiological experiments, only data on ring probabilities and pair-wise correlations are available. Using the mathematical framework developed in section 3, we investigated whether the observed relationship between the probability of N clusters and the correlation coefcient could have been predicted from these two types of measurements. For the network behavior at different values of wint (3.5, 3.875, and 4.125), distributions of N clusters were calculated that maximized entropy under the constraints of the observed ring probability and pair-wise correlations. A comparison between distributions based on the maximal entropy calculation and the experimental distribution is shown in Figures 6A–C. The distribution that maximized the entropy deviates somewhat from the observed distribution for larger values of r . The maximal entropy calculation underestimates the incidence of clusters between 60 and 100 spikes, which occur during the start and end of population bursts, as was discussed above. In addition, the second peak in the maximal entropy distribution is located at a cluster size of 130, whereas the actual location of this peak is 150. Typically, all neurons re within a 2-msec window during a network burst. In order to estimate the effect of these discrepancies on the ring probability of a neuron receiving input from the network, the cumulative probability distributions are plotted in Figures 6D–F. It can be seen that the underestimation of the incidence of intermediate cluster sizes by the maximal entropy calculation is compensated by the overestimation of the incidence of clusters with a size between 100 and 140. For values of h 0 smaller than 130, the largest deviation is about a factor of 2. This approximation is reasonable, since the maximal entropy calculation depends on only two parameters, which are estimated from the network activity: the ring probability and the pair-wise correlation. Nevertheless, large deviations occur for values of h 0 larger than 130. However, these high-threshold values are physiologically implausible, because they would imply that a postsynaptic neuron would re only when almost every input is active within a narrow time window.

Effects of Pair-wise and Higher-order Correlations

169

Figure 6: Comparison of the probability of N clusters in the network simulation and their probability based on maximal entropy. (A–C) Probabilities of N clusters in the simulation are shown as solid curves for three values of the synaptic weights. Dashed line: The corresponding maximal entropy probability distributions with the same ring probability and pair-wise correlation. (D–F) Relationship between ring probability (ordinate) and threshold (abscissa) of a neuron with an integration window of 2 msec that receives input from all 150 neurons of the network (cumulatives of A–C).

170

S. M. Bohte, H. Spekreijse, and P. R. Roelfsema

4.4 Effects of Varying Bin Width on the Maximal Entropy Distribution. The results of the maximal entropy calculation described so far were obtained with a bin width of 2 msec, which equals the minimal interspike interval. Figure 7 shows the dependence of the N cluster distribution and the maximal entropy estimation on the bin width. Spike trains obtained with a synaptic weight (wij ) of 4.125 were rebinned, using bin widths of 0.5, 1.0, 2.0, and 4.0 msec. The rebinning process inuenced the pair-wise correlation and f1 , the probability of ring in a time bin (see equation 3.2). Smaller bin widths reduce f1 . This results in a leftward shift in the location of the rst peak in the distribution of cluster sizes, which represents stochastic activity between bursts (compare Figures 7A and 7B to Figure 7C). A second effect of reducing the bin width is a disappearance of clusters with sizes larger than 110. This disappearance is related to the refractory period, which prohibits neurons from ring in consecutive bins during population bursts. Spikes red by different neurons during these bursts are therefore divided between successive time bins, and no bins remain in which all neurons re simultaneously. This modication of the distribution of cluster sizes is not captured by the maximal entropy estimation, which underestimates the incidence of clusters with sizes between 20 and 110 and overestimates the incidence of clusters with a size larger than 130 (see Figures 7A–B). In other words, the refractory period adds structure to the spike trains, which therefore have less than maximal entropy. When a bin width is used that is longer than the refractory period, this additional structure is lost, and the maximal entropy calculation may provide a reasonable estimate of the distribution of cluster sizes. For bin sizes that are longer than the minimal interspike interval, there are time windows during which individual cells re more than a single spike. It is possible, in principle, to adapt the maximal entropy estimation to this situation. One approach, in the case of a bin size that may include two spikes, is to compute the probability that N neurons re once and M neurons re twice in a bin, for each combination of N and M. Unfortunately, this increases the number of variables that should be calculated from 151 (D0 to D150) to more than 10,000. The computational requirements increase further if three or more spikes can occur in a single bin. Therefore, we took an alternative approach in which the state of a neuron was labeled “off” in the case of no spike and labeled “on” in the case of one or more spikes within a time bin. This keeps the computational requirements within bounds, but at the cost of losing spikes in the rebinning process. Figure 7D compares the distribution of cluster sizes to the maximal entropy distribution for a bin width of 4 msec. The experimental distribution is largely described by a broad rst peak, which is shifted to the right, and a very narrow second peak at a cluster size of 150. An increase in f1 also shifts the rst peak of maximal entropy distribution to the right. However, larger values of f1 are also associated with a leftward shift of the second peak in the distribution of maximal entropy, as was discussed in section 3. This causes large deviations

Effects of Pair-wise and Higher-order Correlations

171

Figure 7: (A–D) Comparison of the maximal entropy distribution to the distribution of N clusters in the simulation for bin sizes ranging from 0.5 msec to 4 msec. The quality of the maximal entropy approximation depends on bin size, that is, on what is considered coincident. In our simulations, a bin size of 2 msec yields the best approximation (C). For smaller bin sizes (A: 0,5 msec, B: 1 msec), the actual and predicted distributions deviate considerably. (E–H) Impact of these deviations on a postsynaptic neuron with threshold h 0 .

172

S. M. Bohte, H. Spekreijse, and P. R. Roelfsema

for cluster sizes between 60 and 120, the incidence of which is overestimated by the maximal entropy estimation. Moreover, the narrow peak at a cluster size of 150 is absent in the distribution of maximal entropy. In summary, reasonable estimates of the N-cluster distribution are obtained only for a bin width that is equal to the minimal interspike interval. In the discussion, we will address the question of whether these results can be generalized to other network architectures. 4.5 Effects of Network Scaling. In order to investigate how the network behavior and the goodness of t of the maximal entropy distribution depend on network size, simulations were run with networks composed of 75, 150, and 200 neurons. Figure 8A shows the cumulative distribution of N clusters for a network of 75 neurons with a wint of 3.55, which exhibited a pair-wise correlation of 0.07. It should be noted that wint represents the sum of the synaptic weights wij of all inputs converging onto a single neuron (section 4.1). When the network size is doubled, each cell receives input from twice as many neurons, and the strength of the individual synapses wij was reduced accordingly, in order to maintain a constant value of wint . Nevertheless, a doubling of the network size resulted in a clear leftward shift of the distribution of N clusters and a reduction of the pair-wise correlation to 0.003. This indicates that a constant value of wint is not sufcient to guarantee a qualitatively similar network behavior when the network size is increased. Indeed, a larger number of synapses with reduced weight impinging on a network unit results in a reduction of the uctuations in the input, as long as the network is in the stochastic mode. This reduces the probability of bursts in the network (Tsodyks & Sejnowski, 1995). Tsodyks and Sejnowski (1995) have suggested that the variance in the input to network units may be kept approximately constant by reducing the release probability of the synapses rather than reducing their strength wij , when network size is increased. Figure 8C shows the cumulative distribution of N clusters for a network of 150 neurons, in which the effective input strength was kept constant by reducing the release probability to 50%. The pair-wise correlation was 0.12, and the cumulative distribution of N clusters was qualitatively similar to that of the smaller network, in accordance with Tsodyks and Sejnowski (1995). Figure 8D shows a similar result for a network composed of 200 neurons. In all cases the estimate based on maximal entropy was reasonable (dashed lines in Figures 8A–D), which indicates that the quality of the maximal entropy calculation does not depend critically on the size of the network. 5 Discussion In most physiological studies on the synchronization behavior of cortical neurons, recordings are obtained from pairs of neurons or pairs of cell clusters. Our results illustrate that these data can supply only limited informa-

Effects of Pair-wise and Higher-order Correlations

173

Figure 8: Effect of scaling the network size. (A) Continuous line, probability of N clusters with a minimal size of h 0 , as a function of h 0 , for a network with 75 neurons. Dashed line: Distribution of maximal entropy with the same ring probability and pair-wise correlation. Parameters of the simulation: wij = 4.73 ¢ 10¡2 , r = 0.07. (B) N cluster distribution for a larger network with 150 neurons. The total input converging on a neuron, wint , was kept constant by reducing wij to 2.37¢ 10¡2 . Nevertheless, the distribution of N clusters was shifted to the left, and the pair-wise correlation was reduced to 0.003. (C) Same as B, but the effective wint was kept constant by reducing the release probability to 50% rather than by reducing wij . The synaptic strength wij was 4.73¢10¡2 , andr = 0.12. (D) N-cluster distribution for a network of 200 neurons, with a r of 0.15. The synaptic weight wij was identical to that in A, but release probability was reduced to 37.5%. Note the similarity of the distributions in C and D to the distribution in A.

tion about the probability of higher-order events—for example, the probability that 30 or more neurons that project to a target neuron re simultaneously. As an approximation for the probability of higher-order events, we used the distribution of maximal entropy. This method provides the most unstructured distribution, given the constraints supplied by the ring probabilities and the pair-wise correlations. The N-cluster distribution with maximal entropy for a homogeneous network without higher-order

174

S. M. Bohte, H. Spekreijse, and P. R. Roelfsema

correlations exhibits two peaks. The rst peak represents stochastic activity and resembles a binomial distribution. This peak also occurs in the absence of correlations. If the ring probability is moderate (< 0.25), a second peak occurs at a relatively large cluster size. The magnitude of this second peak depends on the strength of the pair-wise correlation. Thus, an absence of higher-order correlations dictates that the network should generate coincidences in a number of tightly synchronized network bursts. The N-cluster distribution observed in the simulations also exhibited two peaks, although the position of the second peak was different from the second peak in the maximal entropy distribution. Questions about the generality of these results, in particular, about their dependence on the details of the network implementation, will have to await further experimentation. Previous studies on the impact of correlations among neurons that project to a common postsynaptic target have used N-cluster distributions with a drastically different shape (Bernander et al., 1994; Murthy & Fetz, 1994). In these studies correlations were introduced in the input by forcing a subset of the presynaptic neurons to re in perfect synchrony, but independent of the other presynaptic neurons. Since the resulting N-cluster distribution is relatively devoid of highly synchronous events and has less than maximal entropy, the generality of the results obtained in these earlier studies may also be limited. It is obvious that network bursts in which a large number of neurons participate are rather effective in driving a postsynaptic neuron, especially in the case of a high postsynaptic threshold. Indeed, a recent study (Alonso et al., 1996) uncovered tight correlations, with a peak width in the crosscorrelogram of less than 1 msec, among pairs of neurons in the lateral geniculate nucleus (LGN). In cases in which the LGN neurons projected to a common cortical neuron, the impact of synchronous events was stronger than that of asynchronous spikes. We observed an orderly relationship between the incidence of highly synchronous network bursts and the pair-wise correlation. Relatively small increases in the pair-wise correlation, which are not physiologically implausible, may raise the incidence of highly synchronous events, and thereby the ring rate of a postsynaptic cell, by an order of magnitude (see Figures 5C and 5E). We will now discuss the limitations of the maximal entropy estimation and the way in which these limitations may be overcome by future studies. First, the quality of the approximation by the distribution of maximal entropy exhibited a strong dependence on the choice of the bin size. Reasonable results were obtained only for bin sizes larger than the refractory period of the neurons. If the bin size was smaller than the refractory period, structure was added to the N-cluster distribution, which therefore had less than maximal entropy. This limitation is a direct consequence of the way in which the entropy was dened. The entropy was determined by the distribution of N clusters within individual time bins and was independent of the order of N clusters in successive bins. The equivalent situation for

Effects of Pair-wise and Higher-order Correlations

175

a cross-correlation study would be to calculate only the center bin of the cross-correlation functions, that is, only the probability that two neurons re at exactly the same time. A possible extension of the method would be to reformulate entropy in order to include second-order correlations with a time delay (i.e., the noncentral bins in the auto- and cross-correlation functions) in the calculation. However, such an extension is likely to result in a substantial increase in the number of variables and equations. Second, the quality of approximation by the maximal entropy distribution was also degraded if a bin width was chosen that was larger than the minimal interspike interval (larger than 2 msec in our simulations). In this situation, multiple spikes of a single neuron occurred within a single time bin. The membrane time constant of the postsynaptic neuron provides a natural temporal window over which input is integrated, that is, the natural coincidence window. Currently the value of the effective membrane time constant in cortical neurons is a topic of considerable debate (Bernander et al., 1991; Konig ¨ et al., 1996; Shadlen & Newsome, 1995). A coincidence window of 2 msec, which we used, is presumably at the lower end of the physiologically plausible range. However, it seems likely that an increase in the width of the bin in which spikes are considered to be coincident to 5 or even 10 msec may be unproblematic in the case of cortical neurons. Typical peak ring rates of cortical neurons are around 100 Hz, which implies that the probability of multiple spikes within a 5- or 10-msec bin will still be rather low. Alternatively, the calculation of the maximal entropy may be adapted to allow for multiple spikes in a single bin, as was discussed in section 4.4. If a very large bin size is chosen, the distribution of the number of spikes red by a single neuron in a bin may approach a normal distribution. In this case it is relatively easy to calculate the distribution of N clusters if the only correlations are of second order. Third, the method according to which we derived the distribution of maximal entropy is valid only for a homogeneous network. For a nonhomogeneous network, the number of variables grows exponentially with the number of neurons (see section 3). We have incorporated a single inhibitory neuron in our network, which received input from all excitatory cells and provided strong inhibitory feedback (data not shown). The inhibitory neuron added structure to the spike trains by curtailing population bursts. We did not explicitly incorporate the ring pattern of the inhibitory neuron in our calculations, which caused a considerable discrepancy between the maximal entropy distribution and the actual distribution of N clusters. These limitations, taken together, imply that it will be rather problematic to predict the probability of highly synchronous events from ring rates and pair-wise correlations in physiological experiments. In a physiological study on higher-order events among neurons of the frontal cortex, Martignon et al. (1995) obtained discrepancies between the probability of actually occurring events and their probability predicted by entropy maximization. Unfortunately, even in this study, in which simultaneous recordings from

176

S. M. Bohte, H. Spekreijse, and P. R. Roelfsema

six neurons were studied, sampling problems occurred that were caused by the exponentially growing number of spike congurations and the limited recording time available during such experiments. Another way in which the probability of highly synchronous events might be estimated is by direct, in vivo measurements of the distribution of the postsynaptic potential. A comparison of the postsynaptic potential to the size of individual EPSPs could provide valuable insight in the degree of synchronicity among neurons that project to a common target cell. Appendix: Maximizing the Entropy in a Homogenous Network For N neurons three equations are derived from the pair-wise correlation, the ring probability, and the fact that all probabilities should add up to 1. First, the probabilities should add up to 1: 1=

¡¢

n X n ¢ Di i

(3.1)

i= 0

Second, the ring probability f1 is xed. For example, in the case that N = 7, it is easy to see that f1 equals: f1 =

1 X 1 X 1 X 1 X 1 X 1 X

G1ijklmn =

i= 0 j= 0 k= 0 l= 0 m= 0 n= 0

7 X i= 1

¡

¢

6 Di . i¡1

(A.1)

In the general case of N = n, this reads: f1 =

¡

¢

n X n¡1 ¢ Di . i¡1

(3.2)

i= 1

Third, the pair-wise correlations are xed, and this implies that the probability that two neurons re simultaneously is also xed. Analogously to equation 3.2, we nd for f2 : f2 =

¡

¢

n X n¡2 ¢ Di . i¡2

(3.4)

i= 2

These equations can be rewritten as: D0 = 1 ¡

¡¢

¡¢

¡¢

n n X X n n n Di = 1 ¡ nD1 ¡ D2 ¡ Di 2 i i

D 1 = f1 ¡ D 2 = f2 ¡

i= 1 n X i= 2 n X i= 3

¡ n ¡ 1¢ i¡1

¡ n ¡ 2¢ i¡2

Di = f1 ¡ [n ¡ 1]D2 ¡ Di .

i= 3 n X i= 3

¡ n ¡ 1¢ i¡1

Di (A.2)

Effects of Pair-wise and Higher-order Correlations

Solving for D0 , D1 , and D2 : D0 = 1 ¡ nf1 C n( n ¡ 1) f2 ¡

¡ n¢

µ C

2

Cn

¡ n( n ¡ 1)

¡

¡ n¢ 2

¶X n

¢

f2

¡ n ¡ 2¢ i¡2

i= 3

Di

¡¢

n n X X n¡1 n Di ¡ Di i¡1 i i= 3

i= 3

D1 = f1 ¡ ( n ¡ 1) f2 C ( n ¡ 1)

¡

¢

177

¡

¢

¡

¢

n n X X n¡2 n¡1 Di ¡ Di i¡2 i¡1 i= 3

i= 3

n X n¡2 D2 = f2 ¡ Di , i¡2

(A.3a–c)

i= 3

with the entropy dened as S= ¡

¡¢

n X n Di ln(Di ). i

(3.5)

i= 0

This is a concave function (which is easily veried), and therefore it has a single, unique maximum. To obtain the maximum, we calculate the derivative to Di : @S = 0 @Di

@S = ¡ @Di

for i = 3, . . . , n

»µ

¡ n¢ 2

¶ ¡ n( n ¡ 1)

¡

(3.6)

¡ n ¡ 2¢ i¡2

¢ ¡

Cn

¢

¡ n ¡ 1 ¢ ¡ n¢ ¼ i¡1

¡

i

ln(D0 ) C

» ¼ n¡2 n¡1 ¡ n (n ¡ 1) ¡ ln(D1 ) C i¡2 i¡1 C ¡

¡ n¢ ¡ n ¡ 2 ¢ 2

¡ n¢ i

i¡2

ln(D2 ) C

ln(Di ) = 0.

(A.4)

We note that equations A.3a–c are linear in Di and that a concave function remains concave on a linear subspace. In other words, equations A.3a–c and 3.6, taken together, also have a unique solution. Equation A.4 can be

178

S. M. Bohte, H. Spekreijse, and P. R. Roelfsema

rewritten as:

¡

¢

1 ln(Di ) = ¡ ¡ i C 1 ( i ¡ 1) ln(D0 ) ¡ i(i ¡ 2) ln(D1 ) 2 1 C i(i ¡ 1) ln(D2 ) 2

for i = 3, . . . , n.

(3.7)

Inserting these values for Di into equations A.3a–c yields three equations with three unknowns and a unique solution, which is found using the Newton-Raphson method as described by Press et al. (1986). Use of this algorithm in Maple V v4.0 allowed us to solve the equations for up to 150 neurons in about 2 hours on a Pentium 150 Mhz. Acknowledgments We thank Holger Bosch for helpful comments and a critical reading of the manuscript for this article. The research of P.R.R. was funded by a fellowship of the Royal Netherlands Academy of Arts and Sciences. References Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal ring patterns in the frontal cortex of behaving monkeys. J. Neurophysiol., 70, 1629– 1658. Alonso, J-M., Usrey, W. M., & Reid, R. C. (1996). Precisely correlated ring in cells of the lateral geniculate nucleus. Nature, 383, 815–819. Bair, W., Koch, C., Newsome, W., & Britten, K. (1994).Power spectrum analysis of bursting cells in area MT in the behaving monkey.J. Neurosci., 14, 2870–2892. ¨ Douglas, R. J., Martin, K. A. C., & Koch, C. (1991). Synaptic backBernander, O., ground activity inuences spatiotemporal integration in single pyramidal cells. Proc. Natl. Acad. Sci. USA, 88, 11569–11573. ¨ Koch, C., & Usher, M. (1994). The effect of synchronized inputs Bernander, O., at the single neuron level. Neural Comp., 6, 622–641. Csiszar, I. (1975). I-divergence geometry of probability distributions and minimization problems. Ann. Probab., 3, 146–158. Deppisch, J., Bauer, H-U., Schillen, T., Konig, ¨ P., Pawelzik, K., & Geisel, T. (1993). Alternating oscillatory and stochastic states in a network of spiking neurons. Network, 4, 243–257. Eckhorn, R., Bauer, R., Jordan, W., Broch, M., Kruse, W., Munk, M., & Reitboeck, H. J. (1988). Coherent oscillations: A mechanism of feature linking in the visual cortex? Biol. Cybern., 60, 121–130. Engel, A. K., Konig, P., Kreiter, A. K., Schillen, T. B., & Singer, W. (1992).Temporal coding in the visual cortex: New vistas on integration in the nervous system. Trends Neurosci., 15, 218–226. Gokhale, D. V., & Kullback, S. (1978). The information in contingency tables. New York: Dekker.

Effects of Pair-wise and Higher-order Correlations

179

Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reects global stimulus properties. Nature, 338, 334–337. Gray, G. M., & Singer, W. (1989). Stimulus-specic neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. USA, 86, 1698–1702. Konig, ¨ P., Engel, A. K., & Singer, W. (1995). Relation between oscillatory activity and long-range synchronization in cat visual cortex. Proc. Natl. Acad. Sci. USA, 92, 290–294. Konig, ¨ P., Engel, A. K., & Singer, W. (1996). Integrator or coincidence detector? The role of the cortical neuron revisited. Trends NeuroSci., 19, 130–137. Livingstone, M. S. (1996). Oscillatory ring and interneuronal correlations in squirrel monkey striate cortex. J. Neurophysiol., 75, 2467–2485. MacGregor, R. J., & Oliver, R. M. (1974). A model for repetitive ring in neurons. Kybernetik, 16, 53–64. Martignon, L., von Hasseln, H., Grun, ¨ S., Aertsen, A., & Palm, G. (1995). Detecting higher-order interactions among the spiking events in a group of neurons. Biol. Cybern., 73, 69–81. Murthy, V. N., & Fetz, E. E. (1994). Effects of input synchrony on the ring rate of a three-conductance cortical neuron model. Neural Comp., 6, 1111–1126. Press, W. H., Flanney, B. P., Teukolsky, S. A., & Vetterling, W. T. (1986). Numerical recipes:The art of scientic computing. Cambridge: Cambridge University Press. Roelfsema, P. R., Engel, A. K., Konig, ¨ P., & Singer, W. (1997). Visuomotor integration is associated with zero time-lag synchronization among cortical areas. Nature, 385, 157–161. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1995). Is there a signal in the noise? Curr. Opin. Neurobiol., 5, 248–250. Singer, W. (1995). Development and plasticity of cortical processing architectures. Science, 270, 758–764. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annu. Rev. Neurosci., 18, 555–586. Softky, W. R. (1995). Simple codes versus efcient codes. Curr. Opin. Neurobiol., 5, 239–247. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSP’s. J. Neurosci., 13, 334–350. Tsodyks, M. V., & Sejnowski, T. (1995). Rapid state switching in balanced cortical network models. Neural Comp., 6, 111–124. von der Malsburg, C. (1981). The correlation theory of brain function (Internal Rep. No. 81-2). Gottingen: ¨ Max-Planck-Institute for Biophysical Chemistry. Received May 29, 1998; accepted January 19, 1999.

LETTER

Communicated by Peter Konig ¨

Effects of Spike Timing on Winner-Take-All Competition in Model Cortical Circuits Erik D. Lumer Wellcome Department of Cognitive Neurology, London, WC1N 3BG, U.K.

Synaptic interactions in cortical circuits involve strong recurrent excitation between nearby neurons and lateral inhibition that is more widely spread. This architecture is commonly thought to promote a winner-takeall competition, in which a small fraction of neuronal responses is selected for further processing. Here I report that such a competition is remarkably sensitive to the timing of neuronal action potentials. This is shown using simulations of model neurons and synaptic connections representing a patch of cortical tissue. In the simulations, uncorrelated discharge among neuronal units results in patterns of response dominance and suppression, that is, in a winner-take-all competition. Synchronization of ring, however, prevents such competition. These results demonstrate a novel property of recurrent cortical-like circuits, suggesting that the temporal patterning of cortical activity may play an important part in selection among stimuli competing for the control of attention and motor action. 1 Introduction The brain constantly receives a welter of sensory stimuli, only a small fraction of which seems to reach awareness or inuence behavior. Consequently, in the central nervous system, mechanisms must exist to lter out unwanted activity. This selectivity typically operates at the level of perceptual groups rather than simple features, biasing processing in favor of objects that are perceptually salient or currently relevant to behavior (Humphreys, Romani, Olson, Riddoch, & Duncan, 1994). Such a selectivity probably reects contentions for limited processing capacity and the control of motor action. Thus, it is central to the organization of adaptive responses. Neurophysiological experiments in primates suggest that this selection process involves a suppression of neuronal responses to irrelevant stimuli (Moran & Desimone, 1985; Motter, 1993; Treue & Maunsell, 1996; Chellazzi, Miller, Duncan, & Desimone, 1993; Schall & Sanes, 1993). For example, attention-dependent modulation of discharge rates has been observed in the visual cortex of monkeys trained to attend to stimuli at one location and ignore stimuli placed at another location (Moran & Desimone, 1985; Motter, 1993). Similarly, responses of direction-selective neurons in middle temporal visual areas are transiently suppressed when attention is diverted away from their activatNeural Computation 12, 181–194 (2000)

c 2000 Massachusetts Institute of Technology °

182

Erik D. Lumer

ing stimuli (Treue & Maunsell, 1996). This modulation of neuronal ring is commonly thought to reect the effect of attentional biases on intrinsic mechanisms of competition within circumscribed cortical areas, mediated by mutual inhibition (Kastner, De Weerd, Desimone, & Ungerleider, 1998; Chellazzi et al., 1993; Schall & Sanes, 1993). How such mechanisms of competition interact with those underlying basic perceptual processes, including segmentation and grouping, and top-down attentional biases, however, remains uncertain. Here, I describe a novel property of cortical circuits that may in part mediate such an interaction. Using simulations of spiking neurons grouped in recurrent networks that conform to the basic principles of organization of local cortical circuitry, I show that the competition that arises naturally in such networks exhibits a remarkable sensitivity to the relative timing of action potentials. Specically, synchronization of neuronal activity can switch the network dynamics from a winner-take-all regime, in which a small subset of neurons can sustain elevated ring rates, to one in which most cells are active concurrently. Hence, this effect links local mechanisms of response selection to temporal aspects of cortical activity that have been previously implicated in perceptual organization (Milner, 1974; von der Malsburg, 1981; Gray, Konig, ¨ Engel, & Singer, 1989; Singer, 1993). Previous modeling studies have suggested that the combined inuences of recurrent excitation between nearby cells and more widely spread inhibition within cortical areas provide an intrinsic mechanism by which unwanted signals could be suppressed (Stemmler, Usher, & Niebur, 1995; Salinas & Abbott, 1996). When stimuli activating different groups of neurons are presented simultaneously, the recurrent connections generate a nonlinear behavior, yielding neuronal responses that are restricted to the cell group with strongest input. This winner-take-all competition reects a synaptic organization that is common to cortical circuits (Kisv´arday & Eysel, 1993; Douglas, Koch, Mahowald, Martin, & Suarez, 1995). Such a mechanism has been proposed as a basic computational strategy for learning and action selection (Hertz, Krogh, & Palmer, 1991), and underlies several models of brain function, ranging from the development of cortical maps (von der Malsburg, 1973) to the selection of targets during saccadic eye movements (Salinas & Abbott, 1996) and visual search (Koch & Ullman, 1985), and to the suppression of redundancies in visual scenes (Stemmler et al., 1995). However, most models of winner-take-all mechanisms in neural circuits focus on the modulation of mean discharge rates due to competitive interactions. Accordingly, these simulation studies often fail to model individual neurons or their precise spike timing. Yet recent evidence in animals suggests that the temporal coordination of neuronal activity may play an important role in neural network operations. In particular, synchronization of action potentials on a millisecond timescale can occur selectively between cortical neurons that signal related features (Gray et al., 1989). It

Effects of Spike Timing on Winner-Take-All Competition

183

has been suggested that such synchronization may serve to integrate distributed neural events into coherent percepts (Milner, 1974; von der Malsburg, 1981; Singer, 1993). Moreover, it has been shown, in both the hippocampus and neocortex, that networks of recurrently connected inhibitory neurons can both suppress neuronal responses and affect their temporal patterning (Cobb, Bohl, Halasy, Paulsen, & Somogyi, 1995; Whittington, Traub, & Jefferys, 1995). It remains unclear, however, whether the temporal coordination of neuronal activity can inuence the ability of recurrent circuits to induce response suppression. To investigate this possibility, I model neural network interactions in a small patch of cortex in which the relative spike timing between competing units is varied systematically. 2 Methods The model consists of populations of excitatory and inhibitory cells that are grouped in local recurrent circuits similar to those in previous simulation studies of neocortex (Douglas et al., 1995; Somers, Nelson, & Sur, 1995; Stemmler et al., 1995; Salinas & Abbott, 1996). Synaptic interactions within a group are both excitatory and inhibitory, but are restricted to inhibition between cell groups (see Figure 1). This synaptic organization mimics that found in a local sector of cortex, for example, a hypercolumn of V1, in which massively recurrent excitatory connections dominate nearby cells, whereas inhibitory interactions extend over larger distances. Given the topographic nature of cortical maps, this lateral inhibition interconnects adjacent groups of cells with dissimilar response specicity, for example, with an orthogonal orientation preference in V1 (Kisv´arday & Eysel, 1993). In most simulations presented in section 3, two cell groups coupled through mutual inhibition are considered (target groups). Simulations involving three competing target groups are also reported. Target cells are driven by a population of neurons with similar intrinsic connectivity (input groups). In the model, each input group either projects to a single target group or provides common (divergent) activation to several target groups. Each group of cells comprises 800 excitatory units and 200 inhibitory units. Units of each type receive 80 excitatory connections and either 20 (input groups) or 12 (target groups) inhibitory connections from cells within the same group. Intergroup connections consist of 12 inhibitory contacts per cell between target groups. Input groups contribute 40 excitatory projections onto each unit in the target layer, with the proportion of common input varied systematically in the simulations. These specications yield a ratio of excitatory to inhibitory units, as well as proportions of excitatory and inhibitory synapses, that are consistent with anatomical data (Beaulieu, Kisv´arday, Somogyi, Cynader, & Cowey, 1992). Individual connections are established at random within and between interconnected populations of cells. These connections incorporate transmission delays with a mean of 2 ms.

184

Erik D. Lumer

Figure 1: Model architecture. Spiking units are grouped in cortical-like recurrent circuits (boxes), each comprising 800 excitatory (Ex) and 200 inhibitory (Inh) cells. Input groups can provide either independent or common drive to two target groups (1 and 2) that compete through mutual inhibition. Excitatory and inhibitory connections are represented by lines with arrowheads and round terminals, respectively.

Neurons are modeled as single-voltage compartments with realistic membrane time constants and ring characteristics. Specically, each cell’s membrane potential obeys the equation tm dV/ dt = ¡V C E0 ¡ gEx ( t)( V ¡ EEx ) ¡ gInh ( t)(V¡Einh ), where tm is the membrane time constant, E0 is the cell’s resting potential, gEx and gInh are, respectively, time-dependent excitatory and inhibitory synaptic conductances expressed in units of a leak conductance, and EEx and Einh are the reversal potentials of the corresponding channels. Membrane time constants are set to 16 ms for excitatory cortical units and 8 ms for inhibitory units. These passive membrane time constants are consistent with empirical data (Connors, Gutnick, & Prince, 1982; Bloomeld, Hamos, & Sherman, 1987). E0 is ¡60 mV for both cell types, in agreement with in vivo measurements (Ferster & Jagadeesh, 1992). E Ex and Einh are 0 mV and ¡70 mV, respectively. Spiking is modeled by a standard integrateand-re mechanism, in which units that reach a threshold are reset to the potassium reversal potential (¡90 mV). Thresholds are ¡51 mV for excitatory units and ¡52 mV for inhibitory units. These threshold values fall in the in vivo range (Baranyi, Szente, & Woody, 1993). The choice of different ring thresholds for these two cell types makes it easier to reproduce experimental levels of spontaneous activity, as described below. Absolute refractory periods are 1.5 ms. Relative refractory periods are generated in excitatory units by adding 10 mV to their ring threshold after each spike; relaxation to baseline occurs exponentially with a 10-ms time constant. Similar cellular parameters have been justied in previous simulations of striate cortex

Effects of Spike Timing on Winner-Take-All Competition

185

Table 1: Synaptic Channel Parameters. Receptor

gpeak (units of gleak )

t1 (ms)

t2 (ms)

E (mv)

AMPA GABAa

0.05 0.1

0.5 1

2.4 7

0 ¡70

(e.g., Somers et al., 1995; Lumer, Edelman, & Tononi, 1997). For simplicity, spike rate adaptation is not incorporated in excitatory units. Preliminary simulations showed that this choice does not affect the processes studied in this article. When a unit res, a spiking event is relayed to its postsynaptic targets. Synaptic activation is modeled as a transient conductance change, using dual exponential functions dened as (Wilson & Bower, 1989): g( t) = gpeak N ( e¡t/ t 1 ¡ e¡t/ t 2 ),

where t1 and t2 are the rising and decay time constants, N is a normalizing factor, and gpeak is the peak conductance elicited by a single spike. In the model, synapses provide fast inhibition and excitation, with time constants of activation consistent with empirical data in vitro (Otis & Mody, 1992; Stern, Edwards, & Sakmann, 1992) (see Table 1). At rest, peak postsynaptic potentials (PSPs) elicited by a single spike are 0.5 and 0.25 mV for excitatory and inhibitory synaptic channels, respectively. These PSPs are consistent with the higher end of experimental values (Mason, Nicholl, & Stratford, 1991; Komatsu, Nakajima, Toyama, & Fetz, 1988); large, singleevent PSPs must be used in the model because it includes fewer synapses per cell compared with in vivo connectivity. Background synaptic activity is simulated by a random (Poisson distributed) activation of both types of synaptic channels in each unit. This results in a balance of mean excitation and inhibition and a spontaneous discharge of individual units at a rate of about 3 Hz for excitatory units and 15 Hz for inhibitory units. In addition to this background activity, extrinsic activation of the network is achieved by applying a stochastic (Poisson distributed, l = 0.4 kHz) synaptic excitation independently to each cell in the input groups. This activation occurs with an amplitude equivalent to ve synchronous excitatory postsynaptic potentials (EPSPs) per synaptic event. To clarify the inuence of relative spike timing on competition, the temporal coordination between neuronal responses in the two target groups is manipulated selectively. This is implemented by varying systematically the proportion of common versus independent activation that each target group receives. It must be emphasized that this manipulation of feedforward connectivity is not meant to represent any physiological mechanism, but rather to provide a simple control over the degree of coherence between competing populations of neurons. The question of what mechanisms can

186

Erik D. Lumer

produce dynamic changes in neuronal synchronization is beyond the scope of this study. Sufce it to note that changes in relative spike timing do take place in a task and stimulus-dependent manner in vivo (Gray et al., 1989; Roelfsema, Engel, Konig, ¨ & Singer, 1997). A number of aggregate measures are used to characterize the model’s behavior. The average membrane potential is computed within each cell group. This variable, which is reminiscent of a local eld potential, serves as a useful indicator of within-group coherence. Instantaneous populationaveraged ring rates are computed by cumulating all of the spikes emitted by excitatory units within a given cell group and a sliding 100-ms time window and by normalizing for cell number and window size. Finally, the competition index of two groups of neurons is dened as the mean absolute value of the difference between their instantaneous ring rates divided by the highest instantaneous rate among the two cell groups. This index varies between 0 for noncompeting groups of cells and 1 for groups of neurons with mutually exclusive patterns of activity (i.e., winner-take-all regime). Thus, it provides a useful measure of the degree of competition in the target layer. 3 Results Mutual inhibition in cortical circuits can both suppress neuronal responses and affect their temporal patterning by inducing synchronous oscillations. This is illustrated in Figure 2A, in which the population-averaged membrane potential in input and target groups is plotted, along with single-unit activity from representative target cells. As shown by the autocorrelation function, input populations exhibit weak rhythmic activity in the 60-Hz range (see Figure 2B). By contrast, target populations exhibit strong coherent oscillations with a frequency at around 30 Hz (see Figure 2C). The population oscillations can be shown to arise as an emergent property of recurrent synaptic coupling. This is because individual inhibitory neurons target several excitatory cells, forcing them to pause in concert; the silencing of excitatory neurons in turn causes a drop in inhibition, resulting in a depolarizing rebound and a synchronous volley of action potentials. The sequence repeats itself periodically at a pace that depends in part on the time constants of synaptic inhibition. Because this aspect of the neural dynamics has been investigated in detail in previous studies (Bush & Douglas, 1991; Whittington et al., 1995; Lumer et al., 1997), it is not further characterized here. The simulations also reveal that changes in spike coordination modify the effects of lateral inhibition that is not matched by excitation. We illustrate this by comparing the responses in a model with two target groups: when they sample their inputs from either a common neuronal pool or independent groups of cells. Although the mean activation of each target group remains identical in both cases, they yield different correlation patterns. In the rst instance, the common input provided to the target populations causes them to re in synchrony. With independent inputs, however, the

Effects of Spike Timing on Winner-Take-All Competition

187

Figure 2: Coherent activity in target groups. (A) The three upper plots show the population- averaged membrane potential of excitatory units in an input group (In) and in each target group (Pop1 and Pop2) over a period of 250 ms before and 750 ms during application of extrinsic activation to the model circuit. The remaining plots show the corresponding single-unit activity for representative cells in each target group. The arrow at the bottom marks the stimulation onset. Note the oscillations in the population-averaged responses, despite somewhat variable interspike intervals evidenced by the single unit traces. (B) Autocorrelogram of the mean membrane potential in the input group reveals a weak oscillation at around 60 Hz. (C) Autocorrelogram of the mean membrane potential in one target group reveals synchronous oscillations at around 30 Hz.

target populations exhibit little correlation in the timing of their action potentials. As shown in Figure 3A, these two input conditions yield markedly different behavior in the target layer. Independent activation of the target groups results in a winner-take-all competition, in which one group of cells exhibits elevated ring rates, whereas responses in the other neuronal group are suppressed. This contrasts with the absence of response suppression in the target layer when target groups are driven by a common (synchronizing) input. Thus, synchronization of ring prevents competition in recurrent neural circuits, despite the presence of strong mutual inhibition. The inuence exerted by spike coordination on winner-take-all mechanisms operates very rapidly. This is demonstrated in Figure 3B, in which a histogram of the mean absolute difference in ring rate between the two target groups is shown immediately before and after a change from independent to common afferent input. The histogram reveals that a signicant modulation of differential ring levels is detectable within the rst 50 ms following the change in spike coordination. To quantify the sensitivity of neural competition to the degree of zerolag synchronization, I compute the competition index (CI), as dened in section 2, between two target groups for different proportions of common

188

Erik D. Lumer

Figure 3: Temporal control of winner-take-all mechanism. (A) Instantaneous ring rates, as dened in section 2, are plotted for excitatory units in two competing target groups, using black and gray traces, respectively. The inputs to the two target groups are changed from common (sync) to separate (async) once every 500 ms. Uncorrelated inputs yield a pattern of response dominance and suppression, which is abolished when common inputs are used instead. Input conditions are shown below the activity traces. (B) Differential discharge rate histogram for the target groups during successive suppressed and coordinated time intervals (bin = 50 ms).

input. As shown in Figure 4A, the degree of competition between target groups rises rapidly as the proportion of common input, and hence of their spike correlation, decreases. Because competition in the model depends on the presence of strong mutual inhibition between target groups, the inuence of spike coordination can be interpreted as a modulation of the effective connectivity between these groups (Gerstein & Perkel, 1969; Friston, Frith, Little, & Frackowiak, 1993). This view is consistent with the results of simulations in which, for each setting of input sources, we determine the strength of intergroup synaptic inhibition that yields an equivalent degree of competition in the target layer—in this instance, CI = 0.4. This functionally equivalent inhibitory strength is plotted as a function of the proportion of common input provided to the target populations (see Figure 4B). To a rst approximation, it establishes a boundary between regions of the parameter space yielding, on the one hand, a winner-take-all-regime (above the boundary) and, on the other, a regime in which both target groups are activated simultaneously (below the boundary). Accordingly, physiological factors that transiently alter the efcacy of inhibitory postsynaptic potentials (e.g., neuromodulation) or input coherence (e.g., stimulus conguration) could switch the network dynamics between these two regimes. Finally, it must be noted that similar effects are seen when more than two

Effects of Spike Timing on Winner-Take-All Competition

189

Figure 4: Sensitivity of competition to spike timing. (A) The competition index (CI) is plotted for increasing proportions of common input to the two target groups. The CI for uncorrelated inputs to the two groups corresponds to the leftmost point in the gure. (B) The plot shows the peak conductance of intergroup inhibitory synapses yielding a constant CI (0.4), as a function of the proportion of common input to the two target groups. This plot separates the parameter space in a region of increased competition (above) and one of coordination (below).

competing groups are considered. This is illustrated in simulations of three competing populations of neurons that are driven by independent input groups (see Figure 5A) or by inputs that are common to two (see Figure 5C) or all of the target groups (see Figure 5B). As shown in the gures, target groups that receive a common input, and hence re synchronously, do not suppress each other ’s activity. This contrasts with the winner-take-all competition between populations with uncorrelated synaptic inputs. These simulations illustrate how dynamic changes in input correlation can recongure the competition among target cells; neurons can switch their membership from one coactivated assembly to another on the basis of changes in relative spike timing. Thus, these results are consistent with the notion that the same group of neurons can be involved in the representation of multiple objects. At any one moment, the specic binding of neuronal activity to single objects can be determined by a buildup of synchronization patterns and selection of neuronal responses that conform to such patterns.

190

Erik D. Lumer

Figure 5: Instantaneous ring rates for three competing groups. (A) Independent activation of each group yields a winner-take-all regime, in which a single group maintains elevated ring rates at any one moment. (B) Competition is prevented by the introduction of common inputs to all target populations. (C) Common input to two groups prevents them from competing with each other but not with a third uncorrelated group. Activity in the third group is suppressed in this example, but it could also dominate competition in some runs.

4 Discussion A prominent feature of the local circuitry in neocortex is its topographic organization in clusters, with massively recurrent excitation between neighbor cells and lateral inhibition that is more widely spread. Recent simulation studies suggest that this architecture yields stereotyped responses, in which weak afferent inputs undergo nonlinear amplication due to intracortical recurrent excitation and the strongest inputs are differentially activated (i.e., selected) as a result of lateral inhibitory interactions. The models show that such response properties can account for the emergence of direction and orientation selectivity in striate cortex and for the multiplicative gain modulation of parietal neurons by eye and head position signals (Douglas et al., 1995; Somers et al., 1995; Salinas & Abbott, 1996). The simulations here demonstrate that the basic design principles of cortical architectures can also produce stereotyped temporal patterns of neuronal activity, in particular synchronous and oscillatory population responses in the high-frequency range (20–70 Hz). More important, it has been shown that the winner-take-all competition that arises naturally in such circuits exhibits a remarkable sensitivity to the temporal coordination of neuronal action potentials. Specically, the onset of a winner-takeall behavior can be prevented by a zero-lag synchronization of discharges among pools of neurons competing through mutual inhibition. Transitions

Effects of Spike Timing on Winner-Take-All Competition

191

from competitive to cooperative ring regimes occur very rapidly—within roughly 50 msec of a change from uncorrelated to synchronous population activity. This novel network property reects the sensitivity of reciprocal inhibitory interactions to the relative timing of action potentials. Indeed, the simulations reported in this article show that the effective connectivity, of an inhibitory nature, mediating lateral interactions responsible for competition is decreased in the context of synchronization. This is an interesting and counterintuitive observation; increased functional connectivity (dened as the correlation between neurophysiological events) is, in this instance, accompanied by a decrease in (inhibitory) effective connectivity (dened as the inuence one neuronal system exerts over another). The effects of relative spike timing on competition described here are robust to changes in model parameters. In particular, these effects are maintained when the total number or relative proportion of excitatory and inhibitory units in the model are varied by a twofold factor with respect to the values used in the reported simulations. The network organization underlying these effects—a center of recurrent excitation with a inhibitory surround— does not take into account the long-range horizontal connections between pyramidal cells in upper and lower cortical laminae. However, these distal horizontal connections are organized in discrete patches placed outside the afferent region of inhibitory cells and restricted to cells of similar response specicity (Lund, Yoshioka, & Levitt, 1993). Moreover, physiological evidence suggests that these long- range connections have a modulatory rather than a driving inuence on their neuronal targets (Hirsh & Gilbert, 1991). Thus, in rst approximation these nonlocal connections are not relevant for the local competition mechanisms considered in this article. These results may help in understanding how the brain selects among the welter of stimuli constantly competing for the control of attention and motor action. Physiologically, selective sensory processing is often accompanied by a suppression of neuronal responses to contextually irrelevant stimuli (Moran & Desimone, 1985; Motter, 1993; Treue & Maunsell, 1996). A candidate mechanism for the selective suppression of neuronal responses is mutual inhibition between cortical cells responsive for relevant and irrelevant stimuli (Chellazzi et al., 1993; Schall & Sanes, 1993). Because selection typically operates at the level of perceptual groups rather than simple features, however, such competition must be conned between cells belonging to different object representations. The results here suggest a mechanism whereby this specicity of competitive interactions can be achieved dynamically. It exploits the fact that the relative timing of action potentials in sensory cortices can vary in a context-dependent manner, with synchronization of activity arising selectively among cells representing related features of an object (Gray et al., 1989; Singer, 1993). Thus, according to the effects reported in this article, the neuronal responses associated with various attributes of an object emerge with mutual immunity from competition by virtue of their synchronization. By contrast, the patterns of activity associated with differ-

192

Erik D. Lumer

ent objects or alternative stimulus interpretations may be uncorrelated and hence subject to competition. It has recently been proposed that such effects may underlie the competition between alternative stimulus representations during binocular rivalry (Lumer, 1998). More generally, they link, within a common computational framework, early processes of perceptual organization based on spike timing and subsequent selection among perceptual groups competing for the control of behavior. These newly described effects of spike coordination are functionally distinct from the ability of neuronal synchronization to augment the responses of target cells due to maximal temporal summation of synaptic events (Abeles, 1982). 5 Conclusion What emerges from these simulations is the notion of a parameter space for cortical operation that spans two domains. One domain is capable of supporting activity in multiple populations of neurons for a short period of time, after which one population is selected by competitive lateral interactions. In the other domain, neuronal responses in multiple populations can coexist, provided that they are expressed synchronously. The parameters dening this space can be divided into those that dynamically modulate synaptic efcacy (e.g., classical neuromodulation and voltage-dependent interactions) and the temporal patterning of activity among the interacting populations. I have focused on the latter here and have demonstrated how temporal coordination can move the network dynamics from one domain to the other. The role of temporal coordination in facilitating or attenuating competitive interactions in cortical circuits may represent a fundamental component of perceptual synthesis. Acknowledgments I thank Karl Friston and Richard Frackowiak for their encouragement and for helpful discussions and comments on the manuscript for this article. This research was supported by the Wellcome Trust. References Abeles, M. (1982). Role of cortical neuron: Integrator or coincidence detector? Isr J Med Sci, 18, 83–92. Baranyi, A., Szente, M., & Woody, C. (1993). Electrophysiological characterization of different types of neurons recorded in vivo in motor cortex of the cat. II. Membrane parameters, action potentials, current-induced voltage responses and electrotonic structures. J Neurophysiol, 69, 1865–1879. Beaulieu, C., Kisv a´ rday, Z., Somogyi, P., Cynader, M., & Cowey, A. (1992). Quantitative distribution of GABA-immunonegative neurons and synapses in the monkey striate cortex (area 17). Cerebral Cortex, 2, 295–309.

Effects of Spike Timing on Winner-Take-All Competition

193

Bloomeld, S. A., Hamos, J. E., & Sherman, S. M. (1987). Passive cable properties and morphological correlates of neurones in the lateral geniculate nucleus of the cat. J Physiol, 383, 653–692. Bush, P., & Douglas, R. (1991). Synchronization of bursting action potentials in a model network of neocortical neurons. Neural Comp, 3, 19–30. Chellazzi, L., Miller, E. K., Duncan, J., & Desimone, R. (1993). A neural basis for visual search in inferior temporal cortex. Nature 363, 345–347. Cobb, S. R., Buhl, E., Halasy, K., Paulsen, O., & Somogyi, P. (1995). Synchronization of neuronal activity in hippocampus by individual GABAergic interneurons. Nature, 378, 75–78. Connors, B., Gutnick, N. I., & Prince, D. (1982). Electrophysiological properties of neocortical neurons in vitro. J Neurophysiol, 48, 1302-1320. Douglas, R. J., Koch, C., Mahowald, M., Martin, K., & Suarez, H. (1995).Recurrent excitation in neocortical circuits. Science, 269, 981–985. Ferster, D., & Jagadeesh, B. (1992). EPSP-IPSP interactions in cat visual cortex studied with in vivo whole-cell patch recording. J Neurosci, 12, 1262–1274. Friston, K. J., Frith, C. D., Little, P. F., & Frackowiak, R. S. J. (1993). Functional connectivity: The principal component analysis of large (PET) data sets. J Cereb Blood Flow Metab, 13, 5–14. Gerstein, G. L., & Perkel, D. H. (1969). Simultaneously recorded trains of action potentials: Analysis and functional interpretation. Science, 164, 828–830. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reects global stimulus properties. Nature, 338, 334–337. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Hirsh, J., & Gilbert, C. (1991). Synaptic physiology of horizontal connections in the cat visual cortex. J. Neurosci, 11, 1800–1809. Humphreys, G. W., Romani, C., Olson, A., Riddoch, M. J., & Duncan, J. (1994). Non-spatial extinction following lesions of the parietal lobe in humans. Nature, 372, 357–359. Kastner, S., De Weerd, P., Desimone, R., & Ungerleider, L. G. (1998). Mechanisms of directed attention in the human extrastriate cortex as revealed by functional MRI. Science, 282, 108–111. Kisva´ rday, Z. F., & Eysel, U. T. (1993). Functional and structural topography of horizontal inhibitory connections in cat visual cortex. Europ J Neurosci, 5, 1558–1572. Koch, C., & Ullman, S. (1985).Shifts in selective attention: Towards an underlying neural circuitry. Human Neurobiology, 4, 219–227. Komatsu, Y., Nakajima, S., Toyama, K., & Fetz, E. (1988). Intracortical connectivity revealed by spike-triggered averaging in slice preparation of cat visual cortex. Brain Res, 442, 359–362. Lumer, E. D. (1998).A neural model of binocular integration and rivalry based on the coordination of action-potential timing in primary visual cortex. Cerebral Cortex, 8, 553–561. Lumer, E. D., Edelman, G. M., & Tononi, G. (1997). Neural dynamics in a model of the thalamocortical system. I. Layers, loops, and the emergence of fast

194

Erik D. Lumer

synchronous rhythms. Cerebral Cortex, 7, 207–227. Lund, J. S., Yoshioka, T., & Levitt, J. B. (1993). Comparison of intrinsic connectivity in different areas of macaque monkey cerebral cortex. Cerebral Cortex, 3, 148–162. Mason, A., Nicholl, A., & Stratford, K. (1991). Synaptic transmission between individual pyramidal neurons of the rat visual cortex in vitro. J Neurosci, 11, 72–84. Milner, P. M. (1974). A model of visual shape recognition. Psychol Rev, 81, 521– 535. Moran, J., & Desimone, R. (1985). Selective attention gates visual processing in the extrastriate cortex. Science, 229, 782–784. Motter, B. C. (1993). Focal attention produces spatially selective processing in visual cortical areas V1, V2, and V4 in the presence of competing stimuli. J Neurophysiol, 70, 909–919. Otis, T. S., & Mody, I. (1992). Differential activation of GABAa and GABAb receptors by spontaneously released transmitter. J Neurophysiol, 67, 227–235. Roelfsema, P. R., Engel, A. K., Konig, ¨ P., & Singer W. (1997). Visuomotor integration is associated with zero-lag synchronization among cortical areas. Nature, 385, 157–161. Salinas, E., & Abbott, L. F. (1996). A model of multiplicative neural responses in parietal cortex. Proc Natl Acad Sci USA, 93, 11956–11961. Schall, J. D., & Sanes, D. P. (1993). Neural basis of saccade target selection in frontal eye eld during visual search. Nature, 366, 467–469. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing. Annu Rev Physiol, 55, 349–374. Somers, D., Nelson, S., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. J Neurosci, 15, 5448–5465. Stemmler, M., Usher, M., & Niebur, E. (1995). Lateral interactions in primary visual cortex: A model bridging physiology and psychophysics. Science, 269, 1877–1880. Stern, P., Edwards, F. A., & Sakmann, B. (1992). Fast and slow components of unitary EPSCS on stellate cells elicited by focal stimulation in slices of rat visual cortex. J Physiol (London), 449, 247–278. Treue, S., & Maunsell, J. H. R. (1996). Attentional modulation of visual motion processing in cortical areas MT and MST. Nature, 382, 539–541. von der Malsburg, C. (1973). Self-organization of orientation sensitive cells in primary visual cortex. Kybernetik, 14, 85–100. von der Malsburg, C. (1981). The correlation theory of the brain (Internal Rep. No. 81-2). Gottingen: ¨ Max Planck Institute for Biophysical Chemistry. Whittington, M. A., Traub, R. D., & Jefferys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615. Wilson, M. A., & Bower, J. M. (1989). In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (pp. 291–333). Cambridge, MA: MIT Press. Received April 21, 1998; accepted February 17, 1999.

LETTER

Communicated by George Gerstein

Model Dependence in Quantication of Spike Interdependence by Joint Peri-Stimulus Time Histogram Hiroyuki Ito Satoshi Tsuji Department of Information and Communication Sciences, Faculty of Engineering, Kyoto Sangyo University, Kita-ku, Kyoto 603-8555, Japan and CREST, Japan Science and Technology

Multineuronal recordings have enabled us to examine context-dependent changes in the relationship between the activities of multiple cells. The joint peri-stimulus time histogram (JPSTH) is a much-used method for investigating the dynamics of the interdependence of spike events between pairs of cells. Its results are often taken as an estimate of interaction strength between cells, independent of modulations in the cells’ ring rates. We evaluate the adequacy of this estimate by examining the mathematical structure of how the JPSTH quanties an interaction strength after excluding the contribution of ring rates. We introduce a simple probabilistic model of interacting point processes to generate simulated spike data and show that the normalized JPSTH incorrectly infers the temporal structure of variations in the interaction parameter strength. This occurs because, in our model, the correct normalization of ringrate contributions is different than that used in Aertsen, Gerstein, Habib, and Palm’s (1989) effective connectivity model. This demonstrates that ring-rate modulations cannot be corrected for in a model-independent manner, and therefore the effective connectivity does not represent a universal characteristic that is independent of modulation of the ring rates. Aertsen et al.’s (1989) effective connectivity may still be used in the analysis of experimental data, provided we are aware that this is simply one of many ways of describing the structure of interdependence. We also discuss some measure-independent characteristics of the structure of interdependence. 1 Introduction Recently, a great amount of attention has been focused on the understanding of network mechanisms of information processing in the brain. The advancement of multineuronal recording techniques has enabled us to record simultaneously the spike activities of multiple single units from spatially distant sites as well as from local cortical sites. The availability of multiple neuron spike data has led to a novel coding paradigm called the relational Neural Computation 12, 195–217 (2000)

c 2000 Massachusetts Institute of Technology °

196

Hiroyuki Ito and Satoshi Tsuji

code. In this paradigm, the information or function is not necessarily localized at the single-cell level. This information or function is dynamically encoded by the context of the relationship among spike activities of multiple cells. Such a relationship can be either a coherence in the mean ring rates in a psychological timescale (Hebb, 1949) or a ne temporal interdependence among spike events (Abeles, 1982a, 1991; von der Malsburg, 1981, 1986; Gerstein, Bedenbaugh, & Aertsen, 1989; Engel, Konig, ¨ Kreiter, Schillen, & Singer, 1992; Fujii, Ito, Aihara, Ichinose, & Tsukada, 1996). In order to understand the network mechanism involved in information processing, we investigate the relationship among multiple cells and discuss its relation within the context of information processing (stimuli presentation or task performance). Thus, we need a powerful analytical tool that visualizes the spatiotemporal statistical structure in the activities of the neuronal network based on the spike data of multiple single units. Traditionally, we have computed peri-stimulus time histograms (PSTH) for all of the recorded cells to examine coherence in the ring rates. A crosscorrelogram has then been applied to a pair of cells to detect the temporal interdependence among spike activities. However, the cross-correlogram has a limitation in analyzing a rapid change of the structure of interdependence due to averaging over the sample interval. Recent experimental reports of a time domain analysis suggest that the spike interdependence between two cells evolves in a nonstationary way even within the trial duration of a few seconds (Gray, Engel, Konig, ¨ & Singer, 1992; Vaadia et al., 1995). Aertsen, Gerstein, Habib, & Palm, (1989) introduced a novel statistical analytical method for the spike trains of a pair of cells, the joint peri-stimulus time histogram (JPSTH). The JPSTH was designed to quantify the nonstationary modulation of the spike interdependence within the trial duration. Currently, the JPSTH is considered to be the most advanced tool for the statistical analysis of spike interdependence. The number of experimental reports applying the JPSTH has been increasing, and this increased interest reects the interest in the dynamical structure of interdependence within a cell assembly. Here we examine the mathematical structure of how the JPSTH quanties a degree of interdependence after excluding the contribution of ring rates. In section 2, we provide a basic description of the JPSTH method. Then, following the introduction of two different ways of using the JPSTH, we discuss how Aertsen et al. (1989) dened the effective connectivity to quantify the degree of interdependence. We present a statistical evaluation of the JPSTH in section 3. We introduce a simple probabilistic model of the point processes where the interaction strength is characterized by a different parameter from the effective connectivity. Through the normalization procedure of the JPSTH of the simulated spike data, we examine how the temporal structure of interdependence described by our interaction parameter is related to that by the effective connectivity. We discuss the results of

Model Dependence

197

the simulations in section 4. We summarize detailed mathematical derivations of the formula used in the text in the appendix. In this article, we consistently use the term interdependence instead of the well-used correlation because the latter is often confused with the correlation coefcient, which is a particular measure of interdependence. The essence of this study has been presented elsewhere (Ito & Tsuji, 1997; Ito, 1998). 2 J-PSTH 2.1 Method. By simultaneously recording spike activities of two cells, cell 1 and cell 2, for a trial duration T over K trials under the same experimental condition, we get K pairs of spike trains. Here we assume that the spike trains of each trial are time locked to some trigger event such as the stimulus onset. We divide the trial duration into multiple bins with a small interval D so that there is at most a single spike within any bin for each trial. ( ) Let ni k ( u) be the spike count of cell i (i = 1, 2) in the uth bin of the kth trial. ( ) Then, the variable ni k ( u) can take either 0 or 1. The average spike count, hni ( u)i =

K 1X n( k) ( u), K k= 1 i

(2.1)

represents the probability of the occurrence of the spike event at cell i at bin u. The histogram of hni ( u)i over index u represents the PSTH. In the computation of the raw-JPSTH, a square panel of a size of the trial duration T is prepared, and time axes are set along the bottom and the left sides. The two time axes are locked to the stimulus onset at the left-bottom corner. The square panel is subdivided into tiny compartments of size D , each of which is specied by a pair of two integer indices (u, v). At each compartment, we assign the matrix element that quanties the probability of the occurrence of a joint event, where cell 1 has a spike event at bin u and cell 2 has one at bin v, hn12( u, v)i =

K 1X ( k) n (u, v), K k= 1 12

(2.2)

where ( )

( )

( )

k k k n12 ( u, v) = n1 ( u) n2 (v).

(2.3)

An example of the raw-JPSTH is shown in the left panel in Figure 2a. The magnitude of each matrix element is represented by a color scale. At the bottom of the square, the PSTH of cell 1 is plotted upside down. The PSTH of cell 2 is plotted on the left axis.

198

Hiroyuki Ito and Satoshi Tsuji

The matrix elements on the main diagonal line represent the coincident spike events of the two cells. These elements are used for the PST coincidence histogram, which is plotted along the diagonal line in the right panel in Figure 2a. On the other hand, the summation within paradiagonal bins is used to estimate the ordinary cross-correlogram, which is plotted in the upper right corner in Figure 2a. The central peak of the cross-correlogram represents the cumulative frequency of coincident spike events in entire trial duration. The PST coincidence histogram visualizes ne temporal modulation of the frequency of the coincident events within the trial duration, that is, the time domain of more frequent coincident events. In the raw-JPSTH, the statistical structure in the spike data is represented by three quantities: hn1 ( u)i, hn2 ( v)i, hn12 (u, v)i. The goal of the JPSTH is to provide a statistical measure extracting the net degree of interdependence from the joint probability hn12 (u, v)i after excluding the contribution of ring rates. The element in the predicted-JPSTH is given by hpredicted-JPSTHi = hn1 ( u)ihn2 ( v)i,

(2.4)

which represents the probability of the accidentally correlated events that appear even when the two spike trains are statistically independent (null hypothesis). The departure of the test data from the null hypothesis is estimated by the covariance, which is a subtraction of the predicted-JPSTH from the raw-JPSTH at each matrix element, h(raw ¡ predicted )-JPSTHi = hn12 (u, v)i¡hn1 (u)ihn2 (v)i.

(2.5)

In general, there exist modulations of ring rates within the trial duration. For a comparison of the net degree of spike interdependence over the different compartments, we must further normalize the covariance at each compartment by a factor depending on the ring rates of the two cells. This normalization procedure is the critical issue in the formulation of the JPSTH. The main concern of this study is how to introduce an adequate normalization. 2.2 Standpoint of Analysis. There are two different ways of using the JPSTH to extract the degree of interdependence. In the rst way, we consider the JPSTH as an abstract statistical measure (statistic) of the spike interdependence between the two cells. The element in the raw-JPSTH is transformed into some statistical measure, such as the correlation coefcient. The measure is dened purely for statistical purposes and has nothing to do with the mechanism of the point processes generating the spike data. In this sense, this approach is essentially model free. The choice of a particular measure of interdependence is not unique, and various statistical measures of different characteristics (e.g., efcacy, contribution, correlation density, correlation coefcient, and so on) have been introduced (Brillinger, 1975; Palm, Aertsen, & Gerstein, 1988; Aertsen et al., 1989).

Model Dependence

199

The second way starts from an explicit denition of a simple probabilistic model of the point processes, which is characterized by the ring rates of the two cells and a parameter controlling the interaction strength between the spike events at the two cells. Let this parameter be called interaction mechanism strength. We rst solve a forward problem to obtain the relationship between these model parameters and the covariance of the spike data generated by this model. With this relation, the normalization of the covariance is dened to reproduce the model parameter of interaction mechanism strength. Therefore, the JPSTH works as a tool to facilitate the inference of the interaction mechanism strength from the given spike data. In this model-based approach, the model is introduced as a conventional tool (equivalent circuit) used for our description of the structure of spike interdependence. We attempt to interpret the structure of interdependence in the data through its correspondence to the phenomena in the particular model. Here again the choices of the model and the parameter of interaction mechanism strength are not unique. Aertsen et al. (1989) took a somewhat intermediate approach between the two approaches. They rst dened a specic probabilistic spike model (Aertsen & Gerstein, 1985). The model was then used to generate spike data with the temporal modulations of both the ring rates and the interaction mechanism strength intermingled. Subsequently, they searched for a way to normalize the JPSTH so that the modulation prole of the interaction mechanism strength of their model was qualitatively reproduced. After various normalization trials, they nally adopted the one that divided the (raw ¡ predicted)-JPSTH (covariance) by the product of the standard deviations of the ring rates of the two cells. This normalization procedure coincided with the transformation of the raw-JPSTH into the correlation coefcient. Their choice of normalization was through empirical trial and error, not through the forward problem of their original spike model. Aertsen et al. (1989) calibrated their normalization by positively interdependent spike data from two cells with the same rate of low ring. As we will show in section 3, the interaction mechanism strength of their original model is well approximated by the correlation coefcient only in such a limited parameter region. Aertsen et al. (1989) called the elements of their normalized-JPSTH the effective connectivity between the two cells. Logically there are two possible interpretations of the effective connectivity. In the model-free approach, the effective connectivity is nothing but a correlation coefcient. It is an abstract statistical measure of interdependence and has nothing to do with any model of point processes. In the model-based approach, the effective connectivity in the full parameter regime, represents a parameter of interaction mechanism strength of some other model than the original model by Aertsen and Gerstein (1985). The model, which is dened implicitly by their normalization, and its interaction mechanism strength (effective connectivity) coincide with those in the original model setup only in the limited parameter

200

Hiroyuki Ito and Satoshi Tsuji

Figure 1: Algorithm for imposing interaction mechanism strength a to independent spike trains of ring rates r 1 and r 2 .

region. It is important to keep the two standpoints clearly separate to avoid a confusion in our discussion. Aertsen et al. (1989) stressed a model-based approach by regarding the effective connectivity as “a minimum neuronal circuit model that would reproduce such correlation measurement.” In the following study, we also apply the model-based approach to examine the characteristics of the effective connectivity. 3 Statistical Evaluation of the JPSTH Our main concern with the JPSTH is the characteristics of the effective connectivity’s method of estimating the temporal structure in the interaction parameter in the spike data. Its results are often taken as a universal estimate of interaction strength, independent of modulations in the ring rates. In order to evaluate the adequacy of this estimate, we introduce another spike model where the interaction mechanism strength is characterized by a different parameter from the effective connectivity. Then we examine how the temporal structure of interdependence described by our parameter is related to that by the effective connectivity. 3.1 Model. We introduce a simple model of two cells. The generation of interdependent spike trains by the model is schematically depicted in Figure 1. First, spike trains of the Poisson randomness are generated for cell 1 and cell 2 with a mean ring rate ofr 1 and r 2 , respectively (0 · r 1 , r 2 · 1).1 The two spike trains do not have any signicant interdependence more than an accidental component. Then a positive interdependence is imposed 1 In actual numerical simulations, the trial duration is divided into multiple bins with a width D . Initially we assign a value 0 at each bin, representing the absence of a spike event. The value is replaced by 1 when the algorithm imposes a spike event at the corresponding bin. Thus, the spike timing is quantized with the unit D . The mean ring rate r is the probability of nding a spike event at each bin.

Model Dependence

201

on the spike trains by the following algorithm. Let t1 ( i) be the occurrence time of the ith spike event in the spike train of cell 1. Consider the event at time t1 ( i) C t in the spike train of cell 2, where t is a delay interval of the interdependent spike events. When there is already a spike event at that time, leave it unaffected. However, when there is no spike, add a spike event with a probability of a, (0 · a · 1). In Figure 1, the inserted spike events are represented by thick lines. Repeat this procedure for all the spike events in the spike train of cell 1. As a result of those procedures, we can impose the spike interdependence of the specied strength with the time delay t. Pairs of spike trains with the same statistical property are generated for multiple trials for statistical averaging. A negative interdependence between the two spike trains can be imposed by the reverse of the above procedure. When cell 2 has a spike event at time t1 (i) C t , remove it with a probability |a|, where ¡1 · a · 0. When there is no spike event at that instant, do nothing. The degree of interdependence is quantied by the parameter of interaction mechanism strength a in our model. Our model is identical to the one by Aertsen and Gerstein (1985), except that they did not consider the case when cell 2 already has a spike event in the insertion of the interdependent spike. Thus, the two models become truly identical when the ring rate of cell 2 is low. 3.2 Simulation. We show a typical example of the JPSTH analysis of the simulated spike trains. Figure 2a shows the raw-JPSTH of the spike trains with a trial duration of 300 msec. As seen in the two PSTHs, both cells had a lower ring rate (r 1 = r 2 = 0.14) in the rst and last 100-msec intervals in the trial duration. The ring rates of the two cells increased to a higher value (r 1 = r 2 = 0.32) in the intermediate 100-msec interval. However, the interaction mechanism strength was kept to be a negative constant value a = ¡0.8, and the time delay t was set to be 0 msec throughout the trial duration. Thus, cell 2 had a coincident spike ring with cell 1 much less frequently than the purely accidental component. In the computation of the PST coincidence histogram in Figures 2a–c, we performed a smoothing of the diagonal matrix elements along the diagonal line by a moving average using a gaussian lter with s = 2 msec. Figures 2b and 2c, respectively, show the predicted-JPSTH and the (raw¡predicted)-JPSTH. The degree of departure from the null hypothesis (covariance) in the PST coincidence histogram in Figure 5c remains apparently inuenced by the nonstationary modulation of the ring rates, even though the interaction mechanism strength was kept constant throughout the trial duration. 3.3 Normalization. We attempt to separate the modulation in the (raw¡ predicted)-JPSTH into two components: the modulation derived purely from the ring rates and that of the interaction mechanism strength in the individual pair of spikes of the two cells. This procedure requires normalization of the matrix elements of the (raw ¡ predicted)-JPSTH by a factor depending

202

Hiroyuki Ito and Satoshi Tsuji

Figure 2: JPSTH of simulated spike data. (a) raw-JPSTH. (b) predicted-JPSTH. (c) (raw ¡ predicted)-JPSTH. Scales of color code of JPSTH, PST coincidence histogram, and cross-correlogram are different in the three plots. Minimum and maximum values of matrix elements in the three JPSTHs are: (a) 0.00, 0.220, (b) 0.00, 0.176, (c) ¡0.104, 0.124. Firing rates of two cells r 1 and r 2 had nonstationary modulation (0.14 ! 0.32 ! 0.14). Interaction mechanism strength was constant (a = ¡0.8, t = 0 msec). PST coincidence histograms were smoothed using a gaussian lter with s = 2 msec. (T = 300 msec, D = 1 msec, K = 50).

Model Dependence

203

on the ring rates of the two cells. At rst, we apply the normalization procedure by Aertsen et al. (1989) that transforms the matrix element into the effective connectivity: hnormalized ¡ JPSTHi =

h(raw ¡ predicted )-JPSTHi p , r 1 (1 ¡ r 1 )r 20 (1 ¡ r 20 )

(3.1)

where r 1 = hn1 ( u)i and r 20 = hn2 ( v)i are the ring rates of cell 1 at the uth bin and that of cell 2 at the vth bin, respectively. The prime is added to r 20 , which is not to be confused with r 2 , the ring rate of the initial spike train. Figure 3a shows the normalized-JPSTH computed by equation 3.1. We omit the display of the matrix elements. It is evident that this normalization does not work with our data. The inferred interaction mechanism strength in the PST coincidence histogram still has an apparent modulation by the ring rates. This discrepancy derives from the fact that our interaction mechanism strength, parameter a, is different from that in the model setup by Aertsen et al., (1989), the effective connectivity. Recall that Aertsen et al., (1989) introduced the effective connectivity as a parameter of interaction mechanism strength in their implicit model setup. The temporal structure of modulations in the model-based parameter is different when we apply a different assumption about the model and the interaction mechanism strength that led to the data. In order to nd the proper normalization of the JPSTH that reproduces our model parameter, we need to know the relationship between the (raw ¡ predicted)-JPSTH (covariance) and the model parameters. This is how sensitively the raw-JPSTH departs from the predicted-JPSTH when the interaction mechanism strength a is imposed on the two independent spike trains having ring rates of r 1 and r 2 . This relationship is obtained only in the forward problem from the spike model. Thus, the computation of the normalized-JPSTH inevitably becomes model dependent. As summarized in the appendix, straightforward calculations give the following expressions for our spike model: h(raw ¡ predicted )-JPSTHi = r 1 (1 ¡ r 1 )(1 ¡ r 2 )a,

(for positive interdependence).

(3.2)

By adding the interdependent spikes, the ring rate of cell 2 was increased to r 20 = r 2 Cr 1 (1 ¡ r 2 )a.

(3.3)

Note that only the quantities r 1 , r 20 , and h(raw ¡ predicted)-JPSTHi are available in the analysis of the experimental data. We need to compute the two unknown parameters, r 2 and a, by solving the simultaneous equations (3.2) and (3.3). Substitution of equation 3.3 into equation 3.2 leads to h(raw ¡ predicted )-JPSTHi =

¡

¢

a r 1 (1 ¡ r 1 )(1 ¡ r 20 ). 1 ¡ ar 1

(3.4)

204

Hiroyuki Ito and Satoshi Tsuji

Figure 3: Normalized-JPSTH of simulated spike data applying (a) the form by Aertsen et al. (1989) and (b) our form. Scales of PST coincidence histogram and cross-correlogram are different in the two plots. The display of matrix elements is omitted, and PST coincidence histograms are smoothed using a gaussian lter with s = 2 msec.

The normalization of the (raw ¡ predicted)-JPSTH is attained by solving the above algebraic equation for a. For negative interdependence (¡1 · a · 0), we get h(raw ¡ predicted )-JPSTHi = r 1 (1 ¡ r 1 )r 2 a,

(for negative interdependence).

(3.5)

Model Dependence

205

Substitution of the relationship r 20 = r 2 Cr 1r 2 a results in h(raw ¡ predicted )-JPSTHi =

(3.6)

¡

¢

a r 1 (1 ¡ r 1 )r 20 . 1 C ar 1

(3.7)

The normalized-JPSTH applying equation 3.7 is shown in Figure 3b. Now, the PST coincidence histogram shows almost constant interaction mechanism strength throughout the trial duration. The mean value averaged over the whole trial duration, ¡0.78, has a good quantitative agreement with the value of the original spike data, ¡0.8. 3.4 Comparison to Aertsen et al.’s (1989) Normalization. In our modelbased formulation, the ring rates before adding the interdependent spikes r 1 , r 2 , and the interaction mechanism strength a are the independent variables, which we can set arbitrarily. From equations 3.2 and 3.5, the (raw ¡ predicted)-JPSTH was expressed in a simpler form by using these variables. In Aertsen et al.’s (1989) formulation with an implicit model setup, on the other hand, the ring rates of the interdependent spike data, r 1 , r 20 , and the effective connectivity are the independent variables. Their normalization factor was given by (Aertsen et al.’s (1989) normalization) =

q r 1 (1 ¡ r 1 )r 20 (1 ¡ r 20 ) ,

(3.8)

for both positive and negative interdependence. The different independent variables between the two formulations prevent us from a straightforward comparison. From equations 3.4 and 3.7, we can obtain the corresponding factors in our formulation: (Our normalization for positive interdependence) r 1 (1 ¡ r 1 )(1 ¡ r 20 ) = , 1 ¡ ar 1

(3.9)

(Our normalization for negative interdependence) r 1 (1 ¡ r 1 )r 20 = . 1 C ar 1

(3.10)

In this format, the normalization factor itself depends on the interaction mechanism strength a. This is not a problem when we consider parameter a as the solution of equation 3.4 or 3.7 rather than an unknown variable.

206

Hiroyuki Ito and Satoshi Tsuji

Figure 4: Dependence of normalization factor on ring rate r with (a) positive interdependence and (b) negative interdependence. Firing rates of two cells r 1 andr 20 are constrained to ber . Graphs are plotted for different values of a: (a) 0.0, 0.2, 0.4, 0.6, 0.8, 1.0; (b) 0.0, ¡0.2, ¡0.4, ¡0.6, ¡0.8, ¡1.0 (Thick lines represent normalization factor by Aertsen et al. (1989); the two panels have different scales in their ordinates).

Showing the difference between the normalization factors of the two formulations, we plot their graphs as a function of ring rate r in Figure 4 for the positive interdependence and the negative interdependence, respectively. Here, we consider only a special case with r 1 = r 20 = r for simplicity. In each panel, the graphs are plotted for different values of a, and the thick line represents the normalization factor by Aertsen et al., (1989) that is, r (1 ¡r ). Note that the ordinate has a different scale in the two panels. Since the (raw ¡ predicted)-JPSTH is an estimate of the covariance between the probabilities of a spike event at the two cells, it becomes zero when cells either

Model Dependence

207

always re (r = 1) or never re (r = 0), and so does the normalization.2 For positive interdependence, the graphs are rather similar at lower ring rates (r · 0.1). Our normalization coincides with that by Aertsen et al. (1989) when a = 1. For negative interdependence, on the other hand, they are very different even at lower ring rates. Therefore the difference in the normalized-JPSTH between our result and that of Aertsen et al. (1989) becomes more prominent with negative interdependence. In the general parameter region of lower ring rates (r 1 , r 20 ¿ 1), the normalization factors are approximated as (Aertsen et al.’s (1989) normalization) ’

q r 1r 20 ,

(3.11)

and (Our normalization for positive interdependence) ’ r 1 ,

(Our normalization for negative interdependence) ’ r 1r 20 .

(3.12) (3.13)

Our spike model is identical to the one by Aertsen and Gerstein (1985) in this parameter region. Thus, we see that Aertsen et al.’s (1989) normalization (effective connectivity) can reproduce the interaction mechanism strength of their original spike model only when r 1 = r 20 ¿ 1 for positive interdependence. This was exactly the parameter region that they used for simulating spike data to calibrate their normalization procedure (Aertsen et al., 1989). For negative interdependence, their normalization always underestimates the interaction mechanism strength of the original model. This asymmetry between a positive and negative interdependence is an intrinsic property of the point processes of the spike events (Palm et al., 1988) and was the main conclusion of the study by Aertsen and Gerstein (1985) on the cross-correlogram. 3.5 Simulation of Nonstationary Interdependent Structure. We examined another example where the interaction mechanism strength showed a nonstationary modulation but the ring rates of the two cells were kept constant. The initial Poisson random spike trains were generated with a ring rate of r 1 = r 2 = 0.32 throughout the trial duration of 300 msec. Then, in the intermediate 100-msec interval, positively interdependent spike events were added to the spike train of cell 2 by imposing a strength of a = 0.3. The time delay was set at t = 0 msec. The interdependent spikes were not added in the interval of the rst and last 100 msec where a = 0. Only the 2 When a = ¡1, our normalization factor has a singularity at r = 1, which derives from our enforcement of its expression by r 20 . Since r 1 = 1 and a = ¡1, every initial spike event in cell 2 should be removed by the algorithm. This is impossible under the current constraint of r 20 =1.

208

Hiroyuki Ito and Satoshi Tsuji

Figure 5: Normalized-JPSTH of simulated spike data applying our normalization. Data show stationary ring rates of two cells (r 1 = r 2 = 0.32) and nonstationary modulation of interaction mechanism strength within trial duration (a: 0 ! 0.3 ! 0, t = 0 msec). Display of matrix elements omitted and PST coincidence histogram smoothed using a gaussian lter with s = 2 msec. (T = 300 msec, D = 1 msec, K = 50.)

PST coincidence histogram of the normalized-JPSTH applying our normalization is shown in Figure 5 to demonstrate that the nonstationary structure of interdependence is faithfully reproduced. The mean value of the interaction mechanism strength averaged over the intermediate 100-msec interval is 0.32 and close to the original value. 4 Discussion 4.1 Model Dependence. We could correctly normalize the JPSTH in Figure 3b so as to reproduce our model parameter. This was because we knew that the given spike data were initially generated by the same model. However, if we start our analysis only from the spike data without knowing the spike model, as in the analysis of experimental data, the model selected would no longer be unique. There are other possible models that generate spike data with the same statistical structure as the given data. The parameter of interaction mechanism strength quantifying the degree of interdependence could be dened differently in each model, so we need a different normalization procedure to reproduce its parameter. Thus, when we apply a different assumption about the model or mechanism that led to the data, the temporal structure of modulations in that model-based parameter (interaction mechanism strength) is different. The quantication of the degree

Model Dependence

209

of interdependence through the inference of the model-based parameter is an ill-posed inverse problem without any unique solution. Aertsen et al. (1989) explicitly cautioned against a misunderstanding that the temporal modulation of the effective connectivity represents a universal characteristic, which is independent of the modulation of the ring rates. However, this common misunderstanding still seems to exist. Like our parameter a, the effective connectivity is not a universal measure of the spike interdependence. It is just one of many possible ways of describing the degree of interdependence. 4.2 Logic in the Determination of Measures of Interdependence. Up to this point, we retained a model-based approach in order to examine the characteristic of the effective connectivity. On the other hand, as explained in section 2, a variety of model-free measures of interdependence have been suggested. Although each abstract statistical measure is mathematically well dened, there exists a nonuniqueness in our choices for a particular measure adopted for characterizing the structure of interdependence in the given spike data. Of course, when the model-free measures of different characteristics (e.g., correlation density and correlation coefcient) are applied to the same data, we get qualitatively different modulation proles of the degree of interdependence (Aertsen et al., 1989). Those descriptions can be used to study different aspects of the structure of spike interdependence. The logical structure in the nonuniqueness of the measure of interdependence can be clearly understood by applying an elemental concept borrowed from the information geometry (Amari, 1985, 1998). At each compartment ( u, v) in the JPSTH, a set of joint probabilities fpij g = ( p00 , p01 , p10, p11 ) can be computed from the spike data. The quantity pij represents the joint probability of the event: the spike count at the uth bin of cell 1 is i (0 or 1) and that at the vth bin of cell 2 is j. A particular set of probabilities fpij g can be regarded as a point in the space of the probability distribution under a constraint p00 Cp01 Cp10 Cp11 = 1. Characterization of the statistical structure in the set fpij g is performed by a transformation into a new coordinate set composed of the statistical measures. Initially, the ring rates of the two cells, r 1 = p10 C p11 and r 20 = p01 C p11, are selected as universal measures. Then the set fpij g is represented by the new coordinate set (r 1 , r 20 , p11 ), which is exactly the raw-JPSTH. In this article, we formulated our measure of interdependence through the normalization of the covariance, p11 ¡r 1r 20 . However, the measure j can be introduced in a more general form by imposing a particular relation, j = j (r 1 , r 20 , p11 ), satisfying the mathematical condition of one-to-one mapping.3 Then the set fpij g can be represented by the coordinate 3 Actually, the measure proposed by Amari (1998) in the framework of information p p geometry has a form j = log p 00 p 11 . 01 10

210

Hiroyuki Ito and Satoshi Tsuji

system (r 1 , r 20 , j ), which corresponds to the normalized-JPSTH. Just as there is nonuniqueness in the selection of the coordinate system in the Euclidean space, there are innite possibilities in the selection of such a measure of interdependence. Any measure is equally adequate as long as the coordinate system (r 1 , r 20 , j ) uniquely indicates the universal point fpij g. The value of the coordinate j (degree of interdependence) has only a relative meaning to the selected coordinate system. We can discuss the variations of the degree of interdependence within the trial duration by applying a particular measure. However, the description does not have a universal meaning independent of the choice of measure. The interrelationship between the descriptions by the different measures can be examined, as we did with the numerical simulation in the last section. Although the measure of interdependence is a model parameter of interaction mechanism strength in the model-based approach, it is an abstract statistical measure (statistic) in the model-free approach. However, the two approaches share the same logical structure that for determination of a measure j , we need to impose some constraint or criterion on the value leading to a particular relation j = j (r 1 , r 20 , p11 ). In the model-based approach, the introduction of a particular model of the point processes imposes a specic parameter of interaction mechanism strength j . Aertsen et al. (1989) introduced an implicit model and dened their model-based measure (effective connectivity) under the condition that in the limited parameter regime, it approximated the interaction mechanism strength of another explicit model proposed by them previously. We also introduced a similar explicit model and formulated a JPSTH normalization procedure that reproduces our parameter of interaction mechanism strength in the entire parameter regime. In the model-free approach, on the other hand, the selection of a particular measure should be guided by a purely statistical value (Brillinger, 1975; Palm et al., 1988). Among various statistical measures having a symmetrical dependence on the ring rates of the two cells, the correlation coefcient is most often used because of its established popularity. Amari (1998) suggested that a measure can be derived in a less subjective manner in the mathematical framework of information geometry by imposing a particular measure of distance between the points in the space of the probability distribution fpij g. 4.3 Universal Characteristics of Structure of Interdependence. Although the modulation prole of the degree of interdependence itself is measure dependent, we can discuss universal characteristics in special cases. The rst case is when there is a signicant modulation in the degree of interdependence within the trial duration, despite the ring rates of the two cells showing little modulation. We have simulated such a case in Figure 5. Another case is when the modulation prole of the degree of interdependence shows a signicant change between the different contexts of the experiment (different stimuli or tasks), but the PSTH of each cell remains unchanged.

Model Dependence

211

In these cases, at least the fact that there exists a change in the degree of interdependence is a measure-independent characteristic. An example was introduced by Vaadia et al. (1995) in the recordings of two cells in the prefrontal cortex of a monkey. In both of the two different paradigms in the delayed localization task, there was very little modulation of the ring rate in both cells throughout the trial duration. Nevertheless, the JPSTH of Aertsen et al.’s (1989) normalization revealed a nonstationary modulation of the effective connectivity on the order of a few tens of milliseconds. Interestingly, the modulation showed signicantly different proles between the two paradigms. The modulation of the ring rates and its inuence on the degree of interdependence have the same timescale as seen in Figure 3. Thus, if there is a statistically signicant modulation of the degree of interdependence at a different (possibly, faster) timescale than that of the ring rates, it would be an indication of a universal statistical characteristic in the structure of interdependence. In any measure of interdependence, the statistical independence between the two spike trains is assigned a null value. Then the sign of the interdependence (positive or negative) has a universal meaning. By adopting any measure of interdependence, we can statistically test whether the test data depart signicantly from the null hypothesis of statistical independence. Thus, the standard cross-correlogram and unitary event analysis (Riehle, Grun, ¨ Diesmann, & Aertsen, 1997), by which we look for signicantly interdependent events, are basically a model-free analysis. However, the same difculty as with the JPSTH arises whenever we attempt to normalize the correlogram for a comparison of the degree of interdependence between the data of different ring rates (Abeles, 1982b). Nevertheless, even if we apply a different normalization procedure, it simply changes the overall scale of the frequency in the histogram, so that the structure of the cross-correlogram remains qualitatively the same. However, different normalization procedures in the JPSTH can lead to results that are qualitatively different. 4.4 Future Problems. We have avoided introducing a large-scale model composed of physiologically plausible units like a detailed compartment model. Our simple probabilistic model was nothing more than an abstract algorithm for imposing spike interdependence. In addition, the model parameters used in the simulations were not necessarily within the physiologically relevant range of the actual cortical system, which has low ring rates and weak neuronal connectivity. This is because we introduced our model as a conventional tool to describe structure of interdependence rather than as a simulator of realistic cortical dynamics. The following two tasks can be independently carried out: the characterization of the statistical structure of spike data and solving the inverse problem to determine what actual neuronal network mechanism realizes the observed statistical characteristics.

212

Hiroyuki Ito and Satoshi Tsuji

For future research, it would be valuable to simulate an in virtu experiment (Aertsen, 1988; Aertsen, Erb, & Palm, 1994) using a physiologically plausible network model on a larger scale. We could numerically simulate the dynamics of the network and perform a virtual recording of the spike activities of only two cells. Then we would infer the effective connectivity by the JPSTH as we do for the actual experimental data. Contrary to the actual experiment, all numerical data would be available on the activities of other “nonrecorded” cells. We could then discuss how the global activities of the network inuence the temporal modulation of the degree of interdependence between the two cells. Such knowledge is important for the inference of network activities based on the real experimental data without knowing the activities of other nonrecorded cells. Also, it is valuable to execute systematic evaluations of what measures are relevant for detecting particular network events. Melssen and Epping (1987) investigated the adequacy of cross-correlation procedures for detecting and estimating neuronal connectivity by computer simulations of small networks composed of fairly realistic model neurons. They demonstrated that besides the strength of connectivity and the ring rate, the neurons’ internal state—the nonlinear spike-generating characteristics—also inuenced the visibility and detectability of neuronal connectivity. In their in virtu experiment, they examined how nonlinear superposition of the stimulus-driven activities degraded the reliability of estimations of neuronal connectivity. Their results suggested that there is no universal procedure to estimate adequately the neuronal connectivity equally for all the modes of neuronal activities. This conclusion is supported also by our current study. Throughout this study, we assume that the bin width of the JPSTH is small enough that no bin ever holds more than one spike event per trial. However, in many analyses of experimental data, this assumption is not made. In the actual recordings, the ring rates are low, and neurons are recorded from only during nite numbers of trials. Neurons nevertheless sometimes re in bursts. Therefore, if bins small enough to guarantee the above condition were used, most bins would be empty, and the structure in the JPSTH would be very scattered and difcult to visualize. Broader bins are used for smoothing the scattered structure in the JPSTH. We stress that our conclusion of the JPSTH has a broader nature that is not dependent on this assumption of the bin width. 5 Conclusion We examined the mathematical structure of how the JPSTH quanties a degree of interdependence between the spike events of the two cells after excluding the contribution of ring rates. We introduced a simple probabilistic model of interacting point processes to generate simulated spike data and showed that the normalized JPSTH incorrectly inferred the temporal structure of variations in the interaction parameter strength. This occurred

Model Dependence

213

because, in our model, the correct normalization of ring rate contributions was different from that used in Aertsen et al.’s (1989) “effective connectivity” model. This demonstrated that ring-rate modulations cannot be corrected for in a model-independent manner. Finally, our purpose is not to encourage using our normalization form instead of Aertsen et al.’s (1989) standard normalization. It is to provide the awareness that no model parameter of spike interdependence—neither Aertsen et al.’s (1989) effective connectivity nor ours—represents a universal characteristic that is independent of modulations of ring rates. Aertsen et al.’s (1989) effective connectivity may still be used in the analysis of experimental data, provided that there is an awareness that this is simply one of many ways of describing the structure of interdependence. We recommend rst characterizing the statistical structure of interdependence by applying an a priori model-free measure (correlation coefcient). Then its results are used to infer the interaction mechanism strength of a particular model after assumptions about the interaction mechanism are made. Appendix:

Expression of (raw ¡ predicted)-JPSTH

The proper normalization of the (raw ¡ predicted)-JPSTH can be obtained by the relationship between the JPSTH and the model parameters. A.1 Positive Interdependence. We consider the case when the algorithm imposes spike interdependence with a strength a and a delay t on the two independent random spike trains of mean ring rates r 1 and r 2 . Owing to the constraint that the width of the temporal bin should contain no more than one spike event, the mean ring rate represents a probability of nding a spike event at each bin. Let ni ( u) be a variable representing an occurrence of a spike event at cell i (i = 1, 2) at the uth bin. It takes either 1 or 0 depending on whether there is a spike. The matrix element of the rawJPSTH at (u, u C t ) is given by a probability: P[n1 (u) = 1 ^ n2 (u C t) = 1], where P[A ^ B] represents the joint probability of two events A and B. This is computed by P[n1 ( u) = 1^n2 ( u Ct ) = 1] = P[n1 ( u) = 1]P[n2 ( u Ct ) = 1 | n1 ( u) = 1],

(A.1)

where P(A | B) represents the conditional probability of event A under the condition of event B. In the spike trains before adding interdependent spikes, two events n1 (u) = 1 and n2 ( u C t ) = 1 are independent so that P[n1 (u) = 1] = r 1 and P[n2 (u C t ) = 1 | n1 ( u) = 1] = P[n2 ( u C t) = 1] = r 2 . Then, P[n1 ( u) = 1 ^ n2 (u C t ) = 1] = r 1r 2

(before imposing interdependence).

214

Hiroyuki Ito and Satoshi Tsuji

On the other hand, after the algorithm adds interdependent spikes, there are two possibilities for the occurrence of event n2 ( u C t ) = 1 under the condition of n1 ( u) = 1: 1. n2 ( u C t ) = 1 already in the initial spike train.

2. n2 ( u C t ) = 0 in the initial spike train; then a spike is added with the probability a. The conditional probability of the rst event is given by r 2 and that of the second event by (1 ¡ r 2 )a. Since the two events are exclusive, we get P[n2 ( u C t ) = 1 | n1 ( u) = 1] = r 2 C (1 ¡ r 2 )a.

(A.2)

Substitution of this result into equation A.1 leads to the matrix element of the raw-JPSTH, P[n1 ( u) = 1 ^ n2 ( u C t ) = 1] = r 1 fr 2 C (1 ¡ r 2 )ag,

(A.3)

hraw-JPSTHi.

By adding interdependent spikes, the algorithm increased the ring rate of cell 2, which can be computed by a relation, P[n2 ( u C t ) = 1] = P[n1 ( u) = 0]P[n2 ( u C t ) = 1 | n1 ( u) = 0]

C P[n1 ( u) = 1]P[n2 (u C t) = 1 | n1 ( u) = 1].

Substitutions of P[n2 ( u Ct ) = 1 | n1 ( u) = 0] = r 2 and equation A.2 into the above relation result in P[n2 ( u C t ) = 1] = (1 ¡ r 1 )r 2 Cr 1 fr 2 C (1 ¡ r 2 )ag = r 2 Cr 1 (1 ¡ r 2 )a.

The matrix element of the predicted-JPSTH is given by the probability of an accidental joint event between the two independent spike trains with rates r 1 and r 20 = r 2 Cr 1 (1 ¡ r 2 )a, that is, P[n1 ( u) = 1 ^ n2 ( u C t ) = 1] = r 1r 20 = r 1 fr 2 Cr 1 (1 ¡ r 2 )ag,

(A.4)

hpredicted-JPSTHi.

Finally, the (raw ¡ predicted)-JPSTH is given by subtracting equation A.4 from equation A.3, d P[n1 ( u) = 1 ^ n2 ( u C t ) = 1] = r 1 (1 ¡ r 1 )(1 ¡ r 2 )a, h(raw ¡ predicted)-JPSTHi.

Model Dependence

215

A.2 Negative Interdependence. After the algorithm removes interdependent spikes from the spike train of cell 2, there is a unique possibility for the occurrence of event n2 (u C t ) = 1 with the condition n1 ( u) = 1. That is, if n2 ( u C t) = 1 in the initial spike train, then the spike is not removed with a probability 1 ¡ |a|, where ¡1 · a · 0. The probability of this event is given by P[n2 ( u C t ) = 1 | n1 (u) = 1] = r 2 (1 ¡ |a| ).

(A.5)

Substitution of this result into equation A.1 leads to the matrix element of the raw-JPSTH, P[n1 ( u) = 1 ^ n2 ( u Ct ) = 1] = r 1r 2 (1 ¡ |a|),

hraw-JPSTHi.

(A.6)

The ring rate of cell 2 was decreased by the removal of interdependent spikes so that P[n2 ( u C t ) = 1] = (1 ¡ r 1 )r 2 Cr 1r 2 (1 ¡ |a|) = r 2 ¡ r 1r 2 | a|.

Then the matrix element of the predicted-JPSTH is given by P[n1 ( u) = 1 ^ n2 (u C t ) = 1] = r 1 (r 2 ¡ r 1r 2 |a|),

(A.7)

hpredicted-JPSTHi.

Subtraction of equation A.7 from equation A.6 gives the nal result dP[n1 (u) = 1 ^ n2 ( u C t ) = 1] = ¡r 1 (1 ¡ r 1 )r 2 |a|,

h(raw ¡ predicted )-JPSTHi.

Acknowledgments We thank S. Amari for his important comments. We also express our thanks to M. Abeles for his comments and encouragement. We thank L. Kay for critical readings of the manuscript. Finally, we are grateful to anonymous referees for numerous instructive comments helping us to improve the focus of the article. References Abeles, M. (1982a). Local cortical circuits. Berlin: Springer-Verlag. Abeles, M. (1982b). Quantication, smoothing, and condence limits for singleunits’ histograms. Journal of Neuroscience Methods, 5, 317–325.

216

Hiroyuki Ito and Satoshi Tsuji

Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Aertsen, A. (1988).Discourse on current issues in multi-neuron studies. In W. von Seelen, G. Shaw, & U. M. Leinhos (Eds.), Organization of neural networks (pp. 87–107). Weinheim: VHC Verlag. Aertsen, A., Erb, M., & Palm, G. (1994). Dynamics of functional coupling in the cerebral cortex: An attempt at a model-base interpretation. Physica D, 75, 103–128. Aertsen, A. M. H J., & Gerstein, G. L. (1985).Evaluation of neuronal connectivity: Sensitivity of cross-correlation. Brain Research, 340, 341–354. Aertsen, A. M. H. J., Gerstein, M. K., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal ring correlation: Modulation of “effective connectivity.” Journal of Neurophysiology, 61, 900–918. Amari, S. (1985). Differential-geometrical methods in statistics. Berlin: SpringerVerlag. Amari, S. (1998). Information geometry of neuro-manifolds. In S. Usui & T. Omori (Eds.), Proceedings of ICONIP’98 (pp. 591–593). Amsterdam: IOS Press. Brillinger, D. R. (1975). The identication of point process systems. Ann. Prob., 3, 909–929. Engel, A. K., Konig, ¨ P., Kreiter, A. K., Schillen, T. B., & Singer, W. (1992).Temporal coding in the visual cortex: New vistas on integration in the nervous system. Trends in Neuroscience, 15, 218–226. Fujii, H., Ito, H., Aihara, K., Ichinose, N., & Tsukada, M. (1996). Dynamical cell assembly hypothesis—Theoretical possibility of spatio-temporal coding in the cortex. Neural Networks, 9 (8), 1303–1350. Gerstein, G. L., Bedenbaugh, P., & Aertsen, A. M. H. J. (1989). Neuronal assemblies. IEEE Transactions in Biomedical Engineering, 36, 4–14. Gray, C. M., Engel, A. K., Konig, ¨ P., & Singer, W. (1992). Synchronization of oscillatory neuronal responses in cat striate cortex: Temporal properties. Visual Neurosciences, 8, 337–347. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Ito, H. (1998). Need for development of statistical analysis tools of dynamical neuronal code—Statistical evaluation of joint-PSTH. In S. Usui & T. Omori (Eds.), Proceedings of ICONIP’98 (pp. 175–178). Amsterdam: IOS Press. Ito, H. & Tsuji, S. (1997). Statistical evaluation of joint peri-stimulus time histogram for analysis of dynamical neuronal code. Soc. Neurosci. Abstr., 23, 1400. Melssen, W. J., & Epping, W. J. M. (1987). Detection and estimation of neural connectivity based on crosscorrelation analysis. Biol. Cybern., 57, 403–414. Palm, G., Aertsen, A. M. H. J., & Gerstein, G. L. (1988). On the signicance of correlations among neuronal spike trains. Biol. Cybern., 59, 1–11. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differently involved in motor cortical function. Science, 278, 1950–1953. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. M. H. J. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events. Nature, 373, 5151–5158.

Model Dependence

217

von der Malsburg, C. (1981). The correlation theory of brain function (Internal Rep. No. 81-2), Gottingen: ¨ Max-Planck-Institute for Biophysical Chemistry. von der Malsburg, C. (1986). Am I thinking assemblies? In G. Palm & A. Aertsen (Eds.), Brain theory (pp. 161–176). Berlin: Springer-Verlag. Received March 30, 1998; accepted February 4, 1999.

LETTER

Communicated by Peter Dayan

Reinforcement Learning in Continuous Time and Space Kenji Doya ATR Human Information Processing Research Laboratories, Soraku, Kyoto 619-0288, Japan

This article presents a reinforcement learning framework for continuoustime dynamical systems without a priori discretization of time, state, and action. Based on the Hamilton-Jacobi-Bellman (HJB) equation for innitehorizon, discounted reward problems, we derive algorithms for estimating value functions and improving policies with the use of function approximators. The process of value function estimation is formulated as the minimization of a continuous-time form of the temporal difference (TD) error. Update methods based on backward Euler approximation and exponential eligibility traces are derived, and their correspondences with the conventional residual gradient, TD(0), and TD(l ) algorithms are shown. For policy improvement, two methods—a continuous actor-critic method and a value-gradient-based greedy policy—are formulated. As a special case of the latter, a nonlinear feedback control law using the value gradient and the model of the input gain is derived. The advantage updating, a model-free algorithm derived previously, is also formulated in the HJBbased framework. The performance of the proposed algorithms is rst tested in a nonlinear control task of swinging a pendulum up with limited torque. It is shown in the simulations that (1) the task is accomplished by the continuous actor-critic method in a number of trials several times fewer than by the conventional discrete actor-critic method; (2) among the continuous policy update methods, the value-gradient-based policy with a known or learned dynamic model performs several times better than the actor-critic method; and (3) a value function update using exponential eligibility traces is more efcient and stable than that based on Euler approximation. The algorithms are then tested in a higher-dimensional task: cartpole swing-up. This task is accomplished in several hundred trials using the value-gradient-based policy with a learned dynamic model. 1 Introduction The temporal difference (TD) family of reinforcement learning (RL) algorithms (Barto, Sutton, & Anderson, 1983; Sutton, 1988; Sutton & Barto, 1998) provides an effective approach to control and decision problems for which optimal solutions are analytically unavailable or difcult to obtain. A numNeural Computation 12, 219–245 (2000)

c 2000 Massachusetts Institute of Technology °

220

Kenji Doya

ber of successful applications to large-scale problems, such as board games (Tesauro, 1994), dispatch problems (Crites & Barto, 1996; Zhang & Dietterich, 1996; Singh & Bertsekas, 1997), and robot navigation (Mataric, 1994) have been reported (see, e.g., Kaelbling, Littman, & Moore, 1996, and Sutton & Barto, 1998, for reviews). The progress of RL research so far, however, has been mostly constrained to the formulation of the problem in which discrete actions are taken in discrete time-steps based on the observation of the discrete state of the system. Many interesting real-world control tasks, such as driving a car or riding a snowboard, require smooth, continuous actions taken in response to high-dimensional, real-valued sensory input. In applications of RL to continuous problems, the most common approach has been rst to discretize time, state, and action and then to apply an RL algorithm for a discrete stochastic system. However, this discretization approach has the following drawbacks: When a coarse discretization is used, the control output is not smooth, resulting in poor performance. When a ne discretization is used, the number of states and the number of iteration steps become huge, which necessitates not only large memory storage but also many learning trials. In order to keep the number of states manageable, an elaborate partitioning of the variables has to be found using prior knowledge. Efforts have been made to eliminate some of these difculties by using appropriate function approximators (Gordon, 1996; Sutton, 1996; Tsitsiklis & Van Roy, 1997), adaptive state partitioning and aggregation methods (Moore, 1994; Singh, Jaakkola, & Jordan, 1995; Asada, Noda, & Hosoda, 1996; Pareigis, 1998), and multiple timescale methods (Sutton, 1995). In this article, we consider an alternative approach in which learning algorithms are formulated for continuous-time dynamical systems without resorting to the explicit discretization of time, state, and action. The continuous framework has the following possible advantages: A smooth control performance can be achieved. An efcient control policy can be derived using the gradient of the value function (Werbos, 1990). There is no need to guess how to partition the state, action, and time. It is the task of the function approximation and numerical integration algorithms to nd the right granularity. There have been several attempts at extending RL algorithms to continuous cases. Bradtke (1993) showed convergence results for Q-learning algorithms for discrete-time, continuous-state systems with linear dynam-

Reinforcement Learning in Continuous Time and Space

221

ics and quadratic costs. Bradtke and Duff (1995) derived a TD algorithm for continuous-time, discrete-state systems (semi-Markov decision problems). Baird (1993) proposed the “advantage updating” method by extending Q-learning to be used for continuous-time, continuous-state problems. When we consider optimization problems in continuous-time systems, the Hamilton-Jacobi-Bellman (HJB) equation, which is a continuous-time counterpart of the Bellman equation for discrete-time systems, provides a sound theoretical basis (see, e.g., Bertsekas, 1995, and Fleming & Soner, 1993). Methods for learning the optimal value function that satises the HJB equation have been studied using a grid-based discretization of space and time (Peterson, 1993) and convergence proofs have been shown for grid sizes taken to zero (Munos, 1997; Munos & Bourgine, 1998). However, the direct implementation of such methods is impractical in a highdimensional state-space. An HJB-based method that uses function approximators was presented by Dayan and Singh (1996). They proposed the learning of the gradients of the value function without learning the value function itself, but the method is applicable only to nondiscounted reward problems. This article presents a set of RL algorithms for nonlinear dynamical systems based on the HJB equation for innite-horizon, discounted-reward problems. A series of simulations are devised to evaluate their effectiveness when used with continuous function approximators. We rst consider methods for learning the value function on the basis of minimizing a continuous-time form of the TD error. The update algorithms are derived either by using a single step or exponentially weighed eligibility traces. The relationships of these algorithms with the residual gradient (Baird, 1995), TD(0), and TD(l ) algorithms (Sutton, 1988) for discrete cases are also shown. Next, we formulate methods for improving the policy using the value function: the continuous actor-critic method and a value-gradientbased policy. Specically, when a model is available for the input gain of the system dynamics, we derive a closed-form feedback policy that is suitable for real-time implementation. Its relationship with “advantage updating” (Baird, 1993) is also discussed. The performance of the proposed methods is rst evaluated in nonlinear control tasks of swinging up a pendulum with limited torque (Atkeson, 1994; Doya, 1996) using normalized gaussian basis function networks for representing the value function, the policy, and the model. We test (1) the performance of the discrete actor-critic, continuous actor-critic, and valuegradient-based methods; (2) the performance of the value function update methods; and (3) the effects of the learning parameters, including the action cost, exploration noise, and landscape of the reward function. Then we test the algorithms in a more challenging task, cart-pole swing-up (Doya, 1997), in which the state-space is higher-dimensional and the system input gain is state dependent.

222

Kenji Doya

2 The Optimal Value Function for a Discounted Reward Task In this article, we consider the continuous-time deterministic system, x(t) P = f (x( t), u(t)),

(2.1)

where x 2 X ½ Rn is the state and u 2 U ½ Rm is the action (control input). We denote the immediate reward for the state and the action as r(t) = r(x(t), u(t)).

(2.2)

Our goal is to nd a policy (control law), u(t) = m (x( t)),

(2.3)

that maximizes the cumulative future rewards, Z 1 s¡t Vm (x(t)) = e¡ t r(x(s), u(s)) ds t

(2.4)

for any initial state x(t). Note that x(s) and u(s) ( t · s < 1) follow the system dynamics (2.1) and the policy (2.3). Vm (x) is called the value function of the state x, and t is the time constant for discounting future rewards. An important feature of this innite-horizon formulation is that the value function and the optimal policy do not depend explicitly on time, which is convenient in estimating them using function approximators. The discounted reward makes it unnecessary to assume that the state is attracted to a zero-reward state. The value function V¤ for the optimal policy m ¤ is dened as µZ V ¤ (x( t)) = max

u[t,1)

1 t

e¡

s¡t t

¶ r(x( s), u(s))ds ,

(2.5)

where u[t, 1) denotes the time course u(s) 2 U for t · s < 1. According to the principle of optimality, the condition for the optimal value function at time t is given by µ ¶ 1 ¤ @V ¤ (x) (x( )) (x( ), (x( ), V t = max r t u(t)) C f t u(t)) , u(t)2U t @x

(2.6)

which is a discounted version of the HJB equation (see appendix A). The optimal policy is given by the action that maximizes the right-hand side of the HJB equation: µ ¶ @V¤ (x) u(t) = m ¤ (x( t)) = arg max r(x( t), u) C f (x( t), u) . u2U @x

(2.7)

Reinforcement Learning in Continuous Time and Space

223

Reinforcement learning can be formulated as the process of bringing the current policym and its value function estimate V closer to the optimal policy m ¤ and the optimal value function V¤ . It generally involves two components: 1. Estimate the value function V based on the current policym . 2. Improve the policy m by making it greedy with respect to the current estimate of the value function V. We will consider the algorithms for these two processes in the following two sections. 3 Learning the Value Function For the learning of the value function in a continuous state-space, it is mandatory to use some form of function approximator. We denote the current estimate of the value function as Vm (x( t)) ’ V (x( t)I w),

(3.1)

where w is a parameter of the function approximator, or simply, V( t). In the framework of TD learning, the estimate of the value function is updated using a self-consistency condition that is local in time and space. This is given by differentiating denition 2.4 by t as 1 VP m (x( t)) = Vm (x(t)) ¡ r(t). t

(3.2)

This should hold for any policy, including the optimal policy given by equation 2.7. If the current estimate V of the value function is perfect, it should satisfy the consistency condition VP (t) = t1 V (t)¡ r( t). If this condition is not satised, the prediction should be adjusted to decrease the inconsistency, d (t) ´ r( t) ¡

1 V (t) C VP ( t). t

(3.3)

This is the continuous-time counterpart of the TD error (Barto, Sutton, & Anderson, 1983; Sutton, 1988). 3.1 Updating the Level and the Slope. In order to bring the TD error (3.3) to zero, we can tune the level of the value function V (t), its time derivative VP (t), or both, as illustrated in Figures 1A–C. Now we consider the objective function (Baird, 1993), E( t ) =

1 |d ( t)| 2 . 2

(3.4)

224

Kenji Doya

A d (t)

^ B V(t)

t0

t

t0

t

^ C V(t)

^ D V(t)

O Figure 1: Possible updates for the value function estimate V(t) for an instantaP neous TD error d (t) = r(t) ¡ 1t V(t) C V(t). (A) A positive TD error at t = t0 can be corrected by (B) an increase in V(t), or (C) a decrease in the time derivative P V(t), or (D) an exponentially weighted increase in V(t) (t < t0 ).

From denition 3.3 and the chain rule VP ( t) = @V xP ( t), the gradient of the @x objective function with respect to a parameter wi is given by µ ¶ 1 @E(t) @ P = d ( t) r ( t ) ¡ V ( t) C V ( t ) t @wi @wi µ ¶ 1 @V (xI w) @ @V (xI w) ( ) ( ) P = d t ¡ C xt . t @wi @wi @x

¡

¢

Therefore, the gradient descent algorithm is given by µ 1 @V (xI w) @E @ @V (xI w) = gd (t) ¡ wP i = ¡g t @wi @wi @wi @x

¡

¢

¶ xP ( t) ,

(3.5)

where g is the learning rate. A potential problem with this update algorithm is its symmetry in time. Since the boundary condition for the value function is given at t ! 1, it would be more appropriate to update the past estimates without affecting the future estimates. Below, we consider methods for implementing the backup of TD errors. 3.2 Backward Euler Differentiation: Residual Gradient and TD(0). One way of implementing the backup of TD errors is to use the backward Euler

Reinforcement Learning in Continuous Time and Space

225

approximation of time derivative VP (t). By substituting VP (t) = ( V( t) ¡ V (t ¡ D t))/ D t into equation 3.3, we have d (t) = r( t) C

1 Dt

¡

µ

1¡

Dt t

¢

¶ V ( t) ¡ V ( t ¡ D t) .

(3.6)

Then the gradient of the squared TD error (see equation 3.4) with respect to the parameter wi is given by µ ¶ 1 D t @V (x(t)I w) @E( t) @V (x( t ¡ D t)I w) = d ( t) 1¡ ¡ . Dt t @wi @wi @wi

¡

¢

A straightforward gradient descent algorithm is given by

¡

µ Dt ( ) P wi = gd t ¡ 1 ¡ t

¢ @V(x(t)I w) @wi

¶ @V(x(t ¡ D t)I w) C . @wi

(3.7)

An alternative way is to update only V ( t ¡ D t) without explicitly changing V ( t) by wP i = gd ( t)

@V (x( t ¡ D t)I w) . @wi

(3.8)

The Euler discretized TD error, equation 3.6, coincides with the conventional TD error, d t = rt Cc Vt ¡ Vt¡1 ,

Dt

by taking the discount factor c = 1 ¡ Dt t ’ e¡ t and rescaling the values as Vt = D1t V( t). The update schemes, equations 3.7 and 3.8, correspond to the residual-gradient (Baird, 1995; Harmon, Baird, & Klopf, 1996) and TD(0) algorithms, respectively. Note that time step D t of the Euler differentiation does not have to be equal to the control cycle of the physical system. 3.3 Exponential Eligibility Trace: TD(l). Now let us consider how an instantaneous TD error should be corrected by a change in the value V as a function of time. Suppose an impulse of reward is given at time t = t0 . Then, from denition 2.4, the corresponding temporal prole of the value function is ( t ¡t ¡ 0t t · t0 , m V ( t) = e 0 t > t0 . Because the value function is linear with respect to the reward, the desired correction of the value function for an instantaneous TD errord ( t0 ) is ( t0 ¡t ( t0 ) e¡ t d t · t0 , O ( ) V t = 0 t > t0 ,

226

Kenji Doya

as illustrated in Figure 1D. Therefore, the update of wi given d ( t0 ) should be made as Z wP i = g

t0

¡1

@V VO ( t)

(x( t)I w) @wi

Z dt = gd (t0 )

t0 ¡1

e¡

t0 ¡t t

@V (x( t)I w) dt. (3.9) @wi

We can consider the exponentially weighted integral of the derivatives as the eligibility trace ei for the parameter wi . Then a class of learning algorithms is derived as wP i = gd ( t) ei ( t),

1 @V (x( t)I w) , ePi ( t) = ¡ ei ( t) C k @wi

(3.10)

where 0 < k · t is the time constant of the eligibility trace. If we discretize equation 3.10 with time step D t, it coincides with the eligibility trace update in TD(l ) (see, e.g., Sutton & Barto, 1998), ei ( t C D t) = lc ei ( t) C with l =

1¡D t/ k 1¡D t/ t

@Vt , @wi

.

4 Improving the Policy Now we consider ways for improving the policy u(t) = m (x( t)) using its associated value function V (x). One way is to improve the policy stochastically using the actor-critic method, in which the TD error is used as the effective reinforcement signal. Another way is to take a greedy policy with respect to the current value function, µ ¶ @V(x) u(t) = m (x( t)) = arg max r(x( t), u) C f (x(t), u) , u2U @x

(4.1)

using knowledge about the reward and the system dynamics. 4.1 Continuous Actor-Critic. First, we derive a continuous version of the actor-critic method (Barto et al., 1983). By comparing equations 3.3 and 4.1, we can see that the TD error is maximized by the greedy action u(t). Accordingly, in the actor-critic method, the TD error is used as the reinforcement signal for policy improvement. We consider the policy implemented by the actor as ³ ´ u(t) = s A(x(t)I wA ) C sn(t) ,

(4.2)

Reinforcement Learning in Continuous Time and Space

227

where A(x( t)I wA ) 2 Rm is a function approximator with a parameter vector wA , n(t) 2 Rm is noise, and s() is a monotonically increasing output function. The parameters are updated by the stochastic real-valued (SRV) unit algorithm (Gullapalli, 1990) as wP iA = g Ad ( t)n( t)

@A(x( t)I wA ) . @wiA

(4.3)

4.2 Value-Gradient Based Policy. In discrete problems, a greedy policy can be found by one-ply search for an action that maximizes the sum of the immediate reward and the value of the next state. In the continuous case, the right-hand side of equation 4.1 has to be minimized over a continuous set of actions at every instant, which in general can be computationally expensive. However, when the reinforcement r(x, u) is convex with respect to the action u and the system dynamics f (x, u) is linear with respect to the action u, the optimization problem in equation 4.1 has a unique solution, and we can derive a closed-form expression of the greedy policy. Here, we assume that the reward r(x, u) can be separated into two parts: the reward for state R(x), which is given by the environment and unknown, and the cost for action S(u), which can be chosen as a part of the learning strategy. We specically consider the case r(x, u) = R(x) ¡

m X

Sj (uj ),

(4.4)

j= 1

where Sj () is a cost function for action variable uj . In this case, the condition for the greedy action, equation 4.1, is given by ¡Sj0 ( uj ) C where

@f (x,u) @uj

@V (x) @f (x, u) = 0 @x @uj

( j = 1, . . . , m),

is the jth column vector of the n £ m input gain matrix

@f (x,u) @u

(x,u)

of the system dynamics. We now assume that the input gain @f @u is not dependent on u; that is, the system is linear with respect to the input and the action cost function Sj () is convex. Then the the above equation has a unique solution, uj = Sj0

¡1

¡ @V(x) @f (x, u) ¢ @x

@uj

,

where Sj0 () is a monotonic function. Accordingly, the greedy policy is represented in vector notation as ! @f (x, u) T @V (x) T 0 ¡1 u= S , (4.5) @u @x

(

228

Kenji Doya

where

@V (x) T @x

represents the steepest ascent direction of the value function, (x,u) T

which is then transformed by the “transpose” model @f @u into a direction in the action space, and the actual amplitude of the action is determined by gain function S0 ¡1 (). (x) Note that the gradient @V@x can be calculated by backpropagation when the value function is represented by a multilayer network. The assumption of linearity with respect to the input is valid in most Newtonian mechanical systems (e.g., the acceleration is proportional to the force), and the gain @f (x,u) matrix @u can be calculated from the inertia matrix. When the dynamics is linear and the reward is quadratic, the value function is also quadratic, and equation 4.5 coincides with the optimal feedback law for a linear quadratic regulator (LQR; see, e.g., Bertsekas, 1995). 4.2.1 Feedback Control with a Sigmoid Output Function. A common constraint in control tasks is that the amplitude of the action, such as the force or torque, is bounded. Such a constraint can be incorporated into the above policy with an appropriate choice of the action cost. Suppose that the amplitude of the action is limited as |uj | · ujmax ( j = 1, . . . , m). We then dene the action cost as ! Z uj u ¡1 (4.6) Sj ( uj ) = cj s du, ujmax 0

(

where s() is a sigmoid function that saturates as s(§1) = §1. In this case, the greedy feedback policy, equation 4.5, results in feedback control with a sigmoid output function, uj =

ujmax s

(

! 1 @f (x, u) T @V (x) T . cj @uj @x

(4.7)

In the limit of cj ! 0, the policy will be a “bang-bang” control law, "

uj =

ujmax

# @f (x, u) T @V(x) T sign . @uj @x

(4.8)

4.3 Advantage Updating. When the model of the dynamics is not available, as in Q-learning (Watkins, 1989), we can select a greedy action by directly learning the term to be maximized in the HJB equation, r(x( t), u(t)) C

@V ¤ (x) f (x( t), u(t)). @x

This idea has been implemented in the advantage updating method (Baird, 1993; Harmon et al., 1996), in which both the value function V(x) and the

Reinforcement Learning in Continuous Time and Space

229

advantage function A(x, u) are updated. The optimal advantage function A¤ (x, u) is represented in the current HJB formulation as A¤ (x, u) = r(x, u) ¡

1 ¤ @V ¤ (x) V (x) C f (x, u), t @x

(4.9)

which takes the maximum value of zero for the optimal action u. The advantage function A(x, u) is updated by A(x, u) Ã max[A(x, u)] C r(x, u) ¡ u

= max[A(x, u)] Cd (t) u

1 ¤ V (x) C VP (x) t (4.10)

under the constraint maxu [A(x, u)] = 0. The main difference between the advantage updating and the valuegradient-based policy described above is while the value V and the advantage A are updated in the former, the value V and the model f are updated @f (x,u) and their derivatives are used in the latter. When the input gain model @u is known or easy to learn, the use of the closed-form policy, equation 4.5, in the latter approach is advantageous because it simplies the process of maximizing the right-hand side of the HJB equation. 5 Simulations We tested the performance of the continuous RL algorithms in two nonlinear control tasks: a pendulum swing-up task (n=2, m=1) and a cart-pole swingup task (n=4, m=1). In each of these tasks, we compared the performance of three control schemes: 1. Actor-critic: control by equation 4.2 and learning by equation 4.3 2. Value-gradient-based policy, equation 4.7, with an exact gain matrix 3. Value-gradient-based policy, equation 4.7, with concurrent learning of the input gain matrix The value functions were updated using the exponential eligibility trace (see equation 3.10) except in the experiments of Figure 6. Both the value and policy functions were implemented by normalized gaussian networks, as described in appendix B. A sigmoid output function s( x) = p2 arctan( p2 x) (Hopeld, 1984) was used in equations 4.2 and 4.7. In order to promote exploration, we incorporated a noise term sn(t) in both policies, equations 4.2 and 4.7 (see equations B.2 and B.3 in appendix B). P We used low-pass ltered noise tn n(t) = ¡n(t) C N(t), where N(t) denotes normal gaussian noise. The size of the perturbation s was tapered off as the performance improved (Gullapalli, 1990). We took the modulation scheme

230

Kenji Doya

q l T mg Figure 2: Control of a pendulum with limited torque. The dynamics were given by hP = v and ml2 vP = ¡m v C mgl sinh C u. The physical parameters were m = l = 1, g = 9.8, m = 0.01, and umax = 5.0. The learning parameters were t = 1.0, k = 0.1, c = 0.1, tn = 1.0, s0 = 0.5, V0 = 0, V1 = 1, g = 1.0, g A = 5.0, and g M = 10.0, in the following simulations unless otherwise specied.

s = s0 min[1, max[0, VV11¡V(t) ¡V0 ]], where V0 and V1 are the minimal and maximal levels of the expected reward. The physical systems were simulated by a fourth-order Runge-Kutta method, and the learning dynamics was simulated by a Euler method, both with the time step of 0.02 sec. 5.1 Pendulum Swing-Up with Limited Torque. First, we tested the continuous-time RL algorithms in the task of a pendulum swinging upward with limited torque (see Figure 2) (Atkeson, 1994; Doya, 1996). The control of this one degree of freedom system is nontrivial if the maximal output torque umax is smaller than the maximal load torque mgl. The controller has to swing the pendulum several times to build up momentum and also has to decelerate the pendulum early enough to prevent the pendulum from falling over. The reward was given by the height of the tip of the pendulum, R(x) = cosh . The policy and value functions were implemented by normalized gaussian networks with 15£15 basis functions to cover the two-dimensional state-space x = (h , v). In modeling the system dynamics, 15 £ 15 £ 2 bases were used for the state-action space (h , v, u). Each trial was started from an initial state x(0) = (h (0), 0), where h (0) was selected randomly in [¡p , p ]. A trial lasted for 20 seconds unless the pendulum was over-rotated (|h | > 5p ). Upon such a failure, the trial was terminated with a reward r( t) = ¡1 for 1 second. As a measure of the swing-up performance, we dened the time in which the pendulum stayed up (|h | 10 seconds. We used the number of trials made before achieving 10 successful trials as the measure of the learning speed.

Reinforcement Learning in Continuous Time and Space

231

2p

V

1 0.5 0 0.5 1

0 0

- 2p p

2p

w

q Figure 3: Landscape of the value function V(h , v) for the pendulum swing-up task. The white line shows an example of a swing-up trajectory. The state-space was a cylinder with h = § p connected. The 15 £ 15 centers of normalized gaussian basis functions are located on a uniform grid that covers the area [¡p , p ] £ [¡5/ 4p , 5/ 4p ].

Figure 3 illustrates the landscape of the value function and a typical trajectory of swing-up using the value-gradient-based policy. The trajectory starts from the bottom of the basin, which corresponds to the pendulum hanging downward, and spirals up the hill along the ridge of the value function until it reaches the peak, which corresponds to the pendulum standing upward. 5.1.1 Actor-Critic, Value Gradient, and Physical Model. We rst compared the performance of the three continuous RL algorithms with the discrete actor-critic algorithm (Barto et al., 1983). Figure 4 shows the time course of learning in ve simulation runs, and Figure 5 shows the average number of trials needed until 10 successful swing-ups. The discrete actor-critic algorithm took about ve times more trials than the continuous actor-critic. Note that the continuous algorithms were simulated with the same time step as the discrete algorithm. Consequently, the performance difference was due to a better spatial generalization with the normalized gaussian networks. Whereas the continuous algorithms performed well with 15 £ 15 basis functions, the discrete algorithm did not achieve the task using 15 £15, 20 £ 20, or 25 £ 25 grids. The result shown here was obtained by 30 £ 30 grid discretization of the state.

232

Kenji Doya B 20

20

15

15 t_up

t_up

A

10 5 0

10 5

0

500

1000 trials

1500

0

2000

50

100 trials

150

200

D 20

20

15

15 t_up

t_up

C

0

10 5 0

10 5

0

20

40

trials

60

80

0

100

0

20

40

trials

60

80

100

Figure 4: Comparison of the time course of learning with different control schemes: (A) discrete actor-critic, (B) continuous actor-critic, (C) value-gradientbased policy with an exact model, (D) value-gradient policy with a learned model (note the different scales). tup : time in which the pendulum stayed up. In the discrete actor-critic, the state-space was evenly discretized into 30 £30 boxes and the action was binary (u = § umax ). The learning parameters were c = 0.98, l = 0.8, g = 1.0, and g A = 0.1. 400 350 300 Trials

250 200 150 100 50 0

DiscAC

ActCrit

ValGrad PhysModel

Figure 5: Comparison of learning speeds with discrete and continuous actorcritic and value-gradient-based policies with an exact and learned physical models. The ordinate is the number of trials made until 10 successful swing-ups.

Reinforcement Learning in Continuous Time and Space

233

Among the continuous algorithms, the learning was fastest with the value-gradient-based policy using an exact input gain. Concurrent learning of the input gain model resulted in slower learning. The actor-critic was the slowest. This was due to more effective exploitation of the value function in the gradient-based policy (see equation 4.7) compared to the stochastically improved policy (see equation 4.2) in the actor-critic. 5.1.2 Methods of Value Function Update. Next, we compared the methods of value function update algorithms (see equations 3.7, 3.8, and 3.10) using the greedy policy with an exact gain model (see Figure 6). Although the three algorithms attained comparable performances with the optimal settings for D t and k , the method with the exponential eligibility trace performed well in the widest range of the time constant and the learning rate. We also tested the purely symmetric update method (see equation 3.5), but its performance was very unstable even when the learning rates for the value and its gradient were tuned carefully. 5.1.3 Action Cost, Graded Reward, and Exploration. We then tested how the performance depended on the action cost c, the shape of the reward function R(x), and the size of the exploration noise s0 . Figure 7A compares the performance with different action costs c = 0, 0.01, 0.1, and 1.0. The learning was slower with a large cost for the torque (c = 1) because of the weak output in the early stage. The results with bang-bang control (c = 0) tended to be less consistent than with sigmoid control with the small costs (c = 0.01, 0.1). Figure 7B summarizes the effects of the reward function and exploration noise. When binary reward function, » 1 R(x) = 0

|h | < p / 4 otherwise,

was used instead of cos(h ), the task was more difcult to learn. However, a better performance was observed with the use of negative binary reward function, » 0 |h | < p / 4 R(x) = ¡1 otherwise. The difference was more drastic with a xed initial state x(0) = (p , 0) and no noise s = 0, for which no success was achieved with the positive binary reward. The better performance with the negative reward was due to the initialization of the value function as V(x) = 0. As the value function near h = p is learned as V (x) ’ ¡1, the value-gradient-based policy drives the state to unexplored areas, which are assigned higher values V(x) ’ 0 by default.

234

Kenji Doya

Figure 6: Comparison of different value function update methods with different settings for the time constants. (A) Residual gradient (see equation 3.7). (B) Single-step eligibility trace (see equation 3.8). (C) Exponential eligibility (see equation 3.10). The learning rate g was roughly optimized for each method and each setting of time step D t of the Euler approximation or time constant k of the eligibility trace. The performance was very unstable, with D t = 1.0 in the discretization-based methods.

Reinforcement Learning in Continuous Time and Space

235

A 100

Trials

80 60 40 20 0

0.

0.01 0.1 control cost coef.

1.0

B 100

Trials

80 60 40 20 0

cosq s

[0,1] [-1,0] 0 = 0.5

cosq s

[0,1] [-1,0] 0 = 0.0

Figure 7: Effects of parameters of the policy. (A) Control-cost coefcient c. (B) Reward function R(x) and perturbation size s0 .

5.2 Cart-Pole Swing-Up Task. Next, we tested the learning schemes in a more challenging task of cart-pole swing-up (see Figure 8), which is a strongly nonlinear extension to the common cart-pole balancing task (Barto et al., 1983). The physical parameters of the cart pole were the same as in Barto et al. (1983), but the pole had to be swung up from an arbitrary angle and balanced. Marked differences from the previous task were that the dimension of the state-space was higher and the input gain was state dependent. The state vector was x = ( x, v, h , v), where x and v are the position and the velocity of the cart. The value and policy functions were implemented

236

Kenji Doya

A

B

C

Figure 8: Examples of cart-pole swing-up trajectories. The arrows indicate the initial position of the pole. (A) A typical swing-up from the bottom position. (B) When a small perturbation v > 0 is given, the cart moves to the right and keeps the pole upright. (C) When a larger perturbation is given, the cart initially tries to keep the pole upright, but then brakes to avoid collision with the end of the track and swings the pole up on the left side. The learning parameters were t = 1.0, k = 0.5, c = 0.01, tn = 0.5, s0 = 0.5, V0 = 0, V1 = 1, g = 5.0, g A = 10.0, and g M = 10.0.

Reinforcement Learning in Continuous Time and Space

237

by normalized gaussian networks with 7£7£15£15 bases. A 2£2£4£2£2 basis network was used for modeling the system dynamics. The reward was given by R(x) = cosh2 ¡1 . When the cart bumped into the end of the track or when the pole overrotated (|h | > 5p ), a terminal reward r(t) = 1 was given for 0.5 second. Otherwise a trial lasted for 20 seconds. Figure 8 illustrates the control performance after 1000 learning trials with the greedy policy using the value gradient and a learned input gain model. Figure 9A shows the value function in the 4D state-space x = ( x, v, h, v). Each of the 3 £ 3 squares represents a subspace (h , v) with different values of ( x, v). We can see the 1-shaped ridge of the value function, which is similar to the one seen in the pendulum swing-up task (see Figure 3). Also note the lower values with x and v both positive or negative, which signal the danger of bumping into the end of the track. Figure 9B shows the most critical components of the input gain vector v @P . The gain represents how the force applied to the cart is transformed @u as the angular acceleration of the pole. The gain model could successfully capture the change of the sign with the upward (|h | p / 2) orientation of the pole. Figure 10 is a comparison of the number of trials necessary for 100 successful swing-ups. The value-gradient-based greedy policies performed about three times faster than the actor-critic. The performances with the exact and learned input gains were comparable in this case. This was because the learning of the physical model was relatively easy compared to the learning of the value function. 6 Discussion The results of the simulations can be summarized as follows: 1. The swing-up task was accomplished by the continuous actor-critic in a number of trials several times fewer than by the conventional discrete actor-critic (see Figures 4 and 5). 2. Among the continuous methods, the value-gradient-based policywith a known or learned dynamic model performed signicantly better than the actor-critic (see Figures 4, 5, and 10). 3. The value function update methods using exponential eligibility traces were more efcient and stable than the methods based on Euler approximation (see Figure 6). 4. Reward-related parameters, such as the landscape and the baseline level of the reward function, greatly affect the speed of learning (see Figure 7). 5. The value-gradient-based method worked well even when the input gain was state dependent (see Figures 9 and 10).

238

Kenji Doya A

v

q

- p

0

p

4p 0

2.4

w

- 4p 0.0 V +0.021 -2.4

-0.347

-2.4

0.0 x

B

v

-0.715

2.4

- p

q

0

p

4p 0

2.4

w

- 4p 0.0 df/du +1.423 -2.4

-0.064

-2.4

0.0 x

2.4

-1.551

Figure 9: (A) Landscape of the value function for the cart-pole swing-up task. (B) Learned gain model @@uvP .

Reinforcement Learning in Continuous Time and Space

239

Figure 10: Comparison of the number of trials until 100 successful swing-ups with the actor-critic, value-gradient-based policy with an exact and learned physical models.

Among the three major RL methods—the actor critic, Q-learning, and model-based look-ahead—only the Q-learning has been extended to continuous-time cases as advantage updating (Baird, 1993). This article presents continuous-time counterparts for all three of the methods based on the HJB equation, 2.6, and therefore provides a more complete repertoire of continuous RL methods. A major contribution of this article is the derivation of the closed-form policy (see equation 4.5) using the value gradient and the dynamic model. One critical issue in advantage updating is the need for nding the maximum of the advantage function on every control cycle, which can be computationally expensive except in special cases like linear quadratic problems (Harmon et al., 1996). As illustrated by simulation, the value-gradient-based policy (see equation 4.5) can be applied to a broad class of physical control problems using a priori or learned models of the system dynamics. The usefulness of value gradients in RL was considered by Werbos (1990) for discrete-time cases. The use of value gradients was also proposed by Dayan and Singh (1996), with the motivation being to eliminate the need for updating the value function in advantage updating. Their method, however, in which the value gradients @V are updated without updating the value @x function V (x) itself, is applicable only to nondiscounted problems. When the system or the policy is stochastic, HJB equation 2.6 will include second-order partial derivatives of the value function, µ 1 ¤ @V ¤ (x) V (x( t)) = max r(x( t), u(t)) C f (x( t), u(t)) u(t)2U t @x

240

Kenji Doya

C tr

¡ @ V (x) ¢ ¶ 2

¤

@x2

C

,

(6.1)

where C is the covariance matrix of the system noise (see, e.g., Fleming & Soner, 1993). In our simulations, the methods based on deterministic HJB equation 2.6 worked well, although we incorporated noise terms in the policies to promote exploration. One reason is that the noise was small enough so that the contribution of the second-order term was minor. Another reason could be that the second-order term has a smoothing effect on the value function, and this was implicitly achieved by the use of the smooth function approximator. This point needs further investigation. The convergent properties of HJB-based RL algorithms were recently shown for deterministic (Munos, 1997) and stochastic (Munos & Bourgine, 1998) cases using a grid-based discretization of space and time. However, the convergent properties of continuous RL algorithms combined with function approximators remain to be studied. When a continuous RL algorithm is numerically implemented with a nite time step, as shown in sections 3.2 and 3.3, it becomes equivalent to a discrete-time TD algorithm, for which some convergent properties have been shown with the use of function approximators (Gordon, 1995; Tsitsiklis & Van Roy, 1997). For example, the convergence of TD algorithms has been shown with the use of a linear function approximator and on-line sampling (Tsitsiklis & Van Roy, 1997), which was the case with our simulations. However, the above result considers value function approximation for a given policyand does not guarantee the convergence of the entire RL process to a satisfactory solution. For example, in our swing-up tasks, the learning sometimes got stuck in a locally optimal solution of endless rotation of the pendulum when a penalty for over-rotation was not given. The use of xed smooth basis functions has a limitation in that steep cliffs in the value or policy functions cannot be achieved. Despite some negative didactic examples (Tsitsiklis & Van Roy, 1997), methods that dynamically allocate or reshape basis functions have been successfully used with continuous RL algorithms, for example, in a swing-up task (Schaal, 1997) and in a stand-up task for a three-link robot (Morimoto & Doya, 1998). Elucidation of the conditions under which the proposed continuous RL algorithms work successfully—for example, the properties of the function approximatorsand the methods for exploration—remains the subject of future empirical and theoretical studies. Appendix A: HJB Equation for Discounted Reward According to the optimality principle, we divide the integral in equation 2.5 into two parts [t, t C D t] and [t C D t, 1) and then solve a short-term opti-

Reinforcement Learning in Continuous Time and Space

241

mization problem, µZ V¤ (x( t)) = max

u[t,tCD t]

tCD t t

e¡

s¡t t

¶ Dt r(x( s), u(s))ds Ce¡ t V ¤ (x( t CD t)) . (A.1)

For a small D t, the rst term is approximated as r(x( t), u(t))D t C o(D t), and the second term is Taylor expanded as V¤ (x( t C D t)) = V ¤ (x( t)) C

@V ¤ f (x(t), u(t))D t C o(D t). @x( t)

By substituting them into equation A.1 and collecting V ¤ (x( t)) on the lefthand side, we have an optimality condition for [t, t C D t] as Dt

(1 ¡ e¡ t ) V ¤ (x( t)) µ ¶ ¤ ¡ Dt t @V = max r(x(t), u(t))D t C e f (x( t), u(t))D t C o(D t) . (A.2) u[t,tCD t] @x( t) By dividing both sides by D t and taking D t to zero, we have the condition for the optimal value function, µ ¶ 1 ¤ @V ¤ V (x( t)) = max r(x( t), u(t)) C f (x( t), u(t)) . u(t)2U t @x

(A.3)

Appendix B: Normalized Gaussian Network A value function is represented by V(xI w) =

K X

wk bk (x),

(B.1)

k= 1

where ak (x) , bk (x) = PK (x) l= 1 al

(x¡ck )k . ak (x) = e¡ksk T

2

The vectors ck and sk dene the center and the size of the kth basis function. Note that the basis functions located on the ends of the grids are extended like sigmoid functions by the effect of normalization. In the current simulations, the centers are xed in a grid, which is analogous to the “boxes” approach (Barto et al., 1983) often used in discrete RL. Grid allocation of the basis functions enables efcient calculation of their

242

Kenji Doya

activation as the outer product of the activation vectors for individual input variables. In the actor-critic method, the policy is implemented as u(t) = u

max

s

(

X k

! wkA bk (x(t))

C sn(t) ,

(B.2)

where s is a component-wise sigmoid function and n(t) is the noise. In the value-gradient-based methods, the policy is given by u(t) = u

max

! 1 @f (x, u) T X @bk (x) T C sn(t) . s wk c @u @x k

(

(B.3)

To implement the input gain model, a network is trained to predict the time derivative of the state from x and u, x(t) P ’ fO(x, u) =

X k

wkM bk (x( t), u(t)).

(B.4)

The weights are updated by M( () P ¡ fO(x( t), u(t)))bk (x( t), u(t)), x(t) wP M k t = g

(B.5)

and the input gain of the system dynamics is given by X (x, u) @f (x, u) M @bk . ’ wk @u @u u= 0 k

(B.6)

Acknowledgments I thank Mitsuo Kawato, Stefan Schaal, Chris Atkeson, and Jim Morimoto for their helpful discussions. References Asada, M., Noda, S., & Hosoda, K. (1996). Action-based sensor space categorization for robot learning. In Proceedings of IEEE/ RSJ International Conference on Intelligent Robots and Systems (pp. 1502–1509). Atkeson, C. G. (1994). Using local trajectory optimizers to speed up global optimization in dynamic programming. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing system, 6 (pp. 663–670). San Mateo, CA: Morgan Kaufmann. Baird, L. C. (1993). Advantage updating (Tech. Rep. No. WL-TR-93-1146). Wright Laboratory, Wright-Patterson Air Force Base, OH.

Reinforcement Learning in Continuous Time and Space

243

Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function approximation. In A. Prieditis & S. Russel (Eds.), Machine learning: Proceedings of the Twelfth International Conference. San Mateo, CA: Morgan Kaufmann. Barto, A. G., Sutton, R. S., & Anderson, C. W. (1983). Neuronlike adaptive elements that can solve difcult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, 13, 834–846. Bertsekas, D. P. (1995). Dynamic programming and optimal control. Cambridge, MA: MIT Press. Bradtke, S. J. (1993). Reinforcement learning applied to linear quadratic regulation. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, 5 (pp. 295–302). San Mateo, CA: Morgan Kaufmann. Bradtke, S. J., & Duff, M. O. (1995). Reinforcement learning methods for continuous-time Markov decision problems. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 393– 400). Cambridge, MA: MIT Press. Crites, R. H., & Barto, A. G. (1996). Improving elevator performance using reinforcement learning. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 1017–1023). Cambridge, MA: MIT Press. Dayan, P., & Singh, S. P. (1996). Improving policies without measuring merits. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 1059–1065). Cambridge, MA: MIT Press. Doya, K. (1996). Temporal difference learning in continuous time and space. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 1073–1079). Cambridge, MA: MIT Press. Doya, K. (1997). Efcient nonlinear control with actor-tutor architecture. In M. C. Mozer, M. I. Jordan, & T. P. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 1012–1018). Cambridge, MA: MIT Press. Fleming, W. H., & Soner, H. M. (1993). Controlled Markov processes and viscosity solutions. New York: Springer-Verlag. Gordon, G. J. (1995). Stable function approximation in dynamic programming. In A. Prieditis & S. Russel (Eds.), Machine learning: Proceedings of the Twelfth International Conference. San Mateo, CA: Morgan Kaufmann. Gordon, G. J. (1996). Stable tted reinforcement learning. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural informationprocessing systems, 8 (pp. 1052–1058). Cambridge, MA: MIT Press. Gullapalli, V. (1990). A stochastic reinforcement learning algorithm for learning real-valued functions. Neural Networks, 3, 671–192. Harmon, M. E., Baird, III, L. C., & Klopf, A. H. (1996). Reinforcement learning applied to a differential game. Adaptive Behavior, 4, 3–28. Hopeld, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of National Academy of Science, 81, 3088–3092. Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Articial Intelligence Research, 4, 237–285.

244

Kenji Doya

Mataric, M. J. (1994). Reward functions for accelerated learning. In W. W. Cohen & H. Hirsh (Eds.), Proceedings of the 11th International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann. Moore, A. W. (1994). The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural informationprocessingsystems,6 (pp. 711– 718). San Mateo, CA: Morgan Kaufmann. Morimoto, J., & Doya, K. (1998). Reinforcement learning of dynamic motor sequence: Learning to stand up. In Proceedings of IEEE/ RSJ International Conference on Intelligent Robots and Systems, 3, 1721–1726. Munos, R. (1997). A convergent reinforcement learning algorithm in the continuous case based on a nite difference method. In Proceedings of International Joint Conference on Articial Intelligence (pp. 826–831). Munos, R., & Bourgine, P. (1998).Reinforcement learning for continuous stochastic control problems. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 1029–1035). Cambridge, MA: MIT Press. Pareigis, S. (1998). Adaptive choice of grid and time in reinforcement learning. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 1036–1042). Cambridge, MA: MIT Press. Peterson, J. K. (1993). On-line estimation of the optimal value function: HJBestimators. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, 5 (pp. 319–326). San Mateo, CA: Morgan Kaufmann. Schaal, S. (1997). Learning from demonstration. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 1040– 1046). Cambridge, MA: MIT Press. Singh, S., & Bertsekas, D. (1997). Reinforcement learning for dynamic channel allocation in cellular telephone systems. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 974– 980). Cambridge, MA: MIT Press. Singh, S. P., Jaakkola, T., & Jordan, M. I. (1995). Reinforcement learning with soft state aggregation. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 361–368). Cambridge, MA: MIT Press. Sutton, R. S. (1988). Learning to predict by the methods of temporal difference. Machine Learning, 3, 9–44. Sutton, R. S. (1995). TD models: Modeling the world at a mixture of time scales. In Proceedings of the 12th International Conference on Machine Learning (pp. 531– 539). San Mateo, CA: Morgan Kaufmann. Sutton, R. S. (1996). Generalization in reinforcement learning: Successful examples using sparse coarse coding. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 1038– 1044). Cambridge, MA: MIT Press. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning. Cambridge, MA: MIT Press.

Reinforcement Learning in Continuous Time and Space

245

Tesauro, G. (1994). TD-Gammon, a self teaching backgammon program, achieves master-level play. Neural Computation, 6, 215–219. Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42, 674– 690. Watkins, C. J. C. H. (1989). Learning from delayed rewards. Unpublished doctoral dissertation, Cambridge University. Werbos, P. J. (1990). A menu of designs for reinforcement learning over time. In W. T. Miller, R. S. Sutton, & P. J. Werbos (Eds.), Neural networks for control (pp. 67–95). Cambridge, MA: MIT Press. Zhang, W., & Dietterich, T. G. (1996). High-performance job-shop scheduling with a time-delay TD(l ) network. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in Neural Information Processing Systems, 8. Cambridge, MA: MIT Press. Received January 18, 1998; accepted January 13, 1999.

Neural Computation, Volume 12, Number 2

CONTENTS Article

Minimizing Binding Errors Using Learned Conjunctive Features Bartlett W. Mel and J´ozsef Fiser

247

Notes

Relationship Between Phase and Energy Methods for Disparity Computation Ning Qian and Samuel Mikaelian

279

N-tuple Network, CART, and Bagging Aleksander Ko lcz

293

Improving the Practice of Classier Performance Assessment N. M. Adams and D. J. Hand

305

Letters

Do Simple Cells in Primary Visual Cortex Form a Tight Frame? Emilio Salinas and L. F. Abbott

313

Learning Overcomplete Representations Michael S. Lewicki and Terrence J. Sejnowski

337

Noise in Integrate-and-Fire Neurons: From Stochastic Input to Escape Rates Hans E. Plesser and Wulfram Gerstner

367

Modeling Synaptic Plasticity in Conjunction with the Timing of Pre- and Postsynaptic Action Potentials Werner M. Kistler and J. Leo van Hemmen

385

On-line EM Algorithm for the Normalized Gaussian Network Masa-aki Sato and Shin Ishii

407

A General Probability Estimation Approach for Neural Computation Maxim Khaikine and Klaus Holthausen

433

On the Synthesis of Brain-State-in-a-Box Neural Models with Application to Associative Memory Fation Sevrani and Kennichi Abe

451

ARTICLE

Communicated by Shimon Edelman

Minimizing Binding Errors Using Learned Conjunctive Features Bartlett W. Mel Department of Biomedical Engineering, University of Southern California, Los Angeles, CA 90089, U.S.A. Jozsef ´ Fiser Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY 14627-0268, U.S.A. We have studied some of the design trade-offs governing visual representations based on spatially invariant conjunctive feature detectors, with an emphasis on the susceptibility of such systems to false-positive recognition errors—Malsburg’s classical binding problem. We begin by deriving an analytical model that makes explicit how recognition performance is affected by the number of objects that must be distinguished, the number of features included in the representation, the complexity of individual objects, and the clutter load, that is, the amount of visual material in the eld of view in which multiple objects must be simultaneously recognized, independent of pose, and without explicit segmentation. Using the domain of text to model object recognition in cluttered scenes, we show that with corrections for the nonuniform probability and nonindependence of text features, the analytical model achieves good ts to measured recognition rates in simulations involving a wide range of clutter loads, word sizes, and feature counts. We then introduce a greedy algorithm for feature learning, derived from the analytical model, which grows a representation by choosing those conjunctive features that are most likely to distinguish objects from the cluttered backgrounds in which they are embedded. We show that the representations produced by this algorithm are compact, decorrelated, and heavily weighted toward features of low conjunctive order. Our results provide a more quantitative basis for understanding when spatially invariant conjunctive features can support unambiguous perception in multiobject scenes, and lead to several insights regarding the properties of visual representations optimized for specic recognition tasks. 1 Introduction

The problem of classifying objects in visual scenes has remained a scientic and a technical holy grail for several decades. In spite of intensive research in Neural Computation 12, 247–278 (2000)

c 2000 Massachusetts Institute of Technology °

248

Bartlett W. Mel and J´ozsef Fiser

the eld of computer vision and a large body of empirical data from the elds of visual psychophysics and neurophysiology, the recognition competence of a two-year-old human child remains unexplained as a theoretical matter and well beyond the technical state of the art. This fact is surprising given that (1) the remarkable speed of recognition in the primate brain allows for only very brief processing times at each stage in a pipeline containing only a few stages (Potter, 1976; Oram & Perrett, 1992; Heller, Hertz, Kjær, & Richmond, 1995; Thorpe, Fize, & Marlot, 1996; Fize, Boulanouar, Ranjeva, Fabre-Thorpe, & Thorpe, 1998), (2) the computations that can be performed within each neural processing stage are strongly constrained by the structure of the underlying neural tissue, about which a great deal is known (Hubel & Wiesel, 1968; Szentagothai, 1977; Jones, 1981; Gilbert, 1983; Van Essen, 1985; Douglas & Martin, 1998), (3) the response properties of neurons in each of the relevant brain areas have been well studied, and evolve from stage to stage in systematic ways (Oram & Perrett, 1994; Logothetis & Sheinberg, 1996; Tanaka, 1996), and (4) computer systems may already be powerful enough to emulate the functions of these neural processing stages, were it only known what exactly to do. A number of neuromorphic approaches to visual recognition have been proposed over the years (Pitts & McCullough, 1947; Fukushima, Miyake, & Ito, 1983; Sandon & Urh, 1988; Zemel, Mozer, & Hinton, 1990; Le Cun et al., 1990; Mozer, 1991; Swain & Ballard, 1991; Hummel & Biederman, 1992; Lades et al., 1993; Califano & Mohan, 1994; Schiele & Crowley, 1996; Mel, 1997; Weng, Ahuja, & Huang, 1997; Lang & Seitz, 1997; Wallis & Rolls, 1997; Edelman & Duvdevani-Bar, 1997). Many of these neurally inspired systems involve constructing banks of feature detectors, often called receptive elds (RFs), each of which is sensitive to some localized spatial conguration of image cues but invariant to one or more spatial transformations of its preferred stimulus—typically including invariance to translation, rotation, scale, or spatial distortion, as specied by the task. 1 Since the goal to extract useful invariant features is common to most conventional approaches to computer vision as well, the “neuralness” of a recognition system lies primarily in its emphasis on feedforward, hierarchical construction of the invariant features, where the computations are usually restricted to simple spatial conjunction and disjunction operations (e.g., Fukushima et al., 1983), and/or specify the inclusion of a relatively large number of invariant features in the visual representation (see Mozer, 1991; Califano & Mohan, 1994; Mel, 1997)—anywhere from hundreds to millions. Neural approaches typically also involve some form of learning, whether supervised by category labels at the output layer (Le Cun et al., 1990), supervised directly within 1 A number of the above-mentioned systems have emphasized other mechanisms instead of or in addition to straightforward image ltering operations (Zemel et al., 1990; Mozer, 1991; Hummel & Biederman, 1992; Lades et al., 1993; Califano & Mohan, 1994; Edelman & Duvdevani-Bar, 1997).

Minimizing Binding Errors

249

intermediate network layers (Fukushima et al., 1983), or involving purely unsupervised learning principles that home in on features that occur frequently in target objects (Fukushima et al., 1983; Zemel et al., 1990; Wallis & Rolls, 1997; Weng et al., 1997). While architectures of this general type have performed well in a variety of difcult, though limited, recognition problems, it has yet to be proved that a few stages of simple feedforward ltering operations can explain the remarkable recognition and classication capacities of the human visual system, which is commonly confronted with scenes acquired from varying viewpoints, containing multiple objects drawn from thousands of categories, and appearing in an innite panoply of spatial congurations. In fact, there are reasons for pessimism. 1.1 The Binding Problem. One of the most persistent worries concerning this class of architecture is due to Malsburg (1981), who noted the potential for ambiguity in visual representations constructed from spatially invariant lters. A spatial binding problem arises, for example, when each detector in the visual representation reports only the presence of some elemental object attribute but not its spatial location (or, more generally, its pose). Under these unfavorable conditions, an object is hallucinated (i.e., detected when not actually present) whenever all of its elemental features are present in the visual eld, even when embedded piecemeal in improper congurations within a scattered coalition of distractor objects (see Figure 1A). This type of failure mode for a visual representation emanates from the dual pressures to cope with uncontrolled variation in object pose, which forces the use of detectors with excessive spatial invariances, and the necessity to process multiple objects simultaneously, which overstimulate the visual representation and increase the probability of ambiguous perception. One approach to the binding problem is to build a separate, full-order conjunctive detector for every possible view (e.g., spatial pose, state of occlusion, distortion, degradation) of every possible object, and then for each object provide a huge disjunction that pools over all of the views of that object. Malsburg (1981) pointed out that while the binding problem could thus be solved, the remedy is a false one since it leads to a combinatorial explosion in the number of needed visual processing operations (see Figure 1B). Other courses of action include (1) strategies for image preprocessing designed to segment scenes into regions containing individual objects, thus reducing the clutter load confronting the visual representation, and (2) explicit normalization procedures to reduce pose uncertainty (e.g., centering, scaling, warping), thereby reducing the invariance load that must be sustained by the individual receptive elds. Both strategies reduce the probability that any given object’s feature set will be activated by a spurious collection of features drawn from other objects. 1.2 Wickelsystems. Preprocessing strategies aside, various workers have explored the middle ground between “binding breakdown” and “com-

250

Bartlett W. Mel and J´ozsef Fiser

Figure 1: (A) A binding problem occurs when an object’s representation can be falsely activated by multiple (or inappropriate) other objects. (B) A combinatorial explosion arises when all possible views of all possible objects are represented by explicit conjunctions. (C) Compromise representation containing an intermediate number of receptive elds that bind image features to intermediate order.

Minimizing Binding Errors

251

binatorial catastrophe,” where a visual representation contains an intermediate number of detectors, each of which binds together an intermediatesized subset of feature elements in their proper relations before integrating out the irrelevant spatial transformations (see Mozer, 1991, for a thoughtful advocacy of this position). Under this compromise, each detector becomes a spatially invariant template for a specic minipattern, which is more complex than an elemental feature, such as an oriented edge, but simpler than a full-blown view of an object (see Figure 1C). A set of spatially invariant features of intermediate complexity can be likened to a box of jumbled jigsaw puzzle pieces: although the nal shape of the composite object—the puzzle—is not explicitly contained in this peculiar “representation,” it is nonetheless unambiguously specied: the collection of parts (features) can interlock in only one way. An important early version of this idea is due to Wickelgren (1969), who proposed a scheme for representing the phonological structure of spoken language involving units sensitive to contiguous triples of phonological features but not to the absolute position of the tuples in the speech stream. Though not phrased in this language, “Wickelfeatures” were essentially translation-invariant detectors of phonological minipatterns of intermediate complexity, analyzing the one-dimensional speech signal. Representational schemes inspired by Wickelgren’s proposal have since cropped up in other inuential models of language processing (McClelland & Rumelhart, 1981; Mozer, 1991). The use of invariant conjunctive features schemes to address representational problems in elds other than vision highlights the fact that the binding problem is not an inherently visuospatial one. Rather, it can arise whenever the set of features representing an object (visual, auditory, olfactory, etc.) can be fully, and thus inappropriately, activated by input “scenes” that do not contain the object. In general, a binding problem has the potential to exist whenever a representation lacks features that conjoin an object’s elemental attributes (e.g., parts, spatial relations, surface properties) to sufciently high conjunctive order. On the solution side, binding ambiguity can be mitigated as in the scenario of Figure 1 by building conjunctive features that are increasingly object specic, until the probability of false-positive activation for any object is driven below an acceptable threshold. Given that this conjunctive order-boosting prescription to reduce ambiguity is algorithmically trivial, it shifts emphasis away from the question as to how to eliminate binding ambiguity, toward the question as to whether, for a given recognition problem, ambiguity can be eliminated at an acceptable cost. 1.3 A Missing Theory. In spite of the importance of spatially invariant conjunctive visual representations as models of neural function, and the demonstrated practical successes of computer vision systems constructed along these lines, many of the design trade-offs that govern the performance of visual “Wickelsystems” in response to cluttered input images have re-

252

Bartlett W. Mel and J´ozsef Fiser

mained poorly understood. For example, it is unknown in general how recognition performance depends on (1) the number of object categories that must be distinguished, (2) the similarity structure of the object category distribution (i.e., whether object categories are intrinsically very similar or very different from each other), (3) the featural complexity of individual objects, (4) the number and conjunctive order of features included in the representation, (5) the clutter load (i.e., the amount of visual material in the eld of view from which multiple objects must be recognized without explicit segmentation), and (6) the invariance load (i.e., the set of spatial transformations that do not affect the identities of objects, and that must be ignored by each individual detector). In a previous analysis that touched on some of these issues, Califano and Mohan (1994) showed that large, parameterized families of complex invariant features (e.g., containing 106 elements or more) could indeed support recognition of multiple objects in complex scenes without prior segmentation or discrimination among a large number of highly similar objects in a spatially invariant fashion. However, these authors were primarily concerned with the mathematical advantages of large versus small feature sets in a generic sense, and thus did not test their analytical predictions using a large object database and varying object complexity, category structure, clutter load, and other factors. Beyond our lack of understanding regarding the trade-offs inuencing system performance, the issue as to which invariant conjunctive features and of what complexity should be incorporated into a visual representation, through learning, has also not been well worked out as a conceptual matter. Previous approaches to feature learning in neuromorphic visual systems have invoked (1) generic gradient-based supervised learning principles (e.g., Le Cun et al., 1990) that offer no direct insight into the qualities of a good representation, (2) or straightforward unsupervised learning principles (e.g., Wallis & Rolls, 1997), which tend to discover frequently occurring features or principal components rather than those needed to distinguish objects from cluttered backgrounds on a task-specic basis. Network approaches to learning have also typically operated within a xed network architecture, which largely predenes and homogenizes the complexity of top-level features to be used for recognition, though optimal representations may actually require features drawn from a range of complexities, and could vary in their composition on a per-object basis. All this suggests that further work on the principles of feature learning is needed. In this article, we describe our work to test a simple analytical model that captures several trade-offs governing the performance of visual recognition systems based on spatially invariant conjunctive features. In addition, we introduce a supervised greedy algorithm for feature learning that grows a visual representation in such a way as to minimize false-positive recognition errors. Finally, we consider some of the surprising properties of “good” representations and the implications of our results for more realistic visual recognition problems.

Minimizing Binding Errors

253

2 Methods 2.1 Text World as a Model for Object Recognition. The studies described here were carried out in text world, a domain with many of the complexities of vision in general (see Figure 2): target objects are numerous (more than 40,000 words in the database), are highly nonuniform in their prior probabilities (relative probabilities range from 1 to 400,000), are constructed from a set of relatively few underlying parts (26 letters), and individually can contain from 1 to 20 parts. Furthermore, the parts from which words are built are highly nonuniform in their relative frequencies and contain strong statistical dependencies. Finally, word objects occur in highly cluttered visual environments—embedded in input arrays in which many other words are simultaneously present—but must nonetheless be recognized regardless of position in the visual eld. Although text world lacks several additional complexities characteristic of real object vision (see section 6), it nonetheless provides a rich but tractable domain in which to quantify binding trade-offs and to facilitate the testing of analytical predictions and the development of feature learning algorithms. An important caveat, however, is that the use of text as a substrate for developing and testing our analytical model does not imply that our conclusions bear directly on any aspect of human reading behavior. Two databases were used in the studies. The word database contained 44,414 entries, representing all lowercase punctuation-free words and their relative frequencies found in 5 million words in the Wall Street Journal (WSJ) (available online from the Association for Computational Linguistics at http://morph.ldc.upenn.edu/). The English text database consisted of approximately 1 million words drawn from a variety of sources (KuLcera & Francis, 1967). 2.2 Recognition Using n-Grams. A natural class of visual features in text world is the position-invariant n-gram, dened here as a binary detector that responds when a specic spatial conguration of n letters is found anywhere (one or more times) in the input array.2 The value of n is termed the conjunctive or binding order of the n-gram, and the diameter of the n-gram is the maximum span between characters specied within it. For example, th** is a 3-gram of diameter 5 since it species the relative locations of three characters (including the space character but not wild card characters *), and spans a eld ve characters in width. The concept of an n-gram is used also in computational linguistics, though usually referring to n-tuples of consecutive words (Charniak, 1993).

2 An n-gram could also be multivalued, that is, respond in a graded fashion depending on the number of occurrences of the target feature in the input; the binary version was used here to facilitate analysis.

254

Bartlett W. Mel and J´ozsef Fiser

10

6

10

5

10

4

10

3

10

2

B

Non-Uniform Word Frequencies

Non-Uniform Word Sizes 8

Word Counts (x 1000)

Log Relative Frequency

A

10

7 6 5 4 3 2 1 0

1

1 2 3 4 5 6 7 8 9 1011121314 151617181920

Words Rank-Ordered by Frequency

C

Word Size (Letters)

Non-Uniform Letter Frequencies

D

450

Relative Frequency (x 1000)

1200

Relative Frequency

400 350 300 250 200 150 100 50 _ e t a o n h i s r d l u c wm f g y p b v k x j q z

Single Letters Ordered by Frequency Info Gain About Adjacent Letter (Bits)

1000 800 600 400

unifor m appr oximatio n

200 0

0

E

Non-Uniform 2-gram Frequencies

0

100

200

300

400

500

600

Adjacent 2-grams Ordered by Frequency

Statistical Dependencies Among Adjacent Letters 5 letter to left

4

letter to right 3 2 1 0 a

b

c d

e

f

g h

i

j

k

l m n o

p

q

r

s

t

u

v w x y

z

Figure 2: Text world was used as a surrogate for visual recognition in multiobject scenes. (A) Word frequencies in the 5-million-word Wall Street Journal (WSJ) corpus are highly nonuniform, following an approximately exponential decay according to Zipf’s law. (B) Words range from 1 to 20 characters in length; 7letter words are most numerous. (C,D) Relative frequencies of individual letters or adjacent letter pairs are highly nonuniform. Ordinate values represent the number of times the feature was found in the WSJ corpus. The dashed rectangle in D represents simplifying assumption regarding the 2-gram frequency distribution used for quantitative predictions below. (E) The parts of words show strong statistical dependencies. Columns show information gain about identity of the adjacent letter (or space character) to left or right, given the presence of letter indicated on the x-axis. The dashed horizontal line indicates perfect knowledge—log 2 (27) = 4.75 bits.

Minimizing Binding Errors

255

In relation to conventional visual domains, n-grams may be viewed as complex visual feature detectors, each of which responds to a specic conjunction of subparts in a specic spatial conguration, but with built-in invariance to a set of spatial transformations predened by the recognition task. In relation to a shape-based visual domain, a 2-gram is analogous to a corner detector that signals the co-occurrence of a vertical and a horizontal edge, with appropriate spatial offsets, anywhere within an extended region of the image. Let R be a representation containing a set of position-invariant n-grams analyzing the input array and W be a set of word detectors analyzing R, with one detector for every entry in the word database. Each word detector receives input from some or all of the n-gram units in R that are contained in the respective word. However, not all of a word’s n-grams are necessarily contained in R, and not all of the word’s n-grams that are contained in R necessarily provide input to the word’s detector. For example, the representation might contain four spatially invariant features R = fa, c, o, sg, while the detector for the word cows receives input from features c and s, but not from the o feature, which is present in R, or from the w feature, which is not contained in R. Word detectors are binary-valued conjunctions, responding 1 when all of their inputs are active and 0 otherwise. The collection of ngrams feeding a word detector is hereafter referred to as the word’s minimal feature set. A correct recognition is scored when an input array of characters is presented, and each word detector is activated if and only if its associated word is present in the input. Any word detector activated when its corresponding word is not actually present in the input array is counted as a hallucination, and the recognition trial is scored as an error. Recognition errors for a particular representation R are reported as the percentage of input arrays drawn randomly from the English text database that generated one or more word hallucinations.

3 Results 3.1 Word-Word Collisions in a Simple 1-Gram Representation. We begin by considering the case without visual clutter and ask how often pairs of isolated words collide with each other within a binary 1-gram representation, where each word is represented as an unordered set of individual letters (with repeats ignored). This is the situation in which the binding problem is most evident, as schematized in Figure 1A. Results for the approximately 2 billion distinct pairs of words in the database are summarized in Table 1. About half of the 44,414 words contained in the database collide with at least one other word (e.g., cerebrum ´ cucumber , calculation ´ unconstitutional ). The largest cohort of 1-gram-identical words contained 28 members.

256

Bartlett W. Mel and J´ozsef Fiser

Table 1: Performance of a Binary 1-Gram Representation in Word-Word Comparisons. Quantity

Value

Comment

Number of 1-grams in R

27

a, b, . . . , z,

Number of distinct words

44,414

From 5 million words in the WSJ

Word-word comparisons

»2 billion

Number of ambiguous words

24,488

analogies suction scientists

´ ´ ´

gasoline continuous nicest

stare, arrest, tears, restates, reassert, rarest, easter . . .

Largest self-similar cohort

28 words

Baseline entropy in wordfrequency distribution

9.76 bits

< log 2 (44, 414) = 15.4 bits

1.4 bits

Narrows eld to » 3 possible words

Residual uncertainty about identity of a randomly drawn word, knowing its 1-grams

As an aside, we noted that if the two words compared were required also to contain the same number of letters, the number of unambiguously represented words fell to 11,377, and if the words were compared within a multivalued 1-gram representation (where words matched only if they contained the identical multiset of letters, i.e., in which repeats were taken into account), the ambiguous word count fell further to 5,483, or about 12% of English words according to this sample. The largest self-similar cohort in this case contained only 5 words (traces , reacts , crates , caters , caster ). Although a 1-gram representation contains no explicit part-part relations and fails to pin down uniquely the identity of more than half of the English words in the database, it nonetheless provides most of the needed information to distinguish individually presented words. The baseline entropy in the WSJ word-frequency distribution is 9.76 bits, signicantly less than the 15.4 = log2 (44,414) bits for an equiprobable word set. The average residual uncertainty about the identity of a word drawn from the WSJ word frequency distribution, given its 1-gram representation, is only 1.4 bits, meaning that for individually presented words, the 1-gram representation narrows the set of possibilities to about three words on average. 3.2 Word-Word Collisions in an Adjancent 2-Gram Representation.

Since many word objects have identical 1-gram representations even when full multisets of letters are considered, we examined the collision rate when an adjacent binary-valued 2-gram representation was used (i.e., coding the

Minimizing Binding Errors

257

Table 2: Performance of a Binary Adjacent 2-Gram Representation in WordWord Comparisons. Quantity Number of adjacent 2-grams Number of words Word-word comparisons Number of ambiguous words Largest self-similar cohort

Ambiguous word pairs that are linguistically distinct Words with identical adjacent 2-gram multisets

Value

Comment

729

[aa], [ab], . . . , [a ], . . . , [ z], [ ]

44,414 »2 billion 57 5 words

4 pairs

2 words

ohhh, ahhh, shhh, hmmm, whoosh, whirr , . . . ohhh ´ ohhhh ´ ohhhhh ´ ohhhhhh ´ ohhhhhhh asses possess seamstress intended

´ ´ ´ ´

assess possesses seamstresses indented

intended ´ indented

set of all adjacent letter pairs, including the space character). This representation is signicant in that adjacent 2-grams are the simplest features that explicitly encode spatial relations between parts. The results of this test are summarized in Table 2. There were only 46 word-pair collisions in this case, comprising 57 distinct words. Forty-two of the 46 problematic word pairs differed only in the length of a string of repeated letters (e.g., ahhh ´ ahhhh ´ ahhhhh . . . , similarly for ohhh, ummm, zzz, wheee, whirrr ) or into sets representing varying spellings or misspellings of the same word also involving repeated letters (e.g., tepees ´ teepees, llama ´ lllama ). Only four pairs of distinct words collided that had different morphologies in the linguistic sense (see Table 2). When words were constrained to be the same length or to contain the identical multiset of adjacent 2-grams, only a single collision persisted: intended ´ indented . At the level of word-word comparisons, therefore, the binding of adjacent letter pairs is sufcient practically to eliminate the ambiguity in a large database of English words. We noted empirically that the inclusion of fewer than 20 additional nonadjacent 2-grams to the representation entirely eliminated the word-word ambiguity in this database, without recourse to multiset comparisons or word-length constraints. 4 An Analytical Model

The results of these word-word comparisons suggest that under some circumstances, a modest number of low-order n-grams can disambiguate a

258

Bartlett W. Mel and J´ozsef Fiser

large, complex object set. However, one of the most dire challenges to a spatially invariant feature representation, object clutter, remains to be dealt with. To this end, we developed a simple probabilistic model to predict recognition performance when a cluttered input array containing multiple objects is presented to a bank of feature detectors. To permit analysis of this problem, we make two simplifying assumptions. First, we assume that the features contained in R are statistically independent. This assumption is false for English n-grams: for example, [th] predicts that [he] will follow about one in three times, whereas the baseline rate for [he] is only 0.5%. Second, we assume that all the features in R are activated with equal probability. This is also false in English: for example ed occurs approximately 90,000 times more often than [mj] (as in ramjet ). Proceeding nonetheless under these two assumptions, it is straightforward to calculate the probability that no object in the database is hallucinated in response to a randomly drawn input image. We rst calculate the probability o that all of the features feeding a given object detector are activated by an arbitrary input image:

o=

¡ d ¡ w¢ c¡w ¡ d¢ = c

( d ¡ w)! c! ³ c ´w ¼ , ( c ¡ w)! d! d

(4.1)

where d is the total number of feature detectors in R , w is the size of the object’s minimal feature set, i.e., the number of features required by the target object (word) detector, and c is the number of features activated by a “cluttered” multiobject input image. The left-most combinatorial ratio term in equation 4.1 counts the number of ways of choosing c from the set of d detectors, including all w belonging to the target object, as a fraction of the total number of ways of choosing c of the d detectors in any fashion. The approximation in equation 4.1 holds when c and d are large compared to w, as is generally the case. Equation 4.1 is, incidentally, the value of a “hypergeometric” random variable in the special case where all of the target elements are chosen; this distribution is used, for example, to calculate the probability of picking a winning lottery ticket. The probability that an object is hallucinated (i.e., perceived but not actually present) is given by hi =

oi ¡ qi , 1 ¡ qi

(4.2)

where qi is the probability that the object does in fact appear in the input image, which by denition means it cannot be hallucinated. According to Zipf’s law, however, which tells us that most objects in the world are quite uncommon (see Figure 2A), we may in some cases make the further simpli-

Minimizing Binding Errors

259

cation that qi ¿ oi , giving hi ¼ oi . We adopt this assumption in the following. (This simplication would result also from the assumption that inputs consist of object-like noise structured to trigger false-positive perceptions.) We may now write the probability of veridical perception—the probability that no object in the database is hallucinated, µ ³ c ´w ¶N , pv = (1 ¡ h) N ¼ 1 ¡ d

(4.3)

where N is the number of objects in the database. This expression assumes a homogeneous object database, where all N objects require exactly w features in R. Given the independence assumption, the expression for a heterogeneousQobject database consists of a product of object-specic probabilities c wi pv = i (1 ¡ ( d ) ), where wi is the number of features required by object i. Note that objects (words) containing the same number of parts (letters) do not necessarily activate the same number of features (n-grams) in R, depending on both the object and the composition of R. From an expansion to rst order of equation 4.3 around ( c/ d) w = 0, which gives pv = 1¡N(c/ d) w , it may be seen that perception is veridical (i.e., pv ¼ 1) when ( c/ d) w ¿ 1/ N. Thus, recognition errors are expected to be small when (1) a cluttered input scene activates only a small fraction of the detectors in R, (2) individual objects depend on as many detectors in R as possible, and (3) the number of objects to be discriminated is not too large. The rst two effects generate pressure to grow large representations, since if d is scaled up and c and w scale with it (as would be expected for statistically independent features), recognition performance improves rapidly. Conveniently, the size of the representation can always be increased by including new features of higher binding order or larger diameters, or both. It is also interesting to note that item 1 in the previous paragraph appears to militate for a sparse visual representation, while equation 4.2 militates for a dense representation. This apparent contradiction reects the fundamental trade-off between the need for individual objects to have large minimal feature sets in order to ward off hallucinations and the need for input scenes containing many objects to activate as few features as possible, again in order to minimize hallucinations. This trade-off is not captured within the standard conception of a feature space, since the notion of “similarity as distance” in a feature space does not naturally extend to the situation in which inputs represent multiobject scenes and require multiple output labels. 4.1 Fitting the Model to Recognition Performance in Text World. We ran a series of experiments to test the analytical model. Input arrays varying in width from 10 to 50 characters were drawn at random from the text database and presented to R. The representation contained either all adjacent 2-grams, called Rf2g to signify a diameter of 2, or the set of all 2-grams of diameter 2 or 3, called Rf2,3g , where Rf2,3g was twice the size of Rf2g .

260

Bartlett W. Mel and J´ozsef Fiser

In individual runs, the word database was restricted to words of a single length—two-letter words (N = 253), ve-letter words (N = 4024), or sevenletter words (N = 7159)—in order for the value of w to be kept relatively constant within a trial.3 For each word size and input size and for each of the two representations, average values for w and c were measured from samples drawn from the word or text database. Not surprisingly, given the assumption violations noted above (feature independence and equiprobability), predicting recognition performance using these measured average values led to gross underestimates of the hallucination errors actually generated by the representation. Two kinds of corrections to the model were made. The rst correction involved estimating a better value for d, the number of n-grams contained in the representation, since some of the detectors nominally present in Rf2g and Rf2,3g were infrequently, if ever, actually activated. The value for d was thus down-corrected to reect the number of detectors that would account for more than 99.9% of the probability density over the detector set: for Rf2g , the representation size dropped from d = 729 to d0 = 409, while for Rf2,3g , d was cut from 1,458 to 948. The second correction was based on the fact that the n-grams activated by English words and phrases are highly statistically redundant, meaning that a set of d0 n-grams has many fewer than d0 underlying degrees of freedom. We therefore introduced a redundancy factor r, which allowed us to “pretend” that the representation contained only d0 / r virtual n-grams, and a word or an input image activated only w/ r or c/ r virtual n-grams in the representation, respectively. (Note that since c and d appear only as a ratio in the approximation of equation 4.3, their r-corrections cancel out.) By systematically adjusting the redundancy factor depending on test condition (ranging from 1.60 to 3.45), qualitatively good ts to the empirical data could be generated, as shown in Figure 3. The redundancy factor was generally larger for long words relative to short words, reecting the fact that long English words have more internal predictability (i.e., ll a smaller proportion of the space of long strings than short words ll in the space of short strings). The r-factor was also larger for Rf2,3g than for Rf2g , reecting the fact that the increased feature overlap due to the inclusion of larger feature diameters in Rf2,3g leads to stronger interfeature correlations, and hence fewer underlying degrees of freedom. Thus, with the inclusion of correction factors for nonuniform activation levels and nonindependence of the features in R , the remarkably simple relation in equation 4.3 captures the main trends that describe recognition performance in scenarios with database sizes ranging from 253 to 7159 words, words containing between

3 Since R f2g and R f2, 3g contained all n-grams of their respective sizes, the value of w could vary only from word to word to the extent that words contained one or more repeated n-grams.

Minimizing Binding Errors

261

Figure 3: Fits of equation 4.3 to recognition performance in text world. (A) The representation, designated R f2g , contained the complete set of 729 adjacent 2grams (including the space character). The x-axis shows the width of a text window presented to the n-gram representation, and the y-axis shows the probability that the input generates one or more hallucinated words. Analytical predictions are shown in dashed lines using redundancy factors specied in the legend. (B) Same as A, with results for R f2,3g .

262

Bartlett W. Mel and J´ozsef Fiser

2 and 7 letters, input arrays containing from 10 to 50 characters, and using representations varying in size by a factor of 2. We did not attempt to t larger ranges of these experimental variables. Regarding the sensitivity of the model ts to the choice of d0 and r, we noted rst that ts were not strongly affected by the cumulative probability cutoff used to choose d0 —qualitatively reasonable ts could be also generated using a cutoff value of 95%, for example, for which d0 values for R f2,3g and Rf2g shrank to 220 and 504, respectively. In contrast, we noted that a change of one part per thousand in the value of r could result in a signicant change in the shape a curve generated by equation 4.3, and with it significant degradation of the quality of a t. However, the systematic increase in the optimal value of r with increasing word size and representation size indicates that this redundancy correction factor is not an entirely free parameter. It is also worth emphasizing that a redundancy “factor,” which uniformly scales w, c, and d, is a very crude model of the complex statistical dependencies that exist among English n-grams; the fact that any set of small r values can be found that tracks empirical recognition curves over wide variations in the several other task variables is therefore signicant. 5 Greedy Feature Learning

Several lessons from the analysis and experiments are that (1) larger representations can lead to better recognition performance, (2) only those features that are actually used should be included in a representation, (3) considerable statistical redundancy exists in a fully enumerated (or randomly drawn) set of conjunctive features, which should be suppressed if possible, and (4) features of low binding order, while potentially adequate to represent isolated objects, are insufcient for veridical perception of scenes containing multiple objects. In addition, we infer that different objects are likely to have different representational requirements depending on their complexity and that the composition of an ideal representation is likely to depend heavily on the particular recognition task. Taken together, these points suggest that a representation should be learned that includes higher-order features only as needed. In contrast, blind enumeration of an ever-increasing number of higher-order features is an expensive and relatively inefcient way to proceed. Standard network approaches to supervised learning could in principle address all of the points raised above, including nding appropriate weights for features of various orders depending on their utility and helping to decorrelate the input representation. However, the potential need to acquire features of high order—so high that enumeration and testing of the full cohort of high-order features is impractical—compels the use of some form of explicit search as part of the learning process. The use of greedy, incremental, or heuristic techniques to grow a neural representation through increasing nonlinear orders is not new (e.g., Barron, Mucciardi, Cook, Craig,

Minimizing Binding Errors

263

& Barron, 1984; Fahlman & Lebiere, 1990), though such techniques have been relatively little discussed in the context of vision. To cope with the many constraints that must be satised by a good visual representation, we developed a greedy supervised algorithm for feature learning that (1) selects the order in which features should be added to the representation, and (2) selects which features added to the representation should contribute to the minimal feature sets of each object Q in the database. By differentiating the log probability uv = log(pv ) = log i (1 ¡ hi ) with respect to the hallucination probability of the ith object, we nd that ¡1 duv = , 1 ¡ hi dhi

(5.1)

indicating that the largest multiplicative increments to pv are realized by squashing the largest values of hi = ( c/ d) wi . Practically, this is achieved by incrementing the w-values (i.e., growing the minimal feature sets) of the most frequently hallucinated objects. The following learning algorithm aims to do this in an efcient way: 1. Start with a small bootstrap representation. (We typically used 50 2grams found to be useful in pilot experiments.) 2. Present a large number of text “images” to the representation. 3. Collect hallucinated words—any word that was ever detected but not actually present—in an input image. 4. For each hallucinated word and the input array that generated it, increment a global frequency table for every n-gram (up to a maximum order of 5 and diameter of 6) that is (i) contained in the hallucinated word, (ii) not contained in the offending input, and (iii) not currently included in the representation. 5. Choose the most frequent n-gram in this table of any order and diameter, and add it to the current representation. 6. Build a connection from the newly added n-gram to any word detector involved in a successful vote for this feature in step 4. Inclusion of this n-gram in these words’ minimal feature sets will eliminate the largest number of word hallucination events encountered in the set of training images. 7. Go to 2. We ran the algorithm using input arrays 50 characters in width (often containing 10 or more words) and plotted the number of hallucinated words (see Figure 4A) and proportion of input arrays causing hallucinations (see Figure 4B) as a function of increasing representation size during learning. The periodic ratcheting in the performance curves (most visible in the lower curve of Figure 4B) was due to staging of the training process for efciency

264

Bartlett W. Mel and J´ozsef Fiser

reasons: batches of 1000 input arrays were processed until a xed hallucination error criterion of 1% was reached, when a new batch of 1000 random input arrays was drawn, and so on. The envelope of the oscillating performance curves provides an estimate of the true performance curve that would result from an innitely large training corpus. Learning was terminated when the average initial error rate for three consecutive new training batches fell below 1%. Given that this termination condition was typically reached in fewer than 50 training epochs (consisting of 50,000 50-character input images), no more than half the text database was visited by random drawings of training-testing sets. We found that for 50-character input arrays, the hallucination error rate fell below 1% after the inclusion of 2287 n-grams in R. These results were typical. The inferior performance of a simple unsupervised algorithm, which adds n-grams to the representation in order of their raw frequencies in English, is shown in Figure 4AB for comparison (upper dashed curves). The counts of total words hallucinated (see Figure 4A), with initial values indicating many 10s of hallucinated words per input, can be seen to fall far faster than the hallucination error rate (see Figure 4B), since a sentence was scored as an error until its very last hallucinated word was quashed.

Figure 4: Facing page. Results of greedy n-gram learning algorithm using input arrays 50 characters in width and a 50-element bootstrap representation. Training was staged in batches of 1000 input arrays until the steady-state hallucination error rate fell below 1%. (A) Growth of the representation is plotted along the x-axis, and the number of words hallucinated at least once in the current 1000-input training batch is plotted on the y-axis. The drop is far faster for the greedy algorithm (lower curve) than for a frequency-based unsupervised algorithm (upper curve). Unsupervised bootstrap representation also included the 27 1-grams (i.e., individual letters plus space character) that were excluded from the bootstrap representation used in the greedy learning run. Without these 1-grams, performance of the unsupervised algorithm was even more severely hampered; with these 1-grams incorporated in the greedy learning runs, the wvalues of every object were unduly inated. (B) Same as (A), but the y-axis shows the proportion of input arrays producing at least one hallucination error. Jaggies occur at transitions to new training batches each time the 1% error criterion was reached. Asymptote is reached in this run after the inclusion of 2287 n-grams. (C) A breakdown of learned representation showing relative numbers of n-grams of each order and diameter. (D) The column height shows the minimal feature set size after learning, that is, average number of n-grams feeding word conjunctions for words of various lengths. The dark lower portion of the columns shows initial counts based on full connectivity to bootstrap representation; the light upper portions show an increment in w due to learning. Nearly constant w values after learning reect the tendency of the algorithm to grow the smallest minimal feature sets. The dashed line corresponds to one n-gram per letter in word.

Minimizing Binding Errors

265

One of the most striking outcomes of the greedy learning process is that the learned representation is heavily weighted toward low-order features, as shown in Figure 4C. Thus, while the learned representation includes ngrams up to order 5, the number of higher-order features required to remedy perceptual errors falls off precipitously with increasing order—far faster than the explosive growth in the number of available features at each order. Quantitatively, the ratios of n-grams included in R to n-grams contained in the word database for orders 2, 3, 4, and 5 were 0.42, 0.0095, 0.00017, and 0.0000055, respectively. The rapidly diminishing ratios of useful features to available features at higher orders conrm the impracticality of learning on a fully enumerated high-order feature set and illustrate the natural bias in the learning procedure to choose the lowest-order features that “get the job done.” The dominance of low-order features is of considerable practical signicance, since higher-order features are more expensive to compute, less robust to image degradation, operate with lower duty cycles, and provide a more limited basis for generalization. The relatively broad distribution of

A

Fewer Words are Hallucinated as Representation Grows

Inputs Causing Hallucinations (%)

10000

Words Hallucinated

9000 8000 7000 6000

Unsupervised

5000

Greedy

4000 3000 2000 1000 0 0

1000

2000

3000

B

4000

Fewer Inputs Cause Hallucinations as Representation Grows 100 90 80 70 60

30 20 10 0

5000

Sizes of N-grams in Final Learned Representation

0

D

6 5

1200 1000

4

800

3

600

2

400 200 0 2

3 4 N-gram Order

5

Mean Number of N-grams

Number of N-grams

diameter

1000

2000

3000

4000

5000

N-grams Included in Representation

1600 1400

Greedy

40

N-grams Included in Representation

C

Unsupervised

50

Number of N-grams Feeding Words of Varying Lengths 20 18 16 14 12 10 8 6 4

initial learned

2 0 1 2 3 4 5 6 7 8 9 1011121314151617 181920

Word Size (Letters)

266

Bartlett W. Mel and J´ozsef Fiser

diameters of the learned feature set are also shown in Figure 4C for n-grams of each order. In spite of its relatively small size, it is likely that the n-gram representation produced by the greedy algorithm could be signicantly compacted without loss of recognition performance, by alternately pruning the least useful n-grams from R (which lead to minimal increments in error rate) and retraining back to criterion. However, such a scheme was not tried. After learning to the 1% error criterion, a 50-character input array activated an average of 168 (7.3%) of the 2287 detectors in R, while the average minimal feature set size for an individual word was 11.4. By comparison, the initial c/ d ratio for 50-character inputs projected onto the 50-element bootstrap representation was a much less favorable 44%. Thus, a primary outcome of the learning algorithm in this case involving heavy clutter is a larger, more sparsely activated representation, which minimizes collisions between input images and the feature sets associated with individual target words. Further, the estimated r value from this run was 1.95, indicating signicantly less redundancy in the learned representation than was estimated using seven-letter words (r = 3.45) in the fully enumerated 2-gram representations reported in Figure 3; this is in spite of the fact that the learned representation contained more than twice the number of active features as the fully enumerated 2-gram representation. The total height of each column in Figure 4D shows the w value—the size of the minimal feature sets after learning for words of various lengths. The dark lower segment of each column counts initial connections to the word conjunctions from the 50-element bootstrap representation, while the average number of new connections gained during learning for words of each length is given by the height of each gray upper segment. Given the algorithm’s tendency to expand the smallest minimal feature sets, the w values for short words grew proportionally much more than those for longer words, producing a nal distribution of w values that was remarkably independent of word length for all but the shortest words. The w values for one-, two-, and three-letter words were lower either because they reached their maximum possible values (4 for one-letter words) or because these words contained rarer features, on average, derived from a relatively high concentration in the database of unpronounceable abbreviations, acronyms, and so forth. The dashed line indicates a minimal feature set size equal to the number of letters contained in the word; the longest words can be seen to depend on substantially fewer n-grams than they contain letters. The pressures inuencing both the choice of n-grams for inclusion in R and the assignment of n-grams to the minimal feature sets of individual words were very different from those associated with a frequency-based learning scheme. Thus, the features included in R were not nececssarily the most common ones, and the most commonly occurring English words were not necessarily the most heavily represented.

Minimizing Binding Errors

267

Table 3: First 10 n-Grams to Be Added to R During a Learning Run with 50-Character Input Arrays, Initialized with a 200-Element Bootstrap Representation. Order of Inclusion 1 2 3 4 5 6 7 8 9 10

n-Gram

Relative Frequency

Cumulative %

[z ] [ j] [k* ] [x ] [ki] [ q] [ *v] [ **z] [o*i] [p*** ]

417 14,167 39,590 6090 20,558 8771 24,278 3184 37,562 55,567

8 55 71 42 62 47 64 31 70 76

Note: The “ ” character represents the space character, and “*” matches any character. A value of k in the cumulative distribution column indicates that the total probability summed over all less common features is k%.)

First, in lieu of choosing the most commonly occurring n-grams, the greedy algorithm prefers features that are relatively rare in English text as a whole but relatively common in those words most likely to be hallucinated: short words and words containing relatively common English structure. The paradoxical need to nd features that distinguish objects from common backgrounds, but do so as often as possible, leads to a representation containing features that occur with moderate frequency. Table 3 shows the rst 10 n-grams added to a 200-element bootstrap representation during learning in one experimental run. The features’ relative frequencies in English are shown, which jump about erratically, but are mostly from the middle ranges of the cumulative probability distribution for English n-grams. Most of these rst 10 features include a space character, indicating that failure to represent elemental features properly in relation to object boundaries is a major contributor to hallucination errors in this domain. Specically, we found that hallucination errors are frequently caused by the presence in the input of multiple “distractor ” objects that together contain most or all of the features of a target object, for example, in the sense that national and vacation together contain nearly all the features of the word nation. To distinguish nation from the superposition of these two distractors requires the inclusion in the representation of a 2-gram such as [n***** ] whose diameter of 7 exceeds the maximum value of 6 arbitrarily established in the present experiments. Having established that the features included in R are not keyed in a simple way to their relative frequencies in English, we also observed that no simple relation exists between word frequency and the per-word representational cost, that is, the minimal feature set size established during

268

Bartlett W. Mel and J´ozsef Fiser Table 4: Correlation Between Word Frequency in English and Minimal Feature Set Size After Learning Was Only 0.1. Word

Relative Frequency

Minimal Feature Set Sizea

resting leading

100 315

25 24

buffalo unhappy

188 144

8 7

thereat fording

2 1

25 23

bedknob jackdaw

1 1

7 6

a Number of

n-grams in R that feed the word’s conjunction, that is, which must be present in order for a word to be detected. Table shows examples of common and uncommon words with large and small minimal features sets.

learning. In fact, if attention is restricted to the set of all words of a given length, with nominally equivalent representational requirements, the correlation between relative frequency of the word in English and its postlearning minimal feature set size was only 0.1. Thus some common words require a conjunction of many n-grams in R, while others require few, and some rare words require many n-grams while others require few (see Table 4). 5.1 Sensitivity of the Learned Representation to Object Category Structure. Words are unlike common objects in that every word forms its own

category, which must be distinguished from all other words. In text world, therefore, the critical issue of generalization to novel class exemplars cannot be directly studied. To address this shortcoming partially and assess the impact of broader category structures on the composition of the n-gram representation, we modied the learning algorithm so that a word was considered to be hallucinated only if no similar—rather than identical—word was present in the input when the target word was declared recognized. In this case, similar was dened as different up to any single letter replacement—for example, dog = ffog, dig, . . .g, but dog 6 = ffig, og, dogs, . . .g. The resulting category structure imposed on words was thus a heavily overlapping one in which most words were similar to several other words.4 4 This learning procedure, combined with the assumption that every word is represented by only a simple conjunction of features in R (disjunctions of multiple prototypes per word were not supported), cannot guarantee that every word detector is activated

Minimizing Binding Errors

269

Decrease in Representational Costs for Broader, Overlapping Object Categories 1400 tolerance = 0

Number of N-grams

1200

tolerance = 1

1000 800 600 400 200 0 2

3 4 N-gram Order

5

Figure 5: Breakdown of representations produced by two learning runs differing in breadth of category structure. Fifty-character input arrays were used as before to train to an asymptotic 1% error criterion, this time beginning with a 200-element bootstrap representation. Run with tolerance = 0 was as in Figure 4, where a hallucination was scored whenever a word’s minimal feature set was activated, but the word was not identically present in the input. In a run with tolerance = 1, a hallucination was not scored if any similar word was present in the input, up to a single character replacement. The latter run, with a more lenient category structure, led to a representation that was signicantly smaller, since it was not required to discriminate as frequently among highly similar objects.

The main result of using this more tolerant similarity metric during learning is that fewer word hallucinations are encountered per input image, leading to a nal representation that is smaller by more than half (see Figure 5). In short, the demands on a conjunctive n-gram representation are lessened when the object category structure is broadened, since fewer distinctions among objects must be made. 5.2 Scaling of Learned Representation with Input Clutter. As is conrmed by the results of Figure 3, one of the most pernicious threats to a spatially invariant representation is that of clutter, dened here as the quantity of input material that must be simultaneously processed by the representain response to any of the legal variants of the word. Rather, the procedure allows a word detector to be activated by any of the legal variants without penalty.

270

Bartlett W. Mel and J´ozsef Fiser Table 5: Measured Average Values of c, d, and w and Calculated Values of r in a Series of Three Learning Runs with Input Clutter Load Increased from 25 to 75 Characters, Trained to a Fixed Error Criterion of 3%. Input Width

c

d

w

r

25 50 75

55 152 253

816 1768 2634

7.6 10.7 12.8

1.45 1.85 2.11

Note: The 50-element bootstrap representation was used to initialize all three runs. Equation 4.3 was used to nd values of r for each run, such that with w ! w/ r, the measured values of c and d, and N = 44,414, a 3% error rate was predicted.

tion. The difculty may be easily seen in equation 4.3, where an unopposed increase in the value of c leads to a precipitous drop in recognition performance. On the other hand, equation 4.3 also indicates that increases in c can be counteracted by appropriate increases in d and w, which conserve the value of ( c/ d) w . To examine this issue, we systematically increased the clutter load in a series of three learning runs from 25 to 75 characters, training always to a xed 3% error criterion, and recorded for each run the postlearning values of c, d, and w. Although the WSJ database contained a wide range of word sizes (from 1 to 20 characters in length), the measurement of a single value for w averaged over all words was justied since, as shown in Figure 4D, the learning algorithm tended to equalize the minimal feature set sizes for words independent of their lengths. Under the assumption of uniformly probable, statistically independent features in the learned representations, the value of h = (c/ d)w for all three runs should be equal to » 7 £ 10¡7 , the value that predicts a 3% error rate for a 44,414-object database using equation 4.3. However, corrections (w ! w/ r) were needed to factor out unequal amounts of statistical redundancy in representations of various sizes, where larger representations contained more redundancy. The measured values of c, d, and w and the empirically determined r values are shown for each run in Table 5. As in the runs of Figure 3, redundancy values were again systematically greater for the larger representations generated, and the redundancy values calculated here for the learned representation were again signicantly smaller than those seen for the fully enumerated 2-gram representation. 6 Discussion

Using a simple analytical model as a taking-off point and a variety of simulations in text world, we have shown that low-ambiguity perception in un-

Minimizing Binding Errors

271

segmented multiobject scenes can occur whenever the probability of fully activating any single object’s minimal feature set, by a randomly drawn scene, is kept sufciently low. One prescription for achieving this condition is straightforward: conjunctive features are mined from hallucinated objects until false-positive recognition errors are suppressed to criterion. Even in a complex domain containing a large number of highly similar object categories and severe background clutter, this prescription can produce a remarkably compact representation. This conrms that brute-force enumeration of large families or, worse, all high-order conjunctive features—the combinatorial explosion recognized by Malsburg—is not inevitable. The analysis and experiments reported here have provided several insights into the properties of feature sets that support hallucination-proof recognition and into learning procedures capable of building these feature sets: 1. Efcient feature sets are likely to (1) contain features that span the range from very simple to very complex features, (2) contain relatively many simple features and relatively few complex features, (3) emphasize features that are only moderately common (giving a representation that is neither sparse nor dense) in response to the conicting constraints that features should appear frequently in objects but not in backgrounds also composed of objects, and (4) in spatial domains, emphasize features that encode the relations of parts to object boundaries. 2. An efcient learning algorithm works to drive toward zero, and therefore to equalize, the false-positive recognition rates for all objects considered individually. Thus, frequently hallucinated objects—objects with few parts or common internal structure, or both—demand the most attention during learning. Two consequences of this focus of effort on frequently hallucinated objects are that (1) the average value of w, the size of the minimal feature set required to activate an object, becomes nearly independent of the number of parts contained in the object, so that simpler objects are relatively more intensively represented than complex objects, and (2) among objects of the same part complexity, the minimal conjunctive feature sets grow largest for objects containing the most common substructures, though these are not necessarily the most common objects. A curious implication of the pressure to represent objects heavily that are frequently hallucinated is that the backgrounds in which objects are typically embedded can strongly inuence the composition of the optimal feature set for recognition. 3. The demands on a visual representation are heavily dependent on the object category structure imposed by the task. Where object classes are large and diffuse, the required representation is smaller and weighted

272

Bartlett W. Mel and J´ozsef Fiser

to features of lower conjunctive order, whereas for a category structure like words, in which every object forms its own category that must often be distinguished from a large number of highly similar objects (e.g., cat from cats, cut, rat), the representation must be larger and depend more heavily on features of higher conjunctive order. 6.1 Further Implications of the Analytical Model. Equation 4.3 provides an explicit mathematical relation among several quantities relating to multiple-object recognition. Since the equation assumes statistical independence and uniform activation probability of the features contained in the representation, neither of which is a valid assumption in most practical situations, we were unable to predict recognition errors using measured values for d, c, and w. However, we found that error rates in each case examined could be tted using a small correction factor for statistical redundancy, which uniformly scaled d, c, and w and whose magnitude grew systematically for larger collections of features activated by larger—more redundant—units of text. From this we conclude that equation 4.3 at least crudely captures the quantitative trade-offs governing recognition performance in receptive-eld-based visual systems. One of the pithier features of equation 4.3 is that it provides a criterion for good recognition performance: ( c/ d) w ¿ 1/ N. The exponential dependence on the minimal feature set size, w, means that modest increases in w can counter a large increase in the number of object categories, N, or unfavorable increases in the activation density, c/ d, which could be due to either an increase in the clutter load or a collapse in the size of the representation. For example, using roughly the postlearning values of c/ d » 0.07 and r » 2 from the run of Figure 4, we nd that increasing w from 10 to 12 can compensate for a more than 10-fold increase N, or a nearly 60% increase in the c/ d ratio, while maintaining the same recognition error rate. The rst-order expansion of equation 4.3 also states that recognition errors grow as the wth power of c/ d, where w is generally much larger than 1. A visual system constructed to minimize hallucinations therefore abhors uncompensated increases in clutter or reduction in the size of the representation. These effects are exactly those that have motivated the various compensatory strategies discussed in section 1. The abhorrence of clutter generates strong pressure to invoke any readily available segmentation strategies that limit processing to one or a small number of objects at a time. For example, cutting out just half the visual material in the input array when w/ r = 5, as in the above example, knocks down the expected error rate by a factor of 32. The pressure to reduce the number of objects simultaneously presented to the visual system could account in part for the presence in biological visual systems of (1) a fovea, which leads to a huge overrepresentation of the center of xation and marginalization of the surround, (2) covert attentional mechanisms, which selectively upor down-modulate sensitivity to different portions of the visual eld, and

Minimizing Binding Errors

273

(3) dynamical processes that segment “good” gures (in the Gestalt sense) from background. The second pressure involved in maintaining a low c/ d ratio involves maintaining a visual representation of adequate size, as measured by d. One situation in which this is particularly difcult arises when the task mandates that objects be distinguished under excessively wide ranges of spatial variation, an invariance load that must be borne in turn by each of the individual feature detectors in the representation. For example, consider a relatively simple task in which the orientation of image features is a valid cue to object identity, such as a task in which objects usually appear in a standard orientation. In this case a bank of several orientation-specic variants of each feature can be separately maintained within the representation. In contrast, in a more difcult task that requires that objects be recognized in any orientation, each bank of orientation-specic detectors must be pooled together to create a single, orientation-invariant detector for that conjunctive feature. The inclusion of this invariance in the task denition can thus lead to an order-of-magnitude reduction in the size of the representation, which, unopposed could produce a catastrophic breakdown of recognition according to equation 4.3. As a rule of thumb, therefore, the inclusion of an additional order-of-magnitude invariance must be countered by an order-of-magnitude increase in the number of distinct feature conjunctions included in the representation, drawn from the reservoir of higher-order features contained in the objects in question. The main cost in this approach lies in the additional hardware, requiring components that are more numerous, expensive, prone to failure, and used at a far lower rate. 6.2 Space, Shape, Invariance, and the Binding Problem. While one emphasis in this work has been the issue of spatial invariance in a visual representation, we reiterate that the mathematical relation expressed by equation 4.3 knows nothing of space or shape or invariance, but rather views objects and scenes as collections of statistically independent features. What, then, is the role of space? From the point of view of predicting recognition performance and with regard to the quantitative trade-offs discussed here, we can nd no special status for worlds in which spatial relations among parts help to distinguish objects, since spatial relations may be viewed simply as additional sources of information regarding object identity. (Not all worlds have this property; an example is olfactory identication.) Further, the issue of spatial invariance plays into the mathematical relation of equation 4.3 only indirectly, in that it species the degree of spatial pooling needed to construct the invariant features included in the representation; once again, the details relating to the internal construction of these features are not visible to the analysis. In this sense, the spatial binding problem, which appears at rst to derive from an underspecication of the spatial relations needed to glue an object together, in fact reduces to the more mundane problem of ambiguity arising

274

Bartlett W. Mel and J´ozsef Fiser

from overstimulation of the visual representation by input images—that is, a c/ d ratio that is simply too large. The hallucination of objects based on parts contained in background clutter can thus occur in any world, even one in which there is no notion of space (e.g., matching bowls of alphabet soup). However, the fact that object recognition in real-world situations typically depends on spatial relations between parts and the fact that recognition invariances are also very often spatial in nature together strongly inuence the way in which a visual representation based on spatially invariant, receptive elds can be most efciently constructed. In particular, many economies of design can be realized through the use of hierarchy, as exemplied by the seminal architecture of Fukushima et al. (1983). The key observation is that a population of spatially invariant, higher-order conjunctive features generally shares many underlying processing operations. The present analysis is, however, mute on this issue. 6.3 Relations to “Real” Vision. The quantitative relations given by equation 4.3 cast the problem of ambiguous perception in cluttered scenes in the simple terms of set intersection probabilities: a cluttered scene activates c of the d feature detectors at random and is correctly perceived with high probability as long as the set of c activated features only rarely includes all w features associated with any one of the N known objects. The analysis is thus heavily abstracted from the details of the visual recognition problem, most obviously ignoring the particular selectivites and invariances that parameterize the detectors contained in the representation. On the other hand, the analysis makes it possible to understand in a straightforward way how the simultaneous challenges of clutter and invariance conspire to create binding ambiguity and the degree to which compensatory increases in the size of the representation, through conjunctive order boosting, can be used to suppress this ambiguity. These basic trade-offs have operated close to their predicted levels in the text world experiments carried out in this work, in spite of violations of the simplifying assumptions of feature independence and equiprobability. Since the application of our analysis to any particular recognition problem involves estimating the domain-specic quantities c, d, w, and N, we do not expect predictions relating to levels of binding ambiguity or recognition performance to carry over directly from the problem of word recognition in blocks of symbolic text to other visual recognition domains, such as viewpoint-independent recognition of real three-dimensional objects embedded in cluttered scenes. On the other hand, we expect the basic representational trade-offs to persist. The most important way that recognition in more realistic visual domains differs from recognition in text world lies in the invariance load, which in most interesting cases extends well beyond simple translation invariance. Basic-level object naming in humans, for example, is largely invariant to

Minimizing Binding Errors

275

translation, rotation, scale, and various forms of distortion, occlusion, and degradation—invariances that persist even for certain classes of nonsense objects (Biederman, 1995). Following the logic of our probabilistic analysis, the daunting array of invariances that must be maintained in this and many other natural visual tasks would seem to present a huge challenge to biological visual systems. In keeping with this, performance limitations in human perception suggest that our visual systems have responded in predictable ways to the pressures of the various tasks with which they are confronted; in particular, recognition in a variety of visual domains appears to include only those invariances that are absolutely necessary and those for which simple compensatory strategies are not available. For example (1) recognition is restricted to a highly localized foveal acceptance window for text or other detailed gures, where by “giving up” on translation invariance, this substream of the visual system frees hardware resources for the several other essential invariances that cannot be so easily compensated for (e.g. size, font, kerning, orientation, distortion), (2) face recognition operates under strong restrictions on the image-plane oriention and direction of lighting of faces (Yin, 1969; Johnston, Hill, & Carman, 1992), a reasonable compromise given that faces almost always present themselves to our visual systems upright and with positive contrast and lighting from above, and (3) unlike the case for common objects, reliable discrimination of complex three-dimensional nonsense objects (e.g., bent paper clips, crumpled pieces of paper) is restricted to a very limited range of three-dimensional orientations in the vicinity of familiar object views (Biederman & Gerhardstein, 1995), though this deciency can be overcome in some cases by extensive training (Logothetis & Pauls, 1995). When viewed within our framework, the exceptional difculty of this latter task arises from the need for, or more precisely a lack of, the very large number of very-high-order conjunctive features— essentially full object views locked to specic orientations (see Figure 1B)— that are necessary to support reliable viewpoint-invariant discrimination among a large set of objects with highly overlapping part structures. In summary, the primate visual system manifests a variety of domain-specic design compromises and performance limitations, which can be interpreted as attempts to achieve simultaneously the needed perceptual invariances, acceptably low levels of binding ambiguity, and reasonable hardware costs. In considering the astounding overall performance of the human visual system in comparison to the technical state of the art, it is also worth noting that human recognition performance is based on neural representations that could contain tens of thousands of task-appropriate visual features (extrapolated from the monkey; see Tanaka, 1996) spanning a wide range of conjunctive orders and nely tuned invariances. In continuing work, we are exploring further implications of the tradeoffs discussed here in relation to the neurobiological underpinnings of spatially invariant conjunctive receptive elds in visual cortex (Mel, Ruderman & Archie, 1998) and on the development of more efcient supervised

276

Bartlett W. Mel and J´ozsef Fiser

and unsupervised hierarchical learning procedures needed to build highperformance, task-specic visual recognition machines. Acknowledgments

Thanks to Gary Holt for many helpful comments on this work. This work was funded by the Ofce of Naval Research. References Barron, R., Mucciardi, A., Cook, F., Craig, J., & Barron, A. (1984). Adaptive learning networks: Development and Applications in the United States of algorithms related to GMDH. In S. Farrow (Ed.), Self-organizing methods in modeling. New York: Marcel Dekker. Biederman, I. (1995). Visual object recognition. In S. Kosslyn & D. Osherson (Eds.), An invitation to cognitive science (2nd ed.) (pp. 121–165). Cambridge, MA: MIT Press. Biederman, I., & Gerhardstein, P. (1995). Viewpoint-dependent mechanisms in visual object recognition: Reply to Tarr and Bulthoff ¨ (1995). J. Exp. Psychol. (Human Perception and Performance), 21, 1506–1514. Califano, A., & Mohan, R. (1994). Multidimensional indexing for recognizing visual shapes. IEEE Trans. on PAMI, 16, 373–392. Charniak, E. (1993). Statistical language learning. Cambridge, MA: MIT Press. Douglas, R., & Martin, K. (1998). Neocortex. In G. Shepherd (Ed.), The synaptic organization of the brain (pp. 567–509). Oxford: Oxford University Press. Edelman, S., & Duvdevani-Bar, S. (1997). A model of visual recognition and categorization. Phil. Trans. R Soc. Lond. [Biol], 352, 1191–1202. Fahlman, S., & Lebiere, C. (1990). The cascade-correlation learning architecture. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 524–532). San Mateo, CA: Morgan Kaufmann. Fize, D., Boulanouar, K., Ranjeva, J., Fabre-Thorpe, M., & Thorpe, S. (1998). Brain activity during rapid scene categorization—a study using event-related FMRI. J. Cog. Neurosci., suppl S, 72–72. Fukushima, K., Miyake, S., & Ito, T. (1983). Neocognition: A neural network model for a mechanism of visual pattern recognition. IEEE Trans. Sys. Man & Cybernetics, SMC-13, 826–834. Gilbert, C. (1983). Microcircuitry of the visual cortex. Ann. Rev. Neurosci, 89, 8366–8370. Heller, J., Hertz, J., Kjær, T., & Richmond, B. (1995). Information ow and temporal coding in primate pattern vision. J. Comput. Neurosci., 2, 175–193. Hubel, D., & Wiesel, T. (1968). Receptive elds and functional architecture of monkey striate cortex. J. Physiol., 195, 215–243. Hummel, J., & Biederman, I. (1992). Dynamic binding in a neural network for shape recognition. Psych. Rev., 99, 480–517. Johnston, A., Hill, H., & Carman, N. (1992).Recognizing faces: Effects of lighting direction, inversion, and brightness reversal. Perception, 21, 365–375.

Minimizing Binding Errors

277

Jones, E. (1981). Anatomy of cerebral cortex: Columnar input-output relations. In F. Schmitt, F. Worden, G. Adelman, & S. Dennis (Eds.), The organization of cerebral cortex. Cambridge, MA: MIT Press. Kuù cera, H., & Francis, W. (1967). Computational analysis of present-day American English. Providence, RI: Brown University Press. Lades, M., Vorbruggen, J., Buhmann, J., Lange, J., Malsburg, C., Wurtz, R., & Komen, W. (1993). Distortion invariant object recognition in the dynamic link architecture. IEEE Trans. Computers, 42, 300–311. Lang, G. K., & Seitz, P. (1997). Robust classication of arbitrary object classes based on hierarchical spatial feature-matching. Machine Vision and Applications, 10, 123–135. Le Cun, Y., Matan, O., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., Jackel, L., & Baird, H. (1990). Handwritten zip code recognition with multilayer networks. In Proc. of the 10th Int. Conf. on Patt. Rec. Los Alamitos, CA: IEEE Computer Science Press. Logothetis, N., & Pauls, J. (1995). Psychophysical and physiological evidence for viewer-centered object representations in the primate. Cerebral Cortex, 3, 270–288. Logothetis, N., & Sheinberg, D. (1996). Visual object recognition. Ann. Rev. Neurosci., 19, 577–621. Malsburg, C. (1994).The correlation theory of brain function (reprint from 1981). In E. Domany, J. van Hemmen, & K. Schulten (Eds.), Models of neural networks II (pp. 95–119). Berlin: Springer. McClelland, J., & Rumelhart, D. (1981). An interactive activation model of context effects in letter perception. Psych. Rev., 88, 375–407. Mel, B. W. (1997). SEEMORE: Combining color, shape, and texture histogramming in a neurally-inspired approach to visual object recognition. Neural Computation, 9, 777–804. Mel, B. W., Ruderman, D. L., & Archie, K. A. (1998). Translation-invariant orientation tuning in visual “complex” cells could derive from intradendritic computations. J. Neurosci., 17, 4325–4334. Mozer, M. (1991). The perception of multiple objects. Cambridge, MA: MIT Press. Oram, M., & Perrett, D. (1992). Time course of neural responses discriminating different views of the face and head. J. Neurophysiol., 68(1), 70–84. Oram, M. W., & Perrett, D. I. (1994). Modeling visual recognition from neurobiological constraints. Neural Networks, 7(6/7), 945–972. Pitts, W., & McCullough, W. (1947). How we know universals: The perception of auditory and visual forms. Bull. Math. Biophys., 9, 127–147. Potter, M. (1976). Short-term conceptual memory for pictures. J. Exp. Psychol.: Human Learning and Memory, 2, 509–522. Sandon, P., & Urh, L. (1988). An adaptive model for viewpoint-invariant object recognition. In Proc. of the 10th Ann. Conf. of the Cog. Sci. Soc. (pp. 209–215). Hillsdale, NJ: Erlbaum. Schiele, B., & Crowley, J. (1996). Probabilistic object recognition using multidimensional receptive eld histograms. In Proc. of the 13th Int. Conf. on Patt. Rec. (Vol. 2 pp. 50–54). Los Alamitos, CA: IEEE Computer Society Press. Swain, M., & Ballard, D. (1991). Color indexing. Int. J. Computer Vision, 7, 11–32.

278

Bartlett W. Mel and J´ozsef Fiser

Szentagothai, J. (1977). The neuron network of the cerebral cortex: A functional interpretation. Proc. R. Soc. Lond. B, 201, 219–248. Tanaka, K. (1996). Inferotemporal cortex and object vision. Ann. Rev. Neurosci, 19, 109–139. Thorpe, S., Fize, D., & Marlot, C. (1996).Speed of processing in the human visual system. Nature, 381, 520–522. Van Essen, D. (1985). Functional organization of primate visual cortex. In A. Peters & E. Jones (Eds.), Cerebral cortex (pp. 259–329). New York: Plenum Publishing. Wallis, G., & Rolls, E. T. (1997).Invariant face and object recognition in the visual system. Prog. Neurobiol., 51, 167–194. Weng, J., Ahuja, N., & Huang, T. S. (1997). Learning recognition and segmentation using the cresceptron. Int. J. Comp. Vis., 25(2), 109–143. Wickelgren, W. (1969). Context-sensitive coding, associative memory, and serial order in (speech) behavior. Psych. Rev., 76, 1–15. Yin, R. (1969). Looking at upside down faces. J. Exp. Psychol., 81, 141–145. Zemel, R., Mozer, M., & Hinton, G. (1990). TRAFFIC: Recognizing objects using hierarchical reference frame transformations. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 266–273). San Mateo, CA: Morgan Kaufmann. Received September 29, 1998; accepted March 5, 1999.

NOTE

Communicated by Terence Sanger

Relationship Between Phase and Energy Methods for Disparity Computation Ning Qian Center for Neurobiology and Behavior, Columbia University, New York, NY 10032, U.S.A. Samuel Mikaelian Center for Neural Science, New York University, New York, NY 10003, U.S.A. The phase and energy methods for computing binocular disparity maps from stereograms are motivated differently, have different physiological relevances, and involve different computational steps. Nevertheless, we demonstrate that at the nal stages where disparity values are made explicit, the simplest versions of the two methods are exactly equivalent. The equivalence also holds when the quadrature-pair construction in the energy method is replaced with a more physiologically plausible phase-averaging step. The equivalence fails, however, when the phase-difference receptive eld model is replaced by the position-shift model. Additionally, intermediate results from the two methods are always quite distinct. In particular, the energy method generates a distributed disparity representation similar to that found in the visual cortex, while the phase method does not. Finally, more elaborate versions of the two methods are in general not equivalent. We also briey compare these two methods with some other stereo models in the literature. 1 Introduction

Binocular disparity is dened as the positional difference between the two retinal projections of a given point in space. It is well known that the horizontal component of disparity provides the sensory cue for stereoscopic depth perception. Many computational models for disparity estimation have been proposed. Here we compare two computational models that appear to have the strongest biological relevance: the phase method and the energy method. The phase method (Sanger, 1988; Fleet, Jepson and Jenkin, 1991) for disparity computation is based on the mathematical result that the displacement of a function generates a proportional phase shift in its complex Fourier transform. The binocular disparity at each location is therefore proportional to the difference of the Fourier phases of the corresponding left and right Neural Computation 12, 279–292 (2000)

c 2000 Massachusetts Institute of Technology °

280

Ning Qian and Samuel Mikaelian

image patches. Sanger (1988) used sine and cosine Gabor lters to estimate the local Fourier phases of both left and right images, and then calculated the Fourier phase difference at each location to nd disparity. The algorithm has been tested on both random dot stereograms and natural images (Sanger, 1988). The energy method for disparity computation (Qian, 1994, 1997; Zhu & Qian, 1996; Qian & Zhu, 1997; Qian & Andersen, 1997; Fleet, Wagner, & Heeger, 1996) was derived from the energy model of binocular cell responses in the cat’s primary visual cortex (Ohzawa, DeAngelis, & Freeman, 1990, 1996, 1997; Freeman & Ohzawa, 1990; DeAngelis, Ohzawa, & Freeman, 1991). Through quantitative physiological experiments, Freeman and coworkers found that a typical binocular simple cell can be described by two Gabor functions—one for its left and the other for its right receptive elds. There is a relative phase difference (or positional shift, or both) between the two receptive elds that generates disparity sensitivity. The activities of such simple cells were determined by rst convolving the left and right images with the left and right receptive elds, respectively, and then summing the two contributions. Freeman et al. further found that responses of binocular complex cells in cat’s primary visual cortex can be well modeled by summing the squared outputs of a quadrature pair of simple cells. We analyzed this so-called energy model for binocular complex cells and found that a population of such complex cells can be used to extract stimulus disparity (see Qian, 1997, for a review). The resulting energy method for disparity computation has been tested on random dot stereograms (Qian, 1994; Qian & Zhu, 1997). These two methods are different in several ways. First, although both methods use Gabor lters at the front end, the phase method applies two monocular Gabor lters separately and delays binocular comparison until the last step, when the Fourier phases of the left and right image patches are subtracted, while the energy method uses a set of binocular lters that combines contributions from the two eyes at the rst step. In addition, the phase method explicitly computes and represents the complex Fourier phases of the left and right image patches, while the energy method uses the quadrature-pair construction (or the equivalent phase averaging; see section 2) at the complex cell stage to remove the simple cells’ dependence on image Fourier phases. Furthermore, complex cell responses in the energy method form a distributed representation of disparity; such a representation is absent in the phase method. Despite these differences, we show that the simplest versions of the two methods are exactly equivalent at the nal steps where disparity values are made explicit. The intermediate steps, however, differ in the two methods and have different physiological implications. In addition, the two methods can be elaborated in different ways, resulting in nonequivalent algorithms.

Phase and Energy Methods for Disparity Computation

281

2 Results

Before we start, a few potential confusions in terminology should be claried. First, the phase method should not be confused with the phasedifference (also called phase-parameter or phase-shift) receptive eld model. The former species a method for disparity estimation from stereograms, while the latter is a model for describing receptive eld proles of binocular simple cells often used in the energy method. Second, the phase parameters in the phase-difference receptive eld model (w l and w r in equations 2.1 and 2.2) should not be confused with the image Fourier phases. The former determine the positions of the excitatory-inhibitory bands relative to the receptive eld center, while the latter are the phases of the complex Fourier transforms of retinal images. Finally, mathematically complex quantities, with real and imaginary components, should not be confused with complex cell properties. We now demonstrate the exact equivalence of the simplest versions of the phase and the energy methods. Elaborations of the methods such as frequency pooling (for both methods), condence measure (for the phase method), and spatial pooling (for the energy method) are ignored in this section (but see section 3). We start by reformulating the energy method (Qian, 1994) and convert it into a form identical to that for the phase method (Sanger, 1988). Consider a binocular simple cell centered at x = 0, with the left and right receptive elds given by the following Gabor functions (Marcelja, 1980; Daugman, 1985; McLean & Palmer, 1989; Ohzawa et al., 1990): fl ( x) = g( x) cos(vx C w l ) fr ( x) = g( x) cos(vx C w r ) with g( x) = p

1 2p s

¡

exp ¡

x2 2s 2

(2.1) (2.2)

¢

,

(2.3)

where v is the preferred (angular) horizontal spatial frequency of the cell, s is the gaussian width determining the receptive eld size, and w l and w r are the left and right phase parameters. Its response to a stereo image pair Il ( x) and Ir ( x) is given by: Z rs =

1 ¡1

£

¤ fl ( x) Il ( x) C fr ( x) Ir ( x) dx.

(2.4)

(For a layer of topographically arranged cells of the same type, the convolution operation between images and receptive elds should be used instead.) The simple cell with a quadrature phase relationship to the above cell has

282

Ning Qian and Samuel Mikaelian

receptive elds of the same form but with the cosine functions replaced by sines (Pollen, 1981; Adelson & Bergen, 1985; Watson & Ahumada, 1985; Ohzawa et al., 1990; Qian, 1994). It is easy to see that the responses of these two simple cells are the real and imaginary parts of Z 1 £ ¤ (2.5) RC ´ g( x)eivx eiw l Il ( x) C eiw r Ir (x) dx. ¡1

According to the quadrature-pair construction, the complex cell response is given by the sum of the squared responses of these two simple cells, (or equivalently, four half-wave rectied simple cells (Ohzawa et al., 1990)), and can thus be written as Z 1 2 £ ¤ 2 ivx iw ¡ (2.6) rq = |R C| = g(x) e Il ( x) C e Ir ( x) dx , ¡1

where (2.7)

w ¡ ´ wr ¡ w l

is the phase parameter difference. Therefore, complex cell responses depend on only w ¡ instead of on w r and w l individually. For a given binocular stimulus, complex cells with different w ¡ will give different responses. These cells at a given spatial location form a distributed representation of the stimulus disparity at that location. According to the energy method (Qian, 1994), the stimulus disparity can be explicitly estimated from the distribution as: D=

wO ¡ , v

(2.8)

where wO ¡ is the phase difference w ¡ that maximizes the complex cell response rq . Let Z

1

¡1 Z 1 ¡1

g(x) Il ( x) eivx dx ´ Âl (v),

(2.9)

g( x) Ir ( x) eivx dx ´ Âr (v),

(2.10)

so that Âl (v) and Âr (v) are the Fourier transforms of the left and right image patches under the gaussian envelope g( x), respectively. Equation 2.6 then becomes:

2 rq = Âl (v) C eiw ¡ Âr (v) .

(2.11)

To nd wO ¡ , note that the maximum of rq is attained when the two terms inside the norm of the above expression are along the same direction in the

Phase and Energy Methods for Disparity Computation

283

complex plane, that is, wO¡ = arg(Âl ) ¡ arg(Âr ) "R1 # ( ) ( ) ¡1 dx Il x g x sin vx R = arctan 1 ( ) ( ) ¡1 dx Il x g x cos vx "R1 # ( ) ( ) ¡1 dx I r x g x sin vx ¡ arctan R 1 . ( ) ( ) ¡1 dx Ir x g x cos vx

(2.12)

(2.13)

Equation 2.13 combined with equation 2.8 is precisely the expression for the phase method proposed by Sanger (1988). The two terms in equation 2.13 are the local complex Fourier phases of the left and right image patches, respectively, estimated through the sine and cosine Gabor lters. This establishes the equivalence between the two methods. It is interesting to note that w ¡ simply compensates for the different Fourier phases of the right and left image patches in equation 2.11. Scaling the relative contrast of the image pair while preserving the contrast polarity will not affect the disparity estimates because only the relative amplitudes (but not phases) of Âl and Âr will be changed by this manipulation (Qian, 1994). When the contrast polarity of one image in the pair is reversed, however, w ¡ has to be shifted by p to compensate, resulting in an inverted complex cell turning curve. This prediction is also contained in equation 2.13 of Qian (1994) and has been veried experimentally (Ohzawa et al., 1990; Cumming & Parker, 1997; Masson, Busettini, & Miles, 1997). We have previously discussed the similarities and differences between the energy method and the cross-correlator (Qian & Zhu, 1997). With the current formulation, it is easy to see that nding wO¡ , which maximizes the quadrature-pair response, is equivalent to determining the Fourier phase of the cross-correlation between the left and right image patches under the gaussian envelope g( x). This is because the Fourier transform of the crosscorrelation, Z Clr ( x) =

1 ¡1

g( x0 )Il (x0 ) g( x C x0 )Ir ( x C x0 ) dx0 ,

(2.14)

is simply Âl (v) Âr¤ (v) where ¤ denotes complex conjugation. Therefore, the phase of this cross-correlator, or its normalized version, is just the expression for wO¡ in equation 2.12. Correlation methods usually use the peak location of Clr ( x) to estimate disparity. If Clr ( x) is sharply peaked at xo , its Fourier phase is approximately vxo . It is in deviations from this approximation that the cross-correlator and the energy method can yield different estimates. We used the quadrature-pair construction for obtaining complex cell responses above. As we pointed out previously (Qian & Zhu, 1997), a less demanding and physiologically more plausible alternative is to integrate

284

Ning Qian and Samuel Mikaelian

squared responses of many simple cells all with the same w ¡ but with (2.15)

w C ´ wr C w l

uniformly distributed in the entire 2p range (see equation 2.19). The energy method with this uniform phase averaging approach is still equivalent to the phase method because the phase averaging gives exactly the same complex cell expression as the quadrature-pair construction. To demonstrate, dene: Z 1 (2.16) R¡ ´ R¤C = g( x) e¡ivx [e¡iw l Il ( x) C e¡iw r Ir ( x)]dx, ¡1

and rewrite the simple cell response as: rs =

1 ( R C C R¡ ). 2

(2.17)

Since rs is real, we have: r2s = |rs | 2 =

´ 1³ ´ 1³ |R C| 2 C |R¡ | 2 C 2ReR CR¤¡ = | R C| 2 C ReR2C . 4 2

(2.18)

According to the phase averaging approach, the complex cell response is given by the integration: 1 ra = 2p

Z

2p 0

r2s dw C

1 = 2p

Z

2p 0

´ 1³ | R C| 2 C ReR2C dw C, 2

(2.19)

while w ¡ is kept constant. Since |R C| 2 is not a function of w C (see equation 2.6), the averaging leaves the rst term unchanged. The second term integrates to zero because: " Z # h i 2 1 2 ivx ¡iw ¡ / 2 ( ) iw ¡ / 2 ( ) iw C ( ) ReR C = Re (2.20) dxg x e e Il x C e Ir x e

¡

¢

¡1

and Z

2p 0

eiw C dw C = 0.

(2.21)

Therefore, the phase averaging approach generates a complex cell response proportional to that of the quadrature pair in equation 2.6, ra =

1 1 |R C| 2 = rq , 2 2

(2.22)

where the proportionality constant is immaterial for disparity computation. Intuitively, one can imagine dividing the 2p range for w C into many small

Phase and Energy Methods for Disparity Computation

285

intervals in equation 2.19. Then each pair of intervals differing by p / 2 can be considered a quadrature pair. Therefore, the above phase averaging can be viewed as averaging together a continuum of quadrature-pair responses, all with the same w ¡ . Two types of receptive eld models for binocular simple cells have been proposed: the position-shift model (Bishop, Henry, & Smith, 1971) and the phase-differences model (Ohzawa et al., 1990; DeAngelis et al., 1991). The latter was used in the above derivations. Real cortical cells may use a combination of both mechanisms for coding disparity (Jacobson, Gaska, & Pollen, 1993; Zhu & Qian, 1996; Fleet et al., 1996; Anzai, Ohzawa, & Freeman, 1997). We showed previously that although there are important differences between them, under reasonable assumptions, both models (or their hybrid) can be used with the energy method for computing disparity maps from stereograms (Zhu & Qian, 1996; Qian & Zhu, 1997). However, the exact equivalence between the phase and the energy methods does not hold when the position-shift receptive eld model is used in the energy method. To see this, note that according to the position-shift receptive eld model, equation 2.2 should be replaced by fr ( x) = fl (x C d) = g(x C d) cos[v(x C d) C w l ],

(2.23)

where d represents the amount of positional shift between the left and right receptive eld centers. The complex cell responses in equation 2.6 should then be written as: Z 1 2 h i ivx ivd (2.24) rq = e g(x) Il ( x) C e g( x C d) Ir ( x) dx , ¡1

and the stimulus disparity is given by (Zhu & Qian, 1996): O D = d,

(2.25)

where dO is the positional shift that maximizes complex cell response. Equation 2.24 can be rewritten as

2

rq = Âl C eivd Âr ( d) , with the denitions: Z 1 g( x) Il ( x)eivx dx ´ Âl , ¡1 Z 1 g(x C d) Ir ( x)eivx dx ´ Âr (d) . ¡1

(2.26)

(2.27) (2.28)

It is obvious that a closed-form expression similar to equation 2.13 cannot be obtained for dO because Âr is also a function of d, and the maximum of

286

Ning Qian and Samuel Mikaelian

rq as a function of d may not occur when the two terms inside the norm of equation 2.26 have the same direction on the complex plane. Therefore, the energy method with the position-shift receptive eld model is not equivalent to the phase method. However, under the oft-invoked assumption that the disparity is small compared to the receptive eld size, the d-dependence of Âr in equation 2.28 is negligible and vd in equation 2.26 plays the same role as w ¡ in equation 2.11. Clearly, in this approximation the energy method with the positional-shift model corresponds to the phase method in the same way as the energy method with the phase-difference model does. 3 Discussion

We have demonstrated that the simplest versions of the energy and the phase methods for disparity computation are exactly equivalent at the nal stages where the disparity values are made explicit. The equivalence still holds when the quadrature-pair construction in the energy method is replaced by a less demanding phase-averaging procedure. However, when the position-shift type of simple cell receptive eld model is used to replace the phase-difference type in the energy method, the two methods are no longer exactly equivalent. This result is consistent with the fact that both the phase method and the energy method combined with the phase-difference receptive eld model predict a relationship between the spatial scale v and the computed disparity range (the so-called size-disparity correlation), while the methods with the position-shift receptive eld model do not make such a prediction (Sanger, 1988; Smallman & MacLeod, 1994; Qian, 1994; Zhu & Qian, 1996; Fleet et al., 1996). Regardless of the receptive eld models used, the intermediate results in the two methods are always very different. Indeed, the energy method contains simple and complex cell stages that simulate binocular interactions observed in simple and complex cortical cells, while none of the steps in the phase method resembles actual binocular cells. More importantly, the complex cells in the energy method form a distributed representation of binocular disparity at each location; such a representation is absent in the phase method. The distributed representation is useful because it could directly guide motor behavior such as vergence eye movement (Masson et al., 1997) without rst generating an explicit disparity map. In fact, the nal explicit disparity extraction in both methods does not seem to happen in the brain since “grandmother cells” for disparity coding have never been recorded from the visual cortex. While an explicit, grandmother-cell type of representation is more convenient for us to comprehend, the brain appears to rely more on implicit, distributed representations for perception and control. The simplest versions of the phase and the energy methods can also be elaborated in many ways. For example, both methods can be extended to combine results from different spatial scales (i.e., lters with different vs).

Phase and Energy Methods for Disparity Computation

287

Obviously, if the same extension is made after the nal stages in both methods, they remain equivalent. Any elaborations at intermediate steps or those that can only be applied to one method, however, will in general render the two methods nonequivalent. One example is the spatial pooling procedure in the energy method, which greatly improves the algorithm by making complex cell responses more independent of image Fourier phases (Zhu & Qian, 1996; Qian & Zhu, 1997). This procedure cannot be easily incorporated into the phase method because image Fourier phases are precisely the useful information in the phase method. Similarly, the condence measure in the phase method for appropriately weighting different spatial scales (Sanger, 1988) cannot be readily extended to the energy method since the measure is derived from the Fourier amplitudes of the left and right image patches; these monocular amplitudes are not computed in the energy method that starts with binocular lters. Of course, a different spatial pooling step could be incorporated into the phase method. For example, one could average the Fourier phases estimated from several nearby spatial locations. Similarly, a different condence measure could be added to the energy method. One possibility would be using the normalized range of the complex cell responses, O 2 Re(e¡iw ¡ Âl Âr¤ ) rq (wO ¡ ) ¡ rq (wO ¡ C p ) 2|Âi ||ÂrO | = = , 2 C |Â |2 O O |Â | | | 2 C | Âr | 2 Â ( ) ( ) rq w ¡ C rq w ¡ C p r l l

(3.1)

as the condence measure because a larger modulation would allow a more reliable estimation of wO ¡ . In conclusion, the equivalence between the simplest versions of the phase and energy methods demonstrated in this article indicates that the two methods can be viewed as different implementations of the same underlying mathematical principle. The differences in implementation, however, allow the methods to be elaborated, interpreted, and used in different ways by subsequent processes. Our results are reminiscent of the relationship between various motion models that have been documented in the literature (Adelson & Bergen, 1985; van Santen & Sperling, 1985; Borst & Egelhaaf, 1989; Simoncelli & Adelson, 1991; Heeger & Simoncelli, 1992). 3.1 Other Methods. One of the reviewers pointed out that disparity can also be computed by calculating the magnitude of the bandpass-ltered sum of the left and right image patches. This magnitude is equivalent to setting w ¡ to zero in equation 2.6 (or setting d to zero in equation 2.24) and dropping the immaterial squaring outside the norm. For a lter well-tuned to horizontal frequency v and for left and right image patches given by Il = I ( x) and Ir = I ( x CD), this magnitude is approximately proportional to | cos(vD/ 2)|| IQ(v)|. If the image spectrum | IQ(v)| does not vary much with v, the disparity D can be estimated by using two or more lters with different

288

Ning Qian and Samuel Mikaelian

v. We will refer to this algorithm as the frequency method because it applies identical lters to the left and right images and consequently has to rely on more than one frequency band for disparity computation. In essence, the frequency method xes w ¡ (or d) to zero and varies v, while the energy method xes v and varies w ¡ (or d). Obviously the frequency method is not equivalent to the energy (or the phase) method because the former cannot compute disparity within a single frequency band, while the latter can. Existing psychophysical evidence is against the frequency method in its pure form: human subjects do perceive stereoscopic depth in band-limited stereograms (Julesz, 1971). Psychophysical observations also indicate that different frequency channels interact with each other in human stereo vision (Wilson, Blake, & Halpern, 1991; Rohaly & Wilson, 1993, 1994; Smallman, 1995; Mallot, Gillner, & Arndt, 1996). It is thus possible that the brain may combine the frequency and the energy methods by using cells with different v and different w ¡ (or d) simultanously in disparity computation. Alternatively, the brain could rst apply the energy method to obtain one disparity estimate from each frequency channel and then pool the estimates across channels at a later stage (Sanger, 1988; Qian & Zhu, 1997). Further studies are needed for differentiating these two possibilities. In addition to the phase and the energy methods, many other stereo models have been proposed over the years. Similar to the phase method that rst estimates Fourier phases of the left and right image patches separately, most of the other models also start with a monocular detection of some matching primitives. For example, several studies (Marr & Poggio, 1976; Pollard, Mayhew, & Frisby, 1985; Prazdny, 1985; Qian & Sejnowski, 1989; Marshall, Kalarickal, & Graves, 1996) let individual dots in random-dot stereograms be the matching primitives. Marr and Poggio (1979) later used zero crossings at different spatial scales as the primitives. For convenience, we classify these methods as initially monocular. In contrast, the energy method is initially binocular because it starts with binocular ltering without performing any monocular feature detection. In this aspect it is similar to the cepstral method (Yeshurun & Schwartz, 1989) and the frequency method discussed above. The cepstral and the energy methods are very different in other aspects, however. For instance, the former contains a complex logrithmic operation and complex Fourier transforms, while the latter does not. (Note that the Fourier transform used in this article and elsewhere in connection with the energy method is only for mathematically analyzing the method; it is not a step in the method.) In addition, the cepstral method does not use binocular receptive elds from the physiological experiments. The initially monocular algorithms generally contain a subsequent step that matches the monocular primitives to solve the correspondence problem. Most methods (Marr & Poggio, 1976, 1979; Pollard et al., 1985; Prazdny, 1985; Qian & Sejnowski, 1989; Marshall et al., 1996) do so by introducing explicit constraints or rules that determine correct matches between the left and right primitives among all possible matches within a certain disparity

Phase and Energy Methods for Disparity Computation

289

range. These methods will be referred to as the explicit methods. In contrast, the phase method is implicit (Sanger, 1988) because it nds disparity by simply subtracting the two monocular phases without performing any explicit matching. (If the monocular Fourier amplitudes are used to calculate a condence weighting factor [Sanger, 1988], then the matching is slightly more explicit). The initially binocular algorithms dened above are necessarily implicit since there are no monocular primitives for explicit matching in the rst place. An advantage of the implicit methods is that they can have subpixel disparity resolution without the burden of sorting through a huge number of potential matches (Sanger, 1988; Qian & Zhu, 1997). As we discussed elsewhere (Qian, 1997), the implicit methods are also more physiologically plausible. To summarize, most stereo models are initially monocular for feature extraction and are then explicit in binocular matching. The cepstral, the energy, and the frequency methods, on the other hand, are initially binocular and implicit. The phase method is somewhere in between: it is initially monocular and is implicit. Obviously, there are many other ways of classifying stereo models. For example, some models measure and rely on disparity gradients, while others do not; some implement the uniqueness constraint through inhibitory interactions, while others emphasize facilitation and allow multiple matches of monocular primitives. The current formulation of the energy method assumes that the disparity is constant over the small image patches covered by the receptive elds (Qian, 1994; Qian & Zhu, 1997). Because of this assumption, the performance tends to degrade when there are large disparity gradients in the images such as across disparity boundaries. However, the energy method can be extended to encompass disparity gradients. Specically, it can be shown that a disparity gradient of D0 (= dD( x)/ dx) can be best measured by binocular cells whose left and right preferred spatial frequencies (vl and vr ) are related by vr = (1 C D0 )vl . Therefore, the maximum measurable disparity gradient depends on the differences between vl and vr among real binocular cells. If vl and vr are equal, then the cells’ frequency tuning widths have to be at least (1 C D0 )vl in order to detect D0 (Sanger, 1988). This extension of the energy method can be viewed as an implicit counterpart of the gradient limit rule employed explicitly by Pollard et al. (1985). It is also worth pointing out that the energy method in its current form (Qian, 1994; Qian & Zhu, 1997) already contains an implicit implementation of the uniqueness constraint (used by many explicit models in various forms). Because of the approximately cosine tuning behavior, there is only one maximum of the complex cell response as a function of w ¡ within the [¡p , p ) range even when there are two overlapping stimulus disparities. The stereo transparency problem could still be solved by an interdigitating representation of the two disparities by cells tuned to different spatial locations, similar to a previous motion-transparency model (Qian, Andersen, & Adelson, 1994). This approach is also related to those explicit methods

290

Ning Qian and Samuel Mikaelian

that use the uniqueness constraint to model stereo transparency (Qian & Sejnowski, 1989; Pollard & Frisby, 1990). It thus appears that despite the major differences between the explicit and implicit methods, they could share certain conceptual similarities. Acknowledgments

This work was supported by a Sloan Research Fellowship and NIH grant MH54125, both to N. Q. We are grateful to the anonymous reviewers for their helpful comments. References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Am. A, 2(2), 284–299. Anzai, A., Ohzawa, I., & Freeman, R. D. (1997). Neural mechanisms underlying binocular fusion and stereopsis: Position vs. phase. Proc. Nat. Acad. Sci. USA, 94, 5438–5443. Bishop, P. O., Henry, G. H., & Smith, C. J. (1971). Binocular interaction elds of single units in the cat striate cortex. J. Physiol., 216, 39–68. Borst, A., & Egelhaaf, M. (1989). Principles of visual motion detection. Trends Neurosci., 12, 297–306. Cumming, B. G., & Parker, A. J. (1997). Responses of primary visual cortical neurons to binocular disparity without depth perception. Nature, 389, 280– 283. Daugman, J. G. (1985). Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical lters. J. Opt. Soc. Am. A, 2, 1160–1169. DeAngelis, G. C., Ohzawa, I., & Freeman, R. D. (1991). Depth is encoded in the visual cortex by a specialized receptive eld structure. Nature, 352, 156–159. Fleet, D. J., Jepson, A. D., & Jenkin, M. (1991). Phase-based disparity measurement. Comp. Vis. Graphics Image Proc., 53, 198–210. Fleet, D. J., Wagner, H., & Heeger, D. J. (1996). Encoding of binocular disparity: Energy models, position shifts and phase shifts. Vision Res., 36, 1839–1858. Freeman, R. D., & Ohzawa, I. (1990). On the neurophysiological organization of binocular vision. Vision Res., 30, 1661–1676. Heeger, D. J., & Simoncelli, E. P. (1992). Model of visual motion sensing. In L. Harris & M. Jenkins (Eds.), Spatial vision in humans and robots. Cambridge: Cambridge University Press. Jacobson, L., Gaska, J. P., & Pollen, D. A. (1993). Phase, displacement and hybrid models for disparity coding. Invest. Ophthalmol. and Vis. Sci. Suppl. (ARVO), 34, 908. Julesz, B. (1971). Foundations of cyclopean perception. Chicago: University of Chicago Press. Mallot, H. A., Gillner, S., & Arndt, P. A. (1996). Is correspondence search in human stereo vision a coarse-to-ne process? Biol. Cybern., 74, 95–106.

Phase and Energy Methods for Disparity Computation

291

Marcelja, S. (1980). Mathematical description of the responses of simple cortical cells. J. Opt. Soc. Am. A, 70, 1297–1300. Marr, D., & Poggio, T. (1976). Cooperative computation of stereo disparity. Science, 194, 283–287. Marr, D., & Poggio, T. (1979). A computational theory of human stereo vision. Proc. R. Soc. Lond. B, 204, 301–328. Marshall, J. A., Kalarickal, G. J., & Graves, E. B. (1996). Neural model of visual stereomatching: Slant, transparency, and clouds. Network: Comput. Neural Sys., 7, 635–639. Masson, G. S., Busettini, C., & Miles, F. A. (1997). Vergence eye movements in response to binocular disparity without depth perception. Nature, 389, 283– 286. McLean, J., & Palmer, L. A. (1989). Contribution of linear spatiotemporal receptive eld structure to velocity selectivity of simple cells in area 17 of cat. Vision Res., 29, 675–679. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1990). Stereoscopic depth discrimination in the visual cortex: Neurons ideally suited as disparity detectors. Science, 249, 1037–1041. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1996). Encoding of binocular disparity by simple cells in the cat’s visual cortex. J. Neurophysiol., 75, 1779– 1805. Ohzawa, I., DeAngelis, G. C., & Freeman, R. D. (1997). Encoding of binocular disparity by complex cells in the cat’s visual cortex. J. Neurophysiol., 77, 2879– 2909. Pollard, S. B., & Frisby, J. P. (1990). Transparency and the uniqueness constraint in human and computer stereo vision. Nature, 347, 553–556. Pollard, S. B., Mayhew, J. E., & Frisby, J. P. (1985). PMF: A stereo correspondence algorithm using a disparity gradient limit. Perception, 14, 449–470. Pollen, D. A. (1981). Phase relationship between adjacent simple cells in the visual cortex. Nature, 212, 1409–1411. Prazdny, K. (1985). Detection of binocular disparities. Biol. Cybern., 52, 93–99. Qian, N. (1994). Computing stereo disparity and motion with known binocular cell properties. Neural Comput., 6, 390–404. Qian, N. (1997). Binocular disparity and the perception of depth. Neuron, 18, 359–368. Qian, N., & Andersen, R. A. (1997). A physiological model for motion-stereo integration and a unied explanation of Pulfrich-like phenomena. Vision Res., 37, 1683–1698. Qian, N., Andersen, R. A., & Adelson, E. H. (1994). Transparent motion perception as detection of unbalanced motion signals III: Modeling. J. Neurosci., 14, 7381–7392. Qian, N., & Sejnowski, T. J. (1989). Learning to solve random-dot stereograms of dense and transparent surfaces with recurrent backpropagation. In D. S. Touretzky, G. E. Hinton, & T. J. Sejnowski (Eds.), Proceedings of the 1988 Connectionist Models Summer School (pp. 435–443). San Mateo, CA: Morgan Kaufmann.

292

Ning Qian and Samuel Mikaelian

Qian, N., & Zhu, Y. (1997). Physiological computation of binocular disparity. Vision Res., 37, 1811–1827. Rohaly, A. M., & Wilson, H. R. (1993). Nature of coarse-to-ne constraints on binocular fusion. J. Opt. Soc. Am. A, 10, 2433–2441. Rohaly, A. M., & Wilson, H. R. (1994). Disparity averaging across spatial scales. Vision Res., 34, 1315–1325. Sanger, T. D. (1988). Stereo disparity computation using Gabor lters. Biol. Cybern., 59, 405–418. Simoncelli, E. P., & Adelson, E. H. (1991).Relationship between gradient, spatiotemporal energy, and regression models for motion perception. Invest. Ophthalmol. and Vis. Sci. Suppl. (ARVO), 32, 893. Smallman, H. S. (1995). Fine-to-coarse scale disambiguation in stereopsis. Vision Res., 35, 1047–1060. Smallman, H. S., & MacLeod, D. I. (1994).Size-disparity correlation in stereopsis at contrast threshold. J. Opt. Soc. Am. A, 11, 2169–2183. van Santen, J. P. H., & Sperling, G. (1985). Elaborated Reichardt detectors. J. Opt. Soc. Am. A, 2, 300–321. Watson, A. B., & Ahumada, A. J. (1985).Model of human visual-motion sensing. J. Opt. Soc. Am. A, 2, 322–342. Wilson, H. R., Blake, R., & Halpern, D. L. (1991). Coarse spatial scales constrain the range of binocular fusion on ne scales. J. Opt. Soc. Am. A, 8, 229–236. Yeshurun, Y., & Schwartz, E. L. (1989). Cepstral ltering on a columnar image architecture—A fast algorithm for binocular stereo segmentation. IEEE Pat. Anal. Mach. Intell., 11, 759–767. Zhu, Y., & Qian, N. (1996). Binocular receptive elds, disparity tuning, and characteristic disparity. Neural Comput., 8, 1647–1677. Received September 30, 1998; accepted February 12, 1999.

NOTE

Communicated by Leo Breiman

N-tuple Network, CART, and Bagging Aleksander Ko lcz Electrical and Computer Engineering Department, University of Colorado at Colorado Springs, Colorado Springs, CO 80918, U.S.A. Similarities between bootstrap aggregation (bagging) and N-tuple sampling are explored to propose a retina-free data-driven version of the Ntuple network, whose close analogies to aggregated regression trees, such as classication and regression trees (CART), lead to further architectural enhancements. Performance of the proposed algorithms is compared with the traditional versions of the N-tuple and CART networks on a number of regression problems. The architecture signicantly outperforms conventional N-tuple networks while leading to more compact solutions and avoiding certain implementational pitfalls of the latter.

1 Introduction

Recent years have seen a reemergence of interest in the N-tuple network (NTN)architecture (Bledsoe & Browning, 1959). Despite numerous analyses (Bledsoe & Bisson, 1962; Steck, 1962; Roy & Sherman, 1967) and signicant progress (Luttrell, 1992; Rohwer, 1995; Ko lcz & Allinson, 1996; Jung, Krishnamoorthy, Nagy, & Shapira, 1996; Bradshaw, 1997; Rohwer & Morciniec, 1998), the computational capabilities of this architecture are still not very well understood. An NTN combines multiple models randomly partitioning the input space. In this article we show that a single NTN node can be represented by a binary tree, which can benet from learning algorithms similar to those used by classication and regression trees (CART) (Breiman, Friedman, Olshen, & Stone, 1984). This abstraction, together with analogies between N-tuple sampling and bootstrap aggregation (bagging) (Breiman, 1996a), are used to propose a data-driven method of creating a set of NTN tree-based nodes, resulting in a network related to bagged CART. The article is organized as follows. In section 2 we demonstrate a basic similarity between N-tuple sampling and bagging, and propose a retina-free version of the NTN. Section 3 introduces two tree architectures combining the attributes of CART and NTN. In section 4 numerical experiments are used to demonstrate the advantages of the proposed tree versions of the NTN when compared to the original architecture and to CART on a number of regression problems. Section 5 contains the conclusion. Neural Computation 12, 293–304 (2000)

c 2000 Massachusetts Institute of Technology °

294

Aleksander Ko lcz

2 Retina-Free NTN and Bagging Estimators

Assuming a binary input pattern space, usually represented in the form of a two-dimensional “retina,” NTN performs N-tuple sampling, where a set of K memory nodes, each having an N-bit-long address word, randomly distribute their address lines among the retina bits, with each address line being associated with a particular random feature of the binary input space. The NTN mapping is thus an aggregate of K random low-complexity models. In continuous variable environments, real-valued inputs are converted into a binary format suitable for N-tuple sampling. In particular, when thermometer encoding (Aleksander & Stonham, 1979) is used, an instance of N-tuple sampling becomes equivalent to placing N quantization thresholds (constrained by a uniform quantization grid in each dimension) within the range of the D input variables (Tattersall, Foster, & Johnston, 1991), leading to hyper-rectangular quantization of the input space. The network is biased by the assumption of uniformity imposed on the input space. The alternative we consider is to bypass the input-conversion stage altogether and perform N-tuple sampling by directly distributing the quantization thresholds within the ranges of the input variables, a process that can be guided by the empirical input-space distribution provided by a training set. In the light of this idea, certain similarities between the nature of NTN design and some recent advances in learning algorithms become apparent. Breiman (1996a) showed that signicant improvements in network prediction performance can be gained by using the concept of bagging, where the variance component of the generalization error is reduced by averaging several variants of the network. Each variant is designed with a different version of the training set, with each version created by sampling with repetitions from the original collection. To see the similarity between bagging and NTN design, let us rst note that each node of an NTN represents a crude model of the desired mapping, which is highly dependent on the positioning of the N-tuple thresholds. Averaging a large number of such randomly designed nodes reduces the variance of the estimate irrespective of the data set considered. However, when N-tuple sampling is constrained to choose the quantization thresholds from within the set training-point coordinates (according to the sampling with repetition regimen), the process of selecting a set of K NTN nodes becomes data driven and effectively corresponds to bagging. Although such an approach should lead to performance improvements, we propose to structure the process of training individual NTN nodes further by taking advantage of their inherent similarity to binary trees. 3 NTN Analogies to CART

CART is a well-known binary tree network based on hyper-rectangular quantization of the input space. In the regression variant, its output to an

N-tuple Network, CART, and Bagging

295

input x is given by an arithmetic average of the training points falling into the same tree leaf as x. During the tree-growing process, which starts with a root node containing all training points, CART chooses the next node to be partitioned such that the split of the corresponding cell maximizes the decrease of the overall training set error. A split is performed parallel to one of the axes by choosing the threshold point out of the set of coordinate values of training points falling into that node. Node splitting continues until a sufciently large tree is created, at which point a pruning process is initiated to optimize the tree and avoid the problem of overtting. Although a node of the NTN is created differently from CART, certain commonalities can be seen. Starting with the unpartitioned input space, the placement of the rst quantization threshold of an NTN node is equivalent to splitting the space at the root node of CART. The placement of any additional threshold results in splitting all of the current leaf nodes of the tree at that threshold, instead of splitting exactly one node, as is the case in CART. Despite this important difference, we can create an equivalent tree corresponding to a single node of the NTN, considering that the order of threshold-point placement is xed (i.e., each N-tuple is ordered). The value of N corresponds to the depth of the equivalent tree. 3.1 Bagged-Tree Implementation of an NTN. In view of the similarities between NTN and CART, a question arises whether the structured approach of tree design could be applied to the NTN in order to enhance the network performance. One obvious problem is that CART constructs essentially one tree, whereas NTN uses K nodes to achieve satisfactory performance levels. However, correspondence between bagging and N-tuple sampling allows substitution of the collection of randomly designed N-tuple nodes for a collection of bagged trees. K bootstrapped CART trees can also be interpreted as nonidentical quantizations of the input space, where each point resides in exactly K quantization cells, as in the NTN. Taking these observations into account, if a structured (i.e., data driven) node design were to be applied to the NTN, bagging could be used to derive the K network nodes. 3.2 Adapting the CART Node-Splitting Criterion. At each step in the CART tree-growing algorithm, all current tree leaves are evaluated for the possibility of a split. The leaf leading to the highest reduction in error is then chosen. The best-split selection process can be expressed as

¡

n, d, t

¢ best

£ ¡ ¡ ¢ ¡ ¢ ¢¤ = arg max E ( n) ¡ E n, d, t l C E n, d, t r , ( n,d,t)

(3.1)

where E (n) represents the error estimate at node n, d represents ¡ an ¢input variable (d = 1, . . . , D), t is a particular threshold value, and E n, d, t l and ¡ ¢ E n, d, t r correspond to the error estimates for points at node n residing on both (i.e., “left” and “right”) sides of the partitioning threshold xd = t.

296

Aleksander Ko lcz

Generally, splitting any node at any point will reduce the error (provided that it results in a division of points at that node), and the procedure chooses the best overall split. In the case of the NTN, the splitting occurs for all current terminal nodes. Consequently, it makes sense to choose the best split on the basis of the aggregated decrease of the training-set error, taking all nodes into account, that is, X£ ¡ ¢ ¡ ¡ ¢ ¡ ¢ ¢¤ (3.2) d, t best = arg max E ( n) ¡ E n, d, t l C E n, d, t r . ( d,t) n Note that at some nodes, the gain might be zero if all points in that node lie on the same side of the threshold. In such cases when the split is carried out, the tree will have some empty terminal nodes, which might be split again as the process continues. To avoid this undesirable effect, splits are allowed only if the node contains more than one element and the split does not result in an empty cell. In this way no empty cells will be added during the training process. 3.3 Pruning.

3.3.1 Cost-Complexity Pruning. Since every split results in an error reduction, the greedy algorithm outlined above can, in principle, continue until each leaf contains just one element. Clearly, such a network provides a good t to the training set, but can be expected to have poor generalization properties. To overcome this effect, the CART algorithm utilizes cost-complexity pruning, where the magnitude of the training-set error is weighted against the complexity of the tree measured in terms of the number of its terminal nodes, E = ER C a ¢ e T, where ER is the mean squared error over the training set and e T is the tree size. After a cross-validated estimate of a is obtained, the tree is pruned by merging pairs of terminal nodes, one pair at a time, until the desired tree complexity is reached. The cost-complexity pruning can be applied directly to the tree created by the NTN algorithm. The resulting tree will not correspond to the NTN anymore, since while for some nodes a particular threshold may be implemented, it may be omitted for others, even though carrying out of the split would not result in the creation of an empty cell. A tree grown according to the NTN strategy and pruned according to the cost-complexity approach will be called CART-NTN. An example input-space quantization characteristic of CART-NTN is illustrated in Figure 1 (left). 3.3.2 Optimal-Depth Pruning. The cost-complexity pruning makes the resulting tree analogous to CART, since the tree can have branches of different lengths not necessarily dictated by the lack of data in the terminal nodes. To keep within the NTN framework, however, each pruning step should merge all cells at the current level. A simple pruning method conforming with the NTN methodology is to choose a depth of the tree that

N-tuple Network, CART, and Bagging

297

Figure 1: Input space quantization due to (left) CART-NTN and (right) NTNCART. In CART-NTN, some of the initial quantization thresholds (indicated by heavier lines) will affect the entire input domain, while others will be more local, due to the CART-like pruning procedure. In NTN-CART most of the partitioning thresholds have a global scope, just as in the NTN, but in certain local cases, the splits are not implemented (the shaded regions) due to the lack of training data.

results in the best generalization error, obtained through cross validation. Note that this type of pruning is signicantly simpler than cost-complexity pruning. The order of node removal is exactly the same as the order of node generation since we remove all nodes residing at a current level in each stage. Therefore, while the original tree is grown, a cross-validated estimate of the generalization error is computed and associated with that level each time a split is made. After the original tree is grown, the depth corresponding to the minimum cross-validation error is then chosen as the optimal depth of the nal tree. This value corresponds essentially to the value of N in the case of NTN. A tree grown according to the NTN strategy and pruned according to the optimal-depth approach will be called NTN-CART. An example input-space quantization typical of NTN-CART is shown in Figure 1 (right). 4 Numerical Experiments

In view of the extensions proposed, CART-NTN and NTN-CART can be considered intermediate architectures, bridging CART with NTN. Performance of CART, NTN, and the intermediate CART-NTN and NTN-CART networks was evaluated in a regression setting using one real and three simulated data sets, which have been used in Breiman (1996a). One hundred random variants of each data set were considered. For the simulated data, the different variants were generated using the formulas given below; for

298

Aleksander Ko lcz

the real data set, the different samples were obtained by a random split of the whole set into its training and test components. In all experiments with bagged networks 25 bagging repetitions were used. The real-world data set Boston corresponds to the Boston Housing experiment (Belsley, Kuh, & Welsh, 1980), where a mix of 13 social and environmental variables is used to estimate real estate prices. The total data set consists of 506 elements, where in each random case 481 points were used to train a network and the remaining 25 were used for testing. The three articial data sets are due to Friedman (1991). In each random case, a training set consisted of 200 points, and a test set contained 1000 points. The data set Friedman-1 is generated according to the formula y = 10 ¢ sin (p x1 x2 ) C 20 ¢ ( x3 ¡ 0.5) 2 C 10 ¢ x4 C 5 ¢ x5 C e, where e Ã N (0, 1) and [x1 , . . . , x10 ] is a 10-dimensional input vector, whose elements represent independent variables distributed uniformly within [0, 1]; e represents a normally distributed noise component. Note that only 5 of the 10 variables have any inuence on the output. In the case of data sets Friedman-2 and Friedman-3, the variables of the four-dimensional input space are independent and distributed uniformly in the ranges: x1 2 [0, 100] , x2 2 [40p , 560p ] , x3 2 [0, 1] , x4 2 [1, 11]. For the q data set Friedman-2, the output variable is generated according to y =

x21 C (x2 x3 ¡ 1/ x2 x4 )2 C e (where e Ã N (0, 127)), while for the case of the ³ ´ ¡1/ Friedman-3 data set, the output variable is given by y = arctan x2 x3 x1 x2 x4 C e (where e Ã N (0, 0.106)). 4.1 Network Implementation. CART was implemented according to the description provided in Breiman et al. (1984). The initial tree was grown by successive terminal node splitting (see equation 3.1), while at least one tree leaf contained more than ve points and a split reducing the training set error was possible. Cost complexity reduced the number of terminal nodes once the original tree had been grown. Tenfold cross validation was used to estimate the optimum pruning parameter a. Both CART-NTN and NTN-CART followed a similar initial tree-growing process, except that all current terminal nodes were split at each step, if possible, according to equation 3.2. In the case of CART-NTN, the costcomplexity pruning identical to the one used in CART was then applied, while NTN-CART found the optimal tree depth, minimizing the estimated generalization error obtained via 10-fold cross validation. In the case of the NTN, in all experiments all inputs were mapped to the 0, . . . , 255 range and thermometer encoded to form a binary retina of size R = 256 ¢ D, where D is the number of input variables. In one set of experiments, the number K of N-tuple memories was chosen to allow complete sampling of the retina bits for a given value of N, that is, K » = R/ N.

N-tuple Network, CART, and Bagging

299

Another set of experiments used a generally larger but xed number of Ntuple nodes (K = 500). For each variant of the data set, 25 random N-tuple retina samplings were considered, and the test-set errors obtained in each case were then averaged. The N-tuple memories were implemented using hashing (Knuth, 1973), with each memory node containing 5000 locations. In this way, large N-tuple sizes could be readily considered, and since the training set sizes were in all cases fewer than 500, the probability of hash collisions was negligibly small. To reduce the generalization error due to empty memory locations, the NTN mapping was modeled as y = b C NTN (x ), with the bias b estimated as an average of the training-set values. The network response formation was considered in two variants. One with the output response calculated according to Ko lcz and Allinson (1996), NTN (x ) =

PK

k= 1 wk ( x ) PK k= 1 ck ( x )

=

PT

¡

t t t= 1 y ¢ k x, x ¡ ¢ PT t t= 1 k x , x

¢ ,

(4.1)

where T is the size ¡ t of ¢ the training set, ck ( x ) is the counter associated with t represents the tth input-output training pair, and ( ), weight x x , w y ¡ ¢ k k x , x t is the equivalent kernel function of the NTN. In a one-pass training procedure, wk ( x ) is set to the sum of training values of points addressing that memory location, and ck ( x ) is set to their number. The other form of NTN response was produced according to NTN (x ) =

K 1X wk (x ) , K k= 1 ck ( x )

(4.2)

with the weights and counters set in the same way. The latter variant directly corresponds to the CART network, since the response of each memory location represents an average of the training-point values that address that location and hence fall into the corresponding quantization cell. Since equation 4.2 makes NTN directly related to the remaining networks, it has been chosen as a basis for performance comparison. 4.2 Discussion of Results. The combined results for all networks and data sets considered are presented in Table 1. The complete results for both variants of the NTN and all values of N considered are shown in Figure 2. The accuracy of CART and CART-NTN was closely matched in all cases (with similar depths of the trees produced), which is not surprising since the same pruning algorithm was applied in both cases. The advantage of one network over the other proved to be data dependent: for the Friedman1 and Friedman-3 experiments, CART-NTN was slightly better than CART, while for the Friedman-2 and Boston experiments, the situation was reversed. It appears that the depths of the trees produced by CART-NTN were unaf-

300

Aleksander Ko lcz

Table 1: Combined Mean Squared Error (MSE) Results for All Data Sets. Network

MSE

MSE-bag

Decrease

ATD

ATD-bag

9.0 8.7 6.9 12.3

31% 29% 40% 3%

5 4 4 25

7 4 9 25

24,840 26,297 24,911 30,797

32% 31% 32% 5%

8 4 5 20

9 4 14 20

0.0370 0.0346 0.0281 0.044

38% 36% 37% 2%

4 4 5 20

6 4 14 20

16.2 16.2 13.1 15.9

42% 48% 42% 5%

6 5 8 40

7 5 25 45

Results for the Friedman-1 data set

CART CART-NTN NTN-CART NTN

13.1 12.2 11.4 12.7

Results for the Friedman-2 data set

CART CART-NTN NTN-CART NTN

36,438 37,930 36,510 32,437

Results for the Friedman-3 data set

CART CART-NTN NTN-CART NTN

0.0600 0.0544 0.0449 0.045

Results for the Boston data set

CART CART-NTN NTN-CART NTN

27.8 31.4 22.6 16.7

Notes: MSE-bag refers to the error obtained for the bagged variant of CART, CARTNTN, and NTN-CART and for the variant of NTN with the number of N-tuple nodes xed to K = 500 (in the MSE column, the number of nodes of an NTN is just large enough to cover the retina with N-tuples). A decrease in the test-set error due to bagging is indicated in column 4, and the average tree depths (ATD) (and the value of N for NTN)for the standard and bagged variants of the networks are given in columns 5 and 6, respectively. The NTN results correspond to the optimal values of N.

fected by bagging, whereas for both CART and NTN-CART, bagging led to deeper trees. NTN-CART proved to be superior to both CART and CART-NTN for all data sets considered, except for the Friedman-2 experiment, where its performance was essentially equal to that of CART. This can probably be attributed to the optimal-depth pruning used by NTN-CART, which resulted in trees of higher complexity from those used by the other networks. This may indicate that cost-complexity pruning applied to CART and CART-NTN can prune the trees too much in certain cases. Interestingly, the distribution-based trees proposed in Breiman (1996b) also combined higher tree complexity with better prediction accuracy. All three tree networks demonstrated similar levels of sensitivity to the training data, ranging between 29% and 48% for the data sets considered, where the sensitivity is measured by the decrease of the test-set error due

N-tuple Network, CART, and Bagging

301

NTN : BOSTON (Kmin=67, Kmax=666)

NTN : FRIEDMAN 1 (Kmin=52, Kmax=513)

80

35

tree

70

tree

kernel

kernel 30

prediction error

tree 500

prediction error

tree 500

60

kernel 500

kernel 500

25

50 40

20

30 15

20 10

4

30

40

50

10

FRIEDMAN 2 (Kmin=21, Kmax=205) 0.1

kernel

N

30

40

50

tree kernel

0.09

tree 500

10

20

NTN : FRIEDMAN 3 (Kmin=21, Kmax=205)

tree

12 prediction error

N

tree 500

prediction error

x 10

20

kernel 500

kernel 500

0.08

8

0.07

6

0.06

4

0.05 10

20

N

30

40

50

10

20

N

30

40

50

Figure 2: Combined NTN results for the data sets considered and different values of N. The continuous graph lines correspond to NTN with a retina-size dependent K, while the marker graphs refer to the xed K cases. The kernel label refers to the response of an NTN according to equation 4.1 and tree according to equation 4.2.

to bagging. This suggests that although CART-NTN and NTN-CART have more data at their disposal than CART when making node-splitting decisions, the splitting process is inherently less stable than in CART since each split affects the whole of the input space. As far as the N-tuple networks are concerned, their performance proved to be comparable, and in some cases better than the performance of singlenode CART, CART-NTN, and NTN-CART, but generally worse than the bagged versions of these networks. This is largely expected and can be attributed to the random nature of N-tuple sampling, which is far from optimal unless the joint distribution of the input-output space is actually uniform. Increasing the resolution of the input retina by using more than 256 quantization levels per input variable did not result in a signicant

302

Aleksander Ko lcz

improvement of NTN performance. Conforming with the results published in Rohwer and Lamb (1993) and Ko lcz and Allinson (1996), the optimal values of N tend to be rather large compared to the ones usually used in implementations of the NTN. Figure 2 conrms that by increasing the number of N-tuple nodes, the network performance is improved. The improvement is often small, which may be attributed to the random nature of N-tuple sampling, which does not take full advantage of the distribution-related information provided by the training data. The performance increase due to large K is more pronounced for the synthetic data sets, for which the input space distribution is truly uniform. The results obtained for NTN indicate a clear advantage of combining the individual node responses according to equation 4.2 when compared to the regression-NTN method of equation 4.1. The explanation lies in the higher bias induced by equation 4.1. Let fw1 , . . . , wK g denote the sums of training values in the cells selected for a particular input. Let fa1 , . . . , aK g and fc1 , . . . , cK g correspond to the averages and numbers of training points in these cells, respectively. The node aggregation method analogous to bagged CART (see equation 4.2) produces the response as K K 1X 1X wk = ak , K k= 1 ck K k= 1

while the response of the regression NTN (see equation 4.1) is given by PK

k= 1 wk PK k= 1 ck

=

PK

k= 1 ck

PK

¢ ak

k= 1 ck

=

K X k= 1

b k ¢ ak ,

where b k is proportional to the number of training points in the kth cell, and K X

bk = 1.

k= 1

Thus cell averages generated by larger number of points have more inuence on the output in this case. But a larger number of training points falling into a cell is likely to be associated with a larger “footprint” of the cell in the input space, in which many points inuencing the cell average ak are likely to be farther away from the actual input point for which the network response is computed. Consequently, a higher bias will result. For larger values of N, the size of partitioning cells is small enough for the diluting effects of equation 4.1 to become insignicant (see Figure 2). Note that if all cells selected by x contain an identical number of points, the responses produced according to equations 4.2 and 4.1 will be identical.

N-tuple Network, CART, and Bagging

303

5 Conclusion

The analogy between a single N-tuple memory node and a tree network, as well as between N-tuple sampling and bagging, was used to propose a tree network (in two variants: CART-NTN and NTN-CART) using the datadriven, tree-growing strategy of CART and the structure of input-space quantization of an NTN node. The concept of bagging was applied to create a number of such network nodes in a data-driven instead of a purely random way. A simple tree-pruning method maintaining a close similarity to the original NTN architecture has also been suggested. NTN-CART, combining all these features, provides a data-driven, tree-based version of an NTN using the concept of bagging for the generation of multiple network nodes. Experimental results show that bagged NTN-CART provides an attractive alternative to the NTN, since its data-driven approach to network design is able to match the placement of quantization thresholds to the empirical data distribution provided by the training set. The performance increase is accompanied by a smaller number of nodes, a smaller value of N, and a way of eliminating the problem of empty memory locations characteristic of N-tuple networks. At the same time, the simpler nature of pruning proposed for NTN-CART can also have advantages for this network when compared with the traditional cost-complexity pruning of CART. References Aleksander, I., & Stonham, T. J. (1979). A guide to pattern recognition using random-access memories. IEE Proceedings—E Computers and Digital Techniques, 2(1), 29–40. Belsley, D., Kuh, E., & Welsh, R. (1980). Regression diagnostics. New York: Wiley. Bledsoe, W. W., & Bisson, C. L. (1962). Improved memory matrices for the Ntuple pattern recognition method. IRE Transactions on Electronic Computers, EC-11, 414–415. Bledsoe, W., & Browning, I. (1959). Pattern recognition and reading by machine. IRE Joint Computer Conference, pp. 225–232. Bradshaw, N. P. (1997). The effective VC dimension of the N-tuple classier. In W. Gerstner, A. Germond, M. Hasler, & J. D. Nicoud (Eds.), Articial neural networks—ICANN ’97. 7th International Conference Proceedings (pp. 511–516). Berlin: Springer-Verlag. Breiman, L. (1996a). Bagging predictors. Machine Learning, 24(2), 123–140. Breiman, L. (1996b). Distribution based trees are more accurate. In S. I. Amari, L. Xu, L. W. Chan, I. King, & K. S. Leung (Eds.), Progress in neural information processing: Proceedings of the International Conference on Neural Information Processing (pp. 133–138). Berlin: Springer-Verlag. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classication and regression trees. Monterey, CA: Wadsworth and Brooks/Cole.

304

Aleksander Ko lcz

Friedman, J. H. (1991). Multivariate adaptive regression splines. Annals of Statistics, 19(1), 1–141. Jung, D. M., Krishnamoorthy, M. S., Nagy, G., & Shapira, A. (1996). N-tuple features for OCR revisited. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(7), 734–745. Knuth, D. E. (1973). The art of computer programming (Vol. 3). Reading, MA: Addison-Wesley. Ko lcz, A., & Allinson, N. M. (1996).N-tuple regression network. Neural Networks, 9, 855–869. Luttrell, S. P. (1992). Gibbs distribution theory of adaptive N-tuple networks. In I. Aleksander & J. Taylor (Eds.), Articial neural networks 2: Proceedings of the 1992 International Conference on Articial Neural Networks (pp. 313–316). New York: Elsevier. Rohwer, R. J. (1995).Two Bayesian treatments of the N-tuple recognition method. Proceedings of the 4th International Conference on Articial Neural Networks (pp. 171–176). London: IEE. Rohwer, R. J., & Lamb, A. (1993). An exploration of the effect of super large N-tuples on single layer RAMnets. In N. M. Allinson (Ed.), Proceedings of Weightless Neural Network Workshop ’93 (pp. 33–37). York, UK: University of York. Rohwer, R., & Morciniec, M. (1998). The theoretical and experimental status of the N-tuple classier. Neural Networks, 11(1), 1–14. Roy, R. J., & Sherman, J. (1967). Two viewpoints of K-tuple pattern recognition. IEEE Transactions on System Science and Cybernetics, 3(2), 117–120. Steck, G. P. (1962). Stochastic model for the Bledsoe-Browning pattern recognition scheme. IRE Transactions on Electronic Computers, 11, 274–282. Tattersall, G. D., Foster, S., & Johnston, R. D. (1991). Single-layer lookup perceptrons. IEE Proc. F, 138, 46–54. Received May 1, 1998; accepted February 14, 1999.

NOTE

Communicated by Robert Tibshirani

Improving the Practice of Classier Performance Assessment N. M. Adams D. J. Hand Department of Mathematics, Imperial College, Huxley Building, 180 Queen’s Gate, London, SW7 2BZ, U. K. In this note we use examples from the literature to illustrate some poor practices in assessing the performance of supervised classication rules, and we suggest guidelines for better methodology . We also describe a new assessment criterion that is suitable for the needs of many practical problems. 1 Introduction

The construction of classication tools (Hand, 1997) in real applications can involve a sophisticated design stage, but this is often followed by a much less sophisticated, and sometimes inadequate, performance assessment stage. The consequence of poor assessment methodology is that an inferior rule may be chosen. In order to assess performance, one must rst decide on the criteria for measuring it. However, in many applications, assessment criteria are chosen that do not match the problem very well. Restricting ourselves to the two-class case, two criteria are in common use. One is error rate, or misclassication rate, which assumes that misallocation costs (the respective costs, or utilities, of classifying a class 0 object as class 1, and the reverse) are equal. The other is the area under a receiver operating characteristic (ROC) curve, or the equivalent Gini index, which assumes nothing whatever is known about misallocation costs. Both of these assumptions are unrealistic: costs are unlikely to be equal, and though they are seldom known precisely, it is very rare that nothing is known about them. Using real examples from the literature, our purpose here is to describe shortcomings of routine practice in assessing classier performance and present a set of guidelines that may be useful. In addition, we describe a new loss-based method for comparative analysis of pairs of classiers. We begin with the confusion matrix, which shows the cross-classication of predicted class by true class:

Predicted

Neural Computation 12, 305–311 (2000)

0 1

True 0 1 a b c d

c 2000 Massachusetts Institute of Technology °

306

N. M. Adams and D. J. Hand

where a C b C c C d = n and n is the total number of objects to which the classier has been applied. In order to obtain unbiased performance estimates, one would normally apply the classier to an independent test set or use more sophisticated methods, but our purpose here is to discuss performance assessment measures rather than issues of estimation. Suppose that class 0 represents diseased people and class 1 represents healthy people. Then a/ ( a Cc) is called sensitivity, denoted Se, and d/ ( b Cd) is called specicity, denoted Sp. If we knew that the exact cost of misclassifying a diseased individual as healthy was c0 , and the reverse misallocation cost was c1 , then the overall expected loss from using the rule would be (1 ¡ Se)p 0 c0 C (1 ¡ Sp)p 1 c1 , where p i refers to the prior probability of class i. Loss is the appropriate criterion for many classication problems. Misclassication rate corresponds to equal costs. However, eliciting misallocation costs in real situations can be extremely difcult, and instead people often adopt the area under an ROC curve. An ROC curve is (normally) a plot of sensitivity on the vertical axis and (1-specicity) on the horizontal axis. The area under this curve (denoted AUC) is a popular summary measure of performance that incorporates all possible costs, so that one does not have to choose a particular cost pair (Bradley, 1997). However, this goes to the other extreme from the explicit loss value above and assumes that nothing can be elicited about costs. This is just as unrealistic as assuming the costs are known exactly. 2 Examples from the Literature

In this section, we discuss ve examples from the applications literature that could have used better assessment methodology. Our purpose is not to criticize these authors, since many of these practices seem endemic, but rather to demonstrate scope for improvement in the practice of classier assessment. 2.1 Assuming Equal Costs. Koziol, Adams, and Frutos (1996) were interested in seeing if certain indicators were useful for classifying interstitial cystitis cases into ulcer and nonulcer categories. They compared LDA (linear discriminant analysis), logistic regression, and recursive partitioning using error rate. However, it is very rare in a medical context that the seriousness of the two kinds of misclassications are equal. Typically (though not always) allocating a case as a noncase is more serious (i.e., incurs a greater cost) than the reverse.

Classier Performance Assessment

307

2.2 Integrating over Costs. Page et al. (1996) discuss the use of neural networks to discriminate between the brain scans of healthy subjects and Alzheimer’s patients. They conclude that for this purpose, neural nets are better than the conventional statistical method (LDA) on the basis of the AUC. The AUC summarizes classier performance across the entire cost range. It is difcult to believe that in this problem, the entire range of possible relative costs is equally legitimate. This would mean, for example, that one was prepared to entertain the possibility that misclassifying an Alzheimer’s patient as healthy was vastly more serious than the reverse— and also to entertain the possibility that misclassifying a healthy subject as suffering from Alzheimer’s was vastly more serious than the reverse. It is unlikely that both cost allocations would be regarded as legitimate. It is much more likely that a narrower range would be appropriate—that one type of misclassication is more serious than the other, even if one cannot say by how much. 2.3 Crossing ROC Curves. Comparing the AUCs of two classiers can be appropriate only if one curve entirely dominates the other (one classier is better than the other for all costs). If ROC curves cross, as they do in Page et al. (1996, p. 199), it means that different classiers are better for different values of the costs. Summarizing over all possible costs could lead to completely incorrect conclusions. 2.4 Fixing Costs. There are very few studies in the applications literature that use explicit misallocation costs. This is because there are normally intrinsic uncertainties surrounding the outcome (just how serious is one type of misdiagnosis compared with the other?). Even in domains where one might have expected precise costs to be available, such as commercial and banking domains, they are rare. Michie, Spiegelhalter, and Taylor (1994) give precise costs for a problem of classifying applicants for loans to good and bad classes, but in our experience these misallocation costs are seldom available (Hand & Henley, 1997). The costs depend on protability, which includes issues such as marketing and advertising costs. Moreover, the relative misclassication costs are likely to change over time as the economy evolves. Although it is probable that a range of likely costs can be given, it is improbable that exact ones can be given. 2.5 Variability. Dybowski, Weller, Chang, and Gant (1996) discuss the decision-making process involved in admitting patients to intensive care. They conducted a study to assess the performance of various neural network models and logistic discrimination in predicting death during admission, specically of predicting in-hospital mortality among patients with systemic inammatory response syndrome and hemodynamic shock. Comparison

308

N. M. Adams and D. J. Hand

was made using both AUC and Brier score (Hand, 1997). However, no standard error is quoted for any of the summary statistics for the competing classiers, and no signicance test results are given. The chosen architecture may have had the highest AUC purely by chance. 3 Guidelines for Comparing Classiers

Our examples illustrate that equal costs (error rate) are rarely an appropriate assumption, that assuming nothing is known about the costs is unrealistic, that assuming they are known exactly is often unrealistic, that AUC may be entirely inappropriate if ROC curves cross, and that in comparing rules, the variation in the performance statistics must be taken into account. We could have easily picked other articles to demonstrate these points. In general, the assessment criteria must be chosen to match the objectives. Methodological and simulation studies out of any explicit context may have little bearing on real problems. In particular, it is important to take into account what is known about the relative misclassication costs because this can radically inuence the choice of method. In section 4 we describe a performance measure that does this, making use of what can be said about the costs (since something can always be said) even if precise values cannot be given. 4 LC Assessment Method

In this section we outline the LC assessment method (described in detail in Adams and Hand, 1999), using the German credit data from the STATLOG project (Michie, Spiegelhalter, & Taylor, 1994). These data consist of 24 variables measured on 1000 people: 700 regarded as good credit risks (class 0) and 300 regarded as bad credit risks (class 1). These data were divided into training and test sets of 800 and 200 observations, respectively. The ROC curves for linear and quadratic discriminant analysis (QDA), estimated on the training set and applied to the test set, are given in Figure 1. In terms of the shortcomings described in Section 2, this comparison involves crossing ROC curves. The AUCs (and associated standard errors) for LDA and QDA are 0.8371 (0.0268) and 0.8135 (0.0288), respectively. These summaries provide no evidence to favor either classier, although published studies may ignore the estimates of standard error. Such a comparison also involves integrating over all possible costs and ignoring estimates of variability, both shortcomings described above. In fact, the German credit data have a misallocation matrix, which states that the severity of misclassifying a bad risk is ve times that of misclassifying a good risk. With this extra information, standard ROC curve analysis is certainly inappropriate. However, our experience (Hand & Henley, 1997;

309

0.8

1.0

Classier Performance Assessment

0.0

0.2

0.4

F^(t|0)

0.6

LDA QDA

0.0

0.2

0.4

0.6

0.8

1.0

F^(t|1)

Figure 1: ROC curves for LDA and QDA applied to the German credit data.

Hand & Jacka, 1998) suggests that such precision in the denition of misallocation costs for credit problems is misplaced, and we therefore specify a range of 3 to 7 for this misclassication cost, with most probable value remaining at the quoted one of 5. It is not meaningful for a given classier to compare the losses arising from two different costs pairs (c0 , c1 ) and ( c00 , c01 ). All one can do is compare the losses from different classiers at a xed cost pair. Since the losses due to different cost pairs cannot be compared, the only useful information in comparing different classiers is whether one classier has a smaller loss than the other for particular cost pairs. It follows that the values of a pair ( c0 , c1 ) are not important, but only the relative size of c0 and c1 . For practical reasons outlined in Adams and Hand (1999), it is convenient to impose the constraint c0 C c1 = 1 and work just in terms of c1 . Domain experts can normally specify upper and lower bounds on the possible values c1 can take (although in our experience they do this most comfortably in terms of the cost ratio cc10 ) and they can also specify a most likely value for c1 . We use these three values to dene a belief distribution over the possible values of c1 . For reasons explained in Adams and Hand (1999), we take a triangular function with support dened by the upper and lower bounds and maximum at the most likely value. This belief function, L( c1 ) tells us how probable the expert thinks it is that the different values of c1 will occur.

310

N. M. Adams and D. J. Hand

For each possible value of c1 , we can see which classier incurs the smallest loss. This allows us to dene an indicator function I ( c1 ) for each classier that takes the value 0 if the classier does not have the smallest loss for a given value of c1 and the value 1 if it does. For each classier we combine I and L as Z LC = I ( c1 ) L( c1 ) dc1 to give an overall measure of expected performance. The best classier is that which gives the largest value of the LC index (Adams & Hand (1999) are concerned with only two classiers and work with the difference of the two LC values.) Returning to the example, we have the predictions of LDA and QDA on the test set and the belief distribution dened earlier. The estimates in this case are LCLDA = 1 and LCQDA = 0. Supercially, then, this is strong evidence in favor of LDA. Estimates of standard error for the LC index are not as readily obtained as estimates of standard error for more conventional summaries like error rate or AUC. The approach we adopt in Adams and Hand (2000) is to obtain a bootstrap estimate (Efron & Tibshirani, 1993; Davidson & Hinkley, 1997) of the standard error of the LC index. This involves drawing random resamples (of the same size as the test set) with replacement from the test set. The LC index for the two classiers is computed on each resample. The standard deviation of the resampled indexes is the bootstrap estimate of standard error. Since this example involves two classes, the LC index is a comparative measure (complementary results are obtained for the two classes). This implies that it is not necessary to use a paired test: the ratio of LC index to bootstrap standard error can serve as a test statistic. In this case, the standard error (common to both classiers) is 0.0224. Using LC analysis in this case has come to a rm conclusion using an appropriate cost interval and incorporating variability estimates. 5 Conclusion

Efforts to optimize classier performance can be wasted by the use of an inappropriate assessment criterion. A classier that is suboptimal with respect to the characteristics of the problem can be installed. We have presented some guidelines and a tool that should help reduce these problems. Acknowledgments

The work of N. A. on this project was supported by grant GR/K55219 from the EPSRC, under its “Neural Networks—The Key Questions” initiative.

Classier Performance Assessment

311

References Adams, N. M., & Hand, D. J. (2000). An improved method for comparing diagnostic tests. To appear in Computers in Biology and Medicine. Adams, N. M., & Hand, D. J. (1999). Comparing classiers when the misallocation costs are uncertain. Pattern Recognition, 32, 1139–1147. Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30(7), 1145–1159. Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their application. Cambridge: Cambridge University Press. Dybowski, R., Weller, P., Chang, R., & Gant, V. (1996). Prediction of outcome in critically ill patients using articial neural-network synthesized by genetic algorithm. Lancet, 347, 1146–1150. Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. London: Chapman and Hall. Hand, D. J. (1997). Construction and assessment of classication rules. Chichester: Wiley. Hand, D. J., & Henley, W. E. (1997).Statistical classication methods in consumer credit scoring: A review. Journal of the Royal Statistical Society, Series A, 160(3), 523–541. Hand, D. J., & Jacka, S. D. (1998). Statistics in nance. London: Arnold. Koziol, J. A., Adams, H. P., & Frutos, A. (1996). Discrimination between the ulcerous and the nonulcerous forms of interstitial cystitis by noninvasive ndings. Journal of Urology, 155(1), 89–90. Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine learning, neural networks and statistical classication. New York: Ellis Horwood. Page, M. P. A., Howard, R. J., O’Brien, J. T., Buxton-Thomas, M. S., & Pickering, A. D. (1996).Use of neural networks in brain SPECT to diagnose Alzheimer ’s disease. Journal of Nuclear Medicine 37(2), 195–200. Received August 24, 1998; accepted March 1, 1999.

LETTER

Communicated by Alexandre Pouget

Do Simple Cells in Primary Visual Cortex Form a Tight Frame? Emilio Salinas Instituto de Fisiolog´õa Celular, UNAM, Ciudad Universitaria S/N, 04510 M´exico D.F., M´exico L. F. Abbott Volen Center for Complex Systems and Department of Biology, Brandeis University, Waltham, MA 02254-9110, U.S.A Sets of neuronal tuning curves, which describe the responses of neurons as functions of a stimulus, can serve as a basis for approximating other functions of stimulus parameters. In a function-approximating network, synaptic weights determined by a correlation-based Hebbian rule are closely related to the coefcients that result when a function is expanded in an orthogonal basis. Although neuronal tuning curves typically are not orthogonal functions, the relationship between function approximation and correlation-based synaptic weights can be retained if the tuning curves satisfy the conditions of a tight frame. We examine whether the spatial receptive elds of simple cells in cat and monkey primary visual cortex (V1) form a tight frame, allowing them to serve as a basis for constructing more complicated extrastriate receptive elds using correlationbased synaptic weights. Our calculations show that the set of V1 simple cell receptive elds is not tight enough to account for the acuity observed psychophysically. 1 Introduction

There are a number of parallels between representation and learning in neural networks and methods of functional expansion in mathematical analysis. Populations of neurons with responses tuned to properties of a stimulus may be well suited for approximating functions of stimulus parameters (Poggio, 1990; Poggio & Girosi, 1990; Girosi, Jones, & Poggio, 1995). If certain conditions are satised, such a population can act as a highly efcient substrate for function representation. Accurate representation of a given function depends critically on the synaptic weights connecting input and output neurons. These weights are often adjusted with a delta-type learning rule that is efcient but not biologically plausible. However, if the tuning curves of the input neurons form what is called a tight frame (Daubechies, Grossmann, & Meyer, 1986; Daubechies, 1990, 1992), function approximation can be achieved using a more biologically reasonable Hebbian or correlationNeural Computation 12, 313–335 (2000)

c 2000 Massachusetts Institute of Technology °

314

Emilio Salinas and L. F. Abbott

Figure 1: A feedforward network for function approximation. The input neurons at the bottom of the gure respond with ring rates that are functions of the stimulus variable x. The ring rate of input neuron j is fj (x). The upper output neuron is driven by the input neurons through synaptic connections; wj represents the strength, or weight, of the synapse from input neuron j to the output neuron. The task is to make the ring rate R of the output neuron approximate a given function F(x).

based rule for the synaptic weights. We investigate here whether the simple cells of primary visual cortex provide such a basis. To introduce the hypothesis to be tested, we begin with a simple example. Consider a population of N neurons that respond to a stimulus by ring at rates that depend on a single stimulus parameter x. The ring rate of neuron j as a function of x, also known as the response tuning curve, is fj ( x). We wish to examine whether such a population can generate an approximation of another function F( x) that is not equal to any of the input tuning curves. Figure 1 shows a simple network for function approximation. In this network, a population of neurons acting as an input layer drives a single output neuron through a set of synapses. The target function F is represented by the ring rate of the output neuron. The weight of the synapse connecting input neuron j to the output neuron is denoted by wj . Following standard neural network modeling practice, we write the ring rate R of the output neuron in response to the input x as a synaptically weighted sum of the input ring rates, R=

N X

wj fj (x).

(1.1)

j= 1

If R = F( x), or at least R ¼ F(x) to a sufciently high degree of accuracy, the tuning curve of the output neuron will provide a representation of the target function. The problem is to nd a set of input tuning curves, and the synaptic weights corresponding to these tuning curves, that produces an output response close to the target function. The expansion of a function as a weighted sum of other functions is a standard problem in mathematical analysis. In the limit N ! 1, any set

Simple Cells and Tight Frames

315

of functions that provides a complete basis over the range of x values considered can be used. Completeness is not very restrictive, but an additional condition is usually applied. The weights wj that make F( x ) =

1 X

wj fj (x)

(1.2)

j= 1

can be computed easily if the basis functions fj are orthogonal and normalized so that Z (1.3) dy fi ( y) fj ( y) = d ij , where d ii = 1 for all i and d ij = 0 for all i 6 = j. In this case the weights are given by Z (1.4) wj = dy fj ( y) F( y). Equation 1.4 for the weights is closely related to forms of Hebbian and correlation-based learning commonly used in neural networks. A correlationbased rule sets the synaptic weights to values given by wj = h fj Ri,

(1.5)

where the brackets indicate an average over stimuli. Equation 1.5 is equivalent to the statement that the synaptic weight is proportional to the correlation of the ring rates of the pre- and postsynaptic neurons. The idea that synaptic weights are related to correlations in pre- and postsynaptic ring is a standard hypothesis going back to the work of Hebb (1949). Often this principle is expressed in the form of a learning rule that adjusts the weights incrementally toward their desired values during a training period. We can also think of equation 1.5 as a consistency condition required for the maintenance of weights at their correct values. Suppose that a network has been set up to approximate a given function, so that R = F( x) to a high degree of accuracy. In order for the network to function properly over an extended period of time, some mechanism must maintain the weights at values that produce this desired output. If synaptic weights are maintained by a process that uses a correlation-based rule, condition 1.5 can be interpreted as a consistency condition on maintainable weights. In this interpretation, the Hebbian rule is applied continuously; there is no distinct training period. The average over stimuli in equation 1.5 is then over all stimuli typically encountered, and the output ring rate R can be approximated by the function F(x). Thus, the consistency condition is wj = h fj (x) F(x)i.

(1.6)

316

Emilio Salinas and L. F. Abbott

If all stimulus values x occur with equal probability, the average over stimuli is proportional to an integral over x, and the weights given by equations 1.4 and 1.6 are proportional to each other. Thus, the Hebbian consistency condition is equivalent to the formula for the coefcients of an orthogonal expansion. The relationship between equation 1.4 and correlation-based synaptic weights suggests a close relationship between Hebbian ideas about synaptic plasticity and standard mathematical approaches to functional expansion. Unfortunately, a direct link of this type is inappropriate because most sets of neuronal tuning curves are highly overlapping and far from orthogonal. However, all is not lost. What happens if the values of the weights given by equation 1.4 are used in equation 1.1 even when the input tuning functions are not orthogonal? This question can be answered simply by substituting these weights into equation 1.1: R=

N Z X

dy fj ( y) F( y) fj ( x).

(1.7)

j= 1

If we interchange the order of summation and integration and dene D( x, y) ´

N X

fj ( x) fj ( y),

(1.8)

j= 1

we nd that the output ring rate is a convolution of the target function with D, Z (1.9) R = dy D( x, y) F( y). By computing the D function for a given set of input tuning curves, we can determine how accurately a set of neurons can represent a function using synaptic weights given by equation 1.4. The output ring rate in equation 1.9 will be exactly equal to the target function if D( x, y) = d ( x ¡ y), since Z

R=

dyd (x ¡ y) F( y) = F( x),

(1.10)

by the denition of the Dirac d function. The condition D( x, y) = d ( x ¡ y) is less restrictive than the orthogonality condition. A set of functions fj that satisfy D(x, y) = d ( x ¡ y) is called a tight frame (Daubechies et al., 1986; Daubechies, 1990, 1992). A nite number of well-behaved tuning curves can never generate a true d function in equation 1.8, but they can create a good approximation to one. If the integral of D( x, y) over y is one for all values of x and if

Simple Cells and Tight Frames

317

D(x, y) is signicantly different from zero only for values of y near x, the integral in equation 1.9 is approximately equal to F(x), provided that F varies more slowly than D. Thus, nite sets of tuning curves can act as approximate tight frames. The importance of D( x, y) is that, as a function of y, its width determines the resolution with which an output function can be approximated using a correlation rule like equation 1.6 for the synaptic strengths. For this reason, it is interesting to examine D functions for real neuronal tuning curves. The D function provides a useful characterization of the computational capabilities of a population of neurons for function expansion. To apply ideas about tight frames to real neural circuits, we must nd areas where function approximation may be taking place. Primary visual cortex seems to be a good candidate. Neurons in primary visual cortex provide the rst cortical representation of the visual world, and their receptive elds provide a basis set from which the more complicated receptive elds of extrastriate neurons are constructed. If the tuning curves of simple cells form a tight frame, the construction of more complicated representations using an architecture like that in Figure 1 can be based on Hebbian synapses whose efcacy is maintained by correlation-based rules of the form given by equation 1.6. To investigate these ideas, we will compute the D function (and a related function K) for the receptive elds of simple cells in the primary visual cortices of cats and monkeys. This will allow us to determine how well the receptive eld functions of simple cells collectively approximate a tight frame. Lee (1996) has explored the conditions under which sets of lters with properties like those of simple cells satisfy the mathematical denition of a tight frame. We will use the experimentally measured lter distributions to determine explicitly the resolution with which simple cells can generate arbitrary visual lters for downstream neurons using correlation-based synaptic weights. 2 Function Approximation by Simple Cells

Representation of the visual world is more complicated than the simplied scenario described in section 1, so we start by recasting the basic problem in terms that are appropriate for this application. Simple cell responses can be expressed approximately through spatiotemporal linear lters acting on visual image intensity, but we will focus exclusively on the spatial aspects of these responses. If I (x ) is the luminance (relative to background) of an image that appears at the point x on the retina, where x is a two-dimensional vector, the ring rate of simple cell j is approximated as Z rj = g

dx fj ( x) I ( x ),

(2.1)

318

Emilio Salinas and L. F. Abbott

where g is an overall scale factor determining the magnitude of the ring rate. Because this expression can be positive or negative, we must interpret it as the difference between the ring rates of two neurons having receptive eld lters Cfj and ¡ fj . The lter functions fj for simple cells are often t using plane waves multiplied by gaussian envelopes, known as Gabor functions (Daugman, 1985; Jones & Palmer, 1987; Webster & De Valois, 1985), ! | x ¡ aj | 2 1 exp ¡ cos( kj ¢ x ¡ w j ). (2.2) fj (x ) = 2p sj2 2sj2

(

These lters are parameterized by the receptive eld center a, receptive eld envelope width s, preferred spatial frequency k ´ | k |, preferred orientation h (the direction perpendicular to k ), and spatial phase w . For simplicity, we restrict ourselves to circular rather than elliptical receptive eld envelopes. We have veried that for values in the physiologically measured range, elliptical receptive elds give essentially the same results reported below for circular receptive elds. We have included a factor of 1/ s 2 in the normalization of the Gabor function so that neurons with different lters give similar maximum ring rates when the optimum images are presented with a given overall amplitude. In the case of visual responses, the lter functions ll the role that the tuning curves played in the example given in section 1. Similarly, the target function we wish to represent in this case is a lter function. In other words, we attempt to construct a downstream neuron that is selective for some target function or target image F( x ). The target response is given by Z = (2.3) R G dz F(z ) I ( z), with G the scale factor for the output rate. For xed image power, the maximum value of R is evoked when I is equal to the target function F. This output neuron is driven, through synaptic weights, by the responses of a population of simple cells, so that R=

N X j= 1

wj rj = g

N X

Z wj

dx fj ( x) I ( x ),

(2.4)

j= 1

where we have used equation 2.1. Again we impose the condition that synaptic weights are equal to the average product of pre- and postsynaptic ring rates, assuming that the postsynaptic neuron res as prescribed by equation 2.3. The weights are then given by Z = hr = (2.5) wj Gg dy dz fj ( y )F( z )hI ( y )I ( z )i. j Ri

Simple Cells and Tight Frames

319

In this case the average over stimuli, denoted by angle brackets, is an average over the ensemble of visual images. Substituting these weights into equation 2.4, we nd Z R = Gg2

dx dz K( x, z )F( z ) I (x ),

(2.6)

with Z K ( x , z) =

dy D( x, y )hI ( y ) I (z )i

(2.7)

and N X

D( x , y ) =

fj ( x ) fj ( y).

(2.8)

j= 1

Equation 2.6 tells us that the ring rate of the output neuron is proportional to a ltered integral of the image, Z R = Gg

2

dx Fef f (x ) I ( x).

(2.9)

The effective lter Fef f is a smeared version of the desired lter F and is given by Z Fef f ( x ) =

dy K(x , y )F( y ).

(2.10)

The condition for perfect function representation is then K( x , y) = d ( x ¡ y ), and the resolution with which a given lter function can be represented is related to the degree to which K approximates a d function. To determine how well simple cells approximate a tight frame, we rst compute both the D and K functions. An interesting aspect of the computation of D from equation 2.8 is that it involves not only the form of the receptive eld lters of simple cells, but also the distribution of their parameters across the population of cortical neurons. Once D is determined, we compute K from equation 2.7 by including the correlation function hI( y ) I (z )i for natural images (Field, 1987; Ruderman & Bialek, 1994; Dong & Atick, 1995). Then we discuss whether D and K behave like a d function. 3 Calculation of D( x , y )

Equation 2.8, which denes D, is a sum over simple cell lters. To compute it, we reparameterize the lters 2.2 in terms of quantities whose distributions

320

Emilio Salinas and L. F. Abbott

have been measured. We then approximate the sum over cells by integrals over those parameters using mathematical ts to reported distributions. To simplify the calculation, we group the simple cell lters into pairs with spatial phases 90 degrees apart (quadrature pairs), corresponding to sine and cosine terms. Each pair of lters can then be described by a single complex exponential lter gj , ! ¡ ¡ ¢¢ | x ¡ aj | 2 1 exp ¡ exp i kj ¢ x ¡ wj , (3.1) gj (x ) = 2 2 2p sj 2sj

(

p where i ´ ¡1. Neurophysiological experiments have found pairs of adjacent simple cells roughly 90 degrees out of phase with each other (Pollen & Ronner, 1981), and we assume that combinations as in equation 3.1 can be formed for all cells. Using complex exponential lters, the sum that denes D becomes D( x , y) =

N X j= 1

=

N X

gj ( x ) gj¤ ( y) 1

4p 2 sj4 j= 1

(

exp ¡

| x ¡ aj | 2 C | y ¡ aj | 2 2sj2

!

¡ ¢ exp ikj ¢ ( x ¡ y ) ,

(3.2)

where gj¤ denotes the complex conjugate of gj . Note that the spatial phase cancels out in the expression for D. Simple cell receptive eld lters are characterized by location, envelope width, and preferred spatial wavelength, phase, and orientation. The envelope width and preferred spatial frequency are correlated; larger receptive elds tend to have lower preferred spatial frequencies. To eliminate this dependence, we will characterize the receptive elds by bandwidth rather than by envelope width. The bandwidth is the width of the spatial frequency tuning curve measured in octaves. It is dened as b ´ log2 (v2 / v1 ), where v2 and v1 are the higher and lower spatial frequencies, respectively, at which the neuron gives half of its maximum response. The bandwidth is related to the envelope width s and the preferred frequency k by p 2 ln(2) 2b C 1 s= . (3.3) 2b ¡ 1 k We use b and k as independent parameters, with the tuning curve width s given by equation 3.3. To compute D, we use an integral over receptive eld parameters as an approximation to the sum over neurons, making the replacement N X j= 1

Z !

da dkdbdh dw r a ( a )r k ( k)r b (b)r h (h )r w (w ),

(3.4)

Simple Cells and Tight Frames

321

where the different r functions are the neuronal coverage densities (the number of neurons per unit parameter range) for the different receptive eld parameters. The validity of this approximation is addressed at the end of this section. Preferred orientations and spatial phases appear to be uniformly distributed over neurons in primary visual cortex (Hubel & Wiesel, 1974; Field & Tolhurst, 1986; Bartfeld & Grinvald, 1992; Bonhoeffer & Grinvald, 1993) so we assume that rh (h ) and r w (w ) are constants. Over a small region of the cortex, r a ( a) can also be approximated as a constant. This means that we are considering a small enough region of the visual eld so that the dependence of the cortical magnication factor on eccentricity can be ignored. Since the D function we compute will be quite narrow, this is a reasonable approximation. With these assumptions, we nd Z D( x , y ) =

dadkdbdh dw r k (k)r b (b)

¡

| x ¡ a| 2 C | y ¡ a | 2 c exp ¡ s4 2s 2

¢

¡ ¢ exp ik| x ¡ y |) cos(h ) .

(3.5)

In this and all the following expressions, c denotes a constant that absorbs all of the factors that affect the amplitude but not the functional form of D. To simplify the above expression, we have made a shift in the denition of the angle h , which is allowed because the h integral is over all values. The integrals over a and w in equation 3.5 can be done immediately to obtain Z (| ) D x ¡ y | = dk dbdh r k ( k)r b (b)

¡

| x ¡ y| 2 c exp ¡ 2 s 4s 2

¢

¡ ¢ exp ik| x ¡ y | cos(h ) .

(3.6)

Note that D depends on only the magnitude of x ¡ y . To proceed further, we need to determine the density functions r k ( k) and r b (b). We have done this by tting three different data sets. We rst performed ts on the data reported by Movshon, Thompson, and Tolhurst (1978) for neurons with 5 degree eccentricity or less in cat area 17. The data and ts are shown in Figure 2 (top). The mathematical ts are given by

(

(b ¡ 1.39) 2 r b (b) = p exp ¡ 2(0.44) 2 2p 0.44 16.6

! (3.7)

and r k ( k) =

1.7 k2 ¡ ¢5 , 1 C k/ 5.2

(3.8)

322

Emilio Salinas and L. F. Abbott

Figure 2: Distributions of optimal spatial frequencies (left) and bandwidths (right) for simple cells. The black circles correspond to experimental data, and the solid curves are mathematical ts. (Top) Cat area 17 simple cells with eccentricities smaller than 5 degrees. Data from Movshon et al. (1978). (Middle) Monkey V1 simple cells within 1.5 degrees of the fovea. Data from De Valois et al. (1982). (Bottom) Monkey V1 simple cells with eccentricities from 3 to 5 degrees (parafoveal sample). Data from De Valois et al. (1982). Notice the different scales in the frequency axes of cat and monkey data. Parameters in the tting equations were found by minimizing the Â 2 statistic (Press et al., 1992). All ts had P > 0.3, except for the preferred frequency distribution in the monkey parafoveal sample (lower left), which had P > 0.01.

and we assume that r k ( k) = 0 for k > 20 radians per degree. We also t two data sets of De Valois, Albrecht, and Thorell (1982) from area V1 in macaque monkeys. These correspond to a region from 0 to 1.5 degrees eccentricity

Simple Cells and Tight Frames

323

(foveal) and another from 3 to 5 degrees eccentricity (parafoveal). The ts for the region closest to the fovea are shown in Figure 2 (middle), and are given by

(

(b ¡ 1.49) 2 r b (b) = p exp ¡ 2(0.62) 2 2p 0.62 31.7

! (3.9)

and r k ( k) =

0.17 k2 ¡ ¢5 . 1 C k/ 22.2

(3.10)

We assume that r (k) = 0 for k > 90 radians per degree. The ts for the parafoveal region are shown in Figure 2 (bottom) and are given by

(

(b ¡ 1.3) 2 r b (b) = p exp ¡ 2(0.49) 2 2p 0.49 16.2

! (3.11)

and r k ( k) =

0.22 k2 ¡ ¢5 . 1 C k/ 14.6

(3.12)

Here we assume that r ( k) = 0 for k > 50 radians per degree. The distributions we have extracted exhibit an interesting property: the bandwidth distributions from all three data sets are very similar, and the preferred spatial frequency distributions are scaled versions of each other to a high degree of accuracy. Because the receptive eld width s is inversely proportional to the preferred spatial frequency k (see equation 3.3), the integrand in equation 3.6 depends on | x ¡ y | only through the combination k| x ¡ y |. This means that scaling the distribution r k ( k) by a factor s will have the effect of scaling the argument of the resulting D function to give D(s| x ¡ y |). As a result, we can compute the D function for any one of the distribution pairs and determine D for the others simply by scaling the spatial argument of the function. We have veried through numerical integration that the three bandwidth distributions are equivalent and that this approximation is practically exact (compare Figures 3A and 4A). We will use the experimental results for cells nearest to the fovea in the monkey. The D function we compute and show in the gures corresponds to this case, but the results for the parafoveal region of the monkey can be obtained by broadening these curves by a factor of 1.52, and for the cat data with a broadening factor of 4.27. Results for any eccentricity in either animal can similarly be obtained by scaling the curves by appropriate factors.

324

Emilio Salinas and L. F. Abbott

Figure 3: D function for monkey simple cells. All D curves shown are normalized to a maximum height of one. (A) Result when the experimentally measured neuronal distributions, equations 3.9 and 3.10, are used in the calculation of D. The lled circles correspond to numerical integration of equation 3.6 for different values of | x ¡ y |; the continuous line is a t to the numerical results using a combination of three gaussian functions (see equation 3.13). The width of the curve at half maximum height, which determines the resolution for function approximation, is 0.06 degree. (B) Results obtained when the distributions of bandwidths and preferred frequencies are manipulated. The dotted line is the D function when only neurons near the peaks of both distributions are used; the width at half-height is 0.13 degree. The continuous line is the result when only neurons with optimal frequencies lower than 50 radians per degree are included in the calculation; the width at half-height is 0.086 degree. The dashed line is the result when only neurons with optimal frequencies higher than 50 radians per degree are included; the width at half-height is 0.042 degree. In these last two cases the full bandwidth distribution was used.

Substituting equations 3.3, 3.9, and 3.10 into 3.6, the triple integral can be computed numerically for different values of | x ¡ y | to yield the function D. The result is shown in Figure 3A, where D has been normalized to a maximum value of one. The continuous line is a t to the numerical result, indicated by dots. The t was obtained by combining three gaussians,

¡

D f it (| x ¡ y | ) = 0.8590 exp ¡

| x ¡ y| 2 2(0.024) 2

¡

¡ 0.2579 exp ¡

¢

| x ¡ y| 2 2(0.085) 2

¡

C 0.3590 exp ¡

¢

.

| x ¡ y| 2 2(0.059)2

¢

(3.13)

The gure shows that the full width at half-maximum height is 0.06 degrees. It also shows that the sidebands around the central peak are very small; the function does not oscillate. This suggests that D can approximate ad function to a resolution of roughly 4 minutes of arc.

Simple Cells and Tight Frames

325

Figure 4: D function for simple cells and retinal ganglion cells of the cat. Both curves are normalized to a maximum height of one. (A) The D function for simple cells of the cat primary visual cortex obtained by numerical integration of equation 3.6. This curve is practically indistinguishable from the result in Figure 3A scaled by a factor of 4.27. For comparison, the x-axes in this gure have been scaled precisely by this amount. The full width at half maximum height is 0.26 degree. (B) The D function for a uniformly distributed population of difference of gaussian lters resembling those of retinal ganglion cells (see the appendix). The full width at half maximum height is 0.5 degree.

In Figure 3B we examine how the D function depends on the distributions of preferred spatial frequencies and bandwidths used in the calculation. First, we investigate what happens if only neurons with parameters near the peaks of the distributions in Figure 2 are considered. We include bandwidths ranging between 1.4 and 1.6 octaves and optimal frequencies ranging between 21 and 23 radians per degree, with no neurons outside these ranges. The D function that results is shown in Figure 3B (dotted line). It has large inhibitory sidebands, and its width at half-height is more than twice the width of the original function. When the full bandwidth distribution is used but the same restricted range of preferred frequencies is included, the result is similar. The neurons at the tail of the preferred frequency distribution are important in setting the width of D, even though proportionally they represent a small fraction of the total. When the distribution is truncated, allowing only optimal frequencies below 50 radians per degree, the resulting D function has a width of 0.086 degrees. Conversely, when only optimal frequencies above 50 radians per degree are included, the resulting D function has a width of 0.042 degrees. Thus, the resolution of the full set of simple cells seems to be limited by the highest frequencies included. The 1/ s 2 factor that appears in equation 2.2 is partly responsible for this effect. It is interesting to compare the D function of simple cells with that of neurons at earlier stages in the visual pathway. In the case of retinal gan-

326

Emilio Salinas and L. F. Abbott

glion cells, the computation is straightforward because their lters can be described by differences of gaussians (see the appendix). The D functions for simple cells and retinal ganglion cells corresponding to data collected in cats are shown in Figure 4. The full widths at half-maximum height are 0.26 and 0.5 degree, respectively. The resolution of simple cells for function approximation in the same animal is better, roughly by a factor of 2. This is related to the fact that some simple cells have high preferred spatial frequencies or, equivalently, lters that vary more rapidly than those of retinal ganglion cells. Because the integral in equation 3.6 was computed numerically using the trapezoidal rule (Press, Flannery, Teukolsky, & Vetterling, 1992), the D function was actually computed as a sum. This is appropriate since the biological system involves individual neurons, not a continuum. The number of terms in the sum is proportional to the number of neurons with receptive elds overlapping the same spatial location. Examining how the number of terms affects the sum reveals how many neurons are needed to obtain a smooth D function. The results shown in Figure 3A were obtained with 30 steps in orientation h (in the range 0–2p ), 50 steps in preferred frequency k (in the range 0–90 radians per degree), and 50 steps in bandwidth b (in the range 0.1–3.0 radians per degree). With a much coarser grid of 8 orientations, 10 frequencies, and a single bandwidth (same ranges), the resulting curve was almost identical to the one in Figure 3A. In general, the distribution of bandwidths had little impact on the results. Hence, for a given point in visual space, sampling at 8 orientations and 10 frequencies per orientation is enough to construct a D function with a 0.06 degree resolution. Are the neurophysiological data consistent with these requirements? Within the dimensions of a hypercolumn, preferred orientation changes smoothly and uniformly throughout 180 degrees (Hubel & Wiesel, 1974; Bartfeld & Grinvald, 1992; Bonhoeffer & Grinvald, 1993). In fact, in orientation maps obtained with optical recordings, 8 orientation bins in the 0–180 degrees range can easily be distinguished by eye (Bartfeld & Grinvald, 1992; Bonhoeffer & Grinvald, 1993). Thus, sampling in orientation space is not a limitation. On the other hand, the 10 frequency steps do not correspond to 10 neurons, for two reasons. First, each complex lter represents a sine-cosine pair where each element is the difference between Cfj and ¡ fi , which means 4 neurons per lter. Second, r k ( k) is not uniform. To get at least 4 complex lters with optimal frequencies inside the 90 radians per degree bin, the 10 frequency bins must involve a total of about 300 lters, or 1200 neurons. This means that for each point in space, at least 1200 frequencies are needed in each of the orientation bins to achieve the resolution implied by the D function. De Valois et al. (1982) estimated that in monkeys, on the order of 100,000 neurons respond to stimuli at a given location in visual space. Assuming that half of those are simple cells and that frequencies are uniformly distributed in the 8 orientation bins (De Valois et al., 1982; Field & Tolhurst, 1986; Hubener, ¨ Shoham, Grinvald, & Bonhoeffer, 1997) leaves about 6000 frequencies per

Simple Cells and Tight Frames

327

orientation bin. This number, although only a rough gure, suggests that the sampling density in spatial frequency is sufcient to achieve the resolution implied by the D functions that we have computed. 4 Calculation of K( x , y)

To compute the function K from D, we need to include corre« the two-point ¬ lation function of ensembles of natural images, Q ´ I (x ) I ( y ) . This function has been characterized in the frequency domain (Field, 1987; Ruderman & Bialek, 1994; Dong & Atick, 1995), so we will use its Fourier transform QQ to compute the convolution 2.7 through which the K function is dened (Bracewell, 1965). Using a minimal set of assumptions, Dong & Atick (1995) derived an analytic expression for the full spatiotemporal power spectrum of natural images and t it to data from movies and video recordings. Taking k as the spatial frequency, in radians per degree, and v as the temporal frequency, in radians per second, they proposed that QQ ( k, v) =

A

Z

km¡1 w2

r2 v/ ck r1 v/ ck

dv vP( v),

(4.1)

where A is a constant that absorbs the amplitude factors, c = 360/ 2p , m has been measured from snapshot images to be about 2.3, r1 and r2 are, respectively, the minimum and maximum viewing distances at which objects are resolved, and P( v) is the probability distribution of velocities of objects that are seen. Dong & Atick (1995) suggested that P( v ) »

1 , ( v C v0 ) n

(4.2)

which t their two data sets extremely well using n = 3.7, v0 = 1.02 meters per second, r1 equal to 2 or 4 meters, and r2 either 23 or 40 meters (depending on the data set). The parameter v0 is related to the mean speed by v = v0 / ( n ¡ 2). We will use the above expression in our calculation of K, but to investigate whether the particular choice for P(v) is critical to our results, we also consider another distribution with a different shape: P(v) » v exp(¡av).

(4.3)

Because we are interested only in the spatial component of QQ (k, v), we need to average over v, or consider it as an additional parameter. Given typical ocular movements, a reasonable choice is to set v equal to the mean saccade frequency—about 4 per second or v = 8p radians per second. We will use this value and consider QQ as a function of k only.

328

Emilio Salinas and L. F. Abbott

Figure 5: K functions for monkey and cat simple cells. The curves shown correspond to numerical integration of equation 4.5 for different values of | x ¡ y |. (A) Results for monkey simple cells based on the D f it function of equation 3.13 and shown in Figure 3A. The thick line is the result when the following parameters were used in equations 4.1 and 4.2: n = 3.7, m = 2.3, r1 = 0.2 m, r2 = 100 m, v0 = 1.02 m/s, v = 8p rad/s. The full width at half height is 0.112 degree. The thin lines were obtained with the same parameters, except v = 32p (narrower curve; width at half-height is 0.105 degree) and v = 2p (wider curve; width at half-height is 0.135± ). (B) Results for cat simple cells also based on D f it in equation 3.13, but scaled by a factor of 4.27. The x-axis in this gure was scaled by the same amount. The width at half-height is 0.45 degree.

We rst compute the Fourier transform of K. This is equal to the product of the Fourier transforms of D and Q, since K is dened by a convolution (Bracewell, 1965; Press et al., 1992). We actually use the Fourier transform of D f it , which is simply the sum of the Fourier transforms of three gaussians. Q Transforming back to the spatial The Fourier transform of K is KQ = DQ f it Q. representation we have K( x , y ) =

1 4p 2

Z

¡ ¢ dk exp ik ¢ (x ¡ y ) DQ f it ( k) QQ ( k).

(4.4)

Writing k in polar coordinates and shifting the angle variable, we obtain K(| x ¡ y | ) =

1 4p 2

Z

¡ ¢ dkdh k exp ik| x ¡ y | cos(h ) DQ f it ( k) QQ ( k).

(4.5)

Q the above integral can be computed Given values for the parameters of Q, numerically for each value of | x ¡ y |. Figure 5 shows the results for the data collected in monkeys (see Figure 5A) and cats (see Figure 5B) integrating equation 4.1 using equation 4.2. For the monkey data, equation 3.13 was used for D f it , and for the cat data this same curve was scaled by a factor of

Simple Cells and Tight Frames

329

4.27. Comparison between Figures 5A and 5B shows that the K functions are approximately scaled versions of each other; the K function for cats is just slightly narrower than predicted by scaling. The thick curve in Figure 5A is the result for a standard set of parameters, whose values are listed in the caption of Figure 5. The full width at half-height of the curve is 0.112 degree, which is about twice the width of the original D function. Thus, averaging over images worsens the resolution. The thin curves in the same gure were obtained by multiplying the standard value of v by 1/4 (wider curve) and by 4 (narrower curve). This has the same effect as multiplying r1 and r2 or v0 by these same factors, because these parameters multiply or divide v. The thin curves are quite close to the thick one, which means that the parameters in equations 4.1 and 4.2 have relatively little importance in dening the shape and width of K. The precise functional form of P( v) did not seem to be crucial either. When equation 4.3 was used (parameter a was set so that the mean speed was the same as before), the results were Q which goes as indistinguishable from those in Figure 5. It is the tail of Q, m¡1 1/ f , that makes K broader than D and denes its shape; the detailed form of QQ near k = 0, which is where the parameters have an inuence, does not seem to matter much. 5 Assessment of Tightness

We have compared the computed D and K kernels to a d function based only on the width of their central peaks, but the tightness of a set of basis functions can be more easily grasped by plotting the Fourier transform of its associated D function. In this way, frequency ranges in which DQ is approximately at can be promptly identied. This is important, because the class of functions that are band-limited to such frequency ranges will be accurately represented by the fj basis, that is, the fj functions will behave as a tight frame for the class of functions whose Fourier transform is zero outside the identied ranges. Flatness is quantied by the ratio max( DQ )/ min( DQ ), which is commonly used as a rigorous measure of tightness (Lee, 1996). Figure 6 shows the Fourier transforms of several D and K functions. Because K is related to D through a convolution with the correlation of natural images (see equation 2.7), in all cases KQ is the product of DQ times the Q Here DQ reveals the Fourier transform of the natural image correlation, Q. tightness of the original set of functions, when no averaging over images takes place; in contrast, KQ reveals the tightness of the set when averaging over multiple images does take place (see section 6). Figures 6E and 6F show an ideal case in which the basis set provides perfect approximation of functions with frequency components below 30 cycles per degree; the associated DQ function is at in this range. This corresponds, for example, to a set of Q so Figure 6F wavelets (Daubechies, 1990, 1992). For this set KQ is equal to Q, simply shows the Fourier transform of the natural image correlation func-

330

Emilio Salinas and L. F. Abbott

Figure 6: Fourier transforms of different D (left) and K (right) functions. In all cases the plots on the right were obtained by multiplying the plots on the left Q (shown by the Fourier transform of the natural image correlation function, Q in F). The top row shows the results for monkey V1 foveal neurons. (A) Fourier transform of the D function shown in Figure 3A. (B) Fourier transform of the K function shown in Figure 5A. The middle row shows the results for retinal ganglion cells of the cat (see the appendix). (C) Fourier transform of the D function shown in Figure 4B. (D) Fourier transform of the K function associated with retinal ganglion cells of the cat. The bottom row shows an ideal case of a set of basis functions whose associated D function is exactly ad function. Thus, Q The Q Q and D Q curves were scaled to a maximum of the KQ function is equal to Q. one. Note that in these plots the units of spatial frequency are cycles per degree.

tion. This function is not at all at; although it looks fairly narrow in x space (not shown, but very similar to Figure 5A), its slowly decaying tail makes it a poor approximation to ad function. Figures 6A and 6B show the results for Q suggestmonkey V1 simple cells (foveal). The KQ function is very similar to Q, ing that the ensemble of V1 neurons is not compensating for the average over natural images. On the other hand, DQ is at up to about 1 cycle per degree, in-

Simple Cells and Tight Frames

331

dicating that simple cell receptive elds provide an excellent approximation of a tight frame for functions with frequency cutoffs no higher than 1 cycle per degree. Even up to 10 cycles per degree, where simple cell responses start to attentuate signicantly, DQ varies only by a factor between 4 and 5, not too far from the factor of 2 or less that corresponds to a standard denition of a tight frame (Lee, 1996). As a comparison, Figures 6C and 6D show the results for retinal ganglion neurons from the cat (see the appendix). In this case KQ is Q albeit with a cutoff between 1 and 2 cycles per degree. much atter than D, 6 Discussion

We have evaluated the capacity of populations of simple cells in primary visual cortex to provide a basis for constructing more complicated receptive elds through Hebbian synaptic weights. There are some similarities between the conditions we have studied, in particular, the suggestion that K is proportional to a d function, and the work of Atick and Redlich (1990, 1992) concerning ideas proposed by Barlow (1989) regarding redundancy in neuronal representations. However, there are substantial differences between our approach and theirs. Atick and Redlich (1990, 1992) imposed translational invariance so that all of the neurons they considered had the same receptive eld structure. Their formulations resulted in consideration of functions such as K, including averages over natural images. Because all the receptive elds were taken to be the same, this approach generated an optimality condition on a single receptive eld. Atick and Redlich (1990, 1992) computed several optimal receptive eld lters, among them those that solved the equation for K close to a d function. The case of function approximation that we consider involves cells with different receptive eld structures and depends critically on the distribution of their parameters across the cortical population. Our D and K functions characterize the capacity of the ensemble of all simple cells for function approximation; they are not a property of single cells, as in the work of Atick and Redlich. It is interesting that both cases involve averages over natural images and some similar mathematical expressions, but the quantities being considered are very different: the capacity to transmit information in one case and the accuracy for function approximation in the other. From the DQ and KQ curves computed (see Figure 6), we conclude that the set of monkey V1 simple cells approximates a tight frame with a resolution, Q on the order of 1/10 of a degree. The resolution given by the cutoff in D, for cats, obtained by appropriate scaling of D, is about 0.43 degrees. In both cases the D function is more similar to the ideal d function than the K function, which includes corrections for natural image statistics. Thus, we nd no indication that this basis set is counteracting the effects of averaging over many images. In contrast, sets of lters like those of retinal ganglion neurons seem to be better suited for function approximation when the ef-

332

Emilio Salinas and L. F. Abbott

fect of image correlations is included. However, their resolution is rather poor—on the order of 1 degree. We should stress that these resolutions are meaningful only for synaptic modication processes based on correlational mechanisms. In particular, our results do not imply that simple cells have a higher overall resolution than retinal ganglion cells, an impossibility given that retinal ganglion cells provide the input to simple cells. Our results imply only that simple cells serve as a better basis from which neurons selective for arbitrary images can be constructed on the basis of linear superposition and Hebbian synaptic modication. To understand the difference between using the D and K functions to determine whether simple cell receptive elds form a tight frame, we might imagine two forms of Hebbian learning. For the type of Hebbian learning we discussed in section 1, synaptic adaptation occurs slowly while an ensemble of images is displayed. In this case, the function K sets the resolution for function approximation with Hebbian synapses. Suppose instead that Hebbian modication of synapses occurs rapidly and only when an image equal to the target function appears. In this case the process is fast enough that images different from the target function do not affect the synaptic weights. For this learning procedure, the resolution for function approximation is set by D. Visual acuity in humans and monkeys, in image discrimination or recognition tasks that involve the activities of large numbers of neurons, is about 1 minute of arc, or 0.017 degree (Westheimer, 1980). This is about six times better than the resolution we have estimated for approximation of an arbitrary function based on D. It might seem odd that our estimation is considerably below psychophysical discrimination thresholds, but our analysis applies specically to single neurons downstream from V1 that combine their inputs in a perfectly linear fashion. The discrepancy indicates that the visual system does not rely exclusively on the tightness of the V1 neuronal ensemble to achieve the acuity exhibited at the behavioral level; additional mechanisms beyond linear summation are required. There are two prominent possibilities for this. First, the neural circuitry may be able to exploit some of its nonlinearities to increase accuracy. On the other hand, higher accuracy may also be achieved through a population of neurons. Considering the ubiquity of population codes throughout the cortex, the capacity of V1 simple cells to act as a basis for Hebbian-based function approximation by single postsynaptic neurons may appear substantial—only a factor of 6 below behavioral performance. However, to try to answer the question posed in the title, the frame formed by simple cells in primary visual cortex is tighter than would be expected from a random distribution of Gabor lters, but not tight enough to account for behavioral acuity. Appendix

Here we compute the D function for retinal ganglion cells (X cells). The ring of these cells is well characterized through a linear ltering operation (af-

Simple Cells and Tight Frames

333

terward rectifying the result to eliminate negative rates), as in equation 2.1. In this case each lter is equal to the difference of two gaussians, ¡ ¢2 ! ¡ ¢2 ! x ¡ aj x ¡ aj A1 A2 exp ¡ ¡ exp ¡ . (A.1) fj ( x) = 2p s12 2s12 2p s22 2s22

(

(

We assume that the only parameter that varies across neurons is a, the receptive eld location. Values for the rest of the parameters are as follows: A1 / A2 = 17/ 16 (Linsenmeier, Frishman, Jakiela, & Enroth-Cugell, 1982), s1 = 0.17666 ± and s2 = 0.53± (Peichl & W¨asle, 1979). The D function is computed as in equation 2.8, using the above lters. The sum over neurons is approximated by an integral over a assuming that the receptive elds are uniformly distributed, so r ( a ) is a constant. The integrals can then be computed analytically, and the result is ¡ ¢2 ! ¡ ¢2 ! x¡y x¡y A21 A22 (| ) exp ¡ C exp ¡ D x ¡ y| = 4p s12 4s12 4p s22 4s22 ¡ ¢2 ! x¡y A1 A2 ¡ exp ¡ . (A.2) 4p (s12 C s22 ) 2(s12 C s22 )

(

(

(

The Fourier transform of this expression is DQ ( k) =

(

(

s 2 k2 A1 exp ¡ 1 2

!

(

s 2 k2 ¡ A2 exp ¡ 2 2

!!2 ,

(A.3)

where k is in radians per degree. Acknowledgments

We thank Sacha Nelson for his helpful advice. This research was supported by the Sloan Center for Theoretical Neurobiology at Brandeis University, the National Science Foundation (DMS-9503261), and the W. M. Keck Foundation. References Atick, J. J., & Redlich, A. N. (1990). Towards a theory of early visual processing. Neural Comput., 2, 308–320. Atick, J. J., & Redlich, A. N. (1992). What does the retina know about natural scenes? Neural Comput., 4, 196–210. Barlow, H. B. (1989). Unsupervised learning. Neural Comput., 1, 295–311. Bartfeld, E., & Grinvald, A. (1992). Relationships between orientation preference pinwheels, cytochrome oxidase blobs and ocular dominance columns in primate striate cortex. Proc. Natl. Acad. Sci. USA, 89, 11905–11909.

334

Emilio Salinas and L. F. Abbott

Bonhoeffer, T., & Grinvald, A. (1993). The layout of iso-orientation domains in area 18 of cat visual cortex: Optical imaging reveals a pinwheel-like organization. J. Neurosci., 13, 4157–4180. Bracewell, R. (1965). The Fourier transform and its applications. New York: McGraw-Hill. Daubechies, I. (1990). The wavelet transform, time-frequency localization and signal analysis. IEEE Trans. Information Theory, 36, 961–1004. Daubechies, I. (1992). Ten lectures on wavelets. Philadelphia: SIAM. Daubechies, I., Grossmann, A., & Meyer, Y. (1986). Painless nonorthogonal expansions. J. Math. Phys., 27, 1271–1283. Daugman, J. G. (1985). Uncertainty relation for resolution in space, spatial frequency, and orientation optimization by two-dimensional visual cortical lters. J. Opt. Soc. Am., 2, 1160–1169. De Valois, R. L., Albrecht, D. G., & Thorell, L. G. (1982). Spatial frequency selectivity of cells in macaque visual cortex. Vision Res., 22, 545–559. Dong, D. W., & Atick, J. J. (1995). Statistics of natural time-varying images. Network, 6, 345–358. Field, D. (1987). Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4, 2379–2394. Field, D., & Tolhurst, D. J. (1986). The structure and symmetry of simple-cell receptive-eld proles in the cat’s visual cortex. Proc. R. Soc. Lond. B, 228, 379–400. Girosi, F., Jones, M., & Poggio, T. (1995). Regularization theory and neural networks. Neural Comput., 7, 219–269. Hebb, D. O. (1949). The organization of behavior: A neuropsychological theory. New York: Wiley. Hubel, D. H., & Wiesel, T. N. (1974). Sequence regularity and geometry of orientation columns in the monkey striate cortex. J. Comp. Neurol., 158, 267–294. Hubener, ¨ M., Shoham, D., Grinvald, A., & Bonhoeffer, T. (1997). Spatial relationships among three columnar systems in cat area 17. J. Neurosci., 17, 9270–9284. Jones, J., & Palmer, L. (1987). An evaluation of the two-dimensional Gabor lter model of simple receptive elds in cat striate cortex. J. Neurophysiol.,58, 1233– 1259. Lee, T. S. (1996). Image representation using 2D Gabor wavelets. IEEE Trans. Pattern Analysis and Machine Intelligence, 18, 959–971. Linsenmeier, R. A., Frishman, L. J., Jakiela, H. G., & Enroth-Cugell, C. (1982). Receptive eld properties of X and Y cells in the cat retina derived from contrast sensitivity measurements. Vision Res., 22, 1173–1183. Movshon, J. A., Thompson, I. D., & Tolhurst, D. J. (1978). Spatial and temporal contrast sensitivity of neurons in area 17 and 18 of the cat’s visual cortex. J. Physiol., 283, 101–120. Peichl, L., & Wa¨ sle, H. (1979).Size, scatter and coverage of ganglion cell receptive eld centers in the cat retina. J. Physiol. (Lond.), 291, 117–141. Poggio, T. (1990). A theory of how the brain might work. Cold Spring Harbor Symposium on Quantitative Biology, 55, 899–910. Poggio, T., & Girosi, F. (1990). Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247, 978–982.

Simple Cells and Tight Frames

335

Pollen, D. A., & Ronner, S. F. (1981).Phase relationships between adjacent simple cells in the visual cortex. Science, 212, 1409–1411. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1992).Numerical recipes in C. New York: Cambridge University Press. Ruderman, D. L., & Bialek, W. (1994). Statistics of natural images: Scaling in the woods. Phys. Rev. Let., 73, 814–817. Webster, M. A., & De Valois, R. L. (1985).Relationship between spatial-frequency and orientation tuning of striate-cortex cells. J. Opt. Soc. Am. A, 2, 1124–1132. Westheimer, G. (1980). The eye, including central nervous system control of eye movements. In V. B. Mountcastle (Ed.), Medical physiology (Vol. 1, pp. 481– 503). St. Louis: Mosby. Received June 11, 1998; accepted February 4, 1999.

LETTER

Communicated by Joachim Buhmann

Learning Overcomplete Representations Michael S. Lewicki Computer Science Dept. and Center for the Neural Basis of Cognition, Carnegie Mellon Univ., 115 Mellon Inst., 4400 Fifth Ave., Pittsburgh, PA 15213 Terrence J. Sejnowski Howard Hughes Medical Institute, Computational Neurobiology Laboratory, The Salk Institute, La Jolla, CA 92037, U.S.A. In an overcomplete basis, the number of basis vectors is greater than the dimensionality of the input, and the representation of an input is not a unique combination of basis vectors. Overcomplete representations have been advocated because they have greater robustness in the presence of noise, can be sparser, and can have greater exibility in matching structure in the data. Overcomplete codes have also been proposed as a model of some of the response properties of neurons in primary visual cortex. Previous work has focused on nding the best representation of a signal using a xed overcomplete basis (or dictionary). We present an algorithm for learning an overcomplete basis by viewing it as probabilistic model of the observed data. We show that overcomplete bases can yield a better approximation of the underlying statistical distribution of the data and can thus lead to greater coding efciency. This can be viewed as a generalization of the technique of independent component analysis and provides a method for Bayesian reconstruction of signals in the presence of noise and for blind source separation when there are more sources than mixtures. 1 Introduction

A common way to represent real-valued signals is with a linear superposition of basis functions. This can be an efcient way to encode a highdimensional data space because the representation is distributed. Bases such as the Fourier or wavelet can provide a useful representation of some signals, but they are limited because they are not specialized for the signals under consideration. An alternative and potentially more general method of signal representation uses so-called overcomplete bases (also called overcomplete dictionaries), which allow a greater number of basis functions (also called dictionary elements) than samples in the input signal (Simoncelli, Freeman, Adelson, & Heeger, 1992; Mallat & Zhang, 1993; Chen, Donoho, & Saunders, 1996). Overcomplete bases are typically constructed by merging a set of complete Neural Computation 12, 337–365 (2000)

c 2000 Massachusetts Institute of Technology °

338

Michael S. Lewicki and Terrence J. Sejnowski

bases (e.g., Fourier, wavelet, and Gabor), or by adding basis functions to a complete basis (e.g., adding frequencies to a Fourier basis). Under an overcomplete basis, the decomposition of a signal is not unique, but this can offer some advantages. One is that there is greater exibility in capturing structure in the data. Instead of a small set of general basis functions, there is a larger set of more specialized basis functions such that relatively few are required to represent any particular signal. These can form more compact representations, because each basis function can describe a signicant amount of structure in the data. For example, if a signal is largely sinusoidal, it will have a compact representation in a Fourier basis. Similarly, a signal composed of chirps is naturally represented in a chirp basis. Combining both of these bases into a single overcomplete basis would allow compact representations for both types of signals (Coifman & Wickerhauser, 1992; Mallat & Zhang, 1993; Chen et al., 1996). It is also possible to obtain compact representations when the overcomplete basis contains a single class of basis functions. An overcomplete Fourier basis, with more than the minimum number of sinusoids, can compactly represent signals composed of small numbers of frequencies, achieving superresolution (Chen et al., 1996). An additional advantage of some overcomplete representations is increased stability of the representation in response to small perturbations of the signal (Simoncelli et al., 1992). Unlike the case in a complete basis, where signal decomposition is well dened and unique, nding the “best” representation in terms of an overcomplete basis is a challenging problem. It requires both an objective for decomposition and an algorithm that can achieve that objective. Decomposition can be expressed as nding a solution to x = As ,

(1.1)

where x is the signal, A is a (nonsquare) matrix of basis functions (vectors), and s is the vector of coefcients, that is, the representation of the signal. Developing efcient algorithms to solve this equation is an active area of research. One approach to removing the degeneracy in equation 1.1 is to place a constraint on s (Daubechies, 1990; Chen et al., 1996), for example, by nding s satisfying equation 1.1 with minimum L1 norm. A different approach is to construct iteratively a sparse representation of the signal (Coifman & Wickerhauser, 1992; Mallat & Zhang, 1993). In some of these approaches, the decomposition can be a nonlinear function of the data. Although overcomplete bases can be more exible in terms of how the signal is represented, there is no guarantee that hand-selected basis vectors will be well matched to the structure in the data. Ideally, we would like the basis itself to be adapted to the data, so that for signal class of interest, each basis function captures a maximal amount of structure. One recent success along these lines was developed by Olshausen and Field (1996, 1997) from the viewpoint of learning sparse codes. When adapted

Learning Overcomplete Representations

339

to natural images, the basis functions shared many properties with neurons in primary visual cortex, suggesting that overcomplete representations might a useful model for neural population codes. In this article, we present an algorithm for learning an overcomplete basis by viewing it as a probabilistic model of the observed data. This approach provides a natural solution to decomposition (see equation 1.1) by nding the maximum a posteriori representation of the data. The prior distribution on the basis function coefcients removes the redundancy in the representation and leads to representations that are sparse and are a nonlinear function of the data. The probabilistic approach to decomposition also leads to a natural method of denoising. From this model, we derive a simple and robust learning algorithm by maximizing the data likelihood over the basis functions. This work generalizes the algorithm of Olshausen and Field (1996) by deriving a algorithm for learning overcomplete bases from a direct approximation to the data likelihood. This allows learning for arbitrary input noise levels, allows for the objective comparison of different models, and provides a way to estimate a model’s coding efciency. We also show that overcomplete representations can provide a better and more efcient representation, because they can better approximate the underlying statistical density of the input data. This also generalizes the technique of independent component analysis (Jutten & H´erault, 1991; Comon, 1994; Bell & Sejnowski, 1995) and provides a method for the identication of more sources than mixtures. 2 Model

We assume that each data vector, x = x1 , . . . , xL , can be described with an overcomplete linear basis plus additive noise, x = As C ²,

(2.1)

where A is an L£ M matrix with M > L. We assume gaussian additive noise, ². The data likelihood is log P( x | A , s ) / ¡

1 (x ¡ As )2 , 2s 2

(2.2)

where s 2 is the noise variance. One criticism of overcomplete representations is that they are redundant— a given data point can have many possible representations—but this redundancy can be removed by a proper choice for the prior probability of the basis coefcients, P( s), which species the probability of the alternative representations. This density determines how the underlying statistical structure is modeled and the nature of the representation. Standard approaches to signal representation do not specify a prior for the coefcients, because

340

Michael S. Lewicki and Terrence J. Sejnowski

for most complete bases and assuming zero noise, the representation of the signal is unique. If A is invertible, the decomposition of the signal x is given by s = A ¡1 x . Because A ¡1 is expensive to compute, there is strong incentive to nd basis matrices that are easy to invert, such as restricting the basis functions to be orthogonal or further restricting the basis functions to those for which there are fast algorithms, such as Fourier or wavelet analysis. A more general approach is afforded in the probabilistic formulation. The parameter values of the internal states are found by maximizing the posterior distribution of s , sˆ = argmax P( s| x, A ) = argmax P(x | A , s) P( s), s

s

(2.3)

where sˆ is the most probable decomposition of the signal. This formulation of the problem offers the advantage that the model can t more general types of distributions. Q For simplicity, we assume independence of the coefcients: P( s) = m P( sm ). Another advantage of this formulation is that the process of nding the most probable representation determines the best representation for the noise level dened by s. This automatically performs “denoising” in that s encodes the underlying signal without noise. The noise level can be set to arbitrary levels, including zero. It is also possible to use Bayesian methods to infer the most probable noise level, but this will not be pursued here. 2.1 The Relation Between the Prior and the Representation. Figure 1 shows how different priors induce different representations of the data. A standard choice might be a gaussian prior (equivalent to factor analysis), but this yields no advantage to having an overcomplete representation because the underlying assumption is still that the data are gaussian. An alternative choice, advocated by some authors (Field, 1994; Olshausen & Field, 1996; Chen et al., 1996), is to use priors that assume sparse representations. This is accomplished by priors that have high kurtosis, such as the Laplacian, P( sm ) / exp(¡h |sm | ). Compared to a gaussian, this distribution puts greater weight on values close to zero, and as result the representations are sparser— they have a greater number of small-valued coefcients. For overcomplete bases, we expect a priori that the representation will be sparse, because only L out of M nonzero coefcients are needed to represent arbitrary Ldimensional input patterns. In the case of zero noise and P( s) gaussian, maximizing equation 2.3 is equivalent to mins || s || 2 subject to x = As . The solution can be obtained with the pseudoinverse, s = ACx , and is a linear function of A and s. In the case of zero noise and P(s ) Laplacian, maximizing equation 2.3 is equivalent to mins || s || 1 subject to x = As. Unlike the gaussian prior, this solution cannot be obtained by a simple linear operation.

Learning Overcomplete Representations

341

Figure 1: Different priors induce different representations. (a) The data distribution is two-dimensional and has three main arms. The overlayed axes form an overcomplete representation. (b, c) Optimal scaled basis vectors for the data point x under a gaussian and Laplacian prior, respectively. Assuming low noise, a gaussian for P( s) is equivalent to nding s with minimum L2 norm such that x = As . The solution is given by the pseudoinverse s = ACx and is a linear function of the data. A Laplacian prior (P(sm ) / exp[¡h |sm |]) nds s with minimum L1 norm. This is a nonlinear operation that essentially selects a subset of basis vectors to represent the data (Chen et al., 1996), so that the resulting representation is sparse. (d) A 64-sample segment of natural speech was t to a 2£ overcomplete Fourier representation (128 basis functions) (see section 7). The plot shows rank-order distribution of the coefcients of s under a gaussian prior (dashed) and a Laplacian prior (solid). Under a gaussian prior, nearly all of the coefcients in the solution are nonzero, whereas far fewer are nonzero under a Laplacian prior.

3 Inferring the Internal State

A general approach for optimizing s in the case of nite noise (² > 0) and nongaussian P(s ) is to use the gradient of the log posterior in an optimization algorithm (Daugman, 1988; Olshausen & Field, 1996). A suitable initial condition is s = A T x or s = ACx .

342

Michael S. Lewicki and Terrence J. Sejnowski

An alternative method, which can be used when the prior is Laplacian and ² = 0, is to view the problem as a linear program (Chen et al., 1996): min c T | s|

subject to

As = x .

(3.1)

Letting c = (1, . . . , 1), the objective function in the linear program, c T | s| = P m | sm |, corresponds to maximizing the log posterior likelihood under a Laplacian prior. This can be converted to a standard linear program (with only positive coefcients) by separating positive and negative coefcients. Making the substitutions, s Ã [u I v ], c Ã [1 I 1 ], and A Ã [A I ¡A ], equation 3.1 becomes min 1 T [u I v ]

subject to

[A I ¡A ][u I v ] = x ,

u , v ¸ 0,

(3.2)

which replaces the basis vector matrix A with one that contains both positive and negative copies of the vectors. This separates the positive and negative coefcients of the solution s into the positive variables u and v , respectively. This can be solved efciently and exactly with interior point linear programming methods (Chen et al., 1996). Quadratic programming approaches to this type of problem have also recently been suggested (Osuna, Freund, & Girosi, 1997) for similar problems. We have used both the linear programming and gradient-based methods. The linear programming methods were superior for nding exact solutions in the case of zero noise. The standard implementation handles only the noiseless case but can be generalized (Chen et al., 1996). We found gradientbased methods to be faster in obtaining good approximate solutions. They also have the advantage that they can easily be adapted for more general models, such as positive noise levels or different priors. 4 Learning

To derive a learning algorithm we must rst specify an appropriate objective function. A natural objective is to maximize the probability of the data given the model. For a set of K independent data vectors X = x 1 , . . . , x K , P(X | A) =

K Y

P ( xk | A ).

(4.1)

k= 1

The probability of a single data point is computed by marginalizing over the states of the network, Z (4.2) P (x k | A ) = ds P( s) P( x k | A , s ). This formulates the problem as one of density estimation and is equivalent to minimizing the Kullback-Leibler divergence between the model density

Learning Overcomplete Representations

343

and the distribution of the data. If implementation-related issues such as synaptic noise are ignored, this is equivalent to the methods of redundancy reduction and maximizing the mutual information between the input and the representation (Nadal & Parga, 1994a, 1994b; Cardoso, 1997), which have been advocated by several researchers (Barlow, 1961, 1989; Hinton and Sejnowski, 1986; Daugman, 1989; Linsker, 1988; Atick, 1992). 4.1 Fitting Bases to the Data Distribution. To understand how overcomplete representations can yield better approximations to the underlying density, it is helpful to contrast different techniques for adapting a basis to a particular data set. Principal component analysis (PCA) models the data with a multivariate gaussian. This representation is commonly used to nd the directions in the data with largest variation—the principal components—even if the data do not have gaussian structure. The basis functions are the eigenvectors of the covariance matrix and are restricted to be orthogonal. An extension of PCA, called independent component analysis (ICA) (Jutten & H´erault, 1991; Comon, 1994; Bell & Sejnowski, 1995), allows the learning of nonorthogonal bases for data with nongaussian distributions. ICA is highly effective in several applications such as blind source separation of mixed audio signals (Jutten & H´erault, 1991; Bell & Sejnowski, 1995), decomposition of electroencephalographic (EEG) signals (Makeig, Jung, Bell, Ghahremani, & Sejnowski, 1996), and the analysis of functional magnetic resonance imaging (fMRI) data (McKeown et al., 1998). In all of these techniques, the number of basis vectors is equal to the number of inputs. Because these bases span the input space, they are complete and are sufcient to represent the data, but we will see that this representation can be limited. Figure 2 illustrates how different data densities are modeled by various approaches in a simple two-dimensional data space. PCA assumes gaussian structure, but if the data have nongaussian structure, the vectors can point in directions that contain very few data, and the probability density dened by the model will predict data where none occur. This means that the model will underestimate the likelihood of data in the dense regions and overestimate it in the sparse regions. By Shannon’s theorem, this will limit the efciency of the representation. ICA assumes the coefcients have nongaussian structure and allows the vectors to be nonorthogonal. In this example, the basis vectors point along high-density regions of the data space. If the density is more complicated, however, as in the case of the three-armed density, neither PCA nor ICA captures the underlying structure. Although it is possible to represent every point in the two-dimensional space with a linear combination of two vectors, the underlying density cannot be described without specifying a complicated prior on the basis function coefcients. An overcomplete basis, however, allows an efcient representation of the underlying density with simple priors on the basis function coefcients.

344

Michael S. Lewicki and Terrence J. Sejnowski

a

b

c

d

Figure 2: Fitting two-dimensional data with different bases. (a) PCA makes the assumption that the data have a gaussian distribution. The optimal basis vectors are orthogonal and are not efcient at representing nonorthogonal density distributions. (b, c) ICA does not require that the vectors be orthogonal and can t more general types of densities. (c) Some data distributions, however, cannot be modeled adequately by either PCA or ICA. (d) Allowing the basis to be overcomplete allows the three-armed distribution to be t with simple and independent distributions for the basis vector coefcients.

4.2 Approximating the Data Probability. In deriving the learning algorithm, the rst problem is that, in general, the integral in equation 4.2 is intractable. In the special case of zero noise and a complete representation (i.e., A is invertible) this integral can be solved and leads to the well-known ICA algorithm (MacKay, 1996; Pearlmutter & Parra, 1997; Olshausen & Field, 1997; Cardoso, 1997). But in the case of an overcomplete basis, such a solution is not possible. Some recent approaches have tried to approximate this integral by evaluating P( s )P( x | A , s ) at its maximum (Olshausen & Field, 1996, 1997), but this ignores the volume information of the posterior. This means representations that are well determined, i.e., have a sharp posterior distribution) are treated in the same way as those that are ill determined, which can introduce biases into the learning. Another problem is that this leads to a trivial solution where the magnitude of the basis functions diverge. This problem can be somewhat circumvented by adaptive normalization (Olshausen & Field, 1996), but setting the adaptation rates can be tricky in practice, and, more important,

Learning Overcomplete Representations

345

there is no guarantee the desired objective function is the one being optimized. The approach we take here is to approximate equation 4.2 with a gaussian around the posterior mode, sˆ . This yields log P( x | A ) ¼

l l L M log C log(2p ) C log P(sˆ ) ¡ ( x ¡ A sˆ ) 2 2 2p 2 2 1 ¡ log det H , 2

(4.3)

where l = 1/ s 2 , and H is the Hessian of the log posterior at sˆ , H = l ATA ¡ rs rs log P( sˆ ). This is a saddle-point approximation. (See appendix for derivation details.) Care must be taken that the log prior has some curvature, because det( ATA ) = 0 arising from the fact that ATA has the same rank as A , which is assumed to be rectangular. The accuracy of the saddlepoint approximation is determined by how closely the posterior mode can be approximated by a gaussian. The approximation will be poor if there are multiple modes or there is excessive skew or kurtosis. We present methods for addressing some of these issues in section 6.1. A learning rule is obtained by differentiating log P( x| A ) with respect to A (see the appendix), which leads to the following expression: D A = AAT rA log P( x| A ) ¼ ¡A ( zˆsT C I ),

(4.4)

where zk = @log P( sk )/ @sk . This rule contains no matrix inverses, and the vector z involves only the derivative of the log prior. In the case where A is square, this form of the rule is exactly the natural gradient ICA learning rule for the basis matrix (Amari, Cichocki, & Yang, 1996). The difference in the more general case where A is rectangular is in how the coefcients s are calculated. In the standard ICA learning algorithm (A square, zero noise), the coefcients are given by s = Wx , where W = A ¡1 is the lter matrix. For the learning rule to work the optimization of s must maximize the posterior distribution P( s| x, A ) (see equation 2.3). 5 Examples

We now demonstrate the algorithm on two simple data sets. Figure 3 shows the results of tting two different two-dimensional data sets. The learning procedure was as follows. The initial bases were random normalized vectors with the constraint that no two vectors be closer in angle than 30 degrees. Each basis was adapted to the data using the learning rule given in equation 4.4. The bases were adapted with 50 iterations of simple gradient descent with a batch size of 500 and a step size of 0.1 for the rst 30 iterations, which was reduced to 0.001 over the last 20. The most probable coefcients were obtained using a publicly available interior point

346

Michael S. Lewicki and Terrence J. Sejnowski a

b

c

Figure 3: Three examples illustrating the tting of 2D distributions with overcomplete bases. The gray vectors show the true basis vectors used to generate the distribution. The black vectors show the learned basis vectors. For clarity, the vectors have been rescaled. The learned basis vectors can be antiparallel to the true vectors, because both positive and negative coefcients are allowed.

linear programming package (Meszaros, 1997). Convergence of the learning algorithm was rapid, usually reaching the solution in fewer than 30 iterations. A solution was discarded if the magnitude of one of the basis vector dropped to zero. This happened occasionally when one vector was “trapped” between two others that were already representing the data in that region. The rst example is shown in Figure 3a. The data were generated from the true basis vectors (shown in gray) using x = As. To illustrate better the arms of the distribution, the elements of s were drawn from an exponential distribution (i.e., only positive coefcients) with unit mean. The direction of the learned vectors always matched that of the true generating vectors. The magnitude was smaller than the true basis vectors, possibly due to the approximation used for P(x | A ) (see equation 4.3). Identical results were obtained when the coefcients were generated from a Laplacian prior (i.e., both positive and negative coefcients). The second example (see Figure 3b) has four arms, again generated from the true basis vectors using a Laplacian (both positive and negative coefcients) with unit variance. The learned vectors were close to the true vectors but showed a consistent bias. This might occur because the data are too dense for there to be any distinct direction or from the approximation to P( x| A ). The bias can be greatly reduced if the coefcients are drawn from a distribution with greater sparsity. The example in Figure 3c uses a data set from the same underlying vectors, but generated from a generalized Laplacian distribution, P( sm ) / exp(¡|sm | p ). When this distribution is tted to wavelet subband coefcients of images, Buccigrossi and Simoncelli (1997) found that p was in the range [0.5, 1.0]. Varying this exponent varies the kurtosis of the distribution, which is one measure of a sparseness. In

Learning Overcomplete Representations

347

Figure 3c, the data were generated with p = 0.6. The arms of this data set are more distinct, and the directions match those of the generating vectors. 6 Quantifying the Efciency of the Representation

One advantage of the probabilistic framework is that it provides a natural means for comparing objectively alternative representations and models. In this section we compare two methods for comparing the coding efciency of the learned basis functions. 6.1 Estimating Coding Cost Using P( x| A ) . The objective function P( x | A ) for the probability of the data under the model is a natural measure for comparing different models. It is helpful to convert this value into a more intuitive form. According to Shannon’s coding theorem, the probability of a code word gives the lower bound on the length of that code word, under the assumption that the model is correct. The number of bits required to encode the pattern is given by

#bits ¸ ¡ log2 P( x | A ) ¡ L log2 (sx ),

(6.1)

where L is the dimensionality of the input pattern x, and sx species the quantization level of the encoding. 6.1.1 A More Accurate Approximation to P( x| A ) . The learning algorithm has been derived by making a gaussian approximation around the posterior maximum sˆ . In practice this is sufciently accurate to generate a useful gradient for learning overcomplete bases. One caution, however, is that there is not necessarily a relation between the local curvature and the volume for general priors, as for a gaussian prior. When comparing alternative models or basis functions, we may want a more accurate approximation. One approach that works well is a method based on second differences. Rather than rely on the analytic Hessian to estimate the volume around sˆ , the volume can be estimated directly by calculating the change in the posterior, P(s ) P(x | s , A ) around the maximum sˆ using second differences. Computing the entire Hessian in this manner would require O( M2 ) evaluations of the posterior, which itself is an O( M2 ) operation. We can reduce the number of posterior evaluations to O( M) if we estimate the curvature only in the direction of the eigenvectors of the Hessian. To obtain an accurate volume estimate, the volume of gaussian approximation should match as closely as possible to the volume around the posterior maximum. This requires a choice of the step size along the eigenvector direction. In principle, this t could be optimized, for example, by minimizing the cross entropy, but at signicant computational expense. A relatively quick and reliable estimate can be obtained by choosing the step length, gi , which produces a

348

Michael S. Lewicki and Terrence J. Sejnowski

xed drop, D , in the value of the posterior log-likelihood, £ ¤ g i = argmingi log P( sˆ | x, A ) ¡ log P( sˆ Cgi ei | x, A ) ¡ D ,

(6.2)

where gi is the step length along the direction of eigenvector e i . This method of choosing g i requires a line search, but this can typically be done with three or four function evaluations. In the examples below, we used D = 3.8, the optimal value for approximating a Laplacian with a gaussian. This O , which is a more accurate estimate procedure computes a new Hessian, H of the posterior curvature. The estimated Hessian is then substituted into (equation 4.3) to obtain a more accurate estimate of log P( x| A ). Figure 4 shows cross-sections of the true posterior and the gaussian approximation using the direct Hessian and the second-differences method. The direct Hessian method produced poor approximations in directions where the curvature was sharp around the sˆ . Estimating the volume with second differences produced a much better estimate along all of the eigenvector directions. There is no guarantee, however, that this is the best estimate, because there could still exist some directions that are poorly approximated. For example, the rst plots in Figures 4a and 4b (labeled s ¡ smp) show the direction between sˆ using a convergence tolerance of 10¡4 and sˆ using a convergence tolerance of 10¡6 . This was the direction of the gradient when optimization was terminated and is likely to be in the direction of coefcients whose values are poorly determined. The maximum has not been reached, and the volume in this direction is underestimated. Overall, however, the second-differences method produces a much more accurate estimate of log P (x | A ) than that using the analytic Hessian and obtains more accurate estimates of the relative coding efciencies of different models. 6.1.2 Application to Test Data. We now quantify the efciency of different representations for various test data sets using the methods described above. The estimated coding cost was calculated from 10 sets of 1000 randomly sampled data points using an encoding precision sx = 0.01. The relative encoding efciency of the different representations depends on the relative difference of their predictions—how closely P( x | A ) matches that of the underlying distribution. If there is little difference between the predictive distribution, then there will be little difference in the expected coding cost. Table 1 shows the estimated coding costs of various models for the data sets used above. The uniform distribution gives an upper bound on the coding cost by assuming equal probability for all points within the range spanned by the data points. A simple bivariate gaussian model of the data points, equivalent to a code based on the principal components of the data, does signicantly better. The remaining table entries assume the model dened in equation 2.1 with a Laplacian prior on the basis coefcients. The 2 £ 2 basis matrix is

a

349

P(s|x,A)

Learning Overcomplete Representations

5 4 3 0.5 s smp

0 e128

0.5 0.4 0.2 0 e112

0.2

0

0.5 1 e96

1.5 3

2.5 2 e80

1.5

P(s|x,A)

6

b

e64

1 1.2

1 e48

0.8 0.1

0 e32

0.1 1.15

e16

1 0.02 0

0.02 0.04 e1

P(s|x,A)

1.05

5 4 3 0.5 s smp

0 e128

0.5 0.4 0.2 0 e112

0.2

0

0.5 1 e96

1.5 3

2.5 2 e80

1.5

P(s|x,A)

6

1.05

e64

1 1.2

1 e48

0.8 0.1

0 e32

0.1 1.15

e16

1 0.02 0

0.02 0.04 e1

Figure 4: The solid lines in the gure show the cross-sections of a 128dimensional posterior distribution (P( s| x , A ) along the directions of the eigenvectors of the Hessian matrix evaluated at sˆ (see equation 2.3). This posterior resulted from tting a 2£-overcomplete representation to a 64-sample segment of natural speech (see section 7). The rst cross-section is in the direction between sˆ using a tolerance of 10¡4 and a more probable sˆ found using a tolerance of 10¡6 (labeled s ¡ smp). The remaining cross-sections are in order of the eigenvalues, showing a sample from smallest (e128) to largest (e1). The dots show the gaussian approximation for the same cross-section. The C indicates the position of sˆ . Note that the x-axes have different scales. (a) The gaussian approximation obtained using the analytic Hessian to estimate the curvature. (b) The same cross-sections but with the gaussian approximation calculated using the seconddifference method described in the text.

equivalent to the ICA solution (under the assumption of a Laplacian prior). The 2£3 and 2£4 matrices are overcomplete representations. All bases were learned with the methods discussed above. The most probable coefcients were calculated using the linear programming method as was done during learning. Table 1 shows that the coding efciency estimates obtained using the approximation to P( x | A ) are reasonably accurate compared to the estimated

350

Michael S. Lewicki and Terrence J. Sejnowski

Table 1: Estimated Coding Costs using P( x | A ). Bits per Pattern Model uniform Gaussian A 2£2 (ICA) A 2£3 A 2£4

Figure 1a 22.31 § 18.96 § 18.84 § 18.59 § —

0.17 0.06 0.04 0.03

Figure 3a 21.65 § 17.98 § 17.79 § 16.87 § —

0.19 0.07 0.07 0.06

Figure 3b 21.57 § 18.55 § 18.38 § 18.04 § 18.15 §

0.16 0.06 0.05 0.04 0.08

Figure 3c 26.67 § 22.12 § 21.46 § 21.12 § 21.24 §

0.31 0.15 0.08 0.08 0.08

bits per pattern for the uniform and gaussian models for which the expected coding costs are exact. Apart from the uniform model, the differences in the predicted densities are rather subtle and do not show any large differences in coding costs. This would be expected because the densities predicted by the various models are similar. For all the data sets, using a three-vector basis (A 2£3 ) gives greater expected coding efciencies, which indicates that higher degrees of overcompleteness can yield more efcient coding. For the four-vector basis (A 2£4 ), there is no further improvement in coding efciency. This reects the fact that when the prior distribution is limited to a Laplacian, there is an inherent limitation on the model’s degrees of freedom; increasing the number of basis functions does not always increase the space of distributions that can be modeled. 6.2 Estimating Coding Cost Using Coefcient Entropy. The probability of the data gives a lower bound on code length but does not specify a code. An alternative method for estimating the coding efciency is to estimate the entropy of a proposed code. For these models, a natural choice for the code is the vector coefcients, s. In this approach, the distribution of the coefcients s (t to a training data set) is estimated with a function f ( s). The coding cost for a test data set is computed estimating the entropy of the tted coefcients to a quantization level needed to maintain an encoding noise level of sx . This has the advantage that f ( s) can give a better approximation to the observed distribution of s (better than the Laplacian assumed under the model). In the examples below, we assume that all the coefcients have the same distribution, although it is straightforward to use a separate function for the distribution estimate of each coefcient. One note of caution with this technique, however, is that it does not include any cost of mistting. A basis that does not fully span the data space will result in poor ts to the data but can still yield low entropy. If the bases under consideration fully span the data space, then the entropy will yield a reasonable estimate of coding cost. To compute the entropy, the precision to which each coefcient is encoded needs to be specied. Ideally, f (s) should be quantized to maintain

Learning Overcomplete Representations

351

an encoding noise level of sx . In the examples below, we use the approximation dsi ¼ sx / | A i |. This relates a change in si to a change in the data xi and is exact if A is orthogonal. We use the mean value of d si to obtain a single quantization level d s for all coefcients. The function f ( s) by applying kernel density estimation (Silverman, 1986) to the distribution of coefcients t to a training data set. We use a Laplacian kernel with a window width of 2d s. The most straightforward method of estimating the coding cost (in bits per pattern) is to sum over the individual entropies, #bits ¸ ¡

X ni log2 f [i], i N

(6.3)

where f ( s) is quantized as f [i], the index i ranges over all the bins in the quantization, and ni is the number of counts observed in each bin for each of the coefcients t to a test data set consisting of N patterns. The distinction between the test and training data set is necessary because the probabilities of the code words need to be specied a priori. Using the same data set for test and training underestimates the cost, although for large data sets, the difference is minimal. An alternative method of computing the coding cost using the entropy is to make use of the fact that for sparse-overcomplete representations, a subset of the coefcients will be zero. The coding cost can be computed by summing the entropy of the nonzero coefcients plus the cost of identifying them, #bits ¸ ¡

X ni log2 fnz [i] C min(M ¡ L, L) log2 ( M) , i N

(6.4)

where fnz [i] is the quantized density estimate of the nonzero coefcients in the ith bin and the index i ranges over all the bins. This method is appropriate when the cost of identifying the nonzero coefcients is small compared to the cost of sending M ¡ L (zero-valued) coefcients. 6.2.1 Application to Test Data. We now quantify the efciency of various representations for the data sets shown in Figure 3 by computing the entropy of the coefcients. As before, the estimated coding cost was calculated on 1000 randomly sampled data points, using an encoding precision of 0.01. A separate training data set consisting of 10,000 data points was used to estimate the distribution of the coefcients. The entropy was estimated by computing the entropy of the nonzero coefcients (see equation 6.4), which yielded lower estimated coding costs for the overcomplete bases compared to summing the individual entropies (see equation 6.3). Table 2 shows the estimated coding costs for the same learned bases and data sets used in Table 1. The second column lists the cost of labeling the

352

Michael S. Lewicki and Terrence J. Sejnowski

Table 2: Estimated Entropy of Nonzero Coefcients.

Model A 2£2 A 2£3 A 2£4

Bits per Pattern

Labeling Cost

Figure 1a

Figure 3a

Figure 3b

Figure 3c

0.0 1.6 2.0

18.88 § 0.04 18.02 § 0.03 —

17.78 § 0.07 17.28 § 0.05 —

18.18 § 0.05 17.75 § 0.05 17.88 § 0.07

21.45 § 0.09 20.95 § 0.08 20.81 § 0.09

nonzero coefcients. The other columns list the coding cost for the values of the nonzero coefcients. The total coding cost is the sum of these two numbers. For the 2 £ 2 matrices (the ICA solution), the entropy estimate yields approximately the same coding cost as the estimate based on P( x| A ), which indicates that in the complete case, the computation of P(x | A ) is accurate. This entropy estimate, however, yields a higher coding for the overcomplete bases. The coding cost computed from P(x | A ) suggests that lower average coding costs are achievable. One strategy for lowering the average coding cost would be to avoid identifying the nonzero coefcients for every pattern, for example, by grouping patterns that share common nonzero coefcients. This illustrates the advantages of contrasting the minimal coding cost estimated from the probability with that of a particular code. 7 Learning Sparse Representations of Speech

Speech data were obtained from the TIMIT database, using speech a single speaker, speaking 10 different example sentences. Speech segments were 64 samples in duration (8 msecs at the sampling frequency of 8000 Hz). No preprocessing was done. Both a complete (1£-overcomplete or 64 basis vectors) and a 2£-overcomplete basis (128 basis vectors) were learned. The bases were initialized to the standard Fourier basis and a 2£-overcomplete Fourier basis, which was composed of twice the number of sine and cosine basis functions, linearly spaced in frequency over the Nyquist range. For larger problems, it is desirable to make the learning faster. Some simple modications to the basic gradient descent procedure were used that produced more rapid and reliable convergence. For each gradient (see equation A.33), a step size was computed by d i = 2 i / amax , where amax is the element of the basis matrix, A , with largest absolute value. The parameter 2 was reduced from 0.02r to 0.001r over the rst 1000 iterations and xed at 0.001r for the remaining iterations, where r is a measure of the data range and was dened to be the standard deviation of the data. For the learning rule, we replaced the term A in equation A.39 with an approximation to l AATAH ¡1 (see equation A.36) as suggested in Lewicki

Learning Overcomplete Representations a

353 b

Figure 5: An example of tting a 2£-overcomplete representation to segments of natural speech. Speech segments consisted of 64 samples in duration (8 msecs at the sampling frequency of 8000 Hz). (a) A random sample of 30 of the 128 basis vectors (each scaled to full range). (b) The power spectral densities (0 to 4000 Hz) of the corresponding basis in a.

and Olshausen (1998, 1999): ¡l ATAH ¡1 ¼ I ¡ BQ diag¡1 [V C Q T BQ ]Q T ,

(7.1)

where B = rs rs log P( sˆ ) and Q and V are obtained from the singular value decomposition l ATA = QVQ T . This yielded solutions similar to equation 4.4, but resulted in fewer zero-length basis vectors. One hundred patterns were used to estimate each learning step. Learning was terminated after 5000 steps. The most probable basis function coefcients, sˆ , were obtained using a modied conjugate gradient routine (Press, Teukolsky, Vetterling, & Flannery, 1992). The basic routine was modied to replace the line search with an approximate Newton step. This approach resulted in a substantial speed improvement and produced much better solutions in a xed amount of time than the standard routine. To improve speed, a convergence criterion in the conjugate gradient routine was started at value of 10¡2 and reduced to 10¡3 over the course of learning. Figure 5 shows a sample of the learned basis vectors from the 2£-overcomplete basis. The waveforms in Figure 5a show a random sample of the learned basis vectors; the corresponding power spectral densities (0– 4000 Hz) are shown in Figure 5b. An immediate observation is that, unlike the Fourier basis vectors, many of the learned basis vectors have a broad bandwidth, and some have multiple spectral peaks. This is an indication that the basis functions contain some of the broadband and harmonic structure inherent in the speech signal. Another way of contrasting the learned representation with the Fourier representation is shown in Figure 6. This shows the log-coefcient magnitudes over the duration of a speech sentence taken from the TIMIT database for the 2£-overcomplete Fourier (middle plot) and 2£-overcomplete learned

354

Michael S. Lewicki and Terrence J. Sejnowski

she

had your

dark

suit

in

greasy

wash

water

all

year

3

20 40

2

60

1

80 100

log |s|

coefficient number

Fourier basis

0

120

20 40 60 80 100 120

3 2

log |s|

coefficient number

learned basis

1 0

Figure 6: Comparison of coefcient values over the duration of a speech sample from the TIMIT database. (Top) The amplitude envelope of a sentence of speech. The vertical lines are a rough indication of the word boundaries, as listed in the database. (Middle) Plots the log-coefcient magnitudes for a 2£-overcomplete Fourier basis over the duration of the speech example, using nonoverlapping blocks of 64 samples without windowing. (Bottom) Shows the log-coefcient magnitudes for a 2£-overcomplete learned basis. Only the largest 10% of the coefcients (in terms of magnitude) are plotted. The basis vectors are ordered in terms of their power spectra, with higher frequencies toward the top.

representations (bottom plot). The middle plot is similar to a standard spectrogram except that no windowing was performed and the windows were not overlapping. Figure 6 shows that the coefcient values for the learned basis are used much more uniformly than in the Fourier basis. This is consistent with the learning objective, which optimizes the basis vectors to make the coefcient values as independent as possible. One consequence is that the correlations across frequency that exist in the Fourier representation (reecting the harmonic structure in the speech) are less apparent in the learned representation. 7.1 Comparing Efciencies of Representations. Table 3 shows the estimated number of bits per sample to encode a speech segment to a precision of 1 bit out of 8 (sx = 1/ 256 of the amplitude range). The table shows the

Learning Overcomplete Representations

355

Table 3: Estimated Bits per Sample for Speech Data. Estimation Method ¡ log 2 P( x |bA) ¡ L log 2 (sx )

Basis 1£ learned 1£ Fourier 2£ learned 2£ Fourier

4.17 § 5.39 § 5.85 § 6.87 §

Total Entropy

0.07 0.03 0.06 0.06

3.40 § 4.29 § 5.22 § 6.57 §

Nonzero Entropy

0.05 0.07 0.04 0.14

3.36 § 4.24 § 3.07 § 4.12 §

0.08 0.09 0.07 0.11

estimated coding costs using the probability of the data (see equation 6.1) and using the entropy, computed by summing the individual entropy estimates of all coefcients (see equation 6.3) (total). Also shown is the entropy of only the nonzero coefcients. By both estimates, the complete and the 2£-overcomplete learned basis perform signicantly better than the corresponding Fourier basis. This is not surprising given the observed redundancy in the Fourier representation (see Figure 6). For all of the bases, the coding cost estimates derived from the entropy of the coefcients are lower than those derived from P (x | A ). A possible reason is that the coefcients are sparser than the Laplacian prior assumed by the model. This is supported by looking at the histogram of the coefcients (see Figure 7), which shows a density with kurtosis much higher than the as-

10

Distribution of basis coefficients

0

2x learned 2x Fourier

P(S)

10

10

10

1

2

3

4

10 1.5

1

0.5

0 S

0.5

1

1.5

Figure 7: Distribution of basis function coefcients for the 2£ learned and Fourier bases. Note the log scale on the y-axis.

356

Michael S. Lewicki and Terrence J. Sejnowski

sumed Laplacian. The entropy method can obtain a lower estimate, because it ts the observed density of s. The last column in the table shows that although the entropy per coefcient of the 2£-overcomplete bases is less than for the complete bases, neither basis yields an overall improvement in coding efciency. There could be several reasons for this. One likely Q reason is due to the independence ( )). This assumes that there is no assumption of the model (P( s ) = m P sm structure among the learned basis vectors. From Figure 6, however, it is apparent that there is considerably more structure left in the values of basis vector coefcients. Possible generalizations for capturing this kind of structure are discussed below. 8 Discussion

We have presented an algorithm for learning overcomplete bases. In some cases, overcomplete representations allow a basis to approximate better the underlying statistical density of the data, which can lead to representations that better capture the underlying structure in the data and have greater coding efciency. The probabilistic formulation of the basis inference problem offers the advantages that assumptions about the prior distribution on the basis coefcients are made explicit. Moreover, by estimating the probability of the data given the model or the entropy of the basis function coefcients, different models can be compared objectively. This algorithm generalizes ICA so that the model accounts for additive noise and allows the basis to be overcomplete. Unlike standard ICA, where the internal states are computed by inverting the basis function matrix, in this model the transformation from the data to the internal representation is nonlinear. This occurs because when the model is generalized to account for additive noise or when the basis is allowed to be overcomplete, the internal representation is ambiguous. The internal representation is then obtained by maximizing the posterior probability of s given the assumptions of the model, which is, in general, a nonlinear operation. A further advantage of this formulation is that it is possible to “denoise” the data. The inference procedure of nding the most probable coefcients automatically separates the data into the underlying signal and the additive noise. By setting the noise to appropriate levels, the model species what structure in the data should be ignored and what structure should be modeled by the basis functions. The derivation given here also allows the noise level to be set to zero. In this case, the model attempts to account for all of the variability in the data. This approach to denoising has been applied successfully to natural images (Lewicki & Olshausen, 1998, 1999). Another potential application is the blind separation of more sources than mixtures. For example, the two-dimensional examples in Figure 3 can be viewed as a source separation problem in which a number of sources are mixed onto a smaller number of channels (three to two in Figure 3a and

Learning Overcomplete Representations

357

four to two in Figures 3b and 3c). The overcomplete bases allow the model to capture the underlying statistical structure in the data space. True source separation will be limited, however, because the sources are being mapped down to a smaller subspace and there is necessarily a loss of information. Nonetheless, is it possible to separate three speakers on two channels with good delity (Lee, Lewicki, Girolami, & Sejnowski, 1999). We have also shown, in the case of natural speech, that learned bases have better coding properties than commonly used representations such as the Fourier basis. In these examples, the learned basis resulted in an estimated coding efciency for (near-lossless compression) that was about 1.4 bits per sample better than the Fourier basis. This reects the fact that spectral representations of speech are redundant, because of spectral structures such as speaker harmonics. In the complete case, the model is equivalent to ICA, but with additive noise. We emphasize that these numbers reect only a very simple coding scheme based on 64 sample speech segments. Other coding schemes can achieve much greater compression. The analysis presented here serves only to compare the relative efciency of a code based on Fourier representation with that of a learned representation. The learned overcomplete representations also showed greater coding efciency than the overcomplete Fourier representation but did not show greater coding efciency than the complete representation. One possible reason is that the approximations are inaccurate. The approximation used to estimate the probability of the data (see equation 4.2) works well for learning the basis functions and also gives reasonable estimates when exact quantities are available. However, it is possible that the approximation becomes increasingly inaccurate for higher degrees of overcompleteness. An open challenge for this and other inference problems is how to obtain accurate and computationally tractable approximations to the data probability. Another, perhaps more plausible, reason for the lack of coding efciency for the overcomplete representations is that the assumptions of the model are inaccurate. An obvious one is the prior distribution for the coefcients. For example, the observed coefcient distribution for the speech data (see Figure 7) is much sparser than that of the Laplacian assumed by the model. Generalizing these models so that P( s) better captures the kind of structure would improve the accuracy of the model and allow it to t a broader range of distributions. Generalizing these techniques to handle more general prior distributions, however, is not straightforward because of the difculties in optimization techniques for nding the most probable coefcients and the approximation of the data probability, log P (x | A ). These are directions we are currently investigating. We need to question the fundamental assumption that the coefcients are statistically independent. This is not true for structures such as speech, where there is a high degree of nonstationary statistical structure. This type of structure is clearly visible in the spectrogram-like plots in Figure 6. Be-

358

Michael S. Lewicki and Terrence J. Sejnowski

cause the coefcients are assumed to be independent, it is not possible to capture mutually exclusive relationships that might exist among different speech sounds. For example, different bases could be specialized for different types of phonemes. Capturing this type of structure requires generalizing the model so that it can represent and learn higher-order structure. In applications of overcomplete representations, it is common to assume the coefcients are independent and have Laplacian distributions (Mallat & Zhang, 1993; Chen et al., 1996). These results show that although overcomplete representations can potentially yield many benets, in practice these can be limited by inaccurate prior assumptions. An improved model of the data distribution will lead to greater coding efciency and also allow for more accurate inference. The Bayesian framework presented in this article provides a direct method for evaluating the validity of these assumptions and also suggests ways in which these models might be generalized. Finally, the approach taken here to overcomplete representation may also be useful for understanding the nature of neural codes in the cerebral cortex. One million ganglion cells in the optic nerve are represented by 100 million cells in the primary visual cortex of the monkey. Thus, it is possible that at the rst stage of cortical processing, the visual world is overrepresented by a large factor. Appendix A.1 Approximating P (x | A ) . The integral

Z P(x | A ) =

d s P( s ) P( x | s , A )

(A.1)

can be approximated with the gaussian integral, Z ds f (s ) ¼ f ( sˆ )(2p ) k/ 2 | ¡ rs rs log f ( s )| ¡1/ 2 ,

(A.2)

where | ¢ | indicates the determinant and rs rs indicates the Hessian, which is evaluated at the solution sˆ (see equation 2.3). Using f ( s) = P(s ) P(x | s , A ) we have £ ¤ ¡rs rs log P( s) P( x| s, A ) = H ( s)

µ ¶ l = ¡rs rs ¡ (x ¡ As) 2 C log P(s ) 2 h i = rs ¡l AT (x ¡ As) ¡ rs P (s ) = l ATA ¡ rs rs P( s).

(A.3) (A.4) (A.5)

Learning Overcomplete Representations

359

A.2 The Hessian of the Laplacian. In the case of a Laplacian prior on

sm , log P( s ) ´

X m

log P( sm ) = M logh ¡ h

M X

|sm |.

(A.6)

m= 1

Because log P( s ) is piece-wise linear in s , the curvature is zero, and there is a discontinuity in the derivative at zero. To approximate the volume contribution from P (s ), we use @ log P( s) ¼ ¡h tanh(bsm ), @sm

(A.7)

which is equivalent to using the approximation P( sm ) / cosh¡h/ b (b sm ). For large b this approximates the true Laplacian prior while staying smooth around zero. This leads to the following diagonal expression for the Hessian: @2 log P( sm ) = ¡hb sech2 (b sm ). @s2m

(A.8)

A.3 Derivation of the Learning Rule. For a set of independent data vectors X = x 1 , . . . , x K , expand the posterior probability density by a saddlepoint approximation:

log P( X | A ) = log

Y

P( x k | A )

k

¼

l KL KM log C log(2p ) (A.9) 2 2p 2 ¶ K µ X l 1 C log P( sˆ k ) ¡ ( x k ¡ A sˆ k ) 2 ¡ log det H ( sˆ k ) . 2 2 k= 1

A learning rule can be obtained by differentiating log P( x | A ) with respect to A . Letting rA = @/ @A and for clarity letting H = H ( s), we see that there are three main derivatives that need to be considered: rA log P( x | A ) = rA log P( sˆ ) ¡

l 1 rA ( x ¡ A sˆ ) 2 ¡ r A log det H . (A.10) 2 2

We consider each in turn. A.3.1 Deriving rA log P( sˆ ) . The rst term in the learning rule species how to change A to make the representation sˆ more probable. If we assume a prior distribution with high kurtosis, this component will change the weights to make the representation sparser.

360

Michael S. Lewicki and Terrence J. Sejnowski

From the chain rule, X @log P( sOm ) @Osm @log P(sˆ ) = , (A.11) @A @Osm @A m Q assuming P(sˆ ) = sm / @A , we need to describe sˆ as a m P(Osm ). To obtain @O function of A . If the basis is complete (and we assume low noise), then we can simply invert A to obtain sˆ = A ¡1 x. When A is overcomplete, however, there is no simple expression, but we can still make an approximation. For certain priors, the most probable solution, sˆ , will yield at most L nonzero elements. In effect, the procedure for computing sˆ selects a complete basis from A that best accounts for the data x . We can use this hypothetical reduced basis to derive a learning algorithm for the full basis matrix A . L represent a reduced basis for a particular x , that is, A L is composed Let A of the basis vectors in A that have nonzero coefcients. More precisely, let c1 , . . . , cL be the indices of the nonzero coefcients in sˆ . If there are fewer than L nonzero coefcients, zero-valued coefcients can be included without loss L 1:L,i ´ A 1:L,c . We then have of generality. Then, A i L ¡1 (x ¡ ²), sL = A

(A.12)

@Lsk @ X L ¡1 = A kl ( xl ¡ 2 l ), @Laij @Laij l

(A.13)

L ¡1 obtained where sL is equal to sˆ with M¡L zero-valued elements removed. A by removing the columns of A corresponding to the M ¡ L zero-valued elements of sˆ . This allows the use of results obtained for the case when A is invertible. Note that the construction of the reduced basis is only a mathematical device used for this derivation. The nal gradient equation does not depend on this construction. Following MacKay (1996) we have

Using the identity @A ¡1 / @aij = ¡A ¡1 Ajl¡1 , kl ki X @Lsk L ¡1 A L ¡1 (xl ¡ 2 l ). = ¡ A ki jl @Laij l L ¡1 sLj . = ¡A ki

(A.14) (A.15)

Letting zL k = @log P( sLk )/ @Lsk , X @log P(sL ) L ¡1 sLj . = ¡ zL k A ki @Laij k

(A.16)

Changing back to matrix notation, @log P(sL ) L ¡T zL sL T . = ¡A L A @

(A.17)

Learning Overcomplete Representations

361

This derivative can be expressed in terms of the original variables (that is, nonreduced). We invert the mapping used to obtain the reduced coefcients, sˆ ! sL . We have zcm = zL m , with the remaining (M ¡ L) values of z dened L ¡1 , with the remaining rows to be zero. We dene the matrix W 1:L,ci ´ A 1:L,i equal to zero. Then @log P( s) = ¡W T zsT . @A

(A.18)

This gives a gradient of zero for the columns of A that have zero-valued coefcients. A.3.2 Deriving rA ( x ¡ A sˆ ) 2 . The second term species how to change A to minimize the data mist. Letting ek = [x ¡ As ]k and using the results and notation from above: X X @sl @ lX 2 ek = lei sj Cl ek akl @aij 2 k @aij k l X X = lei sj Cl ¡akl wli sj ek k

(A.19) (A.20)

l

= lei sj ¡ l ei sj = 0.

(A.21)

Thus, there is no gradient component arising from the error term that might be expected if the residual error has no structure. A.3.3 Deriving rA log det H . The third term in the learning rule species how to change the weights to minimize the width of the posterior distribution P(x | A ) and thus increase the overall probability of the data. Using the chain rule, 1 X @det H @H mn @log det H = . det H mn @H mn @aij @aij

(A.22)

An element of H is dened by H mn =

X k

l akm akn C bmn

= cmn C bmn ,

(A.23) (A.24)

where bmn = (¡rs rs log P( sˆ )) mn . Using the identity @det H / @H mn = (det H ) H ¡1 nm we obtain µ ¶ X @log det H @bmn ¡1 @cmn = H nm C . @aij @aij @aij mn

(A.25)

362

Michael S. Lewicki and Terrence J. Sejnowski

Considering the rst term in the square brackets,

@cmn @aij

8 > > <0 l ain = > l aim > : 2l aij

m, n 6 = j, m = j, n = j, m = n = j.

(A.26)

We then have X mn

H ¡1 mn

X X @cmn ¡1 = H ¡1 H jm l aim C H jj¡1 2l aij mj laim C @aij 6 6 = = m j m j X ¡1 = 2l H mj aim ,

(A.27) (A.28)

m

¡1 using the fact that H ¡1 = H jm due to the symmetry of the Hessian. In matrix mj notation this becomes

X mn

H ¡1 mn

@cmn = 2l AH ¡1 . @A

Next we derive @bmm / @aij . Assuming P( sˆ ) = rs rs log P( sˆ ) is diagonal and thus X mn

H ¡1 nm

(A.29) Q

X @bmn @bmm @Osm = H ¡1 . mm @aij @Osm @aij m

mP

( sOm ) we have that

(A.30)

Letting 2ym = H ¡1 sm and using the result under the reduced repremm @bmm / @O sentation (see equation A.15), X mm

H ¡1 mm

@bmm = ¡2W T yˆs T . @A

(A.31)

Finally, putting the results for @cmm / @A and @bmm / @A back into equation A.25, @log det H = 2l AH ¡1 ¡ 2W T yˆsT . @A

(A.32)

Gathering the three components of rA log P(x | A ) together yields the following expression for the learning rule: r A log P( x| A ) = ¡W T zˆs T ¡ l AH ¡1 C W T yˆs T .

(A.33)

Learning Overcomplete Representations

363

A.3.4 Stabilizing and Simplifying the Learning Rule. Using the gradient given in equation A.33 directly is problematic due to the matrix inverses, which make it both impractical and unstable. This can be alleviated, however, by multiplying the gradient by an appropriate positive denite matrix (Amari et al., 1996). This rescales the components of the gradient but still preserves a direction valid for optimization. The fact that AT W T = I for all W allows us to eliminate W from the equation for the learning rule: AAT rA log P( x| A ) = ¡Az sˆ T ¡ l AATAH ¡1 C AyˆsT .

(A.34)

Additional simplications can be made. Ifl is large (low noise), then the Hessian is dominated by l ATA and ¡l AATAH ¡1 = ¡Al ATA (l ATA C B ) ¡1 ¼ ¡A .

(A.35) (A.36)

It is also possible to obtain more accurate approximations of this term (Lewicki & Olshausen, 1998, 1999). The vector y hides a computation involving the inverse Hessian. If the basis vectors in A are randomly distributed, then as the dimensionality of A increases, the basis vectors become approximately orthogonal, and consequently the Hessian becomes approximately diagonal. In this case, ym ¼ =

1

@bmm

(A.37)

H mm @Osm

@bmm / @Osm . P 2 l k akm C bmm

(A.38)

Thus if log P( s ) and its derivatives are smooth, ym vanishes for large l. In practice, excluding y from the learning rule yields more stable learning. We conjecture that this term can be ignored, possibly because it represents curvature components that are unrelated to the volume. We thus obtain the following expression for a learning rule: D A = AAT rA log P( x| A ) ¼ ¡Az sˆ T ¡ A

= ¡A ( zˆsT C I ).

(A.39) (A.40)

Acknowledgments

We thank Bruno Olshausen and Tony Bell for many helpful discussions.

364

Michael S. Lewicki and Terrence J. Sejnowski

References Amari, S., Cichocki, A., & Yang, H. H. (1996).A new learning algorithm for blind signal separation. In D. Touretzy, M. Mozer, & M. Hasselmo (Eds.), Advances in neural and information processing systems, 8 (pp. 757–763). San Mateo, CA: Morgan Kaufmann. Atick, J. J. (1992). Could information-theory provide an ecological theory of sensory processing? Network-Computation in Neural Systems, 3(2), 213–251. Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. A. Rosenbluth (Ed.), Senosory communication (pp. 217– 234). Cambridge, MA: MIT Press. Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 295–311. Bell, A. J., & Sejnowski, T. J. (1995). An information maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Buccigrossi, R. W., & Simoncelli, E. P. (1997). Image compression via joint statistical characterization in the wavelet domain (Tech. Rep. No. 414). Philadelphia: University of Pennsylvania. Cardoso, J.-F. (1997). Infomax and maximum likelihood for blind source separation. IEEE Signal Processing Letters, 4, 109–111. Chen, S., Donoho, D. L., & Saunders, M. A. (1996). Atomic decomposition by basis pursuit (Technical Rep.). Stanford, CA: Department of Statistics, Stanford University. Coifman, R. R., & Wickerhauser, M. V. (1992). Entropy-based algorithms for best basis selection. IEEE Transactions on Information Theory, 38(2), 713–718. Comon, P. (1994). Independent component analysis, a new concept. Signal Processing, 36(3), 287–314. Daubechies, I. (1990). The wavelet transform, time-frequency localization, and signal analysis. IEEE Transactions on Information Theory, 36(5), 961–1004. Daugman, J. G. (1988). Complete discrete 2-D Gabor transforms by neural networks for image-analysis and compression. IEEE Transactions on Acoustics Speech and Signal Processing, 36(7), 1169–1179. Daugman, J. G. (1989). Entropy reduction and decorrelation in visual coding by oriented neural receptive-elds. IEEE Trans. Bio. Eng., 36(1), 107–114. Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6(4), 559–601. Hinton, G. E., & Sejnowski, T. J. (1986). Learning and relearning in Boltzmann machines. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 282–317). Cambridge, MA: MIT Press. Jutten, C., & H´erault, J. (1991). Blind separation of sources. 1. An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24(1), 1–10. Lee, T.-W., Lewicki, M., Girolami, M., & Sejnowski, T. (1999). Blind source separation of more sources than mixtures using overcomplete representations. IEEE Sig. Proc. Lett., 6, 87–90. Lewicki, M. S., & Olshausen, B. A. (1998). Inferring sparse, overcomplete image codes using an efcient coding framework. In M. Kearns, M. Jordan, & S. Solla

Learning Overcomplete Representations

365

(Eds.),Advances in neural and informationprocessing systems, 10. San Mateo, CA: Morgan Kaufmann. Lewicki, M. S., & Olshausen, B. A. (1999). A probabilistic framework for the adaptation and comparison of image codes. J. Opt. Soc. Am. A: Optics, Image Science, and Vision, 16(7), 1587–1601. Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21(3), 105–117. MacKay, D. J. C. (1996). Maximum likelihood and covariant algorithms for independent component analysis. Unpublished manuscript. Cambridge: University of Cambridge, Cavendish Laboratory. Available online at: ftp://wol.ra.phy.cam.ac.uk/pub/mackay/ica.ps.gz. Makeig, S., Jung, T. P., Bell, A. J., Ghahremani, D., & Sejnowski, T. J. (1996). Blind separation of event-related brain response components. Psychophysiology, 33, S58–S58. Mallat, S. G., & Zhang, Z. F. (1993). Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12), 3397–3415. McKeown, M. J., Jung, T.-P., Makeig, S., Brown, G. G., Kindermann, S. S., Lee, T., & Sejnowski, T. (1998). Spatially independent activity patterns in functional magnetic resonance imaging data during the stroop color-naming task. Proc. Natl. Acad. Sci. USA, 95, 803–810. Meszaros, C. (1997).BPMPD: An interior point linear programming solver. Code available online at: ftp://ftp.netlib.org/opt/bpmpd.tar.gz. Nadal, J.-P. & Parga, N. (1994a). Nonlinear neurons in the low-noise limit: A factorial code maximizes information transfer. Network, 5, 565–581. Nadal, J.-P., & Parga, N. (1994b). Redundancy reduction and independent component analysis: Conditions on cumulants and adaptive approaches. Network, 5, 565–581. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive-eld properties by learning a sparse code for natural images. Nature, 381, 607–609. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Res., 37(23). Osuna, E., Freund, R., & Girosi, F. (1997). An improved training algorithm for support vector machines. Proc. of IEEE NNSP’97 (pp. 24–26). Pearlmutter, B. A., & Parra, L. C. (1997). Maximum likelihood blind source separation: A context-sensitive generalization of ICA. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural and information processing systems, 9. San Mateo, CA: Morgan Kaufmann. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992).Numerical recipes in C: The art of scientic programming (2nd ed.). Cambridge: Cambridge University Press. Silverman, B. W. (1986). Density estimation for statistics and data analysis. New York: Chapman and Hall. Simoncelli, E. P., Freeman, W. T., Adelson, E. H., & Heeger, D. J. (1992). Shiftable multiscale transforms. IEEE Trans. Info. Theory, 38, 587–607. Received February 8, 1998; accepted February 1, 1999.

LETTER

Communicated by Adi Bulsara

Noise in Integrate-and-Fire Neurons: From Stochastic Input to Escape Rates Hans E. Plesser ¨ Str¨omungsforschung, D-37073 G¨ottingen, Germany MPI fur Wulfram Gerstner MANTRA Center for Neuromimetic Systems, Swiss Federal Institute of Technology, EPFL/DI, CH-1015 Lausanne, Switzerland We analyze the effect of noise in integrate-and-re neurons driven by time-dependent input and compare the diffusion approximation for the membrane potential to escape noise. It is shown that for time-dependent subthreshold input, diffusive noise can be replaced by escape noise with a hazard function that has a gaussian dependence on the distance between the (noise-free) membrane voltage and threshold. The approximation is improved if we add to the hazard function a probability current propor tional to the derivative of the voltage. Stochastic resonance in response to periodic input occurs in both noise models and exhibits similar characteristics.

1 Introduction

Given the enormous number of degrees of freedom in single neurons due to millions of ion channels and thousands of presynaptic signals impinging on the neuron, it seems natural to describe neuronal activity as a stochastic process (Tuckwell, 1989). Since in most neurons the output signal consists of stereotypical electrical pulses (spikes), the theory of point processes has long been employed to analyze neuronal activity (Perkel, Gerstein, & Moore, 1967; Johnson, 1996). The simplest point process model of neural responses to stimuli is that of an inhomogeneous Poisson process with an instantaneous ring rate that depends on the stimulus, and thus on time. Models of this type have successfully been employed to characterize neural activity in the auditory system (Siebert, 1970; Johnson, 1980; Gummer, 1991) and to investigate the facilitation of signal transduction by noise, a phenomenon known as stochastic resonance (Collins, Chow, Capela, Imhoff, 1996). Stevens and Zador (1996) have recently suggested that in the limit of very low ring rates, Poisson models of neuronal activity indeed capture most of the dynamics of more complex models. Neural Computation 12, 367–384 (2000)

c 2000 Massachusetts Institute of Technology °

368

Hans E. Plesser and Wulfram Gerstner

On the other hand, much recent research indicates that the precise timing of neuronal spikes plays a prominent role in information processing (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). Thus, spiking neuron models are required that provide more insight into the relation between the input current a neuron receives and the spike train it generates as output. In short, these model neurons generate an output spike whenever the membrane potential v, driven by an input current I, reaches a threshold H; afterward the potential v is reset to some value vres . For a recent review, see Gerstner (1998). The archetypical spiking neuron model is the integrate-and-re neuron introduced by Lapicque (Tuckwell, 1988). Stein (1965) rst replaced a continuous input current with more realistic random sequences of excitatory and inhibitory input pulses. If each pulse is small and a large number of pulses impinges on the neuron per membrane time constant, the effective input to the neuron can be described by a time-dependent deterministic current I ( t) and additive noise j (t), which is taken to be gaussian. In this limit, the evolution of the membrane potential from reset to threshold is a time-dependent Ornstein-Uhlenbeck process with the threshold as an absorbing boundary (Johannesma, 1968; L´ansky, ´ 1997). We will refer to this as the diffusion approximation of neuronal activity. The separation of the total input current into a deterministic part I ( t) and noise is much less clear-cut in neuronal modeling than in physical processes such as Brownian motion. Recent research suggests that the variability of input spike arrival times plays a much greater role than intrinsic noise (Mainen & Sejnowski, 1995; Bryant & Segundo, 1976). Stochastic spike arrival may arise in noise-free networks with heterogeneous connections as shown in recent theoretical studies (van Vreeswijk & Sompolinsky, 1996; Brunel & Hakim, 1999). In these studies balanced excitation and inhibition prepared the neuron in a state that is just subthreshold. Spikes are then triggered by the uctuations in the input. Neurons in this regime have a large coefcient of variation just as cortical neurons. For a recent review, see Konig, ¨ Engel, & Singer, 1996. The subthreshold regime is also particularly interesting from the point of view of information transmission. Neurons optimally detect temporal information if the mean membrane potential is roughly one standard deviation of the noise below threshold (Kempter, Gerstner, van Hemmen, & Wagner, 1998). The improvement of signal transduction by noise, known as stochastic resonance, also appears to be limited to this regime (Bulsara & Zador, 1996). Therefore we will focus here on neurons driven by weakly subthreshold input—those neurons that would be silent in the absence of noise. In this work, we take the diffusion approximation sketched above as the reference model for neuronal noise. A disadvantage of this model is that it is difcult to solve analytically, particularly for time-dependent input I (t). The interval distribution for a given input I ( t) is mathematically equivalent

Noise in Integrate-and-Fire Neurons

369

to a rst-passage-time problem, which is known to be hard. Our aim in this study is to replace the stochasticity introduced by the diffusion term by a mathematically more tractable inhomogeneous point process. To this end, we investigate various escape processes across the ring threshold. An escape process is completely described by a hazard function h( t), which depends only on the momentary values of the membrane potential v(t) and the input current I ( t). We want to identify that function h which reproduces as closely as possible the behavior of the diffusion model, for both periodic and aperiodic inputs. We demonstrate that the optimal hazard function not only reproduces the generation of individual spikes well, but that complex response properties such as stochastic resonance are preserved. 2 Integrate-and-Fire Neuron Models 2.1 Diffusion Model. Between two output spikes, the membrane potential of an integrate-and-re neuron receiving input I ( t) evolves according to the following Langevin equation (Tuckwell, 1989):

vP ( t) = ¡v(t) C I ( t) C sj ( t).

(2.1)

Upon crossing the threshold, v( t) = 1, a spike is recorded and the « potential ¬ reset to v = 0. j ( t) is gaussian white noise with autocorrelation j ( t)j (t0 ) = d ( t ¡ t0 ). The diffusion constant s is the root mean square (rms) amplitude of this background noise. Note that we have used dimensionless units above; time is measured in units of the membrane time constant, while voltage is given in terms of the threshold H. As a consequence of the diffusion approximation, the input I ( t) will not contain any d-pulses, and v(t) follows continuous trajectories from reset potential to threshold. Let us denote the ring time of the last output spike by t¤ . In the noisefree case (s = 0), the potential v( t) for t > t¤ depends on the input I ( t0 ) for t¤ < t0 < t and can be found by integration of equation 2.1: Z v0 (t | t¤ , I (¢)) = e¡t

t 0

I ( t¤ C s) es ds.

(2.2)

Our notation makes explicit that the noise-free trajectory can be calculated given the last ring time t¤ and the input I. The subscript 0 is intended to remind the reader that equation 2.2 describes the noise-free trajectory. The next output spike of the noiseless model would occur at a time t¤ Ct dened by the rst threshold crossing, © ª t = min t 0 > 0 | v0 (t 0 | t¤ , I (¢)) ¸ 1 .

(2.3)

In the presence of noise, the actual membrane trajectory will uctuate around the reference trajectory (see equation 2.2) with a variance of the

370

Hans E. Plesser and Wulfram Gerstner

order of s. We can no longer predict the exact timing of the next spike, but we may ask for the probability distribution r (t | t¤ , I (¢)) that a spike occurs at t¤ C t , given that the last spike was at t¤ and the current is I ( t0 ) for t¤ < t0 < t¤ C t : 8 9 < Spike in [t¤ C t, t¤ C t C D t) = r (t | t¤ , I (¢))D t = Pr given the last reset at t¤ and input . (2.4) : ¤ ; I ( t C t 0 ), t 0 > 0. Our notation makes explicit that no information about the past of the neuron going further back than the last ring or reset at t¤ is needed. We may think of equation 2.4 as a conditional interspike interval (ISI) distribution. For the diffusion noise model, analytical expressions for the interval distributions are not known, except for some rare special cases (Tuckwell, 1989). It is, however, possible to calculate these ISI distributions numerically based on Schrodinger ¨ ’s renewal ansatz (Plesser & Tanaka, 1997; Linz, 1985). (Source code is available from us on request.) In the following we use the diffusion model as the reference noise model. If there were no absorbing threshold, equation 2.1 would describe free diffusion with drift. Then the pertaining Fokker-Planck equation, ¡ ¢ ¤ ¡ ¢ @ @ £ P f v, t¤ C t | t¤ = ¡ ¡v C I (t¤ C t ) P f v, t¤ C t | t¤ @t @v C

¡ ¢ s 2 @2 P f v, t¤ C t | t¤ , 2 2 @v

(2.5)

with boundary conditions limv!§ 1 P f ( v, t¤ C t | t¤ ) = 0 has the solution ¡ ¢ P f v, t¤ C t | t¤ = p

» ¼ [v ¡ v0 (t | t¤ , I (¢))]2 exp ¡ , 2g 2 (t ) 2pg 2 (t ) 1

(2.6)

¢ 2 ¡ where g 2 (t ) = s2 1 ¡ e¡2t . As the exponential term ing(t ) decays at twice the rate of those in v0 (t | t¤ , I (¢)), we may approximate for t À 1 » ¼ ¡ ¢ 1 [v ¡ v0 (t | t¤ , I (¢))]2 P f v, t¤ C t | t¤ ¼ p exp ¡ . s2 ps

(2.7)

2.2 Escape Noise and Hazard Functions. In a noise-free model, the membrane trajectory v0 (t | t¤ , I (¢)) is given by equation 2.2, and the time of the next spike determined by equation 2.3. In the presence of noise, ring is no longer precise but may occur even though the noise-free trajectory has not yet reached threshold. A simple intuitive noise model can be based on the idea of an escape probability: at each moment of time, the neuron may re with an instantaneous rate h, which depends on the momentary distance

Noise in Integrate-and-Fire Neurons

371

between the noise-free trajectory v0 and the threshold, and possibly the momentary input current as well. More generally, we may introduce a hazard function h(t | t¤ , I (¢)) that describes the risk of escape across the threshold and depends on the last ring time t¤ and the input I ( t0 ) for t¤ < t0 < t¤ C t. Once we know the hazard function, the ISI distribution is given by Cox & Lewis, 1966) µ Z r (t | t , I (¢)) = h(t | t , I (¢)) exp ¡ ¤

¤

t 0

¶ h(s | t , I (¢)) ds . ¤

(2.8)

The exponential term accounts for the probability that a neuron survives from t¤ to t¤ Ct without ring; the factor h(t | t¤ , I (¢)) gives the rate of ring at t¤ C t , provided that the neuron has survived thus far. We discuss in this work four models of simplied neuronal dynamics, all of which aim to approximate the diffusion model by an escape noise ansatz. The various models differ in the choice of the function h(t | t¤ , I (¢)) (see Figure 1). These hazard functions depend on the noise-free membrane potential v0 and the input current I only through two scaled quantities. One is the momentary distance x between the membrane potential v0 and the threshold H = 1, scaled by the noise amplitude: x(t | t¤ ) =

1 ¡ v0 (t | t¤ ) . s

(2.9)

The other is the net current charging the membrane in the noise-free case, also scaled by the noise amplitude: Y(t | t¤ ) =

¡v0 (t | t¤ ) C I ( t¤ C t ) 1 d = v0 (t | t¤ ). s s dt

(2.10)

The last equality follows immediately from equation (2.1) with s = 0 and indicates that Y can be interpreted as a relative velocity of the noise-free membrane potential v0 . 2.2.1 Arrhenius Model. As long as the membrane potential v stays sufciently far below threshold, we may describe neuronal ring as a noiseactivated process. The latter is most easily characterized by an Arrhenius ansatz for the hazard function (van Kampen, 1992): » ¼ (1 ¡ v0 (t | t¤ )) 2 ¤ 2 = w e¡x(t | t ) , hArr (t | t¤ ) = w exp ¡ 2 s wopt = 0.95.

(2.11)

Here, 1 ¡ v0 (t | t¤ ) is the voltage gap that needs to be bridged to initiate a spike. Its square may be interpreted as the required activation energy

372

Hans E. Plesser and Wulfram Gerstner

(a)

h(v)

1 0.5 0 0.7

0.8

0.9

0.8

0.9

v

1

1.1

1.2

1

1.1

1.2

(b)

h(v)

1 0.5 0 0.7

v

Figure 1: Hazard functions h(v) versus membrane potential v. (a) Arrhenius model (solid), error function model (dash-dotted), and Tuckwell model (dashed). (b) Arrhenius&Current model for different values of the input current: I = 0.95 (solid), 1 (dash-dotted), and 1.1 (dashed); the case I = 0 is drawn dotted for comparison. In all cases, the threshold is at v = 1 and the noise amplitude is s = 0.1. The vertical dotted lines mark one noise amplitude below threshold.

and s 2 as the energy supplied by the noise. The free parameter w was determined by an optimization procedure described below. We found w = wopt = 0.95 to be the optimal value. Note that for strongly superthreshold stimuli (v0 À 1), the hazard function vanishes exponentially. This might seem paradoxical at rst, but is of little concern as long as the input I ( t) contains no d-pulses and v0 reaches the threshold only along continuous trajectories. Then, superthreshold levels of the potential are accessible only via periods of maximum hazard, so that the neuron will usually have red before v0 becomes signicantly superthreshold. 2.2.2 Arrhenius&Current Model. Strong, positive transients in the input current will push the membrane potential toward and across the threshold in a time that is short on the timescale of diffusion. Figuratively, the membrane potential distribution P f is shifted as a whole. The simple Arrhenius ansatz

Noise in Integrate-and-Fire Neurons

373

will not be able to reproduce these transients well. We thus consider, in addition to the diffusive current, the probability current induced by shifting the probability density at the threshold, P f (H = 1, t | t¤ ), across threshold d with the speed of the center of this distribution, dt v0 (t | t¤ ) = ¡v0 (t | t¤ ) C I ( t¤ C t). Since there can be no drift from above threshold downward, we set all negative currents to zero and obtain the drift probability current: ¡ ¢ Jdrift (t | t¤ ) = [¡v0 (t | t¤ ) C I ( t¤ C t )] C P f H = 1, t | t¤ =

[Y(t | t¤ )] C ¡x(t | t¤ ) 2 e p . p

P f (v, t | t¤ ) is again the free gaussian distribution given by equation 2.7 and [x] C ´ ( x C|x|)/ 2. Together with the original, diffusive Arrhenius term, we obtain the hazard function: hAC (t | t¤ ) = hArr (t | t¤ ) C Jdrift ( v, t | t¤ ) =

¡

wC

¢

[Y(t | t¤ )] C ¡x(t |t¤ ) 2 e , p p

wopt = 0.72.

(2.12)

As before, w is a free parameter with optimal value wopt . Below we will refer to the rst term of equation 2.12 as the diffusion term and to the second one as the drift term. 2.2.3 Sigmoidal Model. Abeles (1982) has suggested that the hazard should be related to the proportion of the free membrane R 1potential distribution P f that is beyond the threshold, that is, h(t | t¤ ) » H P f (v, t | t¤ ) dv = erfc x(t | t¤ ). This ansatz is questionable, as the correct potential distribution vanishes beyond the threshold. In view of the widespread use of sigmoidal activation functions in the theory of neural networks (Wilson & Cowan, 1972), we nevertheless test a sigmoidal model but provide it with two free parameters w1 and w2 : herf (t | t¤ ) = w1 erfc[x(t | t¤ ) ¡ w2 ],

opt

w1

opt

= 0.66 , w2

= 0.53.

(2.13)

erfc(x) = 1 ¡ erf(x) is the complementary error function. Given the great similarity of the error function and the hyperbolic tangent, we have not extensively investigated the latter. Preliminary results indicate negligible differences (data not shown). 2.2.4 Tuckwell Model. Suppose that the input I ( t) varies sufciently slowly so that we may, at any point, consider the membrane potential of the neuron to be stationary. A stationary potential v0 (t | t¤ ) corresponds to

374

Hans E. Plesser and Wulfram Gerstner

a constant input current Is = v0 (t | t¤ ). In this case, the average ring rate, and thus the hazard, can be approximated in closed form (Tuckwell, 1989): hTuck (t | t¤ ) =

x(t | t¤ ) ¡x(t | t¤ ) 2 e p h (x(t | t¤ )). p

(2.14)

The Tuckwell model has no free parameter. The Heaviside step function h ( x) is introduced to avoid negative values of the hazard. The approximation made in computing the rate is valid only for stimuli far below threshold, that is, x(t | t¤ ) À 1. Note that this rate is identical to the diffusive current at the location of the threshold for the threshold-free case, ¡

Jdiff v = 1, t | t

¤

¢

¡ ¢ s 2 @2 = P f v, t | t¤ , 2 @v2 v= 1

where P f ( v, t | t¤ ) is the free gaussian distribution (see equation 2.7). 2.3 Input. To compare the responses of integrate-and-re neurons with various hazard functions to those with diffusion noise, we have to choose an input scenario. We will try to be as general as possible and consider time-dependent inputs with both aperiodic and periodic time course. The general form of the input is an oscillatory component of rms amplitude q uctuating around a DC value m , p q 2 X aj cos(vj t C w j ), I ( t) = m C q P 2 a j k k

(2.15)

R where vj = j D v with D v the base frequency. Note that q2 = limT!1 T1 0T (I (t) ¡m ) 2 dt is the expected variance of the current. The membrane potential in response to the input in the absence of noise is found from equation 2.2: p £ ¡ ¢ ¤ v0 (t | t¤ ) = m 1 ¡ e¡t C q 2 F( t¤ C t ) ¡ e¡t F( t¤ ) , 1 F ( t) = q P

2 k ak

X j

q

aj

1 C vj2

¡ ¢ sin vj t C wj C acot vj .

(2.16)

Aperiodic stimuli are characterized by a cutoff frequency Vc and are generated by drawing the phases w j at random from a uniform distribution over [0, 2p ), while the amplitudes are given by

8 > <0 aj = 1 > : exp(¡ 12 ( j ¡ jc ) 2 )

vj · 0,

0 < vj · Vc , jc < j.

(2.17)

Noise in Integrate-and-Fire Neurons

375

Periodic input is taken to be cosinusoidal with frequency V = vk . We set aj = d jk in equation 2.15 and obtain p I (t) = m C q 2 cos(Vt C w ). To compare the performance of the various models across stimuli, we describe each stimulus by its relative distance from threshold,

e=

q « ¬ 1 ¡ (m C 2 D v20 ) s

,

(2.18)

where Z D E D v20 = lim

T!1 0

T

2 q2 X aj (v0 (t | t¤ ) ¡ m ) 2 dt = P 2 2 k ak j 1 C vj

is the rms amplitude of the membranep potential oscillations in the noise-free case. For periodic input, the factor of 2 in the denition of e yields ! p q 2 se = 1 ¡ m C p = 1 ¡ sup v0 (t | t¤ ). 1 C V2 t ¸0

(

Thus subthreshold stimuli have e > 0, superthreshold stimuli e < 0. Note that for periodic stimuli e > 0 indeed guarantees that the stimulus is subthreshold; in the absence of noise, the neuron never res. For aperiodic input, however, the denition holds only in an rms sense, so that v0 may occasionally cross the threshold even for e > 0. 2.4 Choice of Stimuli. The results presented here were obtained from a set of 14,400 periodic and 14,400 aperiodic stimuli. DC inputs were in p the range 0.55 · m · 1.2, AC amplitudes approximately 0.1(1 ¡ m ) < q 2 < 1.5(1 ¡ m ), stimulus/cutoff frequencies 0.02p · Vc · 2p . For each set of these parameters, ve phases were chosen randomly from [0, 2p ) for periodic stimuli and ve different random stimuli generated for the aperiodic case. Each stimulus was tested at eight noise amplitudes, randomly chosen so that the stimuli would have a roughly uniform distribution with respect to their relative distance from threshold with 0.1 < |e| < 3. Some two-thirds of all stimuli were subthreshold. Some stimuli were excluded from analysis. The rejection was based solely on the ISI distributions computed for the diffusion approximation. Specifically, stimuli were excluded if the ring probability was so low that their R norm was insufcient ( 0T r (t ) dt < 0.8), where T was 20 stimulus periods for periodic and 409.4 for aperiodic stimuli. Furthermore, stimuli were rejected if the numerical algorithm was unstable at the time resolution chosen. We veried for some of these cases that with appropriate discretization,

376

Hans E. Plesser and Wulfram Gerstner

the algorithm did converge. The instabilities were caused by very steep threshold crossings of the membrane potential in combination with small noise. After defective stimuli had been excluded, we were left with 8714 periodic stimuli (out of these 4706 subthreshold stimuli) and 12,181 aperiodic stimuli (8027 subthreshold). 2.5 Parameter Optimization. The weights in the models with parameters were chosen as follows. We split the entire set of stimuli into an optimization and a validation set. For each single stimulus, we determined that weight w or set of weights w1 , w2 that minimized the error E dened in equation 3.1 below for this one stimulus, using MATLAB minimization routines. We repeated the procedure for all stimuli in the optimization set and constructed the distribution of weights found in this way. The weights opt opt wopt [respectively, w1 , w2 ] are the medians of the weight distributions. Evaluating the error E over both the optimization and the validation set with the xed weight wopt yielded identical results. Overtting can therefore be excluded. 3 Results 3.1 ISI Distributions. Figure 2 shows the ISI distribution evoked from the diffusion model by a subthreshold aperiodic stimulus. The predicted ISI distribution for the Arrhenius&Current model (dashed) has been based on equation 2.4. To observe this ISI in an experiment, one would present the stimulus shown as the dotted line in Figure 2c repeatedly, starting over at t = 0 each time a spike has been red. We may also think of Figure 2a as a PSTH (peri-stimulus time histogram), where each spike triggers a reset of the stimulus. To measure the agreement between this reference solution and the distributions rendered by the escape noise models, we use the relative integrated mean square error (Scott, 1992):

E=

R1 0

dt [r (t | t¤ = 0) ¡ r esc (t | t¤ = 0)]2 R1 . (t | t¤ = 0) 2 0 dt r

(3.1)

Here, r (t | t¤ = 0) is the ISI distribution for the diffusion model, while r esc(t | t¤ = 0) is the distribution obtained from any one of the escape noise models. The example in Figure 2 corresponds to an error of E = 0.026. Figure 3 shows the cumulative distribution of the error E for all four models. The Arrhenius&Current model obviously performs signicantly better than all the other models for sub- and superthreshold stimuli, both periodic and aperiodic, while the Tuckwell model does worst. The Arrhenius and the error function model are tied for second, with the Arrhenius

Noise in Integrate-and-Fire Neurons

377

r (t |t*)

(a) 0.3 0.2 0.1 0 0

5

10

15

20

25

30

5

10

15

20

25

30

5

10

15 t

20

25

30

h(t |t*)

(b) 1 0.75 0.5 0.25 0 0

0.9

0

I(t ), v (t |t*)

(c)

0.7 0.5 0

Figure 2: (a) Interval distribution from the diffusion model (solid) and the Arrhenius&Current model (dashed) for aperiodic input with DC offset m = 0.85, AC amplitude q = 0.1, cutoff frequency Vc = p , and noise amplitude s = 0.1. The relative distance from threshold is e = 0.88 and the error E = 0.026. (b) Hazard as a function of time for the Arrhenius&Current model (solid). The dashed line is the diffusive component of the hazard dened in equation 2.12. (c) Input current (dotted) and noise-free membrane potential for the given stimulus (solid). The membrane potential stays below the threshold (dashed) but approaches it closely around t ¼ 27. Note that peaks in the input current in (c) lead to sharp transients of the hazard (b) that would not be reproduced correctly by the diffusion term alone. Thus the ring probability of the neuron is correlated to maxima of the current rather than the membrane potential, in excellent agreement with the exact diffusion model; see the nice t in a.

378

Hans E. Plesser and Wulfram Gerstner

P(E)

(a)

(b)

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

10

3

2

10 10 E

1

10

(c)

P(E)

0

0

3

10 10 E

2

1

10

0

3

10 10 E

2

1

10

0

10

(d)

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

10

3

2

10 10 E

1

10

0

0

10

Figure 3: Cumulative distribution of the relative mean square error E over all stimuli for Arrhenius (solid), Arrhenius&Current (dash-dotted), error function (dashed), and Tuckwell (dotted) model. The stimuli were (a) subthreshold periodic, (b) superthreshold periodic, (c) subthreshold aperiodic, and (d) superthreshold aperiodic. To interpret the cumulative distributions, let us study the likelihood that the error E of our approximation is smaller than 0.1. We may read off from the graphs in, for example, a, that for subthreshold periodic stimuli, the Tuckwell model would give a probability of E · 0.1 of about 50%, whereas the Arrhenius model and the error function give about 80% and the Arrhenius&Current model a probability of more than 95%.

model yielding slightly smaller errors for superthreshold stimuli. To quantify the performance of the models, we give in Table 1 the median and 90th percentile of the error for all models. Note that for the Arrhenius&Current model for all stimulus conditions, the error is E < 0.077 in 90% of all cases. We therefore conclude that this model provides an excellent approximation to the dynamics of the diffusion model of neuronal activity. We now describe in more detail how the model error depends on the stimuli and briey discuss why the models perform so differently. Figure 4

Noise in Integrate-and-Fire Neurons

379

Table 1: Median (= 50th ) and 90th Percentiles of Errors E. Periodic Aperiodic Periodic Aperiodic Subthreshold Subthreshold Superthreshold Superthreshold Overall 50% 90% 50% 90% 50% 90% 50% 90% 50% 90% Arrhenius Arrhenius &Current erf Tuckwell

0.055 0.148 0.075 0.336 0.157

0.409

0.130

0.375

0.091 0.334

0.018 0.044 0.024 0.070 0.042 0.061 0.147 0.083 0.346 0.189 0.137 0.333 0.242 0.837 0.747

0.090 0.429 0.896

0.034 0.160 0.713

0.090 0.397 0.901

0.026 0.077 0.103 0.358 0.398 0.864

shows the error of the models versus relative distance from threshold of the stimuli, for both periodic and aperiodic stimuli. All models perform better for sub- than for superthreshold stimuli, but the Arrhenius&Current is relatively good across the entire range. Note that the Tuckwell model becomes quite good for periodic stimuli that are far below threshold (e > 1), as is to be expected, since the Tuckwell approximation is valid in this regime. It does not fare as well for aperiodic stimuli, for they may contain intervals during which the membrane potential of a subthreshold stimulus comes close to threshold. 3.2 Stochastic Resonance. Recent years have seen mounting evidence that signal transmission in neurons can be improved by noise, an effect known as stochastic resonance (Douglass, Wilkens, Pantazelou, & Moss, 1993; Wiesenfeld & Moss, 1995; Cordo et al., 1996; Levin & Miller, 1996; Moss & Russell, 1998). The typical experimental paradigm is to stimulate a neuron sinusoidally, adding noise of varying intensity, and to measure the signal-to-noise ratio (SNR) of the spike train elicited from the neuron. If this SNR peaks for nonvanishing input noise, the system is said to exhibit stochastic resonance. (For a recent review of the eld, see Gammaitoni, H¨anggi, Jung, & Marchesoni, 1998.) Plesser and Geisel (1999) have demonstrated that noise can induce a further kind of resonance in periodically stimulated integrate-and-re neurons: the optimal SNR is attained for a particular stimulation frequency. This latter resonance arises by matching the period of stimulation to the timescale set by the membrane time constant of the neuron. We shall show here that the simplied Arrhenius&Current model reproduces this behavior well and that it is thus well suited to investigate integrate-and-re dynamics. (For details of the methods, see Plesser & Geisel, 1999.) Briey, one proceeds as follows to obtain the SNR in response to sinusoidal stimulation. For periodic input I ( t) = m C q cos(Vt C w), the ISI distributions r (t | t¤ , I (¢)) depend on input and external time only through the input phase y ¤ = [Vt¤ C w 0 ] mod 2p at the time t¤ of the most recent spike. With respect to this spike phase, the spike train reduces to a Markov

380

Hans E. Plesser and Wulfram Gerstner

(a) 0.4

(b) 0.08

E

0.3 0.2

0.04

0.1 0

2

0 e

0

2

0.4

(d) 0.8

0.3

0.6

0.2

0.4

0.1

0.2

E

(c)

0

2

0 e

0

2

2

0 e

2

2

0 e

2

Figure 4: Error E versus relative distance from threshold e for (a) Arrhenius, (b) Arrhenius&Current, (c) error function, and (d) Tuckwell model. Symbols mark the medians, and vertical bars span the 20th to 80th percentiles of 250 stimuli with neighboring e-values. Circles indicate periodic stimuli, crosses aperiodic ones. e > 0 corresponds to the regime of subthreshold stimuli. Note the different scales of the y-axis.

process with transition probability (or kernel), Z

T (y k | y k¡1 ) =

1 0

r (t | y k¡1 )d (y k ¡ [Vt C y k¡1 ] mod 2p )

dt , V

(3.2)

between the phases of the kth and the k ¡ 1st spike. The power spectral density at the stimulus frequency V of a d-spike train of total length To is then given by 1 STo (V) = p To

*t ,t < T jX k o j,k

+ e

iV (tj ¡tk )

* + No X 1 i(yj ¡yk ) e ¼ . p No ht i j,k= 1

(3.3)

Noise in Integrate-and-Fire Neurons

381

16 14 12

SNR

10 8 6 4 2 0 0

0.05 s

0.1

0.15

p Figure 5: SNR versus noise amplitude s form = 0.95, q 2 = 0.05 and V = 0.33p (highest SNR), V = p (lowest SNR) and V = 0.1p . Solid lines are for the diffusion model, dashed lines for the Arrhenius&Current model.

Here, tj are the spike times and yj the spike phases, and we have exploited that yj ¡ yk = [V (tj ¡ tk )] mod 2p . The mean ISI length is given by ht i, and No = bTo / ht ic is the average number of spikes in To . The approximation in equation 3.3 consists in averaging the number of spikes in To and their phases separately. Since the correlations between the spike phases yj are given by equation 3.2, the right-hand side of equation 3.3 can be evaluated in closed form. Results are in excellent agreement with spectra computed via fast Fourier transform from simulated spike trains. As an estimate of the noise background on which this signal will be transmitted, we use the power spectral density of a Poisson process of the same mean number of spikes per time as the spike train studied in equation 3.3 (see Stemmler, 1996). A Poisson process with intensity ht i¡1 has a white spectrum with power SP = (p ht i) ¡1 (see Cox & Lewis, 1966). The SNR of a spike train transmitted within a xed observation period To is then * + No 1 X STo (V) (yj ¡yk ) i e SNR = ¼ . SP No j,k= 1 Figure 5 shows results for the integrate-and-re model with diffusion noise and Arrhenius&Current escape noise. The agreement is good, indicating that the Arrhenius&Current model is well suited to replace diffusive noise in the integrate-and-re neuron when investigating more intricate

382

Hans E. Plesser and Wulfram Gerstner

problems of neuronal dynamics. Note in particular that both models show a twofold stochastic resonance. The SNR exhibits, as always in stochastic resonance, a peak as a function of the noise amplitude. The height of this peak is maximal around a stimulus frequency V = 0.33p and decreases for both larger (V = p ) and smaller frequencies (V = 0.1p ). The optimal frequency V = 0.33p corresponds to a period of T = 6, which should be compared to the mean ISI length ht i ¼ 9.7 at the resonance. (For a more detailed discussion of the timescale matching underlying this effect, see Plesser & Geisel, 1999.)

4 Discussion

We have demonstrated that the dynamics of the integrate-and-re neuron with diffusive noise may be well approximated by simple escape noise models and have identied the Arrhenius&Current hazard function as the optimal choice of model. Upon periodic stimulation, the SNR of the neuron’s output exhibits stochastic resonance for the Arrhenius&Current noise in the same way as it does for diffusive noise. This indicates that this escape noise model renders the correlations within the spike train—induced by the timedependent stimulus—correctly. The relative error of the model depends only weakly on the stimulus, permitting a widespread use. Of the remaining models, the pure Arrhenius model nishes as a strong second, at least for subthreshold stimuli. Compared to the error function model, it has the additional charm of mathematical simplicity. The model based on Tuckwell’s approximation of the mean ring rate, in contrast, is useful only if stimuli are sufciently subthreshold. With this study, we have provided a tool for the efcient investigation of large networks of model neurons. In fact, escape noise models have been used in the past in the context of the spike response model (Gerstner & van Hemmen, 1992; Gerstner, 1995), and this study provides additional support to such simplied treatment of noise. For escape noise models, signal transmission properties of spiking neurons can be studied analytically (Gerstner, forthcoming). Several issues remain to be tackled, particularly the extension of the investigation presented here to the case of colored noise. Finally, although the Arrhenius&Current model is the best model we have found so far, this is as yet no proof that it is indeed the optimal model.

Acknowledgments

We acknowledge helpful comments by two anonymous referees on an earlier version of the manuscript. H. E. P. was supported by Deutsche Forschungsgemeinschaft through SFB 185 “Nichtlineare Dynamik.”

Noise in Integrate-and-Fire Neurons

383

References Abeles, M. (1982).Role of the cortical neuron: Integrator or coincidence detector? Israel J Med Sci, 18, 83–92. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-re neurons with low ring rates. Neural Comput, 11, 1621–1671. Bryant, H. L., & Segundo, J. P. (1976).Spike initiation by transmembrane current: A White-noise analysis. J Physiol, 260, 279–314. Bulsara, A. R., & Zador, A. (1996). Threshold detection of wideband signals: A noise-induced maximum in the mutual information. Phys Rev E, 54, R2185– R2188. Collins, J. J., Chow, C. C., Capela, A. C., & Imhoff, T. T. (1996). Aperiodic stochastic resonance. Phys Rev E, 54, 5575–5584. Cordo, P. Inglis, J. T., Verschueren, S., Collins, J. J., Merfeld, D. M., Rosenblum, S., Buckley, S., & Moss, F. (1996). Noise in human muscle spindels. Nature, 383, 769–770. Cox, D. R., & Lewis, P. A. W. (1966). The statistical analysis of series of events. London: Methuen. Douglass, J. K., Wilkens, L., Pantazelou, E., & Moss, F. (1993).Noise enhancement of information transfer in craysh mechanoreceptors by stochastic resonance. Nature, 365, 337–340. Gammaitoni, L., H¨anggi, P., Jung, P., & Marchesoni, F. (1998). Stochastic resonance. Rev Mod Phys, 70, 223–287. Gerstner, W. (1995). Time structure of the activity in neural network models. Phys Rev E, 51, 738–758. Gerstner, W. (1998). Spiking neurons. In W. Maass & C. M. Bishop (Eds.), Pulsed neural networks (pp. 3–54). Cambridge, MA: MIT Press. Gerstner, W. (forthcoming). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Comput. Gerstner, W., & van Hemmen, J. L. (1992). Associative memory in a network of “spiking” neurons. Network, 3, 139–164. Gummer, A. W. (1991). Probability density function of successive intervals of a nonhomogeneous Poisson process under low-frequency conditions. Biol Cybern, 65, 23–30. Johannesma, P. I. M. (1968). Diffusion models of the stochastic activity of neurons. In E. R. Caianiello (Ed.), Neural networks (pp. 116–144). Berlin: Springer. Johnson, D. H. (1980). The relationship between spike rate and synchrony in responses of auditory-nerve bers to single tones. J Acoust Soc Am, 68, 1115– 1122. Johnson, D. H. (1996). Point process models of single-neuron discharges. J Comp Neurosci, 3, 275–299. Kempter, R., Gerstner, W., van Hemmen, J. L., & Wagner, H. (1998).Extracting out oscillations: Neural coincidence detection with periodic spike input. Neural Comput, 10, 1987–2017. Konig, ¨ P., Engel, A. K., & Singer, W. (1996). Integrator or coincidence detector? The role of the cortical neuron revisited. Trends Neurosci, 19(4), 130–137.

384

Hans E. Plesser and Wulfram Gerstner

L´ansky, ´ P. (1997). Sources of periodical force in noisy integrate-and-re models of neuronal dynamics. Phys Rev E, 55, 2040–2043. Levin, J. E., & Miller, J. P. (1996). Broadband neural encoding in the cricket cercal sensory system enhanced by stochastic resonance. Nature, 380, 165–168. Linz, P. (1985). Analytical and numerical methods for Volterra equations. Philadelphia: SIAM. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing of neocortical neurons. Science, 268, 1503–1506. Moss, F., & Russell, D. F. (1998). Animal behavior enhanced by noise. Paper presented at the Computational Neuroscience Meeting ’98, Santa Barbara, CA. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967). Neuronal spike trains and stochastic point processes. Biophys J, 7, 391–418. Plesser, H. E., & Geisel, T. (1999). Markov analysis of stochastic resonance in a periodically driven integrate-re neuron. Phys Rev E, 59, 7008–7017. Plesser, H. E., & Tanaka, S. (1997). Stochastic resonance in a model neuron with reset. Phys Lett A, 225, 228–234. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Scott, D. W. (1992). Multivariate density estimation. New York: Wiley. Siebert, W. M. (1970). Frequency discrimination in the auditory system: Place or periodicity mechanisms. Proc IEEE, 58, 723–730. Stein, R. B. (1965). A theoretical analysis of neuronal variability. Biophys J, 5, 173–194. Stemmler, M. (1996). A single spike sufces: The simplest form of stochastic resonance in neuron models. Network, 7, 687–716. Stevens, C. F., & Zador, A. (1996). When is an integrate-and-re neuron like a Poisson neuron? In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 103–109). Cambridge, MA: MIT Press. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology (Vol. 1). Cambridge, UK: Cambridge University Press. Tuckwell, H. C. (1989). Stochastic processes in the neurosciences. Philadelphia: SIAM. van Kampen, N. G. (1992). Stochastic processes in physics and chemistry (2nd ed.). Amsterdam: North-Holland. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. Wiesenfeld, K., & Moss, F. (1995). Stochastic resonance and the benets of noise: From ice ages to craysh and SQUIDs. Nature, 373, 33–36. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophys J, 12, 1–24. Received April 27, 1998; accepted March 31, 1999.

LETTER

Communicated by Misha Tsodyks

Modeling Synaptic Plasticity in Conjunction with the Timing of Pre- and Postsynaptic Action Potentials Werner M. Kistler J. Leo van Hemmen ¨ ¨ Physik Department der TU Munchen, D-85747 Garching bei Munchen, Germany We present a spiking neuron model that allows for an analytic calculation of the correlations between pre- and postsynaptic spikes. The neuron model is a generalization of the integrate-and-re model and equipped with a probabilistic spike-triggering mechanism. We show that under certain biologically plausible conditions, pre- and postsynaptic spike trains can be described simultaneously as an inhomogeneous Poisson process. Inspired by experimental ndings, we develop a model for synaptic long-term plasticity that relies on the relative timing of pre- and postsynaptic action potentials. Being given an input statistics, we compute the stationary synaptic weights that result from the temporal correlations between the pre- and postsynaptic spikes. By means of both analytic calculations and computer simulations, we show that such a mechanism of synaptic plasticity is able to strengthen those input synapses that convey precisely timed spikes at the expense of synapses that deliver spikes with a broad temporal distribution. This may be of vital importance for any kind of information processing based on spiking neurons and temporal coding.

1 Introduction

There is growing experimental evidence that the strength of a synapse is persistently changed according to the relative timing of the arrival of presynaptic spikes and the triggering of postsynaptic action potentials (Markram, Lubke, ¨ Frotscher, & Sakmann, (1997); Bell, Han, Sugawara, & Grant, 1997). A theoretical approach to this phenomenon is far from being trivial because it involves the calculation of correlations between pre- and postsynaptic action potentials (Gerstner, Kempter, van Hemmen, & Wagner 1996). In the context of simple neuron models such as the standard integrate-and-re model, the determination of the ring-time distribution involves the solution of rst-passage time problems (Tuckwell, 1988). These problems are known to be hard, the more so if the neuron model is extended so as to include biologically realistic postsynaptic potentials. Neural Computation 12, 385–405 (2000)

c 2000 Massachusetts Institute of Technology °

386

Werner M. Kistler and J. Leo van Hemmen

We are going to use a neuron model, the spike-response model, which is a generalization of the integrate-and-re model (Gerstner & van Hemmen, 1992; Gerstner, 1995; Kistler, Gerstner, & van Hemmen, 1997). This model combines analytic simplicity with the ability to give a faithful description of “biological” neurons in terms of postsynaptic potentials, afterpotentials, and other aspects (Kistler, et al., 1997). In order to be able to solve the rst-passage time problem, the sharp ring threshold is replaced by a probabilistic spike-triggering mechanism (section 2). In section 3 we investigate conditions that allow for a description of the postsynaptic spike train as an inhomogeneous Poisson process, if the presynaptic spike trains are described by inhomogeneous Poisson processes too. Using this stochastic neuron model, we can analytically compute the distribution of the rst postsynaptic ring time for a given ensemble of input spike trains in two limiting cases: for low- and high-ring threshold (section 4). Finally, in section 5 a model of synaptic plasticity is introduced and analyzed in the context of a neuron that receives input from presynaptic neurons with different temporal precision. We show analytically and by computer simulations that the mechanism of synaptic plasticity is able to distinguish between synapses that convey spikes with a narrow or a broad temporal jitter. The resulting stationary synaptic weight vector favors synapses that deliver precisely timed spikes at the expense of the other synapses so as to produce again highly precise postsynaptic spikes. 2 The Model

We will investigate a neuron model—the spike-response model—that is in many aspects similar to but much more general than the standard integrateand-re model (Gerstner & van Hemmen, 1992; Gerstner, 1995). The membrane potential is given by a combination of pre- and postsynaptic contributions described by a response kernel 2 that gives the form of an elementary postsynaptic potential and a kernel g that accounts for refractoriness and has the function of an afterpotential, so that h i ( t) =

X j, f

f

Jij2 ( t ¡ tj ¡ D ij ) ¡

X f

f

g(t ¡ ti ).

(2.1)

Here, hi is the membrane potential of neuron i, Jij is the strength of the synapse connecting neuron j to neuron i, D ij is the corresponding axonal f

delay from j to i, and ftj , f = 1, 2, . . .g are the ring times of neuron j. Causality is respected if both 2 ( s) and g(s) vanish identically for s < 0. Depending on the choice for 2 and g, the spike-response model can reproduce the standard integrate-and-re model or even mimic complex HodgkinHuxley-type neuron models (Kistler et al., 1997). A typical choice for 2 is 2 ( t) = t/ tM exp(1 ¡ t/ tM )H (t), where tM is a membrane time constant and

Modeling Synaptic Plasticity

387

H the Heaviside step function with H ( t) = 1 for t > 0 and H(t) = 0 elsewhere. As for the refractory function g, a simple exponential can be used, for example, g(t) = g0 exp(¡t/ tref )H (t). Equation 2.1 could also be generalized so as to account for short-term depression and facilitation if the constant coupling Jij is replaced by a function of the previous spike arrival times at synapse i 7! j (Kistler & van Hemmen, 1999). In this context, however, we will concentrate on situations where only a single spike is transmitted across each synapse, so that effects of short-term plasticity do not come into action. The crucial point for what follows is the way in which new ring events are dened. Instead of using a deterministic threshold criterion for spike release, we describe the spike train of the postsynaptic neuron as a stochastic process with the probability density of having a spike at time t being a nonlinear function º of the membrane potential h( t). If we neglect the effect of the afterpotential, g = 0, the spike train of neuron i is an inhomogeneous Poisson process with rate º = º( hi ) for every xed set of presynaptic ring f

times ftj , j 6 = ig. Spike triggering in neurons is more or less a threshold process. In order to mimic this observation, we have chosen the rate function º( h) to be º(h) = ºmax H ( h ¡ #),

(2.2)

where ºmax is a positive constant and # the ring threshold. We note that ºmax is not the maximum ring rate of the neuron, which is given by the form of g(t), but a kind of a reliability parameter. The larger ºmax , the faster the neuron will re after the threshold # has been reached from below. The role of the parameter ºmax is further discussed in section 4. 3 Noisy Input f

For a xed set of presynaptic spikes at times ftj , j 6 = ig and a negligible afterpotential (g = 0), the membrane potential of neuron i at time t is uniquely determined and given by hi (t) =

X j, f

f

Jij2 ( t ¡ tj ¡ D ij ).

(3.1)

In this case, the spike train of neuron i is a Poisson process with rate ºmax for those time intervals with hi (t) > # and no spikes in between. Things become more complicated if we include an afterpotential after each spike of neuron i and if the arrival times of the presynaptic spikes are no longer xed but become random variables too. We will not discuss the rst point because we are mainly interested in the rst spike of neuron i after a period of inactivity. As a consequence of the second point, the membrane

388

Werner M. Kistler and J. Leo van Hemmen

potential hi ( t) is no longer a xed function of time, but a stochastic process f because every realization of presynaptic ring times ftj , j 6 = ig results in a different realization of the time course of the membrane potential hi ( t) as given through equation 3.1. The composite process is thus doubly stochastic in the sense that in a rst step, a set of input spike trains is drawn from an ensemble characterized by Poisson rates ºi . This realization of input spike trains then determines the output rate function, which produces in a second step a specic realization of the postsynaptic spike train. The natural question is how to characterize the nature of the resulting postsynaptic spike train. For an inhomogeneous Poisson process with rate function º, the expectation of the number k of events in an interval [T1 , T2 ] is Z E fk | ºg =

T2 T1

dtº(t).

(3.2)

If the rate function º is replaced by a stochastic process, we have to integrate over all possible realizations of º (with respect to a measure m ) in order to obtain the (unconditional) expectation of the number of events, Z E fkg =

« ¬ dm (º)E fk | ºg = E fk | ºg º =

Z

T2 T1

dt ºN (t),

(3.3)

R with h¢iº = ¢dm (º) and ºN ( t) = hº(t)iº. Hence, the composite process has the same expectation of the number of events as an inhomogeneous Poisson process with rate function ºN (t), which is just the expectation of º( t). Nevertheless, the composite process is not equivalent to a Poisson process. This can be seen by calculating the probabilities of observing k events within [T1 , T2 ]. For a Poisson process with xed rate function º, this probability equals 1 prob fk events in [T1 , T2 ] | ºg = k!

"Z

" Z dtº(t) exp ¡

T2 T1

#k

T2 T1

#

dtº(t) , (3.4)

whereas for the composite process, we obtain prob fk events in [T1 , T2 ]g « ¬ = prob fk events in [T1 , T2 ] | ºg º *"Z #nCk + 1 T2 X (¡1)n = dtº(t) k!n! T1 n= 0

(3.5)

º

Z Z T2 1 X « ¬ (¡1)n T2 = dt1 . . . dtnCk º( t1 ) . . .º(tnCk ) º . k!n! T1 T1 n= 0

(3.6)

Modeling Synaptic Plasticity

389

The last expression contains expectations of products of the rate function º evaluated at times t1 , . . . , tnCk . Since the membrane potential h is a continuous function of time, the rate function is (at least) piecewise continuous, and º( ti ) and º( tj ) with ti 6 = tj are in general not statistically independent. The calculation of correlations such as C(ti , tj ) := hº(ti )º( tj )i ¡ hº(ti )ihº( tj )i is thus far from being trivial (Bartsch & van Hemmen, 1998). In the following we approximate the composite process by an inhomogeneous Poisson process with rate function º. N This approximation can be justied for large numbers of overlapping excitatory postsynaptic potentials (EPSPs), when, due to the law of large numbers (Lamperti, 1966), the membrane potential as a sum of independent EPSPs and thus the rate function º( t) is almost always close to its expectation value ºN ( t). The approximation is equivalent to neglecting correlations in the rate function º. That is, to good approximation (as we will see shortly), we are allowed to assume º( t1 ), . . . , º(tkCn ) to be independent for ti 6 = tj , i 6 = j. In doing so, we can replace the expectation of the product by the product of expectations under the time integrals of equation 3.5 and obtain prob fk events in [T1 , T2 ]g Z Z T2 1 X (¡1) n T2 = dt1 . . . dtnCkºN ( t1 ) ¢ . . . ¢ ºN ( tnCk ) k!n! T1 T1 n= 0 1 = k!

"Z

T2 T1

#k

dt ºN ( t)

" Z exp ¡

T2 T1

#

dt ºN (t) ,

(3.7)

which is the probability of observing k events for an inhomogeneous Poisson process with rate function ºN that is simply the expectation of the process º. The efcacy of this approximation remains to be tested. This is done throughout the rest of this article by comparing theoretical results based on this approximation with numerical simulations. We now calculate ºN for neuron i. We assume that the spike train of the presynaptic neurons, labeled j, can be described by an inhomogeneous Poisson process with rate function ºj . A description of spike trains of cortical neurons in terms of inhomogeneous Poisson processes is appropriate for time windows below approximately 1 second (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997), which is sufcient for our purpose. With these preliminaries we can calculate the rst and the second moment of the distribution of the resulting membrane potential. The average postsynaptic potential generated by a single presynaptic neuron with rate function ºj is *

X f

+ 2 (t

f ¡ tj )

= (ºj ¤ 2 )( t).

(3.8)

390

Werner M. Kistler and J. Leo van Hemmen

R Here ¤ denotes convolution, that is, ( f ¤ g)(t) = dt0 f (t0 ) g( t ¡t0 ). The second moment is (Kempter, Gerstner, van Hemmen, & Wagner, 1998, appendix A) *2 4

X f

32 + 2 (t ¡

f tj ) 5

£ ¤2 ³ = (ºj ¤ 2 )( t) C ºj ¤ 2

2

´

(t).

(3.9)

Returning to equation 3.1, we see that the expectation value hN i of the membrane potential and the corresponding variance si2 are given by hN i ( t) = E fhi (t)g =

X j

Jij (ºj ¤ 2 )( t)

(3.10)

and si2 ( t) = Var fhi (t)g =

X j

Jij2 (ºj ¤ 2

2 )(

t).

(3.11)

Because of the central limit theorem, the membrane potential at a certain xed time t has a gaussian distribution, if there are sufciently many EPSPs that superimpose at time t; (for an estimate by the Berry-Esseen inequality we refer to Kempter et al., 1998). Equipped with the gaussian assumption, we can calculate ºN ( t), Z

dh G [h ¡ hN i (t), si ( t)]º(h) ( " #) ºmax hN i ( t) ¡ # = 1 C erf p , 2 2si2 (t)

ºN ( t) =

(3.12)

£ ¤ with G being a normalized gaussian, G (h, s) = (2p s 2 ) ¡1/ 2 exp ¡h2 / (2s 2 ) . We note that this is the rst time that we have used explicitly the functional dependence (see equation 2.2) of the ring rate º on the membrane potential h. The result so far is that if the membrane potential is a sum of many EPSPs so that we can assume h(t) to be gaussian distributed and the membrane potentials for different times to be independent, and if the effect of the afterpotential is negligible, then the postsynaptic spike train is given by an inhomogeneous Poisson process with rate function ºN as given by equation 3.12. 4 First Passage Time

In the previous section we discussed an approximation that allows us to replace the doubly stochastic process that describes the postsynaptic spike

Modeling Synaptic Plasticity

391

train by an inhomogeneous Poisson process with a rate function ºN that is independent of the specic realization of the presynaptic input. This approximation allows us to calculate the distribution of the rst postsynaptic spike analytically. The probability of observing no spikes in the interval [t0 , t] is µ Z t ¶ 0 0 prob fno spike in [t0 , t]g = exp ¡ dt ºN ( t ) , t0

(4.1)

and the density for the rst ring time after time t0 is d (1 ¡ prob fno spike in [t0 , t]g) dt µ Z t ¶ = ºN ( t) exp ¡ dt0 ºN ( t0 ) .

prst (t) =

(4.2)

t0

We are going to study the rst-spike distribution (rst passage time) in a specic context. We consider a single neuron having N independent synapses. Each synapse i delivers a spike train determined by an£ inhomogeneous ¤ Poisson process with rate function ºi ( t) = (2p si2 ) ¡1/ 2 exp ¡t2 / 2si2 , so that on average, one spike arrives at each synapse with a temporal jitter si around t = 0. We can calculate the averaged rate function ºN for this setup,

8 2 39 PN = ( ) ¡ ºmax < J 2 N t # i i 5 , 1 C erf 4 q i= 1 ºN (t) = P ; 2 : 2 N2 2 N i= 1 Ji 2 i ( t)

(4.3)

with expectation and variance being given as convolutions 2 Ni ( t) = (ºi ¤ 2 )( t),

and 2 Ni2 (t) = (ºi ¤ 2

2 )(

t).

(4.4)

The resulting density of the rst postsynaptic spike is shown in Figure 1. 4.1 High Threshold. We calculate the density of the rst postsynaptic spike explicitly for two limiting cases. First, we assume that the probability of observing at least one postsynaptic spike is very small, Z dt ºN ( t) ¿ 1.

(4.5)

In this case, the neuron will re, if it will re at all, in the neighborhood of the maximum of the postsynaptic potential. We expand the terms involving the

392

Werner M. Kistler and J. Leo van Hemmen

Figure 1: Plots to compare the density function for the rst postsynaptic spike as reconstructed from simulations (solid line) with the analytical approximation using the averaged rate function of equation 4.3 (dotted line). The dashed line represents the mean rate º. N (A, B) N = 100 synapses with strength Ji = 1/ N receive spike trains determined£ by an inhomogeneous Poisson process with rate ¤ function ºi = (2p s 2 ) ¡1/ 2 exp ¡t2 / 2s 2 and s = 1. The EPSPs are described by alpha functions t/ t exp(1 ¡ t/ t ) with time constant t = 1 so that the maximum of the membrane potential amounts to h = 1, if all spikes were to arrive simultaneously. The postsynaptic response is characterized by ºmax = 1 and # = 0.5 in A and # = 0.75 in B. Increasing the threshold obviously improves the temporal precision of the postsynaptic response, but the overall probability of a postsynaptic spike is decreased. (C, D) The same plots as A and B but for N = 10 instead of 100 synapses. The synaptic weights have been scaled so that Ji = 1/ 10. The central approximation that asserts statistical independence of the membrane potential at different times produces surprisingly good results for as little as N = 10 overlapping EPSPs.

membrane potential in the neighborhood of the maximum at t0 to leading order in t, PN Ji2 Ni (t) ¡ # q i= 1 = h0 ¡ h2 ( t ¡ t0 )2 C O (t ¡ t0 ) 3 . P N 2 N2 2 i= 1 Ji 2 i (t)

(4.6)

The averaged rate function ºN can thus be written ºN ( t) =

h io ºmax n 1 C erf h0 ¡ h2 (t ¡ t0 )2 C O ( t ¡ t0 ) 3 2

= ºN0 ¡ ºN 2 ( t ¡ t0 ) 2 C O (t ¡ t0 ) 3 ,

(4.7)

Modeling Synaptic Plasticity

393

with ºN0 =

ºmax [1 C erf ( h0 ) ] , 2

and

ºN2 =

ºmax h2 ¡h2 p e 0. p

(4.8)

It turns out that for ¡0.5 < h0 < 0.5, the averaged rate function ºN can be approximated very well by the clipped parabola ( p ºN 0 ¡ ºN2 ( t ¡ t0 )2 , for | t ¡ t0 | < ºN 0 / ºN2 , (4.9) ºN (t) ¼ 0, elsewhere. Using this approximation, we can calculate the integral over ºN and obtain the distribution of the rst postsynaptic spike, s # " h i 2ºN0 ºN0 ºN 2 2 3 ( t ¡ t0 ) ¡ ºN 0 ( t ¡ t0 ) ¡ prst (t) = ºN 0 ¡ ºN2 ( t ¡ t0 ) exp 3 3 ºN2 s ! ºN 0 £H ¡ |t ¡ t0 | . (4.10) ºN 2

(

For h0 < ¡0.5, however, a gaussian function is the better choice, ºN (t) ¼ ºN post G ( t ¡ t0 , spost ), with ºN post

(4.11)

q p = p ºN03 / ºN 2 and spost = ºN 0 / (2ºN 2 ). Since ºN post ¿ 1, we have

prst (t) ¼ ºN ( t).

(4.12)

4.2 Low Threshold. We now discuss the case where the postsynaptic neuron res with probability near to one, that is, Z 1 ¡ dt ºN ( t) ¿ 1. (4.13)

In this case, the neuron will re approximately at time t0 when the membrane potential crosses the threshold, h( t0 ) = #. We again expand the terms containing the membrane potential to leading order in ( t ¡ t0 ), PN Ji2 Ni ( t) ¡ # q i= 1 = h1 ( t ¡ t0 ) C O (t ¡ t0 ) 2 , PN 2 N2 ( ) 2 i= 1 Ji 2 i t

(4.14)

and assume that the averaged rate function ºN is already saturated outside the region where this linearization is valid. We may therefore put ºN (t) =

ª ºmax © 1 C erf [h1 (t ¡ t0 )] , 2

(4.15)

394

Werner M. Kistler and J. Leo van Hemmen

Figure 2: Approximation of the averaged rate function ºN (left) and the density of the rst postsynaptic spike prst (right) for various threshold values. The solid line gives the numerical result; the dashed line represents the analytic approximation. (A, B) Approximation for low-threshold value (# = 0.5) as in equations 4.15 and 4.16. (C, D), Approximation by a clipped parabola for # = 0.65, as in equations 4.9 and 4.10. (E, F) Gaussian approximation for high-threshold values (# = 0.8); see equations 4.11 and 4.12. The number of synapses is N = 100; the other parameters are as in Figure 1.

and obtain for the density of the rst postsynaptic spike, µ ¶ ºmax ¡h2 ( t¡t0 )2 1 ( ) ( ) ( )( ) e . prst t = ºN t exp ¡N º t t ¡ t0 ¡ p 2 p h1

(4.16)

Outside the neighborhood of the threshold crossing, that is, for |t ¡ t0 | À h¡1 1 , prst can be approximated by prst( t) ¼ ºmax e¡ºmax ( t¡t0 ) H ( t ¡ t0 ),

|t ¡ t0 | À h¡1 1 .

(4.17)

4.3 Optimal Threshold. As can be seen in Figure 2, the width of the rst-spike distribution decreases with increasing threshold #. The temporal precision of the postsynaptic spike can thus be improved by increasing

Modeling Synaptic Plasticity

395

Figure 3: Reliability (A) and precision (B) of the postsynaptic ring event as a function of the threshold #. The neuron receives input through N = 100 synapses; for details, see the caption to Figure 1. Reliability is dened as the R overall ring probability ºN post = dtprst (t). Precision is the inverse of the length of the interval containing 90% of the postsynaptic spikes, D t = t2 ¡ t1 with R t1 R1 dtprst (t) = t dtprst (t) = 0.05Nºpost . C demonstrates that there is an optimal ¡1 2 threshold in the sense that ºN post / D t exhibits a maximum at # ¼ 0.6. The short and the long dashed lines represent the results obtained through the high- and the low-threshold approximations, respectively; See equations 4.18–4.20.

the ring threshold #. The overall ring probability of the postsynaptic neuron, however, drops to zero if the threshold exceeds the maximum value of the membrane potential. The trade-off between the precision of the postsynaptic response and the reliability, that is, the overall ring probability of the postsynaptic neuron, is illustrated in Figure 3. There is an optimal threshold where the expectation value of the precision, that is, the product of reliability and precision, reaches its maximum (Kempter et al., 1998). A straightforward estimate demonstrates the functional dependence of precision and reliability on the various parameters. For high-threshold values, the density prst for the rst postsynaptic spike can be approximated by a gaussian function with variance ºN0 / 2ºN 2 . The precision of the ring time is thus proportional to

Dt

¡1

p / ºN 2 / ºN 0 =

(

2

2h2 e¡h0 p p [1 C erf (h0 )]

!1/ 2 ,

(4.18)

which depends solely on the time course of the membrane potential and is

396

Werner M. Kistler and J. Leo van Hemmen

independent ofºmax . The reliability of the postsynaptic response amounts to ºN post

q = p ºN 03 / ºN 2 = ºmax

(

!1/ 2 p p [1 C erf ( h0 )]3 2

8h2 e¡h0

,

(4.19)

which is proportional to ºmax . For low threshold, prst( t) is dominated by an exponential decay / e¡ºmax t , resulting in a constant precision, D t¡1 ¼

ºmax , 3.00

(4.20)

and reliability close to unity. 5 Synaptic Long-Term Plasticity

There is experimental evidence (Markram et al., 1997) that the strength of synapses between cortical neurons is changed according to the relative timing of pre- and postsynaptic spikes. We develop a simple model of synaptic long-term plasticity that mimics the experimental observations. Using the densities for the rst postsynaptic spike calculated in the previous section, we nd analytic expressions for the stationary synaptic weights that result from the statistics of the presynaptic spike train. 5.1 Modeling Synaptic Plasticity. Very much in the thrust of the spikeresponse model, we describe the change of a synaptic weight as the linear response to pre- and postsynaptic spikes. are formalized by P Spike trains f ), with a delta function ( sums of Dirac delta functions, S( t) = d ¡ t t f

for each ring time t f . The most general ansatz up to and including terms bilinear in the pre- and postsynaptic spike train for the change of the synaptic weight J is µ ¶ Z d 0 0 0 J( t) = Spre (t) d pre C dt k pre ( t ) Spost ( t ¡ t ) dt µ ¶ Z C Spost (t) d post C dt0 k post ( t0 )Spre ( t ¡ t0 ) .

(5.1)

This is a combination of changes that are induced by the pre- (Spre ) and the postsynaptic (Spost ) spike train. Furthermore, d pre (d post) is the amount by which the synaptic weight is changed if a single presynaptic (postsynaptic) spike occurs. This non-Hebbian contribution to synaptic plasticity, which is well documented in the literature (Alonso, Curtis, & Llin´as, 1990; Nelson, Fields, Yu, & Liu, 1993; Salin, Malenka, & Nicoll, 1996; Urban & Barrionuevo, 1996), is necessary to ensure that the xed point of the synaptic

Modeling Synaptic Plasticity

397

strength can be reached from very low initial values and even zero postsynaptic activity (take 0 < d pre ¿ 1) and to regulate postsynaptic activity (take ¡1 ¿ d post < 0). Finally, k pre (s) ( k post( s)) is the additional change induced by a presynaptic (postsynaptic) spike that occurs s milliseconds after a postsynaptic (presynaptic) spike. This modication of synaptic strength depends critically on the timing of pre- and postsynaptic spikes (Dan & Poo, 1992; Bell et al., 1997; Markram et al., 1997). Causality is respected if the kernels k vanish identically for negative arguments. In order to restrict the synaptic weights to the interval [0, 1], we treat synaptic potentiation and depression separately and assume that synaptic weight changes due to depression are proportional to the current synaptic weight J( t), whereas changes due to potentiation are proportional to [1 ¡ J (t)], so that µ ¶LTP µ ¶LTD d d d ¡ J ( t) J( t) = [1 ¡ J( t)] J ( t) J ( t) dt dt dt

(5.2)

with µ

¶LTP/ D µ ¶ Z d LTP/ D LTP/ D 0 ( t )Spost (t ¡ t0 ) = Spre ( t) d pre C dt0 k pre J ( t) dt µ ¶ Z LTP/ D LTP/ D CSpost ( t) d post C dt0 k post ( t0 ) Spre ( t ¡ t0 ) , (5.3)

LTP/ D LTP/ D and d pre/ post , k pre/ post (t0 ) ¸ 0. We are interested in the net change hD Ji of the synaptic weight integrated over some time interval and averaged over an ensemble of presynaptic spike trains, ½ Z µ ¶¾ d hD Ji = dt (5.4) J ( t) . dt

If synaptic weights change only adiabatically, that is, if the change of the synaptic weight is small as compared J and (1 ¡ J), we have D E D E hD Ji = [1 ¡ J( t)] D JLTP ¡ J(t) D JLTD , (5.5) with D E D J LTP/ D =

*Z

µ dt

d J ( t) dt

¶LTP/ D +

LTP/ D

LTP/ D = d pre ºNpre Cd post ºNpost Z Z C dt dt0 º(tI t0 )k LTP/ D (tI t0 ),

(5.6)

398

Werner M. Kistler and J. Leo van Hemmen

where ºN pre,post =

R

dtºpre,post ( t) is the expected number of pre- and postsyLTP/ D LTP/ D naptic spikes, andk LTP/ D ( tI t0 ) = k pre (t¡t0 ) Ck post ( t0 ¡t) is the change of the synaptic weight due to a presynaptic spike at time t and a postsynaptic spike at time t0 . In passing we note that º( tI t0 )d tdt0 is the joint probability of nding a presynaptic spike in the interval [t, t C dt) and a postsynaptic spike in [t0 , t0 C dt0 ). 5.2 Stationary Weights. In the following we concentrate on calculating the stationary synaptic weights dened by

hD Ji = 0

()

« LTP ¬ DJ J = « LTP ¬ « LTD ¬ . DJ C DJ

(5.7)

To this end, we return to the situation described at the beginning of Section 4 where a neuron receives input through N independent synapses, each spike train being described by an inhomogeneous Poisson process with rate function ºi , and 1 · i · N. We study a single synapse and assume that this synapse conveys (on average) one action potential with a gaussian distribution centered around t = 0 so that º( t) = G (t, spre ). The synaptic weight of this synapse is denoted by J. Furthermore, we assume that the presynaptic volley of action potentials triggers at most one postsynaptic spike, as is the case, for instance, if postsynaptic action potentials are followed by strong hyperpolarizing afterpotentials. This assumption relieves us from the need to discuss the inuence of the shape of the afterpotential g because the postsynaptic spike statistic is now completely described by the rst-passage-time prst. We discuss the case where the ring time of a single presynaptic neuron and the postsynaptic ring time are virtually independent. This might be considered as an approximation of the case where either the corresponding synaptic weight is small or the number of overlapping EPSPs needed to reach the ring threshold is large. The joint probability ºi ( t, t0 ) of the ring of the presynaptic neuron i and the rst postsynaptic spike is then simply the product of the probabilities, ºi ( t, t0 ) = ºi ( t) prst( t0 ).

(5.8)

We investigate a learning rule that is inspired by the observation that a synapse is weakened (depressed) if the presynaptic spike arrives a short time after the postsynaptic spike and is strengthened (potentiated) if the presynaptic spike arrives a short time before the postsynaptic spike and, thus, is contributing to the postsynaptic neuron’s ring (Markram et al., 1997). We can mimic this mechanism within our formalism by setting k LTP ( tI t0 ) = 2

LTP

h i exp ¡(t0 ¡ t)/ t LTP H ( t0 ¡ t),

(5.9)

Modeling Synaptic Plasticity

399

Figure 4: Synaptic plasticity based on the timing of pre- and postsynaptic spikes. The kernels Ck LTP (solid line) and ¡k LTD (dashed line) describe the modication of synaptic strength as it is caused by the combination of a postsynaptic spike at time 0 and a presynaptic spike arriving at time t. See equations 5.9–5.10 with 2 LTP = 2 LTD = 0.1 and t LTP = t LTD = 1.

and k LTD ( tI t0 ) = 2

LTD

h i exp ¡(t ¡ t0 )/ t LTD H ( t ¡ t0 ),

(5.10)

with 2 LTP/ D > 0 and t LTP/ D as time constants of the “synaptic memory” (see Figure 4). Our choice of adopting exponentials to describe the kernels k LTP and k LTD is motivated by both biological plausibility and the need to keep the mathematical analysis as simple as possible. With these prerequisites, we can calculate the mean potentiation and depression due to the correlation of pre- and postsynaptic spikes (cf. equation 5.6) for the limiting cases of high and low threshold. If we approximate the postsynaptic ring by a gaussian distribution (cf. equation 4.12), prst (t) = ºNpost G ( t ¡ t0 , spost ),

(5.11)

we nd Z Z dt dt0 º( tI t0 )k LTP ( tI t0 ) Z =2

1

LTP

¡1

1 = ºN post2 2 and

Z

Z dt

LTP

h i dt0 G ( t, spre )ºN postG ( t0 ¡ t0 , srst) exp ¡(t0 ¡ t)/ t LTP

1 t

" p # µ ³ ¶ 1 s ´2 1 2s t0 t0 exp ¡ LTP erfc ¡p , 2 t LTP t 2 t LTP 2s

(5.12)

Z dt dt0 º( tI t0 )k LTD ( tI t0 ) 1 = ºN post2 2

LTD

" p # µ ³ ¶ 1 s ´2 1 2s t0 t0 exp C LTD erfc Cp , (5.13) 2 t LTD t 2 t LTD 2s

400

Werner M. Kistler and J. Leo van Hemmen

Figure 5: Stationary synaptic weights. (A) Three-dimensional plot of the stationary synaptic weight for high threshold as a function of s and t0 , where 2 C s2 s 2 = spre post is the sum of the variances of pre- and postsynaptic ring time and t0 the time between the arrival of the presynaptic spike and the ring of the postsynaptic action potential. (B) Contour plot of the same function as in A. (C, D) Plot of the stationary synaptic weight in the case of low-threshold # as a function of the standard deviation spre of the presynaptic ring time and the time t0 when the expectation of the membrane potential reaches the threshold LTD = from below. The parameters used to describe synaptic plasticity ared post 0.01, LTP LTD LTP LTP LTD LTP LTD d pre = 0.001, d pre = d post = 0, 2 = 2 = 0.1, t = t = 1; See equations 5.2, 5.9, and 5.10. 2 C s 2 . Using equation 5.7, we are thus able to calculate the with s 2 = spre post stationary weight of a synapse that conveys a single action potential with 2 , given the postsynaptic spike at a gaussian distribution with variance spre 2 t = t0 with variance spost . This result is illustrated in Figure 5. If the variance 2 2 of both pre- and postsynaptic spike is small, that is, s 2 = spre C spost ¿ LTP/ D 1, the stationary weight is dominated by the kernels k : The weight is close to zero if the presynaptic spike occurs after the postsynaptic one (t0 < 0) and close to unity if the presynaptic spike arrives a short time before the postsynaptic action potential (t0 > 0). For large variances, this dependence on the timing of the spikes is smoothed, and the stationary

Modeling Synaptic Plasticity

401

weight is determined by the rates of pre- and postsynaptic spikes. We obtain a similar result for low threshold if we approximate the postsynaptic ring time by an exponential distribution (cf. equation 4.17), Z

Z dt dt0 º( tI t0 )k LTP ( tI t0 ) Z =2 =

LTP

1

Z dt

¡1

1

dt0 G ( t, spre )ºmax e¡ºmax ( t ¡t0 ) e¡(t ¡t)/ t 0

0

LTP

max(t,t0 )

2 LTPºmax

(

ºmax C 1/ t LTP

µ ³ ¶ 1 1 spre ´2 t0 exp ¡ 2 2 t LTP t LTP " # s t0 £ erfc p ¡p 2t LTP 2spre µ ¶ ¢2 1 1¡ C exp spreºmax C t0ºmax 2 2 " #) spreºmax t0 £ erfc p Cp , 2 2spre

(5.14)

and Z

Z dt dt0 º( tI t0 )k LTD ( tI t0 ) =

2 LTDºmax

ºmax ¡ 1/ t LTD

(

µ ³ ¶ 1 1 spre ´2 t0 exp C 2 2 t LTD t LTD " # s t0 £ erfc p Cp 2t LTD 2spre µ ¶ ¢2 1 1¡ ¡ exp spreºmax C t0ºmax 2 2 " #) spreºmax t0 £ erfc p Cp . 2 2spre

(5.15)

5.3 Synaptic Weights as a Function of the Input Statistics. We saw in the previous section that the stationary synaptic weights can be calculated as a function of the parameters characterizing the distribution of the pre- and postsynaptic spikes. The synaptic weights, on the other hand, determine the distribution of the postsynaptic spike. If we are interested in the synaptic weights that are produced by given input statistics, we thus have to solve a self-consistency problem.

402

Werner M. Kistler and J. Leo van Hemmen

The self-consistency problem can be solved numerically for the limiting cases of low and high threshold, where the distribution of the postsynaptic ring time is completely characterized by only a few parameters. That is, starting from a given vector J of synaptic couplings, we calculate the parameters that characterize the postsynaptic ring time using the results of section 4 in the limiting cases of high- and low-ring threshold. In a second step, we calculate the corresponding stationary synaptic weight vector J as it is determined by the statistics of pre- and postsynaptic ring times; (cf. section 5.2). Finally, we solve the self-consistency equation J = J using standard numerics. In the simple case where all synapses deliver spikes with a common statistics, the stable solution to the self-consistency equation is unique because the mean postsynaptic ring time is a monotone function of the synaptic weight. But if the synapses are different, say, with respect to the mean spike arrival time, the solution in general will no longer be unique. We illustrate the solution of the self-consistency problem by means of an example and consider a neuron that receives spikes via N independent synapses. Again, each presynaptic spike train is described by an inhomogeneous Poisson process with a gaussian rate G ( t, si ) centered at t = 0. In contrast to our previous examples, the synapses are different with respect to the temporal precision si with which they deliver their action potentials. Figure 6A shows the resulting synaptic weights if we start with two groups of presynaptic neurons that deliver action potentials with s1 = 0.1 and s2 = 1.0, respectively. The number of neurons contained in group 1 and group 2 are n1 = 20 and n2 = 80, respectively, but the results turned out to be insensitive to the actual numbers. (A more balanced ratio of synapses would shift the curve J1 (#) slightly to the left). We compare the results obtained from the unique solution of the self-consistency equation with simulations of a straightforward implementation of the synaptic dynamics given in equation 5.2. The number of iterations required for the synaptic weights to settle down in the stationary values depends on the amplitude of the synaptic weight change described by d LTP/ D and 2 LTP/ D and on the amount of postsynaptic activity and, thus, on the ring threshold. With the parameters used in Figure 6, stationarity is usually reached in fewer than 1000 iterations. Finally, a brief discussion of the results shown in Figure 6 is in order. For a high ring threshold, the synaptic weights of both groups are close to their maximum value because otherwise the neuron would not re at LTP > 0. For very all. This is due to the constant presynaptic potentiation d pre low threshold, the postsynaptic ring is triggered by the rst spikes that arrive at the neuron. These spikes stem from presynaptic neurons with a broad ring-time distribution (group 2). Spikes from group 1 neurons thus arrive mostly after the postsynaptic neuron has already red and the corresponding synapses are depressed accordingly. This explains why synapses

Modeling Synaptic Plasticity

403

Figure 6: (A) Synaptic weights for a neuron receiving input from two groups of synapses. One group (n1 = 20) delivers precisely timed spikes (spre = 0.1) and the other one (n2 = 80) spikes with a broad distribution of arrival times (spre = 1.0). The synapses are subject to a mechanism that modies the synaptic strength according to the relative timing of pre- and postsynaptic spikes, see equation 5.2. The upper trace shows the resulting synaptic weight for the group of precise synapses; the lower trace corresponds to the second group. The solid lines give the analytic result for either the low-threshold (# < 0.4) or the high-threshold (# > 0.4) approximation. The dashed lines show the results of a computer simulation. The parameters for the synaptic plasticity are the same as in Figure 5. (B, C, D) precision D t¡1 , reliability ºN post , and efciency ºN post / D t as a function of the threshold # for the same neuron as in A. See Figure 3 for the denition of ºN post and D t.

delivering precisely timed spikes are weaker than group 2 synapses, unless the ring threshold exceeds a certain critical value. The most interesting nding is that in the intermediate range of the ring threshold, synapses that deliver precisely timed spikes are substantially stronger than synapses with poor temporal precision. Furthermore, there is an optimal threshold at # ¼ 0.2 where the ratio of the synaptic strengths of precise and unprecise synapses is largest. The ring of the postsynaptic neuron is thus driven mainly by spikes from group 1 neurons, and the jitter of the postsynaptic spikes is minimal (see Figure 6B). 6 Discussion

We have presented a stochastic neuron model that allows for a description of pre- and postsynaptic spike trains by inhomogeneous Poisson processes

404

Werner M. Kistler and J. Leo van Hemmen

provided that the number of overlapping postsynaptic potentials is not too small. In view of the huge number of synapses that a single neuron carries on its dendritic tree and the number of synchronous excitatory postsynaptic potentials required to reach threshold—104 . . . 105 synapses and several tens of EPSPs—the approximations used seem to be justied. Since ring of a neuron is triggered by a large number of synchronous presynaptic action potentials, the postsynaptic ring time and the arrival time of a single presynaptic spike can be assumed to be independent and the correlation of pre- and postsynaptic ring times is readily calculated. In addition to the spike-triggering mechanism, we have developed a model for synaptic long-term plasticity based on the relative timing of preand postsynaptic ring events. The dynamics of the synaptic weights can be analyzed in connection with the former neuron model. In particular, stationary synaptic weights can be calculated as limit states for a given ensemble of presynaptic spike trains. It turns out that this mechanism is able to tune the synaptic weights according to the temporal precision of the spikes they convey. Synapses that deliver precisely timed action potentials are favored at the expense of synapses that deliver spikes with a large temporal jitter. This learning process does not rely on an external teacher signal; it is unsupervised. Furthermore, there is only one free parameter that has to be xed: the ring threshold #. It is conceivable that a small, local network of inhibitory interneurons could be entrusted with setting the ring threshold. The result concerning the tuning of synaptic weights has two consequences that are open to experimental verication. First, it shows that neurons are able to tune their synaptic weights so as to select inputs from those presynaptic neurons that provide information that is meaningful in the sense of a temporal code. Second, this tuning ensures the preservation of temporally encoded information because noisy synapses are depressed and the neuron is thus able to re its action potential with high temporal precision. We believe that this mechanism has clear relevance to any information processing based on temporal spike coding. Acknowledgments

W. M. K. gratefully acknowledges nancial support from the Boehringer Ingelheim Fonds. References Alonso, A., Curtis, M. de, & Llin a´ s, R. (1990). Postsynaptic Hebbian and nonHebbian long-term potentiation of synaptic efcacy in the entorhinal cortex in slices and in the isolated adult guinea pig brain. Proc. Natl. Acad. Sci. USA, 87, 9280–9284.

Modeling Synaptic Plasticity

405

Bartsch, A. P., & van Hemmen, J. L. (1998). Correlations in networks of spiking neurons. Unpublished manuscript. Munich: Physics Department, Technical University of Munich. Bell, C. C., Han, V. Z., Sugawara, Y., & Grant, K. (1997). Synaptic plasticity in a cerebellum-like structure depends on temporal order. Nature, 387, 278–281. Dan, Y., & Poo, M. (1992). Hebbian depression of isolated neuromuscular synapses in vitro. Science, 256, 1570–1573. Gerstner, W. (1995). Time structure of the activity in neural network models. Phys. Rev. E, 51, 738–758. Gerstner, W., & van Hemmen, J. L. (1992). Associative memory in a network of “spiking” neurons. Network, 3. 139–164. Gerstner, W., Kempter, R., van Hemmen, J. L. & Wagner,H. (1996). A neuronal learning rule for sub-millisecond temporal coding. Nature, 384, 76–78. Kempter, R., Gerstner, W., van Hemmen, J. L., & Wagner, H. (1998). Extracting oscillations: Neuronal coincidence detection with noisy periodic spike input. Neural Comput., 10, 1987–2017. Kistler, W. M., Gerstner, W., & van Hemmen, J. L. (1997). Reduction of the Hodgkin-Huxley equations to a single-variable threshold model. Neural Comput., 9, 1015–1045. Kistler, W. M., & van Hemmen, J. L. (1999). Short-term synaptic plasticity and network behavior. Neural Comput., 11, 1579–1594. Lamperti, J. (1966). Probability: A survey of the mathematical theory. New York: Benjamin. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann B. (1997).Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213– 215. Nelson, P. G., Fields, R. D., Yu, C., & Liu, Y. (1993). Synapse elimination from the mouse neuromuscular junction in vitro: A non-Hebbian activity-dependent process. J. Neurobiol., 24, 1517–1530. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek W. (1997).Spikes— Exploring the neural code. Cambridge, MA: MIT Press. Salin, P. A., Malenka, R. C., & Nicoll, R. A. (1996). Cyclic AMP mediates a presynaptic form of LTP at cerebellar parallel ber synapses. Neuron, 16, 797–803. Tuckwell, H. C. (1988).Introductionto theoreticalneurobiology, (Vol. 2.) Cambridge: Cambridge University Press. Urban, N. N., & Barrionuevo, G. (1996). Induction of Hebbian and non-Hebbian mossy ber long-term potentiation by distinct patterns of high-frequency stimulation. J. Neurosci., 16, 4293–4299. Received August 3, 1998; accepted April 29, 1999.

LETTER

Communicated by John Platt

On-line EM Algorithm for the Normalized Gaussian Network Masa-aki Sato ATR Human Information Processing Research Laboratories, 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0288, Japan Shin Ishii Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma-shi, Nara 6300101, Japan A normalized gaussian network (NGnet) (Moody & Darken, 1989) is a network of local linear regression units. The model softly partitions the input space by normalized gaussian functions, and each local unit linearly approximates the output within the partition. In this article, we propose a new on-line EM algorithm for the NGnet, which is derived from the batch EM algorithm (Xu, Jordan, & Hinton 1995), by introducing a discount factor. We show that the on-line EM algorithm is equivalent to the batch EM algorithm if a specic scheduling of the discount factor is employed. In addition, we show that the on-line EM algorithm can be considered as a stochastic approximation method to nd the maximum likelihood estimator. A new regularization method is proposed in order to deal with a singular input distribution. In order to manage dynamic environments, where the input-output distribution of data changes over time, unit manipulation mechanisms such as unit production, unit deletion, and unit division are also introduced based on probabilistic interpretation. Experimental results show that our approach is suitable for function approximation problems in dynamic environments. We also apply our on-line EM algorithm to robot dynamics problems and compare our algorithm with the mixtures-of-experts family. 1 Introduction

The main aim of this work is to develop a fast on-line learning method that can deal with dynamic environments. We use a normalized gaussian network (NGnet) (Moody & Darken, 1989) for this purpose. The NGnet is a network of local linear regression units. The model softly partitions the input space by normalized gaussian functions, and each local unit linearly approximates the output within the partition. Since the NGnet is a local model, it is possible to change parameters of several units in order to learn a single datum. Therefore, the learning process becomes easier than that of global models such as the multilayered perceptron. In a local model, on the other hand, the number of necessary units grows exponentially as the Neural Computation 12, 407–432 (2000)

c 2000 Massachusetts Institute of Technology °

408

Masa-aki Sato and Shin Ishii

input dimension increases if one wants to approximate the input-output relationship over the whole input space. This results in a computational explosion, which is often called the curse of dimensionality. However, actual data will often distribute in a lower dimensional space than the input space, such as in attractors of dynamical systems. The NGnet, a kind of mixtures-of-experts model (Jacobs, Jordan, Nowlan, & Hinton, 1991; Jordan & Jacobs, 1994), can be interpreted as an output of a stochastic model with hidden variables. The model parameters can be determined by the maximum likelihood estimation method. In particular, the expectation-maximization (EM) algorithm for the NGnet was derived by Xu, Jordan, & Hinton (1995). In this article, we propose a new on-line EM algorithm for the NGnet. The on-line EM algorithm is derived from the batch EM algorithm (Xu et al., 1995) by introducing a discount factor. We show that the on-line EM algorithm is equivalent to the batch EM algorithm if a specic scheduling of the discount factor is employed. In addition, we show that the on-line EM algorithm can be considered as a stochastic approximation method to nd the maximum likelihood estimator. Similar on-line EM algorithms have been proposed by Nowlan (1991) and Jordan and Jacobs (1994) for mixture models. However, there have been no convergence proofs for these on-line EM algorithms. Neal and Hinton (1998) proposed a wide variety of incremental EM algorithms and proved the convergence of their algorithms. One drawback of their algorithms is that they either need additional storage variables for all the training data or need to see all the training data at each iteration. This prevents these algorithms from being applied to realtime on-line settings where new data are supplied indenitely each time. For practical applications, important modications to the on-line EM algorithm are necessary. A new regularization method is proposed in order to deal with a singular input distribution. In order to manage dynamic environments, where the input-output distribution of data changes over time, unit manipulation mechanisms such as unit production, unit deletion, and unit division are also introduced based on the probabilistic interpretation. Applicability of our new on-line EM algorithm is investigated using function approximation problems in three circumstances: a usual function approximation for a set of observed data; a function approximation in which the distribution of the input data changes over time (this experiment is performed to check the applicability of our approach to dynamic environments); and a function approximation in which the distribution of the input data is singular (i.e., the dimension of the input data distribution is smaller than the input space dimension). In this case, a straightforward application of the basic on-line EM algorithm does not work, because the covariance matrices used in the NGnet become singular. Our modied on-line EM algorithm works well even in this situation. Finally, we applied our on-line EM algorithm to realistic and difcult problems in robot dynamics. The performance of our on-line EM algorithm is compared with those of the mixtures-of-experts family.

On-line EM Algorithm for the Normalized Gaussian Network

409

2 Normalized Gaussian Network

The NGnet model, which transforms an N-dimensional input vector x to a D-dimensional output vector y, is dened by the following equations: y=

M X i= 1

i ( x)

N

N

Q

i ( x) Wi xQ

´ Gi ( x)/

M X

(2.1a)

Gj ( x)

(2.1b)

j= 1

µ ¶ 1 Gi ( x) ´ (2p ) ¡N/ 2 |S i | ¡1/ 2 exp ¡ ( x ¡ m i ) 0 S i¡1 ( x ¡ m i ) . 2

(2.1c)

M denotes the number of units, and the prime (0 ) denotes a transpose. Gi ( x) is an N-dimensional gaussian function; its center is an N-dimensional vector m i , and its covariance matrix is an ( N £ N)-dimensional matrix S i . | S i | is the determinant of the matrix S i . N i (x) is the ith normalized gaussian Q i ´ ( Wi , bi ) is a ( D £ ( N C 1))-dimensional linear regression function. W 0 Q i xQ ´ Wi x Cbi . The gaussian function Gi (x) is a kind matrix, xQ ´ ( x0 , 1), and W of radial basis function (Poggio & Girosi, 1990). However, the normalized gaussian functions, N i ( x) ( i = 1, . . . , M ), do not have a radial symmetry. They softly partition the input space into M regions. The ith unit linearly Q i xQ within the corresponding region. An output approximates its output by W of the NGnet is given by a summation of these outputs weighted by the normalized gaussian functions. The NGnet (see equation 2.1) can be interpreted as a stochastic model in which a pair of an input and an output, ( x, y), is a stochastic (incomplete) event. For each event, a single unit is assumed to be selected from a set of classes fi | i = 1, . . . , Mg. The unit index i is regarded as a hidden variable. A triplet (x, y, i) is called a complete event. The stochastic model is dened by the probability distribution for a complete event (Xu et al., 1995), P(x, y, i|h ) = (2p ) ¡(DCN)/ 2 si¡D |S i | ¡1/ 2 M¡1 µ ¶ 1 1 Q i xQ ) 2 , £ exp ¡ (x¡m i ) 0 S i¡1 ( x¡m i ) ¡ 2 ( y¡ W 2 2si

(2.2)

Q i | i = 1, . . . , Mg is a set of model parameters. From where h ´ fm i , S i , si2 , W the probability distribution, 2.2, the following probabilities are obtained: P(i|h ) = 1/ M P(x|i, h ) = Gi ( x)

(2.3a) µ

P( y|x, i, h ) = (2p ) ¡D/ 2 si¡D exp ¡

¶

1 Q i xQ ) 2 . (y ¡ W 2si2

(2.3b) (2.3c)

410

Masa-aki Sato and Shin Ishii

These probabilities dene a stochastic data generation model. First, a unit is randomly selected with an equal probability (see equation 2.3a). If the ith unit is selected, the input x is generated according to the gaussian distribution (see equation 2.3b). Given the unit index i and the input x, the output y is generated according to the gaussian distribution (see equation 2.3c), Q i xQ and the variance s 2 . which has the mean W i When the input x is observed, the probability that the output value becomes y in this model turns out to be P ( y| x, h ) =

M X

N

i ( x) P( y|x,

i, h ).

(2.4)

i= 1

From this conditional probability, the expectation value of the output y for a given input x is obtained as Z M X Q i x, Q N i ( x) W (2.5) E[y|x] ´ yP( y|x, h )dy = i= 1

which is equivalent to the output of the NGnet (see equation 2.1a). The probability distribution (see equation 2.2) provides a stochastic model for the NGnet. 3 EM Algorithm

From a set of T events (observed data), (fxg, fyg) ´ f(x(t), y(t)) | t = 1, . . . , Tg, the model parameter h of the stochastic model (see equation 2.2) can be determined by the maximum likelihood estimation method. In particular, the EM algorithm (Dempster, Laird, & Rubin 1977) can be applied to models having hidden variables. The EM algorithm repeats the following E (expectation)-step and M (maximization)-step. Since the likelihood for the set of observations increases (or does not change) after an E- and M-step, the maximum likelihood estimator is asymptotically obtained by repeating the E- and M-steps: E-step. Let hN be the current estimator value. By using hN , the posterior probability that the ith unit is selected for each observation ( x( t), y(t)) is calculated according to Bayes rule: P( i| x( t), y(t), hN ) = P(x( t), y( t), i|hN )/

M X j= 1

P( x(t), y( t), j|hN).

(3.1)

M-step. By using the posterior probability (see equation 3.1), the expected log-likelihood Q(h |hN , fxg, fyg) for the complete events is dened by Q(h |hN , fxg, fyg) =

T X M X t= 1 i= 1

P(i| x( t), y(t), hN ) log P( x( t), y(t), i|h ).

(3.2)

On-line EM Algorithm for the Normalized Gaussian Network

411

On the other hand, the log-likelihood of the observed data (fxg, fyg) is given by L(h |fxg, fyg) = =

T X

log P( x(t), y( t)|h )

t= 1 T X

log

t= 1

(

M X

! P( x( t), y(t), i|h ) .

(3.3)

i= 1

Since an increase of Q(h |hN , fxg, fyg) implies an increase of the loglikelihood L(h | fxg, fyg) (Dempster et al., 1977), Q(h |hN , fxg, fyg) is maximized with respect to the estimator h . A solution of the necessity condition @Q/ @h = 0 is given (Xu et al., 1995) by m i = hxii (T )/ h1ii ( T )

(3.4a) 0

S i = h(x ¡ m i )( x ¡ m i ) ii ( T )/ h1ii ( T )

= hxx0 ii ( T )/ h1ii ( T ) ¡ m i ( T )m 0i ( T )

Q i hxQ xQ ii ( T ) = hyxQ ii (T ) W 0

0

1 Q i x| Q 2 ii (T )/ h1ii (T ) h|y ¡ W D ³ ´i 1 h 2 Q i hxy Q 0 ii ( T ) / h1ii ( T ), = h|y | ii (T ) ¡ Tr W D

(3.4b) (3.4c)

si2 =

(3.4d)

where Tr(¢) denotes a matrix trace. A symbol h¢ii denotes a weighted mean with respect to the posterior probability (see equation 3.1), and it is dened by h f ( x, y)ii (T ) ´

T 1X f ( x(t), y( t)) P( i|x( t), y( t), hN ). T t= 1

(3.5)

If an innite number of data, which are drawn independently according to an unknown data distribution density r ( x, y), are given, the weighted mean (see equation 3.5) converges to the following expectation value, T!1

h f ( x, y)ii ( T ) ¡! E[ f (x, y)P( i|x, y, hN )]r ,

(3.6)

where E[¢]r denotes the expectation value with respect to the data distribution density r ( x, y). In this case, the M-step equation (see equation 3.4) can be written as gi ´ E[P( i|x, y, hN )]r

(3.7a)

412

Masa-aki Sato and Shin Ishii

m i = E[xP( i|x, y, hN )]r / gi

(3.7b)

S i = E[xx0 P( i|x, y, hN )]r / gi ¡ m im 0i

(3.7c)

Q i E[xQ xQ 0 P( i|x, y, hN )]r = E[yxQ 0 P( i|x, y, hN )]r W ´ 1 ³ Q i E[xy Q 0 P( i|x, y, hN )]r ) / gi , si2 = E[|y| 2 P( i|x, y, hN )]r ¡ Tr( W D

(3.7d) (3.7e)

Q i , s 2 |i = 1, . . . , Mg, is calculated where the new parameter set, h = fm i , S i , W i NQ , s 2 |i = 1, . . . , by using the old parameter set, hN = fmN i , SN i , W Mg. i Ni At an equilibrium point of this EM algorithm, the old and new parameters become the same: h = hN . The equilibrium condition of the EM algorithm (equation 3.7 along with h = hN ) is equivalent to the maximum likelihood condition, @L(h )/ @h = 0,

(3.8)

where the log-likelihood function is given by " L(h ) = E log

M X

# P( x, y, i|h )

i= 1

.

(3.9)

r

4 On-line EM Algorithm 4.1 Basic Algorithm. The EM algorithm introduced in the previous section is based on batch learning (Xu et al., 1995); the parameters are updated after seeing all the observed data (fxg, fyg). In this section, we derive an online version of the EM algorithm. Since the estimator is changed after each observation, let h (t) be the estimator after the tth observation ( x(t), y( t)). In the on-line EM algorithm, the weighted mean (see equation 3.5) is replaced by

hh f ( x, y)iii (T ) ´ g(T)

T X t= 1

(

T Y

! l(s)

s= tC1

£ f ( x( t), y(t)) P( i|x(t), y( t), h ( t ¡ 1)) g(T) ´

(

T Y T X

!¡1 l(s)

.

(4.1a) (4.1b)

t= 1 s= tC1

Here, the parameter l ( t) (0 · l ( t) · 1) is a time-dependent discount factor, which is introduced for forgetting the effect of the old posterior values employing the earlier inaccurate estimator. g(T) is a normalization coefcient and plays a role like a learning rate.

On-line EM Algorithm for the Normalized Gaussian Network

413

The modied weighted mean hh¢iii can be obtained by the step-wise equation, hh f (x, y)iii ( t) = hh f ( x, y)iii (t ¡ 1) £ ¤ Cg(t) f ( x(t), y( t)) Pi ( t) ¡ hh f ( x, y)iii (t ¡ 1)

g(t) = (1 Cl ( t)/g( t ¡ 1))¡1 ,

(4.2a) (4.2b)

where Pi ( t) ´ P(i| x( t), y(t), h ( t ¡ 1)). The constraint, 0 · l(t) · 1, gives the constraint on g(t): 1 ¸ g(t) ¸ 1/ t.

(4.3)

If this constraint is satised, equation 4.2b can be solved for l (t) as l(t) = g(t ¡ 1)(1/ g(t) ¡ 1).

(4.4)

After calculating the modied weighted means, that is, hh1iii ( t), hhxiii ( t), hhxx0 iii ( t), hh|y| 2 iii (t), and hhyxQ 0 iii (t), according to equation 4.2, the new estimator h (t) is obtained by the following equations: m i ( t) = hhxiii (t)/ hh1iii ( t) S i¡1 ( t)

0

(4.5a) 0

= [hhxx iii (t)/ hh1iii ( t) ¡ m i ( t)m i ( t

)]¡1

(4.5b)

Q i ( t) = hhyxQ 0 iii ( t)[hhxQ xQ 0 iii ( t)]¡1 W ³ ´i 1 h Q i ( t)hhxy Q 0 iii ( t) / hh1iii ( t). si2 ( t) = hh|y| 2 iii (t) ¡ Tr W D

(4.5c) (4.5d)

This denes the on-line EM algorithm. In this algorithm, calculations of the inverse matrices are necessary at each time-step. A recursive formula Q i can be derived using a standard method (Jurgen, for S i¡1 and W ¨ 1996), so there is no need to calculate the matrix inverse. Let us dene the weighted Q LQ i ( t) ´ (hhxQ xQ 0 iii (t)) ¡1 . This quantity can be inverse covariance matrix of x: obtained by the step-wise equation: LQ i ( t) =

" # 1 Pi (t) LQ i (t¡1) xQ ( t)xQ 0 ( t) LQ i ( t¡1) Q L i ( t¡1) ¡ . (4.6) 1¡g(t) (1/ g(t) ¡1) CPi ( t) xQ 0 ( t) LQ i ( t¡1) xQ (t)

S i¡1 ( t) can be obtained from the following relation with LQ i (t): LQ i ( t)hh1iii ( t) =

¡

S i¡1 (t) ¡m 0i ( t)S i¡1 ( t)

¡S i¡1 ( t)m i ( t) 1 Cm 0i ( t)S i¡1 ( t)m i (t)

¢

.

(4.7)

414

Masa-aki Sato and Shin Ishii

Q i , is given by The estimator for the linear regression matrix, W Q i ( t) = hhyxQ 0 iii (t) LQ i (t) W Q i ( t ¡ 1) Cg(t)Pi ( t)( y( t) ¡ W Q i ( t ¡ 1) xQ (t)) xQ 0 (t) LQ i (t). = W

(4.8a) (4.8b)

Thus, the basic on-line EM algorithm is summarized as follows. For the tth observation (x( t), y( t)), the weighted means—hh1iii ( t), hhxiii (t), hh|y| 2 iii (t), Q 0 iii (t)—are calculated by using the step-wise equation 4.2. LQ i (t) is and hhxy also calculated with the step-wise equation 4.6. Subsequently, the estimators for the model parameters are obtained by equations 4.5a, 4.5d, 4.7, and 4.8b. 4.2 Equivalence of On-line and Batch EM Algorithms. In this part, we show that the basic on-line EM algorithm dened by equations 3.1, 4.2, and 4.5 is equivalent to the batch EM algorithm dened by equations 3.1, 3.4, and 3.5, with an appropriate choice of l(t). It is assumed that the same set of data, f(x(t), y( t))| t = 1, . . . , Tg, is repeatedly supplied to the learning system, that is, x( t C T ) = x( t) and y( tCT ) = y( t). The time t is represented by an epoch index k (= 1, 2, . . .) and a data number index m (= 1, . . . , T ) within an epoch, that is, t = (k ¡ 1)T Cm. Let us assume that the model parameter h is updated at the end of each epoch, when m = T, and it is denoted by h ( k). Let us suppose that l ( t) is given by » 0 if t = ( k ¡ 1)T C 1 l ( t) = (4.9) 1 otherwise.

The correspondingg(t) is given byg((k¡1)T Cm) = 1/ m. Then the weighted mean hh f (x, y)iii ( t) is initialized to f (x(1), y(1)) P(i| x(1), y(1), h ( k ¡ 1)) at the beginning of an epoch. At the end of an epoch, hh f (x, y)iii ( kT ) = h f ( x, y)ii ( T ) is satised for hN = h ( k ¡ 1). This shows that the on-line EM algorithm, together with l(t) given by equation 4.9, is equivalent to the batch EM algorithm. Consequently, the estimator of the on-line EM algorithm converges to the maximum likelihood estimator. It should be noted that the step-wise equation, 4.6, for LQ i ( t) cannot be used in this case, since the equation becomes singular for g(t) = 1. 4.3 Stochastic Approximation. If an innite number of data, which are drawn independently according to the data distribution density r ( x, y), are available, the on-line EM algorithm can be considered as a stochastic approximation (Kushner & Yin, 1997) for obtaining the maximum likelihood estimator, as demonstrated below. Let w be a compact notation for the weighted mean: w ( t) ´ fhh1iii ( t), hhxiii ( t), hhxx0 iii ( t), hhyxQ 0 iii ( t), hh|y| 2 iii ( t)| i = 1, . . . , Mg. The on-line EM algorithm, equations 4.2 and 4.5, can be written in an abstract form:

d w ( t) ´ w (t) ¡w (t¡1) = g(t)[F(x(t), y( t), h ( t¡1)) ¡w ( t¡1)]

(4.10a)

On-line EM Algorithm for the Normalized Gaussian Network

h (t) = H(w ( t)).

415

(4.10b)

It can be easily proved that the set of equations, w = E[F( x, y, h )]r h = H (w ),

(4.11a) (4.11b)

is equivalent to the maximum likelihood condition, equation 3.8. Then the on-line EM algorithm can be written as dw (t) = g(t)(E[F(x, y, H(w ( t ¡ 1)))]r ¡ w (t ¡ 1)) Cg(t)f(x(t), y( t), w (t ¡ 1)),

(4.12)

where the stochastic noise term f is dened by f(x(t), y( t), w (t ¡ 1)) ´ F(x( t), y( t), H(w ( t ¡ 1)) ¡ E[F( x, y, H (w (t ¡ 1)))]r ,

(4.13)

and it satises E[f ( x, y, w)]r = 0.

(4.14)

Equation 4.12 has the same form as the Robbins–Monro stochastic approximation (Kushner & Yin, 1997), which nds the maximum likelihood estimator given by equation 4.11. The effective learning coefcient g(t) should satisfy the condition t!1

g(t) ¡! 0,

1 X t= 1

g(t) = 1,

1 X t= 1

g 2 ( t) < 1.

(4.15)

Typically, g(t), which satises conditions 4.3 and 4.15, is given by t!1

g(t) ¡!

1 at C b

(1 > a > 0).

(4.16)

The corresponding discount factor is given by t!1

l(t) ¡! 1 ¡

1¡a I at C ( b ¡ a)

(4.17)

l (t) is increased such that l (t) approaches 1 as t ! 1. For the convergence proof of the stochastic approximation (see equation 4.12), the boundedness of the noise variance is necessary (Kushner & Yin, 1997). The noise variance is given by E[f ( x, y, w) 2 ]r = E[F( x, y, H (w ))2 ]r ¡ E[F( x, y, H (w ))]r2 .

(4.18)

416

Masa-aki Sato and Shin Ishii

Both terms on the right-hand side in equation 4.18 are nite if we assume that the data distribution density r ( x, y) has a compact support. Consequently, the noise variance becomes nite under this assumption. This assumption is not so restrictive, because actual data always distribute in a nite domain. We can weaken this assumption such that r ( x, y) decreases exponentially as |x| or | y| goes to innity. It should be noted that the stochastic approximation (see equation 4.12) for nding the maximum likelihood estimator is not a stochastic gradient ascent algorithm for the log-likelihood function (see equation 3.9). The online EM algorithm (see equation 4.12) is faster than the stochastic gradient ascent algorithm, because the M-step is solved exactly. So far we have used the on-line EM algorithm dened by equations 4.2 and 4.5, which is equivalent to the basic on-line EM algorithm dened by equations 4.2, 4.5a, 4.5d, 4.6, 4.7, and 4.8b, if g(t) 6 = 1. Therefore, the basic on-line EM algorithm is also a stochastic approximation. 5 Regularization 5.1 Regularization of Covariance Matrix. We have assumed so far that the covariance matrix of the input data is not singular in each region; the inverse matrix for every S i (i = 1, . . . , M) exists. In actual applications of the algorithm, however, this assumption often fails. A typical case occurs when the dimension of the input data distribution is smaller than the dimension of the input space. In such a case, S i¡1 cannot be calculated, and LQ i diverges exponentially with the time t. There are several methods for dealing with this problem, such as the introduction of Bayes priors, the singular value decomposition, and the ridge regression. However, these methods are not satisfactory for our purpose. Here, we propose a simple new method that performs well. Let us rst consider a regularization of an ( N £ N )-dimensional covariance matrix, S , for the observed data. Its eigenvalues and normalized eigenvectors are denoted by j n (¸ 0) and y n ( n = 1, . . . , N ), respectively. The set of the eigenvectors, fy n |n = 1, . . . , Ng, forms a set of orthonormal bases. The condition number of the covariance matrix S is dened by

u ´ jmin /j max ,

(5.1)

where j min and j max are the minimum and maximal eigenvalues of S , respectively, so that 0 · u · 1. If u = 0, the covariance matrix S is singular. If u ¼ 1, the covariance matrix S is regular and the calculation of its inverse is numerically stable. If 0 < u ¿ 1, the covariance matrix is regular and S ¡1 PN ¡1 ¡1 0 is given by S ¡1 = n= 1 jn y n y n . It is dominated by jmin , which may contain a numerical noise. As a consequence, the inverse matrix S ¡1 becomes sensitive to numerical noises.

On-line EM Algorithm for the Normalized Gaussian Network

417

In order to deal with this problem, we propose the following regularized covariance matrix S R for the data covariance matrix S : S R = S C aD 2 IN

(0 < a < 1)

2

D = Tr(S )/ N,

(5.2a) (5.2b)

where IN is an ( N £ N )-dimensional identity matrix. The data variance D 2 satises the inequality jmax ¸ D 2 =

N 1 X j n ¸ jmax / N. N n= 1

(5.3)

The eigenvalues and eigenvectors of the regularized matrix S R are given by (j n CaD 2 ) and y n , respectively. From the inequality (se equation 5.3), it can be proved that the condition number of the regularized matrix S R satises the condition: uR ¸ a/ ( N (1 C a)).

(5.4)

Then the condition number of S R is bounded below and controlled by the small constant a. The rate of change for the regularized eigenvalue is dened by R(jn ) ´ ((j n C aD 2 ) ¡ j n )/j n = a(D 2 / j n ). From the inequality (see equation 5.3), this ratio satises the condition: R(j n ) ¸ R(jmax ) ¸ a/ N and R(j n ) · R(j min ) · a/ u. If all the eigenvalues of S are the same order (i.e., u ¼ O(1)), the rate of change becomes small (i.e., R(j n ) ¼ O(a)), so that the effect of the regularization is negligible. If the data covariance matrix S is singular (i.e., u = 0), the zero eigenvalue of S is replaced by aD 2 . Thus, the regularized matrix S R becomes regular, and its condition number is bounded by equation 5.4. In general, the eigenvalues, which are larger than their average, are slightly affected, and the very small eigenvalues are changed so as to be nearly equal to aD 2 . If S = 0, D 2 becomes zero and the regularization (see equation 5.2) does not work. Therefore, if D 2 is smaller than a threshold value D 2min , D 2 is set to D 2min . This prevents S R from being singular even when S = 0. By introducing a Bayes prior for a regularized matrix S B as P(S B ) = (k/ 2)N exp(¡(k / 2)Tr(S B¡1 )), one can get a similar regularization equation, S B = S Ck IN ,

(5.5)

where the regularization parameter k is a given constant. However, it is rather difcult to determine the k value without the knowledge of the data covariance matrix. It is especially difcult for the NGnet, since there are M independent local covariance matrices S i ( i = 1, . . . , M). The proposed method (see equation 5.2) automatically adjusts this constant by using the data variance D 2 , which is easily calculated in an on-line manner.

418

Masa-aki Sato and Shin Ishii

5.2 Regularization of On-line EM Algorithm. Based on the above consideration, the ith weighted covariance matrix is redened in our on-line EM algorithm as

S i¡1 (t) =

h³ ´ i¡1 hhxx0 iii ( t) ¡m i ( t)m 0i ( t)hh1iii ( t) CahhD 2i iii ( t) IN /hh1iii ( t) , (5.6)

where ³ ´ hhD 2i iii (t) ´ hh|x ¡ m i (t)| 2 iii (t)/ N = hh|x| 2 iii ( t) ¡|m i ( t)| 2 hh1iii ( t) /N.

(5.7)

The regularized S i¡1 (see equation 5.6) can be obtained from relation 4.7 by using the following regularized LQ i : LQ i ( t) =

³ ´¡1 hhxQ xQ 0 iii ( t) C ahhD 2i iii ( t) IQN .

(5.8)

The (( N C 1) £ ( N C 1))-dimensional matrix, IQN , is dened by

¡

¢

N X 0 = eQn eQ0n , 0

I IQN ´ N 0

n= 1

where eQn is an (N C 1)-dimensional unit vector; its nth element is equal to 1, and the other elements are equal to 0. From the denition 5.7, hhD 2i iii ( t) can be written as hhD 2i iii (t) = hhD 2i iii ( t ¡ 1) Cg(t)(D 2i ( t) Pi ( t) ¡ hhD 2i iii ( t ¡ 1)),

(5.9)

where D 2i ( t) is given by g(t)Pi ( t)D 2i ( t) = g(t)Pi (t)| x( t) ¡ m i ( t)| 2 / N C (1 ¡ g(t))|m i (t) ¡ m i (t ¡ 1)| 2 hh1iii ( t ¡ 1)/ N.

(5.10)

The second term on the right-hand side of equation 5.10 comes from the difference between hh|x ¡ m i (t)| 2 iii (t ¡ 1) and hh|x ¡ m i (t ¡ 1)| 2 iii ( t ¡ 1). Using equations 5.9, equation 5.8 can be rewritten as " LQ i ( t) =

(1 ¡ g(t))LQ ¡1 i ( t ¡ 1)

(

0

Cg(t) xQ (t) xQ ( t) C ºQn ( t) ´

p

aD i ( t) eQn .

N X n= 1

! 0

ºQn ( t)ºQn ( t) Pi ( t)

#¡1

(5.11a) (5.11b)

On-line EM Algorithm for the Normalized Gaussian Network

419

As a result, the regularized LQ i ( t) (see equation 5.8) can be calculated as follows. For a given input data x( t), LQ i ( t) is calculated using the step-wise equation 4.6. After that, LQ i ( t) is updated by LQ i ( t) := LQ i ( t) ¡

g(t)Pi ( t) LQ i ( t)ºQn ( t)ºQn 0 ( t) LQ i ( t) , 1 Cg(t)Pi ( t)ºQn 0 (t)LQ i (t)ºQn ( t)

(5.12)

using the virtual data fQ ºn ( t)| n = 1, . . . , Ng. After calculating the regularized LQ i ( t) (see equation 5.8), the linear regresQ i is obtained by using equation 4.8b, in which the regularized sion matrix W LQ i (t) is used. Only the observed data f(x(t), y( t))| t = 1, 2, . . .g are used in the calculation of equation 4.8b; and the virtual data fºQ n (t) | n = 1, . . . , NI t = 1, 2, . . .g, which have been used in the calculation of the regularized LQ i (t), must not be used. Therefore, equation 4.8a does not hold in our regularization method. Since the regularized LQ i ( t) is positive denite, the linear Q i at an equilibrium point of equation 4.8b satises the regression matrix W condition Q i E[xQ xQ 0 P(i| x, y, h )]r = E[yxQ 0 P( i|x, y, h )]r , W

(5.13)

which is identical to the maximum likelihood equation (3.7d) along with h = hN . Although the matrix E[xQ xQ 0 P( i| x, y, h )]r may not have the inverse, the relation (see equation 5.13) still has a meaning. Therefore, our method does Q i. not introduce a bias on the estimation of W . 6 Unit Manipulation

Since the EM algorithm guarantees only local optimality, the obtained estimator depends on its initial value. The initial allocation of the units in the input space is especially important for attaining a good approximation. If the initial allocation is quite different from the input distribution, much time is needed to achieve proper allocation. In order to overcome this difculty, we introduce dynamic unit manipulation mechanisms, which are also effective for dealing with dynamic environments. These mechanisms are unit production, unit deletion, and unit division, and they are conducted in an on-line manner after observing each datum ( x(t), y( t)). Unit production. The probability P (x( t), y( t), i | h ( t ¡ 1)) indicates how probable the ith unit produces the datum ( x( t), y(t)) with the parameter ( ( ), ( ), ( h (t ¡ 1). Let 0 < Pproduce ¿ 1/ M. When maxM i= 1 P x t y t i | h t ¡ 1)) < Pproduce , the datum is too distant from the present units to be explained by the current stochastic model. In this case, a new unit is produced to account for the new datum. The initial parameters of the new unit are

420

Masa-aki Sato and Shin Ishii

given by: m MC1 = x( t)

(6.1a)

¡1 ¡2 S MC1 = ÂMC1 IN

M

2 = b1 min |x(t) ¡ m i | 2 / N ÂMC1 i= 1

M

(6.1b)

2 sMC1 = b2 max si2

(6.1c)

Q MC1 ´ ( WMC1 , bMC1 ) = (0, y(t)), W

(6.1d)

i= 1

where b 1 and b2 are appropriate positive constants. Unit deletion. The weighted mean hh1iii (t), which is calculated by equation 4.2, indicates how much the ith unit has been used to account for the data until t. Let 0 < Pdelete ¿ 1/ M. If hh1iii ( t) < Pdelete , the unit has rarely been used. In this case, the ith unit is deleted. Unit division. The unit error variance si2 ( t) (see equation 4.5d) indicates the squared error between the ith unit’s prediction and the actual output. Let Ddivide be a specic positive value. If si2 ( t) > Ddivide, the unit’s prediction is insufcient, probably because the partition in charge is too large to make a linear approximation. In this case, the ith unit is divided into two units and the partition in charge is divided into two. The initial parameters of the two units are given by: p m i ( new) = m i ( old) C b3 j 1 y 1 p m MC1 ( new) = m i ( old) ¡ b3 j 1 y 1 (6.2a) ¡1 ( new) = S i¡1 ( new) = 4j1¡1 y 1 y 10 C S MC1

N X n= 2

j n¡1 y n y n0

(6.2b)

2 (new) = si2 ( old)/ 2 si2 ( new) = sMC1

(6.2c)

Q i ( new) = W Q MC1 ( new) = W Q i (old), W

(6.2d)

where j n and y n denote the eigenvalue and the eigenvector of the covariance matrix S i ( old), respectively, and j 1 = j max . b3 is an appropriate positive constant. Although similar unit manipulation mechanisms have been proposed (Platt, 1991; Schaal & Atkeson, 1998), these mechanisms can be conducted with a probabilistic interpretation in our on-line EM algorithm. 7 Experiments 7.1 Function Approximation in Static Environment. Applicability of our algorithm is investigated using the following function (N = 2 and

On-line EM Algorithm for the Normalized Gaussian Network

421

0.1 0.08

nMSE

0.06 0.04 0.02 0 0

10

20

30

Epochs

40

50

Figure 1: Learning processes of batch EM algorithm (dashed line) and on-line EM algorithm (solid). Five hundred training data are supplied once in each epoch. Ordinate denotes nMSE (mean squared error normalized by variance of desired outputs). Error is evaluated on 41 £ 41 mesh grids on input domain.

D = 1), which was previously used by Schaal and Atkeson (1998): n o 2 2 2 2 (¡1 · x1 , x2 · 1) . y = max e¡10x1 , e¡50x2 , 1.25e¡5(x1 Cx2 )

(7.1)

We prepared 500 input-output pairs, f(x(t), y(t))| t = 1, . . . , 500g as a training data set by uniformly sampling the input variable vector, x ´ ( x1 , x2 ), from its domain. A fairly large gaussian noise N (0, 0.1) is added to the outputs, where N (0, 0.1) denotes a gaussian distribution whose mean and standard deviation are 0 and 0.1, respectively. The problem task is for the NGnet to approximate the function 7.1 from the 500 noisy data. We compared the batch EM algorithm and the on-line EM algorithm. In this experiment, the discount factorl ( t) was scheduled to approach 1 as in equation 4.17. Figure 1 shows the time series of the normalized mean squared error, nMSE, which is normalized by the variance of the desired outputs (see equation 7.1). In the gure, we can see that both the batch EM algorithm and the online EM algorithm are able to approximate the target function in a small number of epochs. Although the so-called overlearning can be seen in both learning algorithms, it is more noticeable in the batch EM algorithm than in the on-line EM algorithm. The number of the units used in both algorithms was 50. The data distribution does not change over time in this task. In this case, it may be thought that the batch learning is more suitable than the on-line learning because the batch learning can process the whole data at once. However, our on-line EM algorithm has an ability similar to the batch algorithm. In addition, the on-line EM algorithm achieves

422

Masa-aki Sato and Shin Ishii

a faster and more accurate approximation for this task than the RFWR (receptive eld weighted regression) model proposed by Schaal and Atkeson (1998). 7.2 Function Approximation in Dynamic Environments. The on-line EM algorithm is effective for a function approximation in a dynamic environment where the input-output distribution changes over time. For an experiment, the distribution of the input variable x1 in equation 7.1 was continuously changed in 500 epochs from the uniform distribution in the interval [¡1, ¡0.2] to that in the interval [0.2, 1]. At each epoch, the input variables were generated in the current domain. In the applications of the on-line EM algorithm to such dynamic environments, the learning behavior changes according to the scheduling of the discount factor l(t). When the discount factor is relatively small, the model tends to forget the past learning result and to adapt quickly to the input-output distribution. We show two typical behavioral patterns in the learning phase. In the rst case, l ( t) is initially set at a relatively small value, and it is slowly increased; that is, the effective learning coefcient g(t) is relatively large throughout the experiment. The NGnet adapts to the input distribution change by a relocation of the units’ center. During the course of this relocation, the units’ center moves to the new region, and consequently the past approximation in the old region is forgotten. Figures 2b and 2c show the center and the covariance of all the units for the 50th epoch and the 500th epoch, respectively. The NGnet adapts to the input distribution change by the drastic relocation of the units’ center. Figure 2a shows that nMSE on the current input region does not grow large throughout the task. In this experiment, the number of the units does not change. In the second case, l ( t) is rapidly increased; that is, g(t) rapidly approaches to zero. The NGnet adapts to the input distribution change without forgetting the past approximation result. Figure 3a shows this learning process. Figure 3b shows the center and the covariance of all the units at the end of the learning task. Since the unit relocation is slow, the NGnet adapts to the input distribution change mainly by the unit production mechanism. Consequently, the units in the region where no more input data appear remain, and the function approximation on the whole input domain is accurately maintained. 7.3 Singular Input Distribution. In order to evaluate our regularization method, we prepared a function approximation problem where the input variables are linearly dependent and there is an irrelevant variable. In such a case, the basic on-line EM algorithm without the regularization method obtains a poor result because the input distribution becomes singular. In this experiment, the output y is given by the same function as equation 7.1, while the input variables are given by x ´ (x1 , x2 , x3 , x4 , x5 ) = ( x1 , x2 , (x1 C x2 )/ 2, (x1 ¡ x2 )/ 2, 0.1). When the input data are generated, a gaussian noise

On-line EM Algorithm for the Normalized Gaussian Network

423

0 .0 5 0 .0 4

nM SE

0 .0 3 0 .0 2 0 .0 1

100

200

300

1

1

0.5

0.5

0

0.5

1 1

400

E p o ch s

X2

X2

0 0

500

0

0.5

0.5

0

X1

0.5

1

1 1

0.5

0

X1

0.5

1

Figure 2: (a) Learning process of on-line EM algorithm that quickly adapts to a dynamic environment with forgetting the past. Ordinate denotes nMSE evaluated on current input domain at each epoch. (b, c) Dynamic unit relocation for dynamic environment. Each ellipse denotes a receptive eld of a unit; the ellipse corresponds to covariance (two-dimensional display of standard deviation). The ellipse center is set at unit center,m i . Dots denote 500 inputs in current epoch. (b) At 50th epoch. (c) At 500th epoch.

N (0, 0.05) is added to x3 , x4 , and x5 . For a training set, we prepared 500 data, where x1 and x2 were uniformly taken from their domain, and the output y was added by a gaussian noise N (0, 0.05). For evaluation, a test

424

Masa-aki Sato and Shin Ishii

Figure 3: (a) Learning process of on-line EM algorithm that slowly adapts to dynamic environment without forgetting the past. Solid line: nMSE on whole input domain. Dashed line: (nMSE ¤ 10) on current input domain at each epoch. Dash-dot line: number of units, which is normalized by its initial value 30 for convenience. (b) Receptive elds after learning.

set was prepared by setting x1 and x2 on the 41 £ 41 mesh grids, and the other variables were generated according to the above prescription. The same training set was repeatedly supplied during the learning. In Figure 4a,

On-line EM Algorithm for the Normalized Gaussian Network

425

0 .5 0 .4

nM SE

0 .3 0 .2 0 .1 0 0

20

40

60

Epochs

80

10 0

0 .1 1

0 .0 8

0.5

X3

nM S E

0 .0 6 0

0 .0 4

0.5

0 .0 2

1 1

1 0

0 0

10

20

30

Epochs

40

50

0

X1

1 1

X2

Figure 4: (a) Learning processes when output is given by equation 7.1 for input x ´ (x1 , x2 , x3 , x4 , x5 ) = (x1 , x2 , (x1 C x2 )/ 2 C N(0, 0.05), (x1 ¡ x2 )/ 2 C N(0, 0.05), N(0.1, 0.05)). Learning curves for a = 0.23 (solid line), a = 0.1 (dashed), and a = 0 (dash-dot). (b) Learning curves for a = 0.023 (solid) and a = 0 (dashed). Output is given by equation 7.2. (c) Receptive elds at 50th epoch, each of which is dened by an ellipse with axes for two principal components of the local covariance matrix S i . The ellipse center is set at unit center, m i . This gure shows a 3D view of a hemisphere.

the solid, dashed, and dash-dot lines are the learning curves for a = 0.23, a = 0.1, and a = 0, respectively. If no regularization method is employed (a = 0), the error becomes large after the early learning stage. Comparing

426

Masa-aki Sato and Shin Ishii

the cases of a = 0.23 and a = 0.1, the regularization effect seems fairly robust with respect to the parameter a value. Let us consider another situation, where the input data distribute on a curved manifold. Each input datum in the three-dimensional space, (x1 , x2 , x3 ), is restricted on the unit sphere, that is, (x1 , x2 , x3 ) = (cosh1 cosh2 , cosh1 sinh2 , sinh 1 ). A function dened on this unit sphere is given by n o 2 2 2 2 y = cos(h1 )cos(h2 / 2)max e¡10(2h1 / p ) , e¡50(h2 / p ) , 1.25e¡5((2h1 / p ) C(h2 / p ) ) , (7.2) where the range of the spherical coordinate is given by ¡p / 2 · h 1 · p / 2 and ¡p · h2 < p . The output y does not include a noise. The function 7.2 is similar to the function 7.1, but it is changed so as to satisfy the consistency of the spherical coordinate. We prepared 2000 data points uniformly on the sphere. The covariance matrix of input data is not singular. However, the covariance matrix of each local unit is nearly degenerate, so that the calculation of the inverse covariance matrix and the linear regression matrix may include a noise without the help of the regularization method. In Figure 4b, the solid and dashed lines are the learning curves for a = 0.023 and a = 0, respectively. If no regularization method is employed (a = 0), the error does not become small. Our regularization method (a = 0.023) provides a fairly good result compared to the basic method without the regularization (a = 0). The condition numbers dened by equation 5.1 with the regularization and without the regularization are 0.0191 § 0.0042 and 0.0071 § 0.0038, respectively. The local covariance matrix, S i , of each unit represents the local input data distribution fairly well in our regularized method. It has two principal components that span the tangential plane to the unit sphere at each local unit position. The third eigenvalue is very small and is bounded below by the regularization term. In Figure 4c, receptive elds of the local units are shown. The receptive elds appropriately cover the unit sphere. 7.4 Robot Dynamics. Finally, we examine realistic and difcult problems. The problems are taken from DELVE (Rasmussen et al., 1996), which is a new standard collection of neural network benchmark problems. The Pumadyn data sets are a family of data sets synthetically generated from a realistic simulation of the dynamics of a Puma 560 robot arm (Ghahramani, 1996; Corke, 1996). The Puma robot has six revolving joints. The state vector is dened by fh i , hPi |i = 1, . . . , 6g, where hi and hPi represent the angle and the angular velocity of the ith joint, respectively. The applied torque to the ith joint is denoted by ti . The task of the Pumadyn-8 family is to predict the third joint acceleration hR3 given the eight-dimensional input vector x ´ fh1 , hP1 , h2 , hP2 , h3 , hP3 , t1 , t2 g. Other variables are xed to zero. The input data are randomly generated by sampling the input variables uniformly from the range (|h |/ p , |hP |/ p , | t| · 0.6). The data sets are characterized by

On-line EM Algorithm for the Normalized Gaussian Network

427

nonlinearity, noise level, and size of training data. We examined six training data sets: nm64, nm256, and nm1024, which are nonlinear and medium noisy data sets consisting of 64, 256, and 1024 training data, as well as nh64, nh256, and nh1024, which are nonlinear and highly noisy data sets consisting of 64, 256, and 1024 training data. Gaussian noise with standard deviations of 0.12 and 0.4 was added to both the input and the output variables for the medium and the highly noisy data sets, respectively. The test data set size was 1024 when the training data set size was 1024 and 512 for the other cases. We compared our on-line EM (OEM) algorithm with the mixtures-ofexperts (ME) methods registered in DELVE (Waterhouse, 1997) and the Moody-Darken (MD) method (Moody & Darken, 1989). There are nine competitor methods: on-line EM algorithm (OEM), on-line EM algorithm with Neal early stopping (OEM-ese) (Neal, 1998), mixtures-of-experts trained by ensemble learning (ME-el), hierarchical mixtures-of-experts trained by ensemble learning (HME-el), mixtures-of-experts trained by committee early stopping (ME-ese), hierarchical mixtures-of-experts trained by committee early stopping (HME-ese), hierarchical mixtures-of-experts grown via committee early stopping (HME-grow), Moody-Darken (MD) method, and modied Moody-Darken (MD-l) method. The committee early stopping method (Waterhouse, 1997) uses a committee of (hierarchical) ME models. The training data are randomly divided into a training set and a validation set, with two-thirds and one-third of the total data, respectively. The training is done for each member of the committee using different training and validation sets and terminated at the minimum of the validation error. The prediction is given by the average of the outputs for all committee members. The number of committee members depends on the training time. A minimum of 3 members and a maximum of 50 can be created. If the training time is greater than a given threshold, which is proportional to the amount of data, no more members can be created. The HME-grow method constructs a hierarchical ME from data, learning both the structure as well as the parameters of the model. In the growing process, one of the experts is split into two experts and a gate. The committee early stopping method is also used to control the size of the hierarchical trees. The ensemble learning estimates the posterior probability distribution for the model parameters rather than estimates the optimal parameter value. The gaussian prior distribution on the gate and the experts parameters are assumed to introduce hyperparameters. The optimal hyperparameters are estimated by an alternating minimization procedure that combines the standard EM algorithm for parameter estimation with reestimation of hyperparameters. The algorithm is computationally highly demanding. The training of the OEM method is terminated at a prespecied number of epochs, which is 5% of the data size. The OEM-ese uses a simpler version of the committee early stopping proposed by Neal (1998). The training data

428

Masa-aki Sato and Shin Ishii

Table 1: Computation Time Comparison for Nine Methods. Method ME-el HME-el OEM-ese OEM ME-ese HME-ese HME-grow MD-l MD Time (sec)

16,515

23,580

395

40

2129

2171

2059

16

5

Notes: The data set used was nm1024 with 1024 training data. The computer used was Silicon Graphics Octane (CPU: R10000 175MHz £ 2).

are partitioned into four equal portions. The committee consists of four networks. Each network uses three of the four portions as a training set and the remaining portion as a validation set. The MD method uses two-stage learning. It rst determines the unit centers according to the input distribution by using the k-means clustering algorithm. Subsequently, the bias coefcients bi are determined by using the least mean squared error method while xing the unit centers. The NGnet used in the original MD model has isometric covariance matrices and no linear coefcients, that is, Wi ´ 0. The MD-l method, in contrast, uses the NGnet with linear coefcients. The main differences among the OEM, the ME, and the MD methods are their criterion functions. The OEM method maximizes the log-likelihood P of the joint probability distribution, Tt= 1 log P( x( t), y(t)|h ). Therefore, the units are allocated according to the input-output data distribution. The ME method maximizes the log-likelihood of the conditional probability, PT t= 1 log P( y( t)| x( t), h ). Therefore, the specialization process of each expert is done according to the output performance and does not take account of the input data distribution. The rst-stage learning of the MD method can be considered as the maximization of the log-likelihood of the input probP ability, Tt= 1 log P(x( t)|h ). The second-stage learning minimizes the mean squared error of the network output. Therefore, the optimality of the total algorithm is not guaranteed. The simulation results are summarized in Figure 5. Table 1 shows the computation time needed for training by these algorithms. The results depend very much on the training size. The most striking differences can be seen for the nm256 data set. The best performance is achieved by (H)ME-el, and followed by OEM (-ese), (H)ME-ese, HME-grow, and MD(-l), in that order. The performances of the ME family and the OEM family are almost the same for the nm64 and nh64 data sets, while those of the MD family are slightly worse. For the nm1024 and nh1024 data sets, (H)ME-el, OEM(-ese) show comparable performances. The performances of (H)ME-ese and MD-l are slightly worse. HME-grow and MD each shows poor performance. Some comments can be made regarding these results. The number of data is very small compared to the dimensionality of the input space; the noisy data are sparsely distributed in a high-dimensional space. Therefore, the

On-line EM Algorithm for the Normalized Gaussian Network

nM SE

nm 1024

nm 256 0 .8

0 .8

0 .6

0 .6

0 .6

0 .4

0 .4

0 .4

0 .2

0 .2

0 .2

ABCDEFGH I

0

nh1024

nMSE

nm 64

0 .8

0

ABCDEFGH I

0

nh256 0 .8

0 .8

0 .6

0 .6

0 .6

0 .4

0 .4

0 .4

0 .2

0 .2

0 .2

ABCDEFGH I

0

ABCDEFGH I

ABCDEFGH I nh64

0 .8

0

429

0

ABCDEFGH I

Figure 5: Learning results for Pumadyn data sets: nm64, nm256, nm1024, nh64, nh256, and nh1024. Bars represent the expected nMSE,which is calculated by an average of nMSE for different test data sets, for nine methods shown at the bottom of the gure.

reliable estimation of the optimal parameter is very difcult. This situation favors ensemble learning. The disadvantage of ensemble learning is that it requires a huge amount of time, as shown in Table 1. The OEM method achieves comparable learning ability with the (H)ME-el method within a signicantly smaller amount of computation time for the 1024 data sets. The original MD method showed a poor performance despite being very fast. The modied MD method with linear coefcients showed a better performance than the original MD method. However, its performance is worse than the OEM methods. The performance of the MD methods depended on the number of units. In Figure 5, the results of the best MD models are

430

Masa-aki Sato and Shin Ishii

Table 2: Learning Results for Robot Trajectory Data Sets. Method nMSE (%) Time (sec)

(H)ME-el OEM — —

1.8 1085

ME-ese HME-ese HME-grow MD-l

MD

2.3 84,626

4.4 130

2.8 63,610

22.2 36,780

3.9 220

Notes: Computation time and nMSE for six methods are shown. Ensemble learning, (H)ME-el, does not end within a week.

shown. For example, the best number of units for MD and MD-l are 140 and 33 for the nm1024 data set, respectively. The OEM method automatically selects the number of units. The nal numbers of units for the nm1024, nm256, and nm64 data sets are 6, 5, and 1, respectively, although the initial number of units is set to 2% of the data size. We considered a more realistic situation. For real-time robot control, the training data are generated along the robot trajectory. Therefore, we generated the training data by freely operating the Puma robot from the upright position. Due to inertia, the robot exhibited chaotic oscillations. The task was to predict the third joint acceleration hR3 given the 10-dimensional input vector x ´ fhi , hPi |i = 1, . . . , 5g. Since h 6 and hP6 were almost zero throughout the trajectory, we omitted these variables, and the applied torque was zero. The size of training data was 7000 (the ME family could not operate more than 7000 data in our computer). The 7000 test data were generated in the neighborhood of the trajectory. The results are summarized in Table 2. The OEM method achieves a better performance than the ME family with a signicantly smaller amount of time. The performance of the MD method is worse than that of the (H)ME-ese method. The computation times of the OEM, the MD, and the MD-l methods scale as (N C 1) 2 M , M3 , and (( N C 1)M)3 , respectively, where M is the number of units and N is the input dimension. The best performance of MD(MD-l) was obtained with 420 (50) units, while the nal number of units in OEM became 230 after four epochs. When the number of units in MD(MD-l) was larger than 1000 (100), the MD(-l) method became slower than the OEM method and the performance was degraded. In summary, we conrmed that the on-line EM algorithm achieves good performance in a short amount of time even if the number of data become large. 8 Conclusion

In this article, we proposed a new on-line EM algorithm for the NGnet. We showed that the on-line EM algorithm is equivalent to the batch EM algorithm if a specic scheduling of the discount factor is employed. In addition, we showed that the on-line EM algorithm can be considered as a stochastic approximation method to nd the maximum likelihood estimator.

On-line EM Algorithm for the Normalized Gaussian Network

431

A new regularization method was proposed in order to deal with a singular input distribution. In order to manage dynamic environments, unit manipulation mechanisms such as unit production, unit deletion, and unit division were also introduced based on the probabilistic interpretation. Experimental results showed that our approach is suitable for dynamic environments where the input-output distribution of data changes over time. We also applied our on-line EM algorithm to robot dynamics problems. Our algorithm achieved a comparable learning ability to the ME methods for a small number of data and achieved a better performance than the ME methods for a large number of data within a signicantly smaller amount of computation time. Our method can also be applied to reinforcement learning (Sato & Ishii, 1999) and the reconstruction of nonlinear dynamics (Ishii & Sato, 1998).

Acknowledgments

We thank the referees and the editor for their insightful comments in improving the quality of this article.

References Corke, P. I. (1996).A robotics toolbox for MATLAB. IEEE Robotics and Automation Magazine, 3, 24–32. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society B, 39, 1–22. Ghahramani, Z. (1996). The Pumadyn dataset. Available online at: http://www.cs. toronto.edu/»delve/data/pumadyn/pumadyn.ps.gz. Ishii, S., & Sato, M. (1998). On-line EM algorithm and reconstruction of chaotic dynamics. In T. Constantinides, S.-Y. Kung, M. Niranjan, & E. Wilson (Eds.), Neural networks for siginal processing VIII (pp. 360–369). New York: IEEE. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Computation, 3, 79-87. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. Jurgen, ¨ S. (1996). Pattern classication: A unied view of statistical and neural approaches. New York: Wiley. Kushner, H. J., & Yin, G. G. (1997). Stochastic approximation algorithms and applications. New York: Springer-Verlag. Moody, J., & Darken, C. J. (1989). Fast learning in networks of locally-tuned processing units. Neural Computation, 1, 281–294. Neal, R. M. (1998). Assessing relevance determination methods using DELVE. In C. M. Bishop (Ed.), Generalization in neural networks and machine learning (pp. 97–129). Berlin: Springer-Verlag.

432

Masa-aki Sato and Shin Ishii

Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in graphical models (pp. 355–368). Norwell, MA: Kluwer Academic Press. Nowlan, S. J. (1991). Soft competitive adaptation: neural network learning algorithms based on tting statistical mixtures (Tech. Rep. No. CS-91-126). Pittsburgh: Carnegie Mellon University. Platt, J. (1991). A resource-allocating network for function interpolation. Neural Computation, 3, 213–225. Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proceedings of the IEEE, 78, 1481–1496. Rasmussen, C. E., Neal, R. M., Hinton, G. E., van Camp, D., Revow, M., Ghahramani, Z., Kustra, R., & Tibshirani, R. (1996). The DELVE manual. Available online at: ftp://ftp.cs.utoronto.ca/pub/neuron/delve/doc/manual.ps.gz. Sato, M., & Ishii, S. (1999). Reinforcement learning based on on-line EM algorithm. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (pp. 1052–1058).Cambridge, MA: MIT Press. Schaal, S., & Atkeson, C. G. (1998). Constructive incremental learning from only local information. Neural Computation, 10, 2047–2084. Waterhouse, S. R. (1997). Classication and regression using mixtures of experts. Unpublished doctoral dissertation, Cambridge University. Xu, L., Jordan, M. I., & Hinton, G. E. (1995).An alternative model for mixtures of experts. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 633–640). Cambridge, MA: MIT Press. Received March 23, 1998; accepted February 17, 1999.

LETTER

Communicated by Jean-Pierre Nadal

A General Probability Estimation Approach for Neural Computation Maxim Khaikine Klaus Holthausen Friedrich-Schiller-Universit¨at Jena, Ernst-Haeckel-Haus, D-07745 Jena, Germany We describe an analytical framework for the adaptations of neural systems that adapt its internal structure on the basis of subjective probabilities constructed by computation of randomly received input signals. A principled approach is provided with the key property that it denes a probability density model that allows studying the convergence of the adaptation process. In particular, the derived algorithm can be applied for approximation problems such as the estimation of probability densities or the recognition of regression functions. These approximation algorithms can be easily extended to higher-dimensional cases. Certain neural network models can be derived from our approach (e.g., topological feature maps and associative networks). 1 Introduction

The concepts of probability density estimation are relevant to many aspects of neural computation (Bishop, 1995). There are two general methods to deal with this problem in the framework of statistical pattern recognition: nonparametric and parametric probability density estimation. Both recognition methods are presented and discussed in detail in Bishop (1995). The nonparametric estimation (e.g., simple histogram methods) does not assume a particular functional form for the density function. A disadvantage of this method is that it requires large amounts of processing because all data points used in the training must be retained. Moreover, it is not known a priori how many data have to be collected in order to provide a good representation. The parametric approach, on the contrary, assumes a specic form for the density function, which might be very different from the actual density. The chosen functional form contains a number of adjustable parameters that are optimized by tting the model to the data set. For the purpose of neural network theory, an important advantage of this parametric approach is that the parameters can be adjusted locally. As local learning rules of this type, we will consider algorithms in which a hypothesis about the real distribution is changed each time after a stimulus was received without storing and processing of previous stimuli. These local recognition algorithms give quite Neural Computation 12, 433–450 (2000)

c 2000 Massachusetts Institute of Technology °

434

Maxim Khaikine and Klaus Holthausen

good results if the real probability density is not too different from some elements of the family of functions chosen for the approximation. Generally the quality of the approximation depends on the smoothness of the probability density function and the inherent degree of freedom in the applied functional form. In this article, we develop a general approach to nd a comprehensive family of recognition algorithms in which the class of basis functions is not limited and includes all possible smooth functions. The focus of our work is not a particular application in neural network theory, but a principled approach based on sound theoretical concepts. It allows an ideal estimation of any arbitrary probability distribution. We will show that on the basis of these learning rules, one can solve more complicated problems (e.g., an approach for the estimation of regression functions for stimulus-response pairs with arbitrary distributions). Then we discuss the general framework of our approximation theory and certain applications in neural network theory. Our ndings support recent work on the approximation by feedforward neural networks (Scarselli & Tsoi, 1998) since they allow the focus of investigation to be turned toward the approximation capabilities of single neurons. Furthermore, earlier general approaches that combined neural computation and approximation theory (Girosi & Poggio, 1990; Kurkova, 1992) could be revised in order to gain insight into novel architecture and optimization principles in neural computation. 2 Local Algorithms for the Approximation of Probability Densities

We consider a discrete process in which a stochastic variable x adopts one of N possible states fx1 , x2 , . . . , xN g. This process can be described by the corresponding probabilities PE = ( p1 , p2 , . . . , pN ), which we will call the objective probability distribution. A priori we have no information about the objective E = ( q1 , q2 , . . . , qN ) about it. distribution, but we can make a hypothesis Q E The correEvery stimulus xk gives rise to a change of the hypothesis Q. sponding learning rule can be represented by a matrix L , whose elements E Thus, the adaptation of the hypothesis Q Et depend on the actual vector Q. after a stimulus xk was received at time t can be written with the learning E ) and learning rate 2 : rule L ( Q E t) = qtn C 2 Ln,k ( Q qtC1 n

8n.

We can calculate the expected (i.e., average) value of qtC1 n with respect to the objective probabilities pk of the stimuli xk : t hqtC1 n i = qn C 2 ¤

N X k= 1

E t ) pk . Ln,k (Q

(2.1)

General Probability Estimation Approach for Neural Computation

435

E ) are continuous functions and the coefIf all elements of the matrix Ln,k ( Q cient 2 is sufciently small, we can come to a description that is continuous E in time. Thus, the statistical dynamics of the subjective probability vector Q is given by a set of coupled differential equations: N X dq n ( t) E ) pk . = Ln,k ( Q dt k= 1

(2.2)

E ) can be derived? First, let us Which properties of the learning matrix L (Q note that the sum of all time gradients of probabilities qk in our hypothesis must be always null. This means: N N X N X X dqn ( t) E ) pk = 0 = Ln,k ( Q dt n= 1 n= 1 k= 1

E 8P.

E ) follows: Hence, the rst property of Ln,k (Q N X n= 1

E) = 0 Ln,k ( Q

E k. 8Q,

(2.3)

E = PE presents Second, it is a plausible assumption that a correct hypothesis Q at least a xed point in the phase-space and is not statistically changed with time (about convergence toward this point, see below). Together with E ): equation 2.2 we obtain the second property of Ln,k (Q N X k= 1

E ) qk = 0 Ln,k ( Q

E 8n, Q.

(2.4)

Equations 2.3 and 2.4 are obviously not sufcient to dene learning rules. Nevertheless we give an example of a possible learning matrix after simpliE ). fying the structure of L n , k ( Q 2.1 Pfaffelhuber ’s Learning Rule. Let us assume that all functions E ) in the learning matrix are divided into two principally different Ln,k ( Q classes of very simple form: » E ) = f ( qn ) if k 6 = n Ln,k (Q g( qn ) otherwise.

Equations 2.3 and 2.4 will be rewritten as follows: f ( qn )

N X k6 = n

qk C g( qn ) qn = 0

8n

436

Maxim Khaikine and Klaus Holthausen

and N X n6 = k

f ( qn ) C g( qk ) = 0

8k.

This system has only one solution if N > 2 and innitely many otherwise. The general solution is f ( qn ) = ¡qn C and g(qn ) = C(1 ¡ qn ), where C is an arbitrary constant. The sign of C determines whether the dynamics will be stable. The learning matrix hence is: 0 1 1 ¡ q1 ¡q1 ... ¡q1 B ¡q2 1 ¡ q2 . . . ¡q2 C C E) = C B L( Q B .. C. .. .. @ . A . . ¡qN ¡qN . . . 1 ¡ qN This learning rule was rst suggested by Pfaffelhuber (1972), who also investigated its convergence. With this matrix, the statistical dynamics for this learning process follows from equation 2.2: dqn (t) = C( pn ¡ qn ). dt Thus, it has been shown that it is possible to formulate certain restrictions for our approximation theory from which a prominent learning rule can be derived. Here we investigate a broader class of approximation algorithms that could be of interest for both statistical pattern recognition and neural network theory. 3 General Mathematical Framework

The basic properties of the learning matrix (see equations 2.3 and 2.4) guarantee that the probabilities qk that constitute our hypothesis are normalized E = PE represents a xed and that any correctly recognized distribution Q point of the dynamics 2.2. But the convergence toward this point was not investigated. In order to solve this problem, we will apply a framework that will be more suitable than the notation of matrix functions. E (t) and the real distriFirst, we note that both vectors—the hypothesis Q E bution P—are located on a (N ¡ 1)-dimensional hyperplane—a manifold Q in the N-dimensional vector space. The manifold Q is dened by the normalization condition: N X n= 1

qn = 1

E 2 Q. 8Q

In the context of probability estimation, we will consider only learning processes that can be described by their dynamics on the manifold Q . These

General Probability Estimation Approach for Neural Computation

437

algorithms will be regarded as models of the real learning processes. Our aim is to dene a learning rule (or, better, a family of rules) so that every start hypothesis converges statistically to the objective probability distribution PE during the recognition process. E ) for every vector Q E denes a linear operator The learning matrix L (Q on the space RN (cf. equation 2.2). For this linear operator, one can nd a set of eigenvectors fE vn n = 1, . . . , Ng with corresponding eigenvalues fl n n = 1, . . . , Ng so that E ) vEn = l n vEn . L (Q It is clear that if the eigenvectors and values are given in every point of the manifold Q , they automatically dene the learning rule (i.e., the learning E )) in Q . Furthermore, this method is very exible and suitable for matrix L ( Q studying the global properties of the learning process. As a simplication, we consider here only the case where alll n are real. This means that in every E on the manifold Q , the operator L (Q E ) has only one-dimensional point Q subspaces. E is an eigenvector of the operator Proposition 2.4 shows that the vector Q E L ( Q) with corresponding eigenvalue l Q = 0: E )Q E = 0¤Q E = 0. L (Q Furthermore, all other eigenvalues l n n 2 f1, . . . , N ¡ 1g of the learning E ) must be different from zero, because otherwise (for example, matrix L (Q E would be stable for every real probability l m = 0) the current hypothesis Q Q distribution P given as a normalized linear combination, PQ =

E C b vEm aQ P . a C b k vm,k

Here, the parameters a and b have to be chosen in order to satisfy the condiE vEn n = 1, . . . , N ¡ 1g form a complete tion 0 · PQk · 1. The eigenvectors fQ, vector system—the basis for the vector space—in every point of Q so that every vector xE can be expanded as a linear combination: EC xE = cQ Q

N¡1 X n= 1

cn vEn .

E ) yields: The application of the operator L ( Q E ) xE = cQ L ( Q E)Q EC L (Q

N¡1 X n= 1

E ) vEn = cn L ( Q

N¡1 X n= 1

l n cn vEn .

438

Maxim Khaikine and Klaus Holthausen

P E ) projects every vector onto the space The learning matrix L (Q n xn = 0. Therefore all eigenvectors vEn associated with nonzero eigenvalues must belong to this space: N X

vn,k = 0

k= 1

8n 2 f1, . . . , N ¡ 1g.

(3.1)

Theorem 1 introduces a broad class of learning rules that converge toward the process (2.2) and allow the study of the global properties of the E ): corresponding matrix L ( Q E ( t) always converges toward the objective probaThe hypothesis Q E ) are E bility distribution P during the learning process (see equation 2.2) if all l n (Q E real and positive and all eigenvectors of the learning matrix L ( Q) are orthogonal E 2 Q: in every point Q Theorem 1.

E ) > 0, l n(Q

vEn vEm = 0

8n, m = 1, . . . , N ¡ 1.

The eigenvalues corresponding to Pfaffelhuber ’s (1972) learning rule (section 2.1.) for the case N = 3 are given by the roots of the polynomial Â (l ) = ¡l(1 ¡l) 2 . Two orthogonal eigenvectors corresponding to the eigenvalue l = C1 are given by v1 = (0, C1, ¡1) and v2 = (¡2, C1, C1). Remark.

E ( t). We choose Let us introduce the distance vector DE ( t) := PE ¡ Q the following linear combination for DE ( t): Proof.

E (t) C DE ( t) = cQ Q

N¡1 X n= 1

cn vEn .

(3.2)

E and PE belong to the manifold Q , the coefcient cQ in Because both vectors Q the rst part P of equation 3.2 must always be null so that the normalization condition k D k = 0 is always fullled. In order to describe the convergence, we consider the dynamics of the distance vector:

®

®2

® ® d ®DE ( t) ® dt

= ¡2DE ( t)

E ( t) dQ E )P. E = ¡2DE (t)L ( Q dt

E)Q E = 0 we obtain: Using the property L ( Q

® ®

®2 ®

d ®DE ( t) ® dt

E ) DE ( t) = ¡2DE ( t) L ( Q

General Probability Estimation Approach for Neural Computation

= ¡2

(

N¡1 X n= 1

! cn vEn

(

N¡1 X m= 1

! l m cm vEm

= ¡2

N¡1 X n= 1

439

l n c2n | vEn | 2 .

E ) are positive for all points Q, E we can nd an inmum c so that If all l n (Q E ) ¸ c > 0 8Q, E n. Thus we obtain a simple approximation for the l n(Q dynamics of DE ( t):

® ®

®2 ®

d ®DE (t)® dt

· ¡2c

N¡1 X n= 1

®

®2

® ® c2n | vEn | 2 = ¡2 ®DE (t) ®

and

® ® ® ® ®DE ( ) ®2 · ®DE (0) ®2 exp(¡2c ). t ® t® ® ® Because the parameter c is positive, the quantity kDE ( t)k2 converges toward null. To emphasize the relationship of the results of this article with the learning matrix from the previous section, we indicate shortly how a learning matrix E ) n = 1, . . . , N ¡ 1g can be derived from the elds of the eigenvectors fE vn ( Q E ) n = 1, . . . , N¡1g. This follows easily after adjoint and the eigenvalues fl n ( Q E ¤ , vE¤ n = 1, . . . , N ¡ 1g are introduced so that: vectors fQ n » 1 if n = k ¤ EQ E ¤ = 1, E ¤ = vE¤ Q E = 0, E E ¤ = Q vEn Q v v k n n 0 otherwise. In the case of orthogonal eigenvectors vEn , all adjoint vectors can be easily dened: E ¤ ( vEn Q E) vEn ¡ Q E ¤ = (1, 1, . . . , 1)I . Q vE¤n = 2 | vEn | E ) can thus be written as a product of three matrices: L( Q E ) = V( Q E ) ¤ L (Q E ) ¤ V ¤T ( Q E ), L (Q

E ) is a matrix from the components of eigenvectors including Q, E where V ( Q E ) is a matrix composed of adjoint vectors, and L (Q E ) is the transpose V ¤T ( Q a diagonal matrix whose elements are eigenvalues with the corresponding eigenvalue l Q = 0. 4 Approximation of Continuous Probability Distributions

In this section we generalize our results for the case of one continuous variable. First, we introduce and clarify the corresponding notions for our

440

Maxim Khaikine and Klaus Holthausen

description of learning processes. This will allow us to describe the convergence toward a continuous probability density. Finally, after a simplication, we arrive at possible learning rules for the approximation of probability densities. Without loss of generality, we assume a discrete stochastic process in which a continuous variable x is distributed on the interval I = [0, 2p ] with a density P( x). At every time, there is a hypothesis Qt ( x) about the objective distribution P(x). Let us suppose that both functions are smooth on I. We note that in the space of all smooth function, the hypothesis Q( x) and the real distribution P (x) always lie on a manifold Q that is dened by the normalization condition: Z 2p Qt ( x)dx = 1. 0

The adaptation process will be triggered by random stimuli y 2 I. We start with an arbitrary initial hypothesis Qt ( x) satisfying the normalization condition. In the tth step, a stimulus yt 2 I will be chosen according to the probability distribution P( x). Thus, the learning process can be described as a discrete adaptation of the hypothesis: QtC1 ( x) = Qt ( x) C 2 K( x, yt ).

Here, the function K( x, y) corresponds to the learning matrix from the previous section. Because we want the hypothesis Q always to belong to Q , the kernel function K(x, y) has to be smooth. In general, the learning rule K(x, y) depends on the actual hypothesis Qt ( x). To emphasize this, we write KQ (x, y). After we pass over to the continuous description, with 2 ! 0, we obtain the dynamics of Q( x): dQ( x, t) = dt

Z

2p 0

P( y)KQ ( x, y) dy.

(4.1)

In order to satisfy the normalization condition at each instance of the adaptation process, the following equation must be always fullled: Z

2p 0

KQ ( x, y) dx = 0 8y and 8Q(x).

(4.2)

Moreover, the correct hypothesis Q( x, t) = P( x) must be a xed point of the dynamics of our adaptation process. Thus, the second condition for the kernel KQ ( x, y) is given by: Z

2p 0

KQ ( x, y) Q( y) dy = 0 8x and 8Q(x).

(4.3)

Equations 4.2 and 4.3 can be regarded as a continuous generalization of conditions 2.3 and 2.4 applied in the case of discrete approximations. Thus,

General Probability Estimation Approach for Neural Computation

441

the goal is to nd an appropriate integral operator L ( Q ) with smooth kernel KQ ( x, y) that is dened on the manifold Q , fullls conditions 4.2 and 4.3, and always describes the dynamics of the hypothesis Q(x, t), dQ( x, t) = L (Q) P, dt that leads to a convergence toward the objective distribution P( x). We apply the standard denition of convergence: A hypothesis Q( x, t) converges toward the distribution P(x) with the time t, if for every g > 0 one can nd T so that for every time tO ¸ T: |Q(x, tO) ¡ P( x)| · g. By analogy with the previous section, we introduce the set of eigenvectors fvQ,n n = 0, . . . , 1g (in this case, they are real functions on I) and corresponding eigenvalues fl Q,n n = 0, . . . , 1g for the linear operator L ( Q ) in every point of the manifold Q , and note some general properties of them. Equation 4.3 shows that every point of Q is an eigenvector for the operator L ( Q ) with corresponding eigenvalue l Q,0 = 0. With the same arguments as in section 3, one can see that all other eigenvalues fl Q,n n = 1, . . . , 1g must be different from zero to avoid the existence of an uncorrect stable hypothesis. Furthermore, all eigenvectors vQ,n (x) associated with nonzero eigenvalues have to belong to the class Z

2p 0

vQ,n ( x)dx = 0

(4.4)

and build a basis of this space. Following our constructions for the discrete case, we introduce a distance D ( x, t) between the real distribution and the hypothesis D ( x, t) = P( x) ¡ Q(x, t). The dynamics of the distance function is given by: dD ( x, t) dQ(x, t) = ¡ dt dt = ¡L ( Q ) P = ¡L ( Q )D = ¡

1 X

l Q,n cn vQ,n ( x).

(4.5)

n= 1

As a difference between two elements of Q , D ( x, t) satises the condition Z 2p D ( x, t) dx = 0 8t 0

and can be expanded using the basis fvQ,n g: D ( x, t) =

1 X 1

cn vQ,n ( x).

442

Maxim Khaikine and Klaus Holthausen

And nally for the dynamics of kD ( x, t)k2 : dkD ( x, t)k2 = ¡2D ( x, t)L ( Q )D dt = ¡2

(

1 X n= 1

cn vQ,n ( x)

!

(

1 X

! l Q,k ck vQ,k ( x) .

k= 1

Up to this relation, our constructions for the continuous distribution approximation were the same as for the discrete case. But here the implementation of an orthogonal system of eigenvectors with corresponding positive eigenvalues in every point of Q and the proof of convergence fails. Indeed, if the kernel function KQ (x, y) of the integral operator L ( Q ) is smooth, the operator is compact. The innite set of nonzero eigenvalues of such an operator has to converge toward null (limn!1 l n = 0), and thus no positive inmum can be found. Thus, according to theorem 1, we cannot nd an estimation for the convergence speed. In order to construct a class of possible learning rules, we introduce some simplications. First, we make an assumption about the form of the eigenvectors of the learning operator: in every point Q of the manifold Q , we want to construct an operator with uniform eigenvectors given as trigonometric polynomials in complex form: vn ( x) = exp(inx)

n = ¡1, . . . , 1, n 6 = 0.

All of these functions are orthogonal, independent, satisfy condition 4.4, and represent, together with the trivial eigenvector Q( x), a basis of the functional space. The second supposition is that all corresponding eigenvalues are positive, do not depend on the actual hypothesis, and satisfy the condition 1 X ¡1

|l n | < 1.

Because all vectors fvn ( x)g are xed, the dynamics of D (x) in equation 4.5 can be described only by the change of the coefcients cn : X dcn (t) X dD ( x, t) = l n c n ( t) v n ( x ) vn ( x) = ¡ dt dt n6 = 0 n6 = 0 and dcn ( t) = ¡l n cn ( t). dt

(4.6)

After this simplications we can prove the following convergence theorem.

General Probability Estimation Approach for Neural Computation

443

If the dynamics of all Fourier coefcients of a smooth function g( x, t) x 2 [0, 2p ] is given by Theorem 2.

dcn (t) = ¡l n cn (t), dt where alll n are positive, then the function g(x, t) converges toward null. Precisely: 8g > 0 9 T so, that 8t ¸ T: Proof.

| g( x, t)| · g

8x 2 [0, 2p ].

The absolute value of the function f ( x, t) can be estimated as fol-

lows:

X 1 X | g( x, t)| = cn ( t) exp ( inx) · |c (t)|. n ¡1 n Moreover, a well-known fact from Fourier analysis is that for every smooth function on I, the relation 1 X ¡1

|cn | < 1

holds. Therefore we can nd a positive number L so that ¡L X n= ¡1

| cn (0)| C

1 X n= L

|cn (0)| ·

g . 2

Using equation 4.6 and l n > 0 we obtain |cn ( t)| = |cn (0)| exp(¡l n t) · |cn (0)| and ¡L X n= ¡1

| cn ( t)| C

1 X n= L

| cn ( t)| ·

g . 2

We introduce a partial sum of the coefcients: S L ( t) =

L X n= ¡L

| cn ( t)| .

444

Maxim Khaikine and Klaus Holthausen

From the set of positive numbers fl n n = ¡L, . . . , Lg we can always nd a minimum c > 0, so that all l n ¸ c . Thus, the dynamics of SL ( t) can be easily estimated: L L X X dSL ( t) = ¡ l n |cn ( t)| · ¡c |cn ( t)| = ¡c S( t). dt n= ¡L n= ¡L

This differential equation yields S( t) · S(0) exp(¡c t), and 8t ¸

1 c

) the relation S(t) · ln( 2S(0) g

| g( x, t)| ·

1 X

¡L X

| cn ( t)| =

n= ¡1

n= ¡1

g 2

holds. Finally:

|cn (t)| C

1 X n= L

| cn ( t)| C S( t) ·

g g C · g. 2 2

In order to formulate the concrete form of the kernel for the operator

L ( Q) with the given eigenvectors and eigenvalues, we introduce an adjoint

basis fQ¤ (x), v¤n ( x)g that fullls the following conditions: Q( x)Q¤ ( x) = 1, » 1 v¤n ( x) vk ( x) = 0

vn (x) Q¤ (x) = v¤n ( x) Q( x) = 0, if n = k otherwise.

(4.7)

The multiplication here is dened via the integral: Z f ¤g=

2p 0

f ( x) g( x) dx.

Then for any arbitrary smooth function g(x) = of the operator L ( Q) can be written as: L ( Q ) g( x) =

1 X

Z

2p

n(

l n cn v x ) =

n= ¡1

0

(

P1

¡1 cn vn ( x), the application

1 X n= ¡1

! l n vn ( x) v¤n ( y)

g( y)dy.

The functions satisfying the conditions (see equation 4.7) are dened as follows: ¤

Q ( x) = 1,

v¤n

1 = 2p

(

Z exp (¡inx) ¡

n = ¡1, . . . , 1, n 6 = 0.

2p 0

! Q( z) exp (¡inz)dz

General Probability Estimation Approach for Neural Computation

The integral kernel is a smooth function if by: 1 X

KQ (x, y) =

ln

n= ¡1,n6 = 0

Z

¡

2p 0

(

P1

n= ¡1

|l n | · 1 and is given

exp ( in(x ¡ y)) 2p 1 X

exp ( in(x ¡ z)) ln 2p n= ¡1,n6 = 0 Z

= f ( x ¡ y) ¡

445

0

2p

Q(z) f ( x ¡ z)dz

! Q( z) dz

(4.8)

P exp ( inx) with the function f ( x) := . n6 = 0 l n 2p In this section, the detailed analysis of the dynamics of the approximation has been developed and the convergence has been proved. In order to apply this framework to discrete operating systems (e.g., computer simulations of approximation processes) we emphasize that the basic equations for the presented learning algorithm are 4.1 and 4.8. From these, a local algorithm that is based solely on adaptations of an internal hypothesis that is triggered by external stimuli can be derived. The result of a computer simulation in which the objective stimulus distribution P(x) is compared with the evolved hypothesis Q( x) after 10,000 iterations is presented in Figure 1. The start hypothesis was a uniform distribution, and the generating function for the integral kernel (see equation 4.8) was a gaussian function f ( x) = exp(¡x2 / 10). 5 Recognition of Regression Functions

One of the most important problems in neural processing and approximation theory is the recognition of statistical dependences between variables. More precisely, if we have two random variables x and w, which are correlated and distributed on the interval I = [0, 2p ], the conditional average of w for each x denes a function W(x), the so-called regression function (RF), Z 2p W(x) = hw|xi = Pv ( w| x) wdw, 0

where Pv ( w|x) is a conditional probability density of w by the given value of x. We see that the denition of RF is not symmetrical for w and x. To emphasize the basic difference between the variables, we will call x a stimulus and the variable w the response. The approximation of RF can be intuitively understood as a recognition of objective functional dependence W(x) on the signal with symmetrical noise. In this section we suppose that the W(x) and both probability densities— P( x) and Pv ( w| x)—are smooth on I and the distribution P( x) is not equal to null in every point of the interval I.

446

Maxim Khaikine and Klaus Holthausen

Figure 1: Simulation of a continuous density approximation for an arbitrarily chosen objective distribution P(x). Ten thousand iterations of the learning rule represented by equation (14) for the local update of the kernel function together with the local update of the hypothesis: QtC1 (x) := Qt (x) C 2 K(x, yt ) (learning reate 2 = 10¡5 ) were performed. The generating function for the kernel was the gaussian f (x) = exp(¡x2 / 10) that fullls the conditions of theorem 2. As subjective initial hypothesis Q(x, 0), a uniform distribution was assumed. Due to the convergence of the algorithm, quite a good representation could be achieved.

Generally there are many possibilities for applying the approach of ideal approximation described in the previous sections for the recognition of W(x). Here we describe a simple one that consists of two simultaneous approximation processes and obviously converges toward RF. As in sections 3 and 4, we introduce a hypothesis V( x, t) about the regression function W(x). Moreover, we consider an additional function F( x, t) and the hypothesis Q( x, t) about the distribution P( x). The hypothesis Q( x, t) will be changed sequentially after the stimulus y was received, as dened by equation 4.1 and 4.8, so that lim Q( x, t) = P(x).

t!1

(5.1)

The local dynamics of F(x, t) is in general given by: F( x, t C D t) ¡ F(x, t) = 2 KF ( x, yt , wt ),

where KF ( x, yt , wt ) is a kernel function depending on the actual value of F(x, t). We note that compared with the previous section, the kernel has

General Probability Estimation Approach for Neural Computation

447

three parameters: the signal coordinate x, the stimulus y, and the response w. After transition to the continuous description, the average dynamics of F( x, t) can be written as: dF( x, t) = dt

Z Z

=

2p 0 2p 0

Z Z

2p 0 2p 0

P( y, w)KF ( x, y, w) dydw P( y) Pv ( w| y) KF (x, y, w) dydw.

(5.2)

If we assume the kernel function KF ( x, y, w) to be given in the form Z 2p Q ( ) ( ) KF x, y, w = w f x ¡ y ¡ F( z, t) fQ( x ¡ z)dz 0

where fQ( x) is an arbitrary function with real and positive Fourier coefcients. We obtain after some calculations: Z 2p Z 2p dF( x, t) = W(y)P(y) fQ( x ¡ y) dy ¡ F(z, t) fQ( x ¡ z) dz. dt 0 0 This equation describes the convolution of D ( x, t) = (W(x) P(x) ¡ F(x, t)) with the function fQ( x). Hence, the dynamics of Fourier coefcients of D ( x, t) satises equation 4.6 and, as given in theorem 2, D ( x, t) ! 0. Finally: lim F( x, t) = W(x)P(x).

t!1

(5.3)

The hypothesis V ( x, t) about the regression function W(x) can be dened every time as follows: V( x, t) =

F( x, t) . Q(x, t)

And together with equations 5.1 and 5.3 and condition 2, we obtain: lim V( x, t) =

t!1

limt!1 F( x, t) = W(x). limt!1 Q( x, t)

(5.4)

In order to study the performance of this algorithm, a simultaneous approximation of the RF W(x) and the estimation of the objective distribution of the stimuli x was computed via numerical integration. The stimuli were distributed with the same probability density P( x) as in Figure 1. Figure 2 shows a comparison between the regression function W(x, t) and the corresponding approximation V( x, t) after 10,000 learning steps. The additional white noise was distributed in the interval [¡100, 100]. The start hypothesis about the RF was a constant null function.

448

Maxim Khaikine and Klaus Holthausen

Figure 2: Simulation of the approximation of a regression function W(x), which was arbitrarily chosen. The underlying probability density P(x) was the same as in the foregoing simulation (see Figure 1). The initial hypothesis V(x, 0) about the regression function was a constant null function. Again, an appropriate representation of the true regression function could be achieved.

The corresponding dynamics of the distance kW(x) ¡ V( x, t)k between the regression function and the iterated hypothesis is shown in Figure 3. We performed a series of integrations with various regression functions and usually achieved satisfactory approximations; the distance converged toward null. 6 Discussion

Neural networks can be regarded as an extension of conventional techniques in statistical pattern recognition (Bishop, 1995). And, of course, certain network models can be derived from approaches to probability density estimation. For example, a principled alternative to the widely used selforganizing map of Kohonen (1982) could be developed that has the key property that it denes a probability density model (Bishop, Svensen, & Williams, 1998). Accordingly, possible generalizations and applications of the presented approach can be pointed out. Essentially we described a framework for the learning processes in neural systems that can be regarded as autonomous entities that adapt its internal structure on the basis of subjective probabilities constructed on the basis of

General Probability Estimation Approach for Neural Computation

449

Figure 3: The monotonous decrease of the distance kW(x) ¡V(x, t)k between the real regression function and the iterated hypothesis was measured during the foregoing simulation (see Figure 2). After 30,000 iterations the initial distance was reduced by a factor ¼300.

randomly received input signals (Jost, Holthausen, & Breidbach, 1997). The approximation algorithms can be easily extended to higher-dimensional cases. To do this, we have to replace all one-dimensional hypotheses with the corresponding many-dimensional ones and do the same for the kernels. All functions thus will be represented by many-dimensional Fourier series, but the main convergence properties will remain. For example, the density estimation for two-dimensional distributions can be performed in the following manner. We assume that the objective distribution P( x1 , x2 ) and the according hypothesis Q( x1 , x2 ) are given. An appropriate generating function for the kernel is dened by the two-dimensional gaussian f ( x1 , x2 ) = exp(¡(x1 Cx2 ) 2 / s 2 ). The adaptation of the kernel after a stimulus (y1 , y2 ) was received is given by the two-dimensional extension of equation 4.8. Thus, our approach can be applied to the generation of topological feature maps. Because of the optimal approximation capabilities enclosed in each element for the recognition of regression functions in higher-dimensional cases, our algorithms could be a good basis for the mathematical investigation of complex structures that are composed of these elements: articial neural networks. Recent work on information processing and storage at the single-cell level has revealed previously unimagined complexity (Koch, 1997). Thus, one may assume that even single neurons could perform calculations that correspond to the presented approximation model.

450

Maxim Khaikine and Klaus Holthausen

The theorems provided here concerning the convergence of the approximation algorithms are only preliminary. Especially the restriction to trigonometric polynomials presumably could be dropped in the course of more comprehensive mathematical analysis. Acknowledgments

This research was supported by the Thuringer ¨ Ministerium fur ¨ Wissenschaft, Forschung und Kultur (B 301-96028). We thank Paul Ziche for critically reading an earlier draft of the manuscript for this article. We are grateful to one of the anonymous referees who provided several detailed suggestions to clarify our derivations. References Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Bishop, C. M., Svensen, M., & Williams, C. K. I. (1998). GTM: The generative topographic mapping. Neural Comp., 10, 215–234. Girosi, F., & T. Poggio (1990). Networks with the best approximation property. Biol. Cybern., 63, 169–176. Jost, J., Holthausen, K., & Breidbach, O. (1997).On the mathematical foundations of a theory of neural representation. Theory Bioscienc., 116, 125–139. Koch, C. (1997). Computation and the single neuron. Nature, 385, 207–210. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43, 59–69. Kurkova, V. (1992). Kolmogorov’s theorem and multilayer neural networks. Neural Networks, 5, 501–506. Pfaffelhuber, E. (1972). Learning and information theory. Intern. J. Neuroscience, 3, 83–88. Scarselli, F., & Tsoi, A. C. (1998). Universal approximation using feedforward neural networks: A survey of some existing methods, and some new results. Neural Networks, 11, 15–37. Received April 13, 1998; accepted January 27, 1999.

LETTER

Communicated by Richard Golden

On the Synthesis of Brain-State-in-a-Box Neural Models with Application to Associative Memory Fation Sevrani Kennichi Abe Department of Electrical Engineering, Faculty of Engineering, Tohoku University, Sendai 980-8579, Japan In this article we present techniques for designing associative memories to be implemented by a class of synchronous discrete-time neural networks based on a generalization of the brain-state-in-a-box neural model. First, we address the local qualitative properties and global qualitative aspects of the class of neural networks considered. Our approach to the stability analysis of the equilibrium points of the network gives insight into the extent of the domain of attraction for the patterns to be stored as asymptotically stable equilibrium points and is useful in the analysis of the retrieval performance of the network and also for design purposes. By making use of the analysis results as constraints, the design for associative memory is performed by solving a constraint optimization problem whereby each of the stored patterns is guaranteed a substantial domain of attraction. The performance of the designed network is illustrated by means of three specic examples. 1 Introduction

In this article we consider a class of synchronous discrete-time neural networks described by the difference equation of the form x( k C 1) = g ( x (k) C a( Wx ( k) C b )),

x (0) = x 0 ,

(1.1)

where x ( k) 2 Rn is the state vector at time k, a is the step size, W = [wij ] 2 Rn£n is the connection (or weight) matrix, b = [b1 , . . . , bn ]T 2 Rn is a bias vector, and g (x ) = [g(x1 ), . . . , g( xn )]T

is a vector valued function where 8 if xi ¸ 1 <1, if ¡1 < xi < 1 g( xi ) = xi , : ¡1, if xi · ¡1. Neural Computation 12, 451–472 (2000)

c 2000 Massachusetts Institute of Technology °

452

Fation Sevrani and Kennichi Abe

The saturation nonlinearity g(¢) has the effect of restricting the allowable values of the state vector x (k) to the closed unit hypercube Hn = [¡1, 1]n . P Neural network model 1.1, which has been proposed by Hui and Zak (1992) (see also Golden, 1993), can be viewed as a generalization of Anderson’s brain-state-in-a-box (BSB) neural model (Anderson, Silverstein, Ritz, & Jones, 1977) where the connection matrix W is symmetric and the bias vector b = 0. The BSB model has been used primarily to model effects and mechanisms seen in psychology and cognitive science. (For further discussion on uses of the BSB models and additional references on this subject, see Anderson, 1993.) The application we discuss here is associative memory. In the implementation of associative memories, the network is designed to store a given set of vectors (desired memory patterns) as its asymptotically stable equilibrium points. Since the nearby points in the domain of attraction of each stored pattern can be considered as representing partial information (given a suitable metric), a design that ensures a large domain of attraction around each of the stored patterns will necessarily have the effect of increasing the capability of the network to retrieve full information given a noisy or incomplete version of it. In this regard, one is also interested in minimizing the number of extraneous memories, that is, undesirable (spurious) asymptotically stable equilibrium points of the network, which may disturb the retrieval of the desired memory patterns. Various aspects and difculties involved in the implementation of associative memories concerning different neural network models are discussed in Michel and Farrell (1990). Our emphasis here is on the presentation of analysis results and methods for the design of associative memory by means of the neural network model 1.1 based on the above framework. P (1992) and Hui, Lillo, and Zak P (1993) provided an analysis Hui and Zak of the dynamic behavior and examined the stability of neural networks in the form 1.1 under various assumptions of the connection matrix W and bias vector b . The analysis results were interpreted from the point of view of the implementation of associative memory using the model 1.1. However, no specic information was provided on how to accomplish this. In particular, P (1992) examined the case where the connection matrix W Hui and Zak is strongly diagonally dominant and gave conditions that were necessary and sufcient for the network to have asymptotically stable equilibrium points at every vertex of the hypercube. This result, however, imposes severe limitations on the extent of the domain of attraction of the desired patterns to be stored as asymptotically stable equilibrium points of the network. In our approach to studying the dynamic properties of the neural network model 1.1, we do not make use of any of the assumptions in Hui and Zak (1992) and Hui et al. (1993). Instead, we show that if the matrix W has zero diagonal elements, the network can have asymptotically stable equilibria only at the vertices of the hypercube (which are normally used for storing the memory patterns), although not necessarily every vertex of

Synthesis of Brain-State-in-a-Box Neural Models

453

the hypercube is asymptotically stable (see also Perfetti, 1995) and that under mild assumptions, all other equilibria are unstable. In the analysis of the stability properties of the network, we will establish results concerning the extent of the domain of attraction of the asymptotically stable equilibrium points located at the vertices of the hypercube that are useful in the analysis of the retrieval performance of the network. Using the provided analysis results as constraints, the synthesis problem is formulated as a constraint optimization problem, which amounts to expanding the domain of attraction around each of the stored patterns and also minimizing the number of spurious asymptotically stable equilibria in the vicinity of those patterns. The proposed formulation of the synthesis problem has exibility in dealing with certain restrictions on the interconnection structure of the network, such as symmetric, nonsymmetric, and partially prespecied or sparse interconnections. One can take advantage of the sparsity constraints (expressed as prespecied zero elements in the connection matrix) from a computational perspective for solving the synthesis problem and in hardware implementation where the reduction in the number of connections is desirable. Furthermore, the proposed formulation is shown to have the advantages of practical efciency, in the sense that there exist powerful algorithms for solving the problem. What is also important is that the design can be carried out by another neural network, similar to the combined learning-production architecture proposed by Guez, Protopopsecu, and Barhen (1988). Several neural network models have been applied to several classes of constraint optimization problems and have shown promise for solving such problems. Taking advantage of the degrees of freedom and exibility of our approach of the synthesis problem, we investigate further the effect of the connection weights dispersion on the error-correction capabilities of the network in the presence of random errors. In doing so, we are motivated by the results in the case of optimal associative mappings (see Kohonen, 1987). These results suggest that the error-correction properties of such mappings are closely related to some smoothness prior, in the sense that all vectors that lie in a small neighborhood of the input vectors are mapped into a correspondingly small neighborhood of the output vectors. In the case of neural network model 1.1, a small connection weights dispersion is considered a possible way of encoding this a priori information. The effect of connection weights dispersion on the error-correction capabilities of the network is investigated by extensive simulations and will be illustrated by an example in section 4. The article is organized as follows. Section 2 presents theoretical results concerning local qualitative properties and global qualitative aspects of the considered network. In section 3, we present our approach to the synthesis problem, along with some additional heuristic arguments aiming at providing insight into the retrieval performance of the network. In this section we also discuss techniques for solving the synthesis problem. The applica-

454

Fation Sevrani and Kennichi Abe

bility of the results is demonstrated by means of three examples presented in section 4. Concluding remarks are given in section 5. 2 Analysis Results

In this section, we rst address the local stability properties of the equilibrium points of system 1.1. Next, we present results concerning the global stability properties of this system. Our main emphasis is on establishing results concerning the extent of the domain of attraction of the asymptotically stable equilibrium points located at the vertices of the hypercube H n , which are expressed in a form suitable for the design purposes. Recall rst that a point x * 2 H n is an equilibrium point of system 1.1 if and only if ¡

¡

x * = g x * C a Wx * C b

¢¢

.

(2.1)

A systematic method for locating all equilibrium points of 1.1 can be found in Hui et al. (1993). Under various assumptions on the connection matrix W and vector b , these authors provided conditions whereby the stability or instability of equilibrium points can be determined. Note that in applications of neural network model 1.1 to associative memory, asymptotically stable equilibrium points at the vertices of H n are used to store the desired memory patterns. In this regard, we shall prove the following result (see Proposition 1), which excludes the existence of other possible nonbinary stable equilibria. We begin by making the following assumption, which we shall have the occasion to use in the subsequent analysis: Assumption A1. Every principal submatrix of W of size 2 £ 2 or larger, including W itself, is hyperbolic; that is, none of their eigenvalues has a zero

real part.

Proposition 1. Let all diagonal elements of matrix W be zero and suppose that W satises assumption A1. Then, for almost all b 2 Rn (in the sense of Lebesgue

measure), (i) the set of equilibrium points for system 1.1 is nite and (ii) any stable equilibrium point of system 1.1 is located at a vertex of the hypercube. Proof.

See the appendix.

Now, concentrating on the vertices of the hypercube, let À 2 f¡1, 1gn denote a vertex of H n . Using a continuity argument (see Hui et al., 1993), it can be shown that if ui ( W À C b )i > 0,

for i = 1, . . . , n,

(2.2)

Synthesis of Brain-State-in-a-Box Neural Models

455

then À is an asymptotically stable equilibrium point of system 1.1. Note that the above stability result is local, in the sense that there will be a sufciently small, open neighborhood U (À ) of À in Rn such that any trajectory of 1.1, corresponding toTan initial condition x 0 2 H n , converges to À whenever x 0 belongs to U (À ) H n . In the subsequent analysis we are concerned with a more global stability analysis for the equilibrium points of system 1.1 located at the vertices of the hypercube. More precisely, given that a vertex À of H n is asymptotically stable, depending on the positive margin with which condition 2.2 is satised, we would like to know the extent of this stability—that is, for which points it is true that any trajectory of system 1.1 initiating from those points will converge to À . Let Cr (À ) ½ H n be a bounded set dened as Cr (À ) = fx 2 Hn | kx ¡ À k1 · 2r g,

(2.3)

where k ¢ k1 is the l1 vector norm on Rn , and r a positive scalar, 0 < r < n. We shall prove the following result: Let À be an asymptotically stable equilibrium point and let

Proposition 2.

0 < r < min i

ui ( W À C b )i , 2 maxj |wij |

i, j = 1, . . . , n.

(2.4)

Then, every trajectory of system 1.1 starting from a point in Cr ( À ) cannot escape Cr ( À) and will converge to À . Before proceeding further, it will be convenient to introduce some additional notations and geometrical considerations that will be useful in the proof. Let V = fÀ(0) , À (1) , . . . , À (n) g be a set of n C1 afnely independent vectors 1 in Rn such that À(0) = À and À( i) = À ¡ 2r ui e i , where e i , i = 1, . . . , n, is the ith column vector of the identity matrix. Let co(V) denote the ndimensional simplex expressed as a convex combination of the vertices À (0) , À(1) , . . . , À ( n) , that is, ( ) X n n X ( i) co(V) = x x= l iÀ , l i = 1, 0 · l i 2 R for all i . (2.5)

i= 0

i= 0

Equivalently, co(V) can be expressed by a set of linear inequalities fx 2 Rn | ujT x · hj , j = 1, . . . , n C 1g, where uj = uj ej , j = 1, . . . , n, u nC1 = ¡À;

hj = 1, j = 1, . . . , n, hnC1 = 2r ¡ n. Here each inequality ujT x · hj , j =

Recall that a set of n C 1 vectors fx (0) , . . . , x (n) g is afnely independent if and only if the translated set fx (1) ¡ x(0) , . . . , x(n) ¡ x (0) g is linearly independent. 1

456

Fation Sevrani and Kennichi Abe

1, . . . , n, denes a closed half-space associated with tangent (or supporting) hyperplane to Hn at À . Accordingly, u TnC1 x · hnC1 denes the closed halfspace associated with the hyperplane facing the vertex À . In this regard, each vector À ( i) lies on n linearly independent supporting hyperplanes of co(V); that is, each À ( i) satises exactly n linearly independent equations ujT À ( i) = hj . Now suppose that x , | xi | · 1, i = 1, . . . , n, satises u TnC1 x · 2r ¡ n. Then, u TnC1 ( x ¡ À ) · 2r or, equivalently, kx ¡ À k1 · 2r . Thus, the set Cr ( À ) can be represented as the intersection of the n-simplex, spanned by point À as well as points À(1) , . . . , À ( n) in Rn , and the hypercube Hn . In the proof of proposition 2 we shall make use of the following lemma (a proof of the lemma is given in the appendix following the approach of Golden, 1993.) The neural network model 1.1 can be represented as an uncertain linear system described by the state equation of the form Lemma 1.

x ( k C 1) = x (k) C aÀ ( k)( Wx ( k) C b ),

(2.6)

where À ( k) 2 Rn£n is a diagonal matrix with diagonal elements 0 · c i ( k) · 1. We can continue now with the proof of proposition 2. We will show rst that any trajectory of equation 1.1 originating in Cr ( À ) can never leave Cr (À ), that is, if x 0 2 Cr ( À), x (k) will remain in Cr ( À ) for all k > 0. Since, due to saturating nonlinearity, the state of the system is conned to the hypercube H n , in order to leave Cr ( À ), any trajectory originating in Cr ( À) must cross the face of Cr ( À ) opposing the vertex À , fx | hÀ , x i = n ¡ 2r , x 2 Cr (À )g, represented by the closed half-space hÀ , x i ¸ n ¡ 2r . Hence, in view of lemma 1, it sufces to show that if x 2 Cr ( À ), then hÀ, À (Wx C b )i ¸ 0. For this, observe that since Cr ( À) µ co(V), each point x 2 Cr ( À) is uniquely expressible as a convex combination of the form 2.5. Using the fact that l 0 C ¢ ¢ ¢ Cl n = 1, we have Proof of Proposition 2.

hÀ , À ( Wx C b )i =

n X j= 0

lj

n X i= 1

c i (¢)ui ( W À ( j) C b ) i .

(2.7)

By noting that À(0) = À and substituting for À ( j) = À ¡ 2r uj ej , we have by assumption that ui (W À ( j) C b ) i = ui ( W À C b )i ¡ 2r wij ui uj > 0,

for i, j = 1, . . . , n.

(2.8)

Since l j and c i (¢) are nonnegative, with l j adding up to 1, we conclude that hÀ , À( Wx C b )i ¸ 0.

Synthesis of Brain-State-in-a-Box Neural Models

457

To this end we have shown that Cr ( À) is an invariant set. Next we show that x (k) ! À as k ! 1 for any x0 2 Cr ( À). For this, without loss of generality, consider the case where Ài = 1, and let x ( k) 2 Cr ( À) such that xi ( k) 6 = 1. Then the ith component of x ( k) C a(Wx ( k) C b ) is 0 1 n X ( ) l j ( W À j C b ) i A > xi ( k), xi (k) C a @ j= 0

where the above inequality follows from equations 2.4 and 2.8. Thus xi ( k C 1) > xi (k). The case where ui = ¡1 is similar. Continuing these arguments for x (k C 2), x( k C 3), . . . , it can be shown that the corresponding trajectory will monotonically converge to À as k ! 1. The next result addresses the existence of other binary equilibrium points of system 1.1 in the immediate vicinity of Cr (À ). Let wii = 0 for i = 1, . . . , n. Suppose that À is an asymptotically stable equilibrium point, and let k = dmini (ui (W À C b ) i / 2 maxj |wij | )e, where dxe denotes the smallest integer ¸ x and i, j = 1, . . . , n. Then none of vertices À( k) at Hamming distance dH (À , À( k) ) · k is an equilibrium point. Proposition 3.

By assumption there exists r > 0 satisfying equation 2.4 such that from proposition 2, and in view of the denition for k, none of the vertices of the hypercube within the Hamming distance k ¡ 1 from À, that is, ( ) dH ( À , ÀQ k¡1 · k¡1, ÀQ 6 = À , can be an equilibrium point. Now let À (k¡1) be an arbitrary vertex with dH ( À, À (k¡1) ) = k ¡1, and let J ½ f1, . . . , ng of cardinalP ( ) ity | J| = k ¡ 1 such that À (k¡1) = À ¡ 2 j2J uj ej . Let À ( k) = À ( k¡1) ¡ 2ui k¡1 e i Proof.

( )

such that ui k 6 = ui , where i 2 / J, i = 1, . . . , n. By assumption, we have that ³ ´ X ui W À( k¡1) C b = ui ( W À C b ) i ¡ 2ui wij uj > 0, i

for i = 1, . . . , n.

j2J

( ) Since wii = 0 8i, we obtain ui k (W À (k) C b ) i = ¡ui ( W À ( k¡1) C b ) i < 0, i 2 / J. Hence, from equation 2.1, it follows that À (k) cannot be an equilibrium point.

An immediate consequence of proposition 3 is the following: Let wii = 0 8i. Suppose that À is an asymptotically stable equilibrium point. Then none of the vertices of the hypercube at Hamming distance one from À can be an equilibrium point. Corollary 1.

458

Fation Sevrani and Kennichi Abe

In the foregoing results, no considerations regarding the global stability of system 1.1 are made. Recall that a neural network (such as network 1.1) is called globally stable if every trajectory of the network converges to some equilibrium point. This guarantees that periodic and chaotic motions cannot exist in the network. Theorem 1, which follows, provides conditions on the connection matrix W that guarantee the global stability of system 1.1. In the proof of theorem 1 we will make use of an additional assumption: Given assumption A1, system 1.1 has a nite number of equilibrium points. Assumption A2.

Assumption A2 seems reasonable. In fact, if matrix W satises assumption A1, and, in addition, wii 6 = 0 8i, every principal matrix of W is of full rank. Then, following similar arguments as in the proof of proposition 1. (see also Hui et al., 1993), it can be shown that system 1.1 cannot have an innity of equilibrium points. When wii = 0 8i, as shown in the proof of proposition 1, assumption A2 is true for almost all b 2 Rn , with the possible exception of those belonging to a set of measure zero. Suppose that system 1.1 satises assumption A2 and W = W T is either positive denite or a < 2/ |l min |, where l min is the smallest eigenvalue of symmetric matrix W . Then any trajectory of 1.1 starting from an arbitrary point x 0 6 = 0, x 0 2 H n , will converge to an equilibrium point. Theorem 1.

Proof.

See the appendix.

In practice, due to noise, every trajectory of 1.1 will converge to an asymptotically stable equilibrium point, which, from proposition 1, with the additional constraint of matrix W having all of its diagonal elements zero, will be located at a vertex of the hypercube H n . 3 Application to Synthesis

In this section we address the problem of synthesis of associative memory using the neural network model 1.1: Synthesis Problem. Given a set of binary vectors (desired memory patm terns), fp ( k) gk= 1 , p (k) 2 f¡1, 1gn , and given the step size a > 0, nd W , b ,

such that:

1. p (1) , . . . , p ( m) are stored as asymptotically stable equilibrium points (memories) of system 1.1. 2. The domain of attraction of each stored pattern is as large as possible.

Synthesis of Brain-State-in-a-Box Neural Models

459

3. The total number of spurious asymptotically stable equilibria (extraneous stored patterns) is as small as possible. 4. The system 1.1 is globally stable; that is, every trajectory of the system converges to some equilibrium point. In regard to design criterion 4, the global stability of the network is not essential for storing the information. In fact, several existing design methods do not take in consideration the global stability of the network (see, e.g., Michel, Farrell, and Porod, 1989; Michel & Farrell, 1990). However, in terms of the reliability of retrieving the information—for example, when one expects the retrieval of a stored pattern from an input that is further away from that pattern—the global stability of the network is of great interest. Based on the analysis results provided in the preceding section, we approach the synthesis problem in the following manner. According to proposition 2, by selecting an n £ n matrix W = [wij ] and an n-dimensional vector b = [b1 , . . . , bn ]T such that for each k = 1, . . . , m the inequality n X j= 1

( ) ( k)

wij pi k pj

C p(i k) bi ¸ s,

i = 1, . . . , n,

(3.1)

is satised for some given s > 0, we are guaranteed that p (1) , . . . , p ( m) will be stored as memories of system 1.1. Furthermore, associated with each p (k) there will be a stability region fx 2 Hn | kx ¡ p ( k) k1 < 2r k g, ( ) where r k = mini ( pi k ( Wp ( k) C b ) i / 2 maxj | wij | ), i, j = 1, . . . , n, such that any trajectory of the system originating within this region will converge to p (k) . Assuming that |wij | · wN for all i, j, the quantity r k gives a lower bound for the extent of the domain of attraction of a stored pattern,2 and thus, it provides a reasonable measure of the error-correction capabilities of the network. From an alternative viewpoint, given a stored pattern corresponding to an asymptotically stable equilibrium point of the network, an idea of how much the domain of attraction of that pattern can be expanded may also be given by the location of other asymptotically stable equilibrium points in the vicinity—that is, spurious asymptotically stable equilibrium points as well as points used for storing the (desired) memory patterns. Hence, the requirement of excluding spurious asymptotically stable equilibria, at least within some reasonable distance from each of the stored patterns, would allow for a larger extent of the domain of attraction for those patterns. It is clear that such an extent is bounded from above by the locations of the stored patterns, and in general it decreases as their number increases. 2

Note that the domain of attraction of the stored patterns is not known in its entirety.

460

Fation Sevrani and Kennichi Abe

In regard to the location of spurious asymptotically stable equilibrium points, from proposition 1 we have that, for system 1.1, with a zero diagonal connection matrix W satisfying assumption A1, those points must be located at the vertices of the hypercube. By proposition 3, provided that condition 3.1 is satised with wii = 0 8i, we are guaranteed that within a Hamming distance rk = dr k e from a stored pattern p ( k) there can be no equilibrium point apart from p (k) itself. In particular (see corollary 1), none of the vertices at Hamming distance one from any of the stored patterns can be an equilibrium point, which has the implication that none of such vertices can be used as memory locations. However, we consider this a desirable property, since patterns that differ in only one component may be too close to allow for a signicant amount of error correction. At this point some observations are in order. Suppose that for some k, 1 · k · m, we have n X j= 1

( k)

wij p(i k) pj

C p(i k) bi ¡ 2

X j2J

( k)

wij p(i k) pj

> 0

(3.2)

for at least one integer i 2 J ½ f1, . . . , ng, where J is any subset of integers of cardinality | J| ¸ rk (rk ¸ 1). Stated otherwise, for at least one integer i 2 J, the ith component of W À C b , at the vertex À that differs from p ( k) in ( ) the components pi k , i 2 J, points to p ( k) . From equation 2.1, it is clear that there cannot be any equilibrium point located at such a vertex. Note that when a signicant number of components of W À C b , at À , point to p ( k) , we have reason to suspect that a trajectory starting at À will approach p ( k) and eventually reach that pattern. Hence, since s in condition 3.1 provides a lower bound for the rst two terms in equation 3.2, if we seek to maximize s (provided that | wij | · wN for all i, j), this will certainly reduce further the number of spurious asymptotically stable equilibria, and, in addition, it will ensure good retrieval of the memory patterns even from locations (vertices) more distant than those for which we can guarantee a perfect retrieval. To this end, a nal observation concerns the effect of the bound wN on errorcorrection capabilities of the network in the presence of random binary errors. Let us suppose that we require the retrieval of a memory pattern in the presence of errors due to the ipping of some of its components. We anticipate that the dispersion (or discontinuity) of the wij values will affect the strength of perturbations along all other components of that pattern. The larger the dispersion of the wij values, the more the information that is lost in the presence of those errors and, hence, the less likely that the memory pattern will be correctly retrieved. The combined effect of the increase of the positive margin s in equation 3.1 and the bound wN on the error-correction capabilities of the network will be demonstrated by practical simulations results presented in section 4. Now, putting it all together, a possible strategy to approach the synthesis problem can be formulated as follows:

Synthesis of Brain-State-in-a-Box Neural Models

461 m

Given a set of binary vectors fp ( k) gk= 1 , p (k) 2 f¡1, 1gn , to be stored as memories for system 1.1, and given a > 0, wN > 0, nd W = [wij ] 2 Rn£n , b = [b1 , . . . , bn ]T 2 Rn such that s is maximum subject to constraints n X s > 0, k = 1, . . . , m, (3.3) wij pki pjk C pki bi ¸ s, Synthesis Strategy.

j= 1

¡wN · wij · w, N i 6 = j, wii = 0,

i, j = 1, . . . , n.

(3.4)

By making use of the condition for the smallest eigenvalue, 3 l min ( W ), provided by theorem 1, the global stability requirement imposes the additional constraint W = W T , tI n C W > 0,

t = 2/ a,

(3.5)

where In is the n £ n identity matrix and the inequality sign above means that ( tIn C W ) is positive denite. In the following we point out some computational aspects of solving the synthesis problem based on the formulated strategy. Let us consider rst the problem of determining W and b such that s is maximum subject to constraints 3.3 and 3.4. In such a case, the synthesis problem reduces to a general linear programming (LP) problem. Provided that a solution for W and b has been found, the resulting network is not guaranteed to be globally stable; however, we have guaranteed that all given patterns are stored as asymptotically stable equilibrium points of the network, having a substantial domain of attraction and hence a signicant amount of error correction. From a computational perspective, based on an LP formulation, the synthesis problem can be solved efciently by using any of the polynomial time algorithms for LP (see, e.g., Lustig, Martsen, & Shanno, 1994), or by employing another neural network that offers computational advantages of massive parallel processing and fast convergence (see, e.g., Xia & Wang, 1995; Xia, 1996). Note also that restrictions on the interconnection structure that require that predetermined elements of W be zero can be incorporated simply by excluding those elements from constraints 3.3 and 3.4. With the addition of the global stability constraint 3.5, the synthesis problem, that is, the problem of maximizing s over equations 3.3 through 3.5, cannot be cast as a linear program. However, the problem we wish to solve can be cast as a semidenite program (see Vandenberghe & Boyd, 1996) in the variables W = W T , b and s. In semidenite programming (SDP), one minimizes a linear function subject to the constraint that an afne combination of symmetric matrices is positive semidenite, 4 which can be regarded 3

Note that since wii = 0 8i, W = WT cannot be positive denite. Many common convex constraints—for example, linear and quadratic inequalities, eigenvalue, and matrix norm inequalities—can be expressed in this way. 4

462

Fation Sevrani and Kennichi Abe

as an extension of LP where the componentwise inequalities between vectors are replaced by matrix inequalities. Semidenite programs are convex optimization problems, and, as in LP, they can be solved in polynomial time by effective algorithms for the solution of these problems that rapidly compute the global minimum with nonheuristic stopping criteria (Nesterov & Nemirovsky, 1994). We note here that in terms of computational requirements, it is also possible to consider a simple heuristic algorithm for solving the synthesis problem. More specically, given the specications for a and w, N one can approach the synthesis problem by nding a feasible solution for W and b subject to constraints 3.3 and 3.4 (with the additional constraint that wij = wji for i 6 = j, if the global stability is required), for some increasing values of s, until the solution space is empty. In such a case, simple numerical relaxation techniques can be used, such as the technique used in Perfetti (1995). 5 In order to check for global stability, one can then compute the smallest eigenvalue of the symmetric matrix W . Referring to equation 3.5, if l min < ¡2/ a, the above procedure is repeated by specifying a lower value for the bound wN until one ends up with a solution for W and b that solves the synthesis problem.6 We conclude this section with a nal comment concerning the stability properties of the desired memory patterns, stored as asymptotically stable equilibrium points of the network, when considering inaccuracies incurred during the implementation process or limited precision implementation of system 1.1. Denoting those parameter inaccuracies, respectively, by D W and b , and performing a similar analysis as in Liu and Michel (1994), it can be shown that the stored patterns will retain their stability properties provided that the following condition holds:7 | W k1 C kb k1 < s, where k ¢ k1 denotes the matrix norm induced by l1 vector norm. Thus, increasing s will certainly increase the robustness of the stability of the stored patterns in the presence of implementation inaccuracies or parameter perturbations.

5 Experimentation with the algorithm proposed in Perfetti (1995) shows that, in terms of numerical stability and number of iterations, this algorithm is quite inferior (as it can be expected) compared, for example, with the interior point algorithms for LPs or SDPs. 6 GerLskorin theorem applied to zero diagonal symmetric matrix W states that all eigenvaluesPof W are located inside the circular disc centered at the origin and with radius n maxi j | wij |, which is the spectral radius of W . 7 Note that, when considering the inaccuracies during the implementation process, one is more concerned with the parameter inaccuracies in the connection matrix W rather than bias vector b .

Synthesis of Brain-State-in-a-Box Neural Models

463

At this point it should be noted that since large s will usually result in a large spread or dispersion of the wij values, as we conjectured above and also as simulation results show, it may cause the error-correction capabilities of the network to deteriorate in the presence of random binary errors. Furthermore, if we require the global stability of the network, simulation results suggest that the range of possible values of s is practically small. It is clear that there will be a trade-off between robustness, global stability, error-correction capability, and speed of operation, which remain to be chosen by the designer. 4 Simulations

In this section, we demonstrate the applicability of the theoretical results as well as the constituency of the arguments presented in the previous sections by means of three specic examples. 4.1 Example. In this example we compare the retrieval performance of the neural network model, which is synthesized based on the strategy given in the previous section, with the performance of the networks designed by methods of Michel, Si, & Yen (1991) and Perfetti (1995) (in the special case of system 1.1 where b = 0). For the sake of comparison with the results of Michel et al. and Perfetti, in the design presented in this example, we choose a = 1 and require the network to be globally stable. We consider the following set of binary patterns of length n = 10 (cf. Michel et al., 1991, example 2.1) to be stored as asymptotically stable equilibrium points of equation 1.1: p (1) p (2) p (3) p (4) p (5) p (6)

= = = = = =

[¡1, 1, ¡1, 1, 1, 1, ¡1, 1, 1, 1]T , [1, 1, ¡1, ¡1, 1, ¡1, 1, ¡1, 1, 1]T , [¡1, 1, 1, 1, ¡1, ¡1, 1, ¡1, 1, ¡1] T , [1, 1, ¡1, 1, ¡1, 1, ¡1, 1, 1, 1]T , [1, ¡1, ¡1, ¡1, 1, 1, 1, ¡1, ¡1, ¡1]T , [¡1, ¡1, ¡1, 1, 1, ¡1, 1, 1, ¡1, 1] T .

Specifying a = 1 and wN = 0.5, we determine a 10 £ 10 dimensional matrix W = [wij ] and a 10-dimensional vector b by maximizing s over equations 3.3 and 3.4 and the additional symmetry constraint wij = wji for i 6 = j, which is solved by using linear programming. We obtain 2

W=

0 .0000 ¡0 .0071 6¡0 .4009 6¡0 .4419 6¡0 .5000 6 6 0 .3199 6 0 .1017 4¡0 .3457 ¡0 .0071 0 .2898

¡0.0071 0.0000 0.1931 ¡0.0335 ¡0.3104 0.0000 ¡0.3300 ¡0.1920 0.4552 0.2350

¡0 .4009 0 .1931 0 .0000 0 .2238 ¡0 .3885 ¡0 .3046 0 .1568 ¡0 .2677 0 .1931 ¡0 .3830

¡0.4419 ¡0.0335 0.2238 0.0000 ¡0.4009 0.0253 ¡0.2708 0.3909 ¡0.0335 0.1719

¡0 .5000 0.3199 0 .1017 ¡0 .3104 0.0000 ¡0 .3300 ¡0 .3885 ¡0.3046 0 .1568 ¡0 .4009 0.0253 ¡0 .2708 0 .0000 0.0780 0 .1773 0 .0780 0.0000 ¡0 .5000 0 .1773 ¡0.5000 0 .0000 ¡0 .0968 0.2946 ¡0 .3584 ¡0 .3104 0.0000 ¡0 .3300 0 .3921 ¡0.5000 ¡0 .3119

¡0.3457 ¡0.1920 ¡0.2677 0.3909 ¡0.0968 0.2946 ¡0.3584 0.0000 ¡0.1920 0.4185

3

¡0 .0071 0.2898 0 .4552 0.2350 0 .1931 ¡0.3830 7 ¡0 .0335 0.1719 7 7 ¡0 .3104 0.3921 7 0 .0000 ¡0.5000 7 ¡0 .3300 ¡0.3119 7 ¡0 .1920 0.4185 5 0 .0000 0.2350 0 .2350 0.0000

464

Fation Sevrani and Kennichi Abe

Table 1: Simulation Results for Example 1. Number of Binary Inputs Converging to the Desired Pattern Pattern Number 1 2 3 4 5 6 Convergence rate (%)

HD = 1

HD = 2

HD = 3

HD = 4

10 10 10 8 10 10

35 38 31 33 35 33

65 69 39 41 45 51

62 63 10 25 20 24

96.7

75.9

43.1

16.2

HD ¸ 5

Total

40 78 0 4 7 6

212 258 90 111 117 124

10.4

89.0

Table 2: Simulation Results for the Design of Michel et al. (1991, Example 2.1). Number of Binary Inputs Converging to the Desired Pattern Pattern Number 1 2 3 4 5 6 Convergence rate (%)

HD = 1

HD = 2

HD = 3

HD = 4

10 10 10 8 10 9

25 27 32 17 30 26

16 28 34 14 26 22

4 7 12 7 5 5

95.0

58.1

19.4

3.2

HD ¸ 5

Total

0 0 2 2 0 0

55 72 90 48 71 62

0.3

38.9

and b = [ ¡0 .1071

0 .5972 ¡1.4713 0.8561 0 .4372 0 .1175 1.1485 ¡0 .1649 0 .5972 ¡0.0323 ]T ,

corresponding to smax = 0.5. Also l min ( W ) = ¡1.5 which, together with the fact that W is symmetric, according to theorem 1, guarantees the global stability of the synthesized network. Using condition 2.3 it can be veried that system 1.1 with W and b given above, in addition to the desired patterns p (1) , . . . , p (6) , also has three binary patterns stored as asymptotically stable equilibrium points (spurious memories). The simulation results summarized in Tables 1–3 are intended to compare the retrieval performance of the synthesized neural network 1.1, with the performance of the networks designed by the methods of Michel et al. (1991) and Perfetti (1995). In columns 2 to 5 in each table, the data in its row, corresponding to the pattern specied in the rst column, show the number of binary inputs converging to that pattern out of all possible binary inputs at Hamming distance HD = 1, . . . , 4 and HD ¸ 5, which are used as initial conditions for the state of the network. Referring, for example, to the rst

Synthesis of Brain-State-in-a-Box Neural Models

465

Table 3: Simulation Results for the Design of Perfetti (1995, Example 2). Number of Binary Inputs Converging to the Desired Pattern Pattern Number 1 2 3 4 5 6 Convergence rate (%)

HD = 1

HD = 2

HD = 3

HD = 4

10 7 10 8 10 6

31 8 15 31 33 11

46 5 19 47 42 7

30 0 8 34 28 4

85.0

47.8

23.1

8.3

HD ¸ 5

Total

8 0 0 4 7 0

125 20 52 124 120 28

1.3

45.8

Note: Special case of neural network (1) with b = 0.

row of Table 1, for the given pattern p (1) , 35 out of 45 possible initial conditions at Hamming distance HD = 2 from p (1) converged to p (1) . The data in the last row of each table show the average rate of convergence to the correct stored pattern. 8 Also, the total in the last column of each table gives an indication of the attraction capability of each given pattern. As shown in Table 1, the synthesized network (see equation 2.1) attracted 89% of the binary inputs (there are 210 = 1024 possible binary inputs) to the desired patterns. The design method (eigenstructure method) of Michel et al. (1991) (example 2.1, with the design parameters t1 = 1.5, t2 = 0.5) attracted 39% of the binary inputs to the desired patterns, and the method in Perfetti (1995) (example 2, with the design parameter d = 0.75) attracted 46% to the desired patterns. At this point we make a few additional observations. As seen from Table 2, for inputs at Hamming distance HD > 1 from the desired memory patterns, the convergence rate of the network designed by the eigenstructure method drops faster than that for two other designs. In fact, the network designed by the eigenstructure method, with the design parameters specied above, has in addition 12 spurious (nonbinary) memories that affect the retrieval efciency of the network. This is in contrast with the design where there are only three spurious memories, as reected by the convergence rates shown in Table 1. For the design of Perfetti (1995) there are eight spurious memories; six are the negatives of the desired patterns, which are automatically stored as asymptotically stable equilibrium points of the network. Next, numerous simulations suggest that the maximization of s in equation 3.3 is critical for the performance of the network, which is in line with the theoretical results and arguments presented in the previous 8 The convergence rate is specied as the ratio of the number of the initial conditions that converge to the desired pattern to the number of all possible initial conditions from a given Hamming distance HD from the desired pattern.

466

Fation Sevrani and Kennichi Abe

Table 4: Simulation Results for Table 2. Number of Binary Inputs Converging to the Desired Pattern wN

smax / 2w N

HD = 1

HD = 2

HD = 3

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

0.60 0.60 0.60 0.59 0.57 0.52 0.46 0.41 0.37 0.34

96.83 96.83 96.83 96.83 96.42 91.50 84.58 82.25 81.00 78.83

78.40 77.61 76.71 74.78 70.17 60.47 53.82 50.82 49.71 47.45

44.21 43.32 42.11 39.87 37.76 33.63 30.39 28.55 28.11 27.77

Number of spurious states 2.35 2.45 2.50 2.40 2.40 2.80 3.40 3.60 3.90 3.90

sections. Furthermore, by considering wN (the upper bound on the absolute values of the elements of connection matrix W ) as a design parameter, we observe that low values of w, N as the one chosen in this example, improve the retrieval efciency of the network in the presence of errors (see the following example). This approach also allows us to design a globally stable network by solving a linear programming problem, and thus the possibility of designing the neural network (cf. equation 1.1) by another neural network. 4.2 Example. Our objective in this example is to investigate the effect of the bound wN on the retrieval performance of the neural network model 1.1 and also to ascertain how typical are the results of example 4.1. To perform this, we used 20 randomly generated sets of binary patterns (with mutual Hamming distances greater than one but otherwise random) to be stored as asymptotically stable equilibrium points of the network. Each set contained m = 6 binary patterns of length n = 10. For each set, we synthesized the network (with n = 10 and a = 1) by solving a semidenite programming based on the formulation given in the preceding section for several specic values for the bound w. N Thus, all the designed networks are guaranteed to be globally stable. 9 Table 4 shows the results with the obtained optimal value for s (smax ). The rst column species the values chosen for the bound N Columns 2–5 summarize the average results of the simulation tests. w. The simulation results show that the number of binary inputs (applied as initial conditions for the state of the network) converging to the correct stored pattern decreases as the value wN increases. In particular, this was more pronounced for inputs at Hamming distance HD = 1, 2, 3 from a stored pattern. The number of binary inputs attracted from larger Hamming distances, 9 Note that the maximization of s in equation 3.3 subject to large values for bound w N may result in connection matrices that have a large spectral radius. In such a case, without the matrix inequality constraint in equation 3.5, one cannot guarantee the global stability of the synthesized network.

Synthesis of Brain-State-in-a-Box Neural Models

467

HD ¸ 4, was in general unaffected (as it is expected) from the increase of w. N As shown in Table 4, for wN less than one, the average convergence rate for inputs at HD = 2, 3 decreases steadily despite the fact that smax / 2wN stays unchanged. Also, the convergence rate drops faster for wN ¸ 1, which may be considered due to the combined effect of the increase of wN and the decrease of smax / 2w, N 10 as predicted by our theoretical results. In experimentations with other values for n and m (not shown here) we consistently found that the retrieval performance of the network had the same trend when increasing the bound w. N This validates our conjecture that a small dispersion of wij values contributes to better error-correction capabilities of the network in the presence of random binary errors. In particular, we observe that the results of example 1 are quite typical. 4.3 Example. The purpose of this example is to examine how the retrieval performance of neural network model 1.1 is affected by the number of patterns to be stored. For this, 150 sets of binary patterns (n = 13) were chosen randomly, where each set consisted of k patterns (desired memory patterns), with k varying between 1 and 15. For each set of k patterns (for each value of k, there were 10 random sets of k patterns) we synthesized the network by solving an LP problem with specications a = 1 and wN = 0.5. All the designed networks satised the global constraint 3.5. Then, for each designed network we examined the percentage of input patterns (given as an initial state of the network), at Hamming distance HD = 1, . . . , 5 and HD ¸ 6 from a stored pattern, which converge to the stored pattern. Figure 1 summarizes the average results of the simulations tests. The shape of Figure 1 is similar to the waterfall graphs in coding theory and signal processing that are used to display the degradation of system performance as the input noise increases. From this point of view, Figure 1 shows the capability of network 1.1 to handle small signal-to-noise ratios as a function of the number of stored patterns. Compared to other existing results, Figure 1 shows that, regarding the storage ability and retrieval efciency, the performance of the network is quite satisfactory. 5 Conclusions

We have considered the realization of associative memory by means of a class of synchronous discrete-time neural networks that can be viewed as a generalization of Anderson’s BSB neural model. We presented rst an analysis for the location and qualitative properties of the equilibria of the considered network. In particular, we provided results concerning the extent of the domain of attraction of asymptotically stable equilibria, which, under 10

The decrease of smax / 2w N is caused by the fact that the value of smax is bounded by the global stability constraint 3.5.

468

Fation Sevrani and Kennichi Abe

Figure 1: Average percentage of all inputs that converge from a given Hamming distance HD to a stored pattern.

appropriate assumptions, were shown to exist only at the vertices of the hypercube. Using the analysis results as constraints, the synthesis problem was formulated as a constraint optimization problem that was shown to be very exible and offering numerous degrees of freedom. Taking advantage of this formulation, the proposed approach of the synthesis problem allows a design in which the interconnection structure is predetermined with or without symmetry constraints, including a class of discrete cellular neural networks as a special case. In the case of the symmetric interconnection structure, it was shown that by properly choosing an upper bound on the absolute values of the elements of connection matrix, one can guarantee the global stability of the network by using LP techniques for solving the synthesis problem, also highlighting the possibility of the synthesis to be carried out by another neural network. In the experimentations, computer simulations veried that all the desired patterns were stored as asymptotically stable equilibrium points of the designed network. Furthermore, trends observed in numerous simulations showed that the number of spurious asymptotically stable equilibrium points for the considered network is in general small. They also proved the existence of a substantial domain of attraction around the stored patterns, thus validating our theoretical results as well as heuristic arguments provided in the previous sections. Our results, compared to others reported in the literature, show that the retrieval efciency of the designed network is quite satisfactory.

Synthesis of Brain-State-in-a-Box Neural Models

469

Appendix Proof of Proposition 1.

integers and let

Let J ½ f1, . . . , ng denote a nonempty subset of

F = fx 2 Rn | |xi | < 1 if i 2 J, xi = § 1 if i 2/ Jg, dim(F) = | J| the cardinality of J. Let xQ 2 R| J| , xQ i = xi , i 2 J. Then xQ evolves according to the linear system 0 xQ i (k C 1) = xQ i ( k) C a

X j2J

wij xQj ( k) C a @

1 X j2J /

wij uj C bi A

8i 2 J, (A.1)

where uj , j 2/ J, is xed at § 1. From equation A.1, any equilibrium point of system 1.1, say, x* , must satisfy the condition Q xQ * = ¡bQ , W

(A.2)

where xQ * 2 R| J| is a vector whose components are the unsaturated compoQ 2 R| J|£| J| is a submatrix of W obtained by selecting the nents of x * , W rows and columns indicated by the indices which are in the set J, and P bQi = j2J / wij uj C bi , 8i 2 J. Let us consider rst the case | J| = 1, F is an edge of Hn (a face of dimension one). We have the following result. Let wii = 0 8i. Then for almost all b 2 Rn (in the sense of Lebesgue measure), there is no equilibrium point of system 1.1 on an edge of the hypercube Hn . Lemma 2.

Since wii = 0 8i, by equation A.1 the condition for P the existence of an equilibrium point on an edge of the hypercube reduces to j2J / wij uj Cbi = 0. P For xed W , the last equality can be satised if bi = ¡ j2J w ij uj . Then, since / fbi g, i = 1, . . . , n, has one-dimensional measure zero, for all b 2 Rn except those belonging to a set of measure zero, there is no equilibrium point on an edge of Hn . Proof.

Now consider 1 < | J| · n. From the hyperbolicity assumption (assumpQ 2 R| J|£| J| is of full rank. Hence, any equation of tion A1) any submatrix W the form A.2 will have exactly one solution. Now, since the trace of W is zero (wii = 0 8i), each principal submatrix of W , including W itself, has at least one eigenvalue with positive real part (the others with negative real Q ), W Q 2 R| J|£| J| , denote the state matrix of equation A.1. By part). Let ( I| J| CaW

470

Fation Sevrani and Kennichi Abe

Q ) has at least one eigenvalue with modulus greater assumption, (fI| J| C aW than one. Hence the possible equilibrium xQ * is unstable. Consequently the only possible stable equilibria must be located at the vertices of the hypercube. Using an elementwise notation, if xi (k C 1) 6 = xi ( k), ( ) c i k can be obtained as Proof of Lemma 1.

c i ( k) =

( g (x( k) C aL(x ( k))) i ¡ xi ( k) , a(L( x ( k)))i

where, for convenience, we use the notation L( x ) = Wx C b . Now consider the following cases: Case 1. | (x ( k) C aL( x ( k)))i | · 1. Then c i ( k) =

(x ( k) C aL( x ( k)))i ¡ xi (k) = 1. a(L(x ( k))) i

Case 2. ( x (k) C aL(x ( k))) i > 1. Note that since |xi ( k)| · 1, ( L(x ( k))) i > 0. Then (x ( k) C aL( x ( k)))i ¡ xi (k) c i ( k) < = 1, a(L(x ( k))) i also c i ( k) =

1 ¡ xi ( k ) ¸ 0. a(L( x ( k)))) i

The case when ( x( k) C aL(x ( k))) i < ¡1 is similar. Proof of Theorem 1. Under the given hypotheses, that is, if the connection matrix W is symmetric and either positive denite or a < 2/ |l min |, it follows

from the energy minimization theorem (see Golden, 1986, 1993; Hui et al., 1993) that the energy function E: H n ! R, dened by E (x ) = ¡ 12 x T Wx ¡ x T b ,

(A.3)

evaluated on the trajectories of equation is monotonically decreasing unless it encounters an equilibrium point, that is, denoting D E( x (k)) = E( x (k C1)) ¡ E( x( k)), we have that D E( x ( k)) < 0 if x ( k C1) 6 = x (k) and D E (x ( k)) = 0 if and only if x ( kC1) = x( k). Based on these considerations and by making use of assumption A2, we now show that every trajectory of system 1.1 converges to some equilibrium point as k ! 1. For this, given a trajectory fx ( k)g starting from x 0 2 Hn , k = 0, 1, 2, . . . , let V (x ) denote the positive limit set dened as V( x ) = fy : there is a sequence km ! C1 such that x ( km ) ! y g. By denition all the trajectories of system 1.1 are bounded. Hence, by the Bolzano-

Synthesis of Brain-State-in-a-Box Neural Models

471

Weierstrass theorem V ( x) 6 = ;. Let S = fx 2 Hn : D E( x ) = 0g and let M denote the largest invariant subset of S. By denition (cf. equation A.3), E is C1 and bounded on H n . Since D E( x (k)) · 0 8x ( k) 2 H n , by the invariant set theorem (see, e.g., Luenberger, 1979), every trajectory of system 1.1 approaches M as k ! 1, that is, V( x ) ½ M. Now, by assumption, the set M is the entire set of the equilibrium points of 1.1, because D E (x ( k)) = 0 iff x (k C1) = x( k). Furthermore, by assumption, the set of system equilibrium points is nite, that is, each equilibrium point is isolated. Since M is nite and V( x ) ½ M, we have that V( x) is nite. Thus, in order to prove that every trajectory of the system converges to an equilibrium point, it remains to be shown that V( x ) contains only a single point. By contraposition, suppose that x * , y * 2 V( x ). Then, given 2 / 2 > 0, there is a k1 such that | x ( k) ¡ x* | < 2 / 2 for k > k1 , and a k2 such that | x (k) ¡ y * | < 2 / 2 for k > k2 . Choosing kQ = max(k1 , k2 ) we have Q which contradicts that | x * ¡ y * | · | x (k) ¡ x * | C| x (k) ¡ y * | < 2 for all k > k, the fact that the equilibrium points are isolated. This contradiction shows that every trajectory of system 1.1 converges to a limit set containing a single equilibrium point of the equation. References Anderson, J. (1993). The BSB model: A simple nonlinear autoassociative neural network. In M. H. Hassoun (Ed.), Associative neural memories: Theory and implementation (pp. 77–103). Oxford: Oxford University Press. Anderson, J., Silverstein, J., Ritz, S., & Jones, R. (1977). Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psych. Rev., 84, 413–451. Golden, R. (1986). The brain-state-in-a-box neural model is a gradient descent algorithm. Journal Math. Psychol., 30(1), 73–80. Golden, R. (1993). Stability and optimization analysis of the generalized brainstate-in-a-box neural network model. J. Math. Psychol., 37, 282–298. Guez, A., Protopopsecu, V., & Barhen, J. (1988). On the stability storage capacity and design of nonlinear continuous neural networks. IEEE Trans. Systems, Man, and Cybernetic, 18, 80–87. P S. (1992). Dynamical analysis of the bain-state-in-a-box (BSB) Hui, S., & Zak, neural models. IEEE Trans. Neural Networks, 3, 86–94. P S. H. (1993). Dynamics and stability of the brainHui, S., Lillo, W. E., & Zak, state-in-a-box (BSB) neural models. In M. H. Hassoun (Ed.), Associative neural memories: Theory and implementation (pp. 212–224). Oxford: Oxford University Press. Kohonen, T. (1987). Self-organization and associative memory. Berlin: SpringerVerlag. Liu, D., & Michel, A. (1994).Dynamical systems with saturation nonlinearities: Analysis and design. New York: Springer-Verlag. Luenberger, D. (1979). Introduction to dynamic system theory; Theory, models and applications. New York: Wiley.

472

Fation Sevrani and Kennichi Abe

Lustig, I., Martsen, R., & Shanno, D. (1994). Interior point methods for linear programming: Computational state of the art. ORSA J. Computing, 6, 1–14. Michel, A., & Farrell, J. (1990). Associative memories via articial neural networks. IEEE Control Systems Magazine, 10(3), 6–17. Michel, A., Farrell, J., & Porod, W. (1989).Qualitative analysis of neural networks. IEEE Trans. Circuits and Systems, 36, 229–243. Michel, A., Si, J., & Yen, G. (1991). Analysis and synthesis of a class of discrete time neural networks described on hypercubes. IEEE Trans. Neural Networks, 2, 32–46. Nesterov, Y., & Nemirovsky, A. (1994). Interior-point polynomial methods in convex programming. Philadelphia: SIAM. Perfetti, R. (1995). A synthesis procedure for brain-state-in-a-box neural networks. IEEE Trans. Neural Networks, 6, 1071–1080. Vandenberghe, L., & Boyd, S. (1996). Semidenite programming. SIAM Review, 38(1), 49–95. Xia, Y. (1996). A new neural network for solving linear programming problems and its application. IEEE Trans. Neural Networks, 7(2), 525–529. Xia, Y., & Wang, J. (1995).Neural network for solving linear programming problems with bounded variables. IEEE Trans. Neural Networks, 6(2), 515–519. Received June 10, 1998; accepted January 18, 1999.

Neural Computation, Volume 12, Number 3

CONTENTS Article

Dynamics of Encoding in Neuron Populations: Some General Mathematical Features Bruce W. Knight 473 Notes

Local and Global Gating of Synaptic Plasticity Manuel A. S´anchez-Monta˜n´es, Paul F. M. J. Verschure, and Peter K¨onig

519

Nonlinear Autoassociation Is Not Equivalent to PCA Nathalie Japkowicz, Stephen Jos´e Hanson, and Mark A. Gluck

531

Letters

No Free Lunch for Noise Prediction Malik Magdon-Ismail

547

Self-Organization of Symmetry Networks: Transformation Invariance from the Spontaneous Symmetry-Breaking Mechanism Chris J. S. Webber

565

Geometric Analysis of Population Rhythms in Synaptically Coupled Neuronal Networks J. Rubin and D. Terman

597

An Accurate Measure of the Instantaneous Discharge Probability, with Application to Unitary Joint-Event Analysis Quentin Pauluis and Stuart N. Baker

647

Impact of Correlated Inputs on the Output of the Integrate-and-Fire Model Jianfeng Feng and David Brown

671

Weak, Stochastic Temporal Correlation of Large-Scale Synaptic Input Is a Major Determinant of Neuronal Bandwidth David M. Halliday

693

Second-Order Learning Algorithm with Squared Penalty Term Kazumi Saito and Ryohei Nakano

709

ARTICLE

Communicated by Carl van Vreeswijk

Dynamics of Encoding in Neuron Populations: Some General Mathematical Features Bruce W. Knight Laboratory of Biophysics, Rockefeller University, and Laboratory of Applied Mathematics, Mount Sinai Medical School, New York University,New York, NY 10021, U.S.A. The use of a population dynamics approach promises efcient simulation of large assemblages of neurons. Depending on the issues addressed and the degree of realism incorporated in the simulated neurons, a wide range of different population dynamics formulations can be appropriate. Here we present a common mathematical structure that these various formulations share and that implies dynamical behaviors that they have in common. This underlying structure serves as a guide toward efcient means of simulation. As an example, we derive the general population ring-rate frequency-response and show how it may be used effectively to address a broad range of interacting-population response and stability problems. A few specic cases will be worked out. A summary of this work appears at the end, before the appendix. 1 Population Dynamics: Preliminaries

A small patch of cortex may contain thousands of similar neurons, each communicating with hundreds or thousands of other neurons in that same patch or in other patches. By the very nature of this situation, an attempt to simulate it directly must be massive. A recently advanced and promising alternative is the simulation of the dynamical rules stating how the statistical status of the whole population evolves with time. Three recent expositions of this approach are Knight, Manin, and Sirovich (1996), Nykamp and Tranchina (1999), and Omurtag, Knight, and Sirovich (1999). Precursors with goals of a more analytic nature are Stein (1965), Johannesma (1969), Wilbur and Rinzel (1982), Kuramoto (1991), Abbott and van Vreeswijk (1993), Gerstner (1995). Here we briey give one example of the approach and then discuss the generality of its central features. Models of single neuron dynamics are typied by the equations of Hodgkin and Huxley (1952). A serviceable caricature of these equations, and also of more elaborate dynamical models (see the appendix), is the generalized integrate-and-re equation: dx / dt D f ( x, s) Cg(x, s), s D s( t), x D 1 resets to x D 0. Neural Computation 12, 473–518 (2000)

(1.1)

c 2000 Massachusetts Institute of Technology °

474

Bruce W. Knight

Here x may be regarded as a transmembrane voltage, which depends on the accumulation of a transmembrane current f ( x, s) common to all neurons in a population, and which current, in turn, depends on the membrane voltage x and also on a synaptic input signal s(t); each neuron also feels a “noise current” (Tuckwell, 1988) g(x, s) which is statistically uncorrelated with the noise currents felt by other neurons. We may view the resetting of x D 1 to x D 0 as the occurrence of a nerve impulse. We will use equation 1.1 to illustrate particular cases of the general features. The central structural features of the population approach emerge from the consideration of a single population of noninteracting neurons. The interesting further features that emerge, for interacting subpopulations, may be addressed with little further overhead in formulation. Our early discussion consequently concerns a population of noninteracting neurons. (Interactions among subpopulations and feedback interaction among the members of one population are discussed in sections 4 and 5). The neuron population is distributed over the voltage variable x, between 0 and 1, with a distribution function r (x, t), which integrates to unity and whose evolution in time must be determined by equation 1.1. The fraction of neurons whose voltage is more than a given value x changes with time in a way that depends on the probability current, J( x, t) D f ( x, s)r ( x, t) ¡ D( x, s) D

¡

f (x, s) ¡ D( x, s)

¢

@ r ( x, t) @x

@ r (x, t), @x

(1.2)

which follows from equation 1.1. The rst term—the advection term—states that the probability density is swept along by the common velocity. The second term—the diffusion term—states that neurons are more likely to be shaken (by uctuations) from a voltage region, where more of them reside to an adjacent region, where they are less dense, than their converse likelihood of being shaken up the density gradient. Quantitatively, the diffusion coefcient D( x, s) must be derived from a more detailed specication of the noise g(x, s). Equation 1.2 is taken from Knight et al. (1996), where a brief derivation of D is given. Comparable expressions for current (specialized in different ways) are given by Abbott and van Vreeswijk (1993; equation 9.4), by Nykamp and Tranchina (1999; equations A4 and A16), and by Omurtag et al. (1999; equations 36 and 48). We note in equation 1.2 that the current is obtained from the probability density by a linear transformation on r ( x, t). From the dening property of the current, we easily show that its change with x gives the conservation equation (or continuity equation) for the local rate at which density accumulates: @r / @t D ¡@J / @x.

(1.3)

Dynamics of Encoding

475

(In the present context, equation 1.3 was used by Knight et al. 1996, equation 4; Abbott & van Vreeswijk, 1993, equation 3.3; Nykamp & Tranchina, 1999, equation 10.) The linear transformation from r to J, equation 1.2, may be substituted into the conservation equation, 1.3, to obtain the dynamical equation,

¡

¢

@r @ @ D ¡ f (x, s) C D( x, s) r @t @x @x

(1.4)

(Knight et al. 1996, equation 8; Abbott & van Vreeswijk, 1993, equation 9.4, Omurtag et al. 1999, equation 49), which describes how r ( x, t) evolves with the passage of time. (Such a partial differential equation for probability density is technically a Fokker-Planck equation; an extensive reference is Risken, 1996.) Equation 1.4 may be integrated over time to obtain the statistical description of the population’s evolution. This also yields J( x, t) by equation 1.2, and the current across threshold yields the per-neuron ring rate: r( t) D J(1, t)

(1.5)

(Abbott & van Vreeswijk, 1993, equation 9.9; Nykamp & Tranchina, 1999, equation 11; Omurtag et al., equation 39). We note from equation 1.5 that r( t) is linearly related to J (x, t), and so also is related linearly to r ( x, t). Particular neurons, or particular questions concerning their dynamical behavior, or other special circumstances may require a model more detailed, or differently expressed, than what followed from equation 1.1. For example, the Hodgkin-Huxley equations describe a neuron’s momentary state in terms of four variables (v, m, h, n), including two ( m, h) that determine the conductance of sodium channels and one ( n) that governs potassium conductance. The population density in this case lives in a four-dimensional state-space, and the corresponding population current has four components that describe the ow of population density in the several coordinate directions in that space. Relay cells in the lateral geniculate nucleus (McCormick & Huguenard, 1992) reveal the presence of 10 conductance channel types, which specify a state-space with 18 dimensions. Some neurons have cell bodies that are not tightly coupled electrically with their extended dendrites, and such cells must be described by a state-space that has separate sets of dimensions for each of their multiple electrical compartments. Synapses that deliver nite increments of synaptic electric charge require a more detailed formulation (Nykamp & Tranchina, (1999); Omurtag et al., 1999). Synapses with a slow response dynamics add synapse dynamics dimensions to the state-space. On the other hand (as briey discussed in section 6.1), within a state-space the neuronal dynamics places a huge constraint on which probability-density congurations can actually be reached; in a specic case (Sirovich, Knight, & Ormutag, 1999c), this allows one to

476

Bruce W. Knight

replace the population density of equation 1.4 (where the density is a member of the innite-dimensional space of functions that depend on x between 0 and 1) by a 17-entry vector, and allows the reduction of the differential operator of equation 1.4 to a 17 £17 (s-dependent) matrix. The general structural features reviewed below carry over directly to such an efcient matrix representation of the population dynamics. 2 Population Dynamics: Common Structure

In all of the examples cited, we nd the following common elements of structure: s(t): r:

an input signal. a probability density.

(2.1) (2.2)

In many of the above cases, r is a function dened over a state-space; the collection of such functions has the structure of a linear vector space (a weighted sum of two members yields another member). In this sense, in all of the above cases r may be regarded as a member of a linear vector space. J:

a probability current.

(2.3)

Since in many cases, such as the Hodgkin-Huxley case above, J has several components, convenience dictates that we regard J as in a space different from that which contains r . However, there are natural connections: C(s):

a current linear operator such that

(2.4) (2.5)

Cr D J.

In the case discussed above, equation 1.2 is a particular example of equation 2.5, and the linear operator that appears in equation 1.2 is the corresponding particular example of C(s): K:

a conservation operator such that

@r D ¡KJ. @t

(2.6) (2.7)

In the examples above where r is represented in a state-space, K is the familiar divergence operator of ordinary vector analysis, which in our onedimensional example reduces equation 2.7 to equation 1.3. Q( s) D ¡KC(s):

a dynamical operator.

(2.8)

Dynamics of Encoding

477

Substitution of equation 2.5 into 2.7 gives @r D Q( s( t))r : @t

a dynamical equation for

r.

(2.9)

Integration of the dynamical equation yields r at all times, and thus gives a statistical description of the evolving population. In our rst example, equation 2.9 becomes 1.4. r( t): R[J, s]:

the ring rate (per neuron) of the population. a ring-rate linear functional of J,

(2.10) (2.11)

a linear transformation from J to the scalar ring rate r(t) such that r D R[J],

(2.12)

which implies the linear relation, r D R [(Cr ) ] ,

(2.13)

that yields the momentary ring rate from the momentary population density. In our illustrative case above, equation 2.12 specializes to 1.5. Other cases can be more elaborate. For example, the ring rate of a population of Hodgkin-Huxley neurons is the rate at which individual members achieve the peak values of their action potentials. At the moment of maximum voltage, the right-hand side of the Hodgkin-Huxley voltage equation equals zero, and that constraint denes a 3-surface in the four-dimensional statespace. Integration of the ux of probability current across that surface yields the (per-neuron) ring rate, and that integral is linear in J, as equation 2.11 states. Because the value of an action potential’s voltage maximum is inuenced by the input current s, the location of the 3-surface is s-dependent, which leads to the general s-dependence of R[J, s] in equation 2.11. For an additional view of the mathematical structure just presented, we may imagine ourselves endowed with a computer of mythical capability and ask how we might approach the dynamical equation, 2.9. For examples where r is naturally represented as a probability density supported by a state-space, we may subdivide that state-space into hypercubes, each sufciently small to ensure whatever numerical accuracy we demand. If we do this, all the numbered items above may be cast in terms of nite matrices, and several structural features become more clearly visible. For example, we may observe that the dynamical equation, 2.9, describes what technically is a Markov process (Bailey, 1964), and this ensures the existence of some items we use below.

478

Bruce W. Knight

Our informal discretization is introduced at this point not as an approximation but as an aid to uniform presentation. It leads us, by means of nite linear algebra, to some valuable general structural features that may also be obtained directly for each of the several different cases quoted above, but by means that must be customized anew for each case. In particular we note that for any specied input s, the dynamical operator Q of equation 2.9 has a set of eigenvectors w n , which it maps to eigenvalue multiples l n of themselves: Qw n D l n w n .

(2.14)

All of these depend on the value of the input s: Q D Q( s), w n D w n ( s), l n D l n ( s).

(2.15)

For time-independent s, equation 2.9 has a time-independent steady-state solution r 0 ( s) for which Qr 0 D 0I

(2.16)

and time independence need not be invoked to observe that this is a distinguished eigenvector of equation 2.14 with eigenvalue l 0 D 0. This eigenvalue, of course, is independent of s in spite of equation 2.15. However, the zeroth eigenvector r 0 ( s) does depend on the input value s. The existence of a zero eigenvalue implies that all columns of the matrix Q have a common linear dependence expression satised by their elements. To determine this dependence we may exploit the conservation of probability. (Only in equations 2.17, 2.18, and 2.19 will subscripts refer to the compartments of our informal discretization.) Discretized, equation 2.9 is a column of equations, the jth of which reads: X dr j D Qjkr k . dt k

(2.17)

The r j ’s are probabilities that sum to unity, and so their sum has no time dependence. Thus, if equation 2.17 is summed on j, the left side is zero: 0 1 X X @ 0D Qjk A r k . k

(2.18)

j

Equation 2.18 is satised at every moment, including the moment that arbitrary initial data are imposed. Initially, we may choose any one of the

Dynamics of Encoding

479

entries r k to be unity, and choose the rest to be zero, so that equation 2.18 implies X (2.19) Qjk D 0: j

every column sums to zero. The discretized case has a natural inner product; if u and v are two column vectors, then (v, u)

(2.20)

is the sum of products of the entries of v and u if those entries are real numbers; if the entries are complex numbers, we instead take the complex conjugates of the v entries before performing the bilinear sum. (This prescription endows each of our particular cases with an inner product, a bilinear integral over state-space, if we take the small hypercubes, of our discretizing construction, to their innitesimal limit.) For any Q, the inner product induces Q¤ :

the adjoint operator to Q,

which, for any u, v satises ¡ ¤ ¢ Q v, u D ( v, Qu).

(2.21)

(2.22)

The entries of the matrix Q are real numbers, and its matrix transpose Q¤ satises equation 2.22. The determinant of a matrix has the same value as the determinant of its transpose, so the eigenvalues of Q, which satisfy det(l I ¡ Q) D 0,

(2.23)

are the same as those of its adjoint Q¤ . We may regard equation 2.23 as a polynomial in l and must anticipate that it can have complex roots. However, l is the only possibly complex item in equation 2.23, and so complex conjugation of that equation simply replaces l by l ¤ . Thus we see l ¤ is also a root, and complex eigenvalues occur in conjugate pairs. The adjoint operator Q¤ (s) has eigenvectors b w m ( s), which satisfy w n D l ¤n b w n. Q¤ b

(2.24)

Here (for convenience, which soon will be evident), we have chosen a labeling that associates the label n with the eigenvector whose eigenvalue is the complex conjugate of l n . We now observe that ¡ ¢ ¡ ¢ ¡ ¢ ¡ ¢ ln b w m , w n D wbm , l n w n D b w m , Qw n D Q¤ b w m, w n ¡ ¢ ¡ ¢ ¡ ¢ D l ¤m b wm , w n D b w m , l m wn D l m b w m, w n . (2.25)

480

Bruce W. Knight

Subtracting the last form from the rst gives ¡ ¢ bm , w n D 0. (l n ¡ l m ) w

(2.26)

Since one factor or the other must vanish, we see that sets of eigenvectors, and sets of adjoint eigenvectors, which are associated with distinct eigenvalues, form biorthogonal sets. Linear algebra tells us that generically (that is, except for exceptional cases that are “unlikely” in a strict sense) there are as many distinct eigenvectors (and adjoint eigenvectors) as there are dimensions in the vector space. We may impose a normalization on these sets and get ¡

¢ b w m , w n D d mn . d mn D 1 if m D n, D 0 otherwise.

(2.27)

Moreover, the dimension count says that the eigenvectors are complete in the sense that any vector v may be represented as a sum of eigenvectors: vD

X

an w n ,

(2.28)

n

where each am may be evaluated by observing that ¡

(

¢

b wm, v D b w m,

X

! an w n

n

D

X n

¡ ¢ bm , w n D am , an w

(2.29)

where the last step followed from equation 2.27. An immediate rst application is the following: suppose in the dynamical equation 2.9 that s has a xed value and that for r (t), we have initial data r (0); then since equation 2.28 can represent any vector, we know we can represent r ( t) as r ( t) D

X

an ( t)w n ,

(2.30)

n

where ¡ ¢ an (0) D wbn , r (0) .

(2.31)

The full solution to the dynamical equation, 2.9, is given by an ( t) D an (0) el n t ,

(2.32)

whence r ( t) D

X n

an (0) el n t w n ,

(2.33)

Dynamics of Encoding

481

as substitution of equation 2.33 into 2.9 shows that the time differentiation, on the left of equation 2.9, produces on each term the same eigenvalue as does the action of Q on the corresponding eigenvector on the right. In general, if noise entered the formulation of the operator Q in equation equation 2.9 (and if s is constant), then an initial departure from equilibrium will die out, and with the exception of l 0 D 0, all the other eigenvalues will have real parts that are negative. (This is a general property of Markov matrices; Bailey, 1964.) At long times only the n D 0 term in equation 2.33 will survive, giving (with equation 2.31) ¡ ¢ b0 , r (0) r 0 . r (t ! 1) D w

(2.34)

Clearly the zeroth adjoint eigenfunction is a valuable item to understand. We have seen in the discretized format that each matrix column of Q( s) must sum to zero. As Q¤ (s) is its transpose, every row of Q¤ (s) will sum to zero. By the rules of matrix action on a column vector, for a vector whose entries are all the same, the action of Q¤ will yield the zero vector because every row-sum vanishes. Thus, the column vector whose entries are all the same constant is the eigenvector b w 0 ( s) of Q¤ ( s), which has eigenvalue zero. In fact it is not a function of s, but a vector rb0 that is universal for all s (just as is its eigenvalue zero). We have asserted both that the equilibrium probability function r 0 ( s) accumulates to a total probability of unity and that it is the eigenfunction of Q( s) with eigenvalue zero. We have also chosen the biorthonormal normalization, equation 2.27. In the particular case of m D n D 0, equation 2.27 becomes (b r 0 , r 0 ( s)) D 1,

(2.35)

which implies that we choose the value 1 for the constant entry of rb0 . The constant-entry property of rb0 shows a special feature of the eigenfunctions of Q( s). If, in the biorthogonality equation, 2.27, we choose m D 0 and n 6 D 0, we obtain (b r 0 , w n ( s)) D 0.

(2.36)

Because the entries of rb0 are constant, this says that the entries of w n sum to zero. Thus, in the eigenfunction expansion, equation 2.30, the vectors w n for n 6 D 0 do not contribute to the probability inventory, and we verify that the decay of those modes in equation 2.33 does not make probability vanish from the system. If the input s is constant, by equation 2.13 the steady-state ring rate is given by r( s) D R [C(s)r 0 (s), s] ,

(2.37)

482

Bruce W. Knight

which species the steady-state input-output relation for the neuron population. We note from equation 2.37 that already in the steady state, r responds to s in a manner that in general is nonlinear. Evidently equation 2.37 also predicts the population response to a time-dependent s( t), which changes slowly enough. Although the estimate is informal, it is fairly clear that “slowly enough” means that the characteristic change rate of s( t) should be small compared to the slowest decay rate determined by the eigenvalues in the exponents of equation 2.33, in mathematical terms,

¡ ¢ (1 / s(t)) ds( t) / dt < < | Re (l 1 ( s( t)))| ,

(2.38)

where the index 1 has been assigned to the nonzero eigenvalue whose absolute real part is smallest. The response to a step input also may be stated in exact closed form, because the population response is determined by the linear functional R of equation 2.13. Suppose that the population density has achieved an equilibrium distribution r 0 ( sA ) at input sA and that the input signal is jumped to a new value, sB . We set our time origin at the time of the jump. Equations 2.13, 2.31, and 2.33 give the time course of the response: r(t) D

X

¡ ¢ w n (sB ) , r 0 ( sA ) R [C ( sB ) w n (sB ) , sB ] el n ( sB ) t b

n

D R [C ( sB ) r 0 (sB ) , sB ] X ¡ ¢ C el n (sB )t wbn ( sB ) , r 0 ( sA ) R [C ( sB ) w n ( sB ) , sB ] ,

(2.39)

n6 D0

where our special information on the zeroth eigenfunctions (equation 2.35) has been used in the second line, to isolate the new steady state (compare the rst term with equation 2.37). The remaining terms contribute a smooth transient to what would have been an abrupt jump if equation 2.37 had been recklessly used and 2.38 ignored. The jump response (equation 2.39) has been conrmed in the case of equation 1.4 by comparison to both a direct integration of that equation and a direct simulation involving 90,000 integrate-and-re neurons (Knight, Ormutag, & Sirovich, 1999). The effect of our informal discretization above is to make the sum in equation 2.39 large and nite rather than innite. This distinction is not of practical importance; Knight et al. (1999) show that the sum converges with a few terms. An understanding of the eigenvalues clearly is valuable in the interpretation of results such as equation 2.33. While standard algorithms are conveniently available that essentially solve equation 2.23, there are several approximate procedures, and some informal arguments as well, that give valuable systematic information concerning the eigenvalues. A case that we may expect to occur frequently is that of a neuron model that incorporates

Dynamics of Encoding

483

stochastic effects but quantitatively resembles a counterpart nonstochastic model, with the further feature that the steady input s0 is xed at a value where the nonstochastic counterpart res repetitively on a stable limit cycle. A simple example is equation equation 1.1, with s0 so chosen that f ( x, s0 ) is always positive and does not go to zero for any choice of x between 0 and 1; in the nonstochastic counterpart, the noise g is set to zero. In the more detailed example of the Hodgkin-Huxley equations, a steady input current in the most interesting range gives a limit cycle that is a one-dimensional closed curve in 4-space, around which a system point circulates at xed frequency. Even in most neuron models with many electrical compartments and hence a state-space with many dimensions, periodic ring under steady input implies a system point that pursues a closed one-dimensional limit cycle curve embedded in that many-dimensional space. For the nonstochastic counterpart, we may construct a time-dependent probability density r ( x , t) that is degenerate in the sense that it is nonzero only on the limit cycle curve. But along that curve, at some initial moment a variable density of individual systems may be freely chosen. After one period, any cluster of neurons in the population will return to the same point from which they started, so that the probability density r (x , t) is periodic in time and may be represented as a Fourier series in time: r ( x , t) D

X

einv0 tr n ( x ) ,

(2.40)

n

where v0 ( s0 ) is 2p times the fundamental ring frequency. We observe that equation 2.40 is entirely consistent with equation 2.33, with a set of eigenvalues, l n ( s) D inv0 ( s),

(2.41)

that are pure imaginary and come in complex-conjugate pairs except for the value zero. Here the integer index ranges over ¡1 < n < 1. In equation 2.40, the r n ( x ) may be regarded (aside from normalization) as the corresponding eigenfunctions. The stable limit cycle is an “attractor ” in the sense that a system point near it is drawn to the limit cycle with the passage of time. The introduction of stochastic effects will superimpose a random walk on the trajectory of a system point, but the excursions of that random walk will be limited by the greater attraction exerted on trajectories further from the limit cycle. In consequence, stochastic effects will produce an equilibrium probability density that will have appreciable size in a region of state-space that we might describe as a “fattened limit cycle.” We may expect that other eigenfunctions whose eigenvalues lie nearest zero (counterparts of the r n ( x ) in 2.40, with integer n near zero) will be conned to this region as well. A probability density on that fattened limit cycle feels both advection and dif-

484

Bruce W. Knight

fusion around the cycle. More detailed exploration suggests that with some generality, eigenvalues whose indices lie near zero approximate the form l n ( s) D ¡n2 C(s) C inv0 ( s),

(2.42)

where C(s) is positive. This form was found by Abbott and van Vreeswijk (1993, equation 9.15) for the case of equation 1.4 with f and D constant; by Knight et al. (1996, equation 36) for D small and constant and with f ( x, s) D s ¡ c x

(2.43)

(case of “forgetful” or “leaky” integrate-and-re encoder); by Sirovich, Knight, and Omurtag (1999a) for a case with nite synaptic jumps (both analytic and numerical conrmation); and further analytic evidence concerning the generality of this approximate eigenvalue rule, equation 2.42, is given in the appendix. (See also equation 3.24 and the comments following it.) The presence of diffusion tends to make the fundamental frequency v0 more rapid than it is in the nonstochastic counterpart case (Knight et al., 1996, equation 36). Finally, there are neurons that, under steady input, cyclically re bursts of nerve impulses. For such a neuron, the stable attractor is a two-torus embedded in the higher-dimensional state-space. The system motion exhibits two fundamental frequencies (“around and along the doughnut,” which, respectively, characterize the mean impulse rate and the burst rate), and instead of equation 2.40, we will have a sum that features the various combination frequencies of the two fundamentals. A similar but more elaborate eigenvalue theory will follow from these considerations. 3 The Frequency Response

Much valuable structural information may be learned by the study of how the neuron population responds to an input that consists of a large, steady part plus a small, superimposed part that is sinusoidal in time. A valuable slight generalization is a superimposed part that was small throughout its past history and consists of a sinusoid multiplied by an exponentially growing envelope function. The general strategy comes in two steps. The rst step is to observe that a small time-variable addition to steady input will cause a small timedependent addition to the steady value of any system variable that ultimately depends on the input. We will regard the relatively easy (though nonlinear) problem, of determining the steady values of the system variables, as already solved. Large, steady values we will give the index zero; small, time-varying ones will be given the index unity. In algebraic manipulations any product of terms with index unity will be dropped as small to

Dynamics of Encoding

485

second order. Following our numbered expressions of section 2, the input signal, s D s0 C s1 ( t),

(3.1)

gives rise to the probability density, r D r 0 Cr 1 (t),

(3.2)

and the probability current, J D J0 C J1 ( t).

(3.3)

The current linear operator, through rst order, is C0 C C1 D C (s0 C s1 ( t)) D C (s0 ) C s1 (t) Cs ,

(3.4)

where the subscript s implies differentiation with respect to s of the sdependent linear operator C. In the illustrative case of the operator C in equation 1.2 (also see the particular example, equation 2.43), both f and D are to be differentiated with respect to s. We note in equation 3.4 that the zeroth-order terms on the left and right sides are equal and may be subtracted, leaving only the evaluation of the rst-order term (C1 D s1 Cs ). Below, the same sort of steps will be taken repeatedly. The mapping from density to current, equation 2.5 now gives a rst-order part, C0r 1 C s1 Csr 0 D J1 ,

(3.5)

so that the perturbed time-dependent current arises from two pieces, the second and less obvious of which depends directly on small changes in the input. The conservation operator K is not s-dependent; as it is linear, we have K ( J0 C J1 ) D KJ0 C KJ1 ,

(3.6)

and by equation 2.7, probability conservation at rst order is @r 1 D ¡KJ1 . @t

(3.7)

By equation 2.8, the rst-order departure (from equilibrium) of the dynamical operator Q is Q1 D ¡KC1 D ¡s1 KCs D s1 Qs ,

(3.8)

486

Bruce W. Knight

where the last form establishes a more convenient notation. From equation 2.9 the rst-order dynamical equation is @r 1 D Q0r 1 C s1 Qsr 0 . @t

(3.9)

The nal term, which depends directly on the input, involves only known quantities and makes the equation linear inhomogeneous in the rst-order probability density and also linear in the rst-order input. In principle these observations provide the means to solve equation 3.9 analytically for r 1 (for example by the construction of a Green’s function solution). By equation 2.11 the ring rate satises r0 C r1 (t) D R [J0 , s0 ] C R [J1 , s0 ] C s1 Rs [J0 , s0 ] .

(3.10)

With the two terms from the J1 equation, 3.5, this gives the rst-order response, r1 D R [C0r 1 , s0 ] C s1 (R [Csr 0 , s0 ] C Rs [C0r 0 , s0 ]) b[r 0 ] , D R [C0r 1 ] C s1 R

(3.11)

where the second line simplies notation a bit and reminds us that the departure in ring rate can show a direct dependence on input, as well as a response through the departure in population density r 1 . The second step in constructing a frequency response is to assume that the small time-dependent input is sinusoidal: s1 ( t) D s1 (0) eivt .

(3.12)

Instead we at once make the slightly more general specialized assumption that s1 is a sinusoid multiplied by an exponentially growing envelope: s1 ( t) D s1 (0) est where s D a C ivI

(3.13)

here a determines the growth rate of the growing envelope exp(at). Because all of the rst-order relations above are both linear and time invariant, the other rst-order dynamical quantities generically will respond sympathetically with a similar time dependence: J1 (t) D J1 (0) est , r 1 (t) D r 1 (0) est , r1 ( t) D r1 (0) est .

(3.14)

Here the coefcients at time zero will depend on s and are to be determined by the rst-order equations. Since s is generally complex, they will be complex (and, in fact, analytic functions of s). Their complex nature will

Dynamics of Encoding

487

indicate how their phases in the sinusoidal cycle differ from the phase of the input. Since equation 3.13 is a specialization of the general time-dependent input perturbation above, all the linear rst-order equations remain intact. However the dynamical equation, 3.9, reduces to a simpler form: (sI ¡ Q0 ) r 1 D s1 Qsr 0 ,

(3.15)

and this equation has the formal solution r 1 D s1 (s I ¡ Q0 ) ¡1 Qsr 0 .

(3.16)

Now the right-hand expression Qsr 0 is to be regarded as a vector, which can be represented as a sum of eigenvectors of the operator Q0 , as in equation 2.28. We further observe that (s I ¡ Q0 ) ¡1 w n D

1 wn . s ¡ ln

(3.17)

(Multiply both sides by sI ¡ Q0 and use the eigenvector equation 2.14; equation 3.17 becomes an identity.) This enables us explicitly to solve equation 3.16. With equations 2.28 and 2.29, we have X¡ ¢ b w n , Qsr 0 w n , (3.18) Qsr 0 D n

which brings equation 3.16 to the form r 1 D s1

X n

¢ 1 ¡b w n , Qsr 0 w n . s ¡ ln

(3.19)

If we let s D iv here, this expression tells us the frequency response of the population density to a small sinusoidal oscillation in input. If our eigenvalues are of the form expected commonly (see equation 2.42), we see that as we move the driving frequency v, every time it passes a value nv0 , there will be a resonant response, and the eigenfunction w n will become prominent in the population density function. In equation 3.19 there is no resonance for s D 0 even though l 0 D 0 : the n D 0 term in the numerator vanishes, as we now show. By equation 2.16, the equilibrium distribution satises 0D

@ @r 0 ( Qr 0 ) D Qsr 0 C Q . @s @s

(3.20)

Thus the n D 0 numerator in equation 3.19 is

¡

(b r 0 , Qsr 0 ) D rb0 , ¡Q

@r 0 @s

¢ ¡

D Q¤rb0 , ¡

@r 0 @s

¢

D 0,

(3.21)

488

Bruce W. Knight

the last because rb0 is the zeroth eigenfunction of Q¤ . Consequently we can reduce the sum in equation 3.19 to one over n 6 D 0. The population ring-rate frequency response, or transfer function, is the (typically complex) ratio of output departure to input departure: T 0 (s) D r1 / s1 .

(3.22)

Because r1 and s1 are proportional to the same time course, equations 3.13 and 3.14, their ratio is time independent; because they are linearly related, their ratio is amplitude independent: the transfer function depends on input only through the value of s. If we substitute equation 3.19 for r 1 into 3.11 for r1 , we see that r1 is indeed proportional to s1 , and the transfer function, equation 3.22, becomes b[r 0 ] C T 0 (s) D R

¡ ¢ bn , Qsr 0 X w R [C0 w n ] . s ¡ln n6 D0

(3.23)

We will loosely call this the population rate frequency response, which more properly is T0 ( iv), for the more special case of purely oscillatory input. We see again, in the output response, equation 3.23, the appearance of resonance when imaginary s is brought close to a complex eigenvalue l n . The b of equation 3.23 is independent of frequency, and represents leading term R accurate tracking even at high frequencies. (However, in a detailed model b term goes to zero, as whose state-space includes synaptic dynamics, the R explicated by Gerstner, 1999, and its effect is taken over by large-eigenvalue terms in the sum.) An example of the frequency response, equation 3.23, was evaluated by a quite different route some years ago (Knight, 1972a). In the case of equation 2.43 (“forgetful integrate-and-re” model) let the “capacitor discharge rate” c be small and also let stochastic effects be small. Then encoders, which have simultaneously reset to zero, will disperse only slightly before reaching threshold, and stochastic input current may be replaced by an effective stochastic threshold, which slightly disperses ring times with a small standard deviation t . The frequency response in this model (Knight, 1972a, equation 8.22, adapted to present notation) is T 0 (s) D

¡

v0 2p s0

¢

C

X n6 D0

c (v0 / 2p s0 ) » ¼ v03 t 2 2 s ¡ ¡n C inv0 4p

¡ ¢

.

(3.24)

Comparison with equation 3.23 identies the various terms and relates those of equation 3.23 to “back of the envelope” model parameters. Amplitude and phase dependence on s D iv were in agreement with a population frequency response measured (and v0 and t also measured) in an experiment

Dynamics of Encoding

489

(Knight, 1972b). The form of the bracketed expression in the denominator of equation 3.24 is in agreement with the expected form of the eigenvalues in equation 2.42. (For this model, as c and t go to zero, the radian frequency is v0 D 2p s0 .) For subsequent convenience, we notationally simplify equation 3.23 to bC T0 (s) D R

X n6 D0

bn . s ¡ ln

(3.25)

For a multidimensional neuron model, the expedient evaluation of the various constants in equation 3.25 (dened in more detail in equation 3.23) remains a challenge for the future. However, experience to date (Omurtag et al., 1999; Knight et al., 1999; Sirovich et al. 1999) indicates that for onedimensional neuron models, such evaluations are quite straightforward. In the rst section of the appendix, a procedure is outlined to reduce multidimensional neuron models approximately to fairly realistic one-dimensional generalized integrate-and-re counterparts. Also in section 6.2 a method (so far not implemented) is outlined for the efcient direct numerical evaluation of the coefcients in equations 6.14, which express the population density dynamics in a “moving basis” representation. If these equations are specialized to the sinusoidal perturbation input of equations 3.1 and 3.12, equation 3.25 emerges together with a prescription for numerical evaluation of its various constants. 4 Applications of the Frequency Response: Stability of Interacting Subpopulations and of Feedback

The deliberations above are major tools for the investigation of how interacting subpopulations of neurons behave in consort. In time-independent equilibrium the kth subpopulation will satisfy a generalized form of the population-response relation, equation 2.37: £ ¤ (4.1) rk (s 0 ) D Rk Ck ( s 0 ) r k0 ( s0 ) , s0 . Here s0 stands for a list of inputs that may be derived from external sources or from the outputs of other subpopulations. Equation 4.1 has been stated in a way that leaves room for differences not only in the detailed dynamics of each subpopulation but also in the way different input channels deliver synaptic signals to any particular subpopulation. Clearly the dynamical equation, 2.9, may be generalized in a manner similar to equation 4.1: @r k D Qk ( s (t)) r k . @t

(4.2)

Again the subscript k refers to the kth subpopulation. The procedures of section 3 may be applied to this more general case, to nd the frequency

490

Bruce W. Knight

response of the kth population to the mth, input, as in equation 3.22: rk1 D Tkm (s) sm1 .

(4.3)

(Looking back to equations 3.16 and 3.17, we note that the resonant eigenvalues in this expression sensibly depend on the kth dynamical equation but not on the mth input source.) We still have to specify how the response of one population is to affect the input to another. We will work through some informative simple cases, but initially let us leave this relation very general and write £ ¡ ¢ ¤ sm ( t) D s( ext) m ( t) C um r t0 , t .

(4.4)

The rst right-hand term is to be regarded as specied external input from other neuron subpopulations, which can be as numerous as the populations under our direct investigation. The second term is a general nonlinear functional, where the prime on t0 indicates that this functional can look back across past times t0 at earlier outputs of all of the different subpopulations. Thus um is general enough to include, for example, distributed propagation time lags from other subpopulations. In the special state of time-independent equilibrium, equation 4.4 reads sm0 D s( ext) m0 C um [r0 ] ,

(4.5)

where the s( ext) m0 are given. These equations simply state how, through back coupling, the time-independent total inputs must depend in a specied way on the time-independent outputs. They are to be solved together with equation 4.1 for the two sets of constants s0 , r 0 . Simple particular cases suggest particular simple strategies, but we note that one standard method for solving equations 4.5 and 4.1 is to replace um by bum and repeatedly solve while advancing b from b D 0, by small steps to b D 1. There remains the question of the stability of the feedback-coupled solution. The technical means for evaluating a transfer function, discussed in the previous section, are broadly applicable. In particular if we are given a specied um [r( t0 ), t], and if we assume that the nth input component has the time dependence rn ( t) D rn0 C rn1 (t) D rn0 C rn1 (0) est ,

(4.6)

with rn1 small, and we further assume that the remaining components of r( t) are at their equilibrium values, the general steps of the previous section yield £ ¡ ¢ ¤ um r t0 , t D um [r0 ] C Umn (s) rn1 ,

(4.7)

Dynamics of Encoding

491

and the transfer function Umn (s) is straightforward to evaluate in particular cases. If instead we assume that all the rn ( t) are of the form of equation 4.6, then equation 4.7 generalizes to X £ ¡ ¢ ¤ u m r t0 , t D u m [r 0 ] C Umn (s) rn1 ,

(4.8)

n

which is a component of the matrix equation £ ¡ ¢

¤

u r t0 , t D u [r0 ] C U (s) r1 .

(4.9)

Again in matrix notation, if we assume that the external driving in equation 4.4 is of the form s( ext) ( t) D s( ext)0 C s ( ext)1 (0) est ,

(4.10)

then the rst-order part of the feedback equation, 4.4, becomes s1 D s ( ext)1 C U (s) r 1 .

(4.11)

The subpopulation input-output relation, equation 4.3, in matrix form becomes r1 D T 0 (s) s1 .

(4.12)

If we now act on the previous equation with the population-response transfer function matrix of equation 4.12, we may express the rst-order ring rates of all the subpopulations in terms of those of the external inputs: r1 D T 0 (s) s( ext)1 C T 0 (s) U (s) r1 .

(4.13)

This may be solved for the departures from equilibrium of the population rates by a matrix inversion: r1 D ( I ¡ T 0 (s) U (s)) ¡1 T 0 (s) s ( ext)1 .

(4.14)

When this expression is written out in components, it tells us the transfer function that expresses the modulation of ring rate, near equilibrium, of any subpopulation, in response to modulation of any one of the external inputs, and includes the effects of cross-talk communicated between different subpopulations as well as the effect of any self-feedback. It is time to observe the connection between the analytic properties of the transfer function and the question of stability. In general, a modulated output r1 is related to a modulated input b s1 by a (generally complex) ratio

492

Bruce W. Knight

T (s) (as in equation equation 3.22), which allows us to nd what input is required to achieve a specied output: b s1 D ( T (s)) ¡1 r1 .

(4.15)

If we move s on the complex plane, the required value of b s1 changes. In particular if we move s so that (T (s)) ¡1 D 0,

(4.16)

we have an output in the absence of input, which is to say that a root of equation 4.16 gives us a free-running behavior of our system, which has a time course proportional to exp(st). It may be technically simpler to observe that the free-running exponents of the system occur where the transfer function itself becomes innite. A case in point is the population ring-rate response transfer function of equation equation 3.25, which goes to innity at the eigenvalues s D l n . We note from equation 2.39 that the eigenvalues l n are the free-running exponents of the system. We see from equation 4.14 that the free-running exponents for the coupled subpopulations are given by the values of s, where the inverse matrix goes to innity, and these are the roots of the equation, det ( I ¡ T 0 (s) U (s)) D 0.

(4.17)

This expression is a polynomial in the known transfer functions of the various component pieces of the overall coupled system. If any root of equation 4.17 has a real part that is positive, then the equilibrium determined by equations 4.1 and 4.5 is unstable. In typical cases, the features of the specic problem may be exploited to nd the informative roots of equation 4.17 by means more efcient than a direct determinental expansion. The principal virtue of equation 4.17 is to show that in general, there is a well-dened route to a well-dened goal that answers the question about stability. 5 Stability: Specic Cases 5.1 One Population, Self-Feedback with Fixed Lag. In this case we choose the feedback functional in equation 4.4 to be the linear relation

£ ¡ ¢ ¤ ¡ ¢ u r t0 , t D gr t ¡ b t ,

(5.1)

where g is the gain from output to feedback input, and b t is the xed lag time, so that the feedback equation, 4.4, becomes s(t) D s( ext) (t) C gr( t ¡ b t).

(5.2)

Dynamics of Encoding

493

For time-independent equilibrium, equation 4.5 becomes s0 D s(ext)0 C gr0 .

(5.3)

This is to be solved together with the steady-state population-response equation, 2.37: r0 D R [C(s0 )r 0 ( s0 ), s0 ] ´ R0 (s0 ),

(5.4)

which is a nonlinear relation between r0 and s0 . Thus the determination of s0 in terms of s( ext)0 cannot be done in a simple and exact closed form. But to understand the basic facts about the static behavior of our population response, we should already have on hand the evaluation of the function R0 ( s0 ) over a range of s0 values, and so we can solve equations 5.3 and 5.4 together to express the s0 versus s( ext)0 relation as s( ext)0 D s0 ¡ gR0 ( s0 ).

(5.5)

Values of s( ext)0 that are outside the range of positive values given by equation 5.5 will not allow solution of the steady-state equations 5.3 and 5.4. We make the cautionary observation that in equation 5.5, any steady-state total input s0 may be excluded from the solvable region by making the positive feedback g strong enough. Clearly our discussion applies to the remaining broad range of cases where the time-independent solution exists. If we let r( t) D r0 C r1 ( t) D r0 C r1 (0) est ,

(5.6)

our specialization equation, 5.1, becomes ¡ ¢ £ ¡ ¢ ¤ s t¡b t t ( ), D gr0 C ge ¡sb u r0 C r1 t0 , t D gr0 C gr 1 (0) e r1 t

(5.7)

so in this case the transfer function of equation 4.7 is t . U (s) D ge ¡sb

(5.8)

The rst-order part of equation 5.2 is s1 D s(ext)1 C U (s) r1 ,

(5.9)

while from equation 3.22, T0 (s) s1 D r1 .

(5.10)

494

Bruce W. Knight

We multiply equation 5.9 by T0 (s) and nd r1 D

1 T0 (s) s( ext)1 1 ¡ T0 (s) U (s)

(5.11)

for the population rate frequency response to external input. (Note that this is the 1 £ 1 matrix case of the matrix equation, 4.14.) By equation 4.16, the free-running response exponents of the system will be those values of s for which the denominator vanishes, so that T 0 (s) U (s) D 1,

(5.12)

or in our particular case (equation 5.8), t e¡sb T 0 (s) D

1 , g

(5.13)

a transcendental equation to be solved for s, where T0 (s) is the explicit expression (equation 3.25). A specialized case will show some qualitative features. In the simple case of self-feedback without delay (b t D 0), equation 5.13 becomes T 0 (s) D

1 . g

(5.14)

If the feedback gain g is small, then a root s must make T0 (s) large; this will happen if s lies near a pole l n of T0 (s). As a rst approximation we may truncate the sum in equation 3.25 to the term that is large because s lies near its pole: T 0 (s) » D

bn , s ¡ln

(5.15)

and equation 5.14 may now be solved to give s D l n C gbn .

(5.16)

If bn is real and positive (as in the explicit example of T0 (s) given by equation 3.24) then, for positive feedback g, equation 5.16 shows that the real part of s is less negative than that of l n , and also that increasing g moves the real part of s in the positive direction and eventually past zero and beyond the threshold of instability. More generally, to nd the border of instability, we may set the real part of s to zero and take the real part of equation 5.16 to nd the critical feedback value, gcrit D |Rel n | / Re bn ,

(5.17)

Dynamics of Encoding

495

to within the approximation of equation 5.15. The system will be unstable if any root of equation 5.14 has a positive real part, so in equation 5.17, we must choose the index n that makes gcrit smallest. Equation 2.42 suggests that commonly this will occur for n D 1. Returning to equation 5.13, for nite lag b t the “small g-single pole” approximation, equation 5.15, still makes good sense; instead of equation 5.16, it now gives us the implicit relation t s D l n C e¡sb bn g.

(5.18)

Again let us observe that a critical value of g yields a purely imaginary s at the edge of instability. At that value take the real part of equation 5.18, and instead of equation 5.17 we nd ¡¢ gcrit b t D

1 | Rel n | ¡ ¢ , cos |s| b t Re bn

(5.19)

and (because the cosine cannot exceed unity) the presence of the lag tends to stabilize positive feedback. Equation 5.18 may be used in this way if gcrit (b t) is still small, which should hold if | Rel n | is small, which should be the case if the inuence of stochastic effects is modest. 5.2 One Population, Distributed Feedback Lags. Suppose that instead of a single xed lag b t as in equation 5.1, we have a distribution of lags weighted according to a distribution function P( t0 ), which integrates to unity. In place of equation 5.1, our feedback functional now is

£ ¡ ¢¤ u r t0 D g

Z1 0

¡ ¢ ¡ ¢ dt0 P t0 r t ¡ t0 ,

(5.20)

which is still linear in r, and where g again is the feedback gain. Substitution of a constant r0 in equation 5.20 retrieves exactly the same results as in the previous exercise, since P( t0 ) integrates to unity. If we assume a superimposed small modulation, r1 (t) D r1 (0) est ,

(5.21)

then equation 5.20 gives £ ¡ ¢ ¤ u r1 t0 , t D g

Z1

¡ ¢ 0 dt0 P t0 r1 (0) es (t¡t )

0

D gr1 (t)

Z1 0

¡ ¢ 0 dt0 e¡st P t0 D U (s) r1 ( t),

(5.22)

496

Bruce W. Knight

and the explicit feedback transfer function, U (s) D g

Z1 0

¡ ¢ 0 b(s) , dt0 e¡st P t0 D gP

(5.23)

may be substituted into equation 5.12, whose roots give us the free-running frequencies. An example of such a situation is given by Abbott and van Vreeswijk (1993). If the feedback coupling g is small enough, we may again use the “small g-single pole” approximation, equation 5.15. Since in this case s lies near l n , to lowest order in g we may replace U (s) by U(l n ), and equation 5.12 may be solved for the free-running exponents, b(l n ) . sn D l n C bn U (l n ) D l n C gbn P

(5.24)

Here the rst right-hand form makes no assumption concerning the nature of the feedback transfer function U(s), except that its effect should be small enough to permit the single-pole approximation. In the second right-hand form, we may exploit the fact that P (t0 ) is positive in equation 5.23. If, further, P( t0 ) is concentrated around small values of t0 , then for small integer |n|, equation 5.23 tells us that U(l n ) will have a real part that is positive. Equation 3.24 suggests (as do some detailed analytic arguments, as well as numerical evidence; Sirovich et al., 1999a) that the real part of bn (for small |n|) should be positive as well. Consequently the second right-hand form of equation 5.24 indicates that, again in this case, more positive feedback leads to real parts of the sn that are less negative, and so leads to more persistent free-running modes of motion. Again, the borderline of instability may be estimated. If we assume that s1 is pure imaginary and take the real part of equation 5.24, we obtain gcrit D

| Rel 1 | ¡ ¢, b(l 1 ) Re b1 P

(5.25)

which should have quantitative accuracy if gcrit is small enough to fulll the single-pole approximation. 5.3 A Multipopulation Generalization for Small Feedback. If feedbacks are small among interacting subpopulations of neurons, the freerunning exponent equation, 4.17, has an instructive and valuable approximate analytic solution that can be obtained by the single-pole approximation. The determinant of a matrix vanishes if that matrix has a zero eigenvalue, and so equation 4.17 will be solved by any value s that solves the matrix equation,

( I ¡ T 0 (s) U (s)) w D 0 .

(5.26)

Dynamics of Encoding

497

A typical matrix element of T 0 (s) is Tkm (s) D

X n6 D0

(km) bn , (k) s ¡ln

(5.27)

where the row k indexes the kth subpopulation and the column m the mth input, while n indexes the nth free-running mode of one subpopulation in isolation. (The fact that a subpopulation’s free-running eigenvalues depend on which subpopulation, but not on which input, is reected in equation 5.27.) There is a pole for every eigenvalue of every subpopulation. If the matrix elements of U are small, then T 0 must have large elements to compensate, ( ) and so, in equation 5.27, s must lie near some l nk , let us say for the qth subpopulation, and all other poles may be ignored. The matrix elements (see equation 5.27) become Tkm (s) D

( qm) bn vm D ( q) ( q) s ¡ ln s ¡ln

if k D q, D 0 otherwise,

(5.28)

and the second right-hand form denes the elements of the specied vector: ( qm)

v : vm D bn

.

(5.29)

Let us specify another vector: b q: b qk D 1

if k D q, D 0 otherwise.

(5.30)

Then (with the superscript T for transpose) equation 5.28 may be expressed as the outer-product matrix, T 0 (s) D

1 ( q) s ¡ ln

³ ´ b q vT .

(5.31)

( q)

If s lies close to l n , then, in equation 5.26, to lowest order U (s) may be (q ) replaced by U (l n ), and equation 5.26 becomes w¡

1 ( q) s ¡ ln

³ ´³ ³ ´´ ( q) b q vT U l n w D0.

(5.32)

Performing the right-hand operations and multiplying by the denominator: ³ ´ ³ ³ ´ ´ ( q) ( q) s ¡ l n w Db q v, U l n w ,

(5.33)

498

Bruce W. Knight

where the right-hand parenthesis is the scalar product of v with Uw . Equation 5.33 shows us that the components of the unknown vector w are proportional to the components of the known vector b q , so the eigenvector of equation 5.26 is w Db q,

(5.34)

and we may equate the coefcients of equation 5.33 to obtain the exponent s: ³ ³ ´ ´ ³ ´ ( q) ( q) ( q) X ( qm) ( q) s D l n C v, U l n b q D ln C bn Umq l n ,

(5.35)

m

where equations 5.29 and 5.30 were used, in the last step, to evaluate the scalar product. Equation 5.35 tells us how far the exponent s is moved from ( q) the nearby eigenvalue l n , by the interpopulation coupling Umq . This subsection has shown how the single-population results of the previous subsection generalize naturally to a situation with interacting subpopulations. In a single small patch of cerebral cortex, we nd subpopulations of neurons of different dynamical design, and we also nd, in separated patches, subpopulations of neurons with the same dynamical design. This subsection was intended, together with the next subsection, as a guide to show how a model that is a caricature of cortex, with subpopulations that interact and have separate inputs, but are of only a single dynamical type, may be generalized to the more realistic situation, which involves several different dynamical types. 5.4 Caricature of an Orientation Hypercolumn. In primary visual cortex, nearby patches of neurons typically respond best to linear visual stimuli that have similar orientations. There is neural interaction between the patches, and this interaction is weaker between pairs of patches whose preferred orientations are more dissimilar. Approximately, patches for all preferred orientations are on an equal footing, with no preferred orientation exciting a larger subpopulation of neurons than others do or differing from others in design characteristics. A set of nearby patches that includes all orientations is referred to as a hypercolumn. If a hypercolumn contained neurons of only one dynamical type, and contained N subpopulations with equally spaced preferred orientations, then the following caricature would be accurate:

A common dynamical equation, @r k D Q (sk ( t)) r k , @t for the kth subpopulation.

(5.36)

Dynamics of Encoding

499

A common ring-rate rule rk ( t) D R [C (sk (t)) r k , sk ( t) ] .

(5.37)

An even-handed feedback rule, X £ ¡ ¢ ¤ bn u rk¡n t0 , t . sk ( t) D s( ext) k ( t) C

(5.38)

n

Here the summed index is understood to be modulo N, so no subpopulation has preference. Since u has been left general, we may assume that the weightings bn sum to unity: X n

bn D 1.

(5.39)

Let us also reasonably assume that bn is always positive and is even (D b¡n ). For simplicity we assume that u is a linear functional, as in the earlier subsections. If we assume a time-independent baseline external input that also is evenhandedly independent of k, then it is consistent to assume that the resulting total inputs and the subpopulation responses are independent of k as well, and we retrieve, for the N subpopulations the equilibrium equations, 5.3, 5.4, and 5.5, which held for a single population with feedback. Proceeding as before with the small-departure analysis, from our earlier results we easily derive from equations 5.36 and 5.37 that rk,1 D T0 (s) sk,1 ,

(5.40)

while equation 5.38 gives us sk,1 D s( ext)k,1 C U (s)

X

bn rk¡n,1 .

(5.41)

n

To relate population response to external input, we may multiply equation 5.41 by T0 (s) and use equation 5.40: rk,1 D T0 (s) s( ext) k,1 C T 0 (s) U (s)

X n

bn rk¡n,1 .

(5.42)

This is a special case of the matrix equation, 4.13. It is much simplied because the single population dynamics results in a population transfer function matrix, which is a single scalar transfer function that multiplies the identity matrix, while the feedback matrix is a single scalar function of s which multiplies a discrete convolution. The presence of a discrete convolution suggests that a discrete-Fouriertransform strategy should yield major simplications, and this is indeed

500

Bruce W. Knight

the case. In fact discrete Fourier transformation of our nonlinear starting equations, 5.36–5.38, gives a reduction in numerical work by isolating a collection of “nonlinear dynamical modes,” which can be much less than N in number and also carry a valuable physiological interpretation. The stability question, which we address here, is expedited by a small bit of the discrete-Fourier machinery. We have seen that the free-running exponents of the coupled system may be found as the values of s that produce a zero eigenvalue and so solve the matrix equation, 5.26. Corresponding to equation 5.42, that matrix eigenvalue equation specializes to wk ¡ T0 (s) U (s)

X n

b n wk¡n D 0.

(5.43)

Because of the discrete convolution, we can anticipate that the eigenvector w will be a discrete-Fourier basis vector: wk D eiqk

where q D 2p j / N,

(5.44)

and where j can be any integer in the range that includes zero to N ¡ 1. To verify this, substitute equation 5.44 into 5.43 and at once reduce that equation to ¡ ¢ 1 ¡ T 0 (s) U (s) be q D 0,

(5.45)

where ¡ ¢ X ¡iqn be q D bn e

(5.46)

n

is the discrete Fourier transform of b n . We note for j D 0 that be(0) D 1,

(5.47)

by equation 5.39. Because we chose the bn real, positive, and even in n, we see that be is a real number and ¡ ¢ be q

1.

(5.48)

Thus equation 5.45, which is to be solved for s, is identical in form to equation 5.12, which was for a single-neuron population with the same population dynamics and the same feedback dynamics. For q 6 D 0, the presence e in equation 5.45 is simply to reduce the strength of the effective of b(q) feedback coupling according to equation 5.48. Thus, we anticipate that the

Dynamics of Encoding

501

least stable exponent will go with q D 0, which gives us exactly the singlepopulation case, equation 5.12), discussed earlier. Three further points should be made here. The rst is that discrete subpopulations appeared in equations 5.36–5.38 in anticipation of numerical work to come, while a more realistic hypercolumn model would feature a continuous gradation in best-orientation sensitivity. But a free-running exponent equation, 5.45, has emerged as an effective single-population relation whose form is independent of how nely we discretize the system. We could have started equally well with a continuously graded system, and the same effective one-dimensional result would have followed. The simplication of equation 5.45 was a consequence of an underlying symmetry of the system. We could have started with the symmetry of the circle, which is what the problem gave to us, rather than reverting to the less natural symmetry of an N-sided regular polygon, which is what the discrete Fourier transform addresses. The second point is that we could have undertaken a more realistic hypercolumn model involving several different cell types, each with a population dynamics differing in its details. All the steps above would have their counterparts in this more realistic case. At the nal step we would have, instead of the single equation 5.45, a matrix eigenvalue equation like 5.26. But the number of entries in the eigenvector would be not the large number of subpopulations but rather the small number of dynamically different cell types. Thus, the stability problem still is very tractable for this case. The third point is in a more speculative vein. Recent evidence (Everson et al. 1998) suggests that primary visual cortex has a single processing strategy and even-handed allocation of resources for the processing of patches of retinal image that differ not only in orientation but also in magnication scale over a fair range of magnications, to reasonable approximation. The combined two-parameter (“twist” and “zoom”) symmetry has a set of two-parameter eigenvectors analogous to the “discrete twist” eigenvector (see equation 5.44). These mathematical constructs might not only assist more realistically detailed modeling of visual cortex, but perhaps also might provide some new insight into the nature of cortical processing. 6 Possibilities for Efcient Simulation 6.1 Principal Components. Section 1 concluded with the mention of a means to reduce the dynamical equation, 2.9, and the ring-rate equation, 2.13, to matrix equations of fairly modest size. This may be achieved, once we have on hand a sample solution of equation 2.9 in response to a span of “typical input” s( t), by principal component analysis (Sirovich & Everson, 1992) (also called “singular value decomposition” or “representation by empirical eigenfunctions”), which now will be briey described. For rea-

502

Bruce W. Knight

sonable input s(t), some dynamical equations of the form 2.9 prove to be almost unable to carry the population density function r ( t) outside a subspace of modest dimension that is embedded in its full function space. If at a time t¤ we have a snapshot r ( t¤ ) of the density function, we may create from it the singular operator ¡ ¢ ¡ ¢T P t¤ D r t¤ r t¤ ,

(6.1)

which is a projection operator in the sense that its action on any vector u gives ¡ ¢¡ ¡ ¢ ¢ P t¤ u D r t¤ r t¤ , u ,

(6.2)

which is a vector in the one-dimensional manifold dened by the snapshot vector r ( t¤ ). From equation 6.1 we immediately show that P t¤ is self-adjoint. Because ¡ ¢ ¡ ¢ ¡ ¡ ¢ ¡ ¢¢ P t¤r t¤ D r t¤ r t¤ , r t¤ ,

(6.3)

it has only nonnegative eigenvalues and is positive-semidenite. If we have a collection of snapshots, taken at times t¤ , we may form the operator AD

X t¤

Pt¤ ,

(6.4)

which, according to equation 6.2, maps any vector into the modestdimensional manifold accessible to the dynamical equation. The self-adjoint property, and the positive-semidenite property, are inherited by A, and this ensures a set of orthonormal eigenvectors in the eigenvalue problem Awn D l n wn ,

(6.5)

whose eigenvectors we can number in the order of their decreasing nonnegative eigenvalues. Because the eigenvectors are orthonormal, we have ( wm , wn ) D d mn .

(6.6)

Because the effectively nonnull subspace of A is of modest dimension, to good approximation we may span it with a modest subset of the eigenvectors to equation 6.5. Now we can expand a solution r ( t) as r ( t) D

X n

an ( t) wn ,

(6.7)

Dynamics of Encoding

503

where the sum extends over a modest number of terms (17 terms in the quoted example). This we now can substitute into the dynamical equation, 2.9. We take the inner product of that equation with wm and (employing equation 6.6 on the left-hand side) directly conclude that X dam (wm , Q (s(t)) wn ) an , D dt n

(6.8)

which may be integrated in time for the am (t). With these am (t) on hand, substitution of equation 6.7 into the ring-rate equation, 2.13, explicitly evaluates the ring rate as rD

X

R [Cwn ] an .

(6.9)

n

We note that the an may be regarded as the components of a vector, and this endows equations 6.8 and 6.9 with an inherited structure identical to that of equations 2.9 and 2.13. Equation 6.9 is an explicitly specied linear functional of its input vector. Equation 6.8 expresses the evolution of that vector in terms of the action on it, of an s-dependent operator—here an s-dependent matrix. If the operator in equation 2.9 can be expressed as a polynomial in s (as, for example, by substituting equation 2.43 into equation 1.4), then the matrix in equation 6.8 can be expressed as a corresponding polynomial in which the same set of powers of s multiplies corresponding component constant matrices. These considerations translate directly into a practical and efcient computational scheme. However, this procedure requires caution at one point: the operator Q has conservation of probability built into it, but its representation by a truncated matrix in equation 6.8 may slightly miss perfect conservation because of the truncation; the formerly zero eigenvalue may be slightly perturbed, and this will give to the total probability a spurious slow exponential drift as time progresses. Conservation of probability is easily restored to the matrix of equation 6.8. In discretized format, conservation for the original matrix Q was assured by the fact that every column of Q was orthogonal to the zeroth adjoint eigenvector rb0 , which had constant entries. In the present representation the nth component of this vector is given by b an D (wn , rb0 ) .

(6.10)

Without truncation, the admixture of this vector in each column of the matrix in equation 6.8 would be zero. Let us inner-product the truncated equation 6.10, with a column of the truncated matrix, to obtain the (very small) coefcient of actual admixture; we then multiply the vector (equation 6.10) by this admixture coefcient and subtract the result from the matrix column.

504

Bruce W. Knight

Conservation of probability is now fullled. In cases tested (Sirovich et al., 1999c) the resulting matrix applied in equations 6.8 and 6.9 gives excellent agreement with the much longer direct calculation of equations 2.9 and 2.13. The choice of 17 dimensions rendered graphical results indistinguishable from those of the longer direct calculation. (This was not quite achieved with 15 dimensions. The dimension count jumps by twos because the natural subspaces relate to consecutive pairs of complex conjugate eigenvalues, as the next section will elucidate.) 6.2 Moving Basis. We conclude with a brief comment on a set of possibilities that might lead to extremely efcient simulation and give an additional viewpoint on mathematical structure. Once the efcient principal components representation has been constructed, it is not a heavy computation to solve the eigenvalue problem and tabulate the results, parametrically in the input s, for the eigenvalues and eigenfunctions that are within the subspace that the dynamical equation is able to access. In cases examined, these have proved to be in good practical agreement with their counterparts derived from the more laborious direct calculation (Sirovich et al., 1999c). Thus, it is reasonable to envision a representation of the dynamics expressed in terms of the input-dependent dynamical eigenvectors of equations 2.14 and 2.15:

r ( t) D

X

e an ( t)w n (s(t)) .

(6.11)

n

This is a representation in terms of a “moving basis” that at each time t remains optimized for the moving present value of the input s(t). (The overhead mark is to distinguish these coefcients from those of equation 6.7.) Effective convergence thus should be achieved by a sum of fewer terms in equation 6.11 than were required in equation 6.7, where the xed basis was a compromise that had to span the rst several eigenvectors of Q( s) over the entire range of s( t). Once the e an (t) in equation 6.11 have been determined, the population ring rate is given by the short sum rD

X n

e an R [C( s)w n (s) , s] D

X

bn ( s)e R an (t).

(6.12)

n

To determine the e an ( t), we may work from the differential equations that they satisfy. To do this we may inner-product r ( t) in equation 6.11 with the adjoint eigenvector b w m (s), to obtain ¡ ¢ bm (s (t)) , r ( t) . e a m ( t) D w

(6.13)

Dynamics of Encoding

505

Differentiation now yields de am D dt

¡b

w m,

@r @t

¢ ¡b

C w m,s

ds ,r dt

¢

¡ ¢ ds ¡ ¢ b D Q¤ b w m, r C w m,s , r dt ¼ X »¡ ¢ ds ¡ ¢ b e D w m,s , w n an l ¤m wbm , w n C dt n D l m ( s( t))e am C

ds X Mmn (s( t))e an . dt n

(6.14)

In the second line we used the dynamical equation and the adjoint operator relation; in the third we substituted equation 6.11 for r ; in the fourth line we abbreviated by dening the matrix ¡ ¢ w m ( s) / @s, w n ( s) M mn (s) D @b

(6.15)

(whose m D 0 entries are all zero, because of the constant-entry property of rb0 , as discussed at equations 2.35 and 2.36). The fourth line of equation 6.14 gives an explicit system of ordinary differential equations to be integrated for the e am ( t). A check on the equations 6.14 is that if s is constant, they may be solved analytically, and their solution is equation 2.32. Because for m 6 D 0 the eigenvectors w m do not carry any probability (see the discussion at equation 2.36) we must have e a0 ( t) ´ 1.

(6.16)

(by equation 6.11) and this is consistent with equation 6.14 even when s(t) is time dependent, since, when m D 0, both l m and Mmn are zero. If s(t) changes very slowly (see equation 2.38), then r ( t) will pass through a progression of steady states, and we will have r ( t) D r 0 (s(t)) .

(6.17)

The remaining members of equation 6.14 furnish a sequence of corrections to this result. We can estimate how many terms we will need in the sums in equations 6.11 and 6.14. If our eigenvalues are of the “damped limit-cycle” form (see equation 2.42), then the equations 6.14 are those for a sequence of “tuned lters,” each tuned for a radian frequency given by the imaginary part of the eigenvalue, and each driven by the right-hand-most term in the equation. We see that the pair of equations for m and ¡m are needed to furnish the part of the response spectrum in a frequency range near m times the limit cycle frequency. If we wish to compute an output response that

506

Bruce W. Knight

is accurate through frequency components that are N times as high as the highest single-neuron ring rate, then the estimated number of terms in the sums in the ring rate and dynamical equations 6.12 and 6.14 will be 2N C1. Thus, for example, seven terms should yield a computed output with a short-term-estimated spectrum that remains accurate through frequencies about three times the momentary single-neuron ring rate. Direct application of the moving basis, as just discussed, to a simulation in which the value of the input s( t) passes from the limit-cycle regime of an individual neuron to the regime of the stable equilibrium point, will encounter a technical nuisance. (This situation will arise in the simulation of primary visual cortex.) In the region of passage between the two regimes, each pair of early eigenvalues l§ m ( s), at some critical value of s, will merge by having their imaginary parts move to zero, and further change in s will give rise to two negative real eigenvalues that correspond to a faster and a slower pure relaxation. The members of equation 6.14 for § m together may be written ¡ ¢ de a Cm / dt D (l Cm ¢ e a Cm C 0 ¢e a¡m ) C ds / dt L Cm (e a C, e a¡ , s) ¡ ¢ de a¡m / dt D (0 ¢e a Cm Cl ¡m ¢e a¡m ) C ds / dt L¡m (e a C, e a¡ , s)

(6.18)

to emphasize that here we are dealing with a 2 £ 2 diagonal matrix. The L§ terms are the remaining linear sums in equation 6.14. Clearly we can pursue equation 6.18 into the pure-relaxation regime by assigning Cm and ¡m, respectively, to the less and to the more negative real eigenvalues there. But the transition is not smooth at the merger value, as the motion of each eigenvalue (for example) on the complex plane takes a right-angle turn at that value of s. The two invariants of the matrix, its determinant 4 and its trace T, are related to the eigenvalues by 4m ( s) D l Cm ( s) ¢ l ¡m (s)

(6.19)

T m ( s) D l Cm (s) Cl ¡m ( s)

(6.20)

» ¼ q 1 2 l§ m ( s) D Tm ( s) § (Tm ( s)) ¡ 44m (s) . 2

(6.21)

The critical condition occurs at the value of s where the argument of the root passes through zero and then changes sign. We observe that the rightangle-turn behavior of the eigenvalues is not shared by either of the two invariants, and this indicates that the passage may be made smooth by a

Dynamics of Encoding

507

linear transformation upon equation 6.18 that carries it to the form ¡ ¢ ¡ ¢ ¡ ¢ dxm / dt D 0 ¢ xm C 1 ¢ ym C ds / dt Xm x, y, s

¡ ¢ ¡ ¢ ¡ ¢ dym / dt D ¡4m ¢ xm C Tm ¢ ym C ds / dt Ym x, y, s

(6.22) (6.23)

whose 2 £ 2 matrix has the same determinant and trace as in equation 6.18. Equation 6.22 may be solved for ym and substituted into equation 6.23 to remove explicit dependence on ym from that equation’s right-hand side. An additional differentiation of equation 6.22 then carries the form of the dynamics to that of the fairly familiar damped and forced harmonic oscillator. The critical value of s corresponds to critical damping in the oscillator, which does not introduce singular behavior into the oscillator ’s forced motion. The moving-basis equations in the real-component form, 6.22 and 6.23, may be forward integrated across the critical damping condition without encountering embarrassment. The introduction of the moving basis in terms of sharpening the results of principal component analysis at the start of this subsection was convenient but inessential. The evaluated pieces that go into equations 6.12 bn (s), l m ( s), and Mmn ( s), are all to be evaluated at conand 6.14, which are R stant input s. If state-space is divided up into small compartments, then the only compartments visited by a neuron at constant s will be those that lie within a “fattened limit cycle” or in the case of an s so chosen that the neuron without noise would be quiescent, will lie analogously within the “stochastically fattened neighborhood of a stable equilibrium point.” These small compartment subsets of the state-space offer a numerically tractable eigenvalue problem and direct evaluation of the pieces just mentioned. The standard sinusoidal small perturbation assumption, as stated in equations 3.1 and 3.12, can be substituted into the moving-basis dynamical equations 6.14. Since equation 6.14 contains no prior approximation, the transfer function (see equation 3.25) must follow, and does so with a little thoughtful manipulation. Hence the numerically tractable eigenvalue problem will in principle lead to a similarly tractable numerical evaluation of the various constants that appear in equation 3.25. Equations 6.12 and 6.14 summarize the dynamics of a neuron population in a sort of a compact canonical form, in the sense that in principle, two dynamically distinct neuron designs can lead to numerically identical examples of equations 6.12 and 6.14, but if they do then both alternative designs would play dynamically identical roles as subpopulations in a larger array of interacting neuron subpopulations. If the effect on a specic neuron population by an elsewhere-directed pharmacological agent is to be minimal, or if the effect of a base-pair change in a mouse gene is to be maximal, then this more properly could be measured by minimal or maximal numerical changes in equations 6.12 and 6.14 rather than in the underlying dynamical equations. Regarding efcient simulation, we can

508

Bruce W. Knight

envision an elaborate neuron model with multiple electrical compartments, and a highly simplied neuron of the general integrate-and-re variety, giving rise to numerically almost identical instances of equations 6.12 and 6.14. 7 Summary

We may anticipate that a fairly broad range of useful neuron-population simulations will exhibit a common mathematical structure. The main ingredient of that structure is a dynamical equation that describes the momentary evolution of a population density. That density exists in a state-space whose points represent the possible states of a single model neuron. The population density dynamical equation follows from the postulated dynamics of the single neuron, including the consequences of an input noise uncorrelated among different neurons of the population. The population dynamics equation is nonlinear in the common inputs from other neuron populations (which nonlinearity reects the nonlinear dynamics of the identical individual neurons in the population) but is linear in the population density. This linearity endows the population dynamics with a structure whose broad outlines are familiar from applied mathematics and which gives insight regarding the ways the neuron population will respond to different input situations. The second major ingredient of a neuron population dynamics model is an expression that species its momentary per-neuron output rate of impulses. This output rate expression depends linearly on the momentary disposition of the population density. (In the sense that the impulse train from each neuron is a point process in time, the merged impulse trains from a large population of noninteracting neurons will be a time-dependent Poisson point process.) Because the population dynamical equation is linear in the population density, it may be viewed in terms of an input-dependent population dynamical linear operator that acts on the population density. At a specied input, that operator has a set of eigenfunctions in terms of which the population density can be expanded as a linear weighted sum and a corresponding set of eigenvalues that include zero and complex conjugate pairs with negative real parts. Under common circumstances, the eigenvalue pairs nearest zero may be approximated by a simple orderly progression, and we may anticipate that the corresponding eigenfunctions will dominate in the expansion of the population density. We may examine how such a population dynamics model responds to input that is a small sinusoidal oscillation around an operating point. This leads to an input-output frequency response that is a simple analytic function of frequency with “resonance denominators” at the complex eigenvalues that, under common circumstances, lie near integer multiples of the population’s mean ring rate.

Dynamics of Encoding

509

The output of one population can be used as input to another. If several neuron subpopulations are thus coupled together, commonly a set of equilibrium ring rates may be found. A frequency response function can be assigned to each input-output pair; from these we may derive an analytic function that is a polynomial combination of those frequency responses, and whose complex roots specify the possible free-running behaviors near equilibrium of the coupled set of subpopulations. A root whose real part is positive indicates a free-running behavior that is a sinusoidal oscillation times an exponentially growing envelope; hence the equilibrium is unstable in this case. In particular cases this stability analysis is quite tractable. The question of the stability of a model of an orientation hypercolumn of neurons in primary visual cortex may be reduced by symmetry to the tractable stability analysis of a much smaller system, which contains only one subpopulation of each cortical cell type. Finally, because only fairly few early eigenfunctions proved important in exploring the analytic methods discussed above, we are led to a preliminary examination of how this feature might enable extremely efcient simulations of neuron population dynamics. One such method involves examination of the solution of the full population dynamics equation in response to a sample of typical variable input. The small subspace of function space, in which the population density always resides, may be determined by principal component analysis. The dynamical equation may be partially diagonalized into this subspace, where it becomes a modest system of linear ordinary differential equations coupled by a matrix that changes in time with the changing input. A second approach, which currently is speculative but promising, is to recast the dynamical equation in terms of coefcients of a “moving basis” of the input-dependent eigenfunctions of the dynamical operator. This approach has the potentially useful additional feature that it reduces neuron population models to a single canonical form. This should lead, for example, to a quantitative estimate of how well an elaborate neuron model may be approximated by a simpler model. We certainly can envision models of interacting cortical neuron subpopulations whose mathematical features will include some that lie beyond those presented here. The eventual accurately descriptive model may well fall in the broader set. Still, what is presented here may help furnish a bit of guidance for theoretical developments to come. Appendix: Estimate of the Eigenvalues A.1 The Model. The generalized integrate-and-re model, of equation 1.1, can be made to mimic a more detailed model fairly well, for driving inputs s(t) that do not change with unnatural abruptness. In the more detailed case of the Hodgkin-Huxley neuron, for example, the voltage equation is of

510

Bruce W. Knight

the form dv / dt D F( v, m, h, n, s),

(A.1)

where m, h, n satisfy differential equations of their own. For a typical xed value of s, these variables follow a limit cycle, and on that s-dependent limit cycle, the other variables can be expressed as functions of the concurrent value of v, whence F( v, m, h, n, s) D F( v, m( v, s), h( v, s), n( v, s), s) D b F( v, s).

(A.2)

The model equation, dv / dt D b F( v, s),

(A.3)

when s is held constant, will integrate to yield the exact voltage time course on the limit cycle, for each s-value. More generally, for time-dependent s, the momentary exact m, h, n values will not depart far from their limit-cycle values for present v and s, so long as s does not change by a major fraction of its value during a single cycle of v. The other variables relate to v in different ways on the fast descending side of the voltage cycle and on the much slower ascending side. To avoid double-valuedness, we may dene our variable x in equation 1.1 as xD

1 2 ( vmax ¡ vmin ) ( (for downward motion) vmax ¡ v £ (for upward motion). vmax ¡ vmin C ( v ¡ vmin )

(A.4)

If x is taken modulo unity, it is periodic on the limit cycle. Substitution of v in terms of x in equation A.3 now gives the f ( x, s) of equation 1.1. If we proceed in this way, the boundary condition induced on the consequent dynamical equation, 1.4, for the population density r ( x, t), is that r ( x, t) should be smoothly periodic in x, with period 1. The more familiar boundary condition, for integrate-and-re dynamics with diffusion, is an absorbing (or “no-return”) upper boundary. We note the same effect is built in here, as on the nal fast voltage rise of the nerve impulse, forward advection overwhelms backward diffusion. The form of equation 1.4, as a good approximation to more elaborate and realistic dynamics, may be reached by an entirely different path. Suppose that in a state-space of higher dimension, we have properly formulated the dynamical operator in diffusion approximation. Suppose further that stochastic effects are modest enough so that the “fattened limit cycle” (discussed in section 2, equations 2.40 and 2.41), which supports the population

Dynamics of Encoding

511

density function, is reasonably thin. Then our partial differential equation eigenvalue problem may be separated into a preliminary transverse eigenvalue problem that may be solved locally at each cross-section across the limit cycle, and an axial eigenvalue problem that follows from assembling the solutions to the transverse problem. (This is the same methodology that is natural in addressing wave propagation in a waveguide of variable crosssection.) The resulting axial dynamical operator inherits both advective and diffusive terms, and is of the form given in equation 1.4, although in this case f ( x, s) includes the effects of transverse diffusion. A.2 Small Stochastic Effects. If the diffusion term in equation 1.4 is small, we may approach the eigenvalue problem,

Qw n D l n w n ,

(A.5)

by straightforward perturbation theory. In equation 1.4, let Q D Q0 C Q1 ,

(A.6)

where Q0 D ¡

@ @ @ f (x) , Q1 D D (x) @x @x @x

(A.7)

(we drop the s dependence from the notation, as input is held xed in the eigenvalue problem), and we assume that inclusion of the small Q1 induces changes away from both the eigenvalues and eigenvectors of Q0 : (0)

(0) (0)

Q0 w n D l n w n

(A.8)

(0)

(1)

(A.9)

(0)

(1)

(A.10)

wn D w n C wn

l n D l n Cl n .

Substitution of equations A.6, A.9, and A.10 into A.5 (and use of equation A.8) gives the rst-order part of equation A.5: (1)

(0)

(0) (1)

(1) (0)

Q0 w n C Q1 w n D l n w n Cl n w n ,

(A.11)

where we regard terms with index zero as known, and those with index one as to be found. The perturbed eigenfunction can be expanded in terms of the unperturbed eigenfunctions: (1)

wn D

X m6 Dn

(0) . am w m

(A.12)

512

Bruce W. Knight (0)

It is notable that the m D n term, which would include an admixture of w n (1) in w n , is absent from the sum. This term is absent because its inclusion in equation A.9 would serve simply to rescale the unperturbed eigenvector, and the eigenvector property is not affected by a scale change. Essentially equation A.11 asks, How much must the old eigenvector be redirected to be an eigenvector of the modied operator? and a simple scale change is irrelevant to answering this question. So every term of equation A.12 is (0) biorthonormal to the adjoint eigenvector b w n . If all we want from equa(1) tion A.11 is the rst-order correction to the eigenvalue, l n , we may inner(0) product that equation with b w n , and by use of equations 2.22 and 2.27, we at once nd ³ ´ (1) (0) (0) l n D wbn , Q1 w n . (A.13) The inner product for the state-space of equation 1.4 is Z1 ( v, u) D

dxv¤ ( x) u ( x) .

(A.14)

0

We observe that Z1 ( v, Q0 u) D ¡

dxv ¤ (x) 0

Z1 D

dx 0

¡

f (x)

¢ @ ¡ f ( x) u ( x) @x

@ v (x) @x

¢

¡ ¢ D Q¤0 v, u ,

¤

u (x) (A.15)

where for the second line we integrated by parts, and used the periodic boundary condition discussed in section A.1. By equation A.15 the zerothorder adjoint operator is Q¤0 D f ( x)

@ , @x

and the adjoint eigenvalue equation, ³ ´¤ (0) (0) (0) b wn D l n wn , Q¤0 b

(A.16)

(A.17)

is f ( x)

d b(0) ³ (0) ´¤ b(0) wn D l n wn , dx

(A.18)

Dynamics of Encoding

513

which (by direct substitution back into equation A.18) has the solution

8 9 <³ ´¤ Z x = ¡ ¡ ¢¢ (0) (0) b w n ( x) D exp l n . dx0 1 / f x0 : ;

(A.19)

0

Periodicity demands that (0) (0) b w n (1) D b w n (0) ,

(A.20)

and this requires us to choose the eigenvalues (0) ln

8 1
: 0

9¡1 ¢= D inv0 , dx 1 / f (x) ; ¡

(A.21)

which is of a form in agreement with equation 2.41. If we now go back to equation 1.1, in the absence of noise it gives dt D dx / f ( x) ,

(A.22)

which we may immediately integrate from x D 0 to x D 1 to obtain the transit time, or reciprocal ring frequency: Z1 TD

dx / f ( x) .

(A.23)

0

As ring period and radian frequency are related by v0 D 2p / T,

(A.24)

this is in exact agreement with the fundamental radian frequency that appears in the eigenvalue evaluation, A.21. In the same manner as what we have just done, from equation A.7, the unperturbed dynamical operator ’s eigenvalue equation is ¡

´ d ³ (0) (0) (0) f (x) w n D l n w n , dx

(A.25)

which, as back-substitution again conrms, has the solution (0) wn

8 9 Zx < = ¡ ¡ ¢¢ ( x) D C 1 / f ( x) exp ¡l n dx0 1 / f x0 . : ; ¡

¢

0

(A.26)

514

Bruce W. Knight

If in equation A.26 we choose the constant C D T ¡1 ,

(A.27)

then the eigenfunctions and their adjoints are indeed biorthonormal, as equation 2.27 becomes

8 9 Z1 Zx ³ ´ < 0 = 2p i dx dx (0) (0) b ( n ¡ m) wm , wn D C exp . : T f ( x) f ( x0 ) ; 0

(A.28)

0

In fact equation A.22 serves as a crafty substitution that reduces equation A.28 to an easy integral and conrms equation 2.27 for this case. We may now evaluate the integral, equation A.13, for the eigenvalue’s perturbation due to diffusion. We rst make the observation that Z1 (v, Q1 u) D

dxv ¤ (x) 0

Z1 D¡

dx

d d D (x) u (x) dx dx

¡d dx

0

v ( x)

¢

¤

D ( x)

d u ( x) , dx

(A.29)

again by parts integration and periodicity. Thus equation A.13 becomes

(1) ln

Z1 D¡

dx 0

¡ db

(0)

dx

w n ( x)

¢

¤

D ( x)

d (0) w n ( x) . dx

(A.30)

bn(0) we can simplify directly from the adjoint The derivative that involves w (0) eigenvalue equation, A.18, to obtain a term with a factor l n . To evaluate the right-most derivative in equation A.30, we may bring equation A.25 to the form ³ ´ (0) (0) l n C f 0 ( x) dw n (0) D¡ wn . dx f ( x)

(A.31)

From equations A.19 and A.26, we have further that 1 1 (0)¤ (0) wbn ( x) w n ( x) D . T f ( x)

(A.32)

Dynamics of Encoding

515

When these ingredients are substituted into equation A.30, we obtain the evaluated form, (1) ln

Z1 ³ ´2 Z1 dx D (x) dx f 0 (x) D ( x) (0) (0) D ln ¡ ¢3 Cl n ¡ ¢3 T f ( x) T f ( x) 0

0

³ ´2 (0) (0) D ln A Cl n B,

(A.33)

where A is clearly positive and the two integrals A and B are independent of the eigenvalue number n. (However, they do depend on the input s, which controls the fundamental radian frequency v0 .) The eigenvalue without (0) diffusion,l n , is given by equation A.21; substitute that into equations A.33 and A.10, which gives the eigenvalue through zeroth and rst orders, and we have l n D ¡n2 v02 A C inv0 (1 C B) .

(A.34)

Here (unlike in equation 2.42) v0 is the fundamental radian ring frequency in the absence of diffusion, and A and B, assumed small, are the two integrals that appear in equation A.33, which are linear in the strength of diffusion. The presence of diffusion modies the fundamental ring frequency, through the term B in equation A.34; and equation 2.42 is conrmed under the often reasonable assumption of small diffusion and in the context of the fairly accurate model presented here. A.3 Rapidly Oscillating Eigenfunctions. The eigenfunction problem

given by equation 1.4

¡

¢

d d ¡ f ( x) C D ( x) w n D l n wn , dx dx

(A.35)

may be expected to yield solutions w n (x) that oscillate more and more rapidly across the range of x, as |n| and consequently |l n | become larger. This offers us another approximation procedure, the “WKB” approximation, that does not hinge on small diffusion. We assume that the solution locally looks like a sinusoidal wave, but with a wavelength and amplitude that change gradually with position: 8 x 9 < Z = ¡ ¢ w n ( x) D A ( x) exp i dx0 k x0 . (A.36) : ; 0

In the very special case where f and D are constant, equation A.36 solves A.35 exactly with A and k constant, and where k must satisfy ¡if k ¡ Dk2 D l n .

(A.37)

516

Bruce W. Knight

If we now insist on solutions with unit period, we must have k D 2p n, whereupon equation A.37 agrees exactly with equation 2.42. For variable f and D, substitution of equations A.36 into A.35 yields the result that, so long as k( x) is very large compared to the rst derivatives of f ( x) and D( x), neglect of the relatively much smaller terms again leads to equation A.37 locally for every value of x, or equivalently » ¼ q ¡i 2 (A.38) kD f § f C 4Dl n . 2D If we let the diffusion D go to zero in equation A.38 (with the choice of the minus sign), then substitution of A.38 into A.36 again gives an exact solution, as we retrieve the zeroth-order exact solution of the previous subsubsection, which in fact was in the form of equation A.36. Substitution of k( x) from equation A.38 into A.36 yields a condition that species the eigenvalue once we insist that w n (x) be periodic, as in equation A.20: Z1 dx 0

¡1 2D ( x)

»

q f ( x) §

¡

f ( x)

¢2

¼ C 4D (x) l n D ¡2p in.

(A.39)

This is a transcendental equation to be solved forl n . If n is large, then clearly |l n | must be chosen large to solve the equation. But if |l n | is large, then we can approximate k in equation A.38 by k» D

¡i 2D

¡

»

f C 2l n / D1 / 2 1 C 1 2

1 f2 8 Dl n

¢¼

(A.40)

(where l 1 / 2 implies both positive and negative signs) and the eigenvalue equation, A.39, becomes Z1 ¡in D

0

dx f 1 2 Cl n/ 2p 2D

D a Cl n/

1 2

¡

bC

Z1 0

c ln

¢

1 dx 1 C 2p D1 / 2 l 1n / 2

Z1 0

dx 1 f 2 2p 8 D3 / 2 (A.41)

where a, b, c are the three specic integrals, which do not depend on l n or n. Carry a to the left side by subtracting it from equation A.41, and square:

¡

¡n2 C 2ina C a2 D l n b 2 C

2bc c2 C 2 ln ln

D b 2l n C 2bc C

c2 . ln

¢

(A.42)

Dynamics of Encoding

517

If |l n | is large, the right-hand term with the coefcient 1 /l n may be dropped, and the rest may be solved for l n : ln D ¡

¡

1 2 a a2 2c C ¢ 2 C ¡ n i n 2 2 2 b b b b

¢

.

(A.43)

This result, which should hold for large n, is again in agreement with equation 2.42, except for the right-hand asymptotic off-set term, which does not grow with n. Acknowledgments

I owe a particular debt to Lawrence Sirovich. Others who contributed valuable ideas are Daniel Amit, Dmitrii Manin, and Ahmet Omurtag. This work has been supported by NIH/NEI EY11276, NIH/NIMH MH50166, ONR N00014-96-1-0492. Publications from the Laboratory of Applied Mathematics may be found at http://camelot.mssm.edu/. References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulse-coupled oscillators. Phys. Rev. E., 48, 1483–1490. Bailey, N. T. J. (1964). The elements of stochastic processes with applications to the natural sciences. New York: Wiley. Everson, R. M., Prashanth, A. K., Gabbay, M., Knight, B. W., Sirovich, L., & Kaplan, E. (1998). Representation of spatial frequency and orientation in the visual cortex. Proceedings of the National Academy of Science, 95, 8334–8338. Gerstner, W. (1995). Time structure of the activity in neural network models. Phys. Rev. E., 51, 738–758. Gerstner, W. Population dynamics of spiking neurons: Fast transients, asynchronous states and locking. In press. Neural Computation. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol., 117, 500–544. Johannesma, P. I. M. (1969). Stochastic neural activity: A theoretical investigation. Unpublished Ph.D. dissertation, University of Nijmegen. Knight, B. W. (1972a). Dynamics of encoding in a population of neurons. J. Gen. Physiol., 59, 734–766. Knight, B. W. (1972b).The relationship between the ring rate of a single neuron and the level of activity in a population of neurons: Experimental evidence for resonant enhancement in the population response. J. Gen. Physiol., 59, 767–778. Knight, B. W., Manin, D., & Sirovich, L. (1996). Dynamical models of interacting neuron populations. In Symposium on Robotics and Cybernetics: Computational Engineering in Systems Applications (Gerf, E.C. ed) Cite Scientique, Lille, France.

518

Bruce W. Knight

Knight, B. W., Omurtag, A., & Sirovich, L. (2000). The approach of a neuron population ring rate to a new equilibrium: An exact theoretical result. In press. Neural Computation. Kuramoto, Y. (1991).Collective synchronization of pulse-coupled oscillators and excitable units. Physica D, 50, 15–30. McCormick, D. A., & Huguenard, J. R. (1992). A model of the electrophysiological properties of thalamocortical relay neurons. J. Neurophysiol.,68, 1384–1400. Nykamp, D., & Tranchina, D. A population density approach that facilitates large-scale modeling of neural networks: Analysis and an application to orientation tuning. In press. J. Comp. Neurosci. Omurtag, A., Knight, B. W., & Sirovich, L. (1998). On the simulation of large populations of neurons. In press. J. Comp. Neurosci. Risken, H. (1996). The Fokker–Planck equation. Methods of solution and applications. Berlin: Springer-Verlag. Sirovich, L., & Everson, R. (1992). Management and analysis of large scientic datasets. International Journal of Supercomputer Applications, 6, 50–68. Sirovich, L., Everson, R., Kaplan, E., Knight, B. W., O’Brien, E., & Orbach, D. (1996). Modeling the functional organization of the visual cortex. Physica D, 96, 355–366. Sirovich, L., Knight, B. W., & Omurtag, A. (2000). Dynamics of neuronal populations: The equilibrium solution. In press. SIAM. Sirovich, L., Knight, B. W., & Omurtag, A. (1999a). Dynamics of neuronal populations: Eigentheory. In preparation. Sirovich, L., Knight, B. W., & Omurtag, A. (1999b). Dynamics of neuronal populations: Stability theory. In preparation. Sirovich, L., Knight, B. W., & Omurtag, A. (1999c). Dynamics of neuronal populations: Low dimensional approximations. In preparation. Stein, R. B. (1965). A theoretical analysis of neuronal variability. Biophys. J., 5, 173–194. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology (Vol. 2). Cambridge: Cambridge University Press. Wilbur, W. J., & Rinzel, J. (1982). An analysis of Stein’s model for stochastic neuronal excitation. Biol. Cybern., 45, 107–114. Received January 11, 1999; accepted April 12, 1999.

NOTE

Communicated by P. Read Montague

Local and Global Gating of Synaptic Plasticity Manuel A. S a´ nchez-Monta n˜ e´ s ¨ ¨ Institute of Neuroinformatics, ETH/University Zurich, 8057 Zurich, Switzerland, and Grupo de Neurocomputaci´on Biol´ogica, ETS de Inform´atica, Universidad Aut´onoma de Madrid, 28049 Madrid, Spain Paul F. M. J. Verschure Peter K onig ¨ Institute of Neuroinformatics, ETH/University Zurich, ¨ 8057 Zurich, ¨ Switzerland Mechanisms inuencing learning in neural networks are usually investigated on either a local or a global scale. The former relates to synaptic processes, the latter to unspecic modulatory systems. Here we study the interaction of a local learning rule that evaluates coincidences of pre- and postsynaptic action potentials and a global modulatory mechanism, such as the action of the basal forebrain onto cortical neurons. The simulations demonstrate that the interaction of these mechanisms leads to a learning rule supporting fast learning rates, stability, and exibility. Furthermore, the simulations generate two experimentally testable predictions on the dependence of backpropagating action potential on basal forebrain activity and the relative timing of the activity of inhibitory and excitatory neurons in the neocortex. 1 Introduction

Research on neuronal networks has been very much motivated by the ability of these systems to learn from experience (Alkon et al., 1991). In these systems information is stored in the pattern of synaptic efcacies. Among the processes inuencing the modication of synaptic connections local and global mechanisms can be distinguished (Montague, Dayan, & Sejnowski, 1996). Examples of global signals involved in learning can be found in gating mechanisms (Abbott, 1990). In the brain this might be compared to the inuence of modulatory signals, arising from subcortical structures, on cortical plasticity. Here, the cholinergic system of the basal forebrain projecting to the cerebral cortex is of particular interest. It is a necessary ingredient for the induction of cortical representations following monocular deprivation (Singer & Rauschecker, 1982). In addition, it may switch between storage and recall modes in the hippocampus (Hasselmo, 1993), and it gates the plasticity of receptive elds of neurons in the primary auditory cortex during Neural Computation 12, 519–529 (2000)

c 2000 Massachusetts Institute of Technology °

520

M. A. S´anchez-Montan´ ˜ es, P. F. M. J. Verschure, and P. Konig ¨

classical conditioning (Weinberger, 1993; Bakin & Weinberger, 1996; Kilgard & Merzenich, 1998). These results support the suggestion that modulatory substances can act as a “print now” signal gating synaptic plasticity (Singer, von Grunau, ¨ & Rauschecker, 1979). Because the inuence of such a signal onto the cortical network is homogeneous and is not related to the specics of a given stimulus, it can be called a global mechanism. Local mechanisms in learning and memory refer back to the classical work of Hebb (1949). Since then, many variations and physiological implementations have been suggested (Stent, 1973; Sejnowski, 1977; Buonomano & Merzenich, 1998). Recently Stuart and Sakmann (1994) observed that action potentials can propagate backward into the dendritic tree . This signal provides a possible explanation for the dependence of long-term potentiation (LTP) and long-term depression (LTD) on the relative timing of preand postsynaptic action potentials (Gerstner, Ritz, & van Hemmen, 1993; Markram, Lubke, Frotscher, & Sakmann, 1997). The changes in synaptic efcacy are specic to the neurons involved and may be described by rules local in space and time. In spite of the enormous amount of work on local and global mechanisms of synaptic plasticity, their relationship has been little investigated, and, implicitly, their action is assumed to be independent of each other. Here, we investigate an alternative model, where the global and local mechanisms are intimately related. The global mechanism inuences the fraction of action potentials that successfully invade the dendritic tree of the stimulated neurons, a signal exploited by the local process. We demonstrate that such an integrated mechanism allows the combination of the positive properties of its constituents, such as stability, specicity, and exibility, in one learning rule. 2 Methods

We studied local and global mechanisms of synaptic plasticity in a neuronal network consisting of excitatory and inhibitory cortical neurons, subcortical sensory input, and input from the basal forebrain. The sensory afferents target the excitatory neurons and dene their receptive elds. Initially, the synaptic efcacy of these connections is randomly distributed and updated according to the learning rule described below. The excitatory cortical neurons project to inhibitory interneurons, which in turn project back to the excitatory neurons, forming a negative feedback loop. The simulated basal forebrain afferents target the inhibitory cortical neurons and have an inhibitory effect (Freund & Gulyas, 1991; Freund & Meskenaite, 1992). All neurons are simulated as integrate-and-re units (for quantitative details, see the appendix). The synaptic efcacy of the afferent sensory projections to the excitatory neurons evolves according to a modication of a recently proposed learning rule that uses a backpropagating action potential (Kording ¨ & K¨onig,

Local and Global Gating of Synaptic Plasticity

521

1998). First, when a “backpropagating” action potential arrives at a synapse simultaneously (i.e., within a small symmetrical temporal window) with an action potential in the presynaptic afferent ber the efcacy of the respective synapse is increased (Gerstner et al., 1993; Markram et al., 1997; Magee, Hoffman, Colbert, & Johnston, 1998; Bi & Poo, 1998). Furthermore, we studied the effects of using an asymmetrical temporal window, thus including a dependency of the synaptic modications not only on the absolute delay of pre- and postsynaptic signals, but also on the temporal order of their arrival, which matches the physiological results more faithfully. Second, activation of inhibitory synapses located at dendritic sites proximal to the soma tree may attenuate the retrograde propagation of the action potential in the dendritic tree (Spruston, Schiller, Stuart, & Sakmann, 1995; Tsubokawa & Ross, 1996). In this case the efcacies of activated synapses are decreased. In addition, we implemented heterosynaptic LTD in the simulation: synaptic efcacy is decreased in case of postsynaptic activity without coincident presynaptic activity. Thus, in this learning rule, the changes of synaptic efcacy are crucially dependent on the temporal dynamics of the neuronal network, in particular on the relative timing of the excitatory and inhibitory inputs to the cortical neurons. We used two kinds of stimulation protocols to investigate the interaction of global and local mechanisms of synaptic plasticity in analogy to experiments on auditory conditioning (Kilgard & Merzenich, 1998). In the rst protocol, a pseudo-random sequence of 500 exemplars, repeatedly selected from a pool of 10 different stimuli, was presented alone. In the second protocol, one of the stimuli was paired with activity of the basal forebrain. For both protocols, we analyzed 40 simulations, each starting with random initial connections between sensory and excitatory populations. To control for dependencies on the details of the implementation, we conducted control runs using neurons without a refractory period. All simulations were performed using IQR 427 (Verschure, 1997). 3 Results

As a rst step, we investigated the relationship between the formation of representations of input stimuli and the frequency of their occurrence. Two of the stimuli were presented four times more often than the other eight (probability of occurrence of stimuli 1 and 2: 0.2500 versus stimuli 3 to 10: 0.0675). After about 200 stimulus presentations, the network had converged and stabilized. The resulting specicity of the excitatory cortical units showed a bimodal distribution. Neurons either received comparable input upon presentation of any stimulus, and thus were completely untuned (44.8%), or they responded highly specically to only 1 of the 10 stimuli (50.5%). Few neurons showed intermediate degrees of specicity (4.7%). This is in good accord with our previous results (K¨ording & Konig, ¨ 1998). Next, we investigated the distribution of stimuli specicity for the whole popula-

522

M. A. S´anchez-Montan´ ˜ es, P. F. M. J. Verschure, and P. Konig ¨

tion of neurons. Figure 1A shows that the number of neurons responding to the stimuli presented often (stimuli 1 and 2, red and orange, respectively) is similar to the size of the representation of stimuli presented rarely (stimuli 3–10, yellow to violet). Compiling results from 40 runs, we did not nd a signicant inuence of presentation frequency onto the size of the representation (3.0§ 0.4, 3.0§ 0.6, mean§ standard deviation; Figure 1C green curve). Allowing a stronger competition for representing neurons between different stimuli, by reducing the size of the network, leads to a weak dependence of the size of their representation on presentation frequency (data not shown). Thus, the local mechanism alone leads to the formation of stable and specic representations of stimuli which are largely independent of the frequency of stimulus presentation. Next, we investigated the role of the global mechanism. The activation of the nucleus basalis neuron leads to a long-lasting hyperpolarization of the inhibitory neurons. This necessitates a longer integration period of the inhibitory neurons until the threshold is reached. Nevertheless, with the chosen set of parameters, the inhibitory neurons do not re in bursts, and those excitatory inputs, which without nucleus basalis activation were falling into the refractory period, now contribute to the activation of the inhibitory neurons. Thus, the mean activity of the inhibitory neurons does not change. This is consistent with experimental results, which found highly complex effects of basal forebrain and direct ACh application, which cannot be characterized by straightforward changes in mean ring rates of cortical neurons (Sato et al., 1987; Francesconi, Muller, ¨ & Singer, 1988; Murphy & Sillito, 1991; Jimenez-Capdeville, Dykes, & Myasnikov, 1997). The main effect of nucleus basalis stimulation in the model is delay to the inhibitory activity relative to the excitatory activity in the cortical circuit by about 3.3 milliseconds (see Figure 1B). As a result, a much larger fraction of the action potentials generated by the excitatory neurons invade the dendritic tree (21 § 16% and 87 § 17% without and with nucleus basalis activation, respectively). In this case the number of neurons representing the stimulus that is paired with nucleus basalis activation (stimulus 8, mid-blue) increases about 10-fold (see Figure 1A, lower panel). The representations of the other stimuli, however, remain constant. Again, compiling results from a total of 40 runs, we did not nd a signicant difference between the size of the representations of stimuli shown at different frequencies (3.0 § 0.4 versus 2.9 § 0.6, high and low rate, respectively). However, the representation of the paired stimulus was signicantly increased (31.7§ 19.8; see Figure 1C, blue curve). Thus, the global mechanism biases the size of the representation toward the paired stimulus, while allowing the map to preserve a stable representation of all stimulus patterns. As a control, a modied learning rule was studied. In case the backpropagating action potential successfully invades the dendritic tree, the sign of change of synaptic efcacy is dependent on the sequence of arrival of preand postsynaptic action potentials. An interesting difference was found.

Local and Global Gating of Synaptic Plasticity

523

Figure 1: (A) The tuning of the 100 input neurons is shown in the upper panel using a false color code. Low frequencies correspond to reddish colors, high frequencies to violet, and intermediate frequencies follow a rainbow pattern. The distribution of the preferred frequency of the cortical excitatory units after training, without nucleus basalis activity, is shown in the middle panel using the same color code. Hue indicates the preferred stimulus frequency, while saturation indicates the tuning strength. Each square corresponds to one neuron. For better visibility, they are arranged in ve rows in order of increasing preferred stimulus frequency. Thus, the graph might be compared to a top view onto the cortex as used by Kilgard and Merzenich (1998). In the lower panel, the tuning of cortical excitatory neurons is shown, while one stimulus (8, mid-blue) was paired with nucleus basalis activity. (B) Temporal relation of the activity of cortical inhibitory neurons in relation to activity of excitatory neurons with (blue) and without (green) simultaneous nucleus basalis activity. (C) Size of the representation of all 10 stimuli with (blue) or without nucleus basalis activity (green) paired with stimulus 8. The error bars indicate standard error of the mean.

524

M. A. S´anchez-Montan´ ˜ es, P. F. M. J. Verschure, and P. Konig ¨

Those cortical neurons representing a particular stimulus did not contact all of the respective input units equally, but a symmetry breaking occurred, and each cortical unit connected only to a subset of afferents sufcient for its activation. This can be interpreted as an effect, where a neuron cuts down on a set of inputs sufcient for its activation. With respect to the central question of the study, however, no difference was found. Each stimulus was learned by about three to four neurons, and the size of the representation was independent on presentation frequency. Stimulation of the basal forebrain unit lead to a huge increase of the representation of the paired stimulus as before. Thus, in the investigated system, the symmetry or asymmetry of the temporal window for coincidence detection of pre- and postsynaptic action potentials is not relevant for the size of cortical representations. 4 Discussion

In this study, we investigated the interaction of local and global mechanisms regulating synaptic efcacy. This relates to previous work in the auditory system of the rat and guinea pig (Weinberger, 1993; Kilgard & Merzenich, 1998). In these experiments, the animals were exposed to acoustic stimuli either with or without paired electrical stimulation of the basal forebrain. The frequent presentation of a stimulus on its own did not lead to an enlargement of its representation in primary auditory cortex. In contrast, presentation of an auditory stimulus paired with electrical stimulation of the basal forebrain induced a dramatic increase of the cortical area devoted to that particular stimulus. This matches well the behavior of the model presented here. By making the propagation of action potentials into the dendritic tree subject to attenuation by inhibitory afferents, the local learning rule implements an inhibitory mechanism (K¨ording & Konig, ¨ 1998). However, this mechanism is acting on the change of synaptic efcacy, and not directly on the neuronal activity level. Recent physiological data show that in primary visual cortex, optimally activated neurons tend to re before suboptimally activated neurons (Konig, ¨ Engel, Roelfsema, & Singer, 1995). Therefore, the optimally activated neurons can block synaptic plasticity in the suboptimally activated ones. As a consequence, this mechanism allows neurons that are optimally tuned to a stimulus to prevent the recruitment of more and more neurons for its representation. The rst protocol demonstrates that the local learning rule leads to a size of neuronal representations that is independent of the stimulus presentation frequency. Nevertheless, particular stimuli, whether rarely or frequently presented, can be emphasized and allocated an increased cortical representation through the global mechanisms, as shown in the second protocol. In our model, the substantial GABA-ergic projection originating in the basal forebrain and terminating on cortical inhibitory neurons (Freund & Guylas, 1990; Freund & Meskenaite, 1992) increases the proportion of sucessfully backpropagating action potentials in the cortical excitatory neurons.

Local and Global Gating of Synaptic Plasticity

525

Interestingly, the much better investigated basal forebrain cholinergic projection (Singer & Rauschecker, 1982) increases the fraction of backpropagating action potentials in cortical neurons (Tsubokawa & Ross, 1997). Thus, these two subcortical projections may act synergistically, enhancing each others effect. Of course, these ndings do not preclude additional actions of acetylcholine, for example, the modulation of effective intracortical connectivity (Verschure & Konig, ¨ 1999). In the denition of the model, several design decisions had to be made. Increasing the number of parameters in a model to make it more biologically plausible may obscure basic principles. Therefore, we decided not to address the two-dimensional topographic arrangement of feature preferences. These seem to involve independent mechanisms, and several good reviews are available on this topic (e.g., Erwin, Obermayer, & Schulten, 1995). Most notably, the learning rule evaluates only the absolute difference of arrival times of pre- and postsynaptic action potentials. In contrast to the experimental results (Markram, Lubke, Frotscher, & Sakmann, 1997; Bi & Poo, 1998) the sequence of their arrival is not taken into account. This simplication was made for several reasons. We did not want to create the impression that the observed effects depend on this particular feature. Indeed, in control experiments using a modied learning rule sensitive to the sequence of pre- and postsynaptic action potentials, qualitatively identical results were obtained. The requirement for temporal contiguity is sufcient to generate the observed effects. This is not surprising, as the activity statistics of the input units follows a Poisson distribution, and thus did contain additional information. In a complete large-scale model, including feedback of the “cortical” neurons onto units representing neurons in the thalamus, new and interesting properties might emerge. These are, however, subject to current investigation and outside the scope of this article. In the model, the action of the basal forebrain is mediated by a subtle inuence on the temporal relation of inhibitory and excitatory activity in the cortex, inuencing the backpropagation of action potentials. This gives rise to two specic predictions. First, using classical electrophysiological techniques to measure the activity of single excitatory and inhibitory neurons, the spike-triggered average of local eld potentials can be determined. This measure gives an indication of the relative timing of the recorded neurons with respect to the population activity. Therefore, it is a sensitive indicator for the predicted delay of inhibitory activity during basal forebrain stimulation. Second, the fraction of retrogradely propagating action potentials (and their resulting effects on the calcium concentration) should be increased when the basal forebrain is stimulated. This can be assessed in vivo using the recently introduced two-photon imaging technique (Denk et al., 1994). Thus, a test of these two specic predictions of the proposed mechanisms seems possible with currently available techniques. By allowing a global mechanism to affect synaptic plasticity via a local learning rule, a single integrated mechanism can be dened that combines

526

M. A. S´anchez-Montan´ ˜ es, P. F. M. J. Verschure, and P. Konig ¨

continuous learning, stability, specicity, and exibility. The global component, which was previously interpreted as a “print now” signal, might actually be best described as a “print bold” signal. Appendix

The network contains an input layer, a layer of excitatory cells, a layer of inhibitory cells, and one unit representing the activity of the nucleus basalis. The input layer consists of 100 neurons and projects in an all-to-all manner to the set of excitatory cortical neurons. This connectivity is initially random, with weights homogeneously distributed between 0.3 and 0.7 in units of the postsynaptic threshold. Hence, the receptive elds of the excitatory neurons are initially unspecic. The input neurons are activated for 30 ms and generate spikes following a Poisson process. For physiological realism, we implemented a refractory period for these neurons (5 ms). As a control, simulations without such a refractory period were performed. The spatial distribution of their ring rate follows a gaussian shape, with a peak of 100 Hz and dispersion of 3 units. The excitatory neurons are modeled as integrate-and-re neurons with a time constant of 9.5 ms. The synapses of the projection between input and excitatory neurons are modied according to the learning rule described below. During all the simulations, the weights are kept in the 0–1 range. The set of excitatory units projects in a one-to-one manner to the inhibitory neurons. This projection is subject to a transmission delay of 2 ms. The inhibitory units project all-to-all back to the excitatory neurons. Here as well, a transmission delay of 2 ms is taken into account. The dynamics of the inhibitory neurons are identical to those of the excitatory neurons described above. In addition to excitatory input, they receive inhibitory afferents from the basal forebrain. The learning rule is a modication of one proposed in Kording ¨ and Konig ¨ (1998). The change of synaptic efcacy is contingent on the relative timing of pre- and postsynaptic action potentials. If these two signals are coincident on a timescale t0 D 10 ms (Markram et al., 1997), the synaptic efcacy is increased (see equation A.1, aLTP > 0). In case the action potential is attenuated while propagating through the dendritic tree by inhibitory input within 3 ms after its initiation, the synaptic efcacy is decreased (see equation A.2, aLTD < 0). The relative timing is evaluated as before. Furthermore, the introduction of heterosynaptic depression leads to a weakening of inactive synapses in the case of postsynaptic activity (see equation A.3, aheteroLTD < 0): D v D aLTP D v D aLTD

t0 t0 C | tpost ¡ tpre| t0 t0 C | tpost ¡ tpre|

(A.1) (A.2)

Local and Global Gating of Synaptic Plasticity

D v D aheteroLTD

527

(A.3)

with t0 D 10 ms and tpost and tpre timing of the pre- and postsynaptic action potential. For control simulation we used a modied learning rule, where the sign of synaptic modication is dependent on the sequence of pre- and postsynaptic action potentials: D v D aLTP

sign (tpost ¡ tpre)t0 t0 C |tpost ¡ tpre|

with parameters as before. For each stimulus presentation, every synapse is updated only once, even if there are several presynaptic spikes generated by the presynaptic neuron. Acknowledgments

We are grateful to Konrad K¨ording and Mike Merzenich for valuable discussions of the previous work on the learning rule and the experimental data and Daniel Kiper for comments on a previous version of the manuscript. We are happy to acknowledge the support of SPP Neuroinformatics (grants 5002–44888/2&3 to P. F. M. J. V.), SNF (grant 31-51059.97, awarded to P. K.), and an FPU grant from MEC (M. A. S.-M., Spain). References Abbott, L. F. (1990). Learning in neural network memories. Network, 1, 105–122. Alkon, D. L., Amaral, D. G., Bear, M. F., Black, J., Carew, T. J., Cohen, N. J., Disterhoft, J. F., Eichenbaum, H., Golski, S., Gorman, L. K., Lynch, G., Mcnaughton, B. L., Mishkin, M., Moyer, J. R., Olds, J. L., Olton, D. S., Otto, T., Squire, L. R., Staubli, U., Thompson, L. T., & Wible, C. (1991). Learning and memory. Brain Res. Rev., 16, 193–220. Bakin, J. S., & Weinberger, N. M. (1996). Induction of a physiological memory in the cerebral cortex by stimulation of the nucleus basalis. Proc. Natl. Acad. Sci. USA, 93, 11219–11224. Bi, G. Q., & Poo, M. M. (1998). Synaptic modications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. J. Neurosci., 18, 10464–10472. Buonomano, D. V., & Merzenich, M. M. (1998).Cortical plasticity: From synapses to maps. Annu. Rev. Neurosci., 21, 149–186. Denk, W., Delaney, K. R., Gelperin, A., Kleinfeld, D., Strowbridge, B. W., Tank, D. W., & Yuste, R. (1994). Anatomical and functional imaging of neurons using 2-photon laser scanning microscopy. J. Neurosci. Methods, 54, 151–162. Erwin, E., Obermayer, K., & Schulten, K. (1995). Models of orientation and ocular dominance columns in the visual cortex: A critical comparison. Neural Comput., 7, 425–468.

528

M. A. S´anchez-Montan´ ˜ es, P. F. M. J. Verschure, and P. Konig ¨

Francesconi, W., Muller, ¨ C. M., & Singer, W. (1988). Cholinergic mechanisms in the reticular control of transmission in the cat lateral geniculate nucleus. J. Neurophysiol., 59, 1690–1718. Freund, T. F., & Gulyas, A. I. (1991). GABAergic interneurons containing calbindin D28K or somatostatin are major targets of GABAergic basal forebrain afferents in the rat neocortex. J. Comp. Neurol., 314, 187–199. Freund, T. F., & Meskenaite, V. (1992). Gamma-aminobutyric acid- containing basal forebrain neurons innervate inhibitory interneurons in the neocortex. Proc. Natl. Acad. Sci. USA, 89, 738–742. Gerstner, W., Ritz, R., & van Hemmen, J. L. (1993). Why spikes? Hebbian learning and retrieval of time-resolved excitation patterns. Biol. Cybern., 69, 503– 515. Hasselmo, M. E. (1993).Acetylcholine and learning in a cortical associative memory. Neural Comput., 5, 32–44. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Jimenez-Capdeville, M. E., Dykes, R. W., & Myasnikov, A. A. (1997). Differential control of cortical activity by the basal forebrain in rats: A role for both cholinergic and inhibitory inuences. J. Comp. Neurol., 381, 53–67. Kilgard, M. P., & Merzenich, M. M. (1998). Cortical map reorganization enabled by nucleus basalis activity. Science, 279, 1714–1718. Konig, ¨ P., Engel, A. K., Roelfsema, P. R., & Singer, W. (1995). How precise is neuronal synchronization? Neural Comput., 7, 469–485. Kording, ¨ K. P., & Konig, ¨ P. (1998). A single learning rule for the formation of efcient representations and for the segmentation of multiple stimuli. Soc. Neurosci. Abst., 24, 323.16 Magee, J., Hoffman, D., Colbert, C., & Johnston, D. (1998). Electrical and calcium signaling in dendrites of hippocampal pyramidal neurons. Annu. Rev. Physiol., 60, 327–346. Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Montague, P. R., Dayan, P., & Sejnowski, T. J. (1996). A framework for mesencephalic dopamine systems based on predictive Hebbian learning. J. Neurosci., 16, 1936–1947. Murphy, P. C., & Sillito, A. M. (1991). Cholinergic enhancement of direction selectivity in the visual cortex of the cat. Neuroscience, 40, 13–20. Sato, H., Hata, Y., Masui, H., & Tsumoto, T. (1987). A functional role of cholinergic innervation to neurons in the cat visual cortex. J. Neurophysiol., 58, 765– 780. Sejnowski, T. J. (1977). Storing covariance with nonlinearly interacting neurons. J. Math. Biology, 4, 303–321. Singer, W., & Rauschecker, J. P. (1982). Central core control of development plasticity in the kitten visual cortex. II: Electrical activation of mesencephalic and diencephalic projections. Exp. Brain Res., 47, 223–233. Singer, W., von Grunau, ¨ M. W., & Rauschecker, J. P. (1979). Requirements for the disruption of binocularity in the visual cortex of strabismic kittens. Brain Res., 171, 536–540.

Local and Global Gating of Synaptic Plasticity

529

Spruston, N., Schiller, Y., Stuart, G., & Sakmann, B. (1995). Activity-dependent action potential invasion and calcium inux into hippocampal CA1 dendrites. Science, 268, 297–300. Stent, G. S. (1973). A physiological mechanism for Hebb’s postulate of learning. Proc. Natl. Acad. Sci. USA, 70, 997–1001. Stuart, G. J., & Sakmann, B. (1994). Active propagation of somatic action potentials into neocortical pyramidal cell dendrites. Nature, 367, 69–72. Tsubokawa,H., & Ross, W. N. (1996).IPSPs modulate spike backpropagation and associated Ca2+i changes in the dendrites of hippocampal CA1 pyramidal neurons. J. Neurophysiol., 76, 2896–2906. Tsubokawa, H., & Ross, W. N. (1997).Muscarinic modulation of spike backpropagation in the apical dendrites of hippocampal CA1 pyramidal neurons. J. Neurosci., 17, 5782–5791. Verschure, P. F. M. J. (1997). Xmorph: A software tool for the synthesis and analysis of neural systems. Technical Report ETH-UZ, Institute of Neuroinformatics. Verschure, P. F. M. J., & Konig, ¨ P. (1999). On the role of biophysical properties of cortical neurons in binding and segmentation of visual scenes. Neural Comp., 11, 1113–1138. Weinberger, N. M. (1993).Learning induced changes of auditory receptive elds. Curr. Opin. Neurobiol., 3, 570–577. Received March 11, 1999; accepted August 4, 1999.

NOTE

Communicated by Garrison Cottrell

Nonlinear Autoassociation Is Not Equivalent to PCA Nathalie Japkowicz Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada, B3H 1W5 Stephen Jos e´ Hanson Department of Psychology, Rutgers University, Newark, NJ 07102, U.S.A. Mark A. Gluck Center for Molecular and Behavioral Neuroscience, Rutgers University, Newark, NJ 07102, U.S.A. A common misperception within the neural network community is that even with nonlinearities in their hidden layer, autoassociators trained with backpropagation are equivalent to linear methods such as principal component analysis (PCA). Our purpose is to demonstrate that nonlinear autoassociators actually behave differently from linear methods and that they can outperform these methods when used for latent extraction, projection, and classication. While linear autoassociators emulate PCA, and thus exhibit a at or unimodal reconstruction error surface, autoassociators with nonlinearities in their hidden layer learn domains by building error reconstruction surfaces that, depending on the task, contain multiple local valleys. This interpolation bias allows nonlinear autoassociators to represent appropriate classications of nonlinear multimodal domains, in contrast to linear autoassociators, which are inappropriate for such tasks. In fact, autoassociators with hidden unit nonlinearities can be shown to perform nonlinear classication and nonlinear recognition.

1 Introduction

An autoassociator is a feedforward connectionist device whose goal is to reconstruct the input at the output layer. When used with a hidden layer smaller than the input or output layer and linear activations only, the autoassociator implements a compression scheme that was shown to be equivalent to principal component analysis (PCA)—also known as singular value decomposition—by Baldi and Hornik (1989). Another interesting and related result was obtained by Bourlard and Kamp (1988), who claim, “For Neural Computation 12, 531–545 (2000)

c 2000 Massachusetts Institute of Technology °

532

N. Japkowicz, S. J. Hanson, and M. A. Gluck

autoassociation with linear output units, the optimal weight values can be derived by standard linear algebra, consisting essentially in singular value decomposition (SVD) and making thus the nonlinear functions at the hidden layer completely unnecessary.” They backed their claim using theoretical considerations, while Cottrell and Munro (1988) arrived at the same conclusion independently, using an experimental methodology. Paradoxically, nonlinear autoassociators are also famous for their capacity to solve the encoder problem (Rumelhart, Hinton, & Williams, 1986), a problem that cannot be solved by PCA because of the singularity of the principal components. 1 Moreover, recent research suggests that nonlinear autoassociators are well suited for classication of certain types of nonlinearly separable and multimodal domains (Japkowicz, Myers, & Gluck, 1995; Petsche et al., 1996), another task for which PCA and linear autoassociation are not appropriate because of their linear restriction. In spite of the obvious contradiction exhibited by these facts, there has been little effort at explaining the source of this inconsistency. The purpose of this article is to demonstrate experimentally that Bourlard and Kamp’s (1988) claim does not necessarily hold and to analyze the differences between PCA and nonlinear autoassociators when it does not. In particular, we test the performance of nonlinear autoassociators relative to PCA and linear autoassociation on a nonlinear multimodal classication problem inspired by the projected topology of a real-world domain. After showing that there is indeed a difference in the classication accuracy obtained by the linear and nonlinear devices on this domain, we explain this difference by comparing the reconstruction error surfaces constructed by each system over the test domain. More specically, our experiments demonstrate that nonlinear sigmoidal single-hidden-layer autoassociators may sometimes interpolate sets of points differently from PCA. In particular, we show that on the test domain considered, while PCA (or equivalently, linear autoassociation) exhibits at or unimodal reconstruction error surfaces, nonlinear sigmoidal autoassociators build error reconstruction surfaces that contain multiple local valleys. In fact, the reconstruction error surfaces built by the sigmoidal autoassociators are closely related to the reconstruction error surface built by a radial basis function autoassociator, thus suggesting that when the sigmoidal autoassociator does not operate like PCA—computing a globalized solu-

1 Kruglyak (1990) and Phatak, Choi, and Koren (1993) demonstrate that for any N, the N-input, N-output encoder problem can, actually, be represented by an autoassociator of only two nonlinear hidden units. The fact that large encodings can actually be learned by two-hidden-unit autoassociators was demonstrated by Zipser (1989), who showed that such devices are capable of learning how to encode the position of a spot (or cluster) of normally distributed points appearing at random locations on a 10 £ 10 squared array.

Nonlinear Autoassociation

533

tion to the interpolation problem—then it operates like a radial basis function autoassociator—computing a localized solution to the interpolation problem.2 We begin our study by discussing the limitations of Bourlard and Kamp’s (1988) claim. We then demonstrate experimentally that these limitations can indeed be exceeded by standard problems within the context of classication and we analyze these results, concluding with some speculations as to why sigmoidal autoassociators sometimes do and sometimes do not operate like PCA. 2 Limitation of the Conventional Wisdom

When considering the results of Baldi and Hornik (1989), Bourlard and Kamp (1988), and Cottrell and Munro (1988) and the evidence that nonlinear autoassociators can solve problems that cannot be solved by PCA (Rumelhart et al., 1986; Japkowicz et al., 1995; Petsche et al., 1996), an important question comes to mind: Does the Bourlard and Kamp (1988) claim hold in all cases, or does it depend on certain assumptions concerning the underlying domain or the autoassociator? A careful look at the discussion laid out in Bourlard and Kamp (1988) reveals that the claim actually does depend on a particular condition. More specically, let F be the nonlinear function present at the output of the hidden units of a nonlinear autoassociator (F is the function that makes the autoassociator nonlinear). The assumption Bourlard and Kamp (1988) make in order to carry out their demonstration is that for small values of x, F( x) can be approximated as closely as desired by the linear part of its power series expansion. However, this means that x D h, the vector of presynaptic hidden unit activations (i.e., the hidden unit activations prior to their transformation by F), must be in the linear range of the squashing function. This suggests that nonlinear autoassociators do not necessarily emulate PCA when the net inputs are outside the linear range of the squashing function. The remainder of the article demonstrates experimentally that there are indeed differences between the linear and nonlinear schemes and that these differences have practical consequences for the task of classication. 3 Experiments

We conducted two sets of experiments for this article.

2 Although both the sigmoidal and radial basis function autoassociators are nonlinear, all further references to nonlinear autoassociators correspond to references to sigmoidal autoassociators.

534

N. Japkowicz, S. J. Hanson, and M. A. Gluck

3.1 Specication of the Devices and Description of the Task.

3.1.1 Devices. The devices we used in our experiments are PCA and three autoassociators. From an analytical point of view, the four devices differ: two of them are purely linear, and the other two have nonlinear capabilities. The three autoassociators are connectionist methods that differ from one another with regard to the type of hidden and output layer they use. In more detail, we compare a one-hidden-layer autoassociator whose hidden and output layers are both linear (L-L); a one-hidden-layered autoassociator whose hidden layer is nonlinear but whose output layer is linear (NL-L); and a one-hidden-layered autoassociator whose hidden layer and output layer are both nonlinear (NL-NL).In each of these devices, the function used in the nonlinear units is the usual logistic function. 3.1.2 Task. The operation of the four devices was contrasted on the task of classication. A formal denition of the classication problem typically involves an input vector x and a discrete response vector y, which are such that the pair (x, y) belongs to some unknown joint probability distribution, P. The goal of classication is to induce a function f ( x) from a set (x1 , y1 ), ( x2 , y2 ), . . . , ( xN , yN ) of training examples, so that f ( x) predicts y. Here we assume that y is a binary vector. Although classication by neural networks is typically performed using a discrimination network, 3 it can also be performed using a recognition approach implemented by PCA or the autoassociator. This approach is discussed in Oja (1983) for PCA and in Hanson and Kegl (1987) for autoassociators and operates as follows.4 During the training phase, the autoassociator is taught how to reconstruct examples of the concept. Once training is completed, classication is performed on a new vector xTest by comparing its reconstruction error5 to a threshold, and assigning it to the conceptual class if the reconstruction error is smaller than this threshold and to the other class otherwise. The idea behind this recognition-based classication scheme is that since the autoassociator is trained to compress and decompress examples of the conceptual class only, 3 Using such a scheme, yi , the response variable associated with input vector xi , takes on (in the simplest case) a value of 1 or 0, which is interpreted to mean that xi belongs or does not belong to the conceptual class of the problem. After training the network to approximate the function f (x) from which the training examples are believed to have been generated, it is expected that if the training set was appropriately designed, the network will be able to compute the appropriate label (1 or 0) for any input vector of size d, the size of the input layer, even if that vector did not appear in the training set. 4 We describe the method for autoassociators, taking into consideration that it is similar for PCA. j 5 The reconstruction error corresponds to S d [x j )]2 , where g is the function jD1 Test ¡g(xTest j

implemented by the autoassociator and xTest and f j (xTest ) represent input unit j and output unit j of the network, respectively.

Nonlinear Autoassociation

535

Figure 1: Transition from the sonar detection domain to the articial nonlinear domain.

when tested on a novel data point, it will compress and decompress it appropriately if this example belongs to the conceptual class, but it will not do so appropriately if the example does not belong to the conceptual class. 3.2 Specication of the Test Domain. In order to compare and explain the classication performance of the four devices considered in this study, two experiments were conducted for each classier on a two-dimensional articial domain. This problem was inspired from real-world data as shown in Figure 1, which illustrates the transition from real-world to articial data. Specically, Figure 1a displays a two-dimensional representation of sonar target recognition data available from the University of California at Irvine Repository for Machine Learning. These data were compressed from 60

536

N. Japkowicz, S. J. Hanson, and M. A. Gluck

features to 2 using PCA. In this plot, black triangles and white circles correspond to actual points (mines and rocks, respectively), and data clusters are indicated by polygons. Figure 1b shows the same two-dimensional representation of sonar target recognition data, but this time with means replacing the clusters. 6 Although this gure is just a two-dimensional projection of the original data set, it clearly indicates the typical kind of problem that the autoassociator has to deal with: multimodality. We chose a simple twodimensional abstraction of the multimodality problem for benchmarking purposes and explaining the capabilities of autoassociators. This articial domain, illustrated in Figure 1c, consists of four conceptual data clusters with means represented by black triangles and nine counterconceptual data clusters with means represented by white circles. In more detail, the conceptual clusters are located at points (.2, .2), (.2, .8), (.8, .2), and (.8, .8), respectively; the counterconceptual clusters are located at points (.1, .1), (.1, .9), (.2, .5), (.5, .2), (.5, .5), (.5, .8), (.8, .5), (.9, .1), and (.9, .9), respectively. Each cluster is composed of 50 points normally distributed around these means and with variance s 2 D .01. The counterconceptual clusters are further divided into internal and external clusters. Internal counterconceptual clusters correspond to the counterconceptual data that are located inside the convex hull dened by the conceptual data, while external counterconceptual clusters correspond to the counter-conceptual data located outside this convex hull. 3.3 Experiment Set 1. The rst experiment we conducted consisted of comparing the classication accuracy of the four systems on the test domain of Figure 1c. After having been tuned to their optimal settings, the four devices were trained on conceptual data, and a threshold was established by tting the reconstruction error obtained on additional conceptual data to a gaussian and setting a boundary corresponding to a prespecied condence interval. Subsequently, the systems were tested on a testing set containing both instances and counterinstances of the concept.

3.3.1 Tuning the Devices. The number of hidden units that the nonlinear connectionist systems used was determined by running each network with 1, 2, 4, 8, 16, 32, and 64 hidden units on ve cross-validation sets containing 25 points per cluster. It was concluded that in order to reach a good classication performance, the two nonlinear autoassociators had to be trained with 16 hidden units. It was also shown that the networks had converged by Epoch 2000, which was thus selected as a stopping point. Optimization in this experiment was performed using the backpropagation procedure with standard learning rate and momentum of 0.05 and 0.9, respectively. For PCA, since the test domain is two-dimensional, only one or 6

Clusters of fewer than three points were ignored.

537

Average Percent Error Over Five Trials

30

Average Percent Error

40

Nonlinear Autoassociation

0

10

20

Pos Misclass Neg Misclass

PCA

L-L

NL-L NL-NL

Figure 2: Classication error rate of the four classication schemes. Positive misclassication rates are indicated in gray; negative ones are indicated in black.

two principal components could be used. It was determined that one principal component yields a better classication rate than two, and therefore the PCA experiment was conducted using a single principal component. Linear autoassociation performed better as well with a single hidden unit than with two or more; therefore, it too was tested with a single hidden unit. The stopping criterion, learning rate, and momentum that the linear autoassociator used were the same as those used by the nonlinear devices. In the four systems, the threshold was set so as to allow for a 97% condence interval. 3.3.2 Accuracy Rates. The results we obtained in this experiment are presented in Figure 2. This gure plots the average classication error rates over ve trials obtained by the four systems on the test domain of Figure 1c. The graph shows that PCA and L-L obtained very large classication error rates, whereas the two nonlinear autoassociators obtained low error rates. This demonstrates that while PCA and L-L can technically be used for classication, they are not that useful on the problem of Figure 1c, whereas both

538

N. Japkowicz, S. J. Hanson, and M. A. Gluck

NL-L and NL-NL yield acceptable classication rates on that problem.7 This result thus conrms that there can be a difference in the computations performed by PCA (or linear autoassociation) and nonlinear autoassociation and that this difference has practical consequences for the task of classication. In order to analyze this difference in more depth, we conducted further experiments. 3.4 Experiment Set 2. These experiments sought to explain the results obtained in the previous set of experiments by computing and plotting the reconstruction error surfaces constructed by PCA and the three connectionist systems after being trained on the conceptual data (the data summarized by black triangles in Figure 1c). In other words, we were interested in nding out how the interpolation strategy that the various devices considered in this study used differ from one another.

3.4.1 Tuning the Devices. For experiment set 2, two different hidden unit settings were tested: expansion and compression. This was done so as to nd out whether the systems operate qualitatively differently when used in a nonbottleneck or in a bottleneck fashion. In the expansion setting, the same 16-hidden-unit setting was used for the nonlinear systems as those used in the rst experiment, while PCA and the linear autoassociator were tested with two principal components or hidden units. In other words, the optimal nonbottleneck setting for the nonlinear systems was compared to the only nonbottleneck setting possible for the linear systems.8 All the other parameters remained the same as in experiment set 1, except for the learning rate of L-L, which had to be increased from 0.05 to 0.1 for full convergence to take place. In the compression setting, the number of principal components and hidden units were restricted to one for all the systems since it is the only bottleneck setting available given that the test domain is two-dimensional. All the other parameters remained the same as in experiment 1. 3.4.2 Expansion Results. The results we obtained using the expansion setting of experiment set 2 are presented in Figure 3, which displays threedimensional plots of the error ratio surfaces constructed by NL-L and NLNL, respectively. The results for PCA and linear autoassociation are not displayed since they exhibit a at reconstruction error surface. In the graphs of Figures 3a and 3b, the plots are drawn along the x-, y-, and z-axes where x corresponds to the x-axis of the input domain, y to its y-axis, and z is the

7 Nonlinear autoassociation-based classication was previously used in practical settings involving a CH-46 helicopter gearbox and motor monitoring (Japkowicz et al., 1995; Petsche et al., 1996) and yielded accurate results. 8 For the linear autoassociator, hidden layers greater than two are actually possible, but they are equivalent to hidden layers of size two.

Nonlinear Autoassociation

539

Figure 3: (a) Reconstruction errors of NL-L and (b) NL-NL, with 16 hidden units on the articial domain generated from Figure 1c.

error ratio at every point considered in the space such that: z( x, y) D

1 (( A( x) ¡ x)2 C ( A( y) ¡ y) 2 ). c

(3.1)

In the formula, x and y represent the two dimensions of each data point, A() is the function realized by the trained autoassociators, and c is a modeldependent scale factor used for plotting purposes. The plots of Figures 3a and 3b are particularly helpful for understanding the nature of the solutions computed by the nonlinear autoassociators and contrasting them to the solution obtained by the linear autoassociator and PCA. In particular, they show that these error surfaces are qualitatively different from the ones computed by PCA and the linear autoassociator since, while the nonlinear autoassociators build multiple-local-valley representations of the underlying domain (see Figures 3a and 3b), PCA and the linear autoassociator learn how to reconstruct the domain perfectly and exhibit a at reconstruction error surface. This suggests that the solutions computed by autoassociators with nonlinearities in their hidden layer use a multimodal interpolation bias that the linear methods do not use. 9 Our results of Figure 3 can be used to ex9 As expected from Baldi and Hornik (1989), our results show that the linear autoassociator is capable of emulating PCA. However, in order for the two paradigms to be equivalent, the linear autoassociator had to be trained with a learning rate of 0.1 (instead of the learning rate of 0.05 used by the nonlinear systems). When a learning rate of 0.05 was used, the linear autoassociator got stuck in a saddle point (see Baldi & Hornik, 1989) and returned a constant output corresponding to the mean of the four training clusters. This observation is interesting in its own right because it underlines the difference between linear and nonlinear autoassociators even prior to the linear autoassociator ’s full convergence. Indeed, it suggests that the linear autoassociator tackles the reconstruction problem globally, using a unimodal bias, whereas the nonlinear autoassociators use local

540

N. Japkowicz, S. J. Hanson, and M. A. Gluck

plain the nonlinear autoassociators (NL-L and NL-NL) results of Figure 2, which show that nonlinear autoassociators are capable of classifying the nonlinearly separable and multimodal domain of Figure 1c. Indeed, since the reconstruction error surfaces of the nonlinear autoassociators include several local valleys centered at each of the positive clusters (see Figures 3a and 3b), the error ratios of the counterconceptual data can all be appropriately high relative to their conceptual counterparts, whether they are located in the interior, the sides, or the exterior of the square underlying the conceptual components. Thus, all the counterconceptual data get appropriately differentiated with respect to the conceptual data. The classication results obtained by PCA and linear autoassociation in Figure 2 will be explained next, since they were obtained with bottleneck devices. 3.4.3 Compression Results. The results obtained in the previous set of experiments demonstrate that nonlinear autoassociators are not equivalent to linear autoassociators or PCA in the case where the number of hidden units exceeds the number of input units or when the number of principal components is equal to the domain dimensionality. Autoassociators, however, are most typically used as compression devices with a number of hidden units smaller than the number of input units. We now discuss whether the same result also holds in this case. Figures 4a, 4b, and 4c display the reconstruction error surfaces obtained by PCA or L-L (Figure 4a), NL-L (Figure 4b), and NL-NL (Figure 4c) using a single principal component or hidden unit. These reconstruction errors were also obtained using equation (3.1). The reconstruction error surfaces obtained by these devices show that none is appropriate for full classication of the test domain of Figure 1c since they can only classify all (or most) of the data (whether conceptual or counterconceptual) in one diagonal of the input space as conceptual and all (or most) of the other data (whether conceptual or counterconceptual, again) as counterconceptual. In other words, because they do not interpolate the training set fully, these devices are not useful for classication, and this explains the high classication error rate obtained by the linear systems in Figure 2.10 Nevertheless, these results are interesting since, as in the nonbottleneck case, they do illustrate the difference between the linear and the nonlinear schemes. Indeed, while the linear systems display an undistorted surface with no saddle points (see Figure 4a), the nonlinear ones have saddle

strategies characterized by their multimodal interpolation biases. 10 While for the test domain used in this study, classication can be achieved only if using a number of hidden units greater than the number of input units, in the work of (Japkowicz et al. (1995) and Petsche et al. (1996), the number of hidden units is smaller than the number of input units. This suggests that the hidden-to-input unit ratio does not have any bearing on the behavior of nonlinear autoassociators in classication tasks, since what really matters is their capacity to interpolate the training set regardless of how many hidden units that may take.

Nonlinear Autoassociation

541

Figure 4: Reconstruction errors of (a) PCA or L-L (with one principal component or one hidden unit), (b) NL-L, and (c) NL-NL (with one hidden unit) on the articial domain generated from Figure 1c.

points (see Figures 4b and 4c). Thus, the compression results demonstrate that nonlinear autoassociation is not equivalent to linear autoassociation or PCA even when the number of hidden units or principal components is smaller than the number of input units. 4 Discussion

We now analyze the results we obtained and explain why our observations depart from the expectations derived from the past literature. In addition, we compare the results of the nonlinear autoassociators to the result of another well-known paradigm, radial basis functions, and speculate on what nonlinear autoassociators compute and why they do so. 4.1 Nonlinear Autoassociation Versus PCA and Linear Autoassociation. The difference between the linear and nonlinear systems can be ex-

542

N. Japkowicz, S. J. Hanson, and M. A. Gluck

plained by the fact that for the particular test domain considered,11 Bourlard and Kamp’s (1988) assumption has been violated. Indeed, their discussion was shown to hold only in the case where the inputs to the nonlinear activation function of the hidden units remain in the linear range of the squashing function. An observation of the presynaptic activations of the 16 hidden units of NL-NLreveals that, on average, these values are equal to m D 0.6953 with variance s 2 D 0.1919. Similarly, for NL-L, we found that m D .3222 and s 2 D .0548. Although for input values in the [¡1, 1] interval, the sigmoid function is close to linear, our results seem to suggest that a small violation of the Bourlard and Kamp (1988) assumption can make a big difference in certain contexts. While this difference might be marginal and thus overlooked when considering the hidden layer alone, it is magnied in the output layer, as suggested by the plots of Figures 3a and 3b, which do not display a at surface of the sort computed by PCA or autoassociation. This remark also holds for the encoder problem of Rumelhart et al. (1986), which can be solved by an autoassociator with nonlinear hidden units but not by a linear one. In this problem, nonlinearities in the hidden layer are also shown to have an impact. 4.2 What Do Nonlinear Autoassociators Compute? A Comparison with RBF Autoassociators. We will now attempt to establish the nature of the

computations taking place in the nonlinear autoassociators by comparing their reconstruction error surfaces to the reconstruction error surface obtained by a radial basis function (RBF) autoassociator with gaussian activations in the hidden units and linear output activations. The reconstruction error surface obtained by such a device on the domain of Figure 1c is plotted in Figure 5. The particular RBF network that yielded this gure has a capacity of four hidden units, and the variance of the gaussian functions is xed at and set to 2.8. The reconstruction error surface of this autoassociator was computed in the same fashion as that of the other autoassociators, using equation (3.1). The similarity between the reconstruction obtained by the nonbottleneck, nonlinear autoassociators (especially NL-L, which also has linear output activations) and the RBF autoassociator suggests that, like the RBF network, the nonlinear autoassociators use their 16 hidden units to put centers down over the four clusters representing the training data, and evaluate the distance of the testing data to these centers. 12 This result, together with the results of Bourlard and Kamp (1988), suggests that nonlinear autoassociators can resort to one of two modes of operation: either they operate like PCA, seeking a globalized solution to the reconstruction problem they are trying to solve, or they operate like an RBF autoassociator, seeking a localized solution.

11 12

For other classes of domains, this might not be the case. We thank Gary Cottrell for suggesting this comparison.

Nonlinear Autoassociation

543

Figure 5: Reconstruction errors of the RBF autoassociator on the articial domain generated from Figure 1c. The network has a capacity of four hidden units, and the variance of its gaussian functions is xed and set at 2.8.

4.3 On the Duality of Nonlinear Autoassociators: Our Speculation.

The question we now ask is: When does a sigmoidal autoassociator resort to one or another of the two modes of action described in the previous section? Because of the current lack of technology available for analyzing feedforward neural networks fully, we can only speculate as to when each phenomenon will take place. In particular, we suggest that the sigmoidal autoassociator will make use of its nonlinearities when the data set it attempts to reconstruct is multimodal and the different clusters that compose it are very spread out. 13 Indeed, if the data set is unimodal or if it is multimodal but can easily be confused for unimodal data, then every point in the training set activates hidden unit values located in the same vicinity that increase or decrease monotonically. Such data can thus be approximated linearly, in the same fashion as they would be by a linear autoassociator. On the other hand, if the data set is multimodal and the different clusters constituting it are very spread out, then approximation of their hidden unit activation values by a linear function will fail because of the discontinuities introduced by the multimodal nature of the domain. In such cases, the sigmoidal part of the activation function is very useful since it permits the local processing of separate clusters in the same fashion as RBF autoassociators. Indeed, by using a sigmoidal function, a hidden unit can map all the data contained in different clusters to approximately the same default activation value (of 0 or 1) while it can map the data in some other clusters, which fall in the linear part of the sigmoidal function, to more interesting values. Each hidden unit can then use the same strategy for different clusters, and thus a 13 Support for this speculation can be found in Japkowicz (1999), which shows a clear decrease in classication accuracy by NL-NL as the conceptual clusters of the domain in Figure 1c are moved closer to one another.

544

N. Japkowicz, S. J. Hanson, and M. A. Gluck

global solution can be found by the overall network by compounding the localized solution computed by each hidden unit. Because in the encoder problem, the autoassociator is expected to reconstruct data points located far away from each other—since they are located at extreme corners of the input space—this speculation also explains why the nonlinear autoassociator resorts to a localized solution not available to PCA in that case.

5 Summary

The purpose of this article was to address the paradox raised by the claim of Bourlard and Kamp (1988) that, in autoassociation, nonlinearities in the hidden units are completely unnecessary, given that nonlinear autoassociators are known to perform tasks that cannot be solved by PCA or linear autoassociation. We began by demonstrating that although both PCA (or linear autoassociation) and nonlinear sigmoidal autoassociation can be used for classication, the linear systems are not appropriate for certain types of multimodal and nonlinear domains, whereas the nonlinear systems are capable of classifying such domains accurately. An explanation of this result is provided by the observation of the interpolation surfaces constructed by the linear and nonlinear systems when trained on the conceptual examples of our classication test domain. The difference in the interpolation and, as a result, the classication scheme used by the linear and nonlinear devices thus contradicts the claim by Bourlard and Kamp (1988) initially considered. This contradiction is resolved by extracting the assumption on which the claim is based and showing that this assumption is not necessarily always veried. In a last step, we note that when the nonlinear autoassociator does not operate in a linear fashion, it emulates an RBF autoassociator and thus computes a localized solution to the problem it attempts to solve. Speculations as to when nonlinear autoassociators do resort to a linear operative mode and when they do not do so are then formulated and conclude the article.

Acknowledgments

We express our gratitude to Gary Cottrell and an anonymous reviewer for their extremely helpful comments. We also thank Inna Stainvas for useful discussions about autoassociators and for her careful reading of a version of the manuscript for this article. This research was supported by the Ofce of Naval Research through grant N0014-88-K-0112 to M. A. G. We are also grateful to the School of Mathematical Science of Tel-Aviv University for its nancial support and the Cognitive Science Laboratory of Princeton University for the use of computing facilities.

Nonlinear Autoassociation

545

References Baldi, P., & Hornik, K. (1989). Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2, 53–58. Bourlard, H., & Kamp, Y. (1988). Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics, 59, 291–294. Cottrell, G. W., & Munro, P. (1988). Principal component analysis of images via back propagation. In Proceedings of the Society of Photo-Optical Instrumentation Engineers. Cambridge, MA. Hanson, S. J., & Kegl, J. (1987). PARSNIP: A connectionist network that learns natural language grammar from exposure to natural language sentences. In Proceedings of the Ninth Annual Conference on Cognitive Science. Japkowicz, N. (1999). Concept-learning in the absence of counter-examples: An autoassociation-based approach to classication. Technical Report DCS-TR-390, Dept. of Computer Science, Rutgers Univ. Japkowicz, N., Myers, C., & Gluck, M. A. (1995). A novelty detection approach to classication. In Proceedings of the Fourteenth Joint Conference on Articial Intelligence (pp. 518–523). Kruglyak, L. (1990). How to solve the N bit encoder problem with just two hidden units. Neural Computation, 2, 399–401. Oja, E. (1983). Subspace methods of pattern recognition. New York: Wiley. Petsche, T., Marcantonio, A., Darken, C., Hanson, S. J., Kuhn, G. M., & Santoso, I. (1996). A neural network autoassociator for induction motor failure prediction. In D. Toueretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8. Cambridge, MA: MIT Press. Phatak, D. S., Choi, H., & Koren, I. (1993). Construction of minimal N-2-N encoders for any N. Neural Computation, 5, 783–794. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing (pp. 318–364). Cambridge, MA: MIT Press. Zipser, D. (1989). Programming neural nets to do spatial computations. In N. Sharkey (Ed.), Models of cognition: A review of cognitive science. Norwood, NJ: Ablex. Received August 8, 1997; accepted May 14, 1999.

LETTER

Communicated by David Wolpert

No Free Lunch for Noise Prediction Malik Magdon-Ismail California Institute of Technology, Pasadena, CA 91125, U.S.A. No-free-lunch theorems have shown that learning algorithms cannot be universally good. We show that no free funch exists for noise prediction as well. We show that when the noise is additive and the prior over target functions is uniform, a prior on the noise distribution cannot be updated, in the Bayesian sense, from any nite data set. We emphasize the importance of a prior over the target function in order to justify superior performance for learning systems. 1 Introduction

Already established in the learning theory literature are a set of “no-freelunch” (NFL) theorems for various types of learning systems (Wolpert, 1996a, 1996b; Zhu, 1996; Cataltepe, Abu-Mostafa, & Magdon-Ismail, 1999) and for optimization methods (Wolpert & Macready, 1997). The essential content of these theorems is that no learning algorithm can be universally good. A learning algorithm that performs exceptionally well in certain situations will perform comparably poorly in other situations. Thus, in particular, random algorithms that always yield average performance, cross validation, bootstrapping, and anti–cross validation are all, in some sense, on an equal footing. Stopping early at a higher training error than the minimum achievable gains nothing if the hypotheses yielding that training error are chosen with equal probability. A search technique that performs well on one cost function will perform poorly on another cost function. These existing results demonstrate that for certain problems, given a certain criterion of performance, it is not possible to do well consistently in the most general setting. Thus, one needs to make some assumptions about the underlying structure of the problem to be able to claim any kind of superior performance. Here we look at the problem of inferring the noise distribution from a data set. The existing NFL results look at a problem of learning a dependency (not necessarily deterministic) from a nite data set and making statements about the out-of-sample performance. The problem we treat here is to deduce properties of the noise distribution (for example, the noise variance) from a nite data set that consists of input-output pairs. The outputs have a deterministic dependence on the inputs (the target function) and a noisy component. One approaches the problem with some prior belief about the noise distribution, and one hopes that the data will enable a ne-tuning of that prior Neural Computation 12, 547–564 (2000)

c 2000 Massachusetts Institute of Technology °

548

Malik Magdon-Ismail

Figure 1: Two sample noisy data sets. The sets were created by taking two functions and adding noise to them. The x-values are the same in both cases.

belief to the extent that more precise statements can be made about the properties of the noise. An accurate estimate of the noise distribution is useful because the noise sets a fundamental limit on the performance of a learning system (Cortes, Jackel, & Chiang, 1994). Noise may play other roles as well: for example, in nancial markets, the noise level is itself a tradable quantity. Our criterion for selection of a noise distribution will be to maximize its posterior probability. We will show that having a uniform prior on the possible realizations of the target function leads to no update for a wellbehaved prior on the noise distributions (in the Bayesian sense) from any nite data set. Thus the posterior over noise distributions will equal the prior, and we may as well have not even looked at the data. One can compare this to the analogous NFL result of Wolpert: if the prior over target functions is uniform, then the out-of-sample error is constant, and thus one need not have even looked at the data before picking a hypothesis function. Let us consider the question posed in Figure 1. Which data set has more noise? Figure 2 shows the same data, but this time with the function that created

No Free Lunch for Noise Prediction

549

Figure 2: Noisy data and the functions that generated them. The same noisy data sets that were shown in Figure 1 and the functions that were used to produce them. In (b), the noise variance is roughly six times larger.

the data. Now which data set has more noise? The task is considerably easier given this extra information. We will show that some extra information of this form is essential. This article is organized as follows. We start with an example in section 2, where we attempt to predict a noise variance when learning using linear models. In this section we show that one method for predicting the noise variance using a bootstrap technique does not work in the naive form presented. As a result, one might therefore spend a lot of effort trying to nd a more sophisticated technique that might work. That this would be fruitless (for this specic case and the more general case) is the purpose of the remainder of the article. In section 3, we set up the problem of noise prediction in general; section 4 presents the main results, beginning with the simple case of boolean functions and continuing to the more general case. We show that information about the noise distribution cannot be obtained from a nite data set if the target distribution is assumed uniform. We conclude with section 5, where we also discuss the implications of the NFL theorems.

550

Malik Magdon-Ismail

2 An Example Using Linear Models

In this section we illustrate the conclusions of this chapter for the case of linear learning models on the input space R d . It is not our goal to be strictly rigorous, but rather to illustrate the point. Let the hypothesis functions be of the form gw ( x) D w ¢ x C w 0 ,

(2.1)

where w , w 0 are called the weights. We are given a data set D D fxa , ya D f ( x a ) Cna gN aD1 . f (¢) is some deterministic function (we place no other restrictions on the function) and na is noise, drawn independently and identically distributed with zero mean and variance s 2 . We use mean squared deviation as the measure of error. Dene the training error as

Etr D

1 X ( w ¢ x a C w0 ¡ ya ) 2 . N a

(2.2)

Dene w ¤ , w ¤0 as those weights that minimize Etr . It can then be shown that at this minimum, the expected training error is given by (theorem 4 in the appendix)1 hEtr¤ i

,n

D E0 C s 2 ¡

¡ ¢

s 2 ( d C 1) C B 1 CO 3 N N2

.

(2.3)

We use h¢i to denote expectations, and the expectation here is over the possible data sets (the input space and the noise, where we have made explicit the expectation over the noise).2 E0 is the bias, where the bias is the expected training error of the best hypothesis function in our learning model. B is a constant given by theorem 4. For the purposes of this section, the exact expression for B is not essential. It sufces to know that B is related to the variance of a sample statistic around its population value. The goal here is to estimate the noise variance, s 2 . The training error used as an estimate of the noise variance is asymptotically biased. For large N, it overpredicts the noise variance by the bias, E0 . Since we know the form of hEtr¤ i, perhaps we can do better. We can sample N1 data points from the N 1

We use standard order notation: N!1

f (N) D o(g(N)) ) f (N) / g(N) ¡! 0 f (N) D O(g(N)) ) f (N) · Cg(N) for some C > 0. 2

In the language of the NFL theorems (Wolpert, 1996b), hE tr¤ i , n is E(C| f, m) where the on-training-set likelihood and the quadratic loss function are used.

No Free Lunch for Noise Prediction

551

data points and estimate the training error. We use bootstrapping (Shao & Tu, 1996) to estimate hEtr¤ ( N1 )i, and by varying N1 , we can t the resulting dependence of hEtr¤ ( N1 )i on 1 / N1 using equation 2.3. Let the slope of this O From equation 2.3, we see that AO is an estimate for s 2 ( d C 1) C B; t be A. therefore, AO / ( d C 1) used as an estimate of s 2 is off by B / ( d C 1). Perhaps we can estimate B by the bootstrap as well. Given a bootstrapped estimate for B, we can combine it with AO to obtain an estimate for s 2 . A more careful analysis presented in the appendix shows that this too leads to failure. Thus, to get information on s 2 , perhaps a more subtle approach such as looking at terms with higher order in 1 / N would work. Perhaps some method other than simply using the training error would yield a better result. Instead of using linear models, why not use a more complex model for which E 0 is zero? Then at least asymptotically, the training error will approach s 2 (provided that the model is not too complex). However, for any nite N, we run the risk of overtting the data, perhaps even obtaining an estimate s 2 D 0. Perhaps there is a systematic way of choosing the model complexity as a function of N to get an optimal noise variance estimate. The next few sections show that such an exhaustive search will necessarily be fruitless unless we make some statement about f . In this case it would sufce to tell us “how good our model is at estimating f ” (i.e., E0 ). Rather than explore every other potential method for estimating s 2 in this linear scenario, we will set up a more general framework for the estimation of noise distributions and show that unless some assumptions are made about the target function, all noise models are just as probable, given the data. The intuition is that the data points are consistent with any noise level if we do not restrict the target function. We pursue these issues next. 3 Problem Set-Up

A nite data set D D fx a, ya glaD1 is given, where x a 2 R d , ya D f ( xa ) C na, f : R d ! R and the data set consists of l data points. 3 We concentrate on the case where the noise distribution is xed. The results can be modied to accommodate noise distributions that vary with x. It will always be understood that a indexes points in the data set. We concentrate on compact spaces and assume that the input space is the unit hypercube in d dimensions. Let the output space be the bounded interval [¡C, C]. Discretize the input and output spaces as follows. Let the input variable, x , take on values in the set

X 3

D f0, D x , 2D x , . . . , 1 ¡ D x , 1gd ,

(3.1)

We use l rather than the customary N to avoid confusions with Nx , Ny , to be dened shortly.

552

Malik Magdon-Ismail

Figure 3: Discretized function space. The space has only a nite number of functions. An arbitrary precision can be obtained by making the grid ner and ner.

so there are Nx D 1 C D1 possible values for each input variable. Let the x output variable, y, take a value in the set © ª Y D ¡C, ¡C C D y , . . . , 0, D y , 2D y , . . . , C ¡ D y , C . There are Ny D

2C Dy

(3.2)

C 1 possible y values. A function ( f ) will be a mapping, d

f : X ! Y . There are C D Ny Nx different functions. By making D x , D y as small as we choose, we can achieve an arbitrary precision (for example, see Figure 3). In all applications, only a nite number of functions are available (computers are nite precision machines). Due to this niteness, we can dene a probability distribution on the function space—a prior over the function space. One natural prior is to take this distribution to be uniform. Therefore to every function we assign a probability of C1 . We assume that C can be made as large as we wish. Further, without loss of generality, we can scale the outputs by D1 ; therefore we can assume that y Q ¡CQ C 1, ¡CQ C 2, . . . , CQ ¡ 1, Cg, Q Ny D 2CQ C 1 and CQ D C / D y . With y 2 f¡C, this representation, the noise n 2 f0, § 1, § 2, . . .g has a distribution given by

No Free Lunch for Noise Prediction

553

P a vector P , where Pi D Pr[n D i] and i Pi D 1. Assume a prior probability measure on the possible noise distributions g( P ). We are interested in the Bayesian posterior on these noise distributions, Pr[D| P ]g( P ) , (3.3) Pr[D ] P where Pr[D ] D P Pr[D | P ]](P ). We can calculate Pr[D| P ] as follows: g( P |D) D

Pr[D | P , f ] D

l Y aD1

Pr[na D ya ¡ f ( x a )].

(3.4)

Note that the right-hand side implicitly depends on P through the probabilities for the noise realizations na . Multiplying by P[ f ] and summing over f , we have Pr[D| P ] D

l X Y 1 XX ¢¢¢ Pr[na | f , D ] C f ( x ) f (x ) f ( x ) aD1 1

l 1 Y

( a)

D

Ny l

2

Nx

C X

Pr[na | f , D ],

(3.5)

aD1 f ( xa )D¡C

where we have used the notation Pr[na| f, D ] to mean Pr[na D ya ¡ f ( x a )]; (a) follows when we sum over all points not in the data set, and we have used the fact that Pr[ f ] D 1 / C ¡ uniform prior. 4 No Free Lunch for Noise Prediction 4.1 Boolean Functions. We use the following notation: x 2 f0, 1gd

f : x ! f0, 1g (1 ¡ p) 2 [0, 1] D probability of ip. d

There are C D 22 possible functions. Let g( p) be the prior distribution for the noise parameter, p. If the prior distribution on the possible functions is uniform then g( p| D ) D g( p) . Theorem 1.

Let mi be the number of times fi agrees with D where we use i to index functions. Then Pr[D |p, fi ] D pmi (1 ¡ p) l¡mi , so Proof.

Pr[D , fi |p] D Pr[D | p, fi ]Pr[ fi | p] D

1 mi p (1 ¡ p) l¡mi . C

(4.1)

554

Malik Magdon-Ismail

Letr ( mi ) be the number of functions agreeing exactly mi times with the data ¡ ¢ d ¡ ¢ set D . r (mi ) D ml 22 ¡l D ml 2¡l C. Summing equation 4.1 over fi , we get i i X Pr[D | p] D Pr[D , fi | p] fi

D D

l 1 X pmi (1 ¡ p) l¡mir (mi ) C m D0 i 1 Pl mi (1 ¡ p) l¡mi mi D0 p 2l

¡

D2

¢

lmi

¡l

independent of p. Therefore, g( p| D ) D

Pr(D | p) g( p) D g( p). Pr( D )

4.2 Extension to More General Functions. We would like to extend the previous analysis to the case where the output is not binary and the input is the discretized version of [0, 1]d . Figure 4 illustrates the conclusions of this section on a particular discretized grid. This section is somewhat technical. Its essential content is contained in theorem 3, which formalizes the intuition that if every target function value is equally likely, then each noise realization for every data point is equally likely. Hence nothing can be deduced about the relative likelihoods of different noise realizations.

4.2.1 Cyclic Noise. When the noise is additive modulo Ny , then P D [P0 , P1 , . . . , P Ny ¡1 ] fully species the noise distribution. The binary case with probability of ip 1 ¡ p we have Ny D 2, P0 D p, P1 D 1 ¡ p Example.

Lemma 1.

P f ( xa )

Pr[na | f , D ] D 1 for all x a 2 D .

Proof. C X f ( xa )D¡C

Pr[na | f, D ] D

yX a ¡C na Dya CC

Pr[na | f , D ] D

N y ¡1 X na D0

Pna D 1.

Combining lemma 1 with equation 3.5, we have the following theorem: Let g( P ) be the prior distribution for the cyclic noise. If the prior distribution for the functions is uniform, then g( P | D) D g( P ). Theorem 2.

No Free Lunch for Noise Prediction

555

Figure 4: Illustration of no noisy free lunch (NNFL). Suppose that the two functions shown are both equally likely (probability 1/2) to be the function that created the data. Then the two hypotheses, noise D 0 and noise > 0, are still equally likely to be true if they were a priori equally likely to be true. Proof.

By lemma 1 and equation 3.5, Pr[D| P ] D

g( P |D) D

1 Ny l

. Therefore,

Pr[D| P ]g( P ) D g( P ). Pr[D ]

Note that the boolean case is a special case of the cyclic noise case. In the language of information theory (Cover & Thomas, 1991), the noise forms a symmetric channel between the function value and the output value. A uniform distribution for the function values induces a uniform distribution on the output values, independent of the details of the symmetric noisy channel. Hence, observing the output conveys no information about the noisy channel. This is essentially the content of theorem 2. 4.2.2 Thresholded Additive Noise. Noise is added to the target, and then the resultant value is thresholded, so, yi D

» minfC, f ( x i ) C ni g, f (x i ) C ni ¸ 0 maxf¡C, f (x i ) C ni g, f ( x i ) C ni < 0.

556

Malik Magdon-Ismail

P D [P0 , P§ 1 , P§ 2 , . . .]. Unlike the previous case, edge effects make Pr[D| P ] dependent on P . These effects can be made as small as we please by allowing

C to be as large as we please. The intuition is that the data set is nite (yi is bounded), and so, because the noise distribution must decay to zero, if C is large enough, the probability that f ( xa ) is close to the edge and ya so far away becomes negligible. This is formalized in the next lemma. Given 2 > 0, 9C¤ such that for all C ¸ C¤ ,

Lemma 2.

1 Ny

l

(1 ¡ 2 ) · Pr[D| P ] ·

1 Ny l

.

(4.2)

(Ny D 2C C 1 and intuitively speaking, 2

C!1

¡! 0).

P Because fPi g is summable, there exists g such that |i|> g Pi < 2 l . Let ymax D maxfyi g and ymin D minfyi g. Choose C¤ > maxf|ymax Cg|, | ymin ¡g|g and let C ¸ C¤ . We then have Proof.

C X f ( xa )D¡C

Pr[na | f, D ] D

yX a CC

Pr[na | f , D ]

na Dya ¡C

D 1¡

X

na < ya ¡C

2

Pr[na | f , D ] ¡

X

Pr[na | f, D ]

na > ya CC

¸ 1¡ , l where the last inequality follows because by construction ymax ¡ C < ¡g and ymin C C > g. Using equation 3.5, we have 1 Ny

l

¸ Pr[D| P ] ¸

1 ³ 2 ´l 1¡ . l l Ny

The lemma now follows by using the inequality (1 ¡ x) n ¸ 1 ¡ nx for 0 · x · 1. Thus, we see that by choosing C, the bound on our function f to be large enough, the noise distribution has almost no effect on the probability of obtaining a particular data set. All data sets become equally probable, in the limit of large C. This tells us that the data should not be telling us any more about the noise distribution than we already know. The prior distribution for the noise is unaffected given any nite data set. To formalize this intuition, we need to impose a technical condition on the prior over the noise distribution space.

No Free Lunch for Noise Prediction

557

Let 0 < 2 · 1. Dene g P (2 ) by

Denition 1.

( gP (2 ) D min g:

X |i|> g

) Pi < 2

.

For all 0 < 2 · 1, g P (2 ) < 1, by the summability of P . Let 0 < 2 · 1. Dene g( N 2 ) by

Denition 2.

g( N 2 ) D sup fgP (2 )g P

where the sup is taken over P 2 supportfg( P )g. We say that a prior on the space in which P is dened, g(P ) , is totally bounded if Denition 3.

g( N2 )< 1 for all 2 > 0. Essentially this means that the support set for the prior on the noise distributions is not “too large.” For example, a prior over the noise distribution space that has support over only a nite number of points (i.e., that is nonzero only for a nite number of noise distributions) is totally bounded. We are now in a position to make precise the intuition mentioned above. Theorem 3 (No Noisy Free Lunch (NNFL)). Let the prior over noise distributions, g( P ) , be totally bounded. For all 0 < 2 < 1, 9C¤ > 0 such that for all

C ¸ C¤ ,

(1 ¡ 2 ) g(P ) · g( P |D) ·

g( P ) 1 ¡2

(4.3)

Therefore, lim g( P | D) D g(P )

C!1

because 2 ! 0 can be attained by letting C ! 1. Following lemma 2, now let C¤ > maxf| ymax C g( N 2 )| , | ymin ¡ g( N 2 )| g and let C ¸ C¤ . Then lemma 1 applies uniformly to every P 2 supportfg( P )g. Proof.

558

Malik Magdon-Ismail

Therefore equation 4.2 holds for every P 2 supportfg( P )g. Pr[D| P ]g(P ) Pr[D ] Pr[D| P ]g( P ) D P . P Pr[D| P ]g( P )

g(P | D) D

(4.4)

Using equation 4.2, an upper bound can be obtained by substituting Pr[D| P ] with 1 / Ny l in the numerator and (1 ¡ 2 ) / Ny l in the denominator. To get a P lower bound, we do the opposite. Then, using the fact that P g( P ) D 1, we get equation 4.3. Therefore, having no bound on f and allowing all f ’s to be equally likely does not allow anything new to be deduced about the noise distribution given any nite data set. It should also be clear how to extend these results to the case where the noise distribution has x dependence. One looks at each point independently. The nonthresholded additive noise model is asymptotically obtained by letting C become arbitrarily large. Further, the results apply to arbitrary D x , D y , so by allowing D x , D y ! 0, these results can be applied to the nondiscretized model in this asymptotic sense. 5 Discussion

The posterior g(P | D) is given by g(P | D) / g( P )

X

Pr[D | P , f ]P[ f ],

(5.1)

f

where P[ f ] is the prior over target functions. We have demonstrated that when P[ f ] is “uniform,” it is not possible to update a prior on the noise distribution from a nite data set, provided that a certain technical condition is satised. A nontrivial P[ f ] is needed for g(P | D) to be different from g(P ). Exactly how P[ f ] will factor into g( P |D) is given by equation 5.1. The result is more useful than it appears at rst sight, for one might argue that no one ever has a uniform prior. A restatement of the result is: Given two noise distributions, P 1 and P 2 , there are just as many (nonuniform) priors that favor P 1 over P 2 (when one weights by the amount by which P 1 is favored) as vice versa. This result ts into a set of NFL theorems that might be considered a collection of negative results, offering no hope for learning theory. The situation is considerably more positive, however. One is never in a situation with a uniform P[ f ]. Further, it is rare that no information about P[ f ] is known (excepting that it is nonuniform). Therefore, incorporating P[ f ] into

No Free Lunch for Noise Prediction

559

the noise prediction or learning algorithm is necessary in order to claim superior performance. Distribution-independent bounds such as the VC bounds (Vapnik, 1995) say that with high probability, the training error is close to the test error given some xed target function f . But VC results do not guarantee that the training error will be low with high probability; therefore they are of little practical use (for a deeper discussion of the subtleties underlying the connection between NFL and VC theory, see Wolpert (1995, 1996b). 4 How could one guarantee that a training error close to zero is achievable? One has to use the prior information available on the target function. Along a similar vein, results that exemplify the superiority of algorithm A over algorithm B on a certain benchmark problem are also of little use, for the reverse will be true on some other “benchmark” problem, and we never know which “benchmark” problem we will run into. Similarly, looking at a data set alone, one cannot produce any reliable estimate for the noise variance. Perhaps we “feel” as though we ought to be able to (for how often can it be heard: “This data set is clearly noisy”), but only because we already have a smoothness prior built into our neural network. Why? Because the types of functions we tend to encounter in practice are smooth. NFL focuses our attention on the importance of the prior information available. One should not attempt to nd learning systems that are universally good. Instead, one should study the possibilities that various priors present. Unfortunately, identifying and justifying the prior for the problem at hand is already a daunting task, let alone incorporating it into the learning problem. One way would be through the use of hints (Abu-Mostafa, 1995; Sill, 1998; Sill & Abu-Mostafa, 1997). A possible starting point would be to identify priors, P[ f ], with learning systems that perform well with respect to them (for example, Barber & Williams, 1997; Zhu & Rohwer, 1996). One might then be able to relate real problems to these priors and thus choose an appropriate learning system, or at least it might become possible to make precise the assumptions behind a belief that a certain learning system is appropriate for a certain problem. A useful result would be of the form, “Algorithm A [or learning system A] performs better than average on problems drawn from the (useful) class B.” Appendix

O an A.1 Estimating Noise Variance Using Linear Models. We have A, estimate of s 2 ( d C 1) C B, and we would like to estimate B by bootstrap4 VC results make statements about Pr[training error| test error] by making the statement that the probability of a low training error given a high test error is extremely unlikely. In practice, what we would really like to make statements about is the Pr[test error|training error] for which we really need to make some statement about the a priori probablity of a certain test error; that is, we need to make some statement about a prior over target functions.

560

Malik Magdon-Ismail

ping. Suppose we draw NB data points and estimate B (using the bootstrap O It can be shown that (see theorem 5) technique) by B.

¡

O n D BQ C s 2 ( d C 1) 1 ¡ NB hBi N

¢

C

s2 trhS ¡1 V S ¡1 V i, {z } N |

(A.1)

O( dC1)

where BQ is an unbiased estimate of B and the expectation in the last term is with respect to the bootstrapping. The denitions of V and S are given in the next section and are unimportant for our purpose here. Thus, we now O an estimate of B C s 2 ( d C 1), and B, O an estimate of have two quantities: A, 2 ¡1 ¡1 (( ) ). B C s d C 1)(1 ¡ NB / N C trhS V S V i / N Solving for s 2 , we have s2 D

(d

AO ¡ BO

C 1) NNB

¡

trhS ¡1 V S ¡1 V i N

.

(A.2)

However, we are bootstrapping a quantity B, which is related to the deviation of a statistical quantity of the sample from its population value. This estimate is good only as long as N represents the population—that is, NB / N ! 0. In this case, BO D B C s 2 ( d C 1) and thus we cannot combine this with AO to solve for s 2 (the expression in equation A.2 becomes indeterminate). We are faced with a dilemma. For nite NB / N we can get an estimate of s 2 , but this estimate is not very good. However, we have no way to make it better because by making the bootstrap more effective for B, the nal result cannot be solved for s 2 . Technically speaking, we cannot in this way get a consistent estimate of s 2 , even as N ! 1. A.2 Proof of Theorems. We use the notation X D [ x 1 « x2 ¬ . . . S ´ xxT x ,

xN ] , y D f Cn q D hx f (x )ix .

X is a (dC1)£N matrix where the x i are the data points (we use the convention

that the rst component is set to 1 and the remaining d components are the coordinates of the data point). f is an N £ 1 vector of output values at the data points, and n is a vector of noise realizations. If hx i D 0, then S is the covariance matrix for the input distribution. The law of large numbers gives us that XX T / N ¡! S and Xf / N ¡! q , where we assume that the N!1

N!1

conditions for this to happen are satised. We thus dene V and a by XX T

N

V ( X) DS C p , N

Xf

N

a (X ) DqC p . N

Note that hV iX D ha iX D 0 and Var( V ) and Var(a ) are O(1).

(A.3)

No Free Lunch for Noise Prediction

561

Let the learning model be the set of linear functions w ¢ x and let the learning algorithm be minimization of the squared error. Then Theorem 4.

hEtr¤ i

D E0 C s 2 ¡

,n

¡ ¢

s 2 ( d C 1) C B 1 CO 3 N N2

,

(A.4)

where E 0 and B are constants given by E0 D h f 2 i ¡ q T S ¡1 q BD Proof.

(A.5)

hq T S ¡1 V S ¡1 V S ¡1 q C a T S ¡1 a ¡ 2a T S ¡1 V S ¡1 q i . N

(A.6)

The least squares estimate of w is given by

w O D ( XX T ) ¡1 Xy ,

(A.7)

from which we calculate the expected training error as ½ hEtr i D D

¾ ½ ¾ 1 T 1 (X w ((X T ( XX T ) ¡1 X ¡ 1) y ) 2 O ¡ y) 2 D N N « T ¬ « T ¬ T ¡1 f f ¡ f X ( XX ) Xf C hn T n i ¡ hn T X T ( XX T ) ¡1 Xn i

N T X T ( XX T ) ¡1 Xf i h f s 2 ( d C 1) D h f 2i ¡ Cs 2 ¡ . N | {z N } T1

Using equation A.3 and the identity [1 Cl A ]¡1 D 1 ¡ l A Cl 2 A 2 C O(l 3 ), we evaluate T1 as T1 D h f 2 i ¡ q T S ¡1 q ¡

¡ ¢

hq T S ¡1 V S ¡1 V S ¡1 q C a T S ¡1 a ¡ 2aT S ¡1 V S ¡1 q i 1 CO 3 N N2

.

The theorem now follows. This result is similar to results obtained by Amari, Murata, Muller, ¨ and Yang (1997) and Moody (1991). From theorem 4, we have that 0

1

1 B T ¡1 ¡1 ¡1 ¡1 ¡1 ¡1 C BD @hq S V S V S q i ChaT S a i ¡2 ha T S V S q iA . {z } | {z } | {z } N | T1

T2

T3

(A.8)

562

Malik Magdon-Ismail

We wish to estimate T1 , T2 , T3 . Suppose that we try to bootstrap them as follows. For T 1 , we take qO D Xy / N D X ( f C n ) / N and SO Dp( XX T / N ). We estimate V by sampling NB of the N data points and compute NB ((X B X TB / NB )¡ (XX T / N )). We then estimate T1 by TO 1 , a bootstrapped expectation of qO T S O¡1 V S O¡1 V S O¡1 qO . This requires NB / N to be small. Doing all this, we nd that E Xf E fT X T D O¡1 s 2 D O¡1 S V S O¡1 V S O¡1 C TO 1 D tr S V S O¡1 V , N N N

(A.9)

where we have an expectation with respect to the noise and the remaining expectation is with respect to the bootstrapping. Notice that the rst term is the unbiased estimator of T1 . Denote this term by TQ 1 . p For TO 2 D ha T S ¡1 a i we use a D NB (X B y B / NB ¡ qO ). Partitioning the input matrix and noise vector into the part in the sampled NB points and the remaining part, write X D [X B XQ B ] and n D [n TB nQ TB ]T . Noting that n B and nQ B are independent, we have for TO 2 : 1 (NB Xn ¡ N X B n B )S ¡1 ( NB Xn ¡ N X B n B ) N2 NB " # " # 2( ¡ Q B nQ B Q B nB Q B nQ B Q B nB )2 N N N X X X X B ¡1 B D TQ2 C ¡ S ¡ N 2 NB N ¡ NB NB N ¡ NB NB

TO 2 D TQ2 C

¡

NB D TQ2 C s 2 (d C 1) 1 ¡ N

¢ ¡ 1¢ Co

N

,

(A.10)

where we have taken the expectation with respect to the noise and TQ 2 is an unbiased estimate of T2 . Performing the same kind of analysis for TO 3 , we nd that TO 3 D TQ 3 , an unbiased estimate of T3 . Now we estimate BO by TO 1 C TO 2 ¡ 2TO 3 , which is the unbiased estimate of B plus some correction terms. Therefore we have proved the following theorem. For N ! 1, NB / N ! 0, the bootstrapped estimate of B, which O has an expectation given by we call B, Theorem 5.

O n D BQ C hBi

s 2 ( d C 1) N

¡

1¡

NB N

¢

C

s2 trhS ¡1 V S ¡1 V iX , N

(A.11)

where BQ is an unbiased estimate of B and the remaining expectation is with respect to the bootstrapping.

No Free Lunch for Noise Prediction

563

Acknowledgments

We thank Yaser Abu-Mostafa and the members of the Caltech Learning Systems Group for helpful comments. In addition, two anonymous referees provided useful comments.

References Abu-Mostafa, Y. (1995). Hints. Neural Computation, 4(7), 639–671. Amari, S., Murata, N., Muller, ¨ K., & Yang, H. (1997).Asymptotic statistical theory of overtraining and cross validation. IEEE Transactions on Neural Networks, 8(5), 985–996. Barber, D., & Williams, C. K. I. (1997). Gaussian processes for Bayesian classications via hybrid Monte Carlo. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 340–346). San Mateo, CA: Morgan Kaufmann. Cataltepe, Z., Abu-Mostafa, Y. S., & Magdon-Ismail, M. (1999).No free lunch for early stopping. Neural Computation, 11, 995–1001. Cortes, C., Jackel, L. D., & Chiang, W.-P. (1994). Limits on learning machine accuracy imposed by data quality. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Neural information processing systems, 7 (pp. 239–246). Cambridge, MA: MIT Press. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Moody, J. E. (1991). The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information procesing systems, 4 (pp. 847–854). San Mateo, CA: Morgan Kaufmann. Shao, J., & Tu, D. (1996).The jackknife and the bootstrap. New York: Springer-Verlag. Sill, J. (1998). Monotonic networks. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10. Cambridge, MA: MIT Press. Sill, J., & Abu-Mostafa, Y. S. (1997). Monotonicity hints. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 634–640). Cambridge, MA: MIT Press. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Wolpert, D. H. (1995). The mathematics of generalization. Reading, MA: Addison Wesley. Wolpert, D. H. (1996a). The existence of a priori distinctions between learning algorithms. Neural Computation, 8(7), 1391–1420. Wolpert, D. H. (1996b). The lack of a priori distinctions between learning algorithms. Neural Computation, 8(7), 1341–1390. Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1, 67–82.

564

Malik Magdon-Ismail

Zhu, H. (1996). No free lunch for cross validation. Neural Computation, 8(7), 1421–1426. Zhu, H., & Rohwer, R. (1996). Bayesian regression lters and the issue of priors. Neural Computation Applications, 4, 130–142. Received May 29, 1998; accepted March 31, 1999.

LETTER

Communicated by Peter Foldiak

Self-Organization of Symmetry Networks: Transformation Invariance from the Spontaneous Symmetry-Breaking Mechanism Chris J. S. Webber Defence Evaluation and Research Agency, Malvern WR14 3PS, U.K. Symmetry networks use permutation symmetries among synaptic weights to achieve transformation-invariant response. This article proposes a generic mechanism by which such symmetries can develop during unsupervised adaptation: it is shown analytically that spontaneous symmetry breaking can result in the discovery of unknown invariances of the data’s probability distribution. It is proposed that a role of sparse coding is to facilitate the discovery of statistical invariances by this mechanism. It is demonstrated that the statistical dependences that exist between simplecell-like threshold feature detectors, when exposed to temporally uncorrelated natural image data, can drive the development of complex-cell-like invariances, via single-cell Hebbian adaptation. A single learning rule can generate both simple-cell-like and complex-cell-like receptive elds. 1 Introduction

Symmetry networks (Shawe-Taylor, 1989) are neural networks in which some synaptic weights are equal to others, to some degree of approximation. In other words, the weight vectors are invariant under certain permutations among the weights. For example, the simple permutation that swaps over the ith and i0 th components of weight vector w will leave w invariant if the weights wi and wi0 are equal. These symmetries have the consequence that the scalar product with an input vector x will be invariant under corresponding permutations of the input nodes: 0

Sii...... ( w ) D w

)

0

w ¢ Sii...... (x ) D w ¢ x .

(1.1)

This follows because w ¢ S( x) D S¡1 (w ) ¢ x and because S( w ) D w implies S¡1 ( w ) D w . This is true for any permutation, including those that cycle more than two weights and those that cycle several distinct equivalence classes of weights simultaneously. Because the length | w | of a vector is invariant under a permutation, invariance of w ¢ x implies invariance of the distance | x ¡ w | , so that Sii...... ( w ) D w 0

)

0 Sii...... ( x ) ¡ w D | x ¡ w |.

Neural Computation 12, 565–596 (2000)

(1.2)

c British Crown copyright 2000 °

566

Chris J. S. Webber

Any neuron whose response is a function of w ¢ x or | x ¡ w | only will respond invariantly under those input-node permutations that correspond to its weight vector ’s symmetries. The converse of equation 1.1 implies that in general, such a neuron will discriminate between a pair of permutationrelated patterns if its weight vector is not symmetric under that particular permutation. Depending on the weight vector ’s specic symmetries, the response will be invariant for some input-node permutations but not for others; the neuron’s invariances are selective for specic groups of transformations. A neuron with a high-dimensional weight vector can be invariant for many distinct permutations, while retaining the capacity to discriminate for very many other permutations. These arguments are also valid for the case of approximate symmetries, if the neural response is a continuous function of w ¢ x or | x ¡ w | , because S(w ) ¼ w implies w ¢ S(x ) ¼ w ¢ x . Actually the arguments based on equations 1.1 and 1.2 apply to a more general group of linear transformations than the permutations: the continuous orthogonal1 group of proper and improper rotations in the weight-vector space, of which the coordinate permutations form a discrete subgroup. (The orthogonal transformations are those linear transformations that leave vector lengths invariant: |S( w )| D | w |.) The condition for w ¢ x and | x ¡ w | to be invariant under such a transformation is that the weight vector lies in the invariant subspace of the transformation, which is dened to be the subspace of vectors w for which S(w ) D w . The invariant subspace is the multidimensional analog of a mirror plane or axis of rotation. Minsky and Papert (1969) originated the idea that invariant response can be related to weight symmetries in this way: their group-invariance theorem proves that if one can construct a perceptron for computing a given symmetric function, then one can also construct a perceptron with symmetric weights for computing the same function. Many results on the subject of transformation invariance have depended on the design of networks having equalities among weights. Such networks have been successfully applied to the problem of graph isomorphism (Shawe-Taylor, 1993). The rotationinvariant network of Fukumi, Omatu, Takeda, and Kosaka (1992), the handwritten zip code recogniser of Le Cun et al. (1989), and the neocognitron (Fukushima & Miyake, 1982) all implicitly use built-in weight equalities to achieve their transformation invariances. Instead of imposing prior knowledge of a desired invariance directly onto the network architecture in this way, one can incorporate that prior knowledge into the learning algorithm, in the form of prior constraints on the learning equations. The use of weight sharing to distinguish T and C shapes independently of translation and 90 degree rotation (Rumelhart, Hinton, and Williams, 1986) provides one ex-

1 Results obtained for orthogonal transformations can be generalized straightforwardly to apply to unitary transformations if we were to consider complex-valued weights and data.

Self-Organization of Symmetry Networks

567

ample. Rao and Ballard (1998) and Wood (1995) provide others. Supervised networks have even been explicitly trained to be invariant under innitesimal transformations of a known invariance group (Hinton, 1987; Simard, Victorri, Le Cun, & Denker, 1991). This article addresses the question of what kind of intrinsic invariances of the data might be able to drive the development of weight symmetries during unsupervised adaptation, and proposes a generic mechanism by which specic invariances of the data can drive the development of transformation-invariant response. In principle, this generic mechanism could be used to generate invariances with respect to a wide range of transformations, in a range of learning rules. As a specic example, the Hebbian rule of Webber (1994) is studied and is shown analytically to give rise to weight symmetries when trained on data having statistical invariances under permutations. Webber (1994) used an analytical model, and simulations involving synthetic images, to demonstrate that this single-cell learning rule can discover independent components in kurtotically distributed data. The component-nding behavior of this learning rule has also been demonstrated using natural data: the rule discovers phones and words when exposed to continuous speech (Webber, 1997). Blais, Intrator, Shouval, and Cooper (1998) have shown that various other single-cell rules also belonging to the general class of projection pursuit algorithms (Friedman, 1987) can discover independent components in natural images, which resemble the oriented receptive elds of cortical simple cells and appear localized to some extent. Such single-cell learning rules are very different from multicell independent component analysis algorithms such as those of Olshausen and Field (1996) and Bell and Sejnowski (1997), in that they are not derived from any kind of minimum mutual entropy or statistical independence principle, such as was originally proposed by Barlow (1989). In this article the Webber (1994) rule is also trained on natural images, and it generates simple-cell-like feature detectors having oriented and highly localized receptive elds as expected. The patterns of output activity induced in a set of these “simple cells,” when natural images are presented at their inputs, are then used to train other similar, higherlevel Webber (1994) neurons. These higher-level neurons develop weight symmetries that give them complex-cell-like translation invariances. Thus, key response properties of both simple and complex cells can be derived within the framework of the Hebbian single-cell learning rule, without the need to rely on temporal correlations to explain transformation invariance, as others have done (Foldiak, 1991). 2 The Spontaneous Symmetry-Breaking Mechanism

If weight equalities are not explicitly constructed by prior constraints, how can they arise through self-organization? What properties of the data, unknown prior to learning, can determine the permutations under which the

568

Chris J. S. Webber

weight vector is to become invariant? In other kinds of dynamical system, the symmetry of the convergent state is often determined by spontaneous breaking of dynamical symmetries. Examples are widespread in physical systems. This article proposes that spontaneous symmetry-breaking mechanisms can also determine the self-organization of invariant neural response. The term dynamical symmetry refers to an invariance of the equations of motion, under a transformation of the system’s dynamical coordinates. In the neural network context, these coordinates are a set of adaptive weight vectors W ´ ( w 1 , . . . , w m ), and the equations of motion are the adaptation rules, dw j D vj ( w 1 , . . . , w m ) dt

( j D 1, . . . , m),

(2.1)

or, dening V ´ ( v 1 , . . . , v m ), dW D V ( W ). dt

(2.2)

Equations of motion that are rst order in t include most learning rules, notable exceptions being rules with second-order “momentum” terms. The velocity functions vj will depend in some way on the distribution of the training data Pr(x ). The class of rules of the form 2.2 includes the familiar rules in which the weight vector is updated smoothly on the basis of some average h. . .ifx g over Pr(x ): Z vj ( w 1 , . . . , w m ) D Àj ( x, w 1 , . . . , w m ) Pr( x ) dx ( j D 1, . . . , m).

(2.3)

In these ensemble-average rules, the order in which data patterns are drawn from Pr( x) has no inuence on the learning dynamics: it is assumed that the statistics of Pr( x ) are stationary over time. The velocities vj are usually derived from a Lyapunov function U, called the objective function in the neural network context, Z (2.4) U ( w 1 , . . . , w m ) D u( x, w 1 , . . . , w m ) Pr( x) dx , often via gradient ascent or descent: vj ( w 1 , . . . , w m )

§

@U ( w 1 , . . . , w m ) @w j

( j D 1, . . . , m).

(2.5)

Consider a linear transformation of the coordinate system from old coordinates W to new coordinates W 0 , W 0 ´ S( W ).

(2.6)

Self-Organization of Symmetry Networks

569

In the new coordinates, the equations of motion (equation 2.2) become dS¡1 ( W 0 ) D V (S¡1 ( W 0 )) dt

i.e.

dW 0 D S( V (S¡1 ( W 0 ))), dt

(2.7)

because linearity of the transformation implies S¡1 ( W 0o C dW 0 ) ¡ S¡1 ( W 0o ) dS¡1 ( W 0 ) D lim D S¡1 dt dt!0 dt

¡ dW ¢ 0

dt

.

(2.8)

Note that although we have assumed here that the transformation is linear, we do not have to assume that the dynamics are linear—that V ( W ) is a linear function of W . The transformation S on the dynamical coordinates W is said to be a symmetry of the equations of motion if the equations have the same functional form, dW 0 D V ( W 0 ), dt

(2.9)

in the new coordinate system. For linear transformations, such dynamical symmetry is dened by V ( W 0 ) D S(V (S¡1 (W 0 )))

i.e.

V (S(W )) D S(V ( W )),

(2.10)

as can be seen by comparing equation 2.7 with 2.9. In cases where the equations of motion just perform gradient ascent or descent on an objective function, any orthogonal transformation that leaves the objective function invariant, U( W 0 ) D U( W ),

(2.11)

will also be a symmetry (see equation 2.10) of the equations of motion, because the gradient in the transformed coordinate system is (using the chain rule and the denition of orthogonality 2 ) @U @W T @U @U @U D D (S¡1 ) T DS . @W 0 @W 0 @W @W @W

(2.12)

However, the concept and denition (see equation 2.10) of symmetries of the equations of motion apply even to dynamics for which no objective function U exists. If the equations of motion are invariant under some group of transformations, this has implications for the invariances of the weight vectors when the 2

S¡1 D ST for orthogonal matrices.

570

Chris J. S. Webber

learning dynamics have brought them to convergence. From equation 2.2, it can be seen that the rst of two conditions for a conguration of weight 1) vectors to be a convergent state W 1 ´ (W 1 1 , . . . , W m is that it should be a stationary point, V ( W 1 ) D 0,

(2.13)

even for dynamics for which no Lyapunov function exists. This condition is satised for unstable stationary states as well as for stable or metastable convergent states, so, in addition, convergent states must satisfy a second condition,

²T ( @V T / @W ) ² · 0,

(2.14)

for any small perturbation W D W 1 C ² away from the stationary state. This says that the “restoring force” V ( W ) should oppose3 perturbations ² away from the stationary point, so ² ¢ V (W ) should not be positive for any ²: equation 2.14 can be derived by Taylor expansion of V (W ) about W 1 in the standard way. In the special case of gradient ascent, the matrix @V T / @W is the objective function’s Hessian matrix, and equation 2.14 is just the condition for the stationary point dened by equation 2.13 to be a local maximum of the objective function. If S( W ) is an orthogonal symmetry of the equations of motion (equation 2.2) and if the particular state W 1 is one of the stable convergent states satisfying equations 2.15 and 2.16, then the symmetry denition (equation 2.10) ensures that the state S(W 1 ) will also be one of the stable convergent states satisfying equations 2.13 and 2.14. Clearly S( W 1 ) will also be a stationary point because V(W1) D 0

implies

S( V ( W 1 )) D S( 0 ) D 0 ,

(2.15)

and, by symmetry of the objective function under S, S( W 1 ) will obviously be a local maximum of U if W 1 is. Stability (see equation 2.14) of W 1 implies stability of S( W 1 ) even when no objective function U exists. If S is an orthogonal symmetry (see equation 2.10) of the equations of motion, a proof analogous to equation 2.12 shows that T @V T ( W 0 ) ¡1 @V ( W ) D S S. @W 0 @W

(2.16)

The matrices @V T ( W 0 ) / @W 0 and @V T ( W ) / @W are thus related by a similarity transformation.4 For every possible perturbation ²0 in the W 0 coordinate system, there is a corresponding direction ² D S( ²0 ) in the W coordinate 3 4

Or not further enhance it, in the case of metastability. It can be shown that the stability condition, equation 2.14, is equivalent to the re-

Self-Organization of Symmetry Networks

571

system for which the left-hand side of equation 2.14 has identical value (and vice versa),

²0 T

@V T ( W 0 ) 0 @V T ( W ) ² D ²T ², 0 @W @W

(2.17)

so the condition (see equation 2.14) that these values are nonpositive for every possible perturbation is independent of the coordinate transformation: S(W 1 ) must be stable if W 1 is stable or metastable if W 1 is metastable. Orthogonal symmetries of the equations of motion map convergent states onto convergent states. Two possibilities arise when the equations of motion are invariant under an orthogonal transformation S( W ). Either convergent states W 1 and S(W 1 ) are distinct from one another, in which case S( W 1 ) 6 D W 1 and S( W ) is said to be a spontaneously broken, or hidden, symmetry, or else they represent identical states, in which case S( W 1 ) D W 1 ,

(2.18)

and S( W ) is said to be a preserved, or manifest, symmetry. The preserved symmetries always form a subgroup of the equation of motion’s symmetry group, because they are elements of the equation of motion’s symmetry group and because symmetries form groups by denition. It is not helpful to think of spontaneous symmetry breaking as a temporal process that begins at the initial state and ends with convergence, and to regard it as being caused by asymmetry in the initial state. It simply describes the situation in which the objective function as a whole is symmetric under some transformation group, but in which the optima of the objective function are mapped onto one another, rather than onto themselves, by the action of that group. It is important to note that spontaneous symmetry breaking is not caused by slight asymmetries in the equations of motion. There are cases in physics where the dominant component of the Lyapunov function may be symmetric but some perturbation effect introduces a small asymmetry into the Lyapunov function, with the result that its optima no longer map onto themselves exactly. This situation is also called symmetry breaking, but the symmetry is then not hidden and the breaking is not called spontaneous. This article is concerned exclusively with spontaneous symmetry breaking. In the neural network context, whether a particular symmetry of the equations of motion is preserved or broken will depend on the distribution quirement that @V T / @W has no positive eigenvalues. Matrices H and H0 ´ S¡1 H S related by a similarity transformation have the same set of eigenvalues: H e D c e implies H0 S¡1 e D c S¡1 e. The condition that @V T ( W ) / @W has no positive eigenvalues is therefore independent of the transformation S( W ) if S( W ) is an orthogonal symmetry of the equations of motion.

572

Chris J. S. Webber

Pr( x ) and on the particular functional form of the Àj in equation 2.3 and on any xed symmetry-breaking parameters implicit in the functions Àj . In other words, whether a symmetry is preserved or broken depends on the particular learning rule; there is no guarantee that an arbitrary learning rule will preserve any symmetries, except the trivial identity transformation. The values of any symmetry-breaking parameters will control how large a subgroup of the dynamical symmetries is preserved; section 5 provides a simple example. Under the action of a learning rule that preserves a symmetry, the weight-vector conguration will become symmetric (see equation 2.18) on convergence, even though the learning starts from a random and asymmetric initial conguration in general. 3 Symmetry Preservation and Invariant Neural Response

The principles proved in the previous section will be familiar to those who have come across the spontaneous symmetry-breaking mechanism in physics. Although these principles have not elsewhere been set down and worked through explicitly for the neural network context, other authors have made references to symmetry breaking in neural networks. For example, Yamagishi (1994) refers to the breaking of left-right ocular symmetry in the visual cortex, which he suggests takes place when dominance columns are formed. However, other authors have not considered the signicance of the preservation of symmetries for transformation-invariant neural response. Considering this leads us to new results, which connect the subjects of symmetry networks and unsupervised learning quite intimately. Because we are interested in the response invariances of individual neurons in symmetry networks, we concentrate on transformations that permute the coordinates of individual weight vectors, operating in the same way on all the weight vectors: ( w 1 , . . . , w m ) ¡! (S( w 1 ), . . . , S( w m )).

(3.1)

Where the transformation S permutes all the weight vectors in the same way, the general symmetry denition, equation 2.10, simplies to vj (S(w 1 ), . . . , S( w m )) D S( vj ( w 1 , . . . , w m ))

( j D 1, . . . , m).

(3.2)

S( W ) will be an orthogonal transformation in the W -space if S( w ) is an orthogonal transformation on all the individual weight vectors. 5 If S(W ) D (S( w 1 ), . . . , S( w m )) is a preserved permutation symmetry, the symmetry-network conditions, 1.1 and 1.2, for the neurons’ responses to be invariant under S( x ) are automatically satised. If S( W ) D (S( w 1 ), . . . , 5

Because the mn £ mn S-matrix that transforms the W -space is assembled from m repetitions of the n £ n S-matrix that transforms each w -space, in block-diagonal form.

Self-Organization of Symmetry Networks

573

S(w m )) is a spontaneously broken permutation symmetry, however, neurons will instead discriminate between data patterns related by the transformation S(x ). (These results are valid not just for permutations but hold for all orthogonal transformations in the weight-vector space.) If the dynamics are invariant under a whole group of symmetries, some of them may be preserved while others are spontaneously broken, so a neuron may have invariant response under some transformations but discriminate for others. We observe that symmetry preservation provides a mechanism for neurons to acquire invariant response through self-organization, and that symmetry breaking provides a mechanism for the response invariances to be restricted to a specic subgroup of transformations. 4 Invariances of the Data Determine the Dynamical Symmetries

Whether a given permutation is actually a symmetry of the equations of motion in the rst place depends on Pr( x ). Consider the familiar ensembleaverage learning rules in which u is a continuous function of m scalar products wj ¢ x , m distances | w j ¡ x| , and / or m lengths | w j | , and for which the m neurons see the same input space x and so belong to the same layer of the network. S( w j ) ¢ S( x) D wj ¢ x

and

|S( wj ) ¡ S( x )| D | wj ¡ x |,

(4.1)

0

for any permutation Sii...... , so the objective function, equation 2.4, will be symmetric, U(S(w 1 ), . . . , S(w m )) D U( w 1 , . . . , w m ),

(4.2)

under any permutation that satises Pr(S( x )) D Pr( x )

(4.3)

because Z

Z u( x , S( W )) Pr( x ) dx D D

u(S¡1 ( x ), W ) Pr( x ) dx Z u(x , W ) Pr(S( x )) dx .

(4.4)

For many ensemble-average learning rules (see equation 2.3), we can derive analogous results even if no objective function exists. Consider the learning rules of the form

Àj ( x , w 1 , . . . , w m ) D vj x ¡ w j wj

( j D 1, . . . , m),

(4.5)

574

Chris J. S. Webber

where each of the 2m scalars vj and wj is a continuous but not necessarily linear function of the m scalar products w j ¢ x , the m distances | wj ¡ x| , and / or the m lengths | w j |. Their dynamics will be symmetric (see equation 2.10) under any permutation that satises equation 4.3 because all the vj and w j are invariants (see equation 4.1) under combined S( W ) and S( x ), so that Z Z ( )) ) Àj x , S( W Pr( x dx D S( Àj (S¡1 ( x), W )) Pr( x ) dx Z D S( Àj ( x , W )) Pr(S( x )) dx. (4.6) These new results state that any permutation that is an invariance of the data’s distribution will therefore be a dynamical symmetry of these learning dynamics. Those permutations that are invariances of the data’s distribution will therefore give rise to corresponding weight equalities if they are preserved unbroken in the convergent state. We therefore propose that such invariances of the data, unknown prior to learning, can drive the self-organization of transformation invariances in the response of individual neurons, through symmetry preservation. (Again, all the arguments of this section hold for the wider group of orthogonal6 transformations.) The theory of spontaneous symmetry breaking applies to real physical systems, which are only ever deemed symmetric to a nite approximation. For u( x , w 1 , . . . , w m ) that are continuous functions, approximate symmetries of Pr( x) will give rise to approximate symmetries of the objective function U (W ). Any Pr(x ) can be decomposed into a sum of symmetric and antisymmetric components 7 under a given permutation S(x ), and any U( W ) can be decomposed under the corresponding S( W ). As one perturbs Pr( x) away from exact symmetry by introducing an innitesimal antisymmetric component, U (W ) will acquire an innitesimal antisymmetric component. As this component grows, any symmetry-preserving stable optima W 1 will be perturbed away from the invariant subspace dened by S( W ) D W , remaining approximately invariant under S but to an increasingly inaccurate degree. 8 Thus, approximate symmetries of Pr( x) can give rise to approximate weight symmetries in the convergent state. The example of section 6 demonstrates that approximate invariances of the data’s distribution can indeed drive the development of approximate symmetries among synaptic weights. 6 The arguments could also be generalized straightforwardly to unitary transformations if we were to consider complex-valued weights and data. 7 The decomposition Pr( x ) D Pr C ( x ) C Pr¡ ( x ) is merely an analytical device. The ) are not themselves probability densities and are not required to satisfy functions Pr§ ( xR Pr§ ( x) ¸ 0 and Pr§ ( x ) dx D 1. 8 A large enough antisymmetric component may eventually break the approximate symmetry of these stable optima, when they abruptly cease to be stable, or, alternatively, it may move them so far from the invariant subspace that their approximate symmetry becomes washed out.

Self-Organization of Symmetry Networks

575

5 Hebbian Adaptation of Threshold Neurons as an Example

So far, only a few properties of learning rules have been assumed, but remarkably general predictions have been made. To demonstrate that spontaneous symmetry breaking gives rise to self-organized symmetry in practice, a specic learning rule is cited. Webber (1994) studied the Hebbian adaptation of a scalar product neuron whose continuous threshold nonlinearity approximates the limiting case µ a¡# lim r( a) D 0 s!0

if a ¸ # if a · #,

µ dr 1 lim D 0 s!0 da

if a ¸ # if a · #,

(5.1)

where a ´ w ¢ x is the scalar product. s represents a “softness” parameter, which determines how closely the threshold nonlinearity approximates this sharp piece-wise linear limit in practice: a convenient soft parameterization (Webber, 1997, 1998) is

¡

¢

µ ¶ a¡# , r( a) D s log 1 C exp s

¡

¢

µ ¶ dr a ¡ # ¡1 D 1 C exp ¡ . (5.2) da s

The neuron’s weight vector rotates under the effect of Hebbian (Hebb, 1949) adaptation to a distribution Pr( x ) of training patterns, dw dt

hx r( w ¢ x )ifxg ¡ (length constraint term parallel to w ),

(5.3)

subject to a Lagrange length constraint | w | D 1. The Lagrange term in equation 5.3 must operate parallel (or antiparallel) to w , because it is derived from the gradient of | w | 2 (Webber, 1994). The only effect of using a soft threshold, equation 5.2 instead of 5.1, is that the adaptation can boot-strap itself if all the training patterns initially lie below threshold. The convergent states for the two forms (equations 5.1 and 5.2) are clearly no different if s is small (Webber, 1997). (Actually, the soft form, equation 5.2, is used in the simulations here only because it is generally preferable to the original Webber (1994) form, equation 5.1: the simulation results can also be obtained using the s D 0 form (equation 5.1). Because u( x , w ) D r( x ¢ w ) 2 , a function of x ¢ w only, the objective function U (w ) will be symmetric under any permutation symmetry of Pr( x ). Hebbian adaptation optimizes this objective function by gradient ascent (subject to the length constraint), seeking out one of its stable maxima. Maxima related to one another by permutation symmetries of Pr( x ) have basins of attraction of equal area. A large enough population of neurons, adapting independently of one another, will converge to populate such maxima in equal numbers, provided their initial states were distributed randomly and uniformly over the unit hypersphere in w -space.

576

Chris J. S. Webber

Appendix A shows analytically that the threshold # is a symmetrybreaking parameter, which determines how large a subgroup of Pr( x )’s symmetries will leave the response r( w ¢ x) invariant when w has converged to a stable state. For low enough #, the weight vector will develop symmetry under every permutation that is a symmetry of Pr( x ). As # is raised, various symmetry-breaking bifurcations will be passed, which depend quantitatively on Pr( x ). At each bifurcation, a previously stable state becomes unstable, to be replaced by a set of new stable states having a smaller symmetry group.9 The new stable states will be capable of invariant response for only a smaller group of transformations. The broken symmetries no longer map a stable state onto itself, but instead map the set of new stable states onto one another. The sequence in which the permutation symmetries of Pr( x ) are progressively broken into smaller and smaller symmetry groups as # is raised can be determined analytically (see appendix A). In this way, the symmetry-breaking parameter can be used to control the degree of invariance or discrimination in the neuron’s response. Among a collection of neurons adapting with various values for the symmetrybreaking parameter, those with lower values will develop invariant responses for a larger group of transformations, while those with higher values will discriminate to a greater extent. If all permutation symmetries are preserved so that the weight vector just ends up at the mean of the training ensemble, one obtains a useless neuron with no capacity to discriminate. If the spontaneous symmetry breaking is so extensive that only the identity symmetry is preserved, no invariances will develop. It is the ability of the Webber (1994) rule to control the symmetry breaking under the inuence of a symmetry-breaking parameter, in such a way that an intermediate-sized subgroup of the data’s permutation symmetries is preserved, which makes possible the development of response invariances in a discriminating neuron. It is clear from section 4 that a wide range of unsupervised algorithms will constitute spontaneous symmetrybreaking systems when trained on data distributions that possess symmetries, but this does not mean that such systems preserve their symmetries in general. Only the single-cell rule of Webber (1994) has so far been proved analytically to be able to preserve specic subgroups up to and including the group of all the permutation symmetries of the data and to be able to break them in a well-dened and controlled sequence under the inuence of a symmetry-breaking parameter. (It is also possible to replace the threshold function r( a) in equations 5.1 and 5.2 by the more conventional and more biologically plausible sigmoid, and to use the same single-cell Hebbian rule and 9 If the symmetry is exact and discrete, the transition to states of lower symmetry will be abrupt. However, if Pr( x) has an almost continuous sequence of progressively less accurate symmetries, the stable states can appear to change continuously as these approximate symmetries are progressively broken. Section 6’s synthetic data set has examples of both of these kinds of symmetry.

Self-Organization of Symmetry Networks

577

symmetry-breaking parameter of Webber (1994) to reproduce this article’s empirical demonstrations, though at the expense of no longer being able to characterize the symmetry-preserving behavior analytically so easily.) 6 A First Demonstration Using Completely Dened Statistics

Webber (1991) provided an empirical demonstration of self-organized permutation invariance, using a synthetic data set having completely dened statistics. The data set was designed to model the problem of nding statistical dependences among sparse-coded features. The same data set is used here to demonstrate that such invariances arise through the spontaneous symmetry-breaking mechanism. The statistics of this example data set are made simple to dene, so that one can perceive clearly the invariances of the data and the self-organized symmetries of the weight vectors. However, the statistical dependences are nontrivial and cannot be revealed by principal component analysis or algorithms designed to nd localized clusters, as is explained below. This example addresses the problem of revealing the statistical dependences between sparse, nonoverlapping, sharp spikes of neural activity, and is described in greater length, with the benet of diagrams, in Webber (1991). Every pattern of this data set consists of exactly four spikes of unit height on a zero background. Each spike occurs in one of four separate one-dimensional (1D) coordinate spaces labeled A to D, and there is always exactly one spike in each of the four spaces. The location of the spike within each space is determined by a single position-coordinate t, which can take integer values modulo n / 4; that is, the four 1D spaces tA , . . . , tD are each discretized into n / 4 locations and have circular boundary conditions. The spikes can be thought of as representing the responses of n localized feature detectors, arranged in four rows, each containing n / 4 detectors. Each row detects localized features of one of four types in hypothetical 1D images, within which the location of the feature is measured by the positioncoordinate t. Together the four rows of feature detectors encode those images as n-dimensional code vectors x , using a sparse 4-of-n code. The analogy with rows of feature detectors is taken further in the more realistic but less precisely denable example of section 7. p The n-dimensional x -vectors of this data set all subtend an angle arccos 4 / n with the uniform n-dimensional vector (1, 1, . . . , 1). Positive-valued sparse codes have the property that their code vectors subtend large angles with the uniform vector. Every x-vector lies on one of the many vertices of p the unit hypercube that lie 4 length-units from the origin. p pNonidentical p p patterns are separated from one another by distances of 1, 2, 3, or 4 units in x-space, depending on the number of spikes the patterns happen to share in common at corresponding locations. No two patterns are farther than 2 units from each other, and no two patterns lie closer than 1 unit.

578

Chris J. S. Webber

Individually, the spikes are uniformly distributed over the four 1D spaces: Pr(tA ) D Pr(tB ) D Pr(tC ) D Pr(tD ) D 4 / n.

(6.1)

Despite their uniform marginal distributions, the locations of the spikes are statistically dependent on one another, with two signicantly different correlation lengths s2 > s1 À 1. The statistics Pr( x) of these training data are determined completely from the joint distribution Pr(tA , tB , tC , tD ) D

4 N( tA ¡ tB , s1 ) N( tC ¡ tD , s1 ) n t A C tB tC C tD ¢N ¡ , s2 . 2 2

¡

¢

(6.2)

Although (discretised) univariate gaussians N ( t, s), of zero mean and standard deviation s, appear in Pr(tA , tB , tC , tD ), Pr( x) is clearly not a multivariate gaussian and is not localized into distinct clusters. Algorithms designed to nd localized clusters are inappropriate for revealing the statistical dependences between these nonoverlapping spikes, because there are no distinct, localized clusters in x -space to nd. However, both the pair of spikes ( A, B ) and the pair (C, D) appear as relatively localized structures in imagespace t, as is evident from the marginal distributions Pr(tA , tB ) D

4 N ( tA ¡ tB , s1 ) n

and

Pr(tC , tD ) D

4 N (tC ¡ tD , s1 ). (6.3) n

In image-space t, each pair is a composite structure that has an internal degree of freedom (tA ¡ tB or tC ¡ tD ). Webber (1991) points out that internal degrees of freedom of such composite image-space structures are analogous to shape degrees of freedom. The “centers of gravity” of the two pairs are uniformly distributed over image-space t, Pr

¡t

A

C tB 2

¢

D Pr

¡t

C

C tD 2

¢

D 4 / n,

(6.4)

although the two centers are locally correlated with each other: Pr

¡t

A

C tB tC C t D , 2 2

¢

D

4 N n

¡t

A

¢

C tB tC C t D ¡ , s2 . 2 2

(6.5)

The two pairs are intercorrelated more weakly than they are intracorrelated, because s2 > s1 . In the limit s2 ! 1, the pairs would cease to be intercorrelated at all. The statistical dependences dene composite structures on two size scales: the strongly correlated ( A, B) and ( C, D) pairs on the ner scale, and the weaker associations between those ( A, B) and (C, D) pairs

Self-Organization of Symmetry Networks

579

on the coarser scale. The coarser-scale composite structures have a shape degree of freedom ( tA C tB ) / 2 ¡ ( tC C tD ) / 2. The statistics Pr(tA , tB , tC , tD ) are invariant under any translation t ! tCt applied to all four t-coordinates (corresponding to translationally invariant statistics of the hypothetical images that contain the features). These global translation invariances give rise to permutation symmetries of Pr( x ). The distribution is also symmetric under the two global exchanges tA $ tB and tC $ tD , permutations that map the four t-coordinates onto one another, and also under the global exchange (tA , tB ) $ ( tC , tD ). More interesting are the approximate symmetries that permute individual pairs of neighboring xi , corresponding to similar t-values, that is, t $ t C t where |t | ¿ s1 . These approximate symmetries result from the fact that N( t, s) changes only appreciably over a scale length | t | » s, so that a pair of xi corresponding to neighboring t -locations within the same t-space will have very similar statistics. These symmetries would correspond to approximate invariances of the statistics of the hypothetical 1D images, under local shifts of individual features. Principal component analysis (PCA) would be inadequate to deduce the localized nature of the structures dened by these correlations, because the eigenvectors of Toeplitz matrices are periodic, not localized. (The distinction between the two correlation lengths would result in periodic eigenvectors of distinct wavelengths, corresponding to distinct eigenvalues, however.) Because of its threshold nonlinearity, without which the Hebbian neuron could do no more than the PCA neuron of Oja (1982), the neuron described by equation 5.1 is sensitive to higher-order statistics—that is, properties of Pr( x ) other than its mean and covariance. Its nonlinearity allows its convergent states to undergo symmetry-breaking bifurcations under the control of a neural parameter. This is demonstrated in Figure 1. Through the preservation of some permutation symmetries of Pr( x ) and the breaking of others, this Hebbian neuron is able to reveal the two scales of localized composite structure present in the data. If the bifurcation parameter # is high enough, the neuron develops into a detector for either ( A, B) pairs or ( C, D) pairs—the ner-scale structures. Because the global translation symmetry of Pr( x ) has been broken, a pair detector can resolve its pair’s rough location. Because the approximate symmetry of Pr( x) under exchange of neighboring t-locations has been preserved, the responses of these pair detectors are tolerant to variation in the shape degrees of freedom tA ¡ tB and tC ¡ tD . If # lies between the two bifurcation points, the neuron resolves the coarser-scale structure but can no longer resolve the ner-scale structure. Again, because the global translation symmetry of Pr( x ) has been broken, the neuron can resolve the rough location of the coarser-scale structure. Again, the approximate symmetry of Pr( x ) under exchange of neighboring t-locations has been preserved, so the neural response is tolerant to the shape degree of freedom ( tA CtB ) / 2¡(tC CtD ) / 2. If the bifurcation parameter # is low enough, none of

580

Chris J. S. Webber

Figure 1: The weight vector w of the threshold neuron (see equation 5.1), having converged under Hebbian adaptation (see equation 5.3) to patterns drawn from the permutation-symmetric distribution Pr( x ) dened by equation 6.2, is plotted for each of three values of the bifurcation parameter #. Permutation-symmetric receptive elds have developed from randomly initialized, asymmetric initial states. The approximate symmetry of Pr( x ) under individual swaps of closely neighboring t-locations is preserved unbroken into the convergent states. The receptive elds vary smoothly with t, resulting in local position tolerance of the neural response, and tolerance to the “shape” degree of freedom. Above the second bifurcation, two kinds of convergent state can be obtained: the one depicted and another, related to it by the symmetry transformation (tA , tB ) $ (tC , tD ), which is broken at the second bifurcation. The global translation symmetry of Pr( x ) is broken at the rst bifurcation, and the resulting localized convergent states are related to one another via those broken translation symmetries. The special choice of origin used for this gure shows the states depicted as centered, although there are equally many such states at every location, identical up to a translation. Broken symmetries always relate convergent states that have equal basins of attraction, so such states are equally likely to develop if the initial states were uniformly distributed. (The very small value 10¡3 used for s means that r( w ¢ x ) is a very close approximation to the sharp nonlinearity (see equation 5.1).)

Self-Organization of Symmetry Networks

581

the permutation symmetries of Pr( x) is broken, and w O converges to the mean of Pr(x ), as predicted by the theory. When no symmetries are broken, the neuron has no capacity to discriminate between patterns. Each bifurcation breaks some symmetries but not others; the symmetry-breaking mechanism gives rise to self-organized detectors that can discriminate between structures but generalise over their internal degrees of freedom. A different kind of structure is detected in each distinct regime of the symmetry-breaking parameter. A large enough population of such neurons, with randomly initialized weights and randomly set #-values, will converge to span the full variety of convergent states available. This demonstration gives examples of both discrete, exact symmetries (the global exchanges tA $ tB , tC $ tD and (tA , tB ) $ ( tC , tD )) and continuous, approximate symmetries (the symmetries under local pair-wise swaps t $ t C t, which become progressively less accurate with increasing |t |). The breaking of exact, discrete symmetries involves abrupt transitions to states of lower symmetry, as is the case for the two bifurcations illustrated in Figure 1. The breaking of an almost continuous sequence of progressively less accurate symmetries involves an apparently continuous process of transition, however. Although this is not illustrated, the smooth weight proles of Figure 1 become progressively narrower as the bifurcation parameter # is raised. This narrowing process corresponds to the progressive breaking of the continuous sequence of approximate symmetries under local pair-wise swaps t $ t C t , in which less accurate, larger |t | symmetries are broken before more accurate, smaller |t |, symmetries. Preservation of symmetry under local permutations of neighboring input coordinates has given rise to locally smooth weight-vector proles, even though the individual training patterns are sharply spiked and not smooth. This does not happen with every equation of motion that has dynamical symmetries, however, because arbitrary adaptation rules are not in general guaranteed to preserve permutation symmetries as is the Hebbian rule of Webber (1994). If the local permutation symmetries of this data set were not preserved, the converged weight-vector proles would instead appear asymmetric, spiky, and disordered, and would not provide translation-tolerant response. 7 Hebbian Adaptation to Simple-Cell-Encoded Natural Images

The statistical dependences present in temporally uncorrelated, natural image data can also give rise to local weight symmetries. This demonstration also shows that the mechanism is no less applicable if the input vectors are real valued, rather than binary valued as in the previous idealized example. However, for nonsynthetic data, the distribution Pr( x ) is not constructed with symmetries dened a priori, so it is not so easy to determine precisely which local permutation symmetries of Pr( x) are responsible for the approximate local weight symmetries that develop. These statistical symmetries

582

Chris J. S. Webber

are not necessarily simple pair-wise swaps as in the previous synthetic example, but may involve larger subsets of the set of feature detectors. When exposed to a temporally uncorrelated ensemble of spectrally attened natural images that has rotationally invariant statistics, the Hebbian threshold neurons of Webber (1994) develop localized, oriented, bandpass receptive elds that appear similar to those of cortical simple cells. Two examples are shown in Figure 2, obtained under simulation conditions described in appendix B. (The two examples chosen for Figure 2 are rather special in that they are horizontally and vertically oriented. Actually one can obtain receptive elds having a range of orientations, positions, and size scales, depending on the random initialization of the weight vector.) This result raises the questions of what is gained by transforming images into this sparse, nonlinear simple-cell representation, what happens if similar, higher-level, Hebbian neurons are trained on input that has already been preprocessed by a lower network layer into this simple-cell-encoded form, and whether the naturally occurring statistical dependences between the simple-cell outputs can drive the development of weight symmetries in higher-level neurons. These questions have to be addressed using much larger natural images than those that other authors have used to derive simple-cell receptive elds (160 £ 160 pixels as opposed to 16 £ 16). This is because one has to simulate a signicant number of simple cells simultaneously, which collectively cover an image area substantially larger than the individual receptive elds. Only then can one demonstrate the emergence of envelopes of translation tolerance that are substantially larger than the simple cells’ receptive elds. The simulations required are much more time-consuming, because more training patterns are required to characterize the greater number of degrees of freedom that exist in higher-dimensional pattern-vector spaces. Consequently, constraints of computing resources have restricted the number of simple-cell orientations simulated here to two (horizontal and vertical), and restricted all the simple cells to having the same receptive eld size. This is because one can efciently simulate many receptive elds that share the same orientation and size by exploiting the convolution theorem (see appendix B). A more realistic demonstration of complex cell self-organization would require the simulation of many more simple cells, spanning a full range of orientations, sizes, and modal spatial frequencies, and modal Fourier phases. However, these simulations are sufcient to demonstrate the self-organization of weight symmetries driven by the statistical dependences within temporally uncorrelated natural data. Figure 2 describes the rst, simple-cell layer of a two-layer network, in which all neurons adapt according to the Webber (1994) Hebbian rule. Such a two-layer network of these Hebbian neurons was originally studied in Webber (1994), analytically and using synthetic data. All neurons in the simple-cell layer have the same threshold function (see equations 5.1

Self-Organization of Symmetry Networks

583

Figure 2: Grids of 512 “simple cells,” whose outputs form the input vector for the “complex cell” whose weight vector is represented in subsequent gures. These simple cells have either horizontal orientation selectivity (left) or vertical (right). The numbers in the gure represent pixel coordinates within the input image. The 256 horizontal cells differ only in the location of their receptive eld’s centroid within the image plane (as do the 256 vertical cells). The centroids of the two kinds of simple cell are located on the lattice points of two corresponding grids (lower). These simple cells sample the image plane anisotropically; they are most densely packed along the dimension perpendicular to their orientation axis. The simple cells have no lateral connections with each other in this model, so the complex cell must deduce their topological relationships purely on the basis of the joint statistics of their responses. The receptive elds are bandpass; their 2D power spectra all have bandwidths of 0.082 fN and 0.14 fN measured parallel to and perpendicular to the orientation axis, respectively. ( fN denotes the Nyquist frequency.) Their mean spatial frequency perpendicular to the orientation axis is 0.26 fN . A single receptive eld size is used for all the simple cells so that the convolution theorem can be exploited in order to compute many scalar products efciently (see appendix B).

584

Chris J. S. Webber

and 5.2) and threshold value, 10 and collectively transform the spectrally attened input images into a nonlinear sparse representation. The array of real-valued, nonnegative, simple-cell outputs so generated for each image is used as the input vector x for a single higher-level neuron, which adapts by the same rule. It is this “complex cell” whose self-organization we are primarily concerned with. The array of simple cells described in Figure 2 can be conceptually divided into two grids of neurons. One grid consists of similar receptive elds all oriented vertically, and the other of similar receptive elds all oriented horizontally. The centroids of the vertically oriented receptive elds are more closely packed along the horizontal image dimension than along the vertical, and the converse is true for the horizontally oriented elds. Field (1987) has proposed, on the grounds of code completeness and economy, that simple cells should not overlap with other similarly oriented simple cells to a greater extent along their orientation axes than perpendicular to their orientation axes, and therefore that their centroids may in reality be distributed anisotropically in this way with respect to orientation axis. With the topographic arrangement of simple cells that is used here (in Figure 2), the distinct receptive elds are roughly orthogonal to one another. 11 The simple cells have no lateral connections with each other in this model, so the complex cell must deduce their topological relationships purely on the basis of the joint statistics of their responses. Each 160 £ 160-pixel training image is excised from a larger landscape photograph, which has been subjected to a different and uniformly distributed random rotation and then spectrally attened (see appendix B). The ensemble of training images therefore has rotationally invariant statistics. Consequently, the joint distribution Pr(x ) of simple-cell outputs will be invariant under the global permutation that swaps over the two grids of Figure 2—that is, the permutation that exchanges the set of all vertically oriented simple cells with the set of all horizontally oriented simple cells, while maintaining the topographic relationships within each set. Neglecting edge effects, the training ensemble is also invariant under global transla-

10 The simple cell’s thresholds # are set at 0.012, the same value as was used during the training process that generated their receptive elds. (The training used a very small but nonzero value of the softness parameter s D 5 £ 10¡4 , so the hard-threshold limit [see equation 5.1] is approximated very closely.) # and s are quoted in input-vector length units in which the whitened input images have pixel variance measured as 4 £ 10¡5 squared length units. The variance of the projection x ¢ w along each simple cell’s unit weight vector is also 4 £ 10¡5 (because the image ensemble is on average spectrally at, and so must have an isotropic covariance matrix). 11 The scalar product between the weight vectors of immediate neighbors in the same row is 0.0027, corresponding to a w -space angle of 89.8 degrees. The corresponding values for nearest neighbors in adjacent rows of the same orientation grid are 0.29 and 73 degrees, and 0.15 and 81 degrees for nearest neighbors separated by two rows. The values for colocated receptive elds of perpendicular orientations are 0.17 and 80 degrees.

Self-Organization of Symmetry Networks

585

Figure 3: A self-organized conguration of input synapses for the second-layer “complex cell.” Heights represent values of synaptic weights (scaled by a factor of 1000). The 512 synapses provide input from the 512 “simple cells” located at the 512 grid points depicted in the previous gure. (Although not constrained a priori to be nonnegative, all the complex cell’s synaptic weights have developed nonnegative values, because the outputs of the simple cells that feed these synapses are always nonnegative.) In the example, the global rotation symmetry of the data’s statistics happens to have been broken in such a way that this complex cell will only be triggered by activity in the horizontally oriented simple cells: the complex cell depicted has horizontal orientation selectivity. The weight prole is smooth—approximately symmetric under permutations of neighboring weights. The complex cell will have translation-tolerant response as a result. In this simulation, the threshold parameter # was set at 0.02 (measured in the same length-units as the simple cells’ thresholds).

tions, because the training images were selected randomly from uniformly distributed locations within larger landscape photographs and because the training images are presented repeatedly to the network, translated to every possible position offset with respect to the network’s input window (see appendix B). Pr(x ) will therefore be approximately invariant under those global cycles among simple cells that correspond to these translations. These symmetries are only approximate because the periodic boundary condition implied by global cycling is an imperfect model of image edge effects. If Pr(x ) also has approximate symmetries under more local permutations, which involve smaller subsets of simple cells as in the synthetic example of section 6, those symmetries will give rise to localized weight symmetries when the symmetry-preserving Hebbian rule is trained on these simple-cell outputs. Further details of how this Hebbian rule is simulated here are given in appendix B. Figure 3 depicts a self-organized conguration of the higher-level Hebbian “complex cell,” trained on the outputs of Figure 2’s simple-cell grid as images from the training ensemble are presented. After adapting from

586

Chris J. S. Webber

a random initial conguration, the input synapses of this higher-level cell vary smoothly with the grid location of the corresponding simple cell. The approximate weight symmetries that have developed in the complex cell’s receptive eld mean that the complex cell will be tolerant to the position of the simple cells that trigger it. The complex cell’s translation tolerance is substantially greater than that of the simple cells, which will be of the order of the separation between peak and trough in the simple cells’ receptive elds (3 pixels), or less. This observation is consistent with the experimental result that real complex cells are position tolerant over a distance greater than the width of the barlike features for which they are optimally sensitive (Hubel & Wiesel, 1962). One can see from Figure 3 that the “complex cell” has developed so as to become sensitive to activity in simple cells of one orientation only; the global rotation symmetry has been broken. The complex cell’s orientation selectivity has developed because the statistical dependences between neighboring simple cells of the same orientation are stronger than the statistical dependences between simple cells of very different orientations. Long image features typically trigger several similarly oriented neighboring simple cells simultaneously, but neighboring simple cells of very different orientations are simultaneously triggered less frequently, by rarer features such as corners, which have both horizontal and vertical components. (Because rotational symmetry has been broken, the translation invariance result cannot be attributed to rotational symmetry, and so the credibility of that result is not undermined by the fact that rotational invariance of the training statistics has been reinforced by applying random rotations.) Figure 4 illustrates that broken permutation-symmetries map distinct stable states onto one another, and that raising the value of the Webber (1994) threshold parameter # breaks symmetry, as the analytical treatment of appendix B predicts. At the lower of the two threshold values depicted, the stable weight vectors are approximately invariant (i.e., invariant up to edge effects) under global permutations that correspond to translations of the whole image. The 90 degree rotation symmetry of Pr(x ) is already broken at this #-value; this rotation symmetry maps the two stable congurations obtainable at this #-value onto one another. When # is raised to the higher value, the global translation symmetry is broken, and longer-range translations now map distinct localized stable congurations onto one another. Even at this higher #-value, the translation-tolerance of the “complex cell” is still 10 times wider than that of the simple cells that provide its input; its weight vector remains approximately symmetric under the most localized permutations of neighboring simple cells. 8 Discussion

Webber (1991) proposed that a role of sparse feature coding might be to map the raw data into a representation in which transformations of fea-

Self-Organization of Symmetry Networks

587

tures, such as local shift and deformation, are rendered as permutations of the output values among the various feature-coding units. That hypothesis can be extended by observing that any invariances of the raw data’s statistics under such transformations would give rise to homomorphic symmetries of the sparse code’s statistics under permutations among the coding units. Such mapping of invariances onto permutation symmetries would be a useful function for sparse coding to perform, because permutation symmetries can drive the development of invariant response in neurons of the next network layer, via the spontaneous symmetry-breaking mechanism. The examples have illustrated how translations (of individual features or sets of features) can be mapped onto permutations by a sparse feature code. When the features are moved, a different subset of feature detectors res, which corresponds to the new feature locations. By analogy, one can see how scale transformations might be mapped onto permutations by a sparse code based on self-similar wavelets of various size scales; when a feature or set of features is enlarged or reduced, a different subset of wavelet detectors will re, corresponding to the new feature sizes. Because the principles proposed here do not constrain the coding units of sparse codes to have any particular topological relationships (such as the 2D topology of the image plane, for example), it is conceivable that rotations in three dimensions could be represented as permutations of some more abstract, higher-level, sparse code in which 3D features having very different orientations in 3D space are encoded by different feature detectors. (However, it would be inefcient to encode certain very simple transformation groups by homomorphism onto the permutations. For example, image illumination variation can be effectively removed by simple preprocessing operations such as local normalization or local background subtraction on a logarithmic scale.) Sparse feature codes represent individual features of the data on individual coding units, or feature detectors. If a sparse feature code has permutation symmetries Pr(S( x)) ¼ Pr( x ), what does this imply about the subset of features exchanged under the permutation? If exchanging one feature with another (or one subset of features with another subset) is to transform every training pattern into a roughly equiprobable pattern, then the exchanged features (or subsets of features) must have similar statistics, so that the exchanged features always arise in similar contexts with similar likelihood. (“Similar context” means that apart from the features exchanged, the same combination of features is present in the two equiprobable training patterns; the pattern of responses of the rest of the code’s feature detectors is the same for the two patterns.) Thus, if a feature code has permutation symmetries, this means that the exchanged features are indistinguishable from each other on the basis of their statistical dependences on the rest of the features that make up the code. In the spontaneous symmetry-breaking mechanism, it is this property of statistical indistin-

588

Chris J. S. Webber

guishability among features that can drive the development of invariant neural response. The example of section 6 illustrates how discrete local permutations can represent effectively continuous transformations like local position shifts and shape deformations. However, permutations can also represent absolutely discrete transformations, because permutations are inherently discrete transformations themselves. One example of an absolutely discrete transformation is the global exchange of all visual cells having left-ocular dominance with all cells having right-ocular dominance. Yamagishi (1994) has already argued that the breaking of this obvious symmetry may be responsible for the formation of ocular dominance columns. This leads us to speculate that the symmetry preservation mechanism proposed here could explain how invariance with respect to ocularity might be reacquired by cells involved in higher stages of visual processing. Another example of a discrete transformation is the replacement of a word in a sentence by a synonymous word. If the exchanged words have always arisen in similar contexts in the listener ’s experience and have similar frequencies of occurrence, then they are statistically indistinguishable by denition. Another example is the difference in pronunciation of the vowels between one speaking accent and another. In principle, the statistical indistinguishability model could allow invariance with respect to pronunciation to be learned, because the different pronunciations of the vowels are always experienced in the same contexts.

Figure 4: Facing page. The effect of varying the complex cell’s threshold parameter # and the effect of starting from different random initial weight congurations. Neglecting edge effects, the two weight vectors obtainable at # D 0.02 (a and b) are approximately symmetric under long-range translations, but not under exchange of horizontal with vertical; the “complex cell” is translation tolerant but orientation specic. The global translation symmetry of the data’s statistics (made approximate by edge effects) has been preserved, but the 90 degree rotation symmetry has been broken. The rotation symmetry instead maps the two distinct convergent states a and b onto one another. As the threshold is raised to # D 0.04, the receptive elds become more localized (c and d); the global translation symmetry has now been broken, but the weights are still approximately symmetric under more localized permutations. As a result, the translation tolerance of the complex cell’s response will have become more short range but not eliminated. Even at the higher #-value, the width of the tolerance envelope is still about 20 pixels, 10 times wider than the simple cells’ tolerance width. With the long-range translation symmetry broken at the higher #-value, translations as well as 90 degree rotations map distinct convergent states onto one another (e.g., c onto d). Although not depicted, the rotation symmetry may be restored by lowering # (to 0.01); then the weights all become equal (edge effects make the equality approximate).

Self-Organization of Symmetry Networks

589

Invariant neurons in symmetry networks achieve invariance by discarding information about the transformation’s degrees of freedom. For example, a translation-invariant complex cell does not retain precise information about its feature’s position degree of freedom. The discarded degreeof-freedom information is not destroyed; it is still obtainable from other neurons—for example, from the simple cells from which the complex cell takes its input. However, nowhere is there stored a feature-independent generator of the transformation, which could be used to generate transformed copies of arbitrary features from prototypical templates. The transformation is not represented independently of the object on which it acts. There is an

590

Chris J. S. Webber

advantage to this apparent omission in that there is no binding problem: no ambiguity in deciding which, of the many objects that may be present, each particular transformation applies to. Single-cell learning rules such as the Webber (1994) rule do not attempt to minimize statistical dependences between neurons as Barlow (1989) proposed. Indeed, clear statistical dependences between simple cells are needed to drive the development of complex-cell invariances in the symmetrypreservation theory, so the elimination of statistical dependence between simple cells would not be a useful objective from the point of view of this theory. Other theories have been proposed to account for the unsupervised discovery of unknown invariances, based on the temporal association of sequences of transformation-related patterns over short memory timescales (Foldiak, 1991; Wallis et al., 1993; Bartlett & Sejnowski, 1995). Using simple synthetic training patterns rather than natural images, Foldiak (1991) argued that complex cells could discover short-timescale correlations between neighboring simple cells if they were to average the simple-cell activity pattern over those short timescales. Although the explanation of complex-cell development offered here does not exploit time dependence, the general principle of invariance discovery proposed here is not incapable of making use of time dependence. However, it would do so in other contexts and in a different way. For example, it could discover statistical indistinguishabilities among the statistics of sets of motion detectors. Temporal association theories of invariance discovery have problems in principle that are not shared by the symmetry-preservation theory. If a neuron’s memory timescale is to be short enough for it to be sensitive to the temporal associations in short transformation sequences, it is hard to see why it would not forget those associations just as quickly if its stimulus temporarily became stationary and become matched to the stationary stimulus instead, so losing its transformation invariance. By exposure to rotation sequences, complex cells would nd associations between simple cells of different orientations as easily as they nd associations between simple cells of different positions from translation sequences. The temporal association model does not explain why complex cells are orientation specic but translation tolerant, except by assuming that because translation sequences are (presumably) so much more prevalent than rotation sequences, cells would quickly forget any rotation sequences that they happened to learn. That assumption raises a stability-plasticity dilemma, however; if cells forget some transformation sequences in order to learn newer ones, it is not obvious that their response properties would ever settle down. Temporal association models are not relevant to discrete transformations where neurons never experience a temporal sequence—for example, the transformations that relate different speakers’ pronunciations of the same vowel, or the transformation that exchanges ocularity between left and right. Association is just as important in the symmetry-preservation theory as in temporal asso-

Self-Organization of Symmetry Networks

591

ciation theories, but the association is not on the basis of similarity over time, but similarity of context: statistical indistinguishability. In principle, symmetry preservation provides a generic mechanism for generating neurons that respond similarly to different features that arise in similar contexts. Appendix A: Symmetry-Breaking Bifurcation in the Hebbian Threshold Neuron

Webber (1994) showed that the condition for the weight vector of the neuron (see equation 5.1) to be stationary under the Hebbian dynamics is D

x xT

E1 H( w O 1)

w O 1 ¡ #hx iH( wO 1 ) D b w O1

(positive scalar b),

(A.1)

where the average is now taken over just that subset H of patterns for which the neuron’s response r rises above threshold: x¢w O > #

,

x 2 H( w O ).

(A.2)

In addition, for w O 1 to be a stable stationary state, any small perturbation 1 O Dw O C ² perpendicular to w O 1 must satisfy w D E ²O T x x T

H (wO 1 )

D E O 1T x x T ²O < w

O 1) H( w

O1 w

O 1T hx iH( wO 1 ) ¡ #w

(²O ´ | ²| ¡1 ²).

(A.3)

O 1 are permitted by the constraint Only perturbations ² perpendicular to w O | D 1. This stability condition |w « ¬ is an upper limit on the eigenvalues of that part N H( wO 1 ) of the matrix x xT H( wO 1 ) that lies in the subspace of permissible « ¬ perturbations ²—the part of x x T H( wO 1 ) that projects normal to w O 1 (Webber, 1994). Those conditions can be used to dene the symmetry-breaking behavior of this Hebbian neuron. When the threshold # is very large and negative, the entire training ensemble will belong to the set H, and the only stable state that will satisfy equations A.1 and A.3 is O1D w

hx ifxg |hx ifxg |

for # ! ¡1.

(A.4)

So when # is very negative, the weight vector inevitably converges to the mean of the ensemble. Because the mean is invariant under the entire group G of Pr( x )’s permutation symmetries, no permutation symmetry will be broken if # is low enough.

592

Chris J. S. Webber

As # is raised, the right-hand side of equation A.3 will get gradually 12 smaller, until it drops below the left-hand side, at which point the stationary state w O 1 will cease to be stable. It will bifurcate along the direction of the principal eigenvector of N H( wO 1 ) (or along any direction within its principal subspace if its largest eigenvalue is degenerate). Regardless of Pr( x ), the indicatrix ellipsoid of this matrix always has reection and rotation symmetries about a number of invariant subspaces, each containing a number of its eigenvectors. The symmetries that will be broken by the bifurcation are those whose invariant subspaces do not contain the principal eigenvector(s) of N H( wO ) . Permutation symmetries correspond to rotations and reections that map the axes of x -space onto one another and are special cases of proper and improper rotations: the orthogonal transformations. If Pr( x ) is symmetric O1 under a group G of orthogonal transformations, every convergent state w 0 will lie in the invariant subspaces of some group G of orthogonal « unbroken ¬ symmetries, where G0 is a subgroup of G. hx iH( wO 1 ) , x x T H( wO 1 ) and N H( wO 1 ) will all be symmetric under the unbroken orthogonal symmetries G0 of Pr( x), because w O 1 lies in the invariant subspaces of all these unbroken symmetries, and because the above-threshold subset H will therefore be symmetric if the whole of Pr( x ) is. Thus, every unbroken orthogonal symmetry of Pr( x ) will be a symmetry of the indicatrix of N H (wO 1 ) . Whichever unbroken symmetries have invariant subspaces that do not contain the principal eigenvector(s) of N H( wO 1 ) will be the ones that will be broken when the next bifurcation point is passed. Appendix B: Details of Algorithm Implementation for the Natural Image Example

The rst network layer, represented in Figure 2, consists of “simple-cell” receptive elds of just two orientations but many different locations in the image plane. All the receptive elds of a single orientation can be mapped onto one another by translations. The correlation function between the image and a receptive eld, computable using the convolution theorem, is equivalent to an array of scalar products, between the image and an array of similar receptive elds translated to all possible image locations. O required to compute the Consequently, the array of scalar products x ¢ w outputs of each grid of “simple cells” can be computed very efciently by fast Fourier transform, as is done in Webber (1998), for example, which also explains how to compute the weight updates d w O due to an array of cell 12 The left-hand and right-hand sides may both change discontinuously if the set H loses members as a result of the raising of the threshold #, although these discontinuous changes turn out not to break the inequality, because they affect both sides equally. This is because any pattern that is just about to pass below the x ¢ w D # plane will make only a negligible contribution u( x , w ) to the objective function.

Self-Organization of Symmetry Networks

593

outputs equally efciently. This trick can be used to train both the simplecell layer and the complex cell. It is only by exploiting the huge speed enhancement afforded by the convolution theorem that the simulations presented here are feasible for a contemporary serial computer in reasonable time. Computing a pair of correlation functions, between a 160 £ 160-pixel input image and a pair of 64 £ 64-pixel “simple-cell” receptive-eld kernels (one horizontally and the other vertically oriented), yields a pair of square correlation functions of side length 160 ¡ 64 C 1 D 97 pixels, if the artifact of image wraparound is to be avoided. That artifact would introduce undesirable horizontal and vertical correlations between the simple-cell outputs, due to the long discontinuities at the input-image edges. This pair of 97 £ 97-pixel correlation functions forms the basis for the training input to the single Hebbian “complex cell” that forms the second layer of the network. The “simple-cell” receptive elds are not located at 1-pixel intervals, but sample only at the grid points illustrated in Figure 2, so the “complex cell” is allowed to receive input from the two correlation functions only through a limited number of synapses, connecting it only to those coordinates of the correlation functions corresponding to these grid points. As illustrated, these grid points extend over a region of the image 64 pixels square, but there are only 512 of them, so the “complex cell” has 512 synaptic inputs, half of them from vertically oriented “simple cells” and half from horizontally oriented “simple cells.” The size of the training set is augmented by updating the complex cell’s synapses with each training image presented repeatedly, at every possible location translated with respect to the input synapses. This amounts to reinforcing the natural translational invariance of the image statistics. The convolution theorem allows this to be done very efciently (Webber, 1998). In this way, one can augment the training set by a factor of (97 ¡ 64 C 1)2 if one wishes to avoid wraparound of the correlation functions. Augmenting the size of the training set in this way is necessary for two closely related reasons. First, one needs a sufciently large training set to characterize the relevant degrees of freedom of natural images adequately; without such augmentation, too few photographs would have been available to populate the high-dimensional pattern space, so there would have been very strong statistical biases due to gross undersampling of the space. Second, one needs to present at each iteration a sufciently large random subsample of this training set to ensure that the adaptation converges smoothly and is not disrupted by statistical uctuations introduced by undersampling bias. Each 200-iteration simulation required about six weeks’ CPU time on a 266 MHz Alpha, benchmarked at 90 SPECfp rate95. At each iteration, 16,000 training images were presented to the network, each presented at each of the (97 ¡ 64 C 1) 2 position-offsets computable simultaneously using the convolution theorem. Each set of 16,000 training images was selected randomly from a total ensemble of 392,400 images, which were

594

Chris J. S. Webber

obtained by rotating 13 109 large landscape photographs by 1800 different angles covering the full 360 degrees, duplicating these by reections, then applying a spectral whitening lter (the same lter 14 is applied to all photographs), and then excising a randomly chosen 160£160 pixel region within the rotated photograph (avoiding the edges of the photograph where the whitening produces artifacts). Together with the translations, the reections and rotations are necessary to make the training set large enough to eliminate statistical bias due to undersampling of the high-dimensional pattern space. In all simulations depicted in Figures 3 and 4, the complex cell was trained with the very small but nonzero softness parameter s D 5 £ 10¡4 . The fractional rate constant l 0 , which is related to the time-varying Webber (1994) rate-parameter l (the constant of proportionality in equation 5.3) as dened in Webber (1997), was set at 0.3 for all cells in all these simulations. Acknowledgments

I am grateful to Steve Luttrell for discussions. References Barlow, H. B. (1989). Unsupervised learning. Neural Computation, 1, 295–311. Bartlett, M. S., & Sejnowski, T. (1995). Unsupervised learning of invariant representations of faces through temporal association. In J. M. Bower (Ed.), The neurobiology of computation. Norwell, MA: Kluwer. Bell, A. J., & Sejnowski, T. J. (1997). The “independent components” of natural scenes are edge lters. Vision Research, 37, 3327–3338. Blais, B. S., Intrator, N., Shouval, H. V., & Cooper, L. N. (1998). Receptive eld formation in natural scene environments: Comparison of single-cell learning rules. Neural Computation, 10, 1797–1813. Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. Journal of the Optical Society of America, 4, 2379–2394. Foldiak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3, 194–200. Friedman, J. H. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82, 249–266. Fukumi, M., Omatu, S., Takeda, F., & Kosaka, T. (1992). Rotation-invariant neural pattern recognition system with application to coin recognition. IEEE Transactions on Neural Networks, 3, 272–279.

13

The rotation uses a conventional bilinear interpolation algorithm. The whitening lter has been optimized so as to make the average power spectrum over the entire training ensemble at. 14

Self-Organization of Symmetry Networks

595

Fukushima, K., & Miyake, S. (1982). Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognition, 15, 455–469. Hebb, D. O. (1949). The organisation of behaviour. New York: Wiley. Hinton, G. E. (1987). Learning translation invariant recognition in a massively parallel network. In G. Goos & J. Hartmanis (Eds.), PARLE: Parallel architectures and languages europe (pp. 1-13).New York: Springer-Verlag. Hubel, D. H., & Wiesel, T. N. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, 160, 106–154. Le Cun, Y., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., & Jackel, L. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1, 541–551. Minsky, M. L., & Papert, S. A. (1969). Perceptrons. Cambridge, MA: MIT Press. Oja, E. (1982). A simplied neuron model as a principal component analyser. Journal of Mathematical Biology, 15, 267–273. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive eld properties by learning a sparse code for natural images. Nature, 381, 607– 609. Rao, R. P. N., & Ballard, D. H. (1998).Development of localized oriented receptive elds by learning a translation-invariant code for natural images. Network: Computation in Neural Systems, 9, 219–234. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributed processing: Explorations in the microstructures of cognition (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. Shawe-Taylor, J. (1989).Building symmetries into feedforward networks. In First IEEE Conference on Articial Neural Networks (pp. 158–162). Shawe-Taylor, J. (1993). Symmetries and discriminability in feedforward network architectures. IEEE Transactions on Neural Networks, 4, 816–826. Simard, P., Victorri, B., Le Cun, Y., & Denker, J. (1991).Tangent prop—a formalism for specifying selected invariances in an adaptive network. Neural Information Processing Systems, 4, 895–903. Wallis, G., Rolls, E., & Foldiak, P. (1993). Learning invariant responses to the natural transformations of objects. Proceedings of the 1993 International Joint Conference on Neural Networks, 2, 1087–1090. Webber, C. J. S. (1991). Self-organisation of position- and deformation-tolerant neural representations. Network: Computation in Neural Systems, 2, 43–61. Webber, C. J. S. (1994). Self-organisation of transformation-invariant detectors for constituents of perceptual patterns. Network: Computation in Neural Systems, 5, 471–496. Webber, C. J. S. (1997). Generalisation and discrimination emerge from a selforganising componential network: A speech example. Network: Computation in Neural Systems, 8, 425–440. Webber, C. J. S. (1998). Emergent componential coding of a handwriting-image database by neural self-organisation. Network: Computation in Neural Systems, 9, 433–447.

596

Chris J. S. Webber

Wood, J. J. (1995). A model of adaptive invariance. Unpublished doctoral dissertation, University of London. Yamagishi, K. (1994). Spontaneous symmetry breaking and the formation of columnar structures in the primary visual cortex. Network: Computation in Neural Systems, 5, 61–73. Received March 23, 1998; accepted April 26, 1999.

LETTER

Communicated by Bard Ermentrout

Geometric Analysis of Population Rhythms in Synaptically Coupled Neuronal Networks J. Rubin D. Terman Department of Mathematics, Ohio State University, Columbus, Ohio 43210, U.S.A. We develop geometric dynamical systems methods to determine how various components contribute to a neuronal network’s emergent population behaviors. The results clarify the multiple roles inhibition can play in producing different rhythms. Which rhythms arise depends on how inhibition interacts with intrinsic properties of the neurons; the nature of these interactions depends on the underlying architecture of the network. Our analysis demonstrates that fast inhibitory coupling may lead to synchronized rhythms if either the cells within the network or the architecture of the network is sufciently complicated. This cannot occur in mutually coupled networks with basic cells; the geometric approach helps explain how additional network complexity allows for synchronized rhythms in the presence of fast inhibitory coupling. The networks and issues considered are motivated by recent models for thalamic oscillations. The analysis helps clarify the roles of various biophysical features, such as fast and slow inhibition, cortical inputs, and ionic conductances, in producing network behavior associated with the spindle sleep rhythm and with paroxysmal discharge rhythms. Transitions between these rhythms are also discussed. 1 Introduction

Neuronal networks often exhibit a rich variety of oscillatory behavior. The dynamics of even a single cell may be quite complicated; it may, for example, re repetitive spikes or bursts of action potentials, each followed by a silent phase of near-quiescent behavior (Rinzel, 1987; Wang & Rinzel, 1995). The bursting behavior may wax and wane on a slower timescale (Destexhe, Babloyantz, & Sejnowski, 1993; Bal & McCormick, 1996). Examples of population rhythms include synchronous behavior, in which every cell in the network res at the same time, and clustering, in which the entire population of cells breaks up into subpopulations or blocks; the cells within a single block re synchronously while different blocks are desynchronized from each other (Golomb & Rinzel, 1994; Kopell & LeMasson, 1994). Of course, much more complicated population rhythms are also possible (Traub & Miles, 1991; Terman & Lee, 1997). Activity may also propagate through the Neural Computation 12, 597–645 (2000)

c 2000 Massachusetts Institute of Technology °

598

J. Rubin and D. Terman

network in a wavelike manner (Kim, Bal, & McCormick, 1995; Destexhe, Bal, McCormick, & Sejnowski, 1996; Golomb, Wang, & Rinzel, 1996; Rinzel, Terman, Wang, & Ermentrout, 1998). A network’s population rhythm results from interactions among three separate components: the intrinsic properties of individual neurons, the synaptic properties of coupling between neurons, and the architecture of coupling (i.e., which neurons communicate with each other). These components typically involve numerous parameters and multiple timescales. The synaptic coupling, for example, can be excitatory or inhibitory, and its possible turn-on and turn-off rates can vary widely. Neuronal systems may include several different types of cells, as well as different types of coupling. An important and typically challenging problem is to determine the role each component plays in shaping the emergent network behavior. In this article we consider recent models for thalamic oscillations (see, for example, Destexhe, McCormick, & Sejnowski, 1993; Steriade, McCormick, & Sejnowski, 1993; Golomb, Wang, & Rinzel, 1994; Terman, Bose, & Kopell, 1996; Destexhe & Sejnowski, 1997). The networks consist of several types of cells and include excitatory as well as both fast and slow inhibitory coupling. One interesting property of these networks is that they exhibit very different rhythms for different parameter ranges. For some parameter values, the network behavior resembles that of the spindle sleep rhythm: one population of cells is synchronized at the spindle frequency, while another population of cells exhibits clustering. If a certain parameter, corresponding to the strength of fast inhibition, is varied, then the entire network becomes synchronized. This resembles paroxysmal discharge rhythms associated with spike-and-wave epilepsy. In other parameter ranges, the network behavior is similar to that associated with the delta sleep rhythm; in this case, each cell exhibits an entirely different behavior from before. We develop geometric dynamical systems methods to analyze the mechanisms responsible for each of these rhythms and the transitions between them. This approach helps determine each component’s contribution to the network behavior and to clarify how the behavior changes with respect to parameters. We are particularly interested in analyzing the role of inhibitory coupling in generating different oscillatory behaviors. This is done by considering a series of networks with increasing levels of complexity. Our analysis demonstrates, for example, how networks with distinct architectures can make different uses of inhibition to produce different rhythms. The techniques we develop are quite general and do not depend on the details of the specic systems. For a given network, however, these techniques lead to rather precise conditions for when a particular rhythm is possible. Numerous work has considered the role of inhibition in synchronizing oscillations (for example, Wang & Rinzel, 1992, 1993; Golomb & Rinzel, 1993; van Vreeswijk, Abbott, & Ermentrout, 1994; Whittington, Traub, & Jefferys, 1995; Bush & Sejnowski, 1996; Gerstner, van Hemmen, & Cowan, 1996; Rowat & Selverston, 1997; Terman, Kopell, & Bose, 1998). Many of

Population Rhythms

599

these articles used model neurons with very short spiking times. These include integrate-and-re models along with alpha function–type dynamics for the synapses. More biophysically based models for bursting neurons were considered by, among others, Wang and Rinzel (1993) and Terman et al. (1998). One conclusion of those works is that inhibition can lead to synchrony only if the inhibition decays at a sufciently slow rate; in particular, the rate of decay of the synapses must be slower than the rate at which the neurons recover in their refractory period. These theoretical and numerical studies, however, considered rather idealized networks: Each cell contained only one channel state variable, and the network architecture was simply two mutually coupled cells. By considering more realistic biophysical models, we demonstrate that fast inhibitory coupling can indeed lead to synchronous rhythms. We show that this is possible in networks with complicated cells but simple architectures and in networks with more complicated architectures but simple cells. Geometric singular perturbation methods have been used previously to study the population rhythms of neuronal networks (for example, Somers & Kopell, 1993; Skinner, Kopell, & Marder, 1994; Terman & Wang, 1995; Terman & Lee, 1997; Terman et al., 1998; LoFaro & Kopell, in press). Each of the relevant networks possess multiple timescales; this allows one to dissect the full system into fast and slow subsystems. Often, however, there is no clear-cut separation of timescales, so it is not obvious how one decides which variables should be considered as fast or slow. This is particularly the case when there are multiple intrinsic channel state variables and the synaptic variables turn on and turn off at different rates. A primary goal of this article is to demonstrate how one can make the singular reduction in order to understand mechanisms responsible for the thalamic rhythms. Two crucial issues are related to the geometric analysis. The rst is concerned with the existence of a singular solution corresponding to a particular pattern. We assume that individual cells, without any coupling, are unable to oscillate. The existence of network oscillatory behavior then depends on whether the singular trajectory is able to “escape” from the silent phase. An important point will be that greater cellular or network complexity enhances each cell’s opportunity to escape the silent phase when coupled. The second issue is concerned with the stability of the solution. To demonstrate stability of a perfectly synchronous state, for example, we must show that the trajectories corresponding to different cells are brought closer together as they evolve in phase space. As we shall see, this compression is usually not controlled by a single factor; it depends on the underlying architecture as well as nontrivial interactions between the intrinsic and synaptic properties of the cells (see also Terman et al., 1998). Our analysis demonstrates, for example, why thalamic networks are well suited to use inhibitory coupling to help synchronize oscillations and produce other, clustered, rhythms. In the next section we describe in detail the types of models to be considered for individual neurons. We distinguish between basic and compound

600

J. Rubin and D. Terman

cells. As a concrete example, we consider recent conductance-based models for thalamocortical relay (TC) cells (Golomb et al., 1994; Destexhe & Sejnowski, 1997 and references therein). Compound cells can be realized as a model for a TC cell that includes three ionic currents: a low-threshold calcium current (IT ), the nonselective sag current (Isag ), and a leak. A basic cell does not include Isag . In this context, our results help to explain the role of Isag in generating network activity; we shall see that this role depends on the architecture of the network. We also discuss different forms of synaptic coupling and different architectures. In section 3, we show that synchronization is possible in mutually coupled networks that include compound cells and fast synapses. This analysis is similar to that done by Terman et al. (1998), which showed that synchronization is possible in networks with basic cells and slow inhibitory coupling. Those networks contain two types of slow processes: one corresponds to an intrinsic ionic current and the other to a synaptic slow variable. The main conclusion of the analysis here is that what is crucial for synchronization is that the network possess at least two slow processes; one may be intrinsic and the other synaptic, or both may be intrinsic. In section 4, we consider networks with architectures motivated by recent models for the thalamic spindle sleep rhythm. The more complex architectures allow the network to use inhibition in different ways to produce different population rhythms. In particular, inhibition can play an important role in synchronizing the cells in a much more robust way than in the mutually coupled networks. We demonstrate how tuning various parameters allows the network to control the effect of inhibition and thereby control the emergent behavior. Consequences of these results for the full thalamic networks are presented in section 5. We consider the roles of various biophysical parameters associated with fast and slow inhibition, the sag current, cortical inputs, and other currents. These results help clarify the mechanisms responsible for spindle and paroxysmal discharge rhythms and the ways that changes in biophysical parameters lead to transitions between different rhythms. We conclude with a discussion in section 6.

2 The Models

We begin by describing the equations corresponding to individual cells and then describe the synaptic coupling between two cells. It is necessary to explain which parameters determine whether the synapse is excitatory or inhibitory and which other parameters determine whether the synapse is fast or slow. It will also be necessary to distinguish between direct synapses and indirect synapses. Finally, we describe the types of architectures to be considered.

Population Rhythms

601

Figure 1: Nullclines of basic and compound cells. (A) The v- and w-nullclines of a basic cell intersect at p0 , on the middle branch of the v-nullcline, in the oscillatory case. The closed curve indicates a singular periodic orbit, with double arrows denoting fast pieces and single arrows denoting slow pieces. (B) The vnullcline and a singular solution of an excitable compound cell with a stable critical point p0 . 2.1 Single Cells. We model a basic cell as the relaxation oscillator

v0 D f ( v, w) w0 D 2 g( v, w),

(2.1)

d where 0 D dt . Here 2 is assumed to be small; hence, w represents a slowly evolving quantity. The active phase of the oscillation can be viewed as the envelope of a burst of spikes. We assume that the v-nullcline, f (v, w) D 0, denes a cubic-shaped curve, as shown in Figure 1A, and the w-nullcline, g D 0, is a monotone decreasing curve that intersects f D 0 at a unique point p0 . We also assume that f > 0 ( f < 0) above (below) the v¡nullcline and

602

J. Rubin and D. Terman

g > 0 (< 0) below (above) the w-nullcline. If p0 lies on the middle branch of f D 0, then equation 2.1 gives rise to a periodic solution for all 2 sufciently small, and we say that the system is oscillatory. In the limit 2 ! 0, one can construct a singular solution as shown in Figure 1A. If p0 lies on the left branch of f D 0, then the system is said to be excitable; p0 is a stable xed point, and there are no periodic solutions for all 2 small. Note that the thalamic cells we seek to model are typically excitable during the sleep state. For some of our results, it will be necessary to make some more technical assumptions on the nonlinearities f and g. We will sometimes assume that fw > 0,

gv < 0

and

gw < 0

(2.2)

near the singular solutions. By a compound cell we mean one that contains at least two slow processes. We consider compound cells that satisfy equations of the form v0 D f ( v, w, y) w0 D 2 g( v, w) y0 D 2 h(v, y).

(2.3)

Precise assumptions required on the nonlinear functions in equation 2.3 are given later. For now, we assume that for each xed value of y, the functions f ( v, w, y) and g( v, w) satisfy the conditions described for a basic cell. Then f f ( v, w, y) D 0g denes a cubic-shaped surface. The system (see equation 2.3) is said to be excitable if there exists a unique xed point, which we denote by p0 , and this lies on the left branch of the cubic-shaped surface. One can construct singular solutions of this equation, and one of these is shown in Figure 1B. The singular solution shown begins in the silent phase, or left branch, of the surface. It evolves there until it reaches the curve of jump-up points that correspond to the left knees of the cubic surface. The singular solution then jumps up to the active phase, or right branch, of the surface. It evolves in the active phase until it reaches the jump-down points or right knees of the surface. It then evolves in the silent phase, approaching the stable xed point at p0 . A more formal description of certain singular solutions is given in section 3.2. 2.2 Synaptic Coupling. Consider the network of two mutually coupled cells: E1 $ E2 . The equations corresponding to this network are

v01 D f ( v1 , q1 ) ¡ gsyns2 ( v1 ¡ vsyn) q01 D 2 L ( v1 , q1 ) v02 D f ( v2 , q2 ) ¡ gsyns1 ( v2 ¡ vsyn) q02 D 2 L ( v2 , q2 ).

(2.4)

Population Rhythms

603

Here, qi D wi and L D g if the cells are basic, while qi D ( wi , yi ) and L D ( g, h) if the cells are compound. In equation 2.4, gsyn > 0. It is the parameter vsyn that determines whether the synapse is excitatory or inhibitory. If vsyn < v along each bounded singular solution, then the synapse is inhibitory. The coupling depends on the synaptic variables si , i D 1, 2. We consider two choices for the si . Each si may satisfy a rst-order equation of the form s0i D a(1 ¡ si ) H (vi ¡ h syn) ¡ bsi .

(2.5)

Here, a and b are positive constants, H is the Heaviside step function, and hsyn is a threshold above which one cell can inuence the other. Note that a and b are related to the rates at which the synapses turn on or turn off. For fast synapses, we assume that a is O(1) with respect to 2 . It may seem natural also to assume that b is O(1), and this is, in fact, what we will do in the next section. However, when we consider the thalamic networks in sections 4 and 5, it will be necessary to assume that b D K2 , where K is a large constant. The reason for choosing the fast synaptic variables in this way is discussed in more detail later. (The model for slow synapses is discussed in section 5.2.) If the synaptic variables satisfy equation 2.5, then we say that the synapse is direct. We will also consider indirect synapses. These are modeled by introducing a second synaptic variable xi (see Golomb et al., 1994; Terman et al., 1998). The equations for ( xi , si ) are: x0i D 2 ax (1 ¡ xi ) H (vi ¡ hsyn) ¡ 2 bx xi s0i D a(1 ¡ si )H ( xi ¡ hx ) ¡ bsi .

(2.6)

Here, ax and bx are positive constants. Note that indirect synapses have the effect of introducing a delay in the synaptic action, and this delay takes place on the slow timescale. If, say, the cell E1 res, then x1 will activate once v1 crosses the threshold h syn . The activation of s1 must wait until x1 crosses the second threshold hx . Note also that an indirect synapse can be fast if a and b are O(1), as discussed above. 2.3 Globally Inhibitory Networks. Besides mutually coupled networks, we also consider networks with the architecture shown in Figure 2. This network contains two different types of cells, labeled E-cells and J-cells. Each E-cell sends fast excitation to some of the J-cells, and each J-cell sends inhibition to some of the E-cells. The inhibition may be fast or slow (or both). There is no communication among different E-cells; however, the J-cells communicate with each other via fast inhibitory coupling. This network is motivated by recent models for the thalamic sleep rhythms discussed in section 1. The E- and J-cells correspond to thalamocortical relay (TC) and thalamic reticularis (RE) cells, respectively.

604

J. Rubin and D. Terman

Figure 2: Network of inhibitory J-cells and excitatory E-cells.

3 Mutually Coupled Compound Cells with Fast Inhibitory Synapses

Rubin and Terman (1998) proved that stable synchronous oscillations are not possible in mutually coupled networks with fast inhibitory coupling and only one slow variable corresponding to each cell. This holds regardless of whether the cells are excitable or oscillatory and whether the synapses are direct or indirect. These results are for networks with relaxation-type neurons, as discussed in the previous section. Here we demonstrate that when the mutually coupled cells are compound, they can exhibit synchronized oscillations when connected with fast inhibitory coupling. The synchronous solution may exist if the synapses are direct, but as in Terman et al. (1998), it can be stable only if the synapses are indirect (see section 3.3). We assume that each cell is excitable for xed levels of input; however, there is no problem in extending the analysis if this does not hold. We analyze the network by constructing singular solutions, done by piecing together solutions of reduced fast and slow subsystems. Since the cells are compound, there will be at least two slow variables corresponding to each cell. The multiple slow variables are needed for both the existence and the stability of the synchronous solution. For existence, the multiple slow variables allow the singular trajectory to escape from the silent phase, despite the fact that each cell is excitable. The multiple slow variables also allow for several mechanisms that lead to compression of cells as they evolve in phase-space. We do not give precise conditions on the parameters and nonlinearities in the equations to specify when the synchronous solution is stable, as was done in Terman et al. (1998). This will be done elsewhere. Here we describe the geometric mechanisms that allow the cells to escape the silent phase so a synchronous solution is possible and then characterize the compression mechanisms that act to stabilize the synchronous solution. A primary aim of this article is to compare how inhibition is used in different networks with different architectures to produce stable synchrony. By identifying the compression mechanisms here, we are able to evaluate the robustness of the

Population Rhythms

605

synchronous solution. As we shall see, the compression mechanisms for the mutually coupled architecture are considerably less robust than those that arise in the globally inhibitory thalamic networks. The analysis here is similar to but more general than that in Terman et al. (1998), where it is demonstrated that stable synchrony can arise in networks with basic cells but slow synapses. It is assumed in Terman et al. (1998) that the synapses activate on the fast timescale and that the evolution of the cells in the active phase does not depend on the level of synaptic input. These assumptions imply that the slow system corresponding to the active phase is one-dimensional. We do not make these assumptions here and demonstrate that the resulting richer dynamics may lead to additional compression mechanisms for stabilizing the synchronous solution. 3.1 Singular Solutions. We analyze the network by treating 2 as a small, singular perturbation parameter and constructing singular solutions. These consist of various pieces, each piece corresponding to a solution of either fast or slow equations. The fast equations are obtained by simply letting 2 D 0 in equation 2.4 and in either equation 2.5 or 2.6. The slow equations are obtained by replacing t with t D 2 t as the independent variable and then letting 2 D 0. We assume here that both a and b are independent of 2 . There is no problem in extending this analysis if b D O(2 ); this is actually an easier case since the additional slow variable then provides additional opportunities for compression. Here we derive the reduced slow equations valid when the synapses are direct. The equations for indirect synapses are similar, but there are more cases to consider. The slow equations are

0 D f (vi , wi , yi ) ¡ gsyn sj ( vi ¡ vsyn ) wP i D g( vi , wi ) yP i D h(vi , yi ) 0 D a(1 ¡ si ) H (vi ¡ hsyn) ¡ bsi ,

(3.1)

where P D dtd . One can reduce this system to equations for just the slow variables ( yi , wi ). There are several cases to consider depending on whether both cells are silent, both are active, or one is silent and the other is active. We assume that the solution of the rst equation in 3.1 denes a cubic-shaped surface C s , and the left and right branches of this surface can be expressed as vi D W L ( wi , yi , si ) and vi D W R ( wi , yi , si ), respectively. If both cells are silent, then each vi < h syn and si D 0. Let GL ( w, y, s) ´ g(W L ( w, y, s), w) and HL ( w, y, s) ´ h(W L ( w, y, s), y). Then each ( wi , yi ) satises the equations wP D GL (w, y, 0)

yP D HL ( w, y, 0).

(3.2)

606

J. Rubin and D. Terman

If both cells are active, then each vi > h syn and the last equation in 3.1 implies that si D sA ´ a / (a C b). Let GR ( w, y, s) ´ g(W R (w, y, s), w) and HR ( w, y, s) ´ h(W R ( w, y, s), y). Then each (wi , yi ) satises the equations wP D GR ( w, y, sA )

yP D HR ( w, y, sA ).

(3.3)

Finally suppose that one cell, say cell 1, is silent and cell 2 is active. Then the slow variables satisfy the reduced equations wP1 D GL ( w1 , y1 , sA ) yP1 D HL ( w1 , y1 , sA )

wP2 D GR ( w2 , y2 , 0) yP2 D HR ( w2 , y2 , 0).

(3.4)

We may view the singular solution as two points moving around in the ( y, w) slow phase-space. Each point corresponds to the projection of one of the cells onto the slow phase plane. The points evolve according to one of the reduced slow systems until one of the points reaches a jump-up or jump-down curve. The cells then jump in the full phase-space; however, the slow variables remain constant during the fast transitions. The points then “change directions” in slow phase-space and evolve according to some other reduced slow equations. Since the cells are excitable, the reduced system, equation 3.2, with s D 0 has a stable xed point, which we denote by P0 . The slow phase-space corresponding to equation 3.2 is illustrated in Figure 3A. Note that while some of the trajectories are attracted toward P0 , others are able to reach the jump-up curve. That is, although the uncoupled cells are excitable, it is possible for a cell to begin in the silent phase and still re. This will be important in the next section, when we discuss the existence of the synchronous solution. The following lemma characterizes the left and right folds (or jumpup and jump-down curves) of C s . We assume here that fy > 0 on the left branch of C s while fy < 0 on the right. This assumption is justied, based on biophysical considerations, in remark 4 (in appendix A). The left and right folds of C s can be expressed as JL D f(vL ( y, s), L L R ( ), )g < 0, @w > 0, @w > 0, wL y, s y and JR D f(vR ( y, s), wR ( y, s), y)g where @w @y @s @y Lemma.

and

@wR @s

> 0.

Since vL ( y, s) D W L (wL ( y, s), y, s), it follows from equation 3.1 and the denition of folds that Proof.

0 D f (W L (wL ( y, s), y, s), wL ( y, s), y) ¡ gsyn s(W L ( wL ( y, s), y, s) ¡ vsyn ) 0 D fv (W L ( wL ( y, s), y, s), wL ( y, s), y) ¡ gsyns (3.5)

Population Rhythms

607

Figure 3: Singular solutions for compound cells. (A) The slow phase-space of an uncoupled compound cell, bounded by the jump-up curve JL and the jumpdown curve JR of the slow manifold Cs . The solid (dashed) curves represent evolution in the silent (active) phase. P0 is a stable xed point. (B) Numerically generated synchronous solution for mutually coupled compound cells that are separately excitable, in (y, w)-space. The inset (v versus t) shows the voltage traces as two cells approach synchrony. These curves, as well as those in other gures, were generated using the program XPPAUT, developed by G. B. Ermentrout, with parameter values given in appendix A. The solid curve is the synchronous solution, the dashed curves are the jump-up (labeled) and jump-down (approximately horizontal, unlabeled) curves for s D 0.2, and the dash-dotted curves (partially obscured) are those for s D sA D 0.8. Note that the curves for s D 0.8 lie at larger w-values than those for s D 0.2. Since 2 6 D 0, the synchronous solution does not jump up [y0 D 0, near (y, w) D (0.08, 0.07)] immediately on reaching the jump-up curve.

608

J. Rubin and D. Terman

Differentiating the rst equation in 3.5 with respect to w and using the second equation, we obtain 0D

@f @wL ¡ gsyn(W L ( wL ( y, s), y, s) ¡ vsyn). @w @s

Hence, gsyn(W L (wL ( y, s), y, s) ¡ vsyn) @wL D . @s fw

(3.6)

The right-hand side of this expression has a positive numerator because the coupling is inhibitory and a positive denominator from equation 2.2, so @wL R > 0. Analogously, @w > 0. @s @s Similarly, differentiating with respect to y in equation 3.5 yields 0 D f @f @wL L C @f , or @w D ¡ f y , with fy @w @y @y @y w fy @wR D ¡ , with , evaluated f f y w @y fw

and fw evaluated on JL . Analogously,

on JR . The above assumptions on fy , together with equation 2.2, yield the desired result. In the thalamic networks of interest, the jump-down curve JR is nearly horizontal (see Figure 3B). This holds because the y current has a much smaller reversal potential and maximal conductance than the w current; when a cell is in the active phase, this implies that | fy | ¿ | fw |, so R | @w | is quite small. (See remark 5 in appendix A.) @y Remark 1.

3.2 Existence of the Synchronous Solution. Here we illustrate why it is possible for a synchronous solution to exist in a network of mutually coupled compound cells even when the individual cells are excitable. A numerically generated picture of such a solution, projected onto the slow variables ( y, w), is shown together with certain jump-up and jump-down curves in Figure 3B. We also show a solution with initial conditions slightly perturbed from the synchronous solution. The precise equations that this solution satises and the parameter values used numerically are given in appendix A. One constructs the singular synchronous solution as follows. We begin when the cells are in the silent phase, just after they have jumped down. The slow variables then evolve according to equation 3.2. If they are able to reach the jump-up curve, then they jump up according to the fast equations. While in the active phase, the slow variables satisfy equation 3.3 until they reach the jump-down curve. They then jump down according to the fast equations, and this completes one cycle of the synchronous solution. Note that w and y are decreasing in the active phase and then increase just after jump-down; we use this in the next subsection.

Population Rhythms

609

It is not clear how to choose the starting point ( yi (0), wi (0)) so that the singular orbit returns precisely to this point after one cycle. Note, however, that the variables yi relax very close to yi D 0 during the active phase. If we suppose that yi ¼ 0 at the jump-down point, then the value of wi is determined; that is, for the coupled cells, wi ¼ wR (0, sA ). A straightforward xed-point argument shows that the synchronous solution will therefore exist if the solutions of equation 3.2, which begin near ( yi , wi ) D (0, wR (0, sA )), are able to reach the curve of jump-up points. The reason that a synchronous solution can exist even when the uncoupled cells cannot oscillate is that the synchronous solution lies on a different cubic during the active phase than the uncoupled cells. For this reason, the synchronous solution jumps down along a different curve than the uncoupled cells do. From the lemma, the jump-down curve JR (sA ) has larger values of w than the jump-down curve JR (0) (see Figure 3B). It is therefore possible for the coupled cells to jump down to a point from which they are able eventually to escape the silent phase, although the uncoupled cells jump down to a point from which they cannot escape. There is a nice biophysical interpretation for why coupled excitable cells may be able to oscillate. Recall that a thalamocortical relay cell is an example of a compound cell. (See section 5 for a more detailed discussion.) Then w corresponds to the inactivation variable of the IT current. A larger value of w means that this current is more deinactivated. This implies that if the cells jump down at a larger value of w, then it is easier for the cells to become sufciently depolarized so they can reach threshold and re. The construction of the synchronous solution for indirect synapses is very similar. The only difference is that after the cells jump up or jump down, there is a delay until the inhibition either turns on or turns off. The cells switch their cubic surface while in the silent and active phases, assuming that the delay is shorter than the time that the cells spend in each of their silent and active phases. We will assume that this is the case throughout the remainder of this article. 3.3 Stability of the Synchronous Solutions. We assume throughout this section that the synapses are indirect. The synchronous solution cannot be stable if the synapses are direct for the following reason. Suppose we start with both oscillators in the silent phase and assume that cell 1 jumps up. If the synapse is direct, then s2 jumps instantly (with respect to the slow timescale) to s2 D 1. The effect of this is to move cell 2 instantly away from its ring threshold, thus destabilizing the synchronous solution. Indirect synapses are needed for stability since they provide a window of opportunity for both cells to jump up during the same cycle. However, this does not guarantee that the synchronous solution is stable. One must still show that cells that are initially close together are brought closer together, or compressed, as they evolve in phase-space.

610

J. Rubin and D. Terman

It is not at all obvious how to dene compression. We need to demonstrate that the cells are brought closer together; however, this requires that we have a notion of distance between the cells. There are several possible metrics, each with certain advantages on different pieces of the solution. One obvious metric is the Euclidean distance between the points in phase-space corresponding to the cells. It is sometimes convenient to work with a time metric; the “distance” between the cells is then the time it takes for one cell to reach the initial position of the other cell. This was used previously in Somers & Kopell (1993), Terman & Wang (1995), Terman et al. (1998), and LoFaro & Kopell (in press). Here, we describe several mechanisms for compression, each corresponding to a different piece of the singular solution. These mechanisms illustrate how the geometry of the two-dimensional slow subsystem, notably its curves of knees, allows for compression in the presence of inhibitory coupling. By identifying the compression mechanisms, we can then understand how changing parameters in the equations inuences the stability of solutions. We can also compare the robustness of the compression mechanisms for this network with that for other networks with other architectures. 3.3.1 The Jump Up. Suppose that cell 1 lies on the jump-up curve when t D 0. After cell 1 res, there is a delay in the onset of inhibition. We assume that cell 2 begins in the silent phase so close to the jump-up curve that it res before it feels this inhibition. Suppose that cell 2 res when t D T0 . We now need to make an assumption on the nonlinearities. Let ( y¤ , w¤ ) be the point where the synchronous solution jumps up. We assume that (A) |GL ( y¤ , w¤ , 0)| < |GR ( y¤ , w¤ , 0)|

and

| HL ( y¤ , w¤ , 0)| < |HR ( y¤ , w¤ , 0)|

Note that this assumption implies that the w and y coordinates of both cells change at a faster rate in the active phase after the jump up than in the silent phase before the jump up. This is certainly satised for the example described in appendix A and is similar to assumptions in previous work (see, for example, Somers & Kopell, 1993, where the notion of fast threshold modulation is introduced). There are now several cases to consider depending on the orientation of the cells both before and after they jump up. We work out two of these in detail. These are the cases that arise most often for the system described in appendix A. Similar analysis applies to other cases. Assume that w1 (0) < w2 ( T0 ) < w2 (0) and y2 (0) < y1 ( T0 ) < y1 (0) (see Figure 4A). These assumptions, together with the lemma, imply that |y1 (T 0 )¡ y2 (T0 )| < |y1 (0)¡ y2 (0)| so there is compression in the y-coordinates after the jumps. From assumption A, there is also compression in a time metric corresponding to the y-coordinate. For each t0 , let r y (t0 ) be the time it takes for cell 2 to evolve from its position at t D t0 until its y-coordinate is that of cell 1 when t D t0 . It then follows that r y ( T0 ) < r y (0).

Population Rhythms

611

Figure 4: Compression mechanisms for mutually coupled compound cells. (A) Compression in the time metric r w can occur between mutually coupled compound cells in the jump-up. Note that w1 (0) < w2 (T0 ) < w2 (0) and y2 (0) < y1 (T0 ) < y1 (0). The Euclidean distances dw (0), dw (T0 ) are used to compute the time metrics r w (0), r w (T0 ), respectively. (B) A reversal of orientation in the active phase (solid lines) can lead to compression in the jump-down; the dashed line indicates evolution of cell 1 in the silent phase. (C) Numerically computed trajectories of a pair of mutually coupled compound cells undergoing order reversal. Cells 1, 2 correspond to c1 , c2 respectively in (B). Parameter values are given in appendix A.

612

J. Rubin and D. Terman

We now show that there is also a compression in the time metric corresponding to the w-coordinate. This is denoted by r w (t ). Let a¡ D |GL ( y¤ , w¤ , 0)| and a C D |GR ( y¤ , w¤ , 0)|. Then w2 (0) ¡ w1 (0) w2 (0) ¡ w2 (T0 ) w2 ( T0 ) ¡ w1 (0) D C a¡ a¡ a¡ w2 ( T0 ) ¡ w1 (0) w2 ( T0 ) ¡ w1 (0) ¼ T0 C > T0 C a¡ aC w1 (0) ¡ w1 ( T0 ) w2 (T 0 ) ¡ w1 (0) ¼ C aC aC ¼ r w ( T0 ).

r w (0) ¼

(3.7)

Now suppose that w1 (0) < w2 ( T0 ) < w2 (0) and y1 ( T0 ) < y2 (0). The exact same calculation given in equation 3.7 shows that there is compression in the time metric r w across the jump up. A simple calculation also shows that there is compression in r y . 3.3.2 The Active Phase. Next assume that the cells are active with xi > hx . Then each ( yi , wi ) satises equation 3.3. It is easy to see why the cells are compressed in the Euclidean metric if we make some simplifying assumptions concerning the nonlinear functions g and h. These assumptions arise naturally if one considers the network in appendix A; it is also a simple matter to extend this analysis to more general systems. Suppose that g and h are of the form g( v, w) D ( w1 (v) ¡ w) / tw ( v) and h(v, y) D ( y1 ( v) ¡ y) / ty (v). Note that while in the active phase, y1 ( v) and w1 ( v) are very small. Moreover, ty ( v) and tw ( v) are nearly constant. We assume here that while in the active phase, g(v, w) D ¡w / tw and h( v, y) D ¡y / ty , where tw and ty are positive constants. It follows that each ( wi , yi ) satises simple linear equations. If we ignore the jump-down curve, then each slow variable decays to 0 at an exponential rate. In particular, the distance between the cells decays exponentially. Actually, more is true. Each ( yi , wi ) approaches the origin tangent to the weakest eigendirection. Now suppose that the jump-down curve passes close to the origin. The Euclidean distance between the cells still decreases exponentially, and both cells jump down at nearly the same point. This is the point where the jumpdown curve crosses the weakest eigendirection. We note that there is also a more subtle source of compression while the cells are active. There will be some period of time when cell 2 receives inhibition but cell 1 does not; that is, s1 D sA but s2 D 0. During this time, the ( yi , wi ) satisfy different equations. It is then possible that the trajectories ( yi (t ), wi (t )) cross in the slow phase-space. This leads to a reversal of orientation between the cells, as shown in Figure 4B. Next, we discuss why a reversal of orientation can lead to compression in the cells’ trajectories.

Population Rhythms

613

3.3.3 The Jump Down. We now show that if the cells reverse their orientation while in the active phase, this can lead to a form of compression after the cells jump down. Let Ti be the time when cell i jumps down. As above, assume that cell 1 jumped up rst with w1 (0) < w2 (0). We will assume that w1 (t ) < w2 (t) as long both cells are active. Moreover, from remark 1, the jump-down curve is nearly horizontal. Hence, cell 1 jumps down rst; that is, T1 < T2 . If the cells’ trajectories cross while in the active phase, then y1 ( T1 ) < y2 (T 2 ). This is shown in Figure 4B. Let r yA ( T1 ) be the time it would take for the solution of equation 3.3 starting at ( y2 ( T1 ), w2 ( T1 )) to reach the ycoordinate y1 ( T1 ). It follows that r yA ( T1 ) > T2 ¡ T1 . For T1 < t < T2 , cell 1 evolves in the silent phase with y increasing, while cell 2 evolves in the active phase with y decreasing. If y1 (T2 ) < y2 ( T2 ), then | y2 ( T2 ) ¡ y1 (T 2 )| < | y2 ( T1 ) ¡ y1 ( T1 )| so there is compression in the ycoordinates of the cells across the jump. Now suppose that y1 ( T2 ) > y2 ( T2 ), as shown in Figure 4B. Let r yS (T2 ) be the time it would take for the solution of equation 3.2 starting at ( y2 ( T2 ), w2 (T 2 )) to reach the y-coordinate of cell 1. Since y2 ( T1 ) > y1 ( T1 ), it follows thatr yS ( T2 ) < T2 ¡T1 . We have now demonstrated that r yS (T2 ) < T2 ¡ T1 < r yA ( T1 ). That is, there is compression in the time metric corresponding to the y-coordinate. We note that since the jumpdown curve is nearly horizontal, any compression in the w-coordinates is, to rst order, neutral. A numerical example of orientation reversal in two cells’ trajectories in ( y, w)-space is shown in Figure 4C. At the top of the gure, the cells are in the silent phase. Each cell jumps up where the corresponding y0i D 0; chronologically, cell 1 jumps up rst. In the active phase, the trajectories cross, because the cells experience different levels of inhibition; cell 2 receives inhibition rst. The paths cross again at the bottom left of the gure after the leading cell, cell 1, falls down to the silent phase. 3.3.4 The Silent Phase. Suppose that both cells lie in the silent phase with xi < h syn . Then each ( yi , wi ) satises equation 3.2 until one of them reaches the jump-up curve. We now dene a metric between the cells and, in appendix B, we analyze how to choose parameters to guarantee that the metric decreases as the cells evolve in the silent phase. This metric is similar to that introduced by Terman et al. (1998). Suppose that cell 1 reaches the jump-up curve rst, and this is at the point ( y¤1 , w¤1 ). (See Figure 13 in appendix B.) Fix some time t0 , and let t wL0 be the physical translate of the jump-up curve such that ( y¤1 , w¤1 ) is translated to the point ( y1 (t0 ), w1 (t0 )). Then the “distance” between ( y1 (t0 ), w1 (t0 )) and ( y2 (t0 ), w2 (t0 )) is the time it takes for the solution of equation 3.2, which begins at ( y2 (t0 ), w2 (t0 )), to cross wtL0 . This is certainly well dened as long as the two cells are sufciently close to each other.

614

J. Rubin and D. Terman

One can compute explicitly how this metric changes as the cells evolve in the silent phase. (The computation, rather technical, is in appendix B.) A more complete discussion of how this metric is used to prove the stability of the synchronous solution of two mutually coupled basic cells with slow synapses is given in Terman et al. (1998). 3.4 Further Remarks. The compression of cells can take place as the cells evolve along the multidimensional slow manifold or as they jump up or down. The compression during the jumping process depends on the geometry (or slope) of the curve of knees and the orientation of the cells both before and after the jumps. Parameters that determine this slope may therefore have subtle effects on the stability of the synchronous solution; gsyn is one such parameter (see equation 3.6). The results in Terman et al. (1998) provide precise conditions on combinations of parameters that ensure that the synchronous solution is stable when the inhibition is slow. Increasing gsyn, for example, may sometimes stabilize the synchronous solution; however, when other parameters satisfy a different relationship, increasing gsyn may destabilize the synchronous solution. The size of the domain of attraction of the synchronous solution is, to a large extent, determined by the delay in the onset of inhibition. The two cells are able to re together if the trailing cell lies within the window of opportunity determined by this delay. If the trailing cell lies outside this window, then the network typically exhibits antiphase behavior in which the cells take turns ring, although other network behavior is possible. The system may crash, for example, since the completely quiescent state is asymptotically stable. In our analysis, we assumed that the cells and coupling are homogeneous. The effect of heterogeneities on mutually coupled basic cells with slow synapses was studied in several papers (Golomb & Rinzel, 1993; White, Chow, Ritt, Soto, & Kopell, 1998; Chow, 1998). These found that the synchronous solution is not very robust to mild levels of heterogeneities; a 5% variation in parameters was sufcient to destroy synchronous behavior. We have done a number of numerical simulations in order to study the effects of heterogeneities on the network considered in this section. Our numerical results are consistent with those in previous studies. 4 Globally Inhibitory Networks

We now consider the network described in section 2.3. Recall that in this network, E-cells excite J-cells, which in turn inhibit E-cells. The thalamic networks involved in sleep rhythms, discussed in section 5, are examples of such a network, with compound cells, to which the results of this section apply. We assume for now that there are just two E-cells, denoted by E1 and E2 , and there is one J-cell, which we denote as J. Larger networks are considered later. Initially, each cell is assumed to be a basic cell; generalization

Population Rhythms

615

for compound cells is discussed in section 4.4. The E-cells are identical to each other, but they may be different from the J. Each cell is also assumed to be excitable for xed levels of input. The system of equations corresponding to each Ei is v0i D f (vi , wi ) ¡ ginh sJ ( vi ¡ vinh ) w0i D 2 g( vi , wi ) s0i D a(1 ¡ si ) H (vi ¡ h ) ¡ bsi ,

(4.1)

while the equation for J is 1 v0J D fJ ( v J , wJ ) ¡ ( s1 C s2 ) gexc ( vJ ¡ vexc ) 2 w0J D 2 g J ( v J , wJ ) s0J D aJ (1 ¡ sJ ) H( v J ¡ h J ) ¡ 2 KJ sJ .

(4.2)

Here, each synapse is direct. Indirect synapses will be needed when we discuss the stability of solutions. Note that the inhibitory variable s J turns off on the slow timescale. The reason that we write the equations this way will become clear in the analysis. We assume that b D O(1); however, there is no a problem in extending the analysis if b D O(2 ). If vi > h , then si ! sA ´ aCb on the fast timescale. Two types of network behavior are shown in Figures 8 and 9. A synchronous solution, in which each cell res during every cycle, is shown in Figure 9. In Figure 8, each excitatory cell res every second cycle, while J res during every cycle. This type of solution is referred to as a clustered solution. In the following sections, we construct singular orbits corresponding to each of these solutions and then analyze their stability. The constructions then lead to conditions for when the different solutions exist and are stable. 4.1 Existence of the Synchronous Solution. We now construct a singular trajectory corresponding to a synchronous solution in phase space. As before, the trajectory for each cell lies on the left or right branch of a cubic nullcline during the silent and active phases. Which cubic a cell inhabits depends on the total synaptic input that the cell receives. Nullclines for the Ei are shown in Figure 5A and those for J in Figure 5B. Note in Figure 5A that the s J D 1 nullcline lies above the sJ D 0 nullcline, while in Figure 5B, the stot ´ 12 ( s1 C s2 ) D sA nullcline lies below the stot D 0 nullcline. This is because the Ei receive inhibition from J while J receives excitation from the Ei . We will make several assumptions concerning the ow in the following construction. These are justied later. We begin with each cell in the active phase just after it has jumped up. These are the points labeled P0 and Q0 in Figure 5. Each Ei evolves down

616

J. Rubin and D. Terman

Figure 5: Nullclines for (A) E-cells and (B) J-cells in a globally inhibitory network with basic cells. The closed curves and points Pi , Q i correspond to the singular synchronous solution discussed in the text. Note that s J decays on the slow timescale.

the right branch of the sJ D 1 cubic, while J evolves down the right branch of the stot D sA cubic. We assume that the Ei have a shorter active phase than J, so each Ei reaches the right knee P 1 and jumps down to the point P2 before J jumps down. We also assume that at this time, J lies above the right knee of the stot D 0 cubic. J must then jump from the point Q1 to the point Q2 along the stot D 0 cubic. On the next piece of the solution, J moves down the right branch of the stot D 0 cubic while the Ei move up the left branch of the s J D 1 cubic. When J reaches the right knee Q3 , it jumps down to the point Q4 along the left branch of the stot D 0 cubic. Now the inhibition sJ to the E i starts to turn off on the slow timescale. Thus, the Ei do not jump to another cubic. Instead, the trajectory for the Ei moves upward, with increasing wi , until it crosses the w nullcline. Then each wi starts to decrease. If this orbit is able to reach a left knee, it jumps up to

Population Rhythms

617

Figure 6: Slow phase plane for an E-cell. The curve wL (sJ ) is the jump-up curve, which trajectories reach if K J is large enough (¡ ¢ ¡ ¢ ¡ path; cell jumps up from the point marked *). The dotted curve WF (s J ) consists of zeros of GL (w, sJ ) in system 4.3; trajectories tend to the stable critical point WF (0) as s J ! 0 for small K J (¡ ¢ ¢ ¡ ¢ ¢ ¡ path). Note that w0 < 0 for w > WF .

the active phase, and this completes one cycle of the synchronous solution. When the Ei jump up, J also jumps up if it lies above the left knee of the stot D sA cubic. We now derive more quantitative conditions for when the singular synchronous solution exists. It is not at all obvious, for example, why we needed to assume that the active phase of J is longer than that for the E i . It is also not clear what conditions are needed to ensure that the Ei are able to reach a jump-up curve and escape once they are released from inhibition. These two issues are actually closely related. We rst discuss how the Ei can reach the jump-up curve. For this, it is convenient to derive equations for the evolution of the slow variables (wi , sJ ) as was done in section 3.1. Let t D 2 t, denote the left branch of the cubic f (v, w) ¡ ginh s(v ¡ vinh ) D 0 by v D W L (w, s) and let GL ( w, s) ´ g(W L ( w, s), s). Then each ( wi , sJ ) satises the slow equations, wP D GL ( w, sJ ) sPJ D ¡KJ sJ .

(4.3)

The phase plane corresponding to this system is illustrated in Figure 6. There are two important curves shown in the gure. The rst is the jumpup curve w D wL ( sJ ); this is the curve of “left knees.” The second curve, denoted by WF ( s J ), corresponds to the xed points of the rst two equations

618

J. Rubin and D. Terman

in 4.1 with the input s J held constant. This corresponds to the w-nullcline of equation 4.3. We need to determine when a solution (w(t), sJ (t )) of equation 4.3 beginning with s J (0) D 1 and w(0) < W F (1) can reach the jump-up curve wL ( sJ ). This is clearly impossible if WF (1) < wL (0), so we shall assume that WF (1) > wL (0). If w(0) > wL (0) and K J is sufciently large, the solution will certainly reach the jump-up curve; this is because the solution will be nearly vertical, as shown in Figure 6. If, on the other hand, KJ is too small, the solution will never be able to reach the jump-up curve. This is because the solution will slowly approach the curve WF (s J ) and lie very close to this curve as s J approaches zero. This is also shown in Figure 6. We conclude that the cells are able to escape the silent phase if the inhibitory synapses turn off sufciently quickly and the w-values of the cells are sufciently large when this deactivation begins. Escape is not possible for very slowly deactivating synapses (although it would be possible with slow deactivation if the cells were oscillatory for some levels of synaptic input). A biophysical interpretation of this is that escape is possible for GABAA synapses and will occur if the cell’s IT current is sufciently deinactivated when inhibition begins to wear off. We assume that KJ is large enough so that escape is possible. Choose Wesc so that the solution of equation 4.3 that begins with sJ (0) D 1 will be able to reach the jump-up curve only if w(0) > Wesc . The existence of the singular synchronous solution now depends on whether the Ei lie in the region where wi > Wesc when J jumps down to the silent phase. We claim that this requires that the active phase of J be sufciently long. One can give a simple estimate on how long this active phase must be as follows. Suppose that all the cells jump up when t D 0, the Ei jump down when t D tE , and J jumps down when t D tJ . We require that wi (tJ ) > W esc . Since the time the Ei spend in the silent phase before they are released from inhibition is tJ ¡tE , this implies that tJ ¡tE must be sufciently large. Hence, J’s active phase must be sufciently longer than the Ei ’s. More precisely, let wRK be the value of w at the right knee of the sJ D 1 cubic, and let tL be the time it takes for the solution of the rst equation in 4.1 with sJ D 1 to go from w D wRK to w D Wesc . We require that t J ¡ tE > tL .

(4.4)

4.2 Stability of the Singular Synchronous Solution. We now demonstrate that the synchronous solution is stable if the synapse s J is indirect and the active phase of J is sufciently long. We start with the Ei a small distance apart, just after both have jumped up to the active phase. Assume that this causes J to re. We will show that after one cycle, both of the Ei are so close that they must re together again. Moreover, there is a contraction in the distance between the E-cells during each cycle. The analysis proceeds as in the previous section. We assume that the active phases of the Ei are shorter than that of J, so that the Ei return to

Population Rhythms

619

the silent phase and proceed up the left branch of the sJ D 1 cubic before J jumps down. As the Ei move up this left branch, they approach the point PL where the s J D 1 cubic intersects the w-nullcline. (See Figure 5A.) If the active phase of J is sufciently long, then the Ei lie as close as we please to PL , and therefore to each other, when J jumps down. This is precisely what is required to guarantee that both will re together during the next cycle. While in the silent phase, the Ei approach PL at an exponential rate (in the slow timescale). This leads to a very strong compression of Euclidean distance between the cells while in the silent phase. This compression is certainly stronger than any possible expansion over the remainder of the cycle. After the J-cell falls down, s J decays on the slow timescale. This allows the J-cell to recover so that it can re when excited by the ring of the E-cells, and the whole cycle repeats. We need to assume that sJ corresponds to an indirect synapse for the same reason as we did previously. When one of the E-cells res, this causes J to re, which sends inhibition back to the other E-cell. If s J is direct, this causes the second E-cell to be “stepped on” on the fast timescale, and the synchronous solution cannot be stable. Note that the time between the rings of the Ecells is determined by K J , the rate at which sJ decays. If KJ is large, the time between rings is short; it is then easier for the second cell to pass through the window of opportunity provided by the indirect synapse. Our analysis has shown that the dynamics of the J-cell can inuence the domain of attraction of the synchronous solution in several ways. If J’s active phase is long, then the E-cells lie close to each other, near PL , when J jumps down and releases them from inhibition. Moreover, if J recovers quickly in the silent phase, then KJ can be chosen to be large. Both factors make it easier for the E-cells, once they are released from inhibition, to pass through a window of opportunity and re during the same cycle. Hence, both enlarge the domain of attraction of the synchronous solution. There are important differences between the ways in which mutually coupled and globally inhibitory networks use inhibition to synchronize oscillations. In mutually coupled networks, a second slow variable is required for the existence of the synchronized solution; it allows the cells to escape from the silent phase. The second slow variable is also required for the compression of the cells as they evolve in phase space. The existence and stability of the synchronous solution in globally inhibitory networks, on the other hand, are controlled by the dynamics of the J-cell. If the Jcell’s active phase is long enough, then this pushes the E-cells, in their silent phase, to a position from which they can escape; moreover, this provides a strong compression of the E-cells. For this network, we require that the inhibition decays on the slow timescale; however, the reason is so that the J-cell can recover sufciently. The slow recovery is not needed to allow the E-cells to escape or for compression. In fact, the domain of stability of the Remark 2.

620

J. Rubin and D. Terman

synchronous solution is increased if the recovery of the J-cell and the decay of the synapses occur quickly, on the slow timescale. 4.3 Clustered Solution. We now describe the geometric construction of the singular antiphase, or clustered, solution. It sufces to consider half of a complete cycle. During this half-cycle, E1 res and returns to the initial position of E2 , J res and returns to its initial position, and E 2 evolves in the silent phase to the initial position of E1 . By symmetry, we can then continue the solution for another half-cycle with the roles of E1 and E2 reversed. When E1 jumps up, it forces J to jump up to the right branch of the stot D 12 sA cubic. Then E1 moves down the right branch of the sJ D 1 cubic, while J moves down the right branch of the stot D 12 sA cubic and E2 moves up the left branch of the s J D 1 cubic. We assume, as before, that E1 ’s active phase is shorter than J’s active phase, so E1 jumps down before J does so. It is possible that J lies below the right knee of the stot D 0 cubic at this time, in which case J also jumps down. If J lies above this right knee, then it moves down the right branch of the stot D 0 cubic until it reaches the right knee and then jumps down. During this time, both E1 and E2 move up the left branch of the s J D 1 cubic. After J jumps down, s J (t ) slowly decreases. If E 2 is able to reach the jump-up curve, then it res, and this completes the rst half-cycle of the singular solution. Suppose that t D tF when this occurs. For this to be onehalf of an antiphase solution, we need w2 (tF ) D w1 (0), w1 (tF ) D w2 (0), and wJ (tF ) D wJ (0). We now derive conditions for when the antiphase solution exists. These will imply that the active phase of J cannot be too long or too short, compared with the active phase of the Ei . If J’s active phase is too long, then the network exhibits synchronous behavior as described before. If J’s active phase is too short, then the system approaches the stable quiescent state. Suppose that E1 and J jump up when t D 0, E1 jumps down when t D tE , and J jumps down when t D t J . Let tL and Wesc be as dened in the previous section. To have a clustered solution, we require that

w1 (tJ ) < Wesc < w2 (tJ ).

(4.5)

The second inequality is necessary to allow E2 to re during the second halfcycle. The rst inequality guarantees that E1 does not re during this halfcycle. It follows from the denitions that the rst inequality is equivalent to t J ¡ tE < tL .

(4.6)

Next we derive a similar expression for the second inequality in equation 4.5. For w0 < w1 , let r ( w0 , w1 ) to be the time it takes for a solution of the

Population Rhythms

621

rst equation in 4.3 with sJ D 1 to go from w0 to w1 . The second inequality is equivalent to r ( wRK , w2 (tJ )) > tL . Now, r ( wRK , w2 (tJ )) D r (wRK , w2 (0)) Cr (w2 (0), w2 (tJ )). Moreover, wRK D w1 (tE ) and w2 (0) D w1 (tF ). Hence, r ( wRK , w2 (tJ )) D r (w1 (tE ), w1 (tF )) Cr ( w2 (0), w2 (tJ )). Clearly, r ( w2 (0), w2 (tJ )) D tJ , because E 2 lies on the sJ D 1 cubic for 0 < t < tJ . It is not true that E1 lies on the sJ D 1 cubic for tE < t < tF ; however, if the rst equation in 4.3 is weakly dependent on sJ , then we have that r ( w1 (tE ), w1 (tF )) ¼ tF ¡ tE . In this case, the second inequality in equation 4.3 is equivalent to tF ¡ tE C t J > tL .

(4.7)

One can simplify this formula if the parameter KJ is rather large. In this case, tF ¡ tJ is small; that is, E2 escapes the silent phase as soon as it is released from inhibition. Then equation 4.7 is approximately equivalent to tJ >

1 (tE C tL ). 2

(4.8)

Combining equation 4.6 and 4.8 leads to the following condition for the existence of a clustered solution if the synaptic variable sJ turns off quickly: 1 (tE C tL ) < tJ < tE C tL . 2

(4.9)

It is possible for both stable synchrony and stable clustering to exist for the same parameter values. Note that the domain of stability of the synchronous solution is controlled to a large extent by the size of the delay caused by the indirect inhibitory synapses. If this delay is small, the domain of stability will also be small. In this case, the synchronous solution will still be stable; however, most solutions will converge to a stable clustered solution. Remark 3.

4.4 Globally Inhibitory Networks with Compound Cells. The discussion in the previous subsections generalizes to globally inhibitory networks with compound cells, such as the model thalamic network in the next section. The primary difference is that each cell contains an additional slow variable, yi , so it is necessary to consider a higher-dimensional slow phasespace. As a consequence, the jump-up curves of the previous subsections are

622

J. Rubin and D. Terman

Figure 7: Three-dimensional slow phase-space for a singular synchronous periodic orbit of the globally inhibitory network. Double (single) arrows denote evolution on the fast (slow) timescale, solid (dashed) lines indicate the silent (active) phase, and points Pi are as discussed in the text. The shaded region represents the jump-up surface w D wL (y, sJ ).

replaced by jump-up surfaces fw D wa ( y, sJ )g, a D L or R, where wa ( y, sJ ) is as in the lemma in section 3. Figure 7 illustrates the evolution of the slow variables ( wi , yi , sJ ) for the singular synchronous solution. We begin at the point labeled P1 on the jumpup surface w D wL ( y, sJ ). The E-cells then jump up, and this forces J to jump up. Hence, sJ ! 1. This corresponds to the segment in Figure 7 that connects P1 to the point P2 on the s J D 1 surface. Each cell then evolves in the active phase with s J D 1. As before, we assume that the active phases of the E-cells are shorter than that of J. Hence, the E-cells jump down when the ( wi , yi ) reach the jump-down curve wi D wR ( yi , 1). This is at the point labeled P3 in Figure 7. While J lies in the active phase, the E-cells evolve in the silent phase, but sJ D 1 still holds. At P4 , J jumps down and the ( wi , yi , sJ ) evolve

Population Rhythms

623

until they reach the jump-up curve. This then completes one cycle of the singular solution. The stability analysis proceeds just as in section 4.2. What is crucial for stability is that J remains active long enough. The E-cells then approach the stable xed point on the left branch of the sJ D 1 surface while J is active. This provides the compression needed for stability. The construction of a clustered solution is also very similar to that described in section 4.3. We do not describe the construction here; some comments are given in the next section. 4.5 Further Remarks. The geometric constructions of the synchronous and clustered solutions extend in a straightforward manner to globally inhibitory networks with an arbitrary number of excitatory cells Ei . Of course, in a larger network there are more possibilities for clustered solutions; however, if each cluster contains (approximately) the same number of cells, then inequalities similar to equation 4.9 must be satised. This is similar to analysis of Terman and Wang (1995), which yields precise conditions for the existence of clustered states in a locally excitatory and globally inhibitory network model for scene segmentation. Further analysis of clustered solutions in inhibitory networks is given in Rubin and Terman (1999). The analysis leads to simple formulas for the periods of the synchronous and clustered solutions. Let tJ be, as above, the time cell J spends in the active phase, and let tS be the time for the E-cells to reach the jump-up curve after the J-cell jumps down. Then the period of the synchronous solution is simply t J C tS . Now tJ is determined by the dynamics of the J-cell, while tS is primarily controlled by the rate at which the synapses turn off; this is the parameter KJ in equation 4.2. Also see Figure 11. Other parameters play a secondary role. Note, for example, that the parameter gsyn inuences the period by controlling the slope of the jump-up curve, as shown in equation 3.6. The synchronous and clustered solutions can exist only if trajectories are able to escape from the silent phase. Previous work on mutually inhibitory neurons has emphasized the distinction between “release” and “escape” in producing both synchronous and antiphase solutions (Wang & Rinzel, 1993). “Release” refers to the case in which the active phase of one cell ends and this releases another silent cell from inhibition. “Escape” refers to the case in which the dynamics of the inactive cell allows it to re even if that cell receives inhibition. As pointed out in Terman et al. (1998), there is no clear distinction between escape and release when there are multiple slow processes. A silent E-cell is in some sense released when the J-cell jumps down to the silent phase. The rhythms can continue, however, only if this E-cell is able to escape the silent phase. Here we are assuming that each E-cell is excitable for constant levels of inhibition. Both escape and release are therefore needed to maintain oscillations.

624

J. Rubin and D. Terman

5 Thalamic Network

The model for the thalamic spindle sleep rhythm falls into the framework of the network analyzed in the preceding section. The two populations of cells in the model are the thalamocortical relay (TC) cells and the thalamic reticularis (RE) cells, corresponding to the E-cells and J-cells, respectively. One difference between the spindle model and those considered earlier is that the spindle model contains numerous RE as well as TC cells. The RE cells communicate with each other through fast inhibitory synapses, as illustrated in the network shown in Figure 2. During spindling, the network exhibits behavior similar to the clustered solution discussed in the previous section. The RE population is synchronized, while the TC cells break up into groups; cells within each group are synchronized, while cells within different groups are desynchronized. The network also exhibits completely synchronized rhythms. The synchronized rhythms arise, for example, when fast inhibition is removed from the entire network or from between the RE cells only. Recent results have also shown that the synchrony can arise if the RE population receives additional phasic excitation, corresponding to cortical input, at the delta frequency. Hence, the network can transform from clustering to synchronized behavior without any change in the inhibitory synapses. In this section, we demonstrate how the geometric analysis helps to explain the dynamical mechanisms responsible for these rhythms and transitions between them. We begin by presenting a concrete model for the sleep rhythms and then present results of numerical simulations of this model. The numerical simulations clearly show that solutions of the model behave as predicted by the singular perturbation analysis; in particular, solutions jump up and jump down when they reach a curve (or surface) of knees. This conrms that the decomposition into fast and slow variables, as described in the previous section, provides the correct singular perturbation framework for analyzing these rhythms. We also demonstrate how the geometric analysis leads to quantitative statements concerning the behavior of solutions. In particular, the analysis predicts correctly how the frequency of oscillations depends on parameters. The analysis also leads to precise conditions for when the network exhibits either synchronous or clustered solutions. For example, factors that enable the TC cells to synchronize are a long RE active phase, a relatively fast RE recovery, and a fast decay of inhibition. Some of these factors are inconsistent with mechanisms for synchronization in mutually coupled inhibitory networks. Finally, the analysis claries the roles of the various intrinsic and synaptic currents in generating a particular rhythm. We illustrate this in section 5.3, where we consider the role of the sag current, which also differs from the role of the second slow intrinsic current in mutually coupled networks.

Population Rhythms

625

5.1 Model. The following model contains many parameters and nonlinear functions (these are given in appendix A). The cells are modeled using the Hodgkin-Huxley formalism (Hodgkin & Huxley, 1952); the equations are very similar to those in Golomb et al. (1994). The equations of each TC cell are:

v0i D ¡IT (vi , hi ) ¡ Isag ( vi , ri ) ¡ IL ( vi ) ¡ IA ¡ IB h0i D ( h1 ( vi ) ¡ hi ) / th (vi ) r0i D ( r1 ( vi ) ¡ ri ) / tr ( vi ).

(5.1)

This describes a compound cell, with hi and ri corresponding to w and y, respectively, and 2 absorbed in th , tr rather than mentioned explicitly. The terms IT , Isag , and IL are intrinsic currents; they are given by IT ( v, h) D gCa m21 ( v) h( v ¡ vCa ), Isag ( v, r) D gsag r( v ¡ vsag ), and IL ( v) D gL (v ¡ vL ). The terms IA and IB represent the fast (GABAA ) and slow (GABAB ) inhibitory input from the RE cells. We model the fast inhibition IA as in previous P j sections; that is, IA D gA ( vi ¡ vA ) N1TR sA , where g A and vA are the maximal conductance and the reversal potential of the synaptic current. The sum is over all RE cells that send input to the ith TC cell and NTR represents the maximum number of RE cells that send inhibition to a single TC cell. Each j

synaptic variable sA satises the rst-order equation j 0

j

j

j

sA D aR (1 ¡ sA ) H( vR ¡ h R ) ¡ bR sA ,

(5.2)

j

where vR is the membrane potential variable of the jth RE cell. Motivated by recent experiments (Destexhe, Bal, McCormick, & Sejnowski, 1996; Destexhe & Sejnowski, 1997), we model the slow inhibition IB somewhat differently from IA . We rst discuss, however, the model for the RE cells. The equations of each RE cell are: 0 viR D ¡IRT ( viR , hiR ) ¡ IAHP ( viR , mi ) ¡ IRL ( viR ) ¡ IRA ¡ IE 0 hiR D (hR1 ( viR ) ¡ hiR ) / tRh (viR )

mi 0 D m 1 [Ca]i (1 ¡ mi ) ¡ m 2 mi

[Ca]0i D ¡ºIRT ¡ c [Ca]i .

(5.3)

The IRT , IAHP , and IRL represent intrinsic currents. These are given by IRT ( v, h) D gRCa m2R1 (v) h( v ¡ vRCa ), IAHP (v, m) D gAHP m( v ¡ vK ), and IRL ( v) D gRL ( v ¡ vRL ). More details concerning the biophysical signicance of each term are given in Golomb et al. (1994) and Terman et al. (1996). In equation 5.3, IRA represents the inhibitory input from other RE cells. It P j is modeled as IRA D gRA (viR ¡ vRA ) N1 sRA where the sum is over all RE RR

j

cells that send input to the ith RE cell. Each synaptic variable sRA satises a

626

J. Rubin and D. Terman

rst-order equation similar to 5.2. The term IE represents excitatory (AMPA) P j input from the TC cells and is expressed as IE D gE ( viR ¡ vE ) N1 sE , where RT the sum is over all TC cells that send excitatory input to the ith RE cell. The j

synaptic variables sE are fast and also satisfy rst-order equations similar to 5.2. It remains to discuss how we model the slow inhibitory current IB . Sims4

ilarly to Destexhe et al. (1996), we assume that IB D gB s4 biCl (vi ¡ vB ) where sbi , along with variable xbi , satises

bi

s0bi D k1 H( xbi ¡ hxb )(1 ¡ sbi ) ¡ k2 sbi i k3 hX x0bi D H( viR ¡ h Rb ) (1 ¡ xbi ) ¡ k4 xbi . NTR The parameters are such that xbi can become activated (i.e., exceed hxb) only if a sufciently large number of RE cells have their membrane potentials viR above the thresholdh Rb . The threshold is chosen rather large so the RE bursts must be sufciently powerful to activate xbi . Once xbi becomes activated, it turns on the synaptic variable sbi ; the expression s4bi in IB further delays the effect of the inhibition on the postsynaptic cell. 5.2 Numerical Simulations. A clustered solution is shown in Figure 8. There are three RE cells in this example, and they oscillate in synchrony at about 12.5 Hz; one of the RE cells is shown in Figure 8A. The RE cells synchronize due to excitation from the TC population. (This is discussed in more detail in section 5.3.) There are six TC cells, and they form two clusters, each oscillating at half of the RE oscillation frequency, as shown in Figure 8B. In Figure 8C, we show the time courses of the fast (sA ) and slow (sb ) inhibitory synaptic variables, respectively. Note that the fast inhibition activates during every cycle, providing the hyperpolarizing current needed to deinactivate each TC cell’s IT current. The fast inhibition is also needed to desynchronize the TC cells so they can form clusters. The slow current IB never activates during this solution because the RE cells do not re powerful enough bursts; that is, the membrane potentials viR do not rise above the threshold hRb D ¡25 mV long enough to activate the variables xbi . A synchronous solution is shown in Figure 9. The parameters are exactly as in Figure 8 except we set gRA D 0; that is, we have turned off the fast inhibition between RE cells. Note in Figure 9 that each TC cell res during every cycle along with the RE cells. The slow inhibitory current IB now activates during every cycle. Comparing the slow inhibitory variable with the fast inhibitory variable in Figure 9C, we see that the slow inhibition stays on longer and both turns on and turns off more gradually. Geometric analysis is useful in understanding why removing fast inhibition allows slow inhibition to activate and why this leads to synchronization of the network. This is discussed in the next subsection.

Population Rhythms

627

Figure 8: Numerical solution, with two TC clusters, of the thalamic network (parameter values are given in appendix A). Voltages are in mV and time is in msec. (Top) RE cell (the time course of which matches that of the RE population of three cells). (Middle) TC population of six cells, forming two clusters of three cells each. (Bottom) Inhibitory synaptic variables sA (dashed) and sb (solid). Note that sb ´ 0, since the RE cell bursts are not powerful enough to activate slow inhibition in this case.

In Figure 10 we plot the trajectory ( hi (t ), ri (t ), sb (t)) corresponding to the synchronous solution shown in Figure 9. We also plot the numerically computed surface of knees corresponding to the jump-up points; there is a similar jump-down surface, but it is not shown. The behavior of the trajectory is consistent with that discussed in the previous section. For example, the cells jump up when the trajectory crosses the jump-up surface of knees (approximately, since 2 6 D 0 numerically) (see Rubin & Terman, 1998, for a similar plot for the clustered solution). Figure 11 shows a plot of the period of synchronous TC oscillations as the rate of decay of inhibition, the rate of change of h, and the parameter gB are

628

J. Rubin and D. Terman

Figure 9: Numerical synchronous solution of the thalamic network (parameter values are given in appendix A). Voltages are in mV and time is in msec. (Top) RE cell (the time course of which matches that of the RE population of three cells). (Middle) TC population of six cells (synchronized). (Bottom) Inhibitory synaptic variables sA (dashed) and sb (solid). Note that sA turns on and off faster than sb , which in turn stays on longer.

separately varied. These relations can be derived from our analysis, as can predictions about the dependence of the period on other model parameters. As noted earlier, the period is given by the duration of the RE active phase (tJ ) plus the time it takes for inhibition to decay sufciently that the TC cells can jump up (tS ); hence, the period drops relatively sharply as the inhibitory decay rate increases. The same strong dependence, which our analysis explains, was found in the modeling work of Destexhe (1998). As the rate of change of h increases, there is essentially no change in the length of the TC silent phase (not shown) or in the period. This reects the fact that over the parameter range considered, the TC cells are compressed close to their silent phase rest state while the RE cells are active, and then the TC cells

Population Rhythms

629

Figure 10: Numerical trajectory of a TC cell, together with jump-up surface of knees (shaded), in the (h, r, sb ) slow phase-space. The TC cell shown belongs to the synchronized population of six TC cells shown in Figure 9. Note that sb does not immediately increase when the cell jumps up, since we have taken slow inhibitory synapses to be indirect.

evolve with little change in h after the RE cells jump down. Interestingly, gB has little effect on period because it has little inuence on tJ , tS . The mild effect that does occur is due to the subtle inuence of gB on the slope of the curve of knees for the TC silent phase. This actually causes the period to increase as gB , and hence the strength of slow inhibitory coupling, increases (see also Golomb et al., 1994). Our numerical studies show that the synchronous and clustered solutions are robust to moderate levels of heterogeneities and variations in the parameters. This is due to the strong TC compression provided by the RE cells. For example, these solutions were not affected by heterogeneities of about 20% in sag conductances and about 5% in the TC IT conductance.

630

J. Rubin and D. Terman

Figure 11: Period of synchronous TC oscillations versus rate of decay of inhibition (k2 D bR , solid curve), rate of change of h (dashed line), and gB (dotted line) for the population shown in Figure 9. The parameters k2 , bR were varied from 0.04 to 0.14 and were held at 0.1 for the other two curves. A parameter w was factored out from th0 , th1 in the h-equation and was varied from 1.5 to 2.5; w was held at 2.0 for the other curves. The parameter gB was varied from 0.03 to 0.07 and was held at 0.05 for the other two curves. Other parameter values as given in appendix A.

Stronger TC heterogeneities tend to promote TC clustering in this model. In the resultant patterns, TC cells with similar properties re together, independent of the way they are initially perturbed from synchrony. 5.3 Further Implications of the Analysis.

5.3.1 Removing Fast Inhibition. Fast inhibition occurs in two places: the RE cells inhibit themselves as well as the TC cells. Removing fast inhibition has different consequences for each of these synaptic connections, and

Population Rhythms

631

both of these help to synchronize the TC cells (see also Golomb et al., 1994; Destexhe & Sejnowski, 1997). Removing the RE-TC fast inhibition is clearly helpful for synchronization among the TC cells. This holds because the fast inhibition has a very short rise time, which follows since the fast inhibitory synapses are direct in equations 5.1 and 5.2. This short rise time corresponds to a small domain of attraction of the synchronous solution. In fact, it is precisely this inhibition that is responsible for desynchronizing the TC cells during the clustered solution. Removing the RE-RE fast inhibition appears to be even more crucial for synchronizing the TC cells (see also Steriade, McCormick, & Sejnowski, 1993; Huguenard & Prince, 1994; Destexhe, Bal, McCormick, & Steriade, 1996; Destexhe, 1998). This allows the RE cells to re longer, more powerful bursts, which activate the slow inhibitory current IB . The analysis in section 4 demonstrates that long, powerful bursting of the RE cells is needed for the TC cells to synchronize, unless the desynchronizing effect of inhibition is somehow removed as discussed in Terman et al. (1996). Our numerical simulations (e.g., Figures 9–12) show, in fact, that the TC cells will synchronize even if fast inhibition is removed from within the RE population but not from the RE-TC connections. Why removal of inhibition leads to stronger RE bursts can be easily understood by our analysis of trajectories in phase-space. This removal forces the RE cells to lie on the right branch of a different cubic while in the active phase. The cubic of the disinhibited cells lies below the cubic of the cells with inhibition. The disinhibited cells therefore jump up to larger values of membrane potential; moreover, their jump-down point (right knee) lies below the jump-down point of the inhibited cells. The disinhibited RE cells therefore have a longer active phase. Note that the slow inhibitory current IB , activated when RE-RE fast inhibition is removed, has slower rise and decay times than IA . The slow rise time enhances the domain of attraction of the synchronous solution by expanding the window of opportunity. The slow decay time can help to bring the cells closer together while in the silent phase, as discussed in section 3.3 and appendix B. Hence, both effects improve the ability of the TC cells to synchronize in the absence of fast inhibition. 5.3.2 Role of the Sag Current. We view the TC cells as examples of compound cells. If one removes the sag current Isag or keeps it constant, then they become basic cells. Hence, questions concerning the differences between networks with basic or compound cells are closely related to issues pertaining to the role of the sag current. The analysis in section 3 shows that in mutually coupled networks, whether the cells are basic or compound is signicant. Added complexity allows the cells to escape from the silent phase and also leads to possible compression mechanisms. In globally inhibitory networks, the dynamical mechanisms responsible for the synchronous and clustered solutions are basically the same for basic or compound cells. We

632

J. Rubin and D. Terman

therefore do not view the sag current as playing a direct role in the generation of these rhythms. Recall that oscillations arise when the cells are capable of escaping from the silent phase. The sag current helps modulate the intrinsic properties of the TC cells, and this can determine whether escape is even possible. From a more biophysical viewpoint, the sag current depolarizes the TC cells while they are silent. Hence, a stronger sag current raises the TC’s resting potential. If this resting potential is too large, then inhibitory input from the RE cells may not be capable of hyperpolarizing the TC cells sufciently to deinactivate the IT current. If this is the case, then the TC cells will not be able to re (i.e., escape the silent phase). We note that this relates closely to some theories about mechanisms responsible for waxing and waning behavior during the spindle rhythm (Destexhe, Babloyantz, & Sejnowski, 1993; Bal & McCormick, 1996). Within the range where escape is possible, increasing the sag current tends to enhance the TC synchronization. This occurs because the increased sag current tends to depolarize the TC cells more rapidly, leading to stronger compression. Note that while the sag current is depolarizing during the silent phase, it is hyperpolarizing in the active phase. Hence, increasing the sag conductance lowers the left branch of the cubic corresponding to the TC cells but raises the right branch of the cubic. This decreases the voltage level to which TC cells jump up and hence the overall amplitude of TC oscillations (see Figure 12); at jump-down, the sag current is essentially deactivated, so the size of gsag has little effect there. Increasing gsag also tends to decrease the level of deinactivation of IT at ring, as we can show using arguments similar to the proof of the lemma in section 3, and to decrease the period of the TC cells (see Figure 12). To analyze this dependence of the period more carefully, note that the evolution of the inactivation variable h for IT is almost linear for most of the silent phase; we can model this by h0 D w (1 ¡ h), independent of the inhibition level. From this equation, we can approximate the time tk it takes for the cell to reach the curve of knees. This depends on gsag , since the location of the curve of knees does. We nd that tk ¼ ¡ w1 ln(1 Ch f ¡ hk ), where h f is the h-value when the cell falls down and hk is the h-value when it hits the curve of knees to jump up. Incorporating the dependence of h f and hk on gsag implies that @tk / @gsag < 0; in fact, since h f ¡ hk is small relative to 1 and h f , hk depend close to linearly on gsag , the dependence of tk on gsag is almost linear as well. An analogous argument shows that the time spent in the active phase also has a roughly linear relation to gsag . These nearly linear dependencies sum to yield a nearly linear relationship between overall oscillation frequency and gsag , as appears in Figure 12. 5.3.3 Synchronization Among the RE Cells. The RE cells are synchronized during the spindle rhythm, primarily due to the excitation they receive from

Population Rhythms

633

Figure 12: The effect of gsag on synchronous TC oscillations. Six TC cells and three RE cells were simulated with parameter values given in appendix A.

the TC cells. Although only about half of the TC cells re during each cycle, if each TC cell sends excitation to several RE cells, then every RE cell will receive a sufcient amount of excitation to re during every cycle. Experiments have also shown that the RE cells can sustain synchronized rhythms when these cells are completely isolated (Steriade, Domich, Oakson, & Deschˆenes, 1987; Steriade, Jones, & Llin´as, 1990). In fact, these are the experiments that motivated many of the theoretical studies concerning how synchronization can arise in a population of inhibitory cells. If one considers the RE cells to be modeled as basic cells, then the conclusion of these studies is that synchronization is possible only if the decay of inhibition is sufciently slow. However, modeling the RE cells as in equation 5.3, we consider them to be compound cells; the slow variables are hiR and mi . We then conclude from our analysis of mutually coupled inhibitory cells in section 3 that synchronization is possible even if the inhibition decays quickly.

634

J. Rubin and D. Terman

5.3.4 Cortical Inputs. Our results clearly demonstrate that globally inhibitory networks provide a natural framework for analyzing RE-TC interactions in the generation of thalamic sleep rhythms. These results are also consistent with recent work about the impact of cortical inputs in the development and manifestation of the sleep rhythms (Steriade, Curro´ Dossi, & Nunez, ˜ 1991; Steriade, Nunez, ˜ & Amzica, 1993; Steriade, 1994; Contreras, Destexhe, Sejnowski, & Steriade, 1996; Timofeev & Steriade, 1996; Contreras, Destexhe, & Steriade, 1997; Destexhe, Contreras, & Steriade, 1998). Here we review some of this work and briey describe how it relates to our analysis. Recent work has emphasized the importance of the cortex in the transformation of spindle oscillations into spike-and-wave-like (SW) epileptiform oscillations in the thalamus (Steriade, 1994; Steriade, Contreras, & Amzica, 1994; Contreras, Destexhe, & Steriade, 1996; Destexhe, Contreras, Sejnowksi, & Steriade, 1996; Destexhe & Sejnowski, 1997; Steriade, Contreras, & Amzica, 1997; Destexhe, 1998). The experiments and modeling described in these works have suggested that this can arise without the removal of fast inhibition from the thalamus. Recall that in the mechanism we have discussed, disinhibition of the RE cells leads to more powerful RE bursts, and this permits the TC cells to synchronize. Our analysis supports the nding (Contreras, Destexhe & Steriade, 1996; Destexhe et al., 1996; Steriade et al., 1997; Destexhe, 1998) that if one does not remove the fast inhibition among the RE cells but instead induces sufciently strong excitation from the cortex, then this will have the same effect: the RE cells will re more powerful bursts because of the additional excitation (their cubics are lowered). Hence, GABAB inhibition of the TC cells ensues and the TC cells can synchronize, even though they receive fast inhibition from the RE cells. In particular, this explains the mechanism behind the results of Destexhe (1998), in which sufciently strong corticothalamic excitation (achieved by blocking only cortical GABAA ) is found to be crucial for triggering powerful RE bursts. These bursts in turn lead to the activation of GABAB in the thalamus and the generation of synchronized » 3 Hz oscillations in TC and RE cells. It is interesting to note that SW discharges of pyramidal cells and interneurons in the cortex, as modeled by Destexhe, apparently result from a related interaction: excitation from TC cells induces the “spike” of prolonged ring of pyramidal cells and interneurons in the cortex, and resulting GABAB inhibition from the interneurons to the pyramidal cells enhances synchronization and maintains the subsequent “wave” of inactivity until the TC rebound and restart the cycle. Our analysis has shown that it is possible for the TC cells to synchronize even in the presence of fast inhibition if the RE cell bursts are powerful enough. This does not, however, contradict the argument in Terman et al. (1996) that the effect of fast inhibition to the TC cells must be removed during delta. Since the RE cells re at the slow rhythm in delta, their bursts

Population Rhythms

635

are completely absent (after the rst cycle) while the TC cells re. Hence, the RE cells do not produce the powerful bursts that would be needed to synchronize the TC cells if the effect of the fast inhibition remained. 5.3.5 Variations in Synchronous Oscillations. The thalamic network may exhibit other types of solutions than those discussed. This is because assumptions required for the geometric constructions of the solutions may not be satised for some ranges of parameter values. Subtle variations in network behavior can arise as parameters that change the underlying geometry of phase space are varied. One possibility that arises, for certain biophysically relevant parameter values, is that the right knees of each RE cell’s family of cubics lie very close to the left knees of the RE cell’s cubics. In this case, each RE cell recovers almost instantly on jump-down (after a long, gradual decrease in vR during the active phase). If the TC cells approach sufciently close to the stable equilibrium in the silent phase while the RE cells are still active, then they can all re as soon as they are disinhibited by RE jump-down. Hence, stable synchrony can occur here without activation of slow inhibition, even if fast inhibition decays on the fast timescale, since slow decay of inhibition was needed only to allow the RE cells to recover; this holds even though RE-TC fast inhibitory synapses are direct. (For more details, see Rubin & Terman, 1998.) 6 Discussion

Numerous articles have considered models for thalamic oscillations and mechanisms for synchronization. The RE-TC model studied here is closely related to that in Golomb et al. (1994); however, those results are based primarily on numerical studies. Here we develop, for the rst time, a systematic approach for analytically studying models as complex as the thalamic networks. Related articles that deal with the role of inhibition in synchronizing bursting-like oscillations are Wang and Rinzel (1993) and Terman et al. (1998). Their conclusion is that a slow decay of inhibition is needed to obtain synchrony. This does not account, however, for synchronization of the RETC network in the presence of GABAA inhibition. By extending the analysis in Terman et al. (1998) to more complex networks, we are able to understand the mechanisms responsible for this. Several other articles, including van Vreeswijk et al. (1994) and Gerstner et al. (1996), have also shown that inhibition can lead to synchrony. These articles considered integrate-and-re type models and therefore do not directly apply to the thalamic networks. One of their conclusions is that the synchronous solution must be unstable if the synapses have instantaneous onset. This is consistent with our result (also in Terman et al., 1998) that indirect synapses are needed for synchrony. Our analysis shows, however, that parameters not in the integrate-and-re models, such as the time on the excited branch or the geometry of the curves of knees, can strongly inuence when synchrony arises.

636

J. Rubin and D. Terman

Our results clarify the multiple roles that inhibition can play in producing different rhythms. Inhibition may help to synchronize or desynchronize oscillations, depending on several factors. Fast onset of inhibition tends to desynchronize the cells, because when one cell res, it then quickly “steps on” other cells. For this reason, we generally need to assume that the synapses are indirect for stable synchronization to occur. Slow offset of inhibition may help to synchronize the oscillations; it can lead to compression of the cells while they evolve in the silent phase. Compression (or expansion) can also take place as the cells jump up or down between the silent and active phases. This depends on the geometry of curves (or surfaces) of knees, which in turn depends on both the intrinsic and the synaptic parameters, such as gsyn and gsag . A more powerful source of compression arises in the globally inhibitory networks. While the inhibitory cell J is active, it produces sustained inhibition to the E-cells. This forces the E-cells close to a stable xed point and therefore close to each other. In these networks, fast offset of inhibition also helps to synchronize the cells; it helps to get cells through the narrow window of opportunity for ring. We have demonstrated that mutually coupled networks of excitable cells with indirect fast inhibitory coupling can produce synchronized rhythms. This is not possible if the network contains only basic cells. The geometric approach helps explain why additional cellular complexity allows for synchronized rhythms. The additional complexity translates, within the framework of the geometric approach, to higher-dimensional slow manifolds. This allows the cells to escape from the silent phase. It also leads to additional sources of compression. We conclude that synchronization is possible in mutually coupled inhibitory networks if there are at least two slow variables in the intrinsic or synaptic dynamics. This has relevance for isolated RE cell populations, which can synchronize even though they include fast inhibition (Steriade et al., 1990). Globally inhibitory networks can produce different rhythms depending on the intrinsic dynamics of the J-cells, which controls the amount of inhibition sent back to the E-cells. A rapid rate of synchronization, and a large domain of attraction for the synchronous solution, are achieved when the following factors are present: indirect synapses to provide a window of opportunity, a long J-cell active phase to enhance compression among E-cells, and a fast J-cell recovery coupled with relatively fast synaptic decay. Less powerful inhibition, or a smaller window of opportunity relative to synaptic decay rate, results in clustering among the E-cells. The network crashes if the amount of inhibition is too small. In the models for sleep rhythms, there are several possible ways to control the RE cells’ bursts and therefore the emergent network behavior. More powerful RE bursts result from removal of fast inhibition from among the RE cells (Steriade, McCormick, & Sejnowski, 1993; Huguenard & Prince, 1994; Destexhe, Bal, McCormick, & Sejnowski, 1996; Destexhe, 1998) or from the addition of excitatory input from the cortex (Contreras, Destexhe, & Steriade, 1996; Destexhe, Contreras, Sejnowski,

Population Rhythms

637

& Steriade, 1996; Steriade et al., 1997; Destexhe, 1998). Other intrinsic RE parameters, such as a leak conductance, may also greatly inuence the RE cells’ dynamics. We have seen how such considerations help explain the transition between spindling, delta, and paroxysmal discharges in RE-TC networks. The geometric analysis can lead to precise statements for when a particular rhythm is possible. For example, a TC cell can re only if it is sufciently hyperpolarized; this deinactivates the IT current. A geometric interpretation is that a TC cell can re only if, during the silent phase, it lies in the region where trajectories are able to reach the jump-up curve of knees. By considering the slow equations corresponding to the TC cells (the compound analogue of equation 4.1), one can then derive conditions for when a TC will re. In this sense, the geometric analysis helps clarify how different parameters inuence the rhythms. For example, we have seen how the analysis demonstrates that the sag current plays different roles in generating synchronized rhythms in mutually coupled and globally inhibitory networks. We note that the geometric approach used here (see also Terman & Wang, 1995; Terman & Lee, 1997; Terman et al., 1998) is somewhat different from that used in many dynamical systems studies. All of the networks considered here consist of many differential equations, especially for larger networks. Traditionally, one would interpret the solution of this system as a single trajectory evolving in a very large-dimensional phase-space. Instead, we consider several trajectories, one corresponding to each cell, moving around in a much lower-dimensional phase-space. After reducing the full system to a system for just the slow variables, the dimension of the lowerdimensional phase-space equals the number of slow intrinsic variables and slow synaptic variables corresponding to each cell. In the worst case considered here, there are two slow variables for each compound cell and one slow synaptic variable; hence, we never have to consider phase-spaces with dimension more than three. Of course, the particular phase-space we need to consider may change, depending on whether the cells are active or silent and also depending on the synaptic input that a cell receives. We assumed throughout that the networks were completely homogeneous. The analysis certainly extends to network models with mild heterogeneities. In this case, cells within each cluster will no longer be perfectly synchronized. They may lie on different branches of different cubic surfaces. If the cubics are sufciently close to each other, however, the jumps will be very close to each other and almost synchrony will result. Appendix A

The equations for the TC and RE cells in the thalamic network are given in section 5.1. As our biophysical mutually coupled network of compound cells, we considered a pair of TC cells, each governed by equation 5.1; note that in the case of basic cells without slow inhibition, these could be thought

638

J. Rubin and D. Terman

of as simplied RE cells or TC cells. The synaptic variables in this model satisfy equation 2.6. For numerical simulations, we approximated Heaviside functions by functions of the form H1 ( v) D

1 . 1 C exp(¡k (v ¡ h ))

The functions h1 ( v), m1 ( v), r1 ( v), hR1 ( v), and mR1 ( v) are assumed to be of 1 the same form: if j D h, m, r, hR or mR , then j 1 ( v) D 1Cexp((vCh . Further, j ) / sj ) we take th ( v) D th0 C

th1 1 C exp((v C vt h ) / st h )

with tRh ( v) having an analogous form, and tr ( v) D tr0 C

tr1 . exp((v C vt r0 ) / st r0 ) C exp(¡(v C vt r1 ) / st r1 )

To generate our numerical gures, we started cells under a slight perturbation from a synchronous state. For Figures 3B and 4C, we used the following parameter values, based on Golomb et al. (1994) and Terman et al. (1996). IT : gCa D 2.5, h m D 57.0, sm D ¡6.0, vCa D 140.0, hh D 81.0, sh D 4.0, th0 D N vt h D 78.0, st h D 3.0; Isag : gsag D 0.2, vsag D ¡50.0, hr D 10.0, th1 D 73.3, 75.0, sr D 5.5, tr0 D 20.0, tr1 D 1000.0, vt r0 D 71.5, st r0 D 14.2, vt r1 D 89.0, st r1 D 11.6; IL : gL D 0.025, vL D ¡75.0; IA : gA D 0.4, vA D ¡79.0, a D 16.0, b D 4.0, hx D 0.1, k D 100.0, 2 ax D 0.3, 2 bx D 0.1, h syn D ¡50.0. We did not include slow inhibition here. To generate the clustered solution displayed in Figure 8, we used the following parameter values, also based on Golomb et al. (1994) and Terman et al. (1996). Six TC cells: IT : gCa D 1.5, hm D 59.0, sm D ¡9.0, vCa D N th1 D 333.3, N vt h D 78.0, st h D 1.5; Isag : 90.0, h h D 82.0, sh D 5.0, th0 D 66. 6, gsag D 0.15, vsag D ¡40.0, h r D 75.0, sr D 5.5, tr0 D 20.0, tr1 D 1000.0, vt r0 D 71.5, st r0 D 14.2, vt r1 D 89.0, st r1 D 11.6; IL : gL D 0.2, vL D ¡76.0; IA : g A D 0.1, vA D ¡84.0, aR D 8.0, bR D 0.05, h R D ¡50.0, k A D 2.0; IB : gB D 0.05, vB D 95.0, l D 10¡4 , k1 D 0.1, k2 D 0.05, hxb D 0.8, k xb D 0.02, k3 D 0.5, k4 D 0.005, hRb D ¡25.0, k b D 2.0. Three RE cells: IRT : R D gRCa D 2.0, hmR D 52.0, smR D ¡9.0, vRCa D 90.0, h hR D 72.0, shR D 2.0, th0 N t R D 333. 3, N vR D 78.0, s R D 1.0; IAHP : gAHP D 0.1, vK D ¡90.0, m 1 D 66. 6, th th h1 0.02, m 2 D 0.025, º D 0.01, c D 0.08; IRL : gRL D 0.3, vRL D ¡76.0; IRA : gRA D 0.25, vRA D ¡84.0, other parameters as in IA for the TC cells; IE : gE D 0.6, vE D 0, aE D 2.0, bE D 0.05, h E D ¡35.0, k E D 2.0. [Note that in j

the current IE , the variables sE satisfy an equation of the form (5.2), with the parameters aE , . . . replacing aR , . . . .]

Population Rhythms

639

To generate the synchronous solution displayed in Figures 9 and 10, we used the same parameter values except a D 16, b D 0.2, st h D 3.0, stRh D 2.0, gRA D 0. By setting gRA D 0, we removed the RE-RE fast inhibition; for aR D 16, bR D 0.2 with the same initial conditions and gRA D 0.25, the solution forms two clusters, but slow inhibition is slightly activated and it eventually destabilizes, while for gRA D 0.5, the solution forms two clusters without activation of slow inhibition. Finally, for Figures 11 and 12, we used the above parameter values except certain parameters were varied as discussed in the text and gure captions and th0 D 100.0, th1 D 500.0, st h D R D 100.0, t R D 500.0, s R D 2.0. 3.0, th0 th h1 We assumed in the proof of the lemma in section 3 that for compound cells, fy > 0 in the silent phase, while fy < 0 in the active phase. This is justied for the TC cell model for the following reason. Note that y corresponds to the variable r; hence, fy D ¡gsag (v ¡ vsag ). Since vsag ¼ ¡40 mV typically, while v ranges from around ¡80 mV in the silent phase to at least ¡30 mV in the active phase, the result follows. Remark 4.

R We claimed in remark 1 that | @w | is quite small. This is also @y shown numerically in Figure 3B. We can understand analytically why this

Remark 5.

f

a is so by recalling, from the proof of the lemma, that @w D ¡ f y for a D L or R @y w corresponding to the silent or active phase respectively. From equation 5.1,

gsag ( v ¡ vsag ) @wa D¡ . @y gCa m21 (v)(v ¡ vCa )

(A.1)

Typical values are gsag ¼ .04 up to 2.0, vsag ¼ ¡40 mV, gCa ¼ 2.5, vCa ¼ 140 mV. In the silent phase, m1 ( v) is small, so the numerator and denominator in equation A.1 have similar magnitudes, even though v ¼ ¡80 mV. R In the active phase, however, m1 ( v) ¼ 1 while v ¼ 0, typically. Hence, | @w | @y is quite small. Appendix B

Consider a pair of compound cells in the active phase, with mutual inhibitory coupling. Suppose that the cells are perturbed from synchrony such that the lead cell falls down to the silent phase at time t1 and the following cell falls down at time t2 > t1 . Assume slow decay of inhibition (b D 2 K), although the analysis simplies otherwise (see below). During the silent phase, for t ¸ t2 , the slow dynamics of the two cells are given by wP D g( v, w) yP D h(v, y) sP D ¡Ks,

(B.1)

640

J. Rubin and D. Terman

Figure 13: The set-up for analysis of compression in the silent phase in two dimensions (i.e., assuming s is xed). The picture generalizes naturally to 3D when s is included as another slowly evolving variable. Cell 1 jumps up rst, with t D T1 , at (y¤1 , w¤1 ). The larger dotted square shows a blow-up of the smaller dotted square, in which the vectors V and (g, r ) are dened. d where v satises v D W L ( w, y, s) andP D dt . Let V (t ) D ( g( w, v), h( y, v), ¡Ks). The equation of variations that describes the evolution of tangent vectors to the ow of equation B.1 is

d Pw D ¡awd w ¡ bwdy ¡ cwd s P D ¡ayd w ¡ bydy ¡ cyds dy dPs D ¡Kds,

(B.2)

where aw D ¡@g / @w, bw D ¡@g / @y D ¡@g / @v ¢ @W L / @y, and so on. Fix the cell that jumps up out of the silent phase rst as cell 1 and call the other cell 2. Let Ti denote the jump-up time of cell i and restrict to t 2 [t2 , T1 ]. Let (g(t ), r (t), s(t)) denote the vector from the position of cell 1 to that of cell 2, such that (g, r , s) satises equation B.2; see Figure 13 for a two-dimensional representation of this set-up. Next, let wL ( y, s) denote the projection of the left surface of knees to (w, y, s)-space and let (a, b, c) D r[wL ( y1 ( T1 ), s( T1 )) ¡ w( T1 )] D (¡1, @wL ( ( ), ( )), @wL ( ( ), ( ))). Let wtL denote the physical translate @y y1 T1 s T1 @s y1 T1 s T1

Population Rhythms

641

of wL ( y, s) to the position of cell 1 at time t; for example, at time t2 , cell 1 lies in wtL2 . Again, see Figure 13. We dene the time T (t ) as the time for cell 2 to ow to its rst intersection with wtL . This satises "

Z

(a, b, c) ¢ (g, r , s) C

T (t )

0

# V (t Cj ) dj

D 0,

(B.3)

where V is evaluated along the path of cell 2 (see Figure 13), but Z

T(t ) 0

Z V (t Cj ) dj D

T (t ) 0

( V (t) Cj VP (t ) C O(j 2 ))dj

D V (t )T (t ) C O(T 2 ).

(B.4)

For small perturbations from synchrony, if the vector eld of equation B.1 is O(1), then the O(T 2 ) term in equation B.4 can be neglected. Substituting the approximation equation B.4 into B.3 yields, at leading order, T (t ) D ¡

( a, b, c) ¢ (g, r , s) g ¡ br ¡ cs D , ( a, b, c) ¢ V ¡g(w2 , v2 ) C bh( y2 , v2 ) ¡ cKs2

(B.5)

since a D ¡1. Henceforth, we omit the arguments indicating evaluation along the path of cell 2 when this is clear. A sufcient condition for compression in the silent phase is that TP (t ) < 0 for t 2 [t2 , T1 ]. Thus, we proceed to compute TP (t ). Since (g, r , s) and V (t ) satisfy equation B.2, we differentiate, substitute from B.2, and simplify to obtain [V(t ) £ (g, r , s)] ¢ [((a, b, c) ¢ DV ) £ ( a, b, c)]† , TP (t ) D (( a, b, c) ¢ V (t ) † ) 2

(B.6)

where † denotes transpose. Geometrically, the determinant of three vectors gives the volume of the parallelepiped they bound. The numerator of TP (t) consists of such a determinant, with the three edges of the parallelepiped given by the vector eld, the vector from the position of cell 1 to that of cell 2, and a third vector. This last vector relates the linearization of the vector eld to the gradient of the translate of the surface of knees. Next, consider Z ´ (Z1 , Z2 , Z3 ) :D V £ (g, r , s) D (s h Cr Ks, ¡sg ¡ gKs, r g ¡ gh)

D (( h, ¡Ks) ^ (r , s), (g, s) ^ ( g, ¡Ks), ( g, h) ^ (g, r )).

(B.7)

642

J. Rubin and D. Terman

Differentiating equation B.7 and using B.2 gives, in system form, ZP 1 D ¡(by C K)Z1 C ay Z2 ZP 2 D bw Z1 ¡ (aw C K) Z2 ZP 3 D cw Z1 C cy Z2 ¡ ( aw C by ) Z3 .

(B.8)

For basic cells, without the current y, equation B.6 simplies to sg CgKs ¡Z2 ( c( aw ¡ K) C cw ) D (c( aw ¡ K) C cw ). TP (t ) D ( g C cKs) 2 ( g C cKs) 2 Moreover, Z1 ´ 0, so ZP 2 D ¡(aw CK)Z2 . Hence, the sign of Z2 (t ) is invariant for t 2 [t2 , T1 ]. In Terman et al. (1998), Z2 (t2 ) D ( w1 (t2 ) ¡w2 (t2 )) Ks2 (t2 ) < 0, which implies that ¡Z2 (t2 ) > 0. Thus, the sign of TP (t) matches that of (c(aw ¡ K) C cw ) (Terman et al. use l 1 to denote c). By showing that this quantity is negative for all relevant t for K < a¡ (case I in Terman et al., 1998), Terman et al. achieve a sufcient condition for compression in the silent phase. In general, one can compute the signs of a, b, c as well as aw , bw , . . . (see, e.g., the lemma in section 3) and then can use equations B.6 and B.8 to derive compression conditions for the silent phase. Simplications facilitate this process in certain cases even for compound cells. For example, s D sP D 0 during the silent phase for E cells in a globally inhibitory network as well as for much of the silent phase for mutually coupled cells with fast synaptic decay. In the latter case, however, an adjustment must be made to compensate for the fact that one cell loses inhibition before the other; we omit details and explicit computations here. Acknowledgments

We thank N. Kopell for her very useful comments. Research for this article was supported in part by the NSF grants DMS-9423796 and DMS-9804447. Some of the research was done while the authors were visiting the IMA at the University of Minnesota. References Bal, T., & McCormick, D. A. (1996). What stops synchronized thalamocortical oscillations? Neuron, 17, 297–308. Bush, P., & Sejnowski T. (1996). Inhibition synchronizes sparsely connected cortical neurons within and between columns of realistic network models. J. Comp. Neurosci., 3, 91–110. Chow, C. C. (1998). Phase-locking in weakly heterogeneous neuronal networks. Physica D, 118, 343–370.

Population Rhythms

643

Contreras, D., Destexhe, A., Sejnowski, T. J., & Steriade, M. (1996).Control of spatiotemporal coherence of a thalamic oscillation by corticothalamic feedback. Science, 274, 771–774. Contreras, D., Destexhe, A., & Steriade, M. (1996). Cortical and thalamic participation in the generation of seizures after blockage of inhibition. Soc. Neurosci. Abst., 22, 2099. Contreras, D., Destexhe, A., & Steriade, M. (1997). Spindle oscillations during cortical spreading depression in naturally sleeping cats. Neuroscience, 77, 933– 936. Destexhe, A. (1998) Spike-and-wave epileptic oscillations based on the properties of GABA B receptors. J. Neurosci., 18, 9099–9111. Destexhe, A., Babloyantz, A., & Sejnowski, T. J. (1993). Ionic mechanisms for intrinsic slow oscillations in thalamic relay and reticularis neurons. Biophys. J., 65, 1538–1552. Destexhe, A., Bal, T., McCormick, D. A., & Sejnowski, T. J. (1996). Ionic mechanisms underlying synchronized oscillations and propagating waves in a model of ferret thalamic slices. J. Neurophysiol., 76, 2049– 2070. Destexhe, A., Contreras, D., Sejnowski, T. J., & Steriade, M. (1996). Cortical projections to the thalamic reticular nucleus may control the spatiotemporal coherence of spindle and epileptic oscillations. Soc. Neurosci. Abst., 22, 2099. Destexhe, A., Contreras, D., & Steriade, M. (1998). Mechanisms underlying the synchronizing action of corticothalamic feedback through inhibition of thalamic relay cells. J. Neurophysiol., 79, 999–1016. Destexhe, A., McCormick, D. A., & Sejnowski, T. J. (1993). A model for 8–10 Hz spindling in interconnected thalamic relay and reticularis neurons. Biophys. J., 65, 2474–2478. Destexhe, A., & Sejnowski, T. J. (1997). Synchronized oscillations in thalamic networks: Insights from modeling studies. In M. Steriade, E. G. Jones, & D. A. McCormick (Eds.), Thalamus. Amsterdam: Elsevier. Gerstner, W., van Hemmen, J. L., & Cowan, J. (1996). What matters in neuronal locking? Neural Comp., 8, 1653–1676. Golomb, D., & Rinzel, J. (1993).Dynamics of globally coupled inhibitory neurons with heterogeneity. Phys. Rev. E, 48, 4810–4814. Golomb, D., & Rinzel, J. (1994). Clustering in globally coupled inhibitory neurons. Physica D, 72, 259–282. Golomb, D., Wang, X.-J., & Rinzel, J. (1994). Synchronization properties of spindle oscillations in a thalamic reticular nucleus model. J. Neurophysiol., 72, 1109–1126. Golomb, D., Wang, X.-J., & Rinzel, J. (1996) Propagation of spindle waves in a thalamic slice model. J. Neurophysiol., 75, 750–769. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (London), 117, 500–544. Huguenard, J., & Prince, D. (1994). Clonazepam suppresses GABAB -mediated inhibition in thalamic relay neurons. J. Neurophysiol., 71, 2576–2581.

644

J. Rubin and D. Terman

Kim, U., Bal, T., & McCormick, D. A. (1995). Spindle waves are propagating synchronized oscillations in the ferret LGNd in vitro. J. Neurophysiol., 74, 1301–1323. Kopell, N., & LeMasson, G. (1994).Rhythmogenesis, amplitude modulation and multiplexing in a cortical architecture. Proc. Nat. Acad. Sci. (USA), 91, 10586– 10590. LoFaro, T., & Kopell, N. (In press). Time regulation in a network reduced from voltage-gated equations to a one-dimensional map. J. Math. Biology. Rinzel, J. (1987). A formal classication of bursting mechanisms in excitable systems. In A. M. Gleason (Ed.), Proceedings of the International Congress of Mathematics (pp. 1578–1593). Providence, RI: AMS. Rinzel, J., Terman, D., Wang, X.-J., & Ermentrout, B. (1998). Propagating activity patterns in large-scale inhibitory neuronal networks. Science, 279, 1351– 1355. Rowat, P., & Selverston, A. (1997). Oscillatory mechanisms in pairs of neurons connected with fast inhibitory synapses. J. Comp. Neurosci., 4, 103–127. Rubin, J., & Terman, D. (1998). Geometric analysis of population rhythms in synaptically coupled neuronal networks. Unpublished manuscript, IMA Preprint Series 1596, University of Minnesota. Rubin, J., & Terman, D. (1999). Analysis of clustered ring patterns in synaptically coupled networks of oscillators. Manuscript submitted for publication. Skinner, F., Kopell, N., & Marder, E. (1994). Mechanisms for oscillation and frequency control in networks of mutually inhibitory relaxation oscillators. J. Comp. Neurosci., 1, 69–87. Somers, D., & Kopell, N. (1993). Rapid synchronization through fast threshold modulation. Biol. Cyber., 68, 393–407. Steriade, M. (1994). Coherent activities in corticothalamic networks during resting sleep and their development into paroxysmal events. In G. Buzs´aki, R. Llin a´ s, W. Singer, A. Berthoz, & Y. Chrtisten (Eds.), Temporal coding in the brain. New York: Springer-Verlag. Steriade, M., Contreras, D., & Amzica, F. (1994). Synchronized sleep oscillations and their paroxysmal developments. TINS, 17, 199–208. Steriade, M., Contreras, D., & Amzica, F. (1997). The thalamocortical dialogue during wake, sleep, and paroxysmal oscillations. In M. Steriade, E. G. Jones, & D. A. McCormick (Eds.), Thalamus. Amsterdam: Elsevier Steriade, M., Curr´o Dossi, R., & Nunez, ˜ A. (1991). Network modulation of a slow intrinsic oscillation of cat thalamocortical neurons implicated in sleep delta waves: Cortically induced synchronization and brainstem cholinergic suppression. J. Neurosci., 11, 3200–3217. Steriade, M., Domich, L., Oakson, G., & Deschˆenes, M. (1987). Abolition of spindle oscillations in thalamic neurons disconnected from nucleus reticularis thalami. J. Neurophysiol., 57, 260–273. Steriade, M., Jones, E. G., & Llin a´ s, R. R. (1990). Thalamic oscillations and signaling. New York: Wiley Steriade, M., McCormick, D. A., & Sejnowski, T. J. (1993). Thalamocortical oscillations in the sleeping and aroused brain. Science, 262, 679–685.

Population Rhythms

645

Steriade, M., Nunez, ˜ A., & Amzica, F. (1993). Intracellular analysis of relations between the slow (< 1 Hz) neocortical oscillation and other sleep rhythms of the electroencephalogram. J. Neurosci., 13, 3266–3283. Terman, D., Bose, A., & Kopell, N. (1996). Functional reorganization in thalamocortical networks: Transition between spindling and delta sleep rhythms. Proc. Natl. Acad. Sci. (USA), 93, 15417–15422. Terman, D., Kopell, N., & Bose, A. (1998). Dynamics of two mutually coupled inhibitory neurons. Physica D, 117, 241–275. Terman, D., & Lee, E. (1997). Partial synchronization in a network of neural oscillators. SIAM J. Appl. Math., 57, 252–293. Terman, D., & Wang, D. L. (1995). Global competition and local cooperation in a network of neural oscillators. Physica D, 81, 148–176. Timofeev, I., & Steriade, M. (1996). Low-frequency rhythms in the thalamus of intact-cortex and decorticated cats. J. Neurophysiol., 76, 4152–4168. Traub, R. D., & Miles, R. (1991). Neuronal networks of the hippocampus. New York: Cambridge University Press van Vreeswijk, C., Abbott, L., & Ermentrout, G. B. (1994). When inhibition, not excitation synchronizes neural ring. J. Comp. Neurosci., 313–321. Wang, X.-J., Golomb, D., & Rinzel, J. (1995). Emergent spindle oscillations and intermittent burst ring in a thalamic model: Specic neuronal mechanisms. Proc. Natl. Acad. Sci. (USA), 92, 5577–5581. Wang, X.-J., & Rinzel, J. (1992). Alternating and synchronous rhythms in reciprocally inhibitory model neurons. Neural Comp., 4, 84–97. Wang, X.-J., & Rinzel, J. (1993). Spindle rhythmicity in the reticularis thalamic nucleus: Synchronization among mutually inhibitory neurons. Neuroscience, 53, 899–904. Wang, X.-J., & Rinzel, J. (1995). Oscillatory and bursting properties of neurons. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 686– 691). Cambridge, MA: MIT Press. Whittington, M. A., Traub, R. D., & Jefferys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615. White, J., Chow, C. C., Ritt, J., Soto, C., & Kopell, N. (1998). Synchronization and oscillatory dynamics in heterogeneous, mutually inhibited neurons. J. Comp. Neurosci., 5, 5–16. Received July 20, 1998; accepted February 23, 1999.

LETTER

Communicated by George Gerstein

An Accurate Measure of the Instantaneous Discharge Probability, with Application to Unitary Joint-Event Analysis Quentin Pauluis Laboratory of Neurophysiology, School of Medicine, Universit´e Catholique de Louvain Brussels, Belgium Stuart N. Baker Department of Anatomy, University of Cambridge, Cambridge, CB2 3DY, U.K. We present an estimate for the instantaneous discharge probability of a neurone, based on single-trial spike-train analysis. By detecting points where the neurone abruptly changes its ring rate and treating them specially, the method is able to achieve smooth estimates yet avoid the blurring of signicant changes. This estimate of instantaneous discharge probability is then applied to the method of unitary event analysis. We show that the unitary event analysis as originally conceived is highly sensitive to ring-rate nonstationarities and covariations, but that it can be considerably improved if calculations of statistical signicance use an instantaneous discharge probability instead of a ring-rate estimate based on averaging across multiple trials.

1 Introduction

A large number of new techniques have been developed for the analysis of single-unit neuronal discharge. The increasing mathematical complexity of these techniques permits maximal information to be extracted about how these signals are being used by the brain. However, most new techniques are developed and tested on the assumption that neurone spikes resemble stationary Poisson processes. This often yields mathematical tractability, but at the expense of grossly oversimplifying the statistics of the target spike trains. Nonstationarity in cell discharge can occur in a number of ways. First, the ring rate often changes following delivery of a stimulus or performance of a behavioral task by the experimental animal. These changes can be profound (from no ring to > 100 Hz) and rapid. In order to account for such rate modulation, many analysis techniques estimate the ring rate on a single trial from the peristimulus time histogram (PSTH; Gerstein & Kiang, 1960; Lemon, 1984), derived by summing together the discharge on a number of repeats of the same trial. The underlying model for the spike train then Neural Computation 12, 647–669 (2000)

c 2000 Massachusetts Institute of Technology °

648

Quentin Pauluis and Stuart N. Baker

becomes a Poisson process with instantaneous ring rate on any one trial modulating as the all-trial PSTH (Aertsen, Gerstein, Habib, & Palm, 1989; Grun, ¨ 1996). More complex nonstationarities occur when the ring-rate modulation is not the same from one trial to the next. First, there may be a jitter in the time of occurrence of a rise in rate relative to the alignment event; this is common in work on awake, behaving animals, where a trial lasting several seconds must be aligned to one point in the performance (Seidemann, Meilijson, Abeles, Bergmann, & Vaadia, 1996; Munoz & Wurtz, 1993). Second, the overall shape could differ from one trial to the next, depending on uncontrolled external or random events such as attentional shifts. These two forms of nonstationarity may show no covariation between simultaneously recorded cells. However, since such variation is often not simply noise in the system, but instead reects genuine variation in the task performance of the animal from trial to trial, it is more likely that the times, the magnitudes, or the overall shape of rate changes will be correlated. The difference between these two situations is of great importance for methods that seek to estimate the expected number of synchronous discharges between two cells; unfortunately, the PSTH can only reect the mean ring-rate prole and gives no clue to its variability or correlation structure from trial to trial. Brody (1997, 1999a, 1999b) proposed a method of overcoming these problems by tting the cell discharge on a single trial to a function that had the same shape as the mean PSTH but was shifted by an adjustable latency and scaled by a multiplicative gain. This estimate of the single-trial ring rates permitted a better prediction of the cross-correlation histogram between two units than one derived from the PSTH alone. However, the all-trial PSTH reects only the mean change in rate, taking no account of the variability of this prole among trials, so that this technique is probably limited in its applicability to experimental data. In this article, we present a method for estimating the instantaneous discharge probability density of a neurone in a single trial. We then apply it to the method of unitary event analysis pioneered by Grun ¨ (1996) and recently applied to data recorded in the motor cortex of awake monkeys (Riehle, Grun, ¨ Diesmann, & Aertsen, 1997). We show that this method, as originally conceived, is highly sensitive to the nonstationarities described above, but that it can be considerably improved using single-trial ring-rate estimates. 2 Measurement of Instantaneous Discharge Probability

We rst draw a distinction between instantaneous discharge probability density and the instantaneous ring rate. We view the latter as an output, which may be read by a neurone farther downstream. By contrast, we view the discharge probability density as an input function that could be fed into a neuronal model to generate a spike train similar to the one observed. Gen-

Instantaneous Discharge Probability

649

eration of such a measure requires allowance to be made for the delayed response of the neuronal integrator to changes in its inputs. The discharge probability density should be smooth, and not replicate the variability typically seen in interspike intervals (Softky & Koch, 1992; Shadlen & Newsome, 1994). However, in conict with this, the function should also be able to identify and vary abruptly with sudden ring changes. Finally, since the only data available are the spike train, the method has to start with a measure of the instantaneous ring rate. Two distinct methods exist in the literature for estimation of the instantaneous ring rate of a spike train. In one, pulses produced at the spike times are low pass ltered (French & Holden, 1971), either using analog lters (Houk, Dessem, Miller, & Sybirska, 1987), or by convolution with a smoothing function (often a gaussian; MacPherson & Aldridge, 1979; Richmond, Optican, Podell, & Spitzer, 1987). The closer together the spikes, the more summation of the smoothed waveforms that occurs, and hence the larger the output signal. However, this method requires that the cutoff of the low-pass lter be adjusted to be appropriate for the ring rate under study. If it is too low, a highly smoothed signal is obtained, which cannot follow rapid rate changes; if it is too high, periods of zero estimated ring rate are interposed between spikes with long intervals. Adaptive kernel methods (Silverman, 1986; Richmond, Optican, & Spitzer, 1990) represent an attempt to adjust the amount of smoothing depending on the local density, so that these two extremes are avoided. More sophisticated techniques such as orthogonal series estimates and maximum penalized likelihood estimates (Silverman, 1986) can suffer from the need to tune smoothing parameters to nd the optimal representation of the data. An alternative is to use the reciprocal of the interspike interval (ISI) as an estimate of the ring rate during the time spanned by that interval. Such a method is often implicitly employed using “frequencygram” plots, where a point is plotted at the time of each spike with a height of the reciprocal of the preceding interval (Matthews, 1963; Bessou, Laporte, & Pag e` s, 1968; Awiszus, 1988). It is similar to the use of “fractional intervals” (Schwartz, 1992). This technique has the advantage that it never has periods of zero estimated rate; however, in its raw form, the estimate shows great variability. Any subsequent smoothing will necessarily be a trade-off between the need to reduce this variability without removing rapid uctuations. A means of avoiding this trade-off would be to detect rapid changes in ring rate and treat them as a special case. This is essentially the same problem as detection of bursts in a spike train, and a number of techniques have been suggested. The rst was the CUMSUM technique (Hurst, 1950), which has been applied to both averaged PSTH (Ellaway, 1978) and also to single-trial analysis (Butler, Horne, & Hawkins, 1992; Butler, Horne, & Rawson, 1992). The CUMSUM involves subtraction of a control reference level from a series of datum points and adding cumulatively. However, when the goal is the formation of a continuous estimate of discharge probability,

650

Quentin Pauluis and Stuart N. Baker

it is not clear where the control level should be placed. CUMSUM methods are also very sensitive to small variations during the control period and require excessive parameter tuning (Churchward et al., 1997). An alternative has been proposed by Cocatre-Filgien and Delcomyn (1992), who rst identied ISI subpopulations using an ISI histogram, and then tested identied bursts against the mean ring frequency with a chi-square test. This method is well suited to a spike train where ring is at a constant level for most of the time, with occasional high-frequency bursts; it is much less suitable for neurones where the ring rate changes continually over a wide range. Other methods in the literature suffer from similar problems: ISIs are always compared with the mean ISI for the whole recording, making the assumption that intervals are Poisson distributed (Leg´endy & Salcman, 1985; Eggermont, Smith, & Bowman, 1993). Our proposed technique is a variation of these methods. The rst difference is that we work with ISIs locally. In this way, a change in ring rate is identied relative to the ring of the cell close to the time of interest, rather than by comparison to the global ISI statistics. Three consecutive intervals are used, since a single interval shorter than the preceding one could simply represent the tail of the ISI distribution, even if ring rate were constant. Two consecutive ISIs signicantly shorter than the preceding ISI are much less likely without a change in ring rate. The same limit to the smallest detectable burst has been used by Cocatre-Zilgien and Delcomyn (1992). Second, we assume that the ISIs follow a gamma density. This allows for an absolute refractory period and for the relative refractory period produced by afterhyperpolarization, and is a more realistic description of interval statistics than the more commonly used Poisson process (for discussion, see Berry & Meister, 1998; de Ryuter van Steveninck, Lewen, Strong, Koberle, & Bialek, 1997). Moreover, by changing the order parameter of the gamma density, it can be adjusted to t interval statistics from a wide range of experimental data. The rst step is to generate an instantaneous ring-rate function F(t), sampled at an appropriately ne temporal resolution, whose value at any point is simply the reciprocal of the ISI that spans it (see Figure 1B). The series of ISIs is then scanned to determine times at which the ring rate abruptly changes. To do this, we make the assumption that, at constant ring rate, ISIs are distributed according to a gamma density. The probability that if the mean interval is m , an interval i will be larger than some value I is then: Z P(i > I | m ) D

1 I

¡ a¢ m

a

1 ta¡1 e¡ta /m dt, ( a ¡ 1)!

(2.1)

where a is the order of the gamma process used to model the intervals. In the remainder of this article, we use a D 4, although this can be adjusted to t the interval statistics for a given cell (a D 1 would correspond to a Poisson process). As illustrated in Figure 1A, the nth interval was accepted as the

Instantaneous Discharge Probability

651

Figure 1: Reconstruction of the instantaneous frequency. (A) Original spike train. (B) Inverse of the interspike intervals (ISIs). Statistical tests detect points of rate increase (empty arrow) or decrease (lled arrow). (C) The high ring rates are extended out half an interval, to the best estimate of the time of the rate change. In order to maintain the overall mean rate, the ring rate before or after these rate change points is reduced. (D) The signal is smoothed by convolution with a gaussian. No smoothing is performed across the rate change points. Full details are given in the text.

onset of an increase in rate if the following two criteria were both fullled: P(i > In | m D InC1 ) < 0.05 P(i > In | m D InC2 ) < 0.05.

(2.2)

Likewise, the nth interval was accepted as the onset of a decrease in rate if both of the following two criteria were met: P(i > In | m D In¡1 ) < 0.05 P(i > In | m D In¡2 ) < 0.05.

(2.3)

The longest of the three intervals under consideration is always the one tested. As noted, the requirement that the interval should be significantly larger than both of the intervals preceding it (for a rate fall) or

652

Quentin Pauluis and Stuart N. Baker

succeeding it (for a rate rise) ensures that only genuine rate changes are detected, and not anomalous intervals arising from the tail of the ISI distribution. Changes in rate can thus be detected. The next stage is to determine at what precise time the change occurs. One possibility would obviously be at the time of the rst spike contributing to the changed interval. However, this is likely to be biased. A neurone that suddenly receives a higher rate of input will not re immediately; the moment of ring probability rise is therefore likely to be some time prior to the rst spike of the shortened interval. The best solution can be appreciated by the following formalism. Let the long interval In be divided into two parts: one lasting TLo in which the discharge probability density is FLo , and another lasting T Hi in which the discharge probability density is FHi . We have therefore that: T Lo C THi D In .

(2.4)

Since one spike in total resulted from the interval In , the sum of the ring probability density must be 1 over this time: T Lo FLo C THi FFi D 1.

(2.5)

Finally, in the absence of any more information about the timing of the change in discharge probability, we assume that the change occurs independently from the spike process. Had the ring probability density not changed, the interval would have been 1 / FLo. We therefore assume that on average, TLo is half of this value: 1 2FLo T Lo FLo D 0.5 D T Hi FHi . T Lo D

(2.6)

FHi is assumed to be equal to the reciprocal of the short interval IShort either preceding (for a rate fall) or succeeding (for a rate rise) In . Hence equations 2.4–2.6 provide a unique solution, which is: T Hi D

1 1 D IShort 2FHi 2

TLo D In ¡ FLo D

1 1 D In ¡ Ishort 2FHi 2

1 . 2TLo

(2.7)

The high discharge probability density calculated from the short interval is therefore extended out for half of this interval into the adjoining long

Instantaneous Discharge Probability

653

interval, as shown in Figure 1C, for both rate increase and rate decrease. This is compensated for by reducing the discharge probability density in the remainder of the long interval to a value close to half of the original (see Figure 1C). A probability density function of this form is exactly normalized, in the sense that its integral from time zero to a given time, rounded down, gives the number of spikes which have occurred to that point. This function can therefore be seen as the input to an IF neurone, which would thereby generate the observed spike train. The discharge probability density of Figure 1C still exhibits high variability, due to high variance in intervals even if the underlying rate remains constant (Shadlen & Newsome, 1994; Softky & Koch, 1992). In order to remove this unwanted variability partially, intervals between abrupt ring changes are then smoothed by convolution with a gaussian kernel (standard deviations: 10 ms, chosen to be in the range of the somatic time constant; Koch, Rapp, & Segev, 1996). This smoothing is not carried out across boundaries of rate changes as dened above, producing a measure that varies smoothly between rate change points but abruptly at them. This is illustrated in Figure 1D. It is the explicit identication and localization of abrupt ring probability changes that confer on this measure the most tangible advantage over other techniques. Without such a step, there is necessarily a trade-off between oversmoothing rapid changes and high variability, which cannot be satisfactorily resolved. The method described produces an estimate of the discharge probability density function that can be computed without averaging across trials. The next section investigates an application of this method to the assessment of the signicance of unitary joint events. 3 Application to Unitary Joint Event Analysis

The aim of the unitary joint event method is to detect discharges of two neurones that are synchronous within an allowed jitter and occur more often than expected by chance. The technique, similar to that proposed by Grun ¨ (1996; UEMWA method) but with minor changes to avoid unnecessary binning, is reviewed in Figures 2A–G. From the repeated trials (see Figures 2A and 2C), PSTHs for the two cells are built up, and these are smoothed with a moving window (here chosen to be 101 ms wide; see Figures 2B and 2D). Spikes from the two cells occurring at times t1 and t2 are designated a coincident event (CE) if |t1 ¡ t2 | < D ,

(3.1)

where D in this study was taken as 5 ms. We conventionally place the occurrence time of the CE halfway between the two coincident spikes, at ( t1 Ct2 ) / 2. The number of CEs is counted in successive bins, and the count is plotted as dots in Figure 2E. The number of these raw CEs rises as the cell discharge

654

Quentin Pauluis and Stuart N. Baker

rates increase. The expected number of coincidences is approximately E (CEs in bin T ) D 2wD NF1 ( T ) F2 ( T ),

(3.2)

where N is the number of trials, w is the bin width used to count the CEs (here 1 ms), and F1 , F2 represent the ring rates of the cells estimated from the smoothed PSTHs. Computation of the expected number of CEs is made more complex than equation 3.2 because the bin width w is in general larger than the precision to which spike times are recorded. If a CE falls into the bin centered on time T, then

w ( t1 ¡ t2 ) ¡ T 2 < 2 .

(3.3)

Suppose as an example that t1 is in the bin centered on T ¡w, and t2 is the bin centered on T Cw. For some possible combinations of t1 and t2 , equation 3.3

Figure 2: Facing page. Comparison of signicance calculation for coincident events using the method of Grun ¨ (1996) and the novel method proposed here. One hundred trials, each lasting 500 ms, were generated with a central 250 ms burst of activity (ring rate rose from 25 Hz to 75–125 Hz) which began with a temporal jitter of 100 ms. (A–G) Methods based on the peristimulus time histograms (PSTH). For each cell (A and C), the mean ring rates are computed by compiling PSTHs across trials; these are smoothed by a 101 ms moving window (B and D). (E) The expected coincidence rate (solid line) is calculated by multiplying the two mean ring rates, with a correction for the size of the window within which coincidences are detected. Dots show the actual coincidences detected in this simulation. (F) Actual and expected coincidence rates are smoothed with a 101 ms moving window. (G) The probability P to nd as many or more CEs as observed is calculated from the expected rate assuming a Poisson distribution. P is displayed as log10 ((1 ¡ P) / P), facilitating visualization of high and low values of P. (H–P) Method based on the single-trial ring-rate estimate. For each cell (rasters for the rst trial shown in H and J), the instantaneous frequency is reconstructed on a trial-by-trial basis (I and K ; lter SD between rate change points D 10 ms). Then the rate of expected CEs is calculated (L: rst trial; M: four other trials). A bin-per-bin summation of the single-trial CE expectancy curves produces the global CE expectancy (N, solid line), plotted with the number of CEs detected (dots). (O) Both rates are then ltered using a 101 ms moving window. (P) Signicance of departures from the expected rate is assessed as in (G) and plotted in the same way. Vertical bar: (B and D) 100 sp.s¡1 ; (E, F, N, O) 100 CE.s¡1 .trial¡1 ; (I, K) 500 sp.s¡1 ; (L, M) 1667 CEs.s¡1 ; (G, P) 5.

Instantaneous Discharge Probability

655

could still be left unfullled (t1 D T ¡ 0.7w and t2 D T C1.3w). This is shown schematically by Figure 3A: CEs generated by spikes each conned to one bin may fall into two adjacent bins. Extending these considerations to all possible combinations of bin occupancy for t1 and t2 , which could fulll equations 3.1 and 3.2, leads us to an exact solution for the case of D D 5 ms shown in Figure 3B and described numerically as: 0 !1 C3 C3 X X 2 (3.5) E(CEs in T ) D w N @ F1 ( T C jw) Mj,k F2 ( T C kw) A jD¡3

(

kD¡3

656

Quentin Pauluis and Stuart N. Baker

2

MjD¡3...3,kD¡3...3

0 6 0 6 6 0 6 D6 6 0 6 0 6 41 / 4 0

0 0 0 0 2/4 4/4 1/4

0 0 0 2 /4 4 /4 2 /4 0

0 0 2/4 4/4 2/4 0 0

0 2 /4 4 /4 2 /4 0 0 0

1/4 4/4 2/4 0 0 0 0

3 0 1 / 47 7 0 7 7 0 7 7. 0 7 7 0 5 0

(3.6)

Figure 2E plots the expected number of coincidences E as a solid line. The number of observed coincidences is then smoothed with a 101 ms moving window, yielding the plot in Figure 2F. Finally, the probability that the number of coincidences c will be at least as many as the number observed, C, given that E are expected, is calculated assuming that they follow a Poisson distribution: P (c ¸ C) D 1 ¡

iDC¡1 X e¡E Ei

i!

iD0

.

(3.7)

This probability is plotted in Figure 2G as log10 ((1 ¡ P) / P ), thereby permitting bins with a signicant excess, or a signicant decit, of CEs to stand out. The horizontal lines mark the 99% condence limits (two tailed). The modication to this technique that we propose is illustrated in Figures 2H–2P. For each trial (see Figures 2H and 2J), the instantaneous discharge probability density is calculated (see Figures 2I and 2K). This is used to calculate the expected number of coincidences in a single bin, in one trial; summing over all trials yields the approximate number of coincidences in a given bin: E (CEs in T ) D 2wD

N X nD1

Fn1 ( T ). Fn2 ( T ),

(3.8)

where Fn1 , Fn2 represent the instantaneous discharge probability density during trial n. Once again, however, the situation is made more complex by the binning used. By similar considerations to those above, we nd the exact formula for when D D 5w to be: 0 !1 C3 C3 N X X X @ E(CEs in T ) D w2 . Fn1 ( T C jw). Mj,k Fn2 ( T C kw) A , (3.9) nD1

jD¡3

(

kD¡3

where Mj,k is as dened in equation 3.5. This measure is plotted as a solid line in Figure 2N, overlaid with the observed number of coincidences (dots). The moving-window-smoothed traces are shown in Figure 2O, and the signicance plot, calculated according to equation 3.6, is illustrated in Figure 2P. The crucial difference between the two methods lies in the order in which averaging across trials is performed. In the original method proposed by

Instantaneous Discharge Probability

657

Figure 3: Schematic illustration of CE binning. Calculation of the CEs before the binning step ensures their exact determination but renders complex the calculation of their expectancy. Black squares represent a bin containing spikes from one unit; white squares represent a bin containing spikes from the other. The gray squares code on a gray scale the density of CEs resulting from spikes in the black and white squares. White bins with dotted outlines indicate a situation where not all of the possible combinations of spikes will lie within the coincidence detection interval D . (A) The CE distribution resulting from the combination of spikes in two bins has a triangular shape. If bins are separated by (D ¡ 1)w, w being odd, half the spike combinations do not verify the coincidence criterion of equation 3.1 and the number of CEs is reduced by two. (B) The problem is solved extensively for D D 5w. Figures on the right represent the fraction of spikes from these bins, which would form CEs within the central bin of interest (shown with thick vertical lines).The values given are used to construct the matrix Mj,k in equation 3.5.

658

Quentin Pauluis and Stuart N. Baker

Table 1: Evaluation of the Numbers of False-Positive CEs Occurring in Randomly Simulated Independent Spike Trains. Frequency

IDPM

PSTHM

Difference

(Hz)

a D 0.005

a D 0.025

a D 0.005

a D 0.025

a D 0.005

a D 0.025

10 30 50 70

6 2 0.5 0

26.3 15.8 8.3 2.3

6.5 3.3 0.5 0.3

26 20 12.5 4

¡0.5 ¡1.3¤ 0 ¡0.3

0.3 ¡4.3¤ ¡4.3¤ ¡1.8¤

Notes: One hundred trials, 1500 ms long, were simulated at a constant ring frequency for the two cells. CEs were counted within the central 500 ms of the simulation. The table shows the percentage of simulations with signicant CEs in a sample of 400 simulations with the same ring frequency. Two values for the criterion level a were used. Results for the methods using instantaneous discharge probability (IDPM) and normalization by the PSTH (PSTHM) are presented. The PSTH normalization method gives more false positives, although false-positive rates are high for both methods with a D 0.025. ¤ D a signicant difference between the two methods (binomial test, p < 0.05).

Grun ¨ (1996), ring rate is rst averaged across trials, and then the expected number of coincidences is calculated as a product of ring rates. In the current method, the expectation product is calculated on a single trial, and this is then summed. These will produce different results if the ring rates show any covariation from trial to trial about their mean proles. In agreement with this, Table 1 presents data showing that both methods are indeed similar if data are stationary. Two spike trains were generated as independent gamma processes, order 4, with constant ring rate. Simulations were run for 1500 ms, with 100 trials per simulation; 400 simulations were run for each set of condi-

Figure 4: Facing page. Performance of the two methods on spike trains with ring-rate changes that covary in their timing. One hundred trials were generated following the sequence of ring rate changes shown in (A). The black bars indicate the time range over which the rate changes began (uniform probability distribution). The rst column shows the results obtained with the single-trial expectancy calculation, and the second column with the method based on the mean ring rate. (G) Rasters of activity over the 100 trials for both cells. (B) Reconstructed instantaneous frequency for both cells during the rst trial (lter SD between rate change points: 10 ms) and (H) mean ring rates assessed over all trials. (C and I) CE rasters. (D and J) CE expectancy (solid) and CE observed rate (dots) smoothed with a 101 ms moving window. (E and K) Probability plot (same display as Figures 3G and 3P). (F and L) CE rasters. The CEs in windows that exceed the 0.5% condence limits are marked by a square. A large number of false positives occur with the mean ring-rate method.

Instantaneous Discharge Probability

659

tions. The number of simulations with signicant CEs in the central 500 ms was counted using both of the methods described above, for two different signicance criterion levels a. At the more relaxed criterion of a D 0.025, a large number of false positives occur with each method, although the

660

Quentin Pauluis and Stuart N. Baker

situation is slightly better for the instantaneous discharge probability technique. With the more rigorous criterion of a D 0.005, the two methods give comparable numbers of false positives. Figure 4 illustrates the performance of the two algorithms on simulated data, which have a covariation in the timing of the ring-rate modulation. The spike trains were independent, apart from the rate covariation, so that no signicant CEs should be detected. For a particular trial, the underlying ring-rate prole of the two cells was the same. Figure 4A shows how the underlying ring rate changed with time; the onset of the two periods of increased ring rate was uniformly distributed over the times shown by the black bars; the timing of the two bursts was independent. One hundred such trials were simulated. Figure 4G shows rasters of the unit activities simulated according to the ring-rate prole illustrated in Figure 4A. The estimated instantaneous discharge probability density of the rst trial (see Figure 4B) can be compared with the ring rate derived by averaging across all trials in Figure 4H. Spikes that were coincident within 5 ms are shown in Figures 4C and 4I. The number of CEs, counted in a 101 ms wide moving window are plotted as dots in Figures 4D and 4J. Overlaid as a solid line is the expected number of coincidences, calculated according to the instantaneous discharge probability method (see Figure 4D) or the PSTH method (see Figure 4J). While the two plots closely follow one another in Figure 4D, there are discrepancies in Figure 4J associated with the changes in ring rate. These differences reach signicance on the calculation of equation 2.5, such that the signicance plot of Figure 4K goes outside the 0.5% condence intervals. In Figures 4F and 4L, following the display convention established by Grun ¨ (1996), we display rasters of the CEs in which all events falling within a window that tested as signicant are highlighted by a square. It is clear that the use of a trial-averaged ring-rate estimate can lead to errors in signicance testing of unitary events when the neurone ring rates show a temporal jitter relative to the analysis alignment point, and that jitter is not independent between the two units. Our method based on single-trial discharge probability estimates is capable of alleviating this problem. The

Figure 5: Facing page. Performance of the two methods on spike trains that show covariant ring rates. (A) Schematic indicating the simulated ring-rate proles. After an initial period of constant rate, the rate changed to a value chosen at random from the range shown. Three further rate changes occurred. The organization of the panels is the same as in Figure 4, with the methods based on the single-trial expectancy in the left column, and the mean ring-rate approximation on the right. Since the mean ring rate does not change in these simulations, the PSTH-based method cannot predict the increased number of CEs expected, and a large number of false positives result.

Instantaneous Discharge Probability

661

improvement primarily results from the changed order of the operators (multiply, then sum, versus sum, then multiply). It is likely that almost as much of an improvement would be obtained with simpler estimates of the

662

Quentin Pauluis and Stuart N. Baker

instantaneous discharge probability, such as the ISI reciprocal (smoothed or not) or a low-pass ltered spike train. Our method, however, is capable of following both fast discharge probability changes (via insertion of breakpoints) and slow ones (via the smoothing of the ISI reciprocal); it is therefore preferred. Figure 5 illustrates the results of analysis on spike trains simulated to display variation in ring rate from trial to trial. The underlying rates of the two neurones on a given trial were the same, allowing the effect of rate covariation to be examined. The simulated rate proles are shown schematically in Figure 5A. The remainder of Figure 5 follows the same display conventions as Figure 4. It can be seen that the trial-averaged ring rates (smoothed PSTH) in Figure 5H show no modulation: this is expected from Figure 5A, since the mean ring rates do not change. However, the increase in ringrate variability brings about a rise in the number of CEs, shown by the dots in Figures 5D and 5J. This cannot be predicted by the trial-averaged measure, which therefore assigns a high signicance level to these extra events. By contrast, the expected and observed curves show good agreement using the instantaneous discharge probability method of Figure 5D, such that no CEs reach signicance. Figure 6 examines the detection of genuine excess CEs by the new algorithm. Figures 6A–6F follow the same format as Figures 4 and 5. One spike train was simulated as previously by a simple gamma process (order 4). Some of these spikes were selected at random to become “imposed CEs.”

Figure 6: Facing page. Detection of injected CEs by the method based on the single-trial CE expectation, and comparison of the sensitivity of the two methods. One hundred trials were simulated with a constant ring rate of 30 Hz; between ¡0.4 and 0.25 s, 371 coincident events were imposed on the spike train of the second cell (20% of the spikes of cell 1; this yielded 893 CEs in total, 250 more than the 643 CEs expected from a control simulation). (A–F) Detection of imposed CEs (same display conventions as Figures 4 and 5, left column). (G) Data from simulations of 100 trials at a constant ring rate, which investigated the sensitivity of the two methods. Trials lasted 500 ms; CEs were imposed during the central 150 ms. The number of windows that exceeded signicance is plotted versus the number of excess CEs. Different curves show results for different simulated background ring rates. Circles represent the method based on single-trial expectancy (instantaneous discharge probability method, IDPM) and triangles the method based on the mean ring rates (PSTHM). Bars indicate one standard deviation in a sample of 20 simulations. The methods show similar sensitivity; the decrease seen for high ring rates is due to the high level of chance coincidences expected. (H) Number of CEs within windows that tested as signicant versus the number of excess CEs, for the same simulations as shown in (G). The correlation is good, indicating that many of the injected CEs are retrieved. Filter SD between rate change points: 10 ms.

Instantaneous Discharge Probability

663

The second spike train was also initially simulated as a gamma process; the spike closest to each imposed CE was then moved to be at the same time as this event. This produced an increased number of CEs above the chance

664

Quentin Pauluis and Stuart N. Baker

level but maintained the ring rate of the two units. In Figures 6E and 6F, the CEs injected between ¡0.4 and 0.25 s were accurately detected. During this period, the ring rate was constant at 30 Hz, and there were 38% more CEs than expected by chance (893 CEs versus 643 in a control simulation; 2.5 extra CEs per trial spread over 650 ms). The right-hand side of Figure 6 compares both methods for CE detection in stationary spike trains at different ring rates. This was investigated by simulations similar to the one illustrated in Figure 6A, except that the ring rate and number of imposed CEs were varied systematically. Figure 6G plots the number of bins that tested as having a signicant excess of CEs versus the number of excess CEs (calculated as the difference between the mean number of CEs observed and the mean number of CEs during control simulation without imposed CEs). The curves for the two methods lie close together at all tested ring rates, indicating a similar sensitivity for each method. The decline of apparent sensitivity at high ring rates is due to the high number of CEs that would be expected by chance at these rates; larger numbers of excess CEs are therefore necessary before signicance is reached Figure 6H plots the number of detected CEs (the number of CEs in the signicant bins) versus the number of excess CEs; for both methods there is a good correlation. Points lie above the identity line because signicance is attributed to all the CEs in the 101 ms window; there is no means of dissociating the injected CEs from those that occurred by chance. The similar performance when conditions of ring-rate stationarity are fullled indicates that the different results in more complex situations cannot be simply explained by a difference in overall sensitivity between the two methods. 4 Discussion 4.1 Instantaneous Discharge Probability Density Calculation. The instantaneous discharge probability density function has a number of attractive features, which make it a useful measure for single-trial analysis. It is able to cope well with a large range of ring rates, unlike methods that smooth pulses derived from the spike train (French & Holden, 1971; Houk et al., 1987). It is locally smooth, except at a few points of rapid change and does not show great uctuations simply due to the variable nature of ISIs. However, by explicitly detecting points of rate change, blurring of rapid changes is avoided. The probabilistic framework for the detection of rate changes means that it is an entirely objective method in application. Finally, the use of a general gamma distribution as the underlying spike train model permits accurate representation of neurones with many different ring characteristics simply by changing the order parameter a. 4.2 Unitary Event Signicance Testing. The results presented demonstrate that the current methods used for assessing the signicance of uni-

Instantaneous Discharge Probability

665

tary events can lead to errors when there is variability in the ring rate of the neurones from trial to trial. Such variation in responses is less likely in sensory systems where the response to a stimulus is measured under anesthesia, since relatively few synapses are interposed between peripheral receptor and recorded neurones. By contrast, in awake animals subject to attentional uctuations, trial-to-trial variation is the norm. Animals can never be trained to produce precisely reproducible behavior, and even if the overt output appears similar, it is likely that there are many different redundant patterns of neural activity that will lead to the same goal. Lee, Port, Kruse, and Georgopoulos (1998) showed that single-trial ring rates often show correlation (noise correlation in their terminology), which is strongest for cells close together. The situations presented in Figures 4 and 5 are therefore not articially contrived to test current methods to their limits, but rather represent realistic possibilities common in experimental recordings. Other methods that rely on trial-averaged measures, such as the joint peristimulus time histogram (JPSTH) of Aertsen et al. (1989), or that assume intertrial constancy, such as the widely used shufe predictor (Perkel, Gerstein, & Moore, 1967; Engel, Konig, Gray, & Singer, 1990; Lee et al., 1998) are also vulnerable to error in such situations. When there is little variation in ring rate from trial to trial, the proposed method based on instantaneous discharge probability has similar sensitivity to use of the PSTH (see Table 1 and Figure 6). The novel method is therefore prefereable when analyzing a wide range of empirical data. Any technique for signicance testing of CEs will make assumptions (the null hypothesis); a signicant excess of CEs indicates a region of the data where these assumptions do not hold. Normalization using the PSTH makes the assumption that each trial is a realization of a random process whose discharge probability changes as the PSTH. This will detect as signicant all correlation between the two spike trains, over both the short term, and longer time course covariation of the ring rates about the trialaveraged level. Our method assumes that the neurones re at random in response to an input that either changes smoothly over time or makes abrupt jumps in magnitude. CEs that are assigned signicance using this approach result from transient input uctuations correlated between the two neurones and occur over a timescale briefer than that implied by the assumed smooth changes in input. We consider that this is appropriate, since we are interested in short-term synchronization, which may reect processes of temporal coding rather than longer-term comodulation of ring rate. Artifactual CEs could be produced, however, if other assumptions of the method are violated. For example, consider the situation where the neurone is more regular in its response than assumed in the algorithm used to position breakpoints (e.g., the cell res as a gamma process of order 4, but we test it against a Poisson process). In this case, the algorithm will fail to detect most of the step changes in input undergone by the cell; if these input changes covary between the two neurones investigated, an artifactual ex-

666

Quentin Pauluis and Stuart N. Baker

cess of CEs will be detected. It is therefore important to adjust the assumed order of the gamma distribution to agree with the observed ISI distribution in experimental data. There is one interesting circumstance in which our method will fail to assign signicant CEs; it is debatable whether this is desirable. If a neurone res brief bursts of only a few spikes, breakpoints will be set to delimit the bursts. The instantaneous discharge probability will therefore show a sharp increase closely surrounding the burst. If two neurones burst simultaneously, the expected CE count will rise sharply during the burst, such that any CEs that do occur are unlikely to reach signicance. Whether such CEs should be treated as signicant is an open question. Some other instantaneous discharge probability estimates could be proposed that would smooth across the burst limits; they would result in a measure of burst coincidence between the two trains rather than a measure of spike coincidence inside synchronous bursts. Alternative measures more suited to the data, such as the correlation between neurones of burst onset latency measured on single trials, would probably be simpler to interpret. The one report to date that has applied unitary joint-event analysis to experimental data demonstrated that signicant CEs were seen in association with rate changes and during periods of constant ring rate (Riehle et al., 1997). No data were presented on the variability of ring rate around rate changes, but it is likely that some of these CEs would not be signicant using our improved technique. In the instances where signicant CEs occurred without a modulation in ring rate, it is possible that there was nevertheless a change in the variability of discharge, as simulated in Figure 5. This could also lead to erroneous acceptance of chance CEs as signicant. The unitary joint-event method has received considerable interest, since it appears to allow the detection of synchronous ring on single trials (Fetz, 1997). However, as made clear by section 3, determination of the signicance of CEs requires averaging across trials; this article has been concerned primarily with the best way in which to do this. Formally, therefore, the unitary joint-event method is similar to time-resolved cross-correlation methods, such as the JPSTH (Aertsen et al., 1989), except that it concentrates only on the central portion of the cross-correlation histogram, and then identies the moment and the trials where synchronous rings occurred that contributed to signicant cross-correlation peaks. Whether this trial identication is useful depends on our underlying model for how synchrony occurs and is used in information processing in the brain. If we postulate that synchrony between the two recorded neurones is itself an important event, then knowing the trials on which it occurs is of obvious value in investigating the relationship between the CEs and, for example, task precision or reaction time. However, we may alternatively assume that the brain is composed of many small assemblies of neurones and that information is carried by the synchronous ring of a random subset of cells in one assembly (e.g., the synre chain hypothesis of Abeles, 1991; Pauluis, Baker, & Olivier, 1999).

Instantaneous Discharge Probability

667

This seems likely, since at any moment some cells of a given assembly may be refractory or otherwise nonexcitable due to their recent participation in other assemblies. If this is the case, trials on which the two cells whose discharge is monitored re synchronously are those trials where not only this assembly was active, but also where both cells were (by chance) able to participate in it. Knowing which trials contained the synchronous events is, on this model, less useful than knowing where, within the task, there was on average an excess of CEs. We have shown that modication of unitary event analysis to use an instantaneous discharge probability density measure can compensate accurately for changes in the occurrence rate of CEs produced by ring-rate covariation. It is anticipated that this technique, or others based on the instantaneous discharge probability density, will nd wide applicability in the analysis of synchronization among neural populations recorded from awake behaving animals. Acknowledgments

We thank Sonia Grun ¨ for helpful discussions. This work was supported by a British Council–CGRI travel grant. S.N.B. is supported by the UK MRC and Christ’s College, Cambridge. References Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal ring correlation: Modulation of “effective connectivity.” J Neurophysiol., 61, 900–917. Awiszus, F. (1988). Continuous functions determined by spike trains of a neuron subject to stimulation. Biol. Cybern., 58, 321–327 Berry, M. J., & Meister, M. (1998).Refractoriness and neural precision. J. Neurosci., 18, 2200–2211. Bessou, P., Laporte, Y., & Pag`es, B. (1968). A method of analysing the response of spindle primary endings to fusimotor stimulation. J. Physiol., 196, 37–45. Brody, C. D. (1997).Latency, excitability, and spike timing correlations. Soc. Neurosci. Abstr., 23, 14. Brody C. D. (1999a).Disambiguating different covariation types. Neural Comput., 11, 1527–1535. Brody, C. D. (1999b). Latency, excitability, and spike timing correlations. Neural Comput., 11, 1537–1551. Butler, E. G., Horne, M. K., & Hawkins, N. J. (1992). The activity of monkey thalamic and motor cortical neurones in a skilled, ballistic movement. J. Physiol., 42, 25–48. Butler, E. G., Horne, M. K., & Rawson, J. A. (1992). Sensory characteristics of monkey thalamic and motor cortex neurones. J. Physiol., 442, 1–24.

668

Quentin Pauluis and Stuart N. Baker

Churchward, P. R., Butler, E. G., Finkelstein, D. I., Aumann, T. D., Sudbury, A., & Horne, M. K. (1997). A comparison of methods used to detect changes in neuronal discharge patterns. J. Neurosci. Meth., 76, 203– 210. Cocatre-Filgien, J. H., & Delcomyn, F. (1992). Identication of bursts in spike trains. J. Neurosci. Meth., 41, 19–30. de Ruyter van Steveninck, R. R., Lewen, G. D., Strong, S. P., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 1805–1808. Eggermont, J. J., Smith, G. M., & Bowman, D. (1993). Spontaneous ring in cat primary auditory cortex: Age and depth dependence and its effect on neural interaction measures. J. Neurophysiol., 69, 1292–1313. Ellaway, P. H. (1978). Cumulative sum technique and its application to the analysis of peristimulus time histograms. Electroenceph. Clin. Neurophys., 45, 302– 304. Engel, A. K., Konig, P., Gray, C. M., & Singer, W. (1990). Stimulus-dependent neuronal oscillations in cat visual-cortex—intercolumnar interaction as determined by cross-correlation analysis. Europ. J. Neurosci., 2, 588–606. Fetz, E. E. (1997). Temporal coding in neural populations? Science, 278, 1901– 1902. French, A. S., & Holden, A. V. (1971). Alias-free sampling of neuronal spike trains. Kybernetik, 8, 165–171. Gerstein, G. L., & Kiang, N. Y. S. (1960). An approach to the quantitative analysis of electrophysiological data from single neurons. Biophys. J., 1, 15–28. Grun, ¨ S. (1996). Unitary joint-events in multiple-neuron spiking activity. Thun, Frankfurt am Main: Harri Deutsch. Houk, J. C., Dessem, D. A., Miller, L. E., & Sybirska, E. H. (1987). Correlation and spectral analysis of relations between single unit discharge and muscle activities. J. Neurosci. Methods, 21, 201–224. Hurst, H. E. (1950). Long-term storage capacity of reservoirs. Proc. Amer. Soc. Civil Engineering, 76, 1–30. Koch, C., Rapp, M., & Segev, I. (1996). A brief history of time (constants). Cereb. Cortex, 6, 93–101. Lee, D., Port, N. L., Kruse, W., & Georgopoulos, A. P. (1998). Variability and correlated noise in the discharge of neurons in motor and parietal areas of the primate cortex. J. Neurosci., 18, 1161–1170. Leg´endy, C. R., & Salcman, M. (1985). Bursts and recurrences of bursts in the spike trains of spontaneously active striate cortex neurons. J. Neurophysiol., 53, 926–939. Lemon, R. N. (1984). Methods for neuronal recording in conscious animals. London: Wiley. MacPherson, J. M., & Aldridge, J. W. A. (1979). A quantitative method of computer analysis of spike train data collected from behaving animals. Brain Res., 175, 183–187. Matthews, P. B. C. (1963). Apparatus for studying the responses of muscle spindles to stretching. J. Physiol., 169, 58–60P. Munoz, D. P., & Wurtz, R. H. (1993).Fixation cells in monkey superior colliculus.

Instantaneous Discharge Probability

669

I. Characteristics of cell discharge. J. Neurophysiol., 70, 559–575. Pauluis, Q., Baker, S. N., & Olivier, E. (1999). Emergent oscillations in a realistic network: The role of inhibition and the effect of spatiotemporal distribution of the input. J. Comput. Neurosci., 6, 27–48. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967). Neuronal spike trains and stochastic point processes. II. Simultaneous spike trains. Biophys. J., 7, 419– 440. Richmond, B. J., Optican, L. M., Podell, M., & Spitzer, H. (1987).Temporal encoding of two-dimensional patterns by single units in primate inferior temporal cortex. I. Response characteristics. J. Neurophysiol., 57, 132–146. Richmond, B. J., Optican, L. M., & Spitzer, H. (1990). Temporal encoding of two-dimensional patterns by single units in primate primary visual cortex. I. Stimulus-response relations. J. Neurophysiol., 64, 351–369. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Seidemann, E., Meilijson, I., Abeles, M., Bergman, H., & Vaadia, E. (1996).Simultaneously recorded single units in the frontal cortex go through sequences of discrete and stable states in monkeys performing a delayed localization task. J. Neurosci., 16, 752–768. Schwartz, A. B. (1992). Motor cortical activity during drawing movements: Single-unit activity during sinusoid tracing. J. Neurophysiol., 68, 528–541. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Silverman, B. W. (1986). Density estimation for statistics and data analysis. London: Chapman & Hall. Softky, W. R., & Koch, C. (1992). Cortical cells should re regularly, but do not. Neural Comput., 4, 643–646. Received August 31, 1998; accepted March 1, 1999.

LETTER

Communicated by Anthony Zador

Impact of Correlated Inputs on the Output of the Integrate-and-Fire Model Jianfeng Feng David Brown Computational Neuroscience Laboratory, Babraham Institute, Cambridge CB2 4AT, U.K. For the integrate-and-re model with or without reversal potentials, we consider how correlated inputs affect the variability of cellular output. For both models, the variability of efferent spike trains measured by coefcient of variation (CV) of the interspike interval is a nondecreasing function of input correlation. When the correlation coefcient is greater than 0.09, the CV of the integrate-and-re model without reversal potentials is always above 0.5, no matter how strong the inhibitory inputs. When the correlation coefcient is greater than 0.05, CV for the integrateand-re model with reversal potentials is always above 0.5, independent of the strength of the inhibitory inputs. Under a given condition on correlation coefcients, we nd that correlated Poisson processes can be decomposed into independent Poisson processes. We also develop a novel method to estimate the distribution density of the rst passage time of the integrate-and-re model. 1 Introduction

Although the single-neurone model has been widely studied using computer simulation, most of these studies have made the assumption that inputs are independent (Brown & Feng, 1999; Feng, 1997; Feng & Brown, 1998a, 1998b; Tuckwell, 1988) both spatially and temporally. This assumption obviously contradicts the physiological data, which clearly show that nearby neurones usually re in a correlated way (Zohary, Shadlen, & Newsome, 1994)—what we term spatial correlation—and neurones with similar functions frequently form groups and re together. In fact “ring together, coming together ” is a basic principle in neuronal development (Sheth, Sharma, Rao, & Sur, 1996). Furthermore data in Zohary et al. (1994) show that even a weak correlation within a population of neurones can have a substantial impact on network behavior, which suggests that when comparing simulation results with experimental data, it is of vital importance to investigate the effects of correlated input signals on the output of single cells. We address the following two important issues: how correlation between the inputs affects the output of single-neurone models and when and how Neural Computation 12, 671–692 (2000)

c 2000 Massachusetts Institute of Technology °

672

Jianfeng Feng and David Brown

correlated inputs can be transformed into equivalent independent inputs. The second issue is interesting since it is usually not easy to model or analyze a correlated system, except when it is gaussian distributed. Theoretically we do not have general analytical tools to handle correlated systems, even to nd analytical expressions for the distribution. It is even difcult numerically to generate a correlated random vector. Therefore, a transformation from correlated to independent inputs greatly simplies the problem theoretically and numerically and, furthermore, provides insights on the functional role of the correlation. The technique can be applied to both simulation of biophysical models and to experiments in order to generate correlated Poisson inputs. The neuronal model we use is the integrate-and-re model with or without reversal potentials. Positive correlation in input signals usually enlarges the coefcient of variation (CV) of the integrate-and-re model. When the correlation coefcient is greater than 0.09, we nd that CV for the leaky integrate-and-re model without reversal potentials is always greater than 0.5. When the correlation coefcient is greater than 0.05, CV for the equivalent model with reversal potentials is always greater than 0.5. There has been much research devoted to nding a neuronal mechanism that can generate spike trains with a CV greater than 0.5 (see Brown & Feng, 1999; Feng, 1999; Feng & Brown, 1998b, 1999a; Konig, Engel, & Singer, 1996 and references therein). To the best of our knowledge, this is the rst article to provide a possible mechanism for doing this for the integrate-and-re model independently of the ratio between the inhibitory and excitatory inputs. Independently and experimentally, Stevens and Zador (1998) found that correlation between inputs was necessary in order to generate a sufciently large CV in output spike trains of neurons in neocortical slices. Our analysis and simulations therefore provide a theoretical explanation of their ndings. We also show that it is only under a restrictive condition that dependent inputs can be decomposed into independent inputs. Finally we also propose a novel method of nding the distribution density of the rst passage time of the integrate-and-re model. Correlation usually imposes a strong restriction on the system and has a profound impact on its output (Gawne & Richmond, 1993; Zohary et al., 1994). Imagining a neurone exclusively receiving excitatory postsynaptic potentials (EPSPs) from a population of p excitatory neurones with mutually positive correlation coefcient c, then the output CV of the perfect integrateand-re neurone (see section 3) is proportional to [1 C ( p2 ¡ p) c / p] / Nth D [1 C ( p ¡ 1)c] / Nth , where Nth is the number of EPSPs required for the cell to re. The rst term 1 / Nth is well known (Softky & Koch, 1993; Abbott, Varela, Sen, & Nelson, 1997) and is responsible for what is usually termed the central limit theorem effect. The important difference between the rst and second terms is that the latter depends on p, which is usually a large number. The appearance of this factor here and not in the rst term is due to the fact that

Impact of Correlated Inputs

673

the correlation is a second-order statistical quantity. Hence, even when c is small, say 0.1, the quantity ( p ¡ 1)c can be substantially greater than the rst term, approaching or exceeding Nth , thus resulting in a high CV. This is the second of our series of articles aiming to elucidate how more realistic inputs, in contrast to conventional independently and identically distributed Poisson inputs, which have been intensively studied in the literature, affect the outputs of simple neuronal models. We aim to describe as completely as possible the full spectrum of the behavior inherent in these models, documenting more thoroughly their restrictions and potential. Feng and Brown (1998b) have considered the behavior of the integrate-and-re model subject to independent inputs with different distribution tails. In the near future, we will report on more complicated biophysical models with realistic inputs as we develop here and in Feng and Brown (1998b). 2 Models

In this section we dene the models used and then discuss the effects of including reversal potentials on the dynamic behavior of the leaky integrateand-re model, attempting to elucidate the essential difference this makes to its behavior. Suppose that a cell receives EPSPs at p excitatory synapses and inhibitory postsynaptic potentials (IPSPs) at q inhibitory synapses. The activities among excitatory synapses and inhibitory synapses are correlated, but for simplicity of notation here, we assume that the activities of the two classes are independent of each other. When the membrane potential Vt is between the resting potential Vrest and the threshold Vthre , p q X X 1 dV t D ¡ Vt dt C a dEi ( t) ¡ b dIj (t), c iD1 jD1

(2.1)

where 1 /c is the decay rate, Ei ( t), Ii (t) are Poisson processes with ratel E and l I , respectively, and a, b are magnitude of each EPSP and IPSP (Thomson, 1997). Once Vt crosses Vthre from below, a spike is generated, and Vt is reset to Vrest . This model is termed the integrate-and-re model. The interspike interval of the efferent spike process is T D infft: Vt ¸ Vthre g. Without loss of generality, we assume that the correlation coefcient between ith excitatory (inhibitory) synapse and jth excitatory (inhibitory) synapse is c(i, j) D q

h(Ei ( t) ¡ hEi (t)i)( Ej ( t) ¡ hEj ( t)i)i

h(Ei ( t) ¡ hEi (t)i)2 i ¢ h(Ej ( t) ¡ hEj (t)i)2 i

D r (ki ¡ jk),

674

Jianfeng Feng and David Brown

where r is a nonincreasing function. More specically, we consider two structures for the correlation: block-type correlation and gaussian-type correlation. Here block-type correlation means that N cells are divided into N / k blocks, and cell activities inside each block are correlated, whereas between blocks they are independent; gaussian-type correlation implies that the correlation between cells is a decreasing function of their geometrical distance (we impose a periodic boundary condition). A slightly more general model than the integrate-and-re model dened above is the integrate-and-re model with reversal potentials dened by p

dZt D ¡

q

X X Zt ¡ Vre dt CNa(VE ¡ Zt ) dEi (t) CbN(VI ¡Zt ) dIj (t), c iD1 jD1

(2.2)

where Vre is the resting potential, aN , bN are the magnitude of single EPSP and IPSP, respectively, and VE and VI are the reversal potentials. Zt (membrane potential) is now a birth-and-death process with boundaries VE and VI . Once Zt is below Vre , the decay term Zt ¡ Vre will increase membrane potential Zt ; whereas when Zt is above Vre , the decay term will decrease it. A detailed analysis might clarify the essential difference between the models with and without reversal potentials. Rewrite the model with reversal potentials, equation 2.2, in the following way, p

dZt D ¡

q

X X Zt ¡ Vre dt C aN (VE ¡ Vre ) dEi ( t) C bN( VI ¡ Vre ) dIj ( t) c iD1 jD1

¡ aN (Zt ¡ Vre )

p X iD1

dEi ( t) ¡ bN(Zt ¡ Vre )

q X

dIj ( t).

(2.3)

jD1

We see that the term ¡(Zt ¡ Vre ) /c dt C aN ( VE ¡ Vre )

p X iD1

dEi ( t) C bN(VI ¡ Vre )

q X

dIj ( t)

jD1

is exactly the same as the integrate-and-re model without reversal potentials; the coefcients of the diffusion term are xed and independent of time. Compared to the integrate-and-re model without reversal potentials, there is an additional term, ¡Na(Zt ¡ Vre )

p X iD1

dEi (t) ¡ bN(Zt ¡ Vre )

q X jD1

dIj ( t),

Impact of Correlated Inputs

675

which is a pure decay term. Hence equation 2.3 can be expressed in the following way: 2

3 p q X X 1 dZ t D ¡(Zt ¡ Vre ) 4 dt C aN dEi (t) C bN dIj (t) 5 c iD1 jD1 C aN ( VE ¡ Vre )

p X iD1

dEi ( t) C bN( VI ¡ Vre )

q X

dIj (t).

(2.4)

jD1

Therefore, the model with reversal potentials essentially increases the decay rate of membrane potential, resulting in the system’s forgetting its dynamic history more quickly. This property is important when we consider how a group of cells synchronize their behavior (Feng & Brown, 1999b) and is also important for understanding numerical differences in the behavior of models with and without reversal potentials as presented below. Furthermore equation 2.4 might help us reach a theoretical conclusion on a debate (Tuckwell, 1988, p. 165; Wilbur & Rinzel, 1983) that has existed in the literature for decades. 3 Analytical Results

We rst use the technique of martingale decomposition to approximate the input to the integrate-and-re model. Together with the result in the appendix, this leads in theorem 1 to a formula for the distribution of interspike interval for the perfect integrate-and-re model subjected to correlated inputs. We then obtain (in lemma 1) the interspike interval distribution of the leaky integrate-and-re model for the special case when the leakage rate is such that the attractor of the deterministic part of the dynamics (Feng & Brown, 1999a) coincides with the spiking threshold. From this result, we nd an expression in theorem 2 for the distribution density of interspike interval for the general leaky integrate-and-re model, which is then discussed. Finally the model with reversal potentials is considered. First, martingale decomposition is used to approximate the integrateand-re model with or without reversal potentials. We do not discuss the approximation accuracy since it has been done by many authors previously (Tuckwell, 1988; Ricciardi & Sato, 1990) (see section 5). The Poisson point processes representing EPSP and IPSP input involve perturbations of membrane potential with the following two properties. First, they are either xed in size (for the model without reversal potentials) or of a size dependent on membrane potential (for the model with reversal potentials). Second, they arrive irregularly in time according to Poisson processes. Using martingale decomposition, we effectively rewrite this process as a Wiener or Brownian motion process, which is simulated as a sum of an equivalent continuous current input representing the drift that occurs with unbalanced EPSP/IPSP

676

Jianfeng Feng and David Brown

input and unbiased random input involving independently varying, small, random perturbations of membrane potential, arriving at regularly spaced, very short intervals of time. The martingale decomposition reads p E i (t) » l E t C l E BEi (t), and, similarly, p Ii (t) » l I t C l I BIi ( t), where B Ei ( t) and BIi ( t) are standard Brownian motions. For xed time t, the above decomposition is equivalent to the central limit theorem. Therefore the integrate-and-re model without reversal potentials can be approximated by p q p X X p X 1 l E dt ¡ b l I dt C a l E dvt D ¡ vt dt C a dBEi ( t) c iD1 iD1 iD1 q p X ¡ b lI dBIi (t). iD1

Since the summation of Brownian motions is again a Brownian motion, we can rewrite the equation above as follows: 1 dvt D ¡ vt dt C (apl E ¡ bql I )dt c v u p X p q X q X X u C ta2l E cE ( i, j) C b2l I cI (i, j)dB( t) iD1 jD1

(3.1)

iD1 jD1

1 D ¡ vt dt C (apl E ¡ bql I )dt c . v u p q X X u C ta2 pl E C b2 ql I C a2l E cE ( i, j) C b2l I cI ( i, j) dB( t) i6 D j

i6 D j

When cE ( i, j) D cI (i, j) D 0, equation 3.1 gives rise to the same results as in the literature (Tuckwell, 1988) for independent inputs. When q D 0 and cE ( i, j) D c, the signal-to-noise ratio of inputs given by equation 3.1 is SNR D p

pl E , pl E C p( p ¡ 1)l E c

which coincides with that in Zohary et al. (1994, g. 3a).

Impact of Correlated Inputs

677

Combining the results above with the conclusions in the appendix, we arrive at the following theorem for the perfect integrate-and-re model. Let T be the rst passage time of vt . If c D 1 the distribution density of T is

Theorem 1.

p( t ) D q

2p ( a2 pl 2

E

Cb2 ql

I

Ca2l

Vthre Pp E P ( ) Cb2l I qi6D j cI ( i, j))t3 E i6 D j c i, j 3

6 6 ( Vthre ¡ (apl E ¡ bql I ) t) 2 exp 6 ¡ Pp 6 2(a2 pl E Cb2 ql I Ca2l E i6D j cE (i, j) 4 Pq C b2l I i6D j cI ( i, j))t

7 7 7 7 5

(3.2)

and in particular

8 > >
Vthre apl E ¡ bql I Pp Pq a2l E i6D j cE ( i, j) C b2l I i6D j cI (i, j) (3.3) > a2 pl E C b2 ql I > :CV D C . Vthre ( apl E ¡ bql I ) Vthre ( apl E ¡ bql I ) Theorem 1 tells us that for the perfect integrate-and-re model, correlation between inputs does have an impact on the variation (CV) of output interspike interval, but not on the mean ring rate. Compared to the perfect integrate-and-re model without correlation, the CV of efferent interspike intervals falls between 0.5 and 1 for greater degrees of imbalance between EPSPs and IPSPs (Feng & Brown, 1998b). Furthermore, the output distribution is a long-tailed distribution if and only if an exact balance between EPSPs and IPSPs is reached, although in general correlated inputs will generate efferent spikes with a larger CV. Although theorem 1 is a special case of theorem 2, we prefer to write it as a separate theorem since we apply it in the next section. In appendix A, we present a new and simple proof of theorem 1 without resorting to the classic Laplace transformation (Tuckwell, 1988), which we hope opens up new possibilities for obtaining an analytical formula for general cases—the integrate-and-re model without reversal potentials. The idea of our approach is to use Girsanov’s theorem (Protter, 1980), which connects two diffusion processes with a displacement transformation. Once we know the properties of the following diffusion process (s > 0, a constant), dX t D b(Xt ) dt C sdBt ,

678

Jianfeng Feng and David Brown

we can, by Girsanov’s theorem, infer properties of the following diffusion process: dYt D b( Yt ) dt Cm dt C sdBt . For the problem we consider here, we can nd the distribution density of the following case, dvN t D ¡vN t /c dt Cm 0 dt C sdBt , with m 0 D Vthre /c ,

(3.4)

the special case that the attractor of the deterministic dynamics coincides with the threshold. (For further interesting properties of the dynamics under the condition equation 3.4, see Feng, 1999; Feng & Brown, 1999a.) Let TN be the rst passage time of vN t . Lemma 1.

When m D Vthre /c the distribution density of TN is given by

2s 2 Vthre exp(¡t /c ) p0 ( t) D p p [s 2c (1 ¡ exp(¡2t /c ))]3 " # 2 exp(¡2t /c ) Vthre ¢ exp ¡ 2 . s c (1 ¡ exp(¡2t /c ))

(3.5)

Hence, by using Girsanov’s theorem, as we do in the appendix, we nd an analytical formula for the general case, dvt D ¡vt /c dt Cm 0 dt C (m ¡ m 0 )dt C sdBt , as described below. Theorem 2.

The distribution density of the rst passage time T of vt is

¡

¡

¢

µ ¶ (m ¡m 0 ) 2 t ¡(m ¡ m 0 )m 0 t Vthre (m ¡m 0 ) C exp p( t) D p0 ( t) exp ¡ 2 2 2s s s2 * ! + Z TN N m ¡m 0 Vt exp ¢ . (3.6) dt| FTN 2 s 0 c N

(

TDt

Impact of Correlated Inputs

679

As in the appendix, we only need to calculate the Radon-Nikodyn derivative in Girsanov’s theorem, which is given by F(t) D exp D exp

¡Z

t 0

1 1 (m ¡ m 0 ) dBs ¡ s 2

¡ m ¡m s

0

Bt ¡

Z

1

t 0

)2

¢

(m ¡ m 0 ) 2 ds s2

¢

1 (m ¡ m 0 t . 2 s2

Therefore we have hg(T)i D

Z

¡

(m ¡ m 0 ) 2 t 2s 2 0 ½ ¾ m ¡m 0 exp BTN | FTN dt, s N TDt g( t) p0 ( t) exp ¡

¡

¢

¢

(3.7)

where g is a measurable, nonnegative function, T is the rst passage time, and Ft is the s-algebra up to time t. Now the only remaining task to prove theorem 2 is to nd an analytical formula for BTN . From dvN t D ¡vN t /c dt Cm 0 dt C sdBt . we have Z vN TN D ¡

TN 0

vN t dt /c Cm 0 TN C sBTN .

Equation 3.7 thus becomes

¡

¢

µ ¶ (m ¡ m 0 ) 2 t Vthre (m ¡ m 0 ) C g( t) p0 ( t) exp ¡ 2s 2 s2 0 * ! + Z TN ¡(m ¡ m 0 )m 0 t m ¡m 0 vN t exp exp ¢ dt | FTN s2 s2 0 c N Z

hg(T)i D

1

¡

which gives the desired results.

¢

(

dt,

TDt

RN Now let us consider the meaning of the term 0T vN t dt. It is the area under the curve visited by the process vN t , before it hits the threshold, as shown in Figure 1. It is easily seen that we are able to use different ways to apR TN N lemma 1), which gives proximate 0 vN t dt (we know the distribution of T, rise to various approximations of the distribution density of T. To nd an analytical formula for the distribution density of T is a problem studied for decades, and a number of different methods, mainly in terms of PDEs, have been employed (see, for example, Ricciardi & Sato, 1990). Here we present a novel and probably the most natural way—by virtue of the probability

680

Jianfeng Feng and David Brown

Figure 1: Dark area above the resting potential ¡ dark area below the resting potential D

R TN 0

vN t dt (Vrest D 0 mV).

method—to approximate the distribution density. (A study on how to improve the accuracy of the approximation is an interesting problem, but it is outside the scope of this article and will be published elsewhere.) In fact, from theorem 2 we can obtain some informative conclusions. For example, we claim that as soon as m ¸ m 0 , then p( t) goes to zero (t ! 1) faster than or equal to µ ¶ (m ¡ m 0 ) 2 1 exp ¡ C t , 2s 2 c

¡

¢

an exponential decreasing function. This is because vN t /c · Vthre /c D m 0 , and therefore exp

¡ ¡(m ¡ m

0 )m 0 t

s2

¢*

(

m ¡m 0 exp ¢ s2

Z

TN 0

! + vN t dt | FTN c N

TDt

· 1.

Now we turn to the model with reversal potentials. Similar to the above treatment of the model without reversal potentials, we can rewrite the model in the following form, p

q

X X 1 l E ( VE ¡ zt ) dt C b l I ( VI ¡ zt ) dt dz t D ¡ zt dt C a c iD1 iD1 p q p X p X C a lE |VE ¡ zt |dBEi ( t) C b l I |VI ¡ zt | dBIi ( t), iD1

iD1

Impact of Correlated Inputs

681

which gives 1 dzt D ¡ zt dt C (ap( VE ¡ zt )l E ¡ bq( zt ¡ VI )l I )dt C sr ( t)dBt , c

(3.8)

where, sr2 ( t) D a2 ( zt ¡ VE ) 2l E

p X p X

cE ( i, j)

iD1 jD1

C b2 ( zt ¡ VI ) 2l I

q X q X

cI ( i, j)

D a2 ( zt ¡ Ve ) 2 pl E C b2 ( zt ¡ VI ) 2 ql I C a2 ( zt ¡ VE ) 2l E C b2 ( zt ¡ VI ) 2l I

(3.9)

iD1 jD1

q X

p X

cE ( i, j)

i6 D j

cI (i, j).

i6 D j

There are other forms for the diffusion terms in equation 3.8 to approximate the original process (Musila & L´ansky, ´ 1994). For simplicity of notation we conne ourselves to equation 3.9. 4 Numerical Results

From now on we assume that cE (i, j) D cI (i, j) D c( i, j). In this section we present and discuss the results of numerical simulations of equations 3.1 and 3.8 showing the effect of block-type correlation on mean (ISI) and CV of the leaky integrate-and-re model without (see Figure 2) and with reversal potentials (see Figure 4) for moderate EPSP/IPSP sizes, and also for very small EPSP/IPSP sizes (see Figure 3). We also demonstrate the effect of a correlation that falls off as an inverse square law (see Figure 5). 4.1 Block-type Correlation. For simplicity of notation, we consider only the case that c( i, j) D c for i 6 D j, i, j D 1, . . . , p and i 6 D j, i, j D 1, . . . , q. It is reported in the literature (Zohary et al., 1994) that the correlation coefcient between cells is around 0.1 in V5 of rhesus monkeys in vivo. In human motor units of a variety of muscles, the correlation coefcients are usually in the range 0.1 to 0.3 (Matthews, 1996). In Figure 2 we show numerical simulations for q D 0, 10, 20, . . . , 100, p D 100, Vrest D 0, Vthre D 20 mV, a D b D 0.5 mV, c D 20.2 msec, l E D l I D 100 Hz, a set of parameters as in the literature (Brown & Feng, 1999; Feng, 1999; Feng & Brown, 1998b, 1999a), and c D 0, 0.01, 0.02, . . . , 0.1. When c D 0, inputs without correlation, there

682

Jianfeng Feng and David Brown

Figure 2: Mean ring time (left) and CV (right) versus q for c(i, j) D 0., 0.01, 0.05, 0.09, 0.1 of the integrate-and-re model without reversal potentials. Ten thousand spikes are generated for calculating the mean ring time and CV. Parameters are specied in the context. As soon as c(i, j) ¸ 0.09, we see that the CV of efferent spike trains is greater than 0.5.

are many numerical and theoretical investigations (see, for example, Softky & Koch, 1993; Ricciardi & Sato, 1990; Feng & Brown, 1998b). As one might expect, the larger the input correlation, the larger the output CV. In particular we note that when c ¸ 0.09, then CV > 0.5 for any value of q. When the ratio q / p is small, the mean ring time is independent of the correlation, a phenomenon we have pointed out for the perfect integrateand-re model (see theorem 1). However when q / p is large (i.e., q ¸ 50), the mean ring time with weak correlation is greater than that with large correlation. It can be easily understood from the properties of the process since the larger the correlation, the larger the diffusion term, which will drive the process to cross the threshold more often. However, as in the perfect integrate-and-re model, CV of efferent spike trains does depend on the ratio and the correlation. The larger the correlation and the ratio, the larger the CV. Many studies have been devoted to the problem of how to generate spike trains with CV between 0.5 and 1 (see, for example, Brown & Feng, 1999; Feng, 1999; Feng & Brown, 1998b, 1999a; Softky & Koch, 1993; Shadlen & Newsome, 1994). In particular, Softky and Koch (1993) point out that it is impossible for the integrate-and-re model to generate spike trains with CV

Impact of Correlated Inputs

683

Figure 3: Mean ring time (left) and CV (right) versus q for c(i, j) D 0., 0.01, 0.05, 0.09, 0.1 of the integrate-and-re model without reversal potentials and a D b D 0.05, p D 1000. Ten thousand spikes are generated for calculating the mean ring time and CV. Parameters are specied in the context. For the case of c(i, j) D 0 and q D 0, the CV is 0.05. Note that we calculate q D 0, 100, . . . , 800 only for c(i, j) D 0 since the mean ring time is too large when q D 900, 1000.

between 0.5 and 1 if the inputs are exclusively excitatory. A phenomenon referred to as central limit effect is widely cited in the literature (Abbott et al., 1997) as justication of this statement. Many approaches have been proposed to get around this problem (Shadlen & Newsome, 1994). Here we clearly demonstrate that even with exclusively excitatory inputs, the integrate-and-re model is capable of emitting spike trains with CV greater than 0.5. No matter what the ratio between the excitatory and inhibitory synapses, the CV of efferent spike trains is always greater than 0.5, provided that the correlation coefcient between the inputs is greater than or equal to 0.09. Furthermore, we want to emphasize that in Zohary et al. (1994) the authors claim, based on their physiological data, that c > 0.1 is the most plausible case. For further conrming that a small correlation plays an important role in neuronal output, in Figure 3 we simulate an extreme case with EPSP and IPSP size of 0.05 mV. Shadlen and Newsome (1994) have reported that the size of EPSP is between 0.05 mV and 2 mV. This choice implies that 400 EPSPs are needed to drive a cell to re. To have a reasonable comparison with the numerical results of a D 0.5, m is kept as a constant when the ratio

684

Jianfeng Feng and David Brown

Figure 4: Mean ring time (left) and CV (right) versus q for c(i, j) D 0., 0.01, 0.05, 0.09, 0.1 of the integrate-and-re model with reversal potentials. Ten thousand spikes are generated for calculating the mean ring time and CV. Parameters are specied in the context. As soon as c(i, j) ¸ 0.05, we see that the CV of efferent spike trains is greater than 0.5.

q / p is the same and hence p D 1000 in Figure 3. When c( i, j) D 0 and q D 0, CV is about 0.05 (see Figure 3). Figure 3 clearly shows that a large, efferent CV is again obtained with a small correlation. Now we turn to the integrate-and-re model with reversal potentials. As we already pointed out, the essential difference between this model and that without reversal potentials is that the latter has a large-membrane potential decay rate. Therefore in this case, it is natural to expect that a larger efferent spike train CV is obtained since the decay term plays a predominant role. We assume that Vre D ¡50 mV, VI D ¡60 mV,Vth D ¡30, and VE D 50 mV, values that match experimental data. As in the literature (Musila & Lansky, 1994; Feng & Brown, 1998b) we impose a local balance condition on the magnitude of EPSPs and IPSPs: aN ( VE ¡ Vre ) D bN( Vre ¡ VI ) D 1 mV; starting from the resting potential, a single EPSP or a single IPSP will depolarize or hyperpolarize the membrane potential by 1mV (see the discussion in section 3). Figure 4 clearly shows that efferent spike trains of the integrateand-re model with reversal potentials are more variable than for the model without reversal potentials (except for high q and c( i, j)). When the correlation coefcient is greater than 0.05, CV > 0.5, no matter what the ratio

Impact of Correlated Inputs

685

Figure 5: Mean ring time (left) and CV (right) versus q for s D 1, 2, 5, 9, 10 of the integrate-and-re model without reversal potentials for a gaussian-type correlation pattern. Ten thousand spikes are generated for calculating the mean ring time and CV. Parameters are specied in the context. As soon as s ¸ 5, we see that the CV of efferent spike trains is greater than 0.5.

between inhibitory and excitatory inputs. Due to this effect (large CV) we see that mean ring rates are also wider spread than for the model without reversal potentials. When q ¸ 20, the difference between mean ring times become discernible (see Figure 4). From Figures 2 and 4 we can see another interesting phenomenon. When q / p approaches one, the CV of the integrate-and-re model without reversal potential is greater than that of the model with reversal potentials (which itself is less than or equal to one) (see Tuckwell, 1988, p. 165; Wilbur & Rinzel, 1983). The reason can be understood from our analyses in section 3. Note that the decay term will return the system to the resting potential. The model with reversal potentials is equivalent to one with a large decay rate, and so Zt generally remains closer to the resting potential than Vt . When q gets close to 100, the deterministic forces tending to increase membrane potential become weak, and so the process starting from the resting potential will tend to fall below the resting potential more and more often. However, the decay terms together with the reversal potentials in the integrate-andre model with reversal potentials will prevent the process from going far below the resting potential. The process Zt is thus more densely packed in a compact region, and a smaller range of CV is observed.

686

Jianfeng Feng and David Brown

4.2 Gaussian-Type Correlation. We suppose that c (i, j) D exp(¡(i ¡ j)2 / s 2 ), i, j D 1, . . . , p with periodic boundary conditions. Hence the larger that s is, the more widely spread the correlation for the ith cell. We simulate the integrate-and-re model using the Euler scheme (Feng, Lei, & Qian, 1992) with s D 1, 2, . . . , 10. Similar phenomena as in the previous subsection are observed (see Figure 5): there is a threshold value sc , such that as soon as s > sc , CV of efferent trains becomes greater than 0.5, independent of the number of inhibitory inputs. Similar results are obtained for the integrateand-re model with reversal potentials (see the previous subsection) as well (not shown). 5 From Dependent to Independent Synapses

We now consider how we can realize a constant correlated Poisson process. Dene Ei ( t) D Ni ( t) C N ( t) where Ni ( t), i D 1, . . . , p are independently and identically distributed Poisson processes with rate (1 ¡ c)l E , and N (t) is an independent Poisson process with rate cl E . It is then easily seen that the correlation coefcient between Ei and Ej is c. Therefore, in Pterms of our conclusions above, we can rewrite the total synaptic inputs i E i (t) in the following way: X X (5.1) E i ( t) D Ni ( t) C pN ( t). i

This conclusion tells us that constant correlated Poisson inputs are equivalent to the case where there is a common source—the term N (t) in equation 5.1—for all synapses. We usually prefer independent inputs, which are much easier to deal with than correlated inputs, and we do so again here. Let us rewrite the integrate-and-re model in the following way: p

dV t D ¡

q

X X Vt dt C a dNiE ( t) C apdN E ( t) ¡ b dNiI (t) ¡ bqdNI ( t) c iD1 iD1

Vt dt Cm dt c . q C a2 p(1 ¡ c)l E C b2 q(1 ¡ c)l I C a2 p2 cl E C b2 q2 cl E dBt

»¡

The independent term contributes to the variance term with an amount of a2 pl E (1 ¡ c), but the common source contributes with an amount of a2 p2l E c. Since p usually is large, the common source thus plays a predominant role in the model behavior when we employ physiological parameters—a D 0.5, c D 0.1—in the integrate-and-re model. We thus ask whether a given series of correlated Poisson processes (suppose that c(i, j), j D 1, . . . , p is not a constant) can be decomposed into a

Impact of Correlated Inputs

687

series of independent Poisson processes. Furthermore, without loss of generality, we assume that l E D 1. Decomposition Theorem. Suppose that p is large and

Theorem 3. p X jD2

c(1, j) ·

1 , 2

(5.2)

the Poisson process E i (t) dened below, E i ( t) D N i ( t) C

i¡1 X jD1

Nj,i¡j (t) C

p¡i X

Ni, j ( t)

jD1

is correlated with a correlation between Ei ( t) and Ej ( t) equal to c(1, |i¡ j| ), i 6 D j, j§ 1, where Ni ( t), Ni, j ( t), i D 1, . . . , pI j D 1, . . . , p ¡ i are independent Poisson Pi¡1 Pp¡i processes with rate l Ni D t(1 ¡ jD2 c(1, j) ¡ jD2 c(1, j)) and l Ni, j D tc(1, j). Proof . For i 6 D j, j C 1, j ¡ 1, we obtain

hEi ( t) ¡ hEi ( t)i, Ej (t) ¡ hEj ( t)ii

D hNi ( t) ¡ hNi (t)i, Nj ( t) ¡ hNj ( t)ii C C C C

j¡1 i¡1 X X kD1 lD1 p¡j i¡1 X X kD1 lD1 p¡i X j¡1 X kD1 lD1 p¡i X p¡j X kD1 lD1

hNk,i¡k ( t) ¡ hNk,i¡k ( t)i, Nl, j¡l ( t) ¡ hNl, j¡l ( t)ii hNk,i¡k ( t) ¡ hNk,i¡k ( t)i, Nj,l ( t) ¡ hNj,l ( t)ii hNi,k ( t) ¡ hNi,k ( t)i, Nl, j¡l ( t) ¡ hNl, j¡l ( t)ii hNi,k ( t) ¡ hNi,k ( t)i, Nj,l (t) ¡ hNj,l (t)ii

1 1 (1 C sgn( i ¡ j)) c(1, |i ¡ j| )t C (1 C sgn( j ¡ i))c(1, |i ¡ j| ) t 2 2 D c(1, | i ¡ j| ) t. D

Using a similar algebraic calculation, we have hEi ( t)¡hEi (t)i, Ei (t)¡hEi ( t)ii D t. Therefore the correlation coefcient between Ei ( t) and Ej ( t) is given by c(1, |i ¡ j| ).

688

Jianfeng Feng and David Brown

It is obvious that condition 5.2 is rather restrictive since c(i, j) D c with c > 1 / p violates it. However, the following example shows that the condition in theorem 3 is a necessary and sufcient condition for a linear decomposition: Example. Consider four cells i D 1, 2, 3, 4 and assume that c(2, 3) > 0 and c(2, 4) D 0. Since we have c(1, 3) D 0, the most efcient way to construct E2 is E 2 ( t) D N1 (t) C N3 (t), with hN1 (t)i D c(2, 3) D hN3 ( t)i. Therefore if and only if 2c(1, 2) · 1 do we have a linear decomposition of Ei ( t). 6 Discussion

We have shown that a weak correlation between neurones can have a dramatic inuence on model output. In the integrate-and-re model without reversal potentials, when the correlation coefcient is above 0.09, the CV of efferent spike trains is above 0.5. In the integrate-and-re model with reversal potentials, when the correlation coefcient is above 0.05, the CV of efferent spike trains is greater than 0.5 already. These properties are independent of the numbers of inhibitory inputs and therefore resolve an existing problem in the literature: how the integrate-and-re model generates spike trains with CV greater than 0.5 when there is a low ratio between inhibitory and excitatory inputs. Note that Zohary et al. (1994) have claimed that a correlation greater than 0.1 is the most plausible value. The integrate-and-re model is one of the most widely used singleneuronal models. The advantage when studying it is that there are some useful analytical formulas, although there is a fundamental question open: to nd the distribution density of the ring time. We have proposed a novel approach based on Girsanov’s theorem, which opens up new possibilities for answering this question. Analytical approaches of this type are a useful—and, wherever possible, essential—supplement computer simulation results, since analytically derived expressions are usually very general rather than applicable just to the specic parameter values used in the case of simulation results. We illustrate this by deriving an algebraic form for the interspike interval distribution tail. This could be important for tting to experimental data, and hence obtaining parameter estimates, but also has implications explored in our previous work (Feng & Brown, 1998a) when the output of the current layer of neurones is the input to a subsequent layer. When the correlation coefcient between neurones is a constant, we can easily decompose them into a linear summation of independent Poisson

Impact of Correlated Inputs

689

processes, and therefore all results in the literature are applicable. Under a sufcient condition, we also study the possibility of decomposing a correlated Poisson process into independent Poisson processes. The results can be widely used in biophysical models and experiments. Finally, we discuss the implication of random inputs—Poisson process inputs—in our model. This is a puzzling issue, and a solid answer can be provided only in terms of experiments (Abeles, 1982, 1990). Mainen and Sejnowski (1995) pointed out that “Reliability of spike timing depended on stimulus transients. Flat stimuli led to imprecise spike trains, whereas stimuli with transients resembling synaptic activity produced spike trains with timing reproducible to less than one millisecond.” However, it must be emphasized that their experiments were carried out in neocortical slices. It is interesting to see that the variability of spike trains depends on the nature of inputs, but CV of efferent spike trains in vivo might be very different from those in slices due to the absence of other inputs in slice experiments. For example, it has been reported that CV is between 0.5 and 1 for visual cortex (V1) and extrastriate cortex (MT) (Softky & Koch, 1993; Marsalek, Koch, & Maunsell, 1997) even in human motor cells CV is between 0.1 and 0.25 (Matthews, 1996, p. 597). There are also examples that demonstrate a group of cells in vivo behaving totally differently from in slices. Oxytocin cells re synchronously in vivo, but this property is totally lost in slices (see Brown & Moos, 1997, and references). Very recently, it has been reported that random noise of similar amplitude to the deterministic component of the signal plays a key role in neural control of eye and arm movements (Harris & Wolpert, 1998; Sejnowski, 1998). Thus randomness is present—and appears to have a functionally important role—in a number of physiological functions. Appendix

We know that in terms of the reective principle, the distribution density of the rst passage time from (¡1, Vthre ) of a Brownian motion sBt is given by p( t ) D p

1 2p t3 s 2

2 Vthre exp(¡Vthre / (2s 2 t)).

We intend to nd the distribution density of the process Xt D m t C sBt . For any measurable positive function g, we have hg(T))i D hhg(t)F(t)| F ( TO )ii,

(A.1)

690

Jianfeng Feng and David Brown

where F(t) is the Radon-Nikodyn derivative given by the Girsanov’s theorem, as follows, F( t) D exp

¡Z

t 0

m / s dBs ¡ 2

D exp(m / sBt ¡ m t /

1 2s 2

Z

t 0

s 2 ds

¢

(2s 2 )),

and TO is the rst passage time of Bt . Inserting the expression of F( t) into equation A.1 and using the fact that BTO D Vthre / s, we obtain hg(T))i D hg( TO ) exp(m / sB TO ¡ m 2 TO / (2s 2 ))i Z 1 D g(t) exp(m Vthre / s 2 ¡ m 2 t / (2s 2 )) 0

¢p Z D Z D

1 2p t3 s 2

1 0 1 0

2 Vthre exp(¡Vthre / (2s 2 t)) dt

g( t ) p g( t ) p

Vthre 2p s 2 t3 Vthre 2p s 2 t3

2 exp(m Vthre / s 2 ¡ m 2 t / (2s 2 ) ¡ Vthre / (2s 2 t)) dt

exp(¡(m t ¡ Vthre ) 2 / (2s 2 t)) dt.

Therefore the distribution density of T, the rst exiting time of Xt , is given by p

Vthre 2p s 2 t3

exp[¡(m t ¡ Vthre )2 / (2s 2 t)].

Acknowledgments

We are grateful to the anonymous referees for bringing references (Stevens & Zador, 1998; Gawne & Richmond, 1993) to our attention and their valuable comments on the manuscript of this article The work was partially supported by BBSRC and an ESEP grant of the Royal Society. References Abeles, M. (1982). Local cortical circuits: An electrophysiological study. New York: Springer-Verlag. Abeles, M. (1990). Corticonics. Cambridge: Cambridge University Press. Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–223. Brown, D., & Moos, F. (1997). Onset of bursting in oxytocin cells in suckled rats. J. of Physiology, 503, 652–634.

Impact of Correlated Inputs

691

Brown, D., & Feng, J. (1999).Is there a problem matching model and real CV(ISI)? Neurocomputing, 26–27, 117–122. Feng, J. (1997). Behaviours of spike output jitter in the integrate-and-re model. Phys. Rev. Lett., 79, 4505–4508. Feng, J. (1999). Origin of Firing varibility of the integrate-and-re model. Neurocomputing, 26–27, 87–91. Feng, J., & Brown, D. (1998a).Spike output jitter, mean ring time and coefcient of variation. J. Phys. A: Math. Gen., 31, 1239–1252. Feng, J., & Brown, D. (1998b). Impact of temporal variation and the balance between excitation and inhibition on the output of the perfect integrate-andre model. Biol. Cybern., 78, 369–376. Feng, J., and Brown, D. (1999a). Coefcient of variation greater than .5 how and when? Biol. Cybern., 80, 291–297. Feng, J., & Brown, D. (1999b). Synchronisation due to common pulsed input in stein’s model. Accepted by Phys. Rev. E. Feng, J., Lei, G., & Qian, M. (1992). Second-order algorithms for SDE. Journal of Computational Mathematics, 10, 376–387. Gawne, T. J., & Richmond, B. J. (1993). How independent are the messages carried by adjacent inferior temporal cortical neurons? J. Neurosci., 13, 2758– 2771. Harris, C. M., & Wolpert, D. M. (1998).Signal-dependent noise determines motor planning. Nature, 394, 780–784. Konig P., Engel A. K., & Singer W. (1996). Integrator or coincidence detector? The role of the cortical neuron revisited. TINS, 19, 130–137. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurones. Science, 268, 1503–1506. Marsalek, P., Koch, C., & Maunsell, J. (1997).On the relationship between synaptic input and spike output jitter in individual neurones. Proceedings of the National Academy of Sciences U.S.A., 94, 735–740. Matthews, P. B. C. (1996). Relationship of ring intervals of human motor units to the trajectory of post-spike after-hyperpolarization and synaptic noise. J. Physiology, 492, 597–628. Musila, M., & L´ansky, ´ P. (1994). On the interspike intervals calculated from diffusion approximations for Stein’s neuronal model with reversal potentials. J. Theor. Biol., 171, 225–232. Protter, P. (1980). Stochastic integration and differential equations: A new approach. Berlin: Springer-Verlag. Ricciardi, L. M., & Sato, S. (1990).Diffusion process and rst-passage-times problems. In L. M. Ricciardi (Ed.), Lectures in applied mathematics and informatics. Manchester: Manchester University Press. Sejnowski, T. J. (1998). Neurobiology—making smooth moves. Nature, 394, 725– 726. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Sheth, B. R., Sharma, J., Rao, S. C., & Sur, M. (1996).Orientation maps of subjective contours in visual cortex. Science, 274, 2110–2115. Softky, W., & Koch, C. (1993). The highly irregular ring of cortical-cells is incon-

692

Jianfeng Feng and David Brown

sistent with temporal integration of random EPSPs. J. Neurosci., 13, 334–350. Stevens, C. F., & Zador, A. M. (1998). Input synchrony and the irregular ring of cortical neurons. Nature Neurosci., 1, 210–217. Thomson, A. M. (1997). Activity-dependent properties of synaptic transmission at two classes of connections made by rat neocortical pyramidal. J. Physiology, 502, 131–147. Tuckwell, H. C. (1988). Stochastic processes in the neurosciences. Philadelphia: Society for Industrial and Applied Mathematics. Wilbur, W. J., & Rinzel, J. (1983). A theoretical basis for large coefcient of variation and bimodality in neuronal interspike interval distributions. J. Theor. Biol., 105, 345–368. Zohary, E., Shadlen, M. N., & Newsome, W. T. (1994). Correlated neuronal discharge rate and its implications for psychophysical performance. Nature, 370, 140–143. Received September 10, 1998; accepted March 4, 1999.

LETTER

Communicated by Raphael Ritz

Weak, Stochastic Temporal Correlation of Large-Scale Synaptic Input Is a Major Determinant of Neuronal Bandwidth David M. Halliday Division of Neuroscience and Biomedical Systems, Institute of Biomedical and Life Sciences, University of Glasgow, Glasgow, G12 8QQ, U.K. We determine the bandwidth of a model neurone to large-scale synaptic input by assessing the frequency response between the outputs of a two-cell simulation that share a percentage of the total synaptic input. For temporally uncorrelated inputs, a large percentage of common inputs are required before the output discharges of the two cells exhibit signicant correlation. In contrast, a small percentage (5%) of the total synaptic input that involves stochastic spike trains that are weakly correlated over a broad range of frequencies exert a clear inuence on the output discharge of both cells over this range of frequencies. Inputs that are weakly correlated at a single frequency induce correlation between the output discharges only at the frequency of correlation. The strength of temporal correlation required is sufciently weak that analysis of a sample pair of input spike trains could fail to reveal the presence of correlated input. Weak temporal correlation between inputs is therefore a major determinant of the transmission to the output discharge of frequencies present in the spike discharges of presynaptic inputs, and therefore of neural bandwidth.

1 Introduction

Mammalian central nervous system neurones have extensive dendritic trees (Mel, 1994). The combined effects of large-scale synaptic input to such structures act in several ways, which contribute directly and indirectly to the output discharge of the cell. Bernander, Douglas, Martin, and Koch (1991) and Rapp, Yarom, and Segev (1992) demonstrated that large-scale synaptic activity can alter the integrative properties of a cell by acting to reduce the input resistance and time constant. Bernander et al. (1991) further demonstrated that large-scale background synaptic activity results in an increased sensitivity to highly synchronized inputs. Murthy and Fetz (1994) and Bernander, Koch, and Usher (1994) showed that highly synchronized inputs can modulate the mean ring rate of a cell. Dendrites are known to act as lowpass lters (Rinzel & Rall, 1974), which in terms of populations of synaptic inputs results in a ltering out of higher-frequency components present in Neural Computation 12, 693–707 (2000)

c 2000 Massachusetts Institute of Technology °

694

David M. Halliday

input spike trains (Farmer, Bremner, Halliday, Rosenberg, & Stephens, 1993; Halliday, 1995). This study examines the response of cells to different strengths of temporal correlations within subpopulations of the total synaptic input. Particular attention is given to an important aspect of neuronal systems: the widespread presence of noise at all levels (Holden, 1976), which results in the random uctuations in membrane potential observed in intracellular recordings (Calvin & Stevens, 1968) and the stochastic nature of both interspike intervals and the temporal correlation between spike trains. This study includes weak stochastic temporal correlation between presynaptic inputs, in contrast to the highly synchronized form of temporal correlation used in the above studies, where all synchronized inputs always re together within a narrow time window. The aim is to match natural patterns of correlation more closely (Singer, 1993; Gray, 1994). We consider rst the effects of uncorrelated inputs, where a large percentage of common inputs are required before the two cells exhibit signicant correlation. In contrast, for inputs with weak temporal correlation, a small percentage (5–10%) of the total synaptic input can exert a clear inuence on the output discharge at frequencies where the inputs are correlated. Preliminary results are presented in Halliday (1998b). 2 Model Description and Parameters

The model is based on a class of cells with extensive dendrites whose morphology and electrophysiology have been widely studied (Cullheim, Fleshman, Glenn, & Burke, 1987; Fleshman, Segev, & Burke, 1988; Rall et al., 1992): spinal motoneurones. Based on these studies, the membrane resistivity, Rm , is 11, 000V ¢ cm2 , the ctyoplasmic resistivity, Ri , is 70V ¢ cm, and the membrane capacitance, Cm , is 1.0m F / cm2 . Each cell has a spherical soma of50m m diameter with 12 tapered dendrites, 4 each of three types: short, medium, and long. A uniform taper of 0.5m m per 100m m length is used for dendrite diameters in the distal direction. For each dendrite type, the initial diameters at the soma are 5, 7.5, and 10m m; the physical lengths are 766, 1258, and 1904m m; the electrotonic lengths are L D 0.7, 1.0, and 1.5; and the total membrane areas are 8100, 18,500 and 33, 500m m2 , respectively. Each dendrite is modeled as a sequence of connected cylinders, of xed electrotonic length of 0.1 unit; the complete cell model has 129 compartments, representing a total membrane area of 248200m m2 , with 97% of this area in the 12 dendrites. The input resistance of the complete model neurone (including soma and tapered dendrites) to hyperpolarizing current injected at the soma is 4.92MV, and the time constant is 9.7 msec. The compartmental model provides a model of current ow in dendrites accurate to second order in both space and time (Segev, Fleshman, & Burke, 1989; Mascagni, 1989). Action potential generation and afterhyperpolarization (AHP) currents are based on the model of Baldissera and Gustafsson (1974). This incorpo-

Dynamic Modulation of Neuronal Bandwidth

695

rates a threshold-crossing mechanism in conjunction with a time-dependent potassium conductance to represent AHP currents. The three-term model for potassium conductances proposed by Baldissera and Gustafsson is reduced to its single dominant term of an exponential function with a time constant of 14 msec. At the low ring rates used in this study, it is not necessary to include the faster-acting conductances. Output spikes are generated when the membrane potential at the soma exceeds the xed threshold of ¡55 mV, and the AHP conductance is activated after each output spike. This conductance produces a rapid hyperpolarization of the membrane potential toward the reversal potential of ¡75 mV. With constant injected current sufcient to induce an output ring rate of 12 spikes/sec, the membrane potential at the soma following each output spike decreases by approximately 12 mV. Individual excitatory synaptic inputs use a time-dependent conductance change, specied by an alpha function: gsyn( t) D Gs t / t e¡t / t (Jack, Noble, & Tsien, 1975), with the time constant for this conductance change set at t D 2e ¡ 04 sec (Segev, Fleshman, & Burke, 1990), and scaling factor Gs D 1.19e ¡ 08, giving a peak conductance change of 4.38 nS at t D 0.2 msec. The membrane resting potential is ¡70 mV, and the reversal potential for excitatory postsynaptic potentials (EPSPs) is ¡10 mV. When activated at the soma from rest, this conductance generates an EPSP with peak magnitude 100m V at t D 0.525 msec, and a rise time (10% to 90%) and half-width of 0.275 msec, 2.475 msec. The EPSP parameters at the soma when activated in the most distal compartment of the short dendrite (L D 0.7) are 52.8m V at t D 2.00 msec, and 0.900, 7.05 msec. When activated in the most distal compartment of the medium dendrite (L D 1.0), the EPSP parameters are 40.4m V at t D 3.08 msec, and 1.35, 9.35 msec. Activation in the most distal compartment of the long dendrite (L D 1.5) gives EPSP parameters at the soma of 27.1m V at t D 4.75 msec, and a rise time and half-width of 2.08, 11.38 msec. Presynaptic inputs are spatially distributed uniformly by membrane area over the soma and dendrites. The level of excitation necessary to induce a constant ring rate of around 12 spikes/sec in the model cell is 31,872 EPSPs/sec. This can be achieved by 996 uniformly distributed inputs ring randomly at 32 spikes/sec. In this conguration the soma receives 32 inputs, and each dendrite type (short, medium, and long) receives 33, 74, and 134 synaptic inputs each, distributed by area over the surface of the dendrite. In the examples below, where periodic inputs form part of the total synaptic input, the mean rate and number of inputs are adjusted in tandem to provide the same number of EPSPs/sec acting on each compartment of the model. 3 Data Analysis and Inference of Neural Bandwidth

The simulations consist of two cells in which a percentage of the total synaptic input is common to both cells. The cells therefore exhibit a tendency for

696

David M. Halliday

correlated discharge, which we use to estimate the range of frequencies in the common input spike trains that are transmitted to the output discharge of the cells. The data analysis methods are those set out in Rosenberg, Amjad, Breeze, Brillinger, and Halliday (1989) and Halliday et al. (1995) for frequency domain analysis of stochastic neuronal data. We use estimates of coherence functions and cumulant density functions to characterize the correlation between the output discharges. Cumulant density functions provide a measure of correlation between spike trains as a function of time; coherence functions provide a measure of correlation as a function of frequency. The cumulant density function, at lag u, between two spike trains (1, 2), denoted by q12 ( u), can be dened and estimated as the inverse Fourier transform of the cross spectrum: Z p ( ) q12 u D f12 (l ) eil u dl , ¡p

where f12 (l) is the cross spectrum between (1, 2) at frequency l. The corresponding coherence function, |R12(l )| 2 , can be dened and estimated as the magnitude of the cross spectrum squared normalized by the product of the two auto spectrum: | F12 (l )| 2 D

| f12(l )| 2 . f11(l) f22 (l )

For further details including estimation procedures and the setting of condence limits, see Halliday et al. (1995). For two identical neurones acted on by a single common input, the estimated coherence between the output discharges is proportional to the product of the input spectrum with the squared magnitude of the transfer function (dened as the ratio of the cross spectrum and input spectrum) for this single input (Rosenberg, Halliday, Breeze, & Conway, 1998). Using random, or Poisson, common spike train inputs allows the bandwidth of the cell input-output transfer function to be inferred from the estimated coherence between the output discharges of the two cells. We use a similar approach in this study, except that we are dealing with a population of common inputs to the two cells. This method has the advantage of being independent of the number of inputs applied to the cell; the application and interpretation is the same for 1 or 1000 common inputs. The estimation of a multivariate transfer function for a cell with several hundred inputs is not practical. In addition, the two-cell method will also be useful when nonlinear mechanisms are involved in synaptic transmission. Our concern is with the presence of common frequency components in the two output discharges; since the common inputs have the same effect on both cells, the two-cell model will also be useful when linear and nonlinear transformations of input signals occur. Some of the simulations use presynaptic input spike trains that are periodic. We interpret a distinct peak in the estimated coherence between the

Dynamic Modulation of Neuronal Bandwidth

697

Figure 1: Synchronized discharge in the two-cell model induced by 50% common inputs with Poisson discharges. (a) Cumulant density estimate. Dashed and solid horizontal lines are expected value and upper and lower 95% condence limits, respectively, based on assumption of independence. (b) Coherence estimate. Dashed horizontal line is upper 95% condence limit based on assumption of independence. Results estimated from 400 seconds of data.

two output discharges at the frequency of the periodicity to indicate that this frequency is within the bandwidth of the cell for that particular input conguration. This inference uses an important property of a Fourier-based analytical framework: that distinct frequencies persist within a (linear) system (Brillinger, 1974), and can be detected as a periodic correlation between an input and an output, or between two outputs. Inferring the properties of common inputs to neurones from the estimated correlation between their output discharges is a well-established approach in both modeling of neural systems (Perkel, Gerstein, & Moore, 1967; Moore, Segundo, Perkel, & Levitan, 1970; Rosenberg et al., 1998) and neurophysiology (Farmer et al., 1993). 4 Results 4.1 Uncorrelated Inputs. With both cells acted on by 996 inputs, each activated by a Poisson spike train of rate 32 spikes/sec, and no common inputs, both cells re with a mean rate of around 12 spikes/sec, and, as expected, their output discharges are uncorrelated. When 50% of the inputs are applied commonly, the cells share 15,936 EPSPs/sec, and their output discharges exhibit a tendency for synchronous discharge. This results in a peak at time zero in the time domain correlation, (see Figure 1a) estimated as a cumulant density function (Halliday et al., 1995). In the frequency domain, signicant correlation is present at frequencies up to around 50 Hz in the estimated coherence (see Figure 1b). Therefore the bandwidth of this conguration, which reects the ability of 50% of the synaptic inputs acti-

698

David M. Halliday

Figure 2: Response of two-cell model to different percentages of common Poisson inputs. Estimated coherence between output discharges for (a) 10%, (b) 20%, (c) 40%, and (d) 80% common Poisson inputs applied uniformly over entire dendritic tree. Dashed horizontal lines are condence limits as described for Figure 1. Results estimated from 800 seconds of data for each conguration.

vated by random spike trains to inuence the output discharge, is around 50 Hz. In Figure 2 are shown coherence estimates between the output discharges when each cell receives the same number of EPSPs/sec as above, except that the percentage of common EPSPs/sec is varied from 10% to 80%. These common inputs are activated by Poisson spike trains, their location is distributed uniformly by area over the cell body and dendrites, and they have the same location in both cells. The coherence estimate with 10% common inputs (see Figure 2a) has no signicant features; therefore, the combined effect of 10% of the total synaptic input is insufcient to exert any inuence on the timing of output spikes. The conguration with 80% common input (see Figure 2d) has a coherence estimate with signicant values up to frequencies in excess of 200 Hz. The bandwidth of this conguration is therefore above 200 Hz. The bandwidth of the neurone therefore depends on the percentage of the total synaptic input that is considered to provide the input. Coherence es-

Dynamic Modulation of Neuronal Bandwidth

699

Figure 3: Response of two-cell model to different percentages of common uncorrelated periodic 25 spikes/sec inputs. Estimated coherence between output discharges for (a) 10%, (b) 20%, (c) 40%, and (d) 8% common inputs applied uniformly over entire dendritic tree. Dashed horizontal lines are condence limits as described for Figure 1. Results estimated from 800 seconds of data for each conguration.

timates provide a normative measure of association on a scale from 0 to 1 (Rosenberg et al., 1989; Halliday et al., 1995). The peak coherence between the two output discharges with 80% common synaptic input is 0.15. Thus, even with 80% common synaptic input to the two cells, the discharge of one cell can predict a maximum of 15% of the variability in the discharge of the companion cell. The above congurations using Poisson spike trains are equivalent to probing a continuous system with a white noise signal; both types of signal have a at spectral estimate. Many neural systems have pathways that carry periodic spike trains. In addition, there has been recent interest in the role of rhythmic neuronal activity in neural integration (Singer, 1993; Gray, 1994). This raises the question of whether the bandwidths inferred from the coherence estimates in Figure 2 are valid for spike train inputs that contain distinct periodicities. This is explored in Figure 3, which shows

700

David M. Halliday

coherence estimates, with the common inputs activated by periodic spike trains with a mean rate of 25 spikes/sec and a coefcient of variation (COV) of 0.1. As above, in determining the number of inputs, the total number of EPSPs/sec acting on each compartment remained xed. For example, the soma of each cell receives 1024 EPSPs/sec, which with 10% common synaptic input consists of 4 periodic inputs ring at 25 spikes/sec common to both cells and 37 random inputs ring at 25 spikes/sec applied independently to each cell. Figure 3 demonstrates that the behavior of the simulation in response to a varying percentage of common periodic inputs is similar to that for random inputs. With only 10% of the inputs ring at 25 spikes/sec there is no signicant coherence at 25 Hz (see Figure 3a). An increasing percentage of inputs ring at 25 spikes/sec results in an increase in the magnitude of the coherence estimate at 25 Hz, and also an increase in the range of frequencies present, seen as peaks at harmonics of the 25 Hz input. 4.2 Correlated Inputs. The simulations suggest that populations of synaptic inputs activated by uncorrelated spike trains that constitute less than 20% of the total synaptic input to a cell are unlikely to exert any inuence on the timing of output spikes. Next we explore the response to input spike trains that are themselves correlated. Each spike train is generated using an integrate-and-re encoder, and the resulting correlation within each population of common inputs is both weak and stochastic in nature (Halliday, 1998a). These integrate-and-re encoders are much simpler than the detailed conductance-based compartmental neurone model. They are used to generate correlated spike trains and are not intended to represent presynaptic circuitry acting on motoneurones. First we examine the response of the paired motoneurone model to a population of common inputs with the same ring rate (32 spikes/sec) and COV (1.0) as the Poisson inputs used in Figures 1 and 2, except they are weakly correlated over a broad range of frequencies. These inputs are generated using the spike generators described in Halliday (1998a), with the temporal correlation generated by a single Poisson spike train of rate 32 spikes/sec as common input to all spike generators. The resulting strength of correlation between sample pairs of spike trains is broadband and weak. Due to this weak temporal correlation, analysis of a sample pair of spike trains of duration 100 seconds can fail to reveal any signicant correlation between inputs (Halliday, 1998a). However, combined analysis of 20 sample pairs of 100 seconds duration using the technique of pooled coherence (Amjad, Halliday, Rosenberg, & Conway, 1997; Halliday, 1998a) reveals correlation within the population of inputs up to 100 Hz, with a maximum estimated coherence of 0.013. Figure 4a shows the coherence estimate between the two output discharges, with 10% common input supplied by 100 of these correlated inputs distributed uniformly by area as before. This estimate is similar to that obtained with 60% common Poisson inputs (not

Dynamic Modulation of Neuronal Bandwidth

701

Figure 4: Response of two-cell model to populations of inputs with weak temporal correlation. Estimated coherence between output discharges for (a) 10% common correlated broadband inputs and (b) 5% common correlated periodic inputs at 10 spikes/sec and 5% common correlated periodic inputs at 25 spikes/sec applied uniformly over entire dendritic tree. Dashed horizontal lines are condence limits as described for Figure 1. Results estimated from 800 seconds of data in (a) and 400 seconds of data in (b).

shown; cf. Figures 2c and 2d). Weakly correlated inputs exert a signicantly stronger inuence on the output discharge of the cell than uncorrelated inputs. The conguration with 10% weakly correlated inputs has a bandwidth in excess of 100 Hz, in contrast to the conguration with 10% uncorrelated inputs, which has no signicant effect on the timing of output spikes. Next we examine the effects of applying two sets of common periodic spike trains, with each set assigned 5% of the total number of EPSPs/sec acting on the cells, and consisting of 160 inputs ring at 10/sec with COV 0.1, and 64 inputs ring at 25 spikes/sec, with COV 0.1. These spike trains have a similar strength of correlation as the broadband inputs above, except the correlation is at the ring frequency in each set of 10 Hz and 25 Hz, respectively (Halliday, 1998a). The coherence estimate between the two output discharges has clear peaks at 10 Hz and 25 Hz. Thus, the output discharge supports both periodicities present in the two populations of weakly correlated common inputs. The 5% of inputs at 25/sec induce a peak coherence similar to that in Figure 3c for 40% common uncorrelated 25 spikes/sec inputs. A small percentage of weakly correlated periodic inputs are able to exert an inuence on the timing of output spikes whose relative magnitude is far in excess of the percentage of input EPSPs involved. Weak stochastic correlation between inputs (which may not be detected) is converted into stronger correlation between outputs. A small change in the temporal correlation structure of a subset of the total synaptic input, without any alteration in mean ring rate, can alter the bandwidth of the cells.

702

David M. Halliday

5 Discussion

The main nding of this study is the dynamic nature of the cell bandwidth, which for a xed level of synaptic excitation (EPSPs/sec), exhibits systematic changes in response to changes in the numbers of presynaptic inputs considered to provide the input signal and their temporal correlation characteristics. For uncorrelated inputs, neural bandwidth is determined by the percentage of the total synaptic input that is considered to provide the input signal. Populations that constitute less than 10% of the total synaptic input have no signicant inuence on the output discharge (see Figures 2a and 3a). In contrast, weak temporal correlation within 5–10% of the total synaptic input has considerable inuence on the timing of output spikes (see Figure 4). Correlation between the output discharges occurs at the frequencies at which inputs are correlated. Weak temporal correlation between inputs is converted into stronger correlation between outputs. The question of neural bandwidth has been previously addressed. Farmer et al. (1993) used a similar two-cell simulation of a point neurone model and found that EPSPs with longer rise times and half-widths acted to lter out higher-frequency components. Based on a two-cell model with 96.5% of the total inputs common to both cells, they concluded that the bandwidth of their point neurone model was in excess of 250 Hz. Using a two-cell model of a motoneurone plus single attached dendrite with 75% of the total synaptic input common to both cells, Halliday (1995) concluded that a distal input location could result in a bandwidth of around 50 Hz. The ndings of this study are consistent with these previous results and show that these different estimates of the cell bandwidth result from the different percentages of common inputs applied. The classical view of the input-output relationship for motoneurones is that over a wide range of discharge rates, the steady-state discharge rate is determined by a simple linear relation that is proportional to the effective synaptic current input to the cell (Binder, Heckman, & Powers, 1993). This study is concerned with spatial temporal interactions on a millisecond timescale, which for large-scale synaptic input results in membrane potential uctuations (Calvin & Stevens, 1968). It is the random nature of these uctuations, and the interactions between the populations of common and independent inputs, that produces the modulation of neural bandwidth for the same level of average excitation. Two previous studies of the effects of input synchrony have concentrated on variations in mean ring rate (Murthy & Fetz, 1994; Bernander et al., 1994). Both sets of authors found that increased synchronization led to a decrease in output ring rate. The proposed mechanism was an ”overcrowding” effect where excess depolarizing input above that required to re the cell was effectively wasted during the refractory period. For the present simulations the average output ring rate was 12.32 spikes/sec for the Poisson input congurations (see Figures 1 and 2), 12.37 spikes/sec for the weakly correlated broadband inputs (see Figure 4a), and 12.16 spikes/sec for the

Dynamic Modulation of Neuronal Bandwidth

703

weakly correlated periodic inputs (see Figure 4a). This study differs from these in the use of weak, stochastic correlated inputs. Generation of spike trains using integrate-and-re encoders results in patterns of correlation that mimic more realistically natural patterns of correlation, rather than the highly synchronized inputs used by Murthy and Fetz (1994) and Bernander at al. (1994), where all correlated inputs re synchronously with probability 1 in a narrow time window. Bernander et al. (1991) and Rapp et al. (1992) found that large-scale synaptic activity can alter the basic characteristics of a cell, resulting in a reduction in cell input resistance and time constant. The present model exhibits similar behavior; when excited by 31,872 EPSPs/sec, the input resistance is 3.8 MV, the time constant is 6.45 msec, and the average membrane potential is ¡55.1 mV. This compares with 4.9 MV, 9.7 msec and ¡70mV with no synaptic inputs. Bernander et al. (1991) concluded that large-scale background synaptic activity also resulted in increased sensitivity to synchronized inputs, which they demonstrated by contrasting the cell response to 150 inputs that were either exactly synchronized every 25 msec or distributed throughout time. There are two differences in the methods of this study. Bernander et al. (1991) and Rapp et al. (1992) replaced individual background synaptic conductances with a time-averaged constant calculated to provide the same average level of excitation to each compartment. This study is concerned with the stochastic nature of neuronal spike trains and therefore models the time course of each individual synaptic input in full for all synaptic inputs to both cells. In the absence of any other synaptic inputs, a constant background conductance will result in a constant value of membrane potential, which will not exhibit the random uctuations observed in intracellular recordings (Calvin & Stevens, 1968; see Bernander et al., 1991, Figure 1c). This will bias the response of the model in favor of any synaptic inputs that are applied, since these will be the only source of membrane potential uctuations and the cell will always respond with an increased depolarization to excitatory inputs. This is in contrast to this model, where the synaptic inputs that provide the input signal interact and compete with the background synaptic inputs on a millisecond timescale. Theoretical studies have indicated that the addition of noise to a system can result in an improved signal-to-noise ratio, particularly for periodic signals in the presence of noise (Wiesenfeld & Moss, 1995; Bezrukov & Vodyanoy, 1997). In such a case it is important to model the stochastic nature of all inputs to the cell; indeed in the light of this work on stochastic resonance, the presence of stochastic background synaptic activity may contribute to the extreme sensitivity of the cell to other temporally correlated synaptic inputs. The second difference between this study and that of Bernander et al. (1991) is in the form of correlated inputs. Bernander et al. (1991) used precisely synchronized inputs, in contrast to the weak, stochastic temporal correlation used in our populations of inputs. This temporal correlation is sufciently weak that examination of a sample

704

David M. Halliday

pair of spike trains selected from the population may fail to reveal the presence of any correlation (Halliday, 1998a). Despite the presence of large-scale background activity, the model cell is able to detect the presence of weak, stochastic temporal correlation in a small fraction of the total input, without any change in the overall level of excitation (EPSPs/sec). This high selectivity to small changes in the temporal characteristics of subpopulations of the total synaptic input suggests that weak, stochastic temporal correlation between presynaptic inputs is a major determinant of neural bandwidth. The results of this study are also relevant to studies considering the relative merits of rate coding, coincidence detection, and temporal integration as mechanisms involved in neural coding (Shadlen & Newsome, 1994, 1995, 1998; Softky, 1995; K¨onig, Engel, & Singer, 1996), which have led to the view that temporal integration, where many PSPs contribute to the generation of each output spike, will not preserve temporal coding in input spike trains (Shadlen & Newsome, 1994; Konig ¨ et al., 1996). In the simulations here, around 2666 EPSPs contribute to each spike; therefore temporal integration would appear to be the dominant mode of operation. The presence of weak stochastic temporal correlation in 5–10% of the total synaptic input allows these inputs to exert a strong inuence on the timing of output spikes, allowing the temporal coding in the inputs to be transmitted to the output discharges. This result is not consistent with the suggestion that temporal information will not propagate through neurones whose mode of operation is temporal integration. An alternative mode of operation proposed for neurones is coincidence detection (Abeles, 1982; Softky, 1994; Konig ¨ et al., 1996). It is not clear that this is an appropriate mechanism to describe the present simulations, where weakly correlated inputs are distributed over the extensive dendritic tree, with propagation delays of up to 5 ms for distal inputs to arrive at the somatic spike generation site. In addition, the weak and stochastic nature of the correlation structure in the common inputs does not produce reliable coincidences in the input spike trains (Halliday, 1998a), but a diffuse temporal correlation distributed across all the spike trains in each population. We therefore suggest that the mode of operation of the present neurones is temporal integration with correlation detection. The results also demonstrate that correlation between neurones does not require any alteration in output ring rates (deCharms & Merzenich, 1996; Riehle, Grun, ¨ Diesmann, & Aertsen, 1997). Weak correlation between inputs results in stronger correlation between outputs; reciprocal connections in networks of neurones may provide a means of sustaining these oscillations. In such networks, conduction delays are an important factor (K¨onig & Schillen, 1991); however, in the present data, dendritic conduction delays of up to 5 ms do not affect the ability of the neurone to detect temporal correlation between inputs. One unresolved issue in cortical signaling is the variability exhibited in neuronal discharges, which have a COV »1.0 (Softky & Koch, 1993; Shadlen & Newsome, 1994, 1998). In this study the neurones

Dynamic Modulation of Neuronal Bandwidth

705

have a regular output discharge. However, all inputs in the example in Figure 4a. have a COV »1.0, yet the cell is sensitive to weak stochastic temporal correlation between inputs distributed over the dendritic tree. Further simulation studies are necessary to ascertain if weak stochastic temporal correlation in populations of inputs contributes to the variability in cortical neurone discharges. Acknowledgments

This research was undertaken initially at the Information Science Division, NTT Basic Research Labs, Musashino, Tokyo. Completion was supported in part by grants from the Joint Research Council/HCI Cognitive Science Initiative and the Wellcome Trust (036928; 048128). References Abeles, M. (1982). Role of cortical neuron: Integrator or coincidence detector? Isr. J. Med. Sci., 18, 83–92. Amjad, A. M., Halliday, D. M., Rosenberg, J. R., & Conway, B. A. (1997). An extended difference of coherence test for comparing and combining several independent coherence estimates—theory and application to the study of motor units and physiological tremor. Journal of Neuroscience Methods, 73, 69–79. Baldissera, F., & Gustafsson, B. (1974).Afterhyperpolarization conductance time course in lumbar motoneurones of the cat. Acta Physiologica Scandinavica, 91, 512–527. ¨ Douglas, R. J., Martin, K. A. C., & Koch, C. (1991). Synaptic backBernander, O., ground activity inuences spatiotemporal integration in single pyramidal cells. Proceedings of the National Academy of Sciences, 88, 11569–11573. ¨ Koch, C., & Usher, M. (1994). The effect of synchronized inputs Bernander, O., at the single neuron level. Neural Computation, 6, 622–641. Bezrukov, S. M., & Vodyanoy I. (1997). Stochastic resonance in non-dynamical systems without response thresholds. Nature, 385, 319–321. Binder, M. D., Heckman, C. J., & Powers, R. K. (1993). How different afferent inputs control motoneuron discharge and the output of the motoneuron pool. Current Opinion in Neurobiology, 3, 1028–1034. Brillinger, D. R. (1974). Fourier analysis of stationary processes. Proceedings of the IEEE, 62, 1628–1643. Calvin, W. H., & Stevens, C. F. (1968). Synaptic noise and other sources of randomness in motoneuron interspike intervals. Journal of Neurophysiology, 31, 574–587. Cullheim, S., Fleshman, J. W., Glenn, L. L., & Burke, R. E. (1987). Membrane area and dendritic structure in type-identied triceps surae alpha motoneurones. Journal of Comparative Neurology, 255, 68–81. deCharms, R. C., & Merzenich, M. M (1996). Primary cortical representation of sounds by the coordination of action potential timing. Nature, 381, 610–613.

706

David M. Halliday

Farmer, S. F., Bremner, F. D., Halliday, D. M., Rosenberg, J. R., & Stephens, J. A. (1993). The frequency content of common synaptic inputs to motoneurones studied during voluntary isometric contraction in man. Journal of Physiology, 470, 127–155. Fleshman, J. R., Segev, I., & Burke, R. E. (1988). Electrotonic architecture of typeidentied (-motoneurones in the cat spinal cord. Journal of Neurophysiology, 60, 60–85. Gray, C. M. (1994). Synchronous oscillations in neuronal systems: Mechanisms and functions. Journal of Computational Neuroscience, 1, 11–38. Halliday, D. M. (1995). Effects of electronic spread of EPSPs on synaptic transmission in motoneurones—A simulation study. In A. Taylor, M. H. Gladden, & R. Durbaba (Eds.), Alpha and gamma motor systems (pp. 337–339). New York: Plenum. Halliday, D. M. (1998a). Generation and characterization of correlated spike trains. Computers in Biology and Medicine, 28, 143–152. Halliday, D. M. (1998b). Weak, stochastic temporal correlation of large scale synaptic input can modulate neuronal bandwidth. Society for Neuroscience Abstracts, 24, 1151. Halliday, D. M., Rosenberg, J. R., Amjad, A. M., Breeze, P., Conway, B. A., & Farmer, S. F. (1995). A framework for the analysis of mixed time series/point process data—Theory and application to the study of physiological tremor, single motor unit discharges and electromyograms. Progress in Biophysics and Molecular Biology, 64, 237–278. Holden, A. V. (1976). Models of the Stochastic Activity of Neurones. Berlin: Springer Verlag. Jack, J. J. B., Noble, D., & Tsien, R. W. (1975). Electrical current ow in excitable cells (2nd ed.). Oxford: Clarendon Press. Konig, ¨ P., Engel, A. K., & Singer, W. (1996). Integrator or coincidence detector? The role of the cortical neuron revisited. TINS, 19, 130–137. Konig, ¨ P., & Schillen, T. B. (1991). Stimulus-dependent assembly formation of oscillatory responses. I. Synchronization. Neural Computation, 3, 155–166. Mascagni, M. V. (1989). Numerical methods for neuronal modeling. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From synapses to networks. (pp. 439–484). Cambridge, MA: MIT Press. Mel, B. W. (1994). Information processing in dendritic trees. Neural Computation, 6, 1031–1085. Moore, G. P., Segundo, J. P., Perkel, D. H., & Levitan, H. (1970). Statistical signs of synaptic interaction in neurones. Biophysical Journal, 10, 876–900. Murthy, V. N., & Fetz, E. E. (1994). Effects of input synchrony on the ring rate of a three conductance cortical neuron model. Neural Computation, 6, 1111– 1126. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967). Neuronal spike trains and stochastic point processes. II. Simultaneous spike trains. Biophysical Journal, 7, 419–440. Rall, W., Burke, R. E., Holmes, W. R., Jack, J. J. B., Redman, S. J., & Segev, I. (1992). Matching dendritic neuron models to experimental data. Physiological Reviews, 72 (Suppl.), S159–S186.

Dynamic Modulation of Neuronal Bandwidth

707

Rapp, M., Yarom, Y., & Segev, I. (1992). The impact of parallel ber background activity on the cable properties of cerebellar Purkinje cells. Neural Computation, 4, 518–533. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Rinzel, J., & Rall, W. (1974). Transient response to a dendritic neuron model for current injected at one branch. Biophysics, 14, 759–789. Rosenberg, J. R., Amjad, A. M., Breeze, P., Brillinger, D. R., & Halliday, D. M. (1989). The Fourier approach to the identication of functional coupling between neuronal spike trains. Progress in Biophysics and Molecular Biology, 53, 1–31. Rosenberg, J. R., Halliday, D. M., Breeze, P., & Conway, B. A. (1998). Identication of patterns of neuronal activity—partial spectra, partial coherence, and neuronal interactions. Journal of Neuroscience Methods, 83, 57–72. Segev, I., Fleshman, J. R., & Burke, R. E. (1989). Compartmental models of complex neurons. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From synapses to networks (pp. 63–96). Cambridge, MA: MIT Press. Segev, I., Fleshman, J. R., and Burke, R. E. (1990).Computer simulations of group Ia EPSPs using morphologically realistic models of cat a-motoneurones. Journal of Neurophysiology, 64, 648–660. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Current Opinion in Neurobiology, 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1995). Is there a signal in the noise? Current Opinion in Neurobiology, 5, 248–250. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications of connectivity, computation and information coding. Journal of Neuroscience, 18, 3870–3896. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing and learning. Annual Review of Physiology, 55, 349– 374. Softky, W. R. (1994). Sub-millisecond coincidence detection in active dendritic trees. Neuroscience, 58, 13–41. Softky, W. R. (1995). Simple versus efcient codes. Current Opinion in Neurobiology, 5, 239–247. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. Journal of Neuroscience, 13, 334–350. Wiesenfeld, K., & Moss, F. (1995).Stochastic resonance and the benets of noise— from ice ages to craysh and squids. Nature, 373, 33–36. Received July 24, 1998; accepted February 26, 1999.

LETTER

Communicated by Steven Nowlan

Second-Order Learning Algorithm with Squared Penalty Term Kazumi Saito Ryohei Nakano ¤ NTT Communication Science Laboratories, Seika-cho, Soraku-gun, Kyoto 619-0237 Japan This article compares three penalty terms with respect to the efciency of supervised learning, by using rst- and second-order off-line learning algorithms and a rst-order on-line algorithm. Our experiments showed that for a reasonably adequate penalty factor, the combination of the squared penalty term and the second-order learning algorithm drastically improves the convergence performance in comparison to the other combinations, at the same time bringing about excellent generalization performance. Moreover, in order to understand how differently each penalty term works, a function surface evaluation is described. Finally, we show how cross validation can be applied to nd an optimal penalty factor. 1 Introduction

It has been found empirically that adding some penalty term to an objective function in the learning of neural networks can lead to signicant improvements in network generalization. Such terms have been proposed on the basis of several viewpoints such as weight decay (Hinton, 1987), regularization (Poggio & Girosi, 1990), function smoothing (Bishop, 1995), weight pruning (Hanson & Pratt, 1989; Ishikawa, 1996), and Bayesian priors MacKay, 1992; Williams, 1995). Some are calculated by using simple arithmetic operations, while others use higher-order derivatives. The most important evaluation criterion for adding these terms is how the generalization performance improves, but the learning efciency is also an important criterion in large-scale practical problems; computationally demanding terms are hardly applicable to such problems. Here, it is naturally conceivable that the effects of penalty terms depend on learning algorithms; thus, we need comparative evaluations. We have already given an outline of an evaluation on the learning performance of rst- and second-order off-line learning algorithms with three penalty terms, using an articial regression problem (Saito & Nakano, 1997b). This article gives the details of the above evaluation and also includes eval¤

Present address: Nagoya Institute of Technology, Gokiso-cho, Showa-Ku, Nagoya 466–8555, Japan Neural Computation 12, 709–729 (2000)

c 2000 Massachusetts Institute of Technology °

710

Kazumi Saito and Ryohei Nakano

uation using an actual classication problem. It also compares the learning performance of a standard on-line learning algorithm with the three penalty terms, using both the articial regression problem and the real classication problem. Section 2 explains the framework of the learning and shows a second-order learning algorithm with the penalty terms. Sections 3 and 4 show experimental results for the regression problem and the real classication problem, respectively. Section 5 describes a function surface evaluation, in order to explain how differently each penalty term works. Section 6 shows how cross validation can be applied to nd an optimal penalty factor. 2 Learning with Penalty Term 2.1 Framework. Let f( x 1 , y 1 ), ¢ ¢ ¢ , ( xm , y m )g be a set of examples, where x t denotes an n-dimensional input vector and y t a k-dimensional target output vector corresponding to x t . In a three-layer neural network, let h be the number of hidden units, wj D ( wj0 , ¢ ¢ ¢ , wjn ) T be the weight vector between all the input units and the hidden unit j, and ci D ( ci0 , ¢ ¢ ¢ , cih ) T be the weight

vector between all the hidden units and the output unit i; wj0 and ci0 mean bias terms, and xt0 is set to 1. Note that aT denotes the transposed vector of a . Hereafter, a vector consisting of all parameters, ( cT1 , ¢ ¢ ¢ , cTk , w T1 , ¢ ¢ ¢ , w Th ) T , is simply expressed as W D (w 1 , ¢ ¢ ¢ , w N )T , where N (D h( n C 1) C k( h C 1)) denotes the dimension of W. Then the output value of the output unit i is dened as follows: zi ( x t I W) D ci0 C

h X jD1

cij s(w jT xt ),

where s(u) represents a sigmoidal function, s(u) D 1 / (1 C e¡u ). First, we consider k-dimensional regression problems. Provided that each target output value independently includes noise with a mean of 0 and an unknown common standard deviation, then the learning error term can be dened by using a squared-error criterion: f (W) D

m X k 1X ( yti ¡ zi ( x t I W)) 2 . 2 tD1 iD1

Next, we consider k-class classication problems. Note that yti 2 f0, 1g because each target value indicates a class label. Provided that each example belongs to only one class, then the learning error term can be dened by using a soft-max function and a cross-entropy criterion: ! m X k X exp(zi (x t I W)) . f (W) D ¡ yti log Pk tD1 iD1 pD1 exp(zp ( x t I W))

(

Second-Order Learning Algorithm

711

Incidentally, for classication problems, we can consider another learning P Pk 2 error term using the squared-error criterion: 12 m tD1 iD1 ( yti ¡s(zi ( x t I W))) . However, when zi is dened as the linear output value with respect to the input vector, we can easily see that the Hessian matrix of the cross-entropy criterion always becomes nonnegative denite, but that of the squared-error criterion is not always so. The learning error term using the cross-entropy criterion has a unique (weak) local minimum. Thus, even if zi is the nonlinear output value of the neural network, learning with the error term using the cross-entropy criterion will be easier than that of the squared-error criterion. In this article, we consider the following three penalty terms: V1 (W) D

N N N X w k2 1X 1X w k2 , V2 (W) D |w k |, V3 (W) D . 2 kD1 2 kD1 1 C w k2 kD1

(2.1)

Hereafter, V1 , V2 , and V3 are referred to as the squared (Hinton, 1987; MacKay, 1992), absolute (Williams, 1995; Ishikawa, 1996), and normalized (Hanson & Pratt, 1989) penalty terms, respectively. Then learning with one of these three terms can be dened as the problem of minimizing the following objective function, Fi (W) D f (W) Cm Vi (W),

(2.2)

where m is a penalty factor. Although each of the single penalty terms is used for our evaluation, using adequate multiple penalty terms has lately attracted considerable attention. Ripley (1996) has pointed out that a single penalty term will make sense only if the inputs and outputs have been rescaled to the range [0,1]. In data analysis, such normalization is usually employed; thus, the single penalty term approach remains practical and signicant due to its simplicity. One of the limitations of the single penalty term approach is that it is inconsistent with certain scaling properties of network mappings (Ripley, 1996). From this standpoint, we should consider multiple penalty factors; however, if we consider a practically meaningful scaling of inputs and outputs such as xQ m D am xm Cbm and yQ i D ci yi Cdi , we have to optimize as many penalty factors as the number of inputs and outputs. The Bayesian approach (MacKay, 1992; Williams, 1995) may make it possible to estimate numerous penalty factors, but it has been the subject of partial implementation and considerable controversy (Ripley, 1996). Moreover, a constant penalty factor is used for our evaluation. Although the Bayesian approach can adaptively optimize penalty factors together with network weights, the following approach is still practical and useful; that is, train several networks with different amounts of penalty factors, estimate the generalization error for each, and then choose the penalty factor that minimizes the estimated generalization error.

712

Kazumi Saito and Ryohei Nakano

Table 1: Classication of Supervised Learning Algorithms. Supervised Learning Method Search Direction

First order Second order

Step Length Fixed

Variable

Backpropagation, etc. Newton method

Silva-Almeida method, etc. SCG, OSS, BPQ, etc.

Note: SCG D scaled conjugate gradient; OSS D one-step secant; BPQ D backpropagation based on quasi-Newton.

2.2 Learning Algorithms. Since a large number of algorithms have been proposed for supervised neural network learning, it is almost impossible to evaluate the learning performance of all the algorithms in combination with each penalty term. In general, supervised learning algorithms can be classied into on-line and off-line algorithms. In this article, as a representative on-line algorithm, the on-line backpropagation (BP) algorithm with a xed learning rate is used for evaluation. On the other hand, the basic characteristics of off-line algorithms can be determined by the methods for calculating a search direction and its step-length. Thus, as shown in Table 1, we classify the search direction calculation methods into rst and second order, and the step-length calculation methods into xed and variable; then the representative algorithm in each category is reasonably used for evaluation. Since it is well known that the off-line BP algorithm with the xed step length (learning rate) is very inefcient, and Newton techniques are hardly applicable to even midscale problems, hereafter we consider only the rstand second-order methods whose step length is adjusted. 2.3 Second-Order Algorithm with Penalty Term. In the second-order algorithms, shown in Table 2, each of existing representative methods has advantages and disadvantages of applicability to large-scale problems and efciency in inaccurate step length; second-order algorithms based on Levenberg-Marquardt or quasi-Newton methods cannot suitably scale up for large,1 while a line search to calculate an adequate step length is indispensable for second-order algorithms that are based on quasi-Newton or conjugate gradient methods.2 In order to make the most of the secondorder algorithm, we employ a newly invented second-order learning algorithm based on a quasi-Newton method (BPQ) (Saito & Nakano, 1997a). 1 Levenberg-Marquardt algorithms require O(N2 m) operation to calculate the descent direction during any one iteration; standard quasi-Newton algorithms require N2 storage space to maintain an approximation to an inverse Hessian matrix. 2 Successful convergence of conjugate gradient algorithms relies heavily on the accuracy of a line search; thus, they generally require a nonnegligible computational load, while successful convergence of quasi-Newton algorithms is theoretically guaranteed even with an inaccurate line search if certain conditions are satised.

Second-Order Learning Algorithm

713

Table 2: Second-Order Learning Methods. Learning Algorithm

Applicable Scale of Network

Required Accuracy for Step-length Calculation

Newton method Quasi-Newton method Conjugate gradient method

Small Middle Moderately large

Unnecessary to calculate Moderate High

Although we can employ other second-order learning algorithms such as scaled-conjugate gradient (Møller, 1993) or one-step secant (Battiti, 1992), the BPQ worked the most efciently among them in our own experience (Saito & Nakano, 1997a). In the BPQ, the descent direction, D W, is calculated on the basis of a partial Broyden-Fletcher-Goldfarb-Shannon update and a reasonably accurate step length, l, is efciently calculated as the minimal point of a second-order approximation. The partial BFGS update can be directly applied, while the step length l is evaluated as follows: l D

¡rFi (W)D W T ¡rFi (W)D W T D . (2.3) 2 2 D W T r Fi (W)D W D W T r f (W)D W Cm D W T r 2 Vi (W)D W

The quadratic form for the training error term, D W T r 2 f (W)D W, can be calculated efciently with the computational complexity of Nm CO(hm) by using the procedure of the BPQ, while those for penalty terms are calculated as follows: D W T r 2 V1 (W)D W D D W T r 2 V3 (W)D W D

N X kD1

D w k2 ,

D W T r 2 V2 (W)D W D 0,

N (1 X ¡ 3w k2 )D w k2 kD1

(1 C w k2 ) 3

.

(2.4)

Note that in the step-length calculation, D W T r 2 Fi (W)D W is expected to be positive. The three terms have a different effect on D W T r 2 Fi (W)D W: the squared penalty term always adds a nonnegative value, the absolute penalty term has no effect, and the normalized penalty p term may add a negative value if many weight values are larger than 1 / 3. This indicates that the squared penalty term has a very desirable feature. 3 Experiments Using Regression Problem 3.1 Nonlinear Regression Problem. By using a regression problem for 2 a function y D (1 ¡ x C 2x2 ) e¡0.5x , the learning performance of adding a

714

Kazumi Saito and Ryohei Nakano

Figure 1: Nonlinear regression problem.

penalty term was evaluated. In the experiment, a value of x was randomly generated in the range of [¡4, 4], and the corresponding value of y was calculated from x; each value of y was corrupted by adding gaussian noise with a mean of 0 and a standard deviation of 0.2. The total number of training examples was set to 30. The number of hidden units was set to 5, where the initial values for the weights between the input and hidden units were independently generated according to a normal distribution with a mean of 0 and a standard deviation of 1; the initial values for the weights between the hidden and output units were set to 0, but the bias value at the output unit was initially set to the average output value of all training examples. The iteration was terminated when the gradient vector was sufciently small (krFi (W)k2 / N < 10¡12) or the total processing time exceeded 100 seconds. Our experiments using the regression problem were done on SUN S-4/20 computers. The penalty factor m was gradually changed from 20 to 2¡19 by multiplying by 2¡1 ; trials were performed 20 times for each penalty factor. Figure 1 shows the training examples, the true function, and a function obtained after learning without a penalty term. We can see that such a learning overtted the training examples to some degree. 3.2 Evaluation Using Second-Order Algorithm. By using the BPQ, an evaluation was made after adding each penalty term. Figure 2 compares the generalization performance, which was evaluated by using the average RMSE (root mean squared error) and its standard deviation for a set of 5000 examples newly generated for test. 3 The best possible RMSE level is 0.2 because each test example includes the same amount of gaussian noise as given to each training example. For each penalty term, the generalization performance was improved when m was set adequately, but the normalized penalty term was the most unstable among the three, because it frequently got stuck in undesirable local minima.

3

All graphs in this article include results of unconverged trials within the maximum processing time.

Second-Order Learning Algorithm

715

Figure 2: Generalization performance using the BPQ for the regression problem.

Figure 3 compares the statistics of the processing time, until convergence, where the plots are omitted if the standard deviation value is 0. In comparison to the learning without a penalty term, the squared penalty term drastically decreased the processing time, especially when m was large, while the absolute penalty term did not converge when m was between [2¡7 , 20 ]; the normalized penalty term generally required much processing time. Thus, only the squared penalty term improved the convergence performance more than 2 » 100 times, keeping a better generalization performance for an adequate penalty factor. Figure 4 shows the learning results using the squared penalty term with different penalty factors. The gure shows that the result with the large penalty factor m D 2¡3 undertted the training examples, while the result with the small onem D 2¡15 overtted the training examples. We can see that the result with an adequate penalty factor m D 2¡6 closely approximated the true function.

716

Kazumi Saito and Ryohei Nakano

Figure 3: Convergence performance using the BPQ for the regression problem.

3.3 Evaluation Using Off-line BP. By using the off-line BP, a similar evaluation was made after adding each penalty term. Here, we adopted Silva and Almeida’s (1990) learning-rate adaptation rule: the learning rate gk for each weight w k is adjusted by the signs of two successive gradient values. 4 Figure 5 compares the generalization performance, and Figure 6 compares the processing time until convergence, where the trials without a penalty term are not displayed because all trials did not converge within 100 seconds. For each penalty term, the generalization performance was improved when m was set adequately. Note that the off-line BP with the squared penalty term V1 required more processing time than the BPQ with V1 . As for the normalized penalty term V3 , the off-line BP with V3 worked more stably than the BPQ with V3 .

4 The increasing and decreasing parameters were set to 1.1 and 1 / 1.1, respectively, as recommended by Silva and Almeida (1990);if the value of the objective function increases, all learning rates are halved until the value decreases.

Second-Order Learning Algorithm

717

Figure 4: Learning results of a squared penalty term for the regression problem.

3.4 Evaluation Using On-line BP. By using the on-line BP, a similar evaluation was made after adding each penalty term, where the maximum processing time was set to 1000 seconds as enough processing time. Here, the learning rate g was set to 0.01, 0.001, or 0.0001, and the penalty factor m was gradually changed from 0.01 to 0.01 £ 2¡19 by multiplying by 2¡1 . Since on-line learning algorithms perform a weight update for each single example, each penalty operation for the weight values should be performed for each single example as well. Then, in general, the adequate penalty factors of on-line algorithms become much smaller than those of off-line algorithms. Figure 7 compares the generalization performance, where the large RMSE values are omitted to make the graphs the same scale. As for the standard deviation, since each graph of the three learning rates was almost the same, only the graph of g D 0.01 is shown. This gure shows that for the same learning rate, the curves of the generalization performance with each penalty term were very similar. Moreover, the best generalization performance with each penalty term was almost equivalent to that of the

718

Kazumi Saito and Ryohei Nakano

Figure 5: Generalization performance using the off-line BP for the regression problem.

on-line BP without a penalty term. As for the convergence performance, all trials did not converge within 1000 seconds. This indicates that the on-line BP with a penalty term could not efciently reduce the error, worse than the off-line BP. 3.5 Consideration. As the consequence of evaluation using the regression problem, among all the combinations, the combination of V1 and the BPQ performed the best learning, when considering both the generalization performance and the convergence performance. Incidentally, the generalization performance of the on-line or off-line BP without a penalty term was substantially better than that of the BPQ without it. We predict that this is because the effect of early stopping (Bishop, 1995) worked for both, and more for the on-line BP. In this problem, the convergence of the BP is so slow that the effect of early stopping is inevitable.

Second-Order Learning Algorithm

719

Figure 6: Convergence performance using the off-line BP for the regression problem.

On the other hand, in our experiments, the choice of our maximum processing time is considered to be enough and reasonable. Figure 8 shows the learning curves of the BPQ without a penalty term, the off-line BP without it and the on-line BP without it, where the learning curves were plotted by using the average RMSE for the training examples with respect to the processing time. This gure indicates that even if the maximum processing time increases to a larger value, most of our experimental results will be the same because each of the learning curves for these algorithms is almost saturated within the maximum processing time. 4 Experiments Using Classication Problem

We performed similar evaluations using a real handwritten numeral recognition problem, where each numeral, 0 » 9 (k D 10), is described by reduced Glucksman’s features (n D 16) (Glucksman, 1967; Ishii, 1983). In ex-

720

Kazumi Saito and Ryohei Nakano

Figure 7: Generalization performance using the on-line BP for the regression problem.

Second-Order Learning Algorithm

721

Figure 8: Learning curves for the regression problem.

periments, among 4000 examples (each class contains 400 examples), 2000 examples (each class contains 200 examples) were used for training, and the remaining examples were used for testing, where the generalization performance was evaluated by using the misclassication rate. The experimental conditions were exactly the same as in section 3 except that the number of hidden units was set to 10 and the maximum processing time was set to 200 seconds. Our experiments using the classication problem were done on HP C180 computers. Figures 9a and 9b show learning results with each of the three penalty terms by using second and rst-order off-line learning algorithms, respectively. Here, the penalty factor m was gradually changed from 40 to 40 £ 2¡9 by multiplying by 2¡1 . We can see that for each penalty term, the generaliza-

722

Kazumi Saito and Ryohei Nakano

Figure 9: Generalization performance for the classication problem.

tion performance was greatly improved when m was set adequately. Here, the misclassication rate with the squared penalty term V1 is very similar to that with the normalized penalty term V2 , but the former could show the best generalization performance in this experiment. Figure 10 shows learning results with each of the three penalty terms by using the on-line BP whose learning rateg was set to 0.1, 0.01, or 0.001. Here, the penalty factor m was gradually changed from 0.002 to 0.002 £ 2¡19 by multiplying by 2¡1 . We can see that for each penalty term, the generalization performance was also improved when m was set adequately. However, the best generalization performance with each of the penalty terms was slightly better than the generalization performance without a penalty term, compared with the experiments using the off-line algorithms.

Second-Order Learning Algorithm

723

Figure 10: Generalization performance using the on-line BP for the classication problem.

724

Kazumi Saito and Ryohei Nakano

Figure 11: Convergence performance using the BPQ with V1 for the classication problem.

Except for the case using the on-line BP whose learning rate was set to 0.1, the generalization performance of the on-line or off-line BP without a penalty term was substantially better than that of the BPQ without it; the effect of early stopping may work clearly for large-scale problems. Incidentally, for all trials except the case using the off-line algorithms whose penalty factor was set to 40 or 20, the standard deviations of the generalization misclassication rates were less than 1%. As for the convergence performance, only the combination of the squared penalty term V1 and the second-order algorithm BPQ could make the gradient vector sufciently small; the learnings of the other combinations were terminated by the maximum processing time. Figure 11 shows the convergence performance of the BPQ with V1 . Thus, when considering both the generalization performance and the convergence performance, these experiments using the classication problem also show that the combination of V1 and the BPQ performed the best learning. 5 Comparison of Function Surfaces

In order to examine graphically the reasons that the effect of adding each penalty term differed, we designed a simple experiment: learning a function y D s(w1 x) Cs(w2 x), where only two weights, w1 and w2 , are adjustable. In the three-layer network, as shown in Figure 12, the input and output layers consist of only one unit, and the hidden layer consists of two units. Note that the weights between the hidden units and the output unit are xed at 1, there is no bias, and the activation function of hidden units is assumed to be s(x) D 1 / (1 Cexp(¡x)). Each target value yt was calculated from the corresponding input value xt 2 f¡0.2, ¡0.1, 0, 0.1, 0.2g by setting (w1 , w2 ) D (1, 3).

Second-Order Learning Algorithm

725

Figure 12: A two-weight learning problem.

Figure 13 shows the learning trajectories on function contour maps with respect to w1 and w2 during 100 iterations starting at ( w1 , w2 ) D (¡1, ¡3), where the penalty factor m was set to 0.1 or 0.01. Here the BPQ was used as a learning algorithm. The contours for the squared penalty term form ovals, making the BPQ learn easily. When m D 0.1, the contours for the absolute penalty term form an almost square-like shape, and the learning trajectories oscillate near the origin (w1 D w2 D 0), due to the discontinuity of the gradient function. The contours for the normalized penalty term form a valley, making BPQ’s learning more difcult. Here, in order to evaluate the learning results graphically , the number of weights was set to two, but the property of each penalty term can be naturally extended to a higher dimension. Even if the number of weights is larger than two, these properties are considered to be the same. Moreover, these considerations coincide with the experimental results of sections 3 and 4: the combination of squared penalty term V1 and the second-order learning algorithm BPQ performed the best learning. 6 Determining Penalty Factor

In general, for a given problem, we cannot know an adequate penalty factor in advance. Given a limited number of examples, we must nd a reasonably adequate penalty factor. The procedure of cross validation (Stone, 1978) is adopted for this purpose. Since we knew the combination of the squared penalty term V1 and the second-order algorithm, BPQ works very efciently; hereafter, we consider only this combination. The procedure of cross validation was implemented as a leave-one-out method: the learning error term without the hth example is dened as f

( h)

(W) D

m 1 X ( yt ¡ z( xt I W)) 2 , 2 tD1,t6Dh

and the parameters minimizing the objective function with the penalty term

726

Kazumi Saito and Ryohei Nakano

Figure 13: Graphical evaluation of function surfaces for the two-weight learning problem.

are dened as O ( h) D arg min( f ( h) (W) Cm V1 (W)). W W

Then, the cross validation error is v u m u1 X O ( h) )) 2 ( yh ¡ z( x h I W RMSE( CV) (m ) D t m hD1

.

Thus, the optimal penalty factor can be calculated by gradually changing its factor: m opt D arg min RMSE( CV) (m ). m

Second-Order Learning Algorithm

727

Figure 14: Generalization performance using cross validation for the regression problem.

On the other hand, the initial weight values for evaluating the cross-validation error were set as the learning results of the entire examples: O D arg min( f (W) Cm V1 (W)). W W

O ( h) can be calculated The advantage of this initial value setting is that W O and W O ( h) with a light computation load because the difference between W is generally considered to be small. Moreover, in general, there are sevO (h) in eral local minima for neural network learning, but by searching W O the cross-validation error is expected to be the neighborhood of W, stable. Experiments were performed using the above regression problem with exactly the same experimental conditions as in section 3. Figure 14 compares the generalization error and the cross-validation error by using average

728

Kazumi Saito and Ryohei Nakano

Figure 15: Convergence performance using cross validation for the regression problem.

and standard deviation. Although the cross-validation error was a pessimistic estimator of the generalization error, it showed the same tendency and was minimized at almost the same penalty factor. Moreover, since the standard deviations of the results were very small, we can see that this learning was stable. Figure 15 shows the average processing time and its standard deviation. Although the processing time includes the crossvalidation evaluation, we can see that the learning was performed quite efciently. 7 Conclusion

This article investigated the efciency of supervised learning with each of three penalty terms by using rst- and second-order off-line learning algorithms and a standard on-line algorithm. Our experiments showed that for a reasonably adequate penalty factor, the combination of the squared penalty term and the second-order algorithm drastically improves the convergence performance in comparison to the other combinations, and with excellent generalization performance. In the case of other second-order learning algorithms such as SCG or OSS, similar results are possible because the main difference between the BPQ and those other algorithms involves only the learning efciency. In the future, we plan to do evaluations using larger-scale problems. Acknowledgments

We thank Kenichiro Ishii for providing handwritten numeral recognition data.

Second-Order Learning Algorithm

729

References Battiti, R. (1992). First- and second-order methods for learning between steepest descent and newton’s method. Neural Computation, 4(2), 141–166. Bishop, C. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press. Glucksman, H. (1967). Classication of mixed-font alphabetics by characteristic loci. In IEEE Computer Conference (pp. 138–141). Hanson, S., & Pratt, L. (1989). Comparing biases for minimal network construction with backpropagation. In D. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 177–185). San Mateo, CA: Morgan Kaufmann. Hinton, G. (1987). Learning translation invariant recognition in massively parallel networks. In J. deBakker, A. Nijman, & P. Treleaven (Eds.), Proceedings PARLE Conference on Parallel Architectures and Languages Europe (pp. 1–13). Berlin. Ishii, K. (1983). Generation of distorted characters and its applications. Systems and Computers in Japan, 14(6), 19–27. Ishikawa, M. (1996). Structural learning with forgetting. Neural Networks, 9(3), 509–521. MacKay, D. (1992). Bayesian interpolation. Neural Computation, 4(3), 415–447. Møller, M. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6(4), 525–533. Poggio, T., & Girosi, F. (1990). Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247, 978–982. Ripley, B. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Saito, K., & Nakano, R. (1997a). Partial BFGS update and efcient step-length calculation for three-layer neural networks. Neural Computation, 9(1), 123– 141. Saito, K., & Nakano, R. (1997b). Second-order learning algorithm with squared penalty term. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 627–633). Cambridge, MA: MIT Press. Silva, F. & Almeida, L. (1990). Speeding up backpropagation. In R. Eckmiller (Ed.), Advanced neural computers (pp. 151–160). Amsterdam: North-Holland. Stone, M. (1978). Cross-validation: A review. Operationsforsch. Statist. Ser. Statistics B, 9(1), 111–147. Williams, P. (1995). Bayesian regularization and pruning using a Laplace prior. Neural Computation, 7(1), 117–143. Received March 30, 1998; accepted February 4, 1999.

Neural Computation, Volume 12, Number 4

CONTENTS Articles

Minimizing Binding Errors Using Learned Conjunctive Features Bartlett W. Mel and J´ozsef Fiser

731

The Multifractal Structure of Contrast Changes in Natural Images: From Sharp Edges to Textures Antonio Turiel and N´estor Parga

763

Note

Exponential or Polynomial Learning Curves? Case-Based Studies Hanzhong Gu and Haruhisa Takahashi

795

Letters

Training Feedforward Neural Networks with Gain Constraints Eric Hartman

811

Variational Learning for Switching State-Space Models Zoubin Ghahramani and Geoffrey E. Hinton

831

Retrieval Properties of a Hopeld Model with Random Asymmetric Interactions Zhang Chengxiang, Chandan Dasgupta, and Manoranjan P. Singh

865

On “Natural” Learning and Pruning in Multilayered Perceptrons Tom Heskes

881

Synthesis of Generalized Algorithms for the Fast Computation of Synaptic Conductances with Markov Kinetic Models in Large Network Simulations Michele Giugliano

903

Hierarchical Bayesian Models for Regularization in Sequential Learning J. F. G. de Freitas, M. Niranjan, and A. H. Gee

933

Sequential Monte Carlo Methods to Train Neural Network Models J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet

955

ARTICLE

Communicated by Shimon Edelman

Minimizing Binding Errors Using Learned Conjunctive Features Bartlett W. Mel Department of Biomedical Engineering, University of Southern California, Los Angeles, CA 90089, U.S.A. Jozsef ´ Fiser Department of Brain and Cognitive Sciences, University of Rochester, Rochester, NY 14627-0268, U.S.A. We have studied some of the design trade-offs governing visual representations based on spatially invariant conjunctive feature detectors, with an emphasis on the susceptibility of such systems to false-positive recognition errors—Malsburg’s classical binding problem. We begin by deriving an analytical model that makes explicit how recognition performance is affected by the number of objects that must be distinguished, the number of features included in the representation, the complexity of individual objects, and the clutter load, that is, the amount of visual material in the eld of view in which multiple objects must be simultaneously recognized, independent of pose, and without explicit segmentation. Using the domain of text to model object recognition in cluttered scenes, we show that with corrections for the nonuniform probability and nonindependence of text features, the analytical model achieves good ts to measured recognition rates in simulations involving a wide range of clutter loads, word sizes, and feature counts. We then introduce a greedy algorithm for feature learning, derived from the analytical model, which grows a representation by choosing those conjunctive features that are most likely to distinguish objects from the cluttered backgrounds in which they are embedded. We show that the representations produced by this algorithm are compact, decorrelated, and heavily weighted toward features of low conjunctive order. Our results provide a more quantitative basis for understanding when spatially invariant conjunctive features can support unambiguous perception in multiobject scenes, and lead to several insights regarding the properties of visual representations optimized for specic recognition tasks. 1 Introduction

The problem of classifying objects in visual scenes has remained a scientic and a technical holy grail for several decades. In spite of intensive research in Neural Computation 12, 731–762 (2000)

c 2000 Massachusetts Institute of Technology °

732

Bartlett W. Mel and J´ozsef Fiser

the eld of computer vision and a large body of empirical data from the elds of visual psychophysics and neurophysiology, the recognition competence of a two-year-old human child remains unexplained as a theoretical matter and well beyond the technical state of the art. This fact is surprising given that (1) the remarkable speed of recognition in the primate brain allows for only very brief processing times at each stage in a pipeline containing only a few stages (Potter, 1976; Oram & Perrett, 1992; Heller, Hertz, Kjær, & Richmond, 1995; Thorpe, Fize, & Marlot, 1996; Fize, Boulanouar, Ranjeva, Fabre-Thorpe, & Thorpe, 1998), (2) the computations that can be performed within each neural processing stage are strongly constrained by the structure of the underlying neural tissue, about which a great deal is known (Hubel & Wiesel, 1968; Szentagothai, 1977; Jones, 1981; Gilbert, 1983; Van Essen, 1985; Douglas & Martin, 1998), (3) the response properties of neurons in each of the relevant brain areas have been well studied, and evolve from stage to stage in systematic ways (Oram & Perrett, 1994; Logothetis & Sheinberg, 1996; Tanaka, 1996), and (4) computer systems may already be powerful enough to emulate the functions of these neural processing stages, were it only known what exactly to do. A number of neuromorphic approaches to visual recognition have been proposed over the years (Pitts & McCullough, 1947; Fukushima, Miyake, & Ito, 1983; Sandon & Urh, 1988; Zemel, Mozer, & Hinton, 1990; Le Cun et al., 1990; Mozer, 1991; Swain & Ballard, 1991; Hummel & Biederman, 1992; Lades et al., 1993; Califano & Mohan, 1994; Schiele & Crowley, 1996; Mel, 1997; Weng, Ahuja, & Huang, 1997; Lang & Seitz, 1997; Wallis & Rolls, 1997; Edelman & Duvdevani-Bar, 1997). Many of these neurally inspired systems involve constructing banks of feature detectors, often called receptive elds (RFs), each of which is sensitive to some localized spatial conguration of image cues but invariant to one or more spatial transformations of its preferred stimulus—typically including invariance to translation, rotation, scale, or spatial distortion, as specied by the task. 1 Since the goal to extract useful invariant features is common to most conventional approaches to computer vision as well, the “neuralness” of a recognition system lies primarily in its emphasis on feedforward, hierarchical construction of the invariant features, where the computations are usually restricted to simple spatial conjunction and disjunction operations (e.g., Fukushima et al., 1983), and/or specify the inclusion of a relatively large number of invariant features in the visual representation (see Mozer, 1991; Califano & Mohan, 1994; Mel, 1997)—anywhere from hundreds to millions. Neural approaches typically also involve some form of learning, whether supervised by category labels at the output layer (Le Cun et al., 1990), supervised directly within 1 A number of the above-mentioned systems have emphasized other mechanisms instead of or in addition to straightforward image ltering operations (Zemel et al., 1990; Mozer, 1991; Hummel & Biederman, 1992; Lades et al., 1993; Califano & Mohan, 1994; Edelman & Duvdevani-Bar, 1997).

Minimizing Binding Errors

733

intermediate network layers (Fukushima et al., 1983), or involving purely unsupervised learning principles that home in on features that occur frequently in target objects (Fukushima et al., 1983; Zemel et al., 1990; Wallis & Rolls, 1997; Weng et al., 1997). While architectures of this general type have performed well in a variety of difcult, though limited, recognition problems, it has yet to be proved that a few stages of simple feedforward ltering operations can explain the remarkable recognition and classication capacities of the human visual system, which is commonly confronted with scenes acquired from varying viewpoints, containing multiple objects drawn from thousands of categories, and appearing in an innite panoply of spatial congurations. In fact, there are reasons for pessimism. 1.1 The Binding Problem. One of the most persistent worries concerning this class of architecture is due to Malsburg (1981), who noted the potential for ambiguity in visual representations constructed from spatially invariant lters. A spatial binding problem arises, for example, when each detector in the visual representation reports only the presence of some elemental object attribute but not its spatial location (or, more generally, its pose). Under these unfavorable conditions, an object is hallucinated (i.e., detected when not actually present) whenever all of its elemental features are present in the visual eld, even when embedded piecemeal in improper congurations within a scattered coalition of distractor objects (see Figure 1A). This type of failure mode for a visual representation emanates from the dual pressures to cope with uncontrolled variation in object pose, which forces the use of detectors with excessive spatial invariances, and the necessity to process multiple objects simultaneously, which overstimulate the visual representation and increase the probability of ambiguous perception. One approach to the binding problem is to build a separate, full-order conjunctive detector for every possible view (e.g., spatial pose, state of occlusion, distortion, degradation) of every possible object, and then for each object provide a huge disjunction that pools over all of the views of that object. Malsburg (1981) pointed out that while the binding problem could thus be solved, the remedy is a false one since it leads to a combinatorial explosion in the number of needed visual processing operations (see Figure 1B). Other courses of action include (1) strategies for image preprocessing designed to segment scenes into regions containing individual objects, thus reducing the clutter load confronting the visual representation, and (2) explicit normalization procedures to reduce pose uncertainty (e.g., centering, scaling, warping), thereby reducing the invariance load that must be sustained by the individual receptive elds. Both strategies reduce the probability that any given object’s feature set will be activated by a spurious collection of features drawn from other objects. 1.2 Wickelsystems. Preprocessing strategies aside, various workers have explored the middle ground between “binding breakdown” and “com-

734

Bartlett W. Mel and J´ozsef Fiser object detectors

A

"dog"

"cat"

too few detectors, with excessive spatial pooling

"b"

"a"

Binding Problem

a b c . ..

array of detectors defining input space

a b c .. .

"c"

a b

a . c .. . . .

. . .

c a t t c

a

B "cat"

too many high-order conjunctions needed

Combinatorial Explosion

a b c .. .

a b c . . .

y z

t

a c t

c

cat1

"z"

"t"

t t

y z

multiple inputs activate same internal code

"bee"

dog1

bee1 cat2

"dog"

dog2

"bee"

bee 2

a b c . . .

cat N a .. c . . ..

t t

dog N

bee N

t

c a t

C

tractable number of features of intermediate complexity

"ca"

"aa"

tractable number of intermediate-order conjunctions

Wickelgren’s Compromise

"c_t"

"bee"

"at"

"zz"

at ca

c_t

ca

at

c_t

a a a b b b c c c . . . y z

binding problem arises, but less frequently

"dog"

"cat"

a . c .. . . .

t t

t

c a t t h a t

c a r

c u t

Figure 1: (A) A binding problem occurs when an object’s representation can be falsely activated by multiple (or inappropriate) other objects. (B) A combinatorial explosion arises when all possible views of all possible objects are represented by explicit conjunctions. (C) Compromise representation containing an intermediate number of receptive elds that bind image features to intermediate order.

Minimizing Binding Errors

735

binatorial catastrophe,” where a visual representation contains an intermediate number of detectors, each of which binds together an intermediatesized subset of feature elements in their proper relations before integrating out the irrelevant spatial transformations (see Mozer, 1991, for a thoughtful advocacy of this position). Under this compromise, each detector becomes a spatially invariant template for a specic minipattern, which is more complex than an elemental feature, such as an oriented edge, but simpler than a full-blown view of an object (see Figure 1C). A set of spatially invariant features of intermediate complexity can be likened to a box of jumbled jigsaw puzzle pieces: although the nal shape of the composite object—the puzzle—is not explicitly contained in this peculiar “representation,” it is nonetheless unambiguously specied: the collection of parts (features) can interlock in only one way. An important early version of this idea is due to Wickelgren (1969), who proposed a scheme for representing the phonological structure of spoken language involving units sensitive to contiguous triples of phonological features but not to the absolute position of the tuples in the speech stream. Though not phrased in this language, “Wickelfeatures” were essentially translation-invariant detectors of phonological minipatterns of intermediate complexity, analyzing the one-dimensional speech signal. Representational schemes inspired by Wickelgren’s proposal have since cropped up in other inuential models of language processing (McClelland & Rumelhart, 1981; Mozer, 1991). The use of invariant conjunctive features schemes to address representational problems in elds other than vision highlights the fact that the binding problem is not an inherently visuospatial one. Rather, it can arise whenever the set of features representing an object (visual, auditory, olfactory, etc.) can be fully, and thus inappropriately, activated by input “scenes” that do not contain the object. In general, a binding problem has the potential to exist whenever a representation lacks features that conjoin an object’s elemental attributes (e.g., parts, spatial relations, surface properties) to sufciently high conjunctive order. On the solution side, binding ambiguity can be mitigated as in the scenario of Figure 1 by building conjunctive features that are increasingly object specic, until the probability of false-positive activation for any object is driven below an acceptable threshold. Given that this conjunctive order-boosting prescription to reduce ambiguity is algorithmically trivial, it shifts emphasis away from the question as to how to eliminate binding ambiguity, toward the question as to whether, for a given recognition problem, ambiguity can be eliminated at an acceptable cost. 1.3 A Missing Theory. In spite of the importance of spatially invariant conjunctive visual representations as models of neural function, and the demonstrated practical successes of computer vision systems constructed along these lines, many of the design trade-offs that govern the performance of visual “Wickelsystems” in response to cluttered input images have re-

736

Bartlett W. Mel and J´ozsef Fiser

mained poorly understood. For example, it is unknown in general how recognition performance depends on (1) the number of object categories that must be distinguished, (2) the similarity structure of the object category distribution (i.e., whether object categories are intrinsically very similar or very different from each other), (3) the featural complexity of individual objects, (4) the number and conjunctive order of features included in the representation, (5) the clutter load (i.e., the amount of visual material in the eld of view from which multiple objects must be recognized without explicit segmentation), and (6) the invariance load (i.e., the set of spatial transformations that do not affect the identities of objects, and that must be ignored by each individual detector). In a previous analysis that touched on some of these issues, Califano and Mohan (1994) showed that large, parameterized families of complex invariant features (e.g., containing 106 elements or more) could indeed support recognition of multiple objects in complex scenes without prior segmentation or discrimination among a large number of highly similar objects in a spatially invariant fashion. However, these authors were primarily concerned with the mathematical advantages of large versus small feature sets in a generic sense, and thus did not test their analytical predictions using a large object database and varying object complexity, category structure, clutter load, and other factors. Beyond our lack of understanding regarding the trade-offs inuencing system performance, the issue as to which invariant conjunctive features and of what complexity should be incorporated into a visual representation, through learning, has also not been well worked out as a conceptual matter. Previous approaches to feature learning in neuromorphic visual systems have invoked (1) generic gradient-based supervised learning principles (e.g., Le Cun et al., 1990) that offer no direct insight into the qualities of a good representation, (2) or straightforward unsupervised learning principles (e.g., Wallis & Rolls, 1997), which tend to discover frequently occurring features or principal components rather than those needed to distinguish objects from cluttered backgrounds on a task-specic basis. Network approaches to learning have also typically operated within a xed network architecture, which largely predenes and homogenizes the complexity of top-level features to be used for recognition, though optimal representations may actually require features drawn from a range of complexities, and could vary in their composition on a per-object basis. All this suggests that further work on the principles of feature learning is needed. In this article, we describe our work to test a simple analytical model that captures several trade-offs governing the performance of visual recognition systems based on spatially invariant conjunctive features. In addition, we introduce a supervised greedy algorithm for feature learning that grows a visual representation in such a way as to minimize false-positive recognition errors. Finally, we consider some of the surprising properties of “good” representations and the implications of our results for more realistic visual recognition problems.

Minimizing Binding Errors

737

2 Methods 2.1 Text World as a Model for Object Recognition. The studies described here were carried out in text world, a domain with many of the complexities of vision in general (see Figure 2): target objects are numerous (more than 40,000 words in the database), are highly nonuniform in their prior probabilities (relative probabilities range from 1 to 400,000), are constructed from a set of relatively few underlying parts (26 letters), and individually can contain from 1 to 20 parts. Furthermore, the parts from which words are built are highly nonuniform in their relative frequencies and contain strong statistical dependencies. Finally, word objects occur in highly cluttered visual environments—embedded in input arrays in which many other words are simultaneously present—but must nonetheless be recognized regardless of position in the visual eld. Although text world lacks several additional complexities characteristic of real object vision (see section 6), it nonetheless provides a rich but tractable domain in which to quantify binding trade-offs and to facilitate the testing of analytical predictions and the development of feature learning algorithms. An important caveat, however, is that the use of text as a substrate for developing and testing our analytical model does not imply that our conclusions bear directly on any aspect of human reading behavior. Two databases were used in the studies. The word database contained 44,414 entries, representing all lowercase punctuation-free words and their relative frequencies found in 5 million words in the Wall Street Journal (WSJ) (available online from the Association for Computational Linguistics at http://morph.ldc.upenn.edu/). The English text database consisted of approximately 1 million words drawn from a variety of sources (KuLcera & Francis, 1967). 2.2 Recognition Using n-Grams. A natural class of visual features in text world is the position-invariant n-gram, dened here as a binary detector that responds when a specic spatial conguration of n letters is found anywhere (one or more times) in the input array.2 The value of n is termed the conjunctive or binding order of the n-gram, and the diameter of the n-gram is the maximum span between characters specied within it. For example, th** is a 3-gram of diameter 5 since it species the relative locations of three characters (including the space character but not wild card characters *), and spans a eld ve characters in width. The concept of an n-gram is used also in computational linguistics, though usually referring to n-tuples of consecutive words (Charniak, 1993).

2 An n-gram could also be multivalued, that is, respond in a graded fashion depending on the number of occurrences of the target feature in the input; the binary version was used here to facilitate analysis.

738

Bartlett W. Mel and J´ozsef Fiser

10

6

10

5

10

4

10

3

10

2

B

Non-Uniform Word Frequencies

Non-Uniform Word Sizes

8 Word Counts (x 1000)

Log Relative Frequency

A

10

7 6 5 4 3 2 1 0

1

1 2 3 4 5 6 7 8 9 1011121314 151617181920

Words Rank-Ordered by Frequency

C

Word Size (Letters)

Non-Uniform Letter Frequencies

D

450

Relative Frequency (x 1000)

1200

Relative Frequency

400 350 300 250 200 150 100 50 _ e t a o n h i s r d l u c wm f g y p b v k x j q z

Single Letters Ordered by Frequency Info Gain About Adjacent Letter (Bits)

1000 800 600 400

unifor m approximation

200 0

0

E

Non-Uniform 2-gram Frequencies

0

100

200

300

400

500

600

Adjacent 2-grams Ordered by Frequency

Statistical Dependencies Among Adjacent Letters 5 letter to left

4

letter to right 3 2 1 0 a

b

c d

e

f

g h

i

j

k

l m n o p

q

r

s

t

u v w x y

z

Figure 2: Text world was used as a surrogate for visual recognition in multiobject scenes. (A) Word frequencies in the 5-million-word Wall Street Journal (WSJ) corpus are highly nonuniform, following an approximately exponential decay according to Zipf’s law. (B) Words range from 1 to 20 characters in length; 7letter words are most numerous. (C,D) Relative frequencies of individual letters or adjacent letter pairs are highly nonuniform. Ordinate values represent the number of times the feature was found in the WSJ corpus. The dashed rectangle in D represents simplifying assumption regarding the 2-gram frequency distribution used for quantitative predictions below. (E) The parts of words show strong statistical dependencies. Columns show information gain about identity of the adjacent letter (or space character) to left or right, given the presence of letter indicated on the x-axis. The dashed horizontal line indicates perfect knowledge—log 2 (27) D 4.75 bits.

Minimizing Binding Errors

739

In relation to conventional visual domains, n-grams may be viewed as complex visual feature detectors, each of which responds to a specic conjunction of subparts in a specic spatial conguration, but with built-in invariance to a set of spatial transformations predened by the recognition task. In relation to a shape-based visual domain, a 2-gram is analogous to a corner detector that signals the co-occurrence of a vertical and a horizontal edge, with appropriate spatial offsets, anywhere within an extended region of the image. Let R be a representation containing a set of position-invariant n-grams analyzing the input array and W be a set of word detectors analyzing R, with one detector for every entry in the word database. Each word detector receives input from some or all of the n-gram units in R that are contained in the respective word. However, not all of a word’s n-grams are necessarily contained in R, and not all of the word’s n-grams that are contained in R necessarily provide input to the word’s detector. For example, the representation might contain four spatially invariant features R D fa, c, o, sg, while the detector for the word cows receives input from features c and s, but not from the o feature, which is present in R, or from the w feature, which is not contained in R. Word detectors are binary-valued conjunctions, responding 1 when all of their inputs are active and 0 otherwise. The collection of ngrams feeding a word detector is hereafter referred to as the word’s minimal feature set. A correct recognition is scored when an input array of characters is presented, and each word detector is activated if and only if its associated word is present in the input. Any word detector activated when its corresponding word is not actually present in the input array is counted as a hallucination, and the recognition trial is scored as an error. Recognition errors for a particular representation R are reported as the percentage of input arrays drawn randomly from the English text database that generated one or more word hallucinations.

3 Results 3.1 Word-Word Collisions in a Simple 1-Gram Representation. We begin by considering the case without visual clutter and ask how often pairs of isolated words collide with each other within a binary 1-gram representation, where each word is represented as an unordered set of individual letters (with repeats ignored). This is the situation in which the binding problem is most evident, as schematized in Figure 1A. Results for the approximately 2 billion distinct pairs of words in the database are summarized in Table 1. About half of the 44,414 words contained in the database collide with at least one other word (e.g., cerebrum ´ cucumber , calculation ´ unconstitutional ). The largest cohort of 1-gram-identical words contained 28 members.

740

Bartlett W. Mel and J´ozsef Fiser

Table 1: Performance of a Binary 1-Gram Representation in Word-Word Comparisons. Quantity

Value

Comment

Number of 1-grams in R

27

a, b, . . . , z,

Number of distinct words

44,414

From 5 million words in the WSJ

Word-word comparisons

»2 billion

Number of ambiguous words

24,488

analogies suction scientists

´ ´ ´

gasoline continuous nicest

stare, arrest, tears, restates, reassert, rarest, easter . . .

Largest self-similar cohort

28 words

Baseline entropy in wordfrequency distribution

9.76 bits

< log 2 (44, 414) D 15.4 bits

1.4 bits

Narrows eld to » 3 possible words

Residual uncertainty about identity of a randomly drawn word, knowing its 1-grams

As an aside, we noted that if the two words compared were required also to contain the same number of letters, the number of unambiguously represented words fell to 11,377, and if the words were compared within a multivalued 1-gram representation (where words matched only if they contained the identical multiset of letters, i.e., in which repeats were taken into account), the ambiguous word count fell further to 5,483, or about 12% of English words according to this sample. The largest self-similar cohort in this case contained only 5 words (traces , reacts , crates , caters , caster ). Although a 1-gram representation contains no explicit part-part relations and fails to pin down uniquely the identity of more than half of the English words in the database, it nonetheless provides most of the needed information to distinguish individually presented words. The baseline entropy in the WSJ word-frequency distribution is 9.76 bits, signicantly less than the 15.4 D log2 (44,414) bits for an equiprobable word set. The average residual uncertainty about the identity of a word drawn from the WSJ word frequency distribution, given its 1-gram representation, is only 1.4 bits, meaning that for individually presented words, the 1-gram representation narrows the set of possibilities to about three words on average. 3.2 Word-Word Collisions in an Adjancent 2-Gram Representation.

Since many word objects have identical 1-gram representations even when full multisets of letters are considered, we examined the collision rate when an adjacent binary-valued 2-gram representation was used (i.e., coding the

Minimizing Binding Errors

741

Table 2: Performance of a Binary Adjacent 2-Gram Representation in WordWord Comparisons. Quantity Number of adjacent 2-grams Number of words Word-word comparisons Number of ambiguous words Largest self-similar cohort

Ambiguous word pairs that are linguistically distinct Words with identical adjacent 2-gram multisets

Value

Comment

729

[aa], [ab], . . . , [a ], . . . , [ z], [ ]

44,414 »2 billion 57 5 words

4 pairs

2 words

ohhh, ahhh, shhh, hmmm, whoosh, whirr , . . . ohhh ´ ohhhh ´ ohhhhh ´ ohhhhhh ´ ohhhhhhh asses possess seamstress intended

´ ´ ´ ´

assess possesses seamstresses indented

intended ´ indented

set of all adjacent letter pairs, including the space character). This representation is signicant in that adjacent 2-grams are the simplest features that explicitly encode spatial relations between parts. The results of this test are summarized in Table 2. There were only 46 word-pair collisions in this case, comprising 57 distinct words. Forty-two of the 46 problematic word pairs differed only in the length of a string of repeated letters (e.g., ahhh ´ ahhhh ´ ahhhhh . . . , similarly for ohhh, ummm, zzz, wheee, whirrr ) or into sets representing varying spellings or misspellings of the same word also involving repeated letters (e.g., tepees ´ teepees, llama ´ lllama ). Only four pairs of distinct words collided that had different morphologies in the linguistic sense (see Table 2). When words were constrained to be the same length or to contain the identical multiset of adjacent 2-grams, only a single collision persisted: intended ´ indented . At the level of word-word comparisons, therefore, the binding of adjacent letter pairs is sufcient practically to eliminate the ambiguity in a large database of English words. We noted empirically that the inclusion of fewer than 20 additional nonadjacent 2-grams to the representation entirely eliminated the word-word ambiguity in this database, without recourse to multiset comparisons or word-length constraints. 4 An Analytical Model

The results of these word-word comparisons suggest that under some circumstances, a modest number of low-order n-grams can disambiguate a

742

Bartlett W. Mel and J´ozsef Fiser

large, complex object set. However, one of the most dire challenges to a spatially invariant feature representation, object clutter, remains to be dealt with. To this end, we developed a simple probabilistic model to predict recognition performance when a cluttered input array containing multiple objects is presented to a bank of feature detectors. To permit analysis of this problem, we make two simplifying assumptions. First, we assume that the features contained in R are statistically independent. This assumption is false for English n-grams: for example, [th] predicts that [he] will follow about one in three times, whereas the baseline rate for [he] is only 0.5%. Second, we assume that all the features in R are activated with equal probability. This is also false in English: for example ed occurs approximately 90,000 times more often than [mj] (as in ramjet ). Proceeding nonetheless under these two assumptions, it is straightforward to calculate the probability that no object in the database is hallucinated in response to a randomly drawn input image. We rst calculate the probability o that all of the features feeding a given object detector are activated by an arbitrary input image:

¡ d ¡ w¢ c¡w ( d ¡ w)! c! ³ c ´ D ¼ oD ¡ ¢ ( c ¡ w)! d! d d

w

,

(4.1)

c

where d is the total number of feature detectors in R , w is the size of the object’s minimal feature set, i.e., the number of features required by the target object (word) detector, and c is the number of features activated by a “cluttered” multiobject input image. The left-most combinatorial ratio term in equation 4.1 counts the number of ways of choosing c from the set of d detectors, including all w belonging to the target object, as a fraction of the total number of ways of choosing c of the d detectors in any fashion. The approximation in equation 4.1 holds when c and d are large compared to w, as is generally the case. Equation 4.1 is, incidentally, the value of a “hypergeometric” random variable in the special case where all of the target elements are chosen; this distribution is used, for example, to calculate the probability of picking a winning lottery ticket. The probability that an object is hallucinated (i.e., perceived but not actually present) is given by hi D

oi ¡ qi , 1 ¡ qi

(4.2)

where qi is the probability that the object does in fact appear in the input image, which by denition means it cannot be hallucinated. According to Zipf’s law, however, which tells us that most objects in the world are quite uncommon (see Figure 2A), we may in some cases make the further simpli-

Minimizing Binding Errors

743

cation that qi ¿ oi , giving hi ¼ oi . We adopt this assumption in the following. (This simplication would result also from the assumption that inputs consist of object-like noise structured to trigger false-positive perceptions.) We may now write the probability of veridical perception—the probability that no object in the database is hallucinated, µ ³ c ´w ¶N , pv D (1 ¡ h) N ¼ 1 ¡ d

(4.3)

where N is the number of objects in the database. This expression assumes a homogeneous object database, where all N objects require exactly w features in R. Given the independence assumption, the expression for a heterogeneousQobject database consists of a product of object-specic probabilities pv D i (1 ¡ ( dc )wi ), where wi is the number of features required by object i. Note that objects (words) containing the same number of parts (letters) do not necessarily activate the same number of features (n-grams) in R, depending on both the object and the composition of R. From an expansion to rst order of equation 4.3 around ( c / d) w D 0, which gives pv D 1¡N(c / d) w , it may be seen that perception is veridical (i.e., pv ¼ 1) when ( c / d) w ¿ 1 / N. Thus, recognition errors are expected to be small when (1) a cluttered input scene activates only a small fraction of the detectors in R, (2) individual objects depend on as many detectors in R as possible, and (3) the number of objects to be discriminated is not too large. The rst two effects generate pressure to grow large representations, since if d is scaled up and c and w scale with it (as would be expected for statistically independent features), recognition performance improves rapidly. Conveniently, the size of the representation can always be increased by including new features of higher binding order or larger diameters, or both. It is also interesting to note that item 1 in the previous paragraph appears to militate for a sparse visual representation, while equation 4.2 militates for a dense representation. This apparent contradiction reects the fundamental trade-off between the need for individual objects to have large minimal feature sets in order to ward off hallucinations and the need for input scenes containing many objects to activate as few features as possible, again in order to minimize hallucinations. This trade-off is not captured within the standard conception of a feature space, since the notion of “similarity as distance” in a feature space does not naturally extend to the situation in which inputs represent multiobject scenes and require multiple output labels. 4.1 Fitting the Model to Recognition Performance in Text World. We ran a series of experiments to test the analytical model. Input arrays varying in width from 10 to 50 characters were drawn at random from the text database and presented to R. The representation contained either all adjacent 2-grams, called Rf2g to signify a diameter of 2, or the set of all 2-grams of diameter 2 or 3, called Rf2,3g , where Rf2,3g was twice the size of Rf2g .

744

Bartlett W. Mel and J´ozsef Fiser

In individual runs, the word database was restricted to words of a single length—two-letter words (N D 253), ve-letter words (N D 4024), or sevenletter words (N D 7159)—in order for the value of w to be kept relatively constant within a trial.3 For each word size and input size and for each of the two representations, average values for w and c were measured from samples drawn from the word or text database. Not surprisingly, given the assumption violations noted above (feature independence and equiprobability), predicting recognition performance using these measured average values led to gross underestimates of the hallucination errors actually generated by the representation. Two kinds of corrections to the model were made. The rst correction involved estimating a better value for d, the number of n-grams contained in the representation, since some of the detectors nominally present in Rf2g and Rf2,3g were infrequently, if ever, actually activated. The value for d was thus down-corrected to reect the number of detectors that would account for more than 99.9% of the probability density over the detector set: for Rf2g , the representation size dropped from d D 729 to d0 D 409, while for Rf2,3g , d was cut from 1,458 to 948. The second correction was based on the fact that the n-grams activated by English words and phrases are highly statistically redundant, meaning that a set of d0 n-grams has many fewer than d0 underlying degrees of freedom. We therefore introduced a redundancy factor r, which allowed us to “pretend” that the representation contained only d0 / r virtual n-grams, and a word or an input image activated only w / r or c / r virtual n-grams in the representation, respectively. (Note that since c and d appear only as a ratio in the approximation of equation 4.3, their r-corrections cancel out.) By systematically adjusting the redundancy factor depending on test condition (ranging from 1.60 to 3.45), qualitatively good ts to the empirical data could be generated, as shown in Figure 3. The redundancy factor was generally larger for long words relative to short words, reecting the fact that long English words have more internal predictability (i.e., ll a smaller proportion of the space of long strings than short words ll in the space of short strings). The r-factor was also larger for Rf2,3g than for Rf2g , reecting the fact that the increased feature overlap due to the inclusion of larger feature diameters in Rf2,3g leads to stronger interfeature correlations, and hence fewer underlying degrees of freedom. Thus, with the inclusion of correction factors for nonuniform activation levels and nonindependence of the features in R , the remarkably simple relation in equation 4.3 captures the main trends that describe recognition performance in scenarios with database sizes ranging from 253 to 7159 words, words containing between

3 Since R f2g and R f2, 3g contained all n-grams of their respective sizes, the value of w could vary only from word to word to the extent that words contained one or more repeated n-grams.

Minimizing Binding Errors

A

745

Fits of Analytical Model to Measured Error Rates: Adjacent 2-gram Representation

Hallucination Errors (% of inputs generating)

R[2] : 729 total features : [ {aa}, {ab}, ... {a_}, ...{_z}, {__}]

B

100 2 - letter words, r = 1.60 5 - letter words, r = 1.60 7 - letter words, r = 2.13

90 80 70 60 50 40 30 20 10 0 10

20 30 40 Text Input Width (Characters)

50

Fits of Analytical Model to Measured Error Rates: Extended 2-gram Representation

Hallucination Errors (% of inputs generating)

R[2,3] : 1,458 total features : [ {aa}, {a*a}, {ab}, {a*b}, ...{a_}, {a*_}, ... ] 100 2 - letter words, r = 2.19 5 - letter words, r = 2.95 7 - letter words, r = 3.45

90 80 70 60 50 40 30 20 10 0 10

J 20 30 40 Text Input Width (Characters)

50

Figure 3: Fits of equation 4.3 to recognition performance in text world. (A) The representation, designated R f2g , contained the complete set of 729 adjacent 2grams (including the space character). The x-axis shows the width of a text window presented to the n-gram representation, and the y-axis shows the probability that the input generates one or more hallucinated words. Analytical predictions are shown in dashed lines using redundancy factors specied in the legend. (B) Same as A, with results for R f2,3g .

746

Bartlett W. Mel and J´ozsef Fiser

2 and 7 letters, input arrays containing from 10 to 50 characters, and using representations varying in size by a factor of 2. We did not attempt to t larger ranges of these experimental variables. Regarding the sensitivity of the model ts to the choice of d0 and r, we noted rst that ts were not strongly affected by the cumulative probability cutoff used to choose d0 —qualitatively reasonable ts could be also generated using a cutoff value of 95%, for example, for which d0 values for R f2,3g and Rf2g shrank to 220 and 504, respectively. In contrast, we noted that a change of one part per thousand in the value of r could result in a signicant change in the shape a curve generated by equation 4.3, and with it significant degradation of the quality of a t. However, the systematic increase in the optimal value of r with increasing word size and representation size indicates that this redundancy correction factor is not an entirely free parameter. It is also worth emphasizing that a redundancy “factor,” which uniformly scales w, c, and d, is a very crude model of the complex statistical dependencies that exist among English n-grams; the fact that any set of small r values can be found that tracks empirical recognition curves over wide variations in the several other task variables is therefore signicant. 5 Greedy Feature Learning

Several lessons from the analysis and experiments are that (1) larger representations can lead to better recognition performance, (2) only those features that are actually used should be included in a representation, (3) considerable statistical redundancy exists in a fully enumerated (or randomly drawn) set of conjunctive features, which should be suppressed if possible, and (4) features of low binding order, while potentially adequate to represent isolated objects, are insufcient for veridical perception of scenes containing multiple objects. In addition, we infer that different objects are likely to have different representational requirements depending on their complexity and that the composition of an ideal representation is likely to depend heavily on the particular recognition task. Taken together, these points suggest that a representation should be learned that includes higher-order features only as needed. In contrast, blind enumeration of an ever-increasing number of higher-order features is an expensive and relatively inefcient way to proceed. Standard network approaches to supervised learning could in principle address all of the points raised above, including nding appropriate weights for features of various orders depending on their utility and helping to decorrelate the input representation. However, the potential need to acquire features of high order—so high that enumeration and testing of the full cohort of high-order features is impractical—compels the use of some form of explicit search as part of the learning process. The use of greedy, incremental, or heuristic techniques to grow a neural representation through increasing nonlinear orders is not new (e.g., Barron, Mucciardi, Cook, Craig,

Minimizing Binding Errors

747

& Barron, 1984; Fahlman & Lebiere, 1990), though such techniques have been relatively little discussed in the context of vision. To cope with the many constraints that must be satised by a good visual representation, we developed a greedy supervised algorithm for feature learning that (1) selects the order in which features should be added to the representation, and (2) selects which features added to the representation should contribute to the minimal feature sets of each object Q in the database. By differentiating the log probability uv D log(pv ) D log i (1 ¡ hi ) with respect to the hallucination probability of the ith object, we nd that ¡1 duv D , 1 ¡ hi dhi

(5.1)

indicating that the largest multiplicative increments to pv are realized by squashing the largest values of hi D ( c / d) wi . Practically, this is achieved by incrementing the w-values (i.e., growing the minimal feature sets) of the most frequently hallucinated objects. The following learning algorithm aims to do this in an efcient way: 1. Start with a small bootstrap representation. (We typically used 50 2grams found to be useful in pilot experiments.) 2. Present a large number of text “images” to the representation. 3. Collect hallucinated words—any word that was ever detected but not actually present—in an input image. 4. For each hallucinated word and the input array that generated it, increment a global frequency table for every n-gram (up to a maximum order of 5 and diameter of 6) that is (i) contained in the hallucinated word, (ii) not contained in the offending input, and (iii) not currently included in the representation. 5. Choose the most frequent n-gram in this table of any order and diameter, and add it to the current representation. 6. Build a connection from the newly added n-gram to any word detector involved in a successful vote for this feature in step 4. Inclusion of this n-gram in these words’ minimal feature sets will eliminate the largest number of word hallucination events encountered in the set of training images. 7. Go to 2. We ran the algorithm using input arrays 50 characters in width (often containing 10 or more words) and plotted the number of hallucinated words (see Figure 4A) and proportion of input arrays causing hallucinations (see Figure 4B) as a function of increasing representation size during learning. The periodic ratcheting in the performance curves (most visible in the lower curve of Figure 4B) was due to staging of the training process for efciency

748

Bartlett W. Mel and J´ozsef Fiser

reasons: batches of 1000 input arrays were processed until a xed hallucination error criterion of 1% was reached, when a new batch of 1000 random input arrays was drawn, and so on. The envelope of the oscillating performance curves provides an estimate of the true performance curve that would result from an innitely large training corpus. Learning was terminated when the average initial error rate for three consecutive new training batches fell below 1%. Given that this termination condition was typically reached in fewer than 50 training epochs (consisting of 50,000 50-character input images), no more than half the text database was visited by random drawings of training-testing sets. We found that for 50-character input arrays, the hallucination error rate fell below 1% after the inclusion of 2287 n-grams in R. These results were typical. The inferior performance of a simple unsupervised algorithm, which adds n-grams to the representation in order of their raw frequencies in English, is shown in Figure 4AB for comparison (upper dashed curves). The counts of total words hallucinated (see Figure 4A), with initial values indicating many 10s of hallucinated words per input, can be seen to fall far faster than the hallucination error rate (see Figure 4B), since a sentence was scored as an error until its very last hallucinated word was quashed.

Figure 4: Facing page. Results of greedy n-gram learning algorithm using input arrays 50 characters in width and a 50-element bootstrap representation. Training was staged in batches of 1000 input arrays until the steady-state hallucination error rate fell below 1%. (A) Growth of the representation is plotted along the x-axis, and the number of words hallucinated at least once in the current 1000-input training batch is plotted on the y-axis. The drop is far faster for the greedy algorithm (lower curve) than for a frequency-based unsupervised algorithm (upper curve). Unsupervised bootstrap representation also included the 27 1-grams (i.e., individual letters plus space character) that were excluded from the bootstrap representation used in the greedy learning run. Without these 1-grams, performance of the unsupervised algorithm was even more severely hampered; with these 1-grams incorporated in the greedy learning runs, the wvalues of every object were unduly inated. (B) Same as (A), but the y-axis shows the proportion of input arrays producing at least one hallucination error. Jaggies occur at transitions to new training batches each time the 1% error criterion was reached. Asymptote is reached in this run after the inclusion of 2287 n-grams. (C) A breakdown of learned representation showing relative numbers of n-grams of each order and diameter. (D) The column height shows the minimal feature set size after learning, that is, average number of n-grams feeding word conjunctions for words of various lengths. The dark lower portion of the columns shows initial counts based on full connectivity to bootstrap representation; the light upper portions show an increment in w due to learning. Nearly constant w values after learning reect the tendency of the algorithm to grow the smallest minimal feature sets. The dashed line corresponds to one n-gram per letter in word.

Minimizing Binding Errors

749

One of the most striking outcomes of the greedy learning process is that the learned representation is heavily weighted toward low-order features, as shown in Figure 4C. Thus, while the learned representation includes ngrams up to order 5, the number of higher-order features required to remedy perceptual errors falls off precipitously with increasing order—far faster than the explosive growth in the number of available features at each order. Quantitatively, the ratios of n-grams included in R to n-grams contained in the word database for orders 2, 3, 4, and 5 were 0.42, 0.0095, 0.00017, and 0.0000055, respectively. The rapidly diminishing ratios of useful features to available features at higher orders conrm the impracticality of learning on a fully enumerated high-order feature set and illustrate the natural bias in the learning procedure to choose the lowest-order features that “get the job done.” The dominance of low-order features is of considerable practical signicance, since higher-order features are more expensive to compute, less robust to image degradation, operate with lower duty cycles, and provide a more limited basis for generalization. The relatively broad distribution of

A

Fewer Words are Hallucinated as Representation Grows

Inputs Causing Hallucinations (%)

10000

Words Hallucinated

9000 8000 7000 6000

Unsupervised

5000

Greedy

4000 3000 2000 1000 0 0

1000

2000

3000

4000

Sizes of N-grams in Final Learned Representation

6 5 4

800

3

600

2

400 200 0 2

3 4 N-gram Order

90 80 70 60

5

Unsupervised

50

Greedy

40 30 20 10 0 0

D Mean Number of N-grams

Number of N-grams

1400 1000

100

diameter

1600 1200

Fewer Inputs Cause Hallucinations as Representation Grows

5000

N-grams Included in Representation

C

B

1000

2000

3000

4000

5000

N-grams Included in Representation

Number of N-grams Feeding Words of Varying Lengths 20 18 16 14 12 10 8 6 4

initial learned

2 0 1 2 3 4 5 6 7 8 9 1011121314151617 181920

Word Size (Letters)

750

Bartlett W. Mel and J´ozsef Fiser

diameters of the learned feature set are also shown in Figure 4C for n-grams of each order. In spite of its relatively small size, it is likely that the n-gram representation produced by the greedy algorithm could be signicantly compacted without loss of recognition performance, by alternately pruning the least useful n-grams from R (which lead to minimal increments in error rate) and retraining back to criterion. However, such a scheme was not tried. After learning to the 1% error criterion, a 50-character input array activated an average of 168 (7.3%) of the 2287 detectors in R, while the average minimal feature set size for an individual word was 11.4. By comparison, the initial c / d ratio for 50-character inputs projected onto the 50-element bootstrap representation was a much less favorable 44%. Thus, a primary outcome of the learning algorithm in this case involving heavy clutter is a larger, more sparsely activated representation, which minimizes collisions between input images and the feature sets associated with individual target words. Further, the estimated r value from this run was 1.95, indicating signicantly less redundancy in the learned representation than was estimated using seven-letter words (r D 3.45) in the fully enumerated 2-gram representations reported in Figure 3; this is in spite of the fact that the learned representation contained more than twice the number of active features as the fully enumerated 2-gram representation. The total height of each column in Figure 4D shows the w value—the size of the minimal feature sets after learning for words of various lengths. The dark lower segment of each column counts initial connections to the word conjunctions from the 50-element bootstrap representation, while the average number of new connections gained during learning for words of each length is given by the height of each gray upper segment. Given the algorithm’s tendency to expand the smallest minimal feature sets, the w values for short words grew proportionally much more than those for longer words, producing a nal distribution of w values that was remarkably independent of word length for all but the shortest words. The w values for one-, two-, and three-letter words were lower either because they reached their maximum possible values (4 for one-letter words) or because these words contained rarer features, on average, derived from a relatively high concentration in the database of unpronounceable abbreviations, acronyms, and so forth. The dashed line indicates a minimal feature set size equal to the number of letters contained in the word; the longest words can be seen to depend on substantially fewer n-grams than they contain letters. The pressures inuencing both the choice of n-grams for inclusion in R and the assignment of n-grams to the minimal feature sets of individual words were very different from those associated with a frequency-based learning scheme. Thus, the features included in R were not nececssarily the most common ones, and the most commonly occurring English words were not necessarily the most heavily represented.

Minimizing Binding Errors

751

Table 3: First 10 n-Grams to Be Added to R During a Learning Run with 50-Character Input Arrays, Initialized with a 200-Element Bootstrap Representation. Order of Inclusion 1 2 3 4 5 6 7 8 9 10

n-Gram

Relative Frequency

Cumulative %

[z ] [ j] [k* ] [x ] [ki] [ q] [ *v] [ **z] [o*i] [p*** ]

417 14,167 39,590 6090 20,558 8771 24,278 3184 37,562 55,567

8 55 71 42 62 47 64 31 70 76

Note: The “ ” character represents the space character, and “*” matches any character. A value of k in the cumulative distribution column indicates that the total probability summed over all less common features is k%.)

First, in lieu of choosing the most commonly occurring n-grams, the greedy algorithm prefers features that are relatively rare in English text as a whole but relatively common in those words most likely to be hallucinated: short words and words containing relatively common English structure. The paradoxical need to nd features that distinguish objects from common backgrounds, but do so as often as possible, leads to a representation containing features that occur with moderate frequency. Table 3 shows the rst 10 n-grams added to a 200-element bootstrap representation during learning in one experimental run. The features’ relative frequencies in English are shown, which jump about erratically, but are mostly from the middle ranges of the cumulative probability distribution for English n-grams. Most of these rst 10 features include a space character, indicating that failure to represent elemental features properly in relation to object boundaries is a major contributor to hallucination errors in this domain. Specically, we found that hallucination errors are frequently caused by the presence in the input of multiple “distractor ” objects that together contain most or all of the features of a target object, for example, in the sense that national and vacation together contain nearly all the features of the word nation. To distinguish nation from the superposition of these two distractors requires the inclusion in the representation of a 2-gram such as [n***** ] whose diameter of 7 exceeds the maximum value of 6 arbitrarily established in the present experiments. Having established that the features included in R are not keyed in a simple way to their relative frequencies in English, we also observed that no simple relation exists between word frequency and the per-word representational cost, that is, the minimal feature set size established during

752

Bartlett W. Mel and J´ozsef Fiser Table 4: Correlation Between Word Frequency in English and Minimal Feature Set Size After Learning Was Only 0.1. Word

Relative Frequency

Minimal Feature Set Sizea

resting leading

100 315

25 24

buffalo unhappy

188 144

8 7

thereat fording

2 1

25 23

bedknob jackdaw

1 1

7 6

a Number of

n-grams in R that feed the word’s conjunction, that is, which must be present in order for a word to be detected. Table shows examples of common and uncommon words with large and small minimal features sets.

learning. In fact, if attention is restricted to the set of all words of a given length, with nominally equivalent representational requirements, the correlation between relative frequency of the word in English and its postlearning minimal feature set size was only 0.1. Thus some common words require a conjunction of many n-grams in R, while others require few, and some rare words require many n-grams while others require few (see Table 4). 5.1 Sensitivity of the Learned Representation to Object Category Structure. Words are unlike common objects in that every word forms its own

category, which must be distinguished from all other words. In text world, therefore, the critical issue of generalization to novel class exemplars cannot be directly studied. To address this shortcoming partially and assess the impact of broader category structures on the composition of the n-gram representation, we modied the learning algorithm so that a word was considered to be hallucinated only if no similar—rather than identical—word was present in the input when the target word was declared recognized. In this case, similar was dened as different up to any single letter replacement—for example, dog D ffog, dig, . . .g, but dog 6 D ffig, og, dogs, . . .g. The resulting category structure imposed on words was thus a heavily overlapping one in which most words were similar to several other words.4 4 This learning procedure, combined with the assumption that every word is represented by only a simple conjunction of features in R (disjunctions of multiple prototypes per word were not supported), cannot guarantee that every word detector is activated

Minimizing Binding Errors

753

Decrease in Representational Costs for Broader, Overlapping Object Categories 1400 tolerance = 0

Number of N-grams

1200

tolerance = 1

1000 800 600 400 200 0 2

3 4 N-gram Order

5

Figure 5: Breakdown of representations produced by two learning runs differing in breadth of category structure. Fifty-character input arrays were used as before to train to an asymptotic 1% error criterion, this time beginning with a 200-element bootstrap representation. Run with tolerance = 0 was as in Figure 4, where a hallucination was scored whenever a word’s minimal feature set was activated, but the word was not identically present in the input. In a run with tolerance D 1, a hallucination was not scored if any similar word was present in the input, up to a single character replacement. The latter run, with a more lenient category structure, led to a representation that was signicantly smaller, since it was not required to discriminate as frequently among highly similar objects.

The main result of using this more tolerant similarity metric during learning is that fewer word hallucinations are encountered per input image, leading to a nal representation that is smaller by more than half (see Figure 5). In short, the demands on a conjunctive n-gram representation are lessened when the object category structure is broadened, since fewer distinctions among objects must be made. 5.2 Scaling of Learned Representation with Input Clutter. As is conrmed by the results of Figure 3, one of the most pernicious threats to a spatially invariant representation is that of clutter, dened here as the quantity of input material that must be simultaneously processed by the representain response to any of the legal variants of the word. Rather, the procedure allows a word detector to be activated by any of the legal variants without penalty.

754

Bartlett W. Mel and J´ozsef Fiser Table 5: Measured Average Values of c, d, and w and Calculated Values of r in a Series of Three Learning Runs with Input Clutter Load Increased from 25 to 75 Characters, Trained to a Fixed Error Criterion of 3%. Input Width

c

d

w

r

25 50 75

55 152 253

816 1768 2634

7.6 10.7 12.8

1.45 1.85 2.11

Note: The 50-element bootstrap representation was used to initialize all three runs. Equation 4.3 was used to nd values of r for each run, such that with w ! w / r, the measured values of c and d, and N D 44,414, a 3% error rate was predicted.

tion. The difculty may be easily seen in equation 4.3, where an unopposed increase in the value of c leads to a precipitous drop in recognition performance. On the other hand, equation 4.3 also indicates that increases in c can be counteracted by appropriate increases in d and w, which conserve the value of ( c / d) w . To examine this issue, we systematically increased the clutter load in a series of three learning runs from 25 to 75 characters, training always to a xed 3% error criterion, and recorded for each run the postlearning values of c, d, and w. Although the WSJ database contained a wide range of word sizes (from 1 to 20 characters in length), the measurement of a single value for w averaged over all words was justied since, as shown in Figure 4D, the learning algorithm tended to equalize the minimal feature set sizes for words independent of their lengths. Under the assumption of uniformly probable, statistically independent features in the learned representations, the value of h D (c / d)w for all three runs should be equal to » 7 £ 10¡7 , the value that predicts a 3% error rate for a 44,414-object database using equation 4.3. However, corrections (w ! w / r) were needed to factor out unequal amounts of statistical redundancy in representations of various sizes, where larger representations contained more redundancy. The measured values of c, d, and w and the empirically determined r values are shown for each run in Table 5. As in the runs of Figure 3, redundancy values were again systematically greater for the larger representations generated, and the redundancy values calculated here for the learned representation were again signicantly smaller than those seen for the fully enumerated 2-gram representation. 6 Discussion

Using a simple analytical model as a taking-off point and a variety of simulations in text world, we have shown that low-ambiguity perception in un-

Minimizing Binding Errors

755

segmented multiobject scenes can occur whenever the probability of fully activating any single object’s minimal feature set, by a randomly drawn scene, is kept sufciently low. One prescription for achieving this condition is straightforward: conjunctive features are mined from hallucinated objects until false-positive recognition errors are suppressed to criterion. Even in a complex domain containing a large number of highly similar object categories and severe background clutter, this prescription can produce a remarkably compact representation. This conrms that brute-force enumeration of large families or, worse, all high-order conjunctive features—the combinatorial explosion recognized by Malsburg—is not inevitable. The analysis and experiments reported here have provided several insights into the properties of feature sets that support hallucination-proof recognition and into learning procedures capable of building these feature sets: 1. Efcient feature sets are likely to (1) contain features that span the range from very simple to very complex features, (2) contain relatively many simple features and relatively few complex features, (3) emphasize features that are only moderately common (giving a representation that is neither sparse nor dense) in response to the conicting constraints that features should appear frequently in objects but not in backgrounds also composed of objects, and (4) in spatial domains, emphasize features that encode the relations of parts to object boundaries. 2. An efcient learning algorithm works to drive toward zero, and therefore to equalize, the false-positive recognition rates for all objects considered individually. Thus, frequently hallucinated objects—objects with few parts or common internal structure, or both—demand the most attention during learning. Two consequences of this focus of effort on frequently hallucinated objects are that (1) the average value of w, the size of the minimal feature set required to activate an object, becomes nearly independent of the number of parts contained in the object, so that simpler objects are relatively more intensively represented than complex objects, and (2) among objects of the same part complexity, the minimal conjunctive feature sets grow largest for objects containing the most common substructures, though these are not necessarily the most common objects. A curious implication of the pressure to represent objects heavily that are frequently hallucinated is that the backgrounds in which objects are typically embedded can strongly inuence the composition of the optimal feature set for recognition. 3. The demands on a visual representation are heavily dependent on the object category structure imposed by the task. Where object classes are large and diffuse, the required representation is smaller and weighted

756

Bartlett W. Mel and J´ozsef Fiser

to features of lower conjunctive order, whereas for a category structure like words, in which every object forms its own category that must often be distinguished from a large number of highly similar objects (e.g., cat from cats, cut, rat), the representation must be larger and depend more heavily on features of higher conjunctive order. 6.1 Further Implications of the Analytical Model. Equation 4.3 provides an explicit mathematical relation among several quantities relating to multiple-object recognition. Since the equation assumes statistical independence and uniform activation probability of the features contained in the representation, neither of which is a valid assumption in most practical situations, we were unable to predict recognition errors using measured values for d, c, and w. However, we found that error rates in each case examined could be tted using a small correction factor for statistical redundancy, which uniformly scaled d, c, and w and whose magnitude grew systematically for larger collections of features activated by larger—more redundant—units of text. From this we conclude that equation 4.3 at least crudely captures the quantitative trade-offs governing recognition performance in receptive-eld-based visual systems. One of the pithier features of equation 4.3 is that it provides a criterion for good recognition performance: ( c / d) w ¿ 1 / N. The exponential dependence on the minimal feature set size, w, means that modest increases in w can counter a large increase in the number of object categories, N, or unfavorable increases in the activation density, c / d, which could be due to either an increase in the clutter load or a collapse in the size of the representation. For example, using roughly the postlearning values of c / d » 0.07 and r » 2 from the run of Figure 4, we nd that increasing w from 10 to 12 can compensate for a more than 10-fold increase N, or a nearly 60% increase in the c / d ratio, while maintaining the same recognition error rate. The rst-order expansion of equation 4.3 also states that recognition errors grow as the wth power of c / d, where w is generally much larger than 1. A visual system constructed to minimize hallucinations therefore abhors uncompensated increases in clutter or reduction in the size of the representation. These effects are exactly those that have motivated the various compensatory strategies discussed in section 1. The abhorrence of clutter generates strong pressure to invoke any readily available segmentation strategies that limit processing to one or a small number of objects at a time. For example, cutting out just half the visual material in the input array when w / r D 5, as in the above example, knocks down the expected error rate by a factor of 32. The pressure to reduce the number of objects simultaneously presented to the visual system could account in part for the presence in biological visual systems of (1) a fovea, which leads to a huge overrepresentation of the center of xation and marginalization of the surround, (2) covert attentional mechanisms, which selectively upor down-modulate sensitivity to different portions of the visual eld, and

Minimizing Binding Errors

757

(3) dynamical processes that segment “good” gures (in the Gestalt sense) from background. The second pressure involved in maintaining a low c / d ratio involves maintaining a visual representation of adequate size, as measured by d. One situation in which this is particularly difcult arises when the task mandates that objects be distinguished under excessively wide ranges of spatial variation, an invariance load that must be borne in turn by each of the individual feature detectors in the representation. For example, consider a relatively simple task in which the orientation of image features is a valid cue to object identity, such as a task in which objects usually appear in a standard orientation. In this case a bank of several orientation-specic variants of each feature can be separately maintained within the representation. In contrast, in a more difcult task that requires that objects be recognized in any orientation, each bank of orientation-specic detectors must be pooled together to create a single, orientation-invariant detector for that conjunctive feature. The inclusion of this invariance in the task denition can thus lead to an order-of-magnitude reduction in the size of the representation, which, unopposed could produce a catastrophic breakdown of recognition according to equation 4.3. As a rule of thumb, therefore, the inclusion of an additional order-of-magnitude invariance must be countered by an order-of-magnitude increase in the number of distinct feature conjunctions included in the representation, drawn from the reservoir of higher-order features contained in the objects in question. The main cost in this approach lies in the additional hardware, requiring components that are more numerous, expensive, prone to failure, and used at a far lower rate. 6.2 Space, Shape, Invariance, and the Binding Problem. While one emphasis in this work has been the issue of spatial invariance in a visual representation, we reiterate that the mathematical relation expressed by equation 4.3 knows nothing of space or shape or invariance, but rather views objects and scenes as collections of statistically independent features. What, then, is the role of space? From the point of view of predicting recognition performance and with regard to the quantitative trade-offs discussed here, we can nd no special status for worlds in which spatial relations among parts help to distinguish objects, since spatial relations may be viewed simply as additional sources of information regarding object identity. (Not all worlds have this property; an example is olfactory identication.) Further, the issue of spatial invariance plays into the mathematical relation of equation 4.3 only indirectly, in that it species the degree of spatial pooling needed to construct the invariant features included in the representation; once again, the details relating to the internal construction of these features are not visible to the analysis. In this sense, the spatial binding problem, which appears at rst to derive from an underspecication of the spatial relations needed to glue an object together, in fact reduces to the more mundane problem of ambiguity arising

758

Bartlett W. Mel and J´ozsef Fiser

from overstimulation of the visual representation by input images—that is, a c / d ratio that is simply too large. The hallucination of objects based on parts contained in background clutter can thus occur in any world, even one in which there is no notion of space (e.g., matching bowls of alphabet soup). However, the fact that object recognition in real-world situations typically depends on spatial relations between parts and the fact that recognition invariances are also very often spatial in nature together strongly inuence the way in which a visual representation based on spatially invariant, receptive elds can be most efciently constructed. In particular, many economies of design can be realized through the use of hierarchy, as exemplied by the seminal architecture of Fukushima et al. (1983). The key observation is that a population of spatially invariant, higher-order conjunctive features generally shares many underlying processing operations. The present analysis is, however, mute on this issue. 6.3 Relations to “Real” Vision. The quantitative relations given by equation 4.3 cast the problem of ambiguous perception in cluttered scenes in the simple terms of set intersection probabilities: a cluttered scene activates c of the d feature detectors at random and is correctly perceived with high probability as long as the set of c activated features only rarely includes all w features associated with any one of the N known objects. The analysis is thus heavily abstracted from the details of the visual recognition problem, most obviously ignoring the particular selectivites and invariances that parameterize the detectors contained in the representation. On the other hand, the analysis makes it possible to understand in a straightforward way how the simultaneous challenges of clutter and invariance conspire to create binding ambiguity and the degree to which compensatory increases in the size of the representation, through conjunctive order boosting, can be used to suppress this ambiguity. These basic trade-offs have operated close to their predicted levels in the text world experiments carried out in this work, in spite of violations of the simplifying assumptions of feature independence and equiprobability. Since the application of our analysis to any particular recognition problem involves estimating the domain-specic quantities c, d, w, and N, we do not expect predictions relating to levels of binding ambiguity or recognition performance to carry over directly from the problem of word recognition in blocks of symbolic text to other visual recognition domains, such as viewpoint-independent recognition of real three-dimensional objects embedded in cluttered scenes. On the other hand, we expect the basic representational trade-offs to persist. The most important way that recognition in more realistic visual domains differs from recognition in text world lies in the invariance load, which in most interesting cases extends well beyond simple translation invariance. Basic-level object naming in humans, for example, is largely invariant to

Minimizing Binding Errors

759

translation, rotation, scale, and various forms of distortion, occlusion, and degradation—invariances that persist even for certain classes of nonsense objects (Biederman, 1995). Following the logic of our probabilistic analysis, the daunting array of invariances that must be maintained in this and many other natural visual tasks would seem to present a huge challenge to biological visual systems. In keeping with this, performance limitations in human perception suggest that our visual systems have responded in predictable ways to the pressures of the various tasks with which they are confronted; in particular, recognition in a variety of visual domains appears to include only those invariances that are absolutely necessary and those for which simple compensatory strategies are not available. For example (1) recognition is restricted to a highly localized foveal acceptance window for text or other detailed gures, where by “giving up” on translation invariance, this substream of the visual system frees hardware resources for the several other essential invariances that cannot be so easily compensated for (e.g. size, font, kerning, orientation, distortion), (2) face recognition operates under strong restrictions on the image-plane oriention and direction of lighting of faces (Yin, 1969; Johnston, Hill, & Carman, 1992), a reasonable compromise given that faces almost always present themselves to our visual systems upright and with positive contrast and lighting from above, and (3) unlike the case for common objects, reliable discrimination of complex three-dimensional nonsense objects (e.g., bent paper clips, crumpled pieces of paper) is restricted to a very limited range of three-dimensional orientations in the vicinity of familiar object views (Biederman & Gerhardstein, 1995), though this deciency can be overcome in some cases by extensive training (Logothetis & Pauls, 1995). When viewed within our framework, the exceptional difculty of this latter task arises from the need for, or more precisely a lack of, the very large number of very-high-order conjunctive features— essentially full object views locked to specic orientations (see Figure 1B)— that are necessary to support reliable viewpoint-invariant discrimination among a large set of objects with highly overlapping part structures. In summary, the primate visual system manifests a variety of domain-specic design compromises and performance limitations, which can be interpreted as attempts to achieve simultaneously the needed perceptual invariances, acceptably low levels of binding ambiguity, and reasonable hardware costs. In considering the astounding overall performance of the human visual system in comparison to the technical state of the art, it is also worth noting that human recognition performance is based on neural representations that could contain tens of thousands of task-appropriate visual features (extrapolated from the monkey; see Tanaka, 1996) spanning a wide range of conjunctive orders and nely tuned invariances. In continuing work, we are exploring further implications of the tradeoffs discussed here in relation to the neurobiological underpinnings of spatially invariant conjunctive receptive elds in visual cortex (Mel, Ruderman & Archie, 1998) and on the development of more efcient supervised

760

Bartlett W. Mel and J´ozsef Fiser

and unsupervised hierarchical learning procedures needed to build highperformance, task-specic visual recognition machines. Acknowledgments

Thanks to Gary Holt for many helpful comments on this work. This work was funded by the Ofce of Naval Research. References Barron, R., Mucciardi, A., Cook, F., Craig, J., & Barron, A. (1984). Adaptive learning networks: Development and Applications in the United States of algorithms related to GMDH. In S. Farrow (Ed.), Self-organizing methods in modeling. New York: Marcel Dekker. Biederman, I. (1995). Visual object recognition. In S. Kosslyn & D. Osherson (Eds.), An invitation to cognitive science (2nd ed.) (pp. 121–165). Cambridge, MA: MIT Press. Biederman, I., & Gerhardstein, P. (1995). Viewpoint-dependent mechanisms in visual object recognition: Reply to Tarr and Bulthoff ¨ (1995). J. Exp. Psychol. (Human Perception and Performance), 21, 1506–1514. Califano, A., & Mohan, R. (1994). Multidimensional indexing for recognizing visual shapes. IEEE Trans. on PAMI, 16, 373–392. Charniak, E. (1993). Statistical language learning. Cambridge, MA: MIT Press. Douglas, R., & Martin, K. (1998). Neocortex. In G. Shepherd (Ed.), The synaptic organization of the brain (pp. 567–509). Oxford: Oxford University Press. Edelman, S., & Duvdevani-Bar, S. (1997). A model of visual recognition and categorization. Phil. Trans. R Soc. Lond. [Biol], 352, 1191–1202. Fahlman, S., & Lebiere, C. (1990). The cascade-correlation learning architecture. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 524–532). San Mateo, CA: Morgan Kaufmann. Fize, D., Boulanouar, K., Ranjeva, J., Fabre-Thorpe, M., & Thorpe, S. (1998). Brain activity during rapid scene categorization—a study using event-related FMRI. J. Cog. Neurosci., suppl S, 72–72. Fukushima, K., Miyake, S., & Ito, T. (1983). Neocognition: A neural network model for a mechanism of visual pattern recognition. IEEE Trans. Sys. Man & Cybernetics, SMC-13, 826–834. Gilbert, C. (1983). Microcircuitry of the visual cortex. Ann. Rev. Neurosci, 89, 8366–8370. Heller, J., Hertz, J., Kjær, T., & Richmond, B. (1995). Information ow and temporal coding in primate pattern vision. J. Comput. Neurosci., 2, 175–193. Hubel, D., & Wiesel, T. (1968). Receptive elds and functional architecture of monkey striate cortex. J. Physiol., 195, 215–243. Hummel, J., & Biederman, I. (1992). Dynamic binding in a neural network for shape recognition. Psych. Rev., 99, 480–517. Johnston, A., Hill, H., & Carman, N. (1992).Recognizing faces: Effects of lighting direction, inversion, and brightness reversal. Perception, 21, 365–375.

Minimizing Binding Errors

761

Jones, E. (1981). Anatomy of cerebral cortex: Columnar input-output relations. In F. Schmitt, F. Worden, G. Adelman, & S. Dennis (Eds.), The organization of cerebral cortex. Cambridge, MA: MIT Press. KuLcera, H., & Francis, W. (1967). Computational analysis of present-day American English. Providence, RI: Brown University Press. Lades, M., Vorbruggen, J., Buhmann, J., Lange, J., Malsburg, C., Wurtz, R., & Komen, W. (1993). Distortion invariant object recognition in the dynamic link architecture. IEEE Trans. Computers, 42, 300–311. Lang, G. K., & Seitz, P. (1997). Robust classication of arbitrary object classes based on hierarchical spatial feature-matching. Machine Vision and Applications, 10, 123–135. Le Cun, Y., Matan, O., Boser, B., Denker, J., Henderson, D., Howard, R., Hubbard, W., Jackel, L., & Baird, H. (1990). Handwritten zip code recognition with multilayer networks. In Proc. of the 10th Int. Conf. on Patt. Rec. Los Alamitos, CA: IEEE Computer Science Press. Logothetis, N., & Pauls, J. (1995). Psychophysical and physiological evidence for viewer-centered object representations in the primate. Cerebral Cortex, 3, 270–288. Logothetis, N., & Sheinberg, D. (1996). Visual object recognition. Ann. Rev. Neurosci., 19, 577–621. Malsburg, C. (1994).The correlation theory of brain function (reprint from 1981). In E. Domany, J. van Hemmen, & K. Schulten (Eds.), Models of neural networks II (pp. 95–119). Berlin: Springer. McClelland, J., & Rumelhart, D. (1981). An interactive activation model of context effects in letter perception. Psych. Rev., 88, 375–407. Mel, B. W. (1997). SEEMORE: Combining color, shape, and texture histogramming in a neurally-inspired approach to visual object recognition. Neural Computation, 9, 777–804. Mel, B. W., Ruderman, D. L., & Archie, K. A. (1998). Translation-invariant orientation tuning in visual “complex” cells could derive from intradendritic computations. J. Neurosci., 17, 4325–4334. Mozer, M. (1991). The perception of multiple objects. Cambridge, MA: MIT Press. Oram, M., & Perrett, D. (1992). Time course of neural responses discriminating different views of the face and head. J. Neurophysiol., 68(1), 70–84. Oram, M. W., & Perrett, D. I. (1994). Modeling visual recognition from neurobiological constraints. Neural Networks, 7(6/7), 945–972. Pitts, W., & McCullough, W. (1947). How we know universals: The perception of auditory and visual forms. Bull. Math. Biophys., 9, 127–147. Potter, M. (1976). Short-term conceptual memory for pictures. J. Exp. Psychol.: Human Learning and Memory, 2, 509–522. Sandon, P., & Urh, L. (1988). An adaptive model for viewpoint-invariant object recognition. In Proc. of the 10th Ann. Conf. of the Cog. Sci. Soc. (pp. 209–215). Hillsdale, NJ: Erlbaum. Schiele, B., & Crowley, J. (1996). Probabilistic object recognition using multidimensional receptive eld histograms. In Proc. of the 13th Int. Conf. on Patt. Rec. (Vol. 2 pp. 50–54). Los Alamitos, CA: IEEE Computer Society Press. Swain, M., & Ballard, D. (1991). Color indexing. Int. J. Computer Vision, 7, 11–32.

762

Bartlett W. Mel and J´ozsef Fiser

Szentagothai, J. (1977). The neuron network of the cerebral cortex: A functional interpretation. Proc. R. Soc. Lond. B, 201, 219–248. Tanaka, K. (1996). Inferotemporal cortex and object vision. Ann. Rev. Neurosci, 19, 109–139. Thorpe, S., Fize, D., & Marlot, C. (1996).Speed of processing in the human visual system. Nature, 381, 520–522. Van Essen, D. (1985). Functional organization of primate visual cortex. In A. Peters & E. Jones (Eds.), Cerebral cortex (pp. 259–329). New York: Plenum Publishing. Wallis, G., & Rolls, E. T. (1997).Invariant face and object recognition in the visual system. Prog. Neurobiol., 51, 167–194. Weng, J., Ahuja, N., & Huang, T. S. (1997). Learning recognition and segmentation using the cresceptron. Int. J. Comp. Vis., 25(2), 109–143. Wickelgren, W. (1969). Context-sensitive coding, associative memory, and serial order in (speech) behavior. Psych. Rev., 76, 1–15. Yin, R. (1969). Looking at upside down faces. J. Exp. Psychol., 81, 141–145. Zemel, R., Mozer, M., & Hinton, G. (1990). TRAFFIC: Recognizing objects using hierarchical reference frame transformations. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 266–273). San Mateo, CA: Morgan Kaufmann. Received September 29, 1998; accepted March 5, 1999.

This paper was originally printed in Neural Computation, Volume 12, Number 2—pages 247–278. It is being reprinted here because of print errors in Figures 2, 4, and 5.

ARTICLE

Communicated by William Bialek

The Multifractal Structure of Contrast Changes in Natural Images: From Sharp Edges to Textures Antonio Turiel N e´ stor Parga Departamento de F´õsica Te´orica, Universidad Aut´onoma de Madrid, 28049 Madrid, Spain We present a formalism that leads naturally to a hierarchical description of the different contrast structures in images, providing precise denitions of sharp edges and other texture components. Within this formalism, we achieve a decomposition of pixels of the image in sets, the fractal components of the image, such that each set contains only points characterized by a xed strength of the singularity of the contrast gradient in its neighborhood. A crucial role in this description of images is played by the behavior of contrast differences under changes in scale. Contrary to naive scaling ideas where the image is thought to have uniform transformation properties (Field, 1987), each of these fractal components has its own transformation law and scaling exponents. A conjecture on their biological relevance is also given.

1 Introduction

It has been frequently remarked that natural images contain redundant information. Two nearby points in a given image are very likely to have similar values of light intensity, except when an edge lies between them. In this case, the sharp change in luminosity also represents a loss of predictability, because probably the two points belong to two different objects, with almost independent luminosities. Since they cannot be predicted easily, it is frequently said that edges are the most informative structures in the image and that they represent their independent features (Bell & Sejnowski, 1997). Anyone’s subjective experience is consistent with this: most people would start drawing a scene by tracing the lines of the obvious contours in the scene. Even more, the picture can be understood by anyone observing that sketch. Only afterward does one start adding texture to the light-at areas in the drawing. These textures represent the less informative data about the position and intensity of the light sources. It seems that different structures can be distinguished in every image, following a hierarchy dened according to their information content, from the sharpest edges to the softest textures. Neural Computation 12, 763–793 (2000)

c 2000 Massachusetts Institute of Technology °

764

Antonio Turiel and N´estor Parga

Image-processing techniques also emphasize that edges are the most prominent structures in a scene. A large variety of methods for edge detection have been proposed, among them high-pass ltering, zero crossing, and wavelet projections (Gonzalez & Woods, 1992). But once the edges have been extracted from the image, these methods do not address the problem of how to describe and obtain the other structures. In order to detect these other sets and extract them in the context of an organized hierarchy, it is necessary to provide a more general mathematical framework and, in particular, to dene the variables suitable to perform this classication. In this article we present a more rigorous approach to these issues, proposing precise denitions of edges and other texture components of images. As we shall see, our formalism leads very naturally to a hierarchical description of the different structures in the images, achieving a decomposition of its pixels in sets, or components, ranging from the most to the less informative structures. Another motivation to understand the role that edges play in natural images is related to the development of the early visual system. The brain has to solve the problem of how to represent and transmit efciently the information it receives from the outside world. As it has been frequently pointed out (Barlow, 1961; Linsker, 1988; Atick, 1992; van Hateren, 1992), cells located in the rst stages of the visual pathway do this by trying to take advantage of statistical regularities in the image ensemble. The idea is that since natural images are highly redundant over space, they have to be represented in a compact and nonredundant way before being transmitted to inner areas in the brain. In order to do this, it is rst necessary to know the regularities of the images, which then could be used to build more appropriate and efcient internal representations of the environment. The statistical properties of natural images have been studied for many years, but the effort was mainly restricted to the second-order statistics. The emphasis was put on the two-point correlation rst because of the simplicity of the analysis and then because its Fourier transform (the power spectrum) has a power law behavior that reects the existence of scale-invariant properties in natural images. This implies that the second-order statistics does capture some of the regularities. However, most of the underlying statistical structure remains unveiled. One can easily convince oneself that most of the correlations in the image are still present after whitening is performed: the contours of the objects in the whitened image are still easily recognizable (Field, 1987). The conclusion is that nongaussian statistical properties of edges are important to describe natural images, a fact that has also been recognized by Ruderman and Bialek (1994) and Ruderman (1994). However, no systematic studies of the statistical properties of edges had been done until very recently (Turiel, Mato, Parga, & Nadal, 1998). A complete characterization of the regularities of natural images is impossible, and probably useless. Some intuition is necessary to guide the search for regularities. The aim of this work is to exploit the basic idea that

Structure of Contrast Changes

765

changes in contrast are relevant because they occur at the most informative pixels of the scene. Hence we focus on the statistics of changes in luminosity. Since these changes are graded, the task could seem rather difcult and even hopeless, but the solution to this problem turns out to be quite simple and elegant. A single parameter is enough to describe a full hierarchy of differences in contrast, and the hierarchy can be explicitly constructed. Moreover, it is intimately related to the scale-invariance properties of the image, which are certainly more elaborated than those that have been used until now but are still simple enough and have plenty of structure to be detected by a network adapting to the statistics of the stimuli. To understand these scale invariances we rst need the basic concepts of multiscaling and multifractality (Falconer, 1990), together with a simple explanation of their meaning for images. While in naive scaling there is a single transformation law throughout the image (which is then said to be a fractal), in the more general case there is no invariance under a global scale transformation, but the space can be decomposed in sets with a different transformation law for each of them (the image is then said to be a multifractal). To make this concept more precise, let us consider a simple situation where a set of images presents a single type of changes in contrast: the luminosity suddenly changes from one (roughly) constant value to another. In this case, a natural description of the properties of changes in contrast would be to count the number of such changes that appear along a segment of size r, oriented in a given direction. The statistical distribution of the position and size of these jumps in contrast would give relevant information about an ensemble of such images. However, in real images, this counting is not enough to describe well the properties of contrast changes. This is because these changes do not follow such a simple pattern: some of them can be sharp (the notion of sharp changes has to be given), but there will also be all types of softer textures that would be lost by counting just the most noticeable changes. It is then necessary to consider all of them as a whole. A natural way to deal with contrast changes is by dening a quantity that accumulates all of them, whatever their strength, contained inside a scale r, that is: E) D 2 r(x

1 r2

Z

x1 C 2r

x1 ¡ 2r

Z 0

dx1

x2 C 2r x2 ¡ 2r

dx02 | rC( xE0 )| .

(1.1)

Hereafter, the contrast C( xE) is taken as C( xE) D I ( xE)¡hIi, where I ( xE) is the eld of luminosities and hIi its average value across the ensemble. The bidimensional integral 1 on the right-hand side, dened on the set of pixels contained 1

The variables x01 , x02 are the components of the vector xE0 and | | denote the modulus of a vector.

766

Antonio Turiel and N´estor Parga

in a square of linear size r, is a measure of that square. 2 It is divided by the factor r2 , which is the usual Lebesque measure (which we will denote l) of a square of linear size r. The quantity 2 r ( xE) can then be regarded as the ratio between these two measures. More generally, we can dene the measure m of a subset A of image pixels as Z m ( A) D

A

dE x0 |rC( xE0 )|

(1.2)

R ( A dE x0 means bidimensional integration over the set A)—what we will call the edge measure (EM) of the subset A. It is also possible to generalize the denition of 2 r for any subset A: 2 AD

m ( A) . l(A)

(1.3)

2 A is a density, a quantity that compares how much the distribution of |rC|

deviates from being homogeneously distributed on A. We shall call it the edge content (EC) of the set A. Denoting Br ( xE) the “ball”3 of radius r centered around xE, it is clear that 2 r ( xE) ´ 2 Br ( xE) , so this denition generalizes the previous one. We shall refer to 2 r (E x) as the EC of xE at the scale r. Notice that E) is the m -density at xE, |rC|( xE). Thus, equation 1.2 can be symbolically 2 dr ( x expressed as dm ( xE) D | rC| (E x) dE x.

(1.4)

The main point is that contrast changes are distributed over the image in such a way that the EC has rather large contributions even from pixels that are very close together. As a consequence, the measure presents a rather irregular behavior. But it is precisely in its irregularity where the information about the contrast lies. Let us assume, for instance, that as r becomes very small, the measure behaves at every point xE as ¡ ¢ ( ) m Br ( xE) ´ m r (E x) D a( xE) rh xE Cd ,

(1.5)

where d D 2 is the dimension of the images. For convenience a shift in the denition of the exponent ofm r (xE) has been introduced. The EC then veries E) D a( xE) rh(Ex) , 2 r(x

(1.6)

2 Any measure of a set has the property of being additive; that is, if one splits the set in pieces, the measure of the set is equivalent to the sum of the measures of all its pieces. 3 “Ball” in a mathematical sense: it does not need to be true circles, but could also be squares or diamonds. In fact, here the “balls” will be taken as squares.

Structure of Contrast Changes

767

and this choice of h(xE) removes trivial dependencies on r. The exponent h(xE) is just the singularity exponent of |rC|( xE). That is, h < 0 indicates a divergence of |rC|( xE) to innity, while h > 0 indicates niteness and continuous behavior. The greater the exponent, the smoother is the density around that point. In general, we will talk about “singularity” even for the positive h cases. Thus, the meaning of equation 1.5 is that all the points are singular (in this wide sense) and that the singularity exponent is not uniform. This irregular behavior would allow us to classify the pixels in a given image: the set of pixels with a singularity exponent contained in the interval [h ¡ D h , h CD h ] (where D h is a small positive number) dene a class Fh . These classes are the fractal components of the image. The smaller the exponent, the more singular is the class, and the most singular component is the one with the smallest value of h. A measure m verifying equation 1.5 in every point xE is called a multifractal measure. We will refer to these components of the images as fractal sets because they are of a very irregular nature. One way to characterize the odd arrangement of the points in fractal sets is by just counting the number of points contained inside a given ball of radius r; we will denote by Nr ( h, D h ) this number for the set Fh . As r ! 0 it is veried that ( ) Nr (h, D h ) » rD h .

(1.7)

This new exponent, D( h), quanties the size of the set of pixels with singularity h as the image is covered with small balls of radius r. It is the fractal dimension of the associated fractal component Fh , and the function D( h) is called the dimension spectrum (or the singularity spectrum) of the multifractal. It is also worth estimating the probability for a ball of radius r to contain a pixel belonging to a given fractal component. Its behavior for small scales r is Prob( h, D h) » rd¡D( h) ,

(1.8)

where d D 2 as before. Notice that since r is small, this quantity increases as D(h) approaches d.4 The multifractal behavior also implies a multiscaling effect. Because every fractal component possesses its own fractal dimension, each changes differently under changes in the scale. Thus, although every component has no denite scale and is invariant under the scale transformations, scale invariance is broken for a given image: the points in the image transform with different exponents according to the fractal component to which they belong. This leads to the presence of intermittency effects: if the image were 4

It can be proved that any set contained in a d-dimensional real space has fractal dimension D smaller than or equal to d.

768

Antonio Turiel and N´estor Parga

enlarged, its statistical properties would change. Self-similarity is lost. However, scale invariance is restored in the averages over an ensemble of such images. At this point, one may wonder if a multifractal behavior is actually realized in natural images. One way to check its validity is to verify that equation 1.5 is indeed correct. This in turn leads to the technical problem of how to obtain in practice the exponent h( xE) at a given pixel xE of the image. This is a local analysis where the singularity is found by looking at the neighborhood of the point xE. It has the advantage that it not only provides a tool to check the multifractal character of the measure, but it also gives a way to decompose each image in its fractal components. These issues are addressed in the next section. The multifractality of the measure can also be studied by means of statistical methods. In this case one looks for the effect of the singularities on the moments of marginal distributions of, for example, the EC dened in equation 1.1. This is done in section 3, where the important notions of self-similarity, extended self-similarity, and multiplicative processes are explained. Once the relevant mathematical tools are given, these are used in section 4 to analyze natural images. First, in section 4.2, we check that the measure introduced in equation 1.2 is indeed multifractal. Examples of fractal components of natural images are also presented in the same section. After this we study the consistency between the singularity analysis and the statistical analysis of the measure (subsection 4.3). The results of the numerical study of self-similarity properties of the contrast C( xE) and of the measure density |rC|( xE) are presented in section 4.4. The discussion of the results and perspectives for future work are given in the last section. Some technical aspects are included in the appendix.

2 Singularity Analysis: The Wavelet Transform

In this section we present the appropriate techniques to deal with multifractal measures, especially with those adapted to natural images. To check if m denes a multifractal measure and to obtain the local exponent h( xE), we need to analyze the dependence of m ( Br ( xE)) on the scale r. Unfortunately, direct logarithmic regression performed on equation 1.5 yields rather coarse results when it is applied on discretized images, allowing only the detection of very sharp, well-isolated features. It is then necessary to use more sophisticated methods. A convenient technique for this purpose (Mallat & Huang, 1992; Arneodo, Argoul, Bacry, Elezgaray, & Muzy, 1995; Arneodo, 1996) is the wavelet transform (see, e.g., Daubechies, 1992). Roughly speaking, the wavelet transform provides a way to interpolate the behavior of the measure m over balls of radius r when r is a noninteger number of pixels. Besides, if the image is a discretization of a continuous signal, the wavelet analysis is the appropriate technique to retrieve the correct exponents, as we explain later.

Structure of Contrast Changes

769

The wavelet transform of a measure m is a function that is not localized in either space or in frequency. Its arguments are then the site xE and the scale r. It also involves an appropriate, smooth, interpolating function Y( xE). More precisely, the wavelet transform of m is: Z r TY dm ( xE) ´

dm ( xE0 )Yr ( xE ¡ xE0 ) D

dm Yr ( xE), dE x

(2.1)

where Yr (E x) ´ r1d Y( xEr ). The important point here is that for the case of a positive measure, it can be proved that r TY dm ( xE0 ) » rh(Ex0 )

(2.2)

if and only if m veries equation 1.5 (h being exactly the same) and Y decreases fast enough (see Arneodo, 1996, and references therein). In that way, r this property allows us to extract the singularities directly from TY dm . It is very important to remark that Y can be a positive function—what makes an essential difference with respect to the analysis of multiafne functions (see section 3.4). 5 In general the ideal, continuous measure dm (xE) is unknown. In these cases, the data are given by a discretized sampling, which in our case is a collection of pixels. One can reasonably argue that the pixel is the result of the convolution of an ideal signal with a compact support function Âa ( xE) (describing, for example, the resolution » a of the optical device). In this case, under convenient hypothesis, the wavelet transform yields the correct projection of the ideal signal. Let us call xE( n1 ,n2 ) ´ xEnE the positions of the pixels in the discretized sample and CnE the values of the discretized version of the contrast; then CnE D TÂa C( xEn ) ´ C Âa ( xEnE ).

(2.3)

Taking the sample fE xnE g in such a way that the centers of adjacent pixels are at a distance larger than a, it is veried that ¡ ¢ rCnE ´ C( n1 C1,n2 ) ¡ C( n1 ,n2) , C( n1 ,n2 C1) ¡ C( n1 ,n2) » rC Âa ( xEnE ), (2.4) and so the discretized versionm ( dis) of the ideal measurem , can be considered as a wavelet coefcient of the continuous m with the wavelet Â at the scale a, namely, ( dis)

dm nE 5

» dE x| rC| Âa (xEnE ) D dE xTÂa dm ( xEnE ).

(2.5)

This means that the exponents can be obtained with a nonadmissible wavelet Y, that is, a function with a nonzero mean.

770

Antonio Turiel and N´estor Parga

Taking into account that the convolution is associative, the discretized wavelet ( ) analysis of dm nEdis with a wavelet Y can be simply expressed as r T rY dm ( dis) ( xEnE ) » TYÂ dm (E xnE ), a r

(2.6)

that is, analyzing the discretized measure m ( dis) with a wavelet Y is equivalent to analyzing the continuous measure m with a wavelet Y Â ar . But if the scale r is large enough compared to the photoreceptor extent a, Â ar ( xE) » d ( xE), which does not depend on r. Thus, if the internal size a is small enough to allow ne detection, we can recover the correct exponent at the point xEnE by just analyzing the discretized measure. Conversely, the extent a imposes a lower cutoff in the details that can be observed in any image constructed by pixels. Because it is just the size of the pixel, the estimation of the singularity exponent at a given point requires performing the analysis over scales containing several pixels. This analysis is similar to that performed in Mallat and Zhong (1991) and Mallat and Huang (1992): any image is analyzed by means of its scaling properties under wavelet projection. There are, however, important differences between the study performed by these authors and the work examined here. The rst difference concerns the basic scalar eld to be analyzed: Mallat and Zhong considered C( xE) instead of |rC|( xE), which we prefer because it is the m -density of a multifractal measure m (see section 4). Another difference is methodological: we intend to classify the points, according their singularity exponent to form every fractal component, while the previous work was devoted to obtaining only the wavelet transform modulus maxima, a concept related but different from what we will call the most singular component—that is, only one (although the most important) of the fractal components. The third difference concerns motivation: we want to classify every point with respect to a hierarchical scheme, while these authors search a wavelet-based optimal codication algorithm, in the sense of providing an (almost) perfect reconstruction algorithm from the wavelet maxima. 3 Statistical Self-Similarity of Natural Images

In this section we introduce the basic concepts of the statistical approach, dening at the same time new variables that are closer to the wavelet analysis. 3.1 SS and ESS. From the statistical point of view, m r is a random variable, in the sense that given a random point xE in an arbitrary image I , the value of m r ( xE) cannot be deterministically predicted. By denition, the multifractal structure is a multiscaling effect. This multiscaling should be somehow apparent from the dependence of the probability density ofm r on the scale parameter r. To determine the whole probability

Structure of Contrast Changes

771

density function ofm r requires a large data set, especially for the rare events; besides, its dependence on r could mix multiscaling effects in a complicated way. On the contrary, the p-moments of m r (that is, the expectation values of m r raised to the pth power) sufce to characterize completely the probability density,6 and given that they are averages of powers of m r they are likely to depend on r in a simple way, somehow related to the behavior of the measure in equation 1.5. It is more convenient to deal with the p-moments of normalized versions r of m r such as the EC 2 r ´ r12 m r or the wavelet projections |TY dm |. They are r normalized in such a way that the rst-order moments, h2 r i and h|TY dm | i, pd do not depend on r: the trivial factor r is hence removed from the random variable. p As we will see, the moments h2 r i and h|T rY dm | p i exhibit the remarkable scaling properties of self-similarity (SS): p

h2 r i D ap rtp

(3.1)

r dm | with exponents t Y ) and of extended self-similarity (analogously for |TY p (ESS) (referred to the moment of order 2) (Benzi, Ciliberto, Baudet, Chavarria, & Tripiccione, 1993; Benzi, Biferale, Crisanti, Paladin, Vergassola, & Vulpiani, 1993; Benzi, Ciliberto, & Chavarria, 1995):

h ir ( p,2) p h2 r i D A( p, 2) h2 r2 i ,

(3.2)

r (and similarly for |TY dm | with exponents r Y ( p, 2)). ESS has been referred to the moment of order 2, but equation 3.2 trivially implies that any moment can be expressed as a power of the moment of order q with an exponent r ( p,2) r ( p, q) D r (q,2) . In this way, we have the sets of SS exponents tp and ESS exponents r ( p, 2) r for 2 r and similarly the exponents tpY and r Y ( p, 2) for TY dm . Notice that if SS holds, so does ESS, in which case there is a simple relation between tp and r ( p, 2):

tp D t2 r ( p, 2).

(3.3)

The actual dependence of tp and r ( p, 2) on p determines the fractal structure of the system. A trivial dependence tp / p would reveal a monofractal structure. A different dependence on p indicates the existence of a more complicated (multifractal) geometrical structure. 3.2 The Multiplicative Process. A random variable subtending an area of linear size r and possessing SS (see equation 3.1) and ESS (see equation 3.2) 6

Provided they do not diverge too fast with p.

772

Antonio Turiel and N´estor Parga

can be described statistically by means of a multiplicative process (Benzi, Biferale, Crisanti, Paladin, Vergassola, & Vulpiani, 1993; Novikov, 1994). The statistical formulation is rather simple: given two different scales r and L such that r < L, there is a simple stochastic relation between the ECs at these two scales, .

2 r D arL 2 L ,

(3.4)

where arL is a random variable, independent of the EC, which describes how . the change between these two scales takes place. The symbol D indicates that both sides of the equation are distributed in the same way, but this does not necessarily imply the validity of the relation at every pixel of a given image. The random variables arL between all possible pairs of scales r and L are said to dene a multiplicative process. Let us understand in a more intuitive way the meaning of this process. It is rather clear that given the distribution of 2 L at a xed large-scale L, to compute the distribution of the EC 2 r , it is enough to know the distribution of arL . But let us now introduce an intermediate scale r0 . One could also obtain rst 2 r0 using ar0 L and then 2 r by means of arr0 . Thus, the multiplicative process must verify the cascade relation, arL D arr0 ar0 L ,

(3.5)

which tells us that the variables arL are innitely divisible. Innitely divisible stochastic variables are then completely characterized by their behavior under innitely small changes in the scale, that is, by ar,rCdr . The simplest example of an innitely divisible random variable arL is the one having a binomial innitesimal distribution: ( ar,rCdr D

1¡D b(1 ¡ D

dr , r dr ), r

with probability 1 ¡ [d ¡ D1 ] dr r with probability [d ¡ D1 ] drr .

(3.6)

The form of this process is determined by its compatibility with equation 3.5 (see, e.g., She & Leveque, 1994; She & Waymire, 1995; and Castaing, 1996, for a detailed study). Moreover, there are only two free parameters: it is veried that d ¡ D1 D D / (1 ¡ b). They have also to satisfy the following bounds: 0 < b < 1 and 0 < D < 1. The innitesimal process (see equation 3.6) can be easily interpreted in terms of the EC and the measure density: Probability D 1 ¡ [d ¡ D1 ] dr . This is the most likely situation. In r this case, the EC (and consequently the measure) changes smoothly under small changes in scale: ar,rCdr is close to one and its variation is proportional to d ln r.

Structure of Contrast Changes

773

Probability D [d ¡ D1 ] dr . In this unlikely case, under innitesimal r changes in scale, the EC undergoes nite variations, which is reected in the fact that ar,rCdr deviates from one by a factor b. The m -density | rC| must then be divergent somewhere along the boundary of the ball of radius r, and the parameter b is a measure of how sharp this divergence is. The process for noninnitesimal changes in scale can be easily derived from the innitesimal process presented in equation 3.6. In this case the random variable arL follows a log-Poisson process (She and Waymire, 1995); its probability distribution r arL (arL ) has the form r arL (arL ) D

µ 1 h r id¡D1 X ( d ¡ D1 ) n L

nD0

n!

L ln r

¶n

¡

d arL ¡ b n

³ r ´¡D L

¢

.

(3.7)

If a multiplicative process is realized, the EC has the SS and the ESS properties, equations 3.1 and 3.2. Thus, knowledge of the ESS exponents r ( p, 2) and of t2 is enough to compute tp using equation 3.3. For the log-Poisson process, the ESS exponents are: r ( p, 2) D

1 ¡ bp p ¡ (1 ¡ b)2 1¡b

(3.8)

which depend only on b. This process was proposed in She and Leveque (1994) for turbulent ows, so we will refer to it as either the She-Leveque (S-L) or the log-Poisson model. The SS exponents are tp D ¡D p C ( d ¡ D1 )(1 ¡ b p ),

(3.9)

which compared with equation 3.3 gives the following set of relations: (

D d ¡ D1

D D

t2 ¡ 1¡b t2 ¡ (1¡b) 2

D

D 1¡b .

(3.10)

This means that the model has only two free parameters, which can be chosen to be t2 and b or D and D1 , for instance. The geometrical interpretation of the last two is very interesting (Turiel et al., 1998) and will be explained in the next section. For the time being, let us notice that the maximum value of the EC, k 2 r k1 , also follows a power law with exponent ¡D , k2 r k1 D a1 r¡D ,

(3.11)

and thus the parameter D characterizes the most divergent behavior present in natural images.

774

Antonio Turiel and N´estor Parga

3.3 SS and Multifractality. Let us now analyze in more detail the geometrical meaning of the model and its relation with the singularities described in section 2. For a multifractal measure m with singularity spectrum D( h), there is an important relation between the SS exponents tp and D( h). Let us denote byr h ( h) the probability density of the distribution of the singularity exponents h in the image. Then, partitioning the image pixels according to their values of h, equation 1.6 can be expressed, for r small enough, as Z p ( ) h2 r i » (3.12) dhr h ( h)hap iFh rhpCd¡D h .

The factor rd¡D( h) comes from the probability of a randomly chosen pixel being in the fractal Fh (see equation 1.8) and guarantees the correct normalization of hap iFh (see Frisch, 1995; Arneodo, 1996, for details). For very small r the application of the saddle point method to equation 3.12 yields p

h2 r i D ap rtp / r

minfphCd¡D(h)g h

.

(3.13)

This gives a very interesting relation between tp and D( h): tp is the Legendre transform of D( h). The important point is that D( h) is easily expressed in terms of tp by means of another Legendre transform, namely: D( h) D minfph C d ¡ tp g.

(3.14)

p

If one has a model for the tp ’s (the S-L model, equation 3.9) the dimension spectrum can be predicted: D ( h) D D 1 ¡

h CD ln b

¡

µ 1 ¡ ln ¡

h CD (d ¡ D1 ) ln b

¢¶

.

(3.15)

This is represented in Figure 8. There are natural cutoffs for the dimension spectrum; in particular, there cannot be any exponent h below the minimal value h1 D ¡D . Thus, D denes the most singular of the fractal components, which has fractal dimension D(h1 ) D D1 . We could use these properties of the most singular fractal (its dimension and its associated exponent) to characterize the whole dimension spectrum, which is in agreement with the arguments given in section 3.2. 7 3.4 Multiafnity. Related to multifractality there exists the simpler but more unstable property of multiafnity (Benzi, Biferale, Crisanti, Paladin, Vergassola, & Vulpiani, 1993). This is a characterization of chaotic, irregular scalar functions that somewhat generalizes the concepts of continuity and differentiability. 7

Remember that b can be expressed in terms of D and D1 : b D 1 ¡ D / (d ¡ D1 ).

Structure of Contrast Changes

775

First, let us give the concept of H¨older exponent: A scalar function F(E x) is said to be Holder ¨ of exponent hF ( xE0 ) at a given point xE0 if for any point yE close enough to xE that the following inequality holds: |F( yE) ¡ F(E x0 )| < A0 | yE ¡ xE0 | hF ( xE0 ) ,

(3.16)

¨ exA0 being a constant depending on the point xE0 . We dene the Holder ponent of F( xE) at xE0 as the maximum of the exponents hF (E x0 ) verifying equation 3.16. Dening the linear increment (LI) of F( xE) by a displacement vector Er as d Er F( xE) ´ |F( xE CEr) ¡ F(E ¨ exponent x)|, if the function F has a Holder hF ( xE) at xE then: d Er F( xE) » aF (E x) rhF ( xE) ,

(3.17)

which is similar to equation 1.6. In the same spirit, we say that a function is multiafne if for every point xE equation 3.17 is veried. Multiafnity is a more intuitive concept than multifractality, because the Holder ¨ exponents are good characterizations of Taylor-like local expansions of the function F(E x). For instance, hF (E x) > 0 means that the function is continuous at xE, hF ( xE) > 1 implies that the function has a continuous rst derivative, and so on. It also works in the other sense: a function having exponent hF D ¡1 behaves as “bad” as a d -function (see the appendix for a brief tutorial about Holder ¨ exponents of simple functions). Even more, a function F has Holder ¨ exponent hF at a given point if and only if its rst-order derivatives have Holder ¨ exponent hF ¡ 1 at the same point. Besides, multiafnity implies multifractality in the following sense: given a measure m , if its density dm is a multiafne function, then m is a multifracdE x tal measure having exactly the same exponents as dm at every point.8 For dE x natural images and taking F D | rC| , this fact can be expressed as r 2 r » TY dm » d rE| rC| ,

(3.18)

in the sense that the three variables have the same statistical dependence on r. This relation can be also expressed in terms of the SS exponents. Let us r denote by tp the SS exponents of 2 r , by tpY those of TY dm and by tprC those of d Er |rC|. Then the statistical relation in equation 3.18 implies that tp D tpY D tprC .

(3.19)

We are also interested in the relation between tp and the SS exponents tpC of the moments of d Er C. If C exhibits the multiafne behavior shown in equa8 Notice a shift of size d in the denition of the exponents of m (see equation 1.5) that is not present in the denition of the Holder ¨ exponents of a multiafne function (see equation 3.16).

776

Antonio Turiel and N´estor Parga

tion 3.17, recalling the meaning in terms of differentiability of the Holder ¨ exponents, we have that d Er | rC| (E x) veries: d Er |rC|( xE) » a| rC| ( xE)rhC ( xE)¡1 .

(3.20)

This can be statistically interpreted as: 2 r » d Er |rC| »

1 d Er C, r

(3.21)

which reects the shift by ¡1 in the exponents of the derivative of C.9 In terms of the SS exponents, this implies that tpC D tprC C p D tp C p.

(3.22)

Let us remark that the singularity spectra associated with the variables r d Er C, d Er |rC|, and TY dm are the Legendre transforms of the SS exponents tpC , tprC , and tpY , respectively (a fact that can be derived in the same way as equation 3.14). There still remains the converse question: if multifractality implies multiafnity. Unfortunately, this is not true (see Daubechies, 1992). In fact, the |F( yE) ¡F(xE0 )| term in equation 3.16 suggests the existence of a pseudo-Taylor expansion, in the sense that ( ) F( yE) ¼ pn ( yE ¡ xE0 ) C A0 | yE ¡ xE0 | hF xE0 ,

(3.23)

where pn ( yE ¡ xE0 ) is a polynomial of degree n in the projection of the vector ¨ yE ¡ xE0 along its direction. We should thus generalize the concept of Holder exponent: given a point xE0 , we will say that this point possesses a Holder ¨ exponent n < hF (E x0 ) < n C 1 (n integer) if there exists a polynomial pn ( xE) of degree n and a constant A0 such that | F( yE) ¡ pn ( yE ¡ xE0 )| < A0 | yE ¡ xE0 | hF (Ex0 )

(3.24)

and hF ( xE0 ) is the maximum of the exponents verifying such relation. Multifractal measures with no multiafne densities (as dened in equation 3.16) have been studied in Bacry, Muzy, & Arneodo (1993) and Arneodo (1996). The problem can be explained by the masking presence of the regular, polynomial part, which stands for global, nonlocalized effects in the signal eld F( yE). To remove this part, those authors propose a generalization of the LI. Instead of LIs of the function, wavelet transforms of the appropriate one-dimensional restriction are considered, taking the scalar eld as a 9

Let us note that the relation r2 local similarity (Arneodo, 1996).

r

» d Er C is the analog of the Kolmogorov hypothesis of

Structure of Contrast Changes

777

real-valued measure density. This wavelet transform of the function F( xE) is dened as Z Er r ( ) Yr ( s), (3.25) TEY F xE0 ´ dsF xE0 C s r

¡

¢

where Y is a real, one-dimensional analyzing wavelet. Since the integral is one-dimensional, the wavelet at the scale r is Yr ( s) ´ 1r Y( sr ). If the scale r is small enough, the integral is dominated by the rst terms in the pseudoTaylor expansion of F( xE) around xE0 , equation 3.23: Z

r (E ) TEY F x0

¼

Z

dspn ( s)Yr (s) C A0

ds|s| hF ( xE0 ) Yr (s).

(3.26)

If the analyzing wavelet is now required to vanish at least the rst n integer moments, the singular part appears as the main contribution to the wavelet transform. (For later reference, let us say now that a wavelet with a vanishing zeroth-order moment—its integral is null—is called an R admisible wavelet.) Hence, changing variables t D sr and dening A D A0 dt|t| hF (Ex0 ) Y(t), one has: r (E ) TEY F x0 ¼ ArhF ( xE0 ) ,

(3.27)

where all the dependence on r stands in the prefactor rhF ( xE0 ) . In this way, the Holder ¨ exponent (in the sense of equation 3.24) can be obtained by a wavelet projection, provided the wavelet has zero moments up to a large enough order. Since hF ( xE0 ) is noninteger, this procedure cannot cancel the singular part even if the order n of the largest vanishing moment of Y is larger than hF ( xE0 ). On the other hand, the wavelet would not be able to detect singularity exponents hF > n C1. This is because a polynomial of order n C1 could still be present and dominate the contribution of the singular term. For this reason, it is important to determine the right order of the required zero moments of the wavelet. This is, then, the appropriate scheme to detect the Holder ¨ exponents of signals with a regular part. In cases in which the m -density |rC| or its primitive C are not multiafne, the concept of multiafnity (see equation 3.16) should be applied to their wavelet projections. It is then necessary to determine how many zero moments the wavelet should have because it provides information on the global aspects of the signal. This generalization is Er r |rC|. straightforward, replacing in all the casesd Er C by TY C andd Er |rC| by TEY For instance, if the wavelet transforms of C and |rC| verify equation 3.27, for certain A and hF ( xE), the same arguments that led us to equations 3.21 and 3.22 can be used to derive that Er | rC| » 2 r » TY

1 Er T C, r Y

(3.28)

778

Antonio Turiel and N´estor Parga

Table 1: Images from van Hateren’s Database. imk00034.imc imk00211.imc imk00478.imc imk00586.imc imk00605.imc

imk00801.imc imk00808.imc imk00881.imc imk01017.imc imk01032.imc

imk01164.imc imk01406.imc imk02035.imc imk02603.imc imk02626.imc

imk02649.imc imk03134.imc imk03514.imc imk03536.imc imk03789.imc

imk03807.imc imk03842.imc imk03940.imc imk04042.imc imk04069.imc

Note: The images can be observed and downloaded from the following URL: http://hlab.phys. rug.nl/archive.html.

YC

which is expressed in terms of the SS exponents tp Er | rC| ) as: TY YC

tp

YrC

D tp

C p D tp C p.

YrC

Er (of TY C) and tp

(of

(3.29)

Let us make a nal remark: it is possible that the wavelet projections of a given scalar function F verify the scaling in equation 3.27 without verifying equation 3.24. This can happen, for instance, for functions with such an irregular behavior that the pseudo-Taylor expansion is not possible. In those Er cases, TY F can be used as a stronger generalization of the usual LI. As it will be shown in section 4.4, this is the case for |rC|. 4 Numerical Analysis of Images 4.1 Databases. We have used two different image data sets: one from Daniel Ruderman (1994) consisting of 45 black and white images of 256£256 pixels with up to 13 bits in luminosity depth (from now on this set will be referred to as the rst data set) and another containing 25 images taken from Hans van Hateren’s database (see van Hateren & van der Schaaf, 1998), having 1536 £ 1024 pixels and 16 bits in luminosity depth (we will refer to these images as the second data set). These 25 images are enumerated in Table 1. The rst data set has been used for the statistical analysis of the onedimensional wavelet transform, equation 3.25, while the second data set was used for the more demanding task (from the statistics point of view) of computing the moments of the two-dimensional variable in equation 1.1. No matter the ensemble considered, however, the studied properties seem to behave in essentially the same way. 4.2 Multifractality of the Measure and Singularity Analysis. We rst address the issue of the multifractality of the measure m dened in equation 1.2. Our numerical analysis gives an afrmative answer to this question: r the wavelet transform TY dm ( xE0 ) is well tted by » rh(Ex0 ) in about 98% of the pixels xE0 in each image of the two data sets. The analysis was done using

Structure of Contrast Changes

779

several wavelets, although the most convenient family of one-parameter functions was Yc ( xE) D (1C|1xE| 2 )c (c ¸ 1). These are positive (nonadmissible) wavelets. The singularity exponent of the measure at a given pixel, h( xE), was obtained from a regression on equation 2.2. The value of the scale r was typically taken in the range (0.5, 2.0) in pixel units. Notice that for scales smaller than r » 0.5 the method would detect an articial edge that actually corresponds to the pixel boundaries—what corresponds to h D ¡1. Similarly, as the scale becomes large, the wavelet can discriminate only between very sharp edges, and for r of the order of the size of the image, it would yield the value h D 0 everywhere. To obtain the fractal components Fh , the pixels of a given image were classied according to the value of their exponents. The exponents lie in the interval (¡0.5, 1.4); only a negligible fraction of pixels have exponents less than ¡0.5 (about 0.8% of them) and none with a singularity below ¡1. Also, there are no exponents larger than 1.4. This range of possible exponent values coincides exactly with that predicted in Turiel et al. (1998) and is in agreement with the model explained in section 3.3. The most singular component is the one with h D ¡0.5, dened within a window of size D h. The visualization of this set is very instructive (see Figure 1). Clearly, the points appear to be associated with the contour of the objects present in the image. This is in agreement with our expectation that the behavior of the measure at these edges should be singular, and as much as possible. But there is something rather surprising: for a sharp, theta-like contrast, with independent values at both sides of the discontinuity, one would have expected an exponent ¡1, which appears as a Dirac d -function in the density measure rC (see example, number 4 in the appendix). Our interpretation of the minimal value of h, h1 D ¡0.5 is that it reects the existence of correlations among different fractal components of the images that smoothen the singularities (in particular, those of the sharpest edges). It is also interesting to observe the fractal components with exponents just above the most singular one. Figure 2 shows the two next components, dened with D h D 0.05. A comparison with Figure 1 shows that pixels in these components are close to pixels in the most singular one. It is also rather clear that the probability that a random ball of size r contains a pixel from the component Fh increases with h, at least for small values of this exponent. Since this probability scales as shown in equation 1.8, this behavior reveals an increase with h of the dimensionality of the fractal components. It is also observed (although not shown in the gures) that the dimensionality starts decreasing beyond h ¼ 0.2. This is in agreement with the arguments of section 3.3: the singularity spectrum D(h) of the S-L model, shown in Figure 8, reects this behavior of the fractal components. The analysis done in this subsection justies the arguments presented in section 1. One cannot consider the sharpest edges of an image independent

780

Antonio Turiel and N´estor Parga

Figure 1: Image from Ruderman’s ensemble (Ruderman, 1994) and its most singular component (taken as the set of points having a singularity exponent in the interval h D ¡0.5 § 0.05).

Structure of Contrast Changes

781

Figure 2: Fractal components with exponents h D ¡0.4 § 0.05 (above) and h D ¡0.3 § 0.05 (below) for the same image represented in Figure 1. Notice the spatial correlations between components with similar singularities.

782

Antonio Turiel and N´estor Parga

of the other texture structures (the other fractal components). Luminosity jumps from one roughly constant value to another (statistically independent) constant value are not typical in natural images. The sharpest edges appear correlated with smoother textures, an effect that tends to decrease the strength of their singularity exponent, reaching the value h » ¡0.5. Standard edge-detection techniques (Gonzalez & Woods, 1992) have not been designed to search for singularities of this strength. Good edge-detection algorithms should be specically constructed to capture singularities close to h1 . 4.3 Statistical Analysis of the Measure. We computed the p-moments

of three variables related to the measure: Bidimensional edge content (2 r ). The denition is given in equation 1.1. This analysis was done using the 25 images of the second data set. Because the variable itself is positive, we computed directly p the quantities h2 r i. Two one-dimensional restrictions of the wavelet projection of the r measure 10 (TY dm ). We considered horizontal (h) and vertical (v) restrictions of the measure dm , that is, dm h D dx| @m | and dm v D dy| @m |. @x @y

r dm l (l D h, v). The appropriate wavelet transforms are denoted by TY Because these coefcients need not be positive (in general Y is not posr itive), we computed the moments of their absolute values, h|TY dm l | p i.

It is remarkable that the bidimensional EC and the two one-dimensional wavelet projections exhibit the scaling properties of SS (see equation 3.1) and ESS (see equation 3.2) (the ESS tests for the three variables are presented in Figures 3 and 4). This was also observed in Turiel et al. (1998), although for somewhat different variables: the one-dimensional horizontal and vertical restrictions of the local edge variance. Surprisingly, no matter the particular variable, the exponents tp and r ( p, 2) obtained are very similar (see Figure 5 and those in Turiel et al., 1998). Therefore, they should refer to essential, robust aspects of images. The S-L model ts the experimental exponents well (see Figure 5). It seems, then, that the multifractal structure underlying the statistical description could be described with the log-Poisson process, which means that under innitesimal changes in the scale, there are only two possible different types of transformations.

10 r Since the calculations involving TY dm are highly computer time-consuming, we did not consider the bidimensional wavelet transform. Fast Fourier transform cannot be safely used due to aliasing, which heavily changes the tails (rare events) of the distribution.

Structure of Contrast Changes

783

0.25 0 -0.25 -0.5 -0.75 -1

4 3 2 1 9 8 7 6 5 4 -1.6

-1.4

-1.2

-1

Figure 3: Test of ESS for the bidimensional edge content 2 r . The graphs represent the logarithm of its moments of order 3, 5, and 7 versus the logarithm of the second-order moment, for distances r D 4 to r D 64 pixels—that is, about an order of magnitude smaller than the image size. This upper bound is needed both to consider values of r small enough compared to the size of the images and to be able to use moments of high order as p D 7. The data correspond to the second image ensemble. According to equation 3.2, these graphs should be straight lines; the best linear ts are also represented. Except for lower cutoff effects at very small distances, the scaling law is well veried.

It is remarkable that for our data sets, D ¼ 0.5 and D1 ¼ 1. This means that the most singular component is a collection of lines (the sharp edges present in the image), which can be characterized by their common exponent ¡D D ¡0.5. But this is precisely the smallest value we observed previously in our wavelet analysis. The statistical approach conrms the result obtained by a local singularity analysis. The nonlinear dependence of r ( p, 2) on p again indicates that the EM is multifractal. This has to be contrasted with the results by Ruderman and

784

Antonio Turiel and N´estor Parga

-2 -2.5 -3 -3.5 -4 -0.5 -1.5 -2.5 -3.5 -4.5 -5.5 2 0 -2 -4 -6 -3.2 -3 -2.8-2.6-2.4-3.2 -3 -2.8 -2.6 -2.4

Figure 4: Test of ESS for (a) TYr dm h and (b) TYr dm v . The graphs represent the logarithm of the moments of order 3, 5, and 7 for both variables versus the logarithm of the corresponding second-order moment. The scales range from r D 4 to r D 64 pixels. The rst data set was used in this computation. The corresponding best linear ts are also represented. Except for lower and upper cutoff effects at small and large distances, the scaling law veries well. Although the test is not shown here, SS also holds for these variables.

Bialek (1994) and Ruderman (1994), who nd that the log-contrast distribution follows a simple classical scaling.11 It also holds for any derived variable as the gradient (see Ruderman, 1994, for details). This particular scaling is related in that context with general properties of scale invariance of natural images: the averaging of the contrast does not keep track of the details (e.g., the edges completely contained into the averaging block) and produces a statistically equivalent image. The EC and the wavelet projections, being local averages of a gradient, are able to keep the information about the more complex underlying multifractal structure. 11

Other luminosity analysis performed by D. Ruderman provided some evidence of multiscaling behavior (private communication).

Structure of Contrast Changes

785

1 7 .5 15 1 2 .5 10 7 .5 5 2 .5 0 0

2

4

6

8

0

2

4

6

8

0

2

4

6

8

10

Figure 5: ESS exponents r (p, 2) for (a) the bidimensional edge content 2 r , (b) the horizontal wavelet transform TYr dm h , and (c) the vertical one, TYr dm v . The convolution function Y was taken to be a gaussian function. Each value of r (p, 2) was obtained by linear regression of the logarithm of the pth moment versus the logarithm of the second moment, for r between 8 and 32. The solid line represents the t with the log-Poisson process. The best t is obtained with (a) b D 0.52§ 0.05, (b) b D 0.55 § 0.06, and (c) b D 0.5 § 0.07 The SS parameters t2 were also calculated; they turned out to be (a) t2 D ¡0.26 § 0.06, (b) t2 D ¡0.25 § 0.06, and (c) t2 D ¡0.25 § 0.08.

The same authors have also designed an iterative procedure that tends to decompose the image into a gaussian-like piece (local uctuations of the logcontrast) and an image-like piece (local variances). Such a decomposition would suggest the existence of a multiplicative process for the log-contrast itself, where the random variable relating the image at two different levels of the hierarchy is given by the local uctuations. A closer look at the iterative procedure indicates that this is not the case: independence between the two factors of the process is not guaranteed, and the process is not innitely divisible. 4.4 Multiafnity Properties of C and jr Cj . In this section we study the scaling properties for the LIs of the contrast C( xE) and the measure density |rC|( xE). Our aim is to check if C and | rC| satisfy the scaling dened in equation 3.17. Recalling the connection (the analog of equation 3.12 for these two quantities) between local scaling (described by the singularity exponents) and statistical scaling (characterized by the SS exponents), one concludes that the validity of SS implies that multiafnity holds. We have done a statistical analysis of the LIs of the contrast and the measure density. We rst computed the p-moments of the LIs of both variables, looking for SS and ESS scalings. Let us notice that there is no a priori argument to expect that d Er C( xE) shows SS. On the other hand, since the measure m is multifractal (as it was shown in section 4.2) it would be reasonable to expect

786

Antonio Turiel and N´estor Parga

-0.6

9

-0.7

7

-0.8

5

-0.9

3

-1 -1.1

1 17

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

15 13 11 9 7 5 3

2.6 2.5 2.4 2.3 2.2 2.1 2 1.9 1.8 -1.3

20 17.5 15 12.5 10 7.5 5 -1.2

-1.1

-1

-0.9

-0.8

0

1

2

3

4

5

6

Figure 6: Test of ESS for (a) the horizontal linear increment of the contrast, d rEi C and (b) the horizontal wavelet coefcient |TYr Ch | (Y was taken as the second derivative of a gaussian). The rst data set was used, and the range of scales was from r D 4 to r D 64. ESS is not present for d rEi C, but it is clearly observed for the wavelet transform (notice the presence of the cutoffs due to numerical effects). Note that the scale of the axes in (a) and (b) is different.

that |rC|( xE) has a multiafne behavior. However, this has to be veried explicitly because the multifractality of the measure does not guarantee the multiafnity of its density, as it was discussed in section 3.4. In fact, our numerical analysis showed that neither C( xE) nor |rC|( xE) has SS. The same is true for ESS, as it is shown in Figures 6a and 7, where the moments exhibit a rather erratic behavior. It seems that the failure of SS for C( xE) and |rC|( xE) has to be explained in different terms. In the case of |rC|( xE), this quantity is the measure density and, as it was stated in equation 2.2, its convolution with a nonadmissible (i.e., positive) wavelet has the same singularities as the measure, equation 1.2. This implies that the lack of SS cannot be explained in terms of a polynomial part as described in equation 3.23. In fact, a positive wavelet cannot alter the contribution from this polynomial while one observes that

Structure of Contrast Changes

787

4.3 4.25 4.2 4.15

10.5 10.45 10.4 10.35 17.8 17.7 17.6 17.5 1.75

1.8

1.85

1.9

1.75

1.8

1.85

Figure 7: Test of ESS for (a) horizontal linear increment of |rC|, d rEi |rC| and (b) vertical linear increment of the same function, d r Ej |rC|. The rst data set was used, and the scales were taken ranging from r D 4 to r D 64 pixels. Again, these variables do not have ESS. The corresponding wavelet projections are shown in Figure 4, where one can see that ESS holds.

r the data for d Er | rC| (xE) (see Figure 7) are different from those of TY dm h and r TY dm v (see Figure 4). On the other hand, it is plausible that the failure of SS of d Er C( xE) can be attributed to the presence of a polynomial part in C( xE). In contrast to what happened with |rC|( xE), when the contrast is convolved with a nonadmissible wavelet, SS and ESS are still not present (that is, one observes similar erratic behavior, as in Figure 6a). The same behavior is observed when C( xE) is convolved with an admissible wavelet with a zero mean. The situation changes drastically when a wavelet with vanishing zero- and rst-

788

Antonio Turiel and N´estor Parga 2

1.75

1.5

1.25

1

-0.5

-0.25

0

0.25

0.5

0.75

1

1.25

Figure 8: Singularity spectrum of the log-Poisson model for b D 0.5 and t2 D ¡0.25. For these values D1 D 1.0 and D D 0.5 (see text).

order moments is used. The result obtained using the second derivative of a gaussian is presented in Figure 6b, which clearly shows ESS (it also exhibits SS). This can be understood recalling that the possible values of the Holder ¨ exponent of C(xE) are contained in the interval (0.5, 2.4) (see equation 3.20 and section 4.2). A nonadmissible wavelet truncates the detected singularities at the value h D 0; an admissible one truncates them at h D 1. The third wavelet considered, with two vanishing moments, truncates the singularities only from h D 2, and then it covers almost the whole interval (In theory a wavelet with another vanishing moment would be more appropriate; however, in practice it is numerically more unstable because the resolution of the wavelet is smaller.) Y Y Finally, we also checked that the experimental values of tp , tp C , and tp rC satisfy the relation given in equation 3.29. 5 Discussion

We have shown that in natural scenes, pixels are arranged in nonoverlapping classes, the fractal components of the image, characterized by the behavior of the contrast gradient under changes in the scale. In each of these classes this quantity exhibits a distinct power law behavior under scale transformations, which indicates that there is no a well-dened scale for the isolated components but at the same time the scene is not scale invariant globally. This result implies that a given image is composed of many different spatial structures of the contrast associated with the fractal components, which can be arranged in a hierarchical way. In fact, how sharp or soft a

Structure of Contrast Changes

789

change in contrast is at a given point can be quantied in terms of the value of the scaling exponent at that site. The smaller the exponent, the sharper the local change in contrast is. More precisely, we have dealt with two basic quantities: the contrast C( xE) and the edge measure m , whose density is the modulus of the contrast gradient |rC|( xE) (see equation 1.2). Closely related to the measure, we can dene the edge content of a region A of the image (see equation 1.3), which quanties how much the contrast gradient deviates from being uniformly distributed inside the region A. This is a suitable denition to study the local behavior of the contrast gradient. As the size of A is reduced, some contributions of the contrast gradient to the edge content are left outside this region. Under an innitesimal change in the scale, these contributions often are small, but sometimes a small change in the scale produces a large change in the edge content, due to singularities in the contrast gradient. As a consequence, the edge content (and the measure) exhibits large spatial uctuations. The edge content can have different singularities at neighboring points, characterized by a power law behavior with exponents h( xE) that depend on the site (see equation 1.6). This means that the measure is multifractal, as it has been explicitly veried in this work. The set of pixels with the same value of h( xE) denes a particular spatial structure of contrast: the fractal component Fh . The smallest h(xE) is the sharpest the contrast gradient at the pixel xE is. Its minimal value was, for our data set, h1 D ¡0.5. Another question refers to the fractal dimension D1 of this component. One nds D1 D 1 (Turiel et al., 1998), and when this component is extracted from the image, one checks that it corresponds to the boundaries of the objects in the scene. The associated contrast structure is made of the sharpest edges of the image. The less singular components represent softer contrast structures inside the objects. It is also observed that there are strong spatial correlations between fractal components with similar exponents. We have also shown how to extract the fractal components from the image. To do this, one needs a suitable technique to compute the scale behavior at a given pixel. A wavelet projection (Mallat & Huang, 1992; Arneodo, 1996) seems to be the correct way to perform this analysis, since it provides a way of interpolating scale values that are not an integer number of pixels. Our approach differs from the one followed in those references in several aspects, but in particular in that we have been interested here in the characterization of the whole set of fractal components. We remark the important fact that the multifractal can be explained by a model with only two parameters (e.g., b and D ). This had been noticed before in terms of a variable that integrates the square of the derivative of the contrast along a segment of size r, both taken in either the horizontal or the vertical direction (Turiel et al., 1998). This quantity admits the interpretation of a local linear-edge variance. We have seen that the same model is valid for many other quantities, some of them more general and natural.

790

Antonio Turiel and N´estor Parga

The basic idea of the model is that the contrast gradient has singularities that under an innitesimal change in the size of A produce a substantial change in the edge content inside that region. The simplest stochastic process to assign a value to this change is a binomial distribution, equation 3.6; its more likely value is very small, but with small probability, this change is given by the parameter b, which for our data set is about 0.5. The other parameter of the model, D , measures the strength of the singularity. The geometrical locus of these singularities is the most singular fractal component. For a nite scale transformation, this process becomes log-Poisson. The model appeared before in other problems (She & Leveque, 1994). In a sense, the most singular component conveys a lot of information about the whole image. The two model parameters can be obtained by analyzing properties of this particular class of pixels: its dimension (D1 ) and the scale exponent of the edge content (D D ¡h1 ). In turn, these two parameters completely dene the whole dimension spectrum in the log-Poisson model. It is then very plausible that other components have a great deal of redundancy with respect to the most singular one. This hierarchical representation of spatial structure can be used to predict specic feature detectors in the visual pathways. We conjecture that the most singular component contains highly informative pixels of the images that are responsible for the epigenetic development leading to the adaptation of receptive elds. Learning of the statistical regularities (Barlow, 1961) present in this portion of visual scenes (using, for instance, the algorithm in Bell & Sejnowski, 1997), would give relevant predictions about the structure of receptive elds of cells in the early visual system. Appendix A: Characterization of the Singularities of Simple Functions

It is instructive to compute the singularity exponents dened in equation 3.16 for several simple functions. Continuous functions. If a function f ( xE) is continuous at a given point xE, the value of f at yE should be rather close to f ( xE) provided yE is close to xE; then | f ( yE)| < | f (xE)| C2 so | f ( xE) ¡ f ( yE)| < 2| f ( xE)| C 2 D A. That is, ¨ of exponent 0 at xE. f is Holder Smooth functions with continuous, nonvanishing rst derivative. Using the Taylor expansion, f ( xE)¡ f ( yE) D r f ( yE0 )¢( xE ¡ yE) with yE0 a point between xE and yE. Let us call A D max | r f ( yE0 )| > 0, so | f ( xE) ¡ f ( yE)| < ¨ of exponent 1 at xE. A| xE ¡ yE|. That is, f is Holder Analogously, any function f having n ¡ 1 vanishing derivatives at a point xE and a nonvanishing continuous nth derivative is Holder ¨ of exponent n at xE.

Structure of Contrast Changes

791

µ -function. The h ( x) function is given by 8 0 x< 0 < h ( x) D 1 x> 0 : undened x D 0. Since for x 6 D 0 the increments are zero (for small enough displacement) it is Holder ¨ of any order. For x D 0 we make use of the property that f is Holder ¨ of exponent h at a given point if and only if its primitive F is Holder ¨ of exponent h C 1 at the same point. One possible primitive F(x) of h ( x) is the following: » 0 x< 0 F( x) D x x ¸ 0, which is a continuous function. The increments |F(0) ¡ F(r)| D | F( r)| D F( r) are thus immediate, and obviously F( r) < |r|. Moreover, it is also clear that the maximal h such that |F(0) ¡ F(r)| < A|r| h is h D 1. Thus, F has Holder ¨ exponent 1 at x D 0 and thereforeh ( x) has Holder ¨ exponent 0 at x D 0.

± -function. The Dirac’sd -function is in fact a distribution. Anyway, it is possible to dene the Holder ¨ exponent of distributions attending to the fact that any distribution can be expressed as a nite-order derivative of a bounded function. So if the nth order primitive of a distribution is a Holder ¨ function of exponent h at a given point, the distribution is Holder ¨ of exponent h ¡ n at the same point. Since thed ( x) distribution vanishes in x 6 D 0, the only nontrivial point is ¨ exponent x D 0. Since thed is the derivative ofh (x) and this has Holder 0 at x D 0, one concludes that the d (x) has Holder ¨ exponent ¡1 at x D 0 Acknowledgments

We thank Jean-Pierre Nadal, Germ´an Mato, and Dan Ruderman for useful comments and Angel Nevado for many fruitful discussions that we had during the preparation of this work. We are grateful to Dan Ruderman and Hans van Hateren, some of whose images we used in the statistical analysis. A.T. is nancially supported by a FPI grant from the Comunidad Autonoma ´ de Madrid, Spain. This work was funded by a Spanish grant, PB96-0047. References Arneodo, A. (1996).Wavelet analysis of fractals: From the mathematical concepts to experimental reality. In M. Y. H. G. Erlebacher & L. Jameson (Eds.), Wavelets. Theory and applications. Oxford: Oxford University Press. Arneodo, A., Argoul, F., Bacry, M., Elezgaray, J., & Muzy, J. F. (1995). Ondelettes, multifractales et turbulence. Paris: Diderot Editeur.

792

Antonio Turiel and N´estor Parga

Atick, J. J. (1992). Could information theory provide an ecological theory of sensory processing? Network: Comput. Neural Syst., 3, 213–251. Bacry, E., Muzy, J. F., & Arneodo, A. (1993)Singularity spectrum of fractal signals from wavelet analysis: Exact results. J. Stat. Phys., 70, 635–673. Barlow, H. B. (1961). Possible principles underlying the transformation of sensory messages. In W. Rosenblith (Ed.), Sensory communication. Cambridge, MA: MIT Press. Bell, A. J., & Sejnowski, T. J. (1997). The independent components of natural scenes are edge lters. Vision Research, 37, 3327–3338. Benzi, R., Biferale, L., Crisanti, A., Paladin, G., Vergassola, M., & Vulpiani, A. (1993). A random process for the construction of multiafne elds. Physica D, 65, 352–358. Benzi, R., Ciliberto, S., Baudet, C., Chavarria, G. R., & Tripiccione, C. (1993). Extended self similarity in the dissipation range of fully developed turbulence. Europhysics Letters, 24, 275–279. Benzi, R., Ciliberto, S., & Chavarria, C. B. G. R. (1995). On the scaling of three dimensional homogeneous and isotropic turbulence. Physica D, 80, 385– 398. Benzi, R., Ciliberto, S., & Tripiccione, C., Baudet, C., Massaioli, F., & Succi, S. (1993).Extended self-similarity in turbulent ows. Physical Review E, 48, R29– R32. Castaing, B. (1996). The temperature of turbulent ows. Journal of Physique II, 6, 105. Daubechies, I. (1992). Ten lectures on wavelets. Montpelier, VT: Capital City Press. Falconer, K. (1990). Fractal geometry: Mathematical foundations and applications. Chichester: Wiley. Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am., 4, 2379–2394. Frisch, U. (1995). Turbulence. Cambridge: Cambridge University Press. Gonzalez, R. C., & Woods, R. E. (1992). Digital image processing. Reading, MA: Addison-Wesley. Linsker, R. (1988). Self-organization in a perceptual network. Computer, 21, 105. Mallat, S., & Huang, W. L. (1992). Singularity detection and processing with wavelets. IEEE Trans. in Inf. Th., 38, 617–643. Mallat, S., & Zhong, S. (1991). Wavelet transform maxima and multiscale edges. In M. B. Ruskai et al. (Eds.), Wavelets and their applications. Boston: Jones and Bartlett. Novikov, E. A. (1994). Innitely divisible distributions in turbulence. Physical Review E, 50, R3303. Ruderman, D. L. (1994). The statistics of natural images. Network, 5, 517–548. Ruderman, D. L., & Bialek, W. (1994). Statistics of natural images: Scaling in the woods. Physical Review Letters, 73, 814. She, Z. S., & Leveque, E. (1994). Universal scaling laws in fully developed turbulence. Physical Review Letters, 72, 336–339. She, Z. S., & Waymire, E. (1995). Quantized energy cascade and log-Poisson statistics in fully developed turbulence. Physical Review Letters, 74, 262– 265.

Structure of Contrast Changes

793

Turiel, A., Mato, G., Parga, N., & Nadal, J. P. (1998).The self-similarity properties of natural images resemble those of turbulent ows. Physical Review Letters, 80, 1098–1101. van Hateren, J. H. (1992). Theoretical predictions of spatiotemporal receptive elds of y LMCS, and experimental validation. J. Comp. Physiology A, 171, 157–170. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component lters of natural images compared with simple cells in primary visual cortex. Proc. R. Soc. Lond., B265, 359–366. Received August 11, 1998; accepted May 5, 1999.

NOTE

Communicated by Naftali Tishby

Exponential or Polynomial Learning Curves? Case-Based Studies Hanzhong Gu Haruhisa Takahashi Department of Information and Communication Engineering, University of ElectroCommunications, Chofugaoka, Chofu, Tokyo 182, Japan Learning curves exhibit a diversity of behaviors such as phase transition. However, the understanding of learning curves is still extremely limited, and existing theories can give the impression that without empirical studies (e.g., cross validation), one can probably do nothing more than qualitative interpretations. In this note, we propose a theory of learning curves based on the idea of reducing learning problems to hypothesistesting ones. This theory provides a simple approach that is potentially useful for predicting and interpreting (a diversity of) learning curve behaviors qualitatively and quantitatively, and it applies to nite training sample size and nite learning machine and for learning situations not necessarily within the Bayesian framework. We illustrate the results by examining some exponential learning curve behaviors observed in Cohn and Tesauro (1992)’s experiment. 1 Introduction

A fundamental problem in neural network learning theory is how average generalization performance depends on the training sample size m, which is also called learning curves. According to the experimental results of Cohn and Tesauro (1992), learning curves in average-case learning processes (from backpropagation experiments) can exhibit different behaviors that correspond to different learning problems. Specically, in the “majority” and the “majority-XOR” problems of binary inputs, the learning curves roughly exhibit exponential behaviors for moderately large sample size. The exponential learning curve behaviors may be covered by the theoretical results from statistical mechanics (Tishby, Levin, & Solla, 1989; Schwartz, Samalam, Solla, & Denker, 1990; Haussler, Seung, Kearns, & Tishby, 1994; for a survey, see Watkin, Rau, & Biehl, 1993), which suggest that if the spectrum of possible generalizations has a “gap” near perfect performance, then the generalization error will approach zero exponentially in the number of examples. One might think that since the inputs are binary, the generalization is necessarily quantized, and this quantization provides the gap that Neural Computation 12, 795–809 (2000)

c 2000 Massachusetts Institute of Technology °

796

Hanzhong Gu and Haruhisa Takahashi

explains the results. However, Cohn and Tesauro did not nd evidence of the gap in their experiment. Thus, it remains interesting to see which mechanism accounts for the exponential behaviors. In this note, we try to mediate this gap in a learning setting discussed in Gu and Takahashi (1996, 1997) based on a conceptual lower-quality learning algorithm, the so-called ill-disposed learning algorithm, and the Boolean interpolation dimension (Takahashi & Gu, 1998). One of the main ideas is to relate a learning problem to a hypothesis-testing one, so that a learning problem with the ill-disposed learning algorithm can then equivalently be reduced to a simple one for which the evaluation of generalization performance is usually much simpler. This note extends the result of Gu and Takahashi (1997) so as to examine the diversity of learning curve behaviors. As an example, we apply the theory to examining again the learning curves of the majority problem. 2 Learning Model

We begin with describing briey the basic learning setting. (For details, see Gu & Takahashi, 1997, and Takahashi & Gu, 1998.) In particular, we assume that the learning machine implements a class of f0, 1g-valued functions f f ( x, w): w 2 Wg on an input space X, and the target function to be learned is chosen from this class. Here, W µ Rp is called the conguration space. We refer to any f ( x, w) or w as hypothesis and the boundary of the region fx 2 X: f ( x, w) D 1g as the decision boundary of w. Consider a problem to infer a function f ( x, w) by a learning algorithm on nite m random training examples (x1 , y1 ), . . . , ( xm , ym ). Each example consists of an instance (input vector) x 2 X and a f0, 1g-valued output y. If yi D 1 (yi D 0), ( xi , yi ) is called a positive (negative) example. We assume that instances x1 , . . . , xm are generated independently according to a xed distribution P (x) over X and that corresponding outputs y1 , . . . , ym are given according to a xed unknown target function f (x, wt ), wt 2 W. Let xm D (x1 , . . . , xm ) denote the instances. xm is called a sample of size m. A hypothesis w is said to be consistent with xm if f ( xi , w) D yi for i D 1, . . . , m. A learning algorithm is called a consistent learning algorithm if, for any sample, it outputs a hypothesis consistent with the given sample. For any f (x, w), the generalization error of w Ris the probability that w misclassies a randomly drawn example: R( w) D X ( y ¡ f ( x, w))2 dP( x). Let w be output by a learning algorithm, and EfR( w)g denote the expectation of R(w) over training samples of size m. Then EfR( w)g is called the learning curve of the algorithm. Now let w1 » w2 , f (¢, w1 ) D f (¢, w2 ) be an equivalence relation onto W. Let W ¤ be a set of representatives of the equivalence classes. Then, generally, W ¤ is a space whose dimensionality is not greater than W’s. In what follows, we transfer the term hypothesis on W ¤ , and, with some abuse of notation, use f (x, w) or w, w 2 W ¤ to denote a hypothesis.

Exponential or Polynomial Learning Curves?

797

We say that a decision boundary is uniquely specied by a set of m instances x1 , . . . , xm if it is the common decision boundary of all possible hypotheses w 2 W ¤ that contains all xi , i D 1, . . . , m. Then the Boolean interpolation dimension d of the function class is the smallest integer m such that a set of m distinct instances in X uniquely species a decision boundary with the property that there are at most two hypotheses w1 , w2 2 W ¤ having this decision boundary in common and satisfying f (x, w1 ) D 1 ¡ f ( x, w2 ) for all permissible x 2 X outside the decision boundary. Now we introduce the ill-disposed learning algorithm, which behaves worse than ordinal ones and thus can apply to bound the generalization performance of ordinal algorithms (Takahashi & Gu, 1998). Let xm be a training sample. We call a hypothesis will an ill-disposed hypothesis if its decision boundary contains d instances in xm and is consistent with other m ¡ d training instances. Here the output values of the target function of the instances on the decision boundary of will can be either 0 or 1; that is, will is not necessarily consistent with xm . Then there are generally a number of ill-disposed hypotheses for the given xm . For convenience, we call any articial procedure that chooses a will randomly from the collection of ill-disposed hypotheses, ill-disposed learning algorithm. An example of such articial procedures is described by Takahashi and Gu (1998). The following theorem is due to Gu and Takahashi (1997). Theorem 1.

Let P( x) be not degenerated. 1 Let m ¸ d, where d < 1. Then

EfR(will )g · (1 C o)

d , m Cd

where o D 0 for m D d and m / d ! 1, and o ¿ 1 otherwise. Thus, if w0 is produced by an average good, consistent learning algorithm, d . EfR( w0 )g · mCd 3 Characterizing Generalization Error Under Degenerated Input Distribution

Assume that P (x) is degenerated, so that the probability measure of some instance is not zero. Let us name such an instance as x0 and assume that x0 is contained in the decision boundary of some will . Let w0 and w1 be two possible limit hypotheses of will such that w0 can correctly classify x0 , whereas w1 cannot. Then since the measure of x0 is not zero, there is a discontinuity 1 This means that for any given xm drawn from P(x), all the examples are, with probability one, distinct. In practice, this is equivalent to saying that all the examples in the training set and the test set are almost surely distinct.

798

Hanzhong Gu and Haruhisa Takahashi

in the generalization error of a hypothesis when it moves from w0 to w1 . Let P0 ( x) be induced from P( x) by perturbing P( x) in a neighborhood of x D x0 , such that P0 ( x) is not degenerated in the neighborhood and the generalization error of w0 with respect to P0 ( x) is R( w0 ). Then the generalization error of will with respect to P0 ( x), denoted by R0 (will ), ranges from R( w0 ) through R( w1 ) depending on the perturbation. Here one can establish a (one-to-one) relationship between R0 ( will ) and R( w0 ) according to the perturbation. Thus generalizing, we conclude that in the case P( x) is degenerated, one can generally characterize the generalization ability of a consistent hypothesis w0 by an ill-disposed hypothesis will such that » 0 R ( will ) D e R( w0 ) D Â (e), where Â (¢) is some real-valued function and R0 ( will ) is measured with respect to some nondegenerated P0 ( x) perturbed from P( x). Now apply the method of relating learning to hypothesis testing (Gu & Takahashi, 1997). Let will be produced on a sample of size m chosen from P0 ( x). We call those d instances contained in the decision boundary of will 0 the support of will . Let xm , m0 D m / d be an average case sample, which is obtained by deleting randomly d ¡ 1 instances from the support and (m ¡ d)( d ¡ 1) / d instances from other m ¡ d instances. This is the uniform pruning method. Then, according to Gu and Takahashi (1997), one can approximately treat will as a hypothesis whose decision boundary is uniquely specied by 0 0 one instance in xm such that ignoring the constraints on will due to xm ¡ xm , © ª 0 Pr R0 ( will ) ¸ e · (1 C o)(1 ¡ e) m

(3.1)

holds, where o D 0 for m0 D 1 and m0 ! 1, and o ¿ 1 otherwise. The real 0 constraints due to xm ¡ xm make R0 (will ) tend to exhibit a behavior that has a smaller deviation than equation 3.1. Therefore, if w0 is produced by an average good learning algorithm, then n o PrfR(w0 ) ¸ eg D Pr R0 ( will ) ¸ Â ¡1 (e) ³ ´m / d · 1 ¡ Â ¡1 (e) . Note that the uniform pruning method equivalently reduces the learning problem to a simple one where d D 1, training sample size is m0 , and for m0 D 1, PrfR(w0 ) · eg ¸ Â ¡1 (e). 4 Analysis of a Simple Problem

In this section, we develop a result that, when applied with the method of relating learning to hypothesis testing, can be used to account for the learning curve behaviors in learning majority functions.

Exponential or Polynomial Learning Curves?

799

Consider a simple learning problem in which d D 1 and an instance xi takes one of a number of discrete values over X D [a, b] for some real a, b. We use w0 to denote the output hypothesis consistent with the training sample xm . For simplicity, we shall not specify how w0 will be generated but directly specify the distribution of R( w0 ). Assume that there are only a nite number of K C 1 discrete values 0 D e0 < e1 < e2 < ¢ ¢ ¢ < eK D 1 that R(w0 ) can take, where K > 1 is some integer. For m D 1, let R( w0 ) take one of the values with probabilities p0 , p1 , . . . , pK , P respectively. We assume that k¡1 iD0 pi ¸ e k for k D 1, . . . , K so that PrfR(w0 ) · eg ¸ e, a situation where the learning algorithm performs better than the ill-disposed one does. Then for m 6 D 1,

(

PrfR(w0 ) ¸ eg D 1 ¡

k¡1 X

!m

for e 2 [ek¡1 , ek ), k D 1, . . . , K, (4.1)

pi

iD0

and EfR(w0 )g D D

K Z X

ek¡1

kD1 K X kD1

ek

(

1¡

k¡1 X

!m pi

de

iD0

(

(ek ¡ ek¡1 ) 1 ¡

k¡1 X

!m .

pi

iD0

Since equation 4.1 implies PrfR(w0 ) ¸ eg · (1 ¡ e)m , we have EfR( w0 )g D

K1 Z X

Z ·

ek¡1

kD1

C

(

1¡

Z K X

ek

kDK2 C1 ek¡1

eK 1 0

Z C

ek

k¡1 X

!m

(

1¡

(1 ¡ e) m de C 1

eK 2

de C

pi

iD0

k¡1 X

Z K2 X

ek

kDK1 C1 ek¡1 !m

pi

(

1¡

k¡1 X

de

iD0

de

iD0 K2 X

pi

!m

(

(ek ¡ ek¡1 ) 1 ¡

kDK1 C1

k¡1 X

!m pi

iD0

(1 ¡ e) m de

(

K2 k¡1 ´ X X 1 ³ (ek ¡ ek¡1 ) 1 ¡ D 1 ¡ (1 ¡ eK1 ) mC1 C pi m C1 iD0 kDK C1 1

C

¢mC1 1 ¡ 1 ¡ eK2 , m C1

!m

(4.2)

where K1 < K2 < K. Note that if eK1 is sufciently small but eK2 is considerably large, the learning curve would quickly transit to several exponential

800

Hanzhong Gu and Haruhisa Takahashi

behaviors (the second part), and then to a polynomial behavior (the rst part). Also note that if eK1 is far smaller than eK1 C1 , the last polynomial behavior may not be visible unless the sample size is sufciently large. 5 Some General But More Descriptive Results

It is expected that the above results will allow us to interpret a number of learning curve behaviors in general cases. It is also clear, however, that even though a learning problem can be reduced to one equivalent to the case where d D 1, a complete characterization of the learning problem might still be impossible (see, e.g., section 6). In this section, we develop some results useful for a qualitative interpretation. 5.1 Learning Is an Experiment of m Trials. Recall that the training examples can be presented to the learning machine either one by one or in a batch mode. We shall now take the learning process that a hypothesis is randomly selected from the set of consistent hypotheses as an experiment consisting of a sequence of m trials, where a trial in the learning process is one that the learning machine receives a training example (and consequently the learning algorithm adjusts its output). Sets of realizations—hypotheses produced by the learning algorithm—will be called outcomes of the experiment.2 The most important outcome of interest here is one that the learning machine attains perfect performance (R(w0 ) D 0). 5.2 Events of a Trial. In each trial, the learning machine receives an example according to some distribution P (x). For convenience, we shall take the input space X as a feature space and instances as points of the feature space. Thus, in each trial, the learning machine is assigned to a feature according to some probability distribution. In this case, we say that the feature occurs in that trial. It is possible that a set of features exhibits the same overall effects on generalization as a single feature in the set does. For example, in the case where X D [0, 1], P( x) is uniform over [0, 0.5 ¡ a] [ [0.5 C a, 1], and the boundary of the target function lies at xt D 0.5, where 0 · a < 0.5, any x 2 (0.5 ¡ a ¡ c, 0.5 ¡ a] has a same effect as far as perfect performance is concerned. We call such a set of features an event of a trial.3 2 Accordingly, the terminologies experiment, trial, and outcome here are exactly those used in classic probability theory. 3 It might be instructive to see that if one considers the denition of events as simply based on a specic generalization error, then it should have a close relation to the discussion of slicing the e-ball (see, e.g., Haussler et al., 1994), and the occurrence of the events should be related to the information entropy of the e-ball. It is clear, however, that nding out such events or the way of slicing the e-ball are usually very difcult and even unrealistic, unless in some very simple problems such as d D 1.

Exponential or Polynomial Learning Curves?

801

Thus, mutually exclusive sets of features can be assigned to different events. If at least one feature contained in an event occurs in a trial, we say that the event occurs in that trial. Further, since the (learning) experiment consists of a number of trials, if an event occurs in a trial, it will often be convenient to say that the event occurs in the experiment. Let A be an outcome of an experiment consisting of a sequence of m trials. Let A1 , A2 , . . . , AN be N < 1 mutually exclusive events of a trial such that once all of them occurs in m trials, the outcome of the experiment is A. Let P( A) denote the probability that the outcome of the experiment is A. Then P( A) D 1 ¡ P ( A) decreases exponentially in m for sufciently large m. Lemma 1.

Let A be an outcome of an experiment consisting of a sequence of m trials. Let A1 , A2 , . . . , AN be N mutually exclusive equally probable events of a trial such that once all of them occur in m trials, the outcome of the experiment is A. Let p denote the probability that an event in a trial is Ai for 1 · i · N. Let P ( A) denote the probability that the outcome of the experiment is A. Then Lemma 2.

P( A) D 1 ¡ P( A) D

¡ ¢

N X N (¡1) iC1 (1 ¡ ip) m . i iD1

In case p ¿ 1, P( A) ¼ 1 ¡ (1 ¡ e¡pm ) N , which is approximately Ne¡pm for sufciently large m. 5.3 Exponential Transition of Learning Curves. Now assume that A1 , A2 , . . . , AN are N mutually exclusive events of a trial such that once they occur in the learning process, perfect performance will be attained. Let xm be such a sample that not all A1 , A2 , . . . , AN occur. Let w0 be output by an average good learning algorithm. Then R( w0 ) may not be zero. Let e1 be the maximum of such R( w0 ). Consider now xm as being drawn from a nondegenerated input distribution 4 P0 ( x), and let will be an ill-disposed hypothesis and R0 (will ) be the generalization error measured with respect to P0 ( x). Since we had assumed that an average good learning algorithm is better than the ill-disposed one, R0 (w0 ) · R0 ( will ) and R(w0 ) · R( will ), where R( will ) is measured with respect to P (x). Thus, R( will ) is at least e1 , that is, R( will ) 2 [e1 , 1]. Applying the uniform pruning method, we obtain 0 (1Co)d a subsample xm of xm , and EfR0 ( will)g · mCd . Since in the case d D 1 and 0 0( m D 1, R will ) is uniformly distributed over [0, 1], and R(will ) is uniformly distributed over [e1 , 1] (refer to equation 3.1), one can simply consider that in the case d D 1, there is a linear map from R0 ( will ) to R( will ), that is, R( will ) D e1 C(1 ¡ e1 )R0 (will ). Hence in the case where not all A1 , A2 , . . . , AN 4

If necessary, xm can be perturbed to some extent.

802

Hanzhong Gu and Haruhisa Takahashi

occur, EfR(w0 )g is upper bounded by e1 C (1 ¡ e1 )d / (m C d) according to theorem 1. Let A denote the outcome of the (learning) experiment that the learning machine attains perfect performance. Then

¡

EfR( w0 )g · 0 ¢ P( A) C P( A) e1 C (1 ¡ e1 ) D e 1 P( A ) C

d (1 ¡ e1 ) P( A). m Cd

d m Cd

¢ (5.1)

If the measure of solution to perfect performance is not zero, then the learning process will improve its average generalization performance exponentially in m for sufciently large m. Theorem 2.

Since the measure of solution to perfect performance is not zero, it is sufcient to assume that there are nite N mutually exclusive events of a trial such that once the training sample contains all of these N events, perfect performance will be attained. Let p denote the probability that the event in a trial is the least possible among N events of a trial. Then by lemma 1 and equation 5.1, Proof.

EfR( w0 )g · Ne1 (1 ¡ p) m C N(1 ¡ e1 )(1 ¡ p) m ¢

d , m Cd

which decays exponentially in m when the exponential term dominates. Thus, as far as the perfect performance is concerned, average-case learning curves will probably exhibit roughly polynomial behaviors for small m and roughly exponential behaviors for sufciently large m. However, it is also clear that unless N ¿ m and p is considerably large, the exponential transition to perfect performance may in fact be invisible at nite m. We have so far focused on the situation where only transition to perfect performance is concerned. In some learning problems, there could exist events that make an output hypothesis have only a small generalization error. In this case, it is not difcult to verify that the learning curve could also exhibit exponential behaviors, but the transition would not be toward perfect performance. On the other hand, it is also possible that there could exist events such that if the ill-disposed algorithm output will with R( will ) D e, then the events would make the learning algorithm produce w0 with R( w0 ) D ae for some a < 1, and without the events, R(w0 ) is close to R( will ). In this case, the learning curve will follow from EfR( w0 )g · P( A)a

d d C P( A ) . m Cd m Cd

Exponential or Polynomial Learning Curves?

803

Since P( A) exponentially approaches unity for sufciently large m, the learning curve could exhibit an exponential transition between two polynomial behaviors. Furthermore, it is still possible that for a given learning problem, a learning algorithm A 0 could perform either optimally or nonoptimally, depending on how much information is contained in the training sample. In this case, one may consider that there exists another consistent learning algorithm A00 that is the same as A0 when A0 performs nonoptimally. Then the situation could be modeled as that there are events A such that if A 00 outputs w00 with R( w00 ) D e, then the events would make A0 produce w0 with R( w0 ) D ae, and without the events, R( w0 ) D e. The learning curve will then be upper bounded by P( A)aEfR(w00 )g C P( A) EfR(w00 )g. Thus if EfR(w00 )g has an exponential behavior, the learning curve of A 0 exhibits an exponential transition between two exponential behaviors (for example, the learning curve in section 6.3 may be interpreted in this way). However, since these discussions are too descriptive, we shall not discuss them any more. 6 Learning Majority Functions

In this section, we present simulation results in learning majority functions. (For detailed discussion, see Gu, 1997.) Majority function is a boolean predicate and is linearly separable. The output of a D-dimensional majority function is 1 if and only if more than half of the D inputs are 1. In what follows, we use ( xi1 , xi2 , . . . , xiD ) to denote D inputs of the ith instance xi , where xij 2 f0, 1g. Accordingly, the majority PD function sets the target output yi to 1 if jD1 xij > D / 2 and 0 otherwise. We assume that D is even and that each of the D inputs of an example is set independently to 0 or 1 with probability 1 / 2. PD Since a positive example satises jD1 xij > D / 2, we shall categorize PD C the positive instances (examples) into Ak D fxi : jD1 xij D D / 2 C k C 1g for k D 0, . . . , D / 2 ¡ 1, and the negative instances (examples) into A¡ D k PD fxi : jD1 xij D D / 2 ¡ kg for k D 0, . . . , D / 2, respectively. If an instance ) occurs in a trial, we shall simply say that xi categorized into AkC (or A¡ k the instance in a trial is AkC (or A¡ ). Let pkC, p¡ denote the probabilities k k C ¡ that the instance in a trial is Ak , Ak for k D 0, . . . , D / 2 ¡ 1, respectively. ¡ D ¢ ¡ 1 ¢D ¡ ¢ ¡ ¢D Then pkC D D / 2CkC1 and p¡ D D /D2¡k 12 . In the case D is large, 2 k . For simplicity, we shall treat AkC and A¡ as two equally probable pkC ¼ p¡ k k instances that occur in a trial, and denote the probability of the occurrence by pk .

804

Hanzhong Gu and Haruhisa Takahashi

Figure 1: The learning curve of learning a 50-dimensional majority function with a perceptron whose 49 input-output weights are xed to 1. The dashed line indicates a theoretical prediction (see equation 6.1).

6.1 Training a Perceptron with D ¡ 1 Weights Fixed. We begin with the results of learning with a perceptron with D ¡ 1 fan-in connections set to 1; that is, only one weight and the threshold need to be set by some learning algorithm. It is not difcult to verify that in this case, d D 2. Since d D 2, one can directly assume that after applying the uniform pruning method, R(w0 ) is of discrete values 0, p0 , p0 C p1 , . . ., with probabilities p0 , p1 , p2 , . . . when m0 D 1. Then, according to equation 4.2 and theorem 1, the learning curve would follow from m

m

EfR(w0 )g · p0 (1 ¡ p0 ) 2 C p1 (1 ¡ p0 ¡ p1 ) 2 C

2 m (1 ¡ p0 ¡ p1 ) 2 C1 . m C2

(6.1)

Figure 1 illustrates the simulation result in the case D D 50 (p0 D 0.11, p1 D 0.10). The learning algorithm is the usual reward-punishment with a learning rate 0.00005. We have found that the result is less dependent on the learning rate. The generalization error is estimated by testing against 4096 new examples. The result is obtained by averaging over 100 runs for each training sample size. 6.2 Learning with Perceptrons. Let the learning machine be a perceptron. Then d D D. In this case, since the understanding of the learning

Exponential or Polynomial Learning Curves?

805

Figure 2: The learning curve of learning a 50-dimensional majority function with perceptrons. The dashed lines denote two possible theoretical predictions, one of which is valid only for small m.

problem is still very limited, we rst present the simulation result and then examine possible theoretical interpretations of the resultant learning curve behavior. Figure 2 illustrates the experimental result of learning a 50-dimensional majority function with perceptrons, where two possible theoretical interpretations (B1 , B2 ) are also plotted out. Training sample size varies from 60 to 8000. The learning algorithm is the reward-punishment with momentum. The learning rate is set to 0.65, and the momentum is set to 0.5. The generalization error is estimated by testing on 4096 new examples (8000 for training sample size greater than 6000). The learning curve is obtained by averaging over 50 runs for each training sample size. The possible theoretical interpretations are described below. Since d D 50 À 1 and average case samples obtained via the uniform pruning method are only statistically unbiased, it is not difcult to verify that for m0 D 1, if the decision boundary crosses over AkC (or A¡ ), the change in k 0 generalization error is nearly pk . More specically, if xm contains A0C or A¡ 0, then one can only assert that almost surely R(w0 ) is close to zero. Therefore, a crude bound on the learning curve could be m

m

B1 D p0 (1 ¡ p0 ) 50 C p1 (1 ¡ p0 ¡ p1 ) 50 C

50 m (1 ¡ p0 ¡ p1 ) 50 C1 , m C 50

(6.2)

806

Hanzhong Gu and Haruhisa Takahashi

which should work satisfactorily for small m but fail for large m since it 0 assumes that w0 is of perfect performance whenever xm contains A0C or ¡ A0 . Note from Figure 2 that the learning curve seems to differ from equation 6.2 only within a small multiplicative factor for sample size less than 1500 (m / d < 30). 0 A rened bound could be obtained by taking xm ¡ xm into account. In the case d is large, for m0 D 1, if the decision boundary of w0 crosses over AkC, then the change in generalization error is almost surely nearly pk . This is equivalent to that in the case m0 D 1 the distribution of R( w0 ) is, compared with that used in section 6.1, smeared to a certain extent so that in some range, PrfR(w0 ) · eg remains nearly constant. Therefore for m0 D 1, PrfR(w0 ) D 0g would be considerably small, but PrfR(w0 ) · eg quickly 0 climbs to p0 for 0 < e · p0 . Moreover, because w0 depends on xm ¡ xm , 0 even if xm contains A0C or A¡ , w0 could still misclassify a part of A1C or A¡ 1, PD 0 C for example, A11 D fxi : jD2 xij D D / 2 C 1, xi1 D 1g. According to this type of reasoning, PrfR(w0 ) · eg in the case m0 D 1, is possibly characterized by discrete values p0 C0.5p1 , p0 Cp1 Cp2 , p0 Cp1 Cp2 Cp3 C0.5p4 , . . . such that there are quick transitions when R(w0 ) is near 0, p0 C0.5p1 , p0 Cp1 Cp2 , p0 C p1 C p2 C p3 C 0.5p4 , . . .. Then, according to equation 4.2 and theorem 1, the learning curve bound would be characterized by ´ m m m ³ 1 ¡ (1 ¡ e1 ) 50 C1 C (e2 ¡ e1 )(1 ¡ p0 ) 50 B2 D m C 50 m

C ( p0 C 0.5p1 ¡ e2 )(1 ¡ p0 ¡ 0.5p1 ) 50 m

C (0.5p1 C p2 )(1 ¡ p0 ¡ p1 ¡ p2 ) 50 C

m m (1 ¡ p0 ¡ p1 ¡ p2 ) 50 C1 , m C 50

for some e1 , e2 , p0 . A possible t is obtained when e1 D 0.00028. e1 ¡ e2 D 0.005, and p0 D 0.03, and is plotted in the gure. Note that the rst two terms in B2 are in general extremely descriptive. The exact interpretation of the learning curve is by far still unclear. 6.3 Learning with Multilayer Perceptrons. We illustrate here the result of learning a 50-dimensional majority function by 50-4-1 feedforward fully connected networks. The output layer uses a threshold unit, and the hidden layer uses sigmoidal units. To calculate the learning curve bounds for the ill-disposed learning algorithm, we simply substitute d with the number of weights. In this case, d D 209. The training algorithm is a mixture of the perceptron learning algorithm (in the output layer) and the backpropagation (in the hidden layer) without momentum. The learning rate is set to 0.25. The generalization error is estimated by testing against 4096 new examples (8000 for training sample size exceeding 1200). The nal learning curve is obtained by averaging over 50 runs for each training sample size and is

Exponential or Polynomial Learning Curves?

807

Figure 3: The learning curve of learning a 50-dimensional majority function with 50-4-1 networks. The dashed lines indicate two possible theoretical interpretations.

plotted in Figure 3, where two possible theoretical interpretations are also plotted out. Similar to the discussions in learning with perceptrons, a crude bound on the learning curve could be m

m

B1 D p0 (1 ¡ p0 ) 209 C p1 (1 ¡ p0 ¡ p1 ) 209 C

209 m (1 ¡ p0 ¡ p1 ) 209 C1 . m C 209

Of course, due to the limitation of the theory, the bound cannot explain the part where m < d. However it works quite well for sample size in the range [500, 1700]. On the other hand, it is interesting that in this case, the bound is still valid for sample size up to 6000. This may arise from the fact that m / d < 30. However, more interesting is that the learning curve appears to exhibit an exponential transition for sample size in [2000, 3000], and then to be back to an exponential behavior that has the same exponential rate as that in [500, 1700]. One possible explanation of the learning curve behavior may be inherent in the nature of the learning algorithm. It has been shown that the backpropagation algorithm is Bayes optimal to the extent that the algorithm is able to nd a global error minimum and to the extent that the network is able to represent the optimal Bayes function (Ruck, Rogers, Kabriskey, Oxley, & Suter, 1990; Wang, 1990). Therefore, if the learning algorithm performs optimally, it is reasonable to assume that for m0 D 1, if the decision boundary

808

Hanzhong Gu and Haruhisa Takahashi

of w0 crosses over AkC, then w0 can classify not only AkC but also most part C of Ak¡1 . Then according to equation 4.2 and theorem 1, a possible bound on the learning curve would be m

m

B 2 D e1 (1 ¡ p0 ) 209 C (e2 ¡ e1 )(1 ¡ p0 ¡ p1 ) 209 C

209 m (1 ¡ e1 ) 209 C1 , m C 209

where e1 < p0 and e2 ¡ e1 < p1 . Since the learning algorithm is a mixture of the perceptron learning algorithm and the backpropagation without momentum, when sample size is small, it may not be optimal, so that e1 ¼ p0 , and when sample size is sufciently large, the optimality saturates so that e1 ¼ ap0 for some constant a < 1. Clearly, this is equivalent to saying that there are events of a trial sufcient to make the learning algorithm perform optimally. Therefore, the learning curve could exhibit exponential transitions between B1 and B2 . Since B2 ¼ aB1 for large m / d, a possible bound on the learning curve would be B D P( A)aB1 C [1 ¡ P( A)]B1 , where P( A) denotes the probability the events occur in a trial. Note that the target function (the majority function) is linearly separable, and its decision boundary can be uniquely specied by D points. Hence because an optimal learning machine is a perceptron with d D D, whether the learning algorithm can perform optimally could depend essentially on such a fact that a reduced sample of size m / D obtained by the uniform pruning method contains A0C (or A¡ 0 ), which is necessary to make the perceptron show perfect performance. This denes one event. Clearly, to ensure that the training algorithm performs optimally, we would need D such events (because D points are needed to specify the decision boundary of the target function). Therefore, by lemma 2, P( A) ¼ [1 ¡ exp(¡p0 m / D)]D . Indeed, a good t to the learning curve can be obtained by setting a D 0.17 and P( A) D [1 ¡ exp(¡0.0021m)]50 (see Figure 3). The exact interpretation is not yet clear and seems challenging. 7 Conclusion

Although our discussion seems descriptive to some extent, it accounts for learning curve behaviors in learning majority functions. The results show that these learning curves contain both polynomial and exponential behaviors and the exponential transitions are not necessarily toward perfect performance. Similar results can be obtained for learning majority-XOR functions (Gu, 1997). We expect that our results can also be applied to other learning problems.

Exponential or Polynomial Learning Curves?

809

Acknowledgments

H. Gu acknowledges a student scholarship from the Japanese Government (Monbusho). This work was supported in part by Grant-in-Aid for Scientic Research from the Ministry of Education, Science and Culture, Japan. References Cohn, D., & Tesauro, G. (1992).How tight are the Vapnik-Chervonenkis bounds? Neural Computation, 4, 249–269. Gu, H. (1997). Towards a practically applicable theory of learning curves in neural networks. Unpublished doctoral dissertation, University of ElectroCommunications, Tokyo. Gu, H., & Takahashi, H. (1996). Towards more practical average bounds on supervised learning. IEEE Transactions on Neural Networks, 7, 953–968. Gu, H., & Takahashi, H. (1997). Estimating learning curves of concept learning. Neural Networks, 10, 1089–1102. Haussler, D., Seung, H. S., Kearns, M., & Tishby, N. (1994). Rigorous learning curve from statistical mechanics. In Proc. 7th Annu. ACM Workshop on Comput. Learning Theory (COLT) (pp. 76–87). Ruck, D. W., Rogers, S. K., Kabriskey, M., Oxley, M. E., & Suter, B. W. (1990). The multilayer perceptron as an approximation to a Bayes optimal discriminant function. IEEE Trans. Neural Networks, 1, 296–298. Schwartz, D. B., Samalam, V. K., Solla, S. A., & Denker, J. S. (1990). Exhaustive learning. Neural Computation, 2, 374–385. Takahashi, H., & Gu, H. (1998). A tight bound on concept learning. IEEE Transactions on Neural Networks, 9, 1191–1202. Tishby, N., Levin, E., & Solla, S. (1989). Consistent inference of probabilities in layered networks: Predictions and generalizations. IJCNN Proc., 2, 403–409. Wang, E. (1990). Neural network classication: A Bayesian interpretation. IEEE Trans. Neural Networks, 1, 303–305. Watkin, T. L. H., Rau, A., & Biehl, M. (1993).The statistical mechanics of learning a rule. Rev. Mod. Phys., 65, 499–556. Received February 27, 1998; accepted July 2, 1999.

LETTER

Communicated by Nicol Schraudolph

Training Feedforward Neural Networks with Gain Constraints Eric Hartman Pavilion Technologies, 1110 Metric Blvd. #700, Austin, TX 78758-4018, U.S.A. Inaccurate input-output gains (partial derivatives of outputs with respect to inputs) are common in neural network models when input variables are correlated or when data are incomplete or inaccurate. Accurate gains are essential for optimization, control, and other purposes. We develop and explore a method for training feedforward neural networks subject to inequality or equality-bound constraints on the gains of the learned mapping. Gain constraints are implemented as penalty terms added to the objective function, and training is done using gradient descent. Adaptive and robust procedures are devised for balancing the relative strengths of the various terms in the objective function, which is essential when the constraints are inconsistent with the data. The approach has the virtue that the model domain of validity can be extended via extrapolation training, which can dramatically improve generalization. The algorithm is demonstrated here on articial and real-world problems with very good results and has been advantageously applied to dozens of models currently in commercial use. 1 Introduction

Neural networks and other empirical modeling techniques are generally intended for situations where no physical (rst-principles) models are available. However, in many instances, partial knowledge about physical models may be known. Because empirical and physical modeling methods have largely complementary sets of advantages and disadvantages (e.g., physical models generally are more difcult to construct but extrapolate better than neural network models), the ability to incorporate available a priori knowledge into neural network models can capture advantages from both modeling methods. The nature of partial prior knowledge varies by application. For example, in the continuous process industries, such as rening and chemical manufacturing, operators and process engineers usually know, from physical understanding or experience, a great deal about their processes, including (1) input-output causality relations (knowledge that is easily modeled by network connectivity), (2) input-output nonlinear-linear relations (knowledge easily modeled by network architecture), and (3) approximate bounds Neural Computation 12, 811–829 (2000)

c 2000 Massachusetts Institute of Technology °

812

Eric Hartman

on some or all of the gains. This knowledge can be modeled by adding constraints of the form min · ki

@yk · max @xi ki

(1.1)

to the training algorithm, where yk is the kth output, xi is the ith input, and minki and maxki are constants. (Note that a nonzero maxki -minki range allows for nonlinear input-output relationships.) This article describes a robust method for carrying out item 3, which we call gain-constrained training. Gain-constrained models give superior performance both when calculating predictions and when calculating gains. When calculating predictions, extrapolation (generalization) accuracy can be improved by augmenting gain-constrained training with extrapolation training, in which the gain constraints alone are imposed outside the training region. When calculating gains, accuracy is vital for optimization, control, and other uses. Correlated inputs (substantial cross-correlations among the time series of the input variables) and “weak” (inaccurate, incomplete, or closed loop) data are very common in real applications and frequently cause incorrect gains in neural network models. When inputs are correlated, many different sets of gains can yield equivalent data-tting quality (the gain solution is underspecied; see section 5.1 for an example). Constraining the gains breaks this symmetry by specifying which set of gains is known a priori to be correct (the gain solution is fully specied).1 Such gain constraints are consistent by denition with the data and therefore do not affect the data-tting quality. In the case of “weak” data, on the other hand, gain constraints are imposed purposefully to override or contradict the data. A simple process industries example is illustrated in Figure 1, and a simple generic example is given in Sill and Abu-Mostafa (1997). Such gain constraints are not consistent with the data and therefore typically increase the data-tting error. Merging prior knowledge with neural networks has been addressed before in the literature (especially relevant references are Lee & Oh, 1997; Sill & Abu-Mostafa, 1997; and Lampinen & Selonen, 1995). Thompson (1996) provides a broad framework focusing on input-output relations rather than gains. All previous work formulates the problem in terms of equality bounds (point gain targets), as opposed to the inequality bounds formulation of equation 1.1 (interval gain targets), although the formulation of Lampinen and Selonen (1995) in effect can achieve interval targets by means of an associated certainty coefcient function. 1

A gain constraint for every input-output pair need not be specied; in this case the gain solution may be more fully, but not necessarily completely, specied.

Training Feedforward Neural Networks

813

Figure 1: Dots represent scatter-plot data of pH versus acid for a stirred-tank reactor, where the ow rate of base into the tank is constant and the pH is closed loop controlled to a constant set-point value by adjusting the ow rate of acid into the tank. (The plot represents a small, linear region of operation.) Whenever the pH rises, the controller increases the acid inow in order to lower the pH, and vice versa, creating data with a positive pH-acid correlation, as shown; that is, the data pH-to-acid gain is positive. The correct physical relationship, given by the solid line, has a negative pH-to-acid gain, which completely contradicts the data.

This article makes the following contributions: An extrapolation training procedure is presented. Term balancing, which is especially important in the case of conicting data and constraints, is directly addressed. Gain constraints are formulated explicitly as inequalities (or equalities). Also, the second derivative expressions for direct input-output connections, which are important in many applications, are new. 2 The Objective Function

A neural network with outputs yk and inputs xi is trained subject to the set of constraints of equation 1.1. The objective function to be minimized in training is E D E D C EC ,

(2.1)

where ED is the data-tting error, and EC is a penalty for violating the constraints. Weighting factors for ED and EC are introduced later in the form

814

Eric Hartman

Figure 2: Penalty function f (@yk / @xi ).

of learning rates (section 3). We use the standard sum of squares for ED ED

D

N out X k

ED k

Ntrp N

D

´ out ³ XX p p 2 tk ¡ y k , p

(2.2)

k

p

p

where p is the pattern index, tk is the target for output yk , Nout is the number of output units, and Ntrp is the number of training patterns. Although we assume equation 2.2 throughout, the algorithm has no inherent restrictions to this form for ED . We write EC as EC

D

Ninp N out X X k

i

ECki

D

Ninp Ntrp N out X XX p

i

k

f

(

p

@yk

!

p

@xi

,

(2.3)

where Ninp is the number of input units and f measures the degree to which p p @yk / @xi violates its constraints (see section 2.1 and Figure 2). In the following, the pattern index p is suppressed for clarity. 2.1 The Penalty Function. We dene the penalty function f as

f

¡ @y ¢ k

@xi

´ fki 8 @yk > < @xi ¡ maxki D minki ¡ @yk @xi > : 0

and its derivative as 8 @yk > < C1 if @xi > maxki 0 fki D ¡1 if @yk < minki @xi > : 0 else

if

@yk @xi @yk @xi

> maxki

if < minki else

9 (max is violated) > =

(min is violated) > , (2.4) ; (no violation)

9 (max is violated) > = (min is violated) . > ; (no violation)

(2.5)

Training Feedforward Neural Networks

815

The function f is shown in Figure 2 (such shapes are widely used in penalty functions). No improvement in solution quality resulted from using an everywhere smooth (and more time-consuming) approximation of equation 2.4. 3 Gradient Descent Training

From equations 2.2, 2.3, and 2.5 and the chain rule, the gradient descent update equation for a single pattern is given by Dw D ¡

N out X k

N

gk

inp out X ¢2 NX @ ¡ @ l ki fki0 tk ¡ y k ¡ @w @w i k

¡ @y ¢ k

@xi

,

(3.1)

where w denotes an arbitrary weight or bias, g k is the learning rate for ED , k and l ki is the learning rate for ECki . Expressions for the second derivatives ¡ ¢ @/ @w @yk / @xi are given in the appendix. The weights and biases are updated after M patterns (M a parameter) according to w( T ) D D w( T ) Cm D w( T ¡ 1),

(3.2)

where T is the weight update index and m is the momentum coefcient. (Section 3.2 details the steps of the training algorithm.) An alternative to this gradient-descent formulation would be to train using a constrained nonlinear programming (NLP) method (e.g., SQP or GRG) (Nash & Sofer, 1996; Smith & Lasdon, 1992). We chose our formulation instead because (1) commercial NLP codes use Hessian information (either exact or approximate), and in our experience any such explicitly higherorder training method tends toward overtting in a way that is difcult to control in a problem-independent manner, and (2) the methods of sections 3.1 and 4 constitute “changes to the problem” for NLPs, which would interfere with their Hessian calculation procedures. 3.1 Extrapolation Training. Some degree of improved model extrapolation may be expected from gain-constrained training solely from enforcing correct gains at the boundaries of the training data set (Lee & Oh, 1997). However, extrapolation training explicitly improves extrapolation accuracy by shaping the response surface outside the training regions. Extrapolation training applies the gain constraint penalty terms alone to the network for patterns articially generated anywhere in the entire input space, whether or not training patterns exist there. An extrapolation pattern consists of an input vector only; no output target is needed because the penalty terms do not involve output target values. The number of extrapolation patterns to be included in each training epoch is an adjustable parameter. Extrapolation input vectors are generated at random from a uniform distribution covering all areas of the input space over which the constraints are assumed to

816

Eric Hartman

be valid. The effectiveness of this method is shown in the synthetic data example of section 5.1. 3.2 Training Procedure Summary. The training algorithm proceeds as

follows: 1. Backpropagation (gradient descent) (LeCun, 1985; Parker, 1985; Rumelhart, Hinton, & Williams, 1986) is performed on ED , and the D w’s in equation 3.1 are incremented with the rst-term contributions. 2. A backward pass for each yk is performed to compute @yk / @xi . This is done by clearing the propagation error d’s (see Rumelhart et al., 1986) for all units, setting d yk D y0k , and backpropagating to the inputs, which yields d xi D @yk / @xi . This follows easily from the denition ofd and this technique has long been known to practitioners. (See Lee & Oh, 1997, for a complete derivation.) ¡ ¢ 3. Quantities computed in step 2 are used to evaluate the @/ @w @yk / @xi expressions given in the appendix, and the D w’s in equation 3.1 are incremented with the second-term contributions. 4. The weights are updated after every M patterns (1 · M · Ntrp ) according to equation 3.2. For an extrapolation pattern (section 3.1), step 1 is omitted and step 2 is preceded by a forward pass. 4 Dynamic Balancing of Objective Function Terms

If the constraints do not conict with the data, balancing the relative strengths (learning rates) of the data and penalty terms in the objective function is relatively straightforward (Lee & Oh, 1997). If the constraints conict with the data, however, a robust method for term balancing is of key importance. While the constraints have priority over data in case of conict, insisting that all constraint violations be zero is ordinarily neither necessary nor conducive to obtaining good models (models of practical value) in real applications. Typically it is not desirable to drive “constraint outliers” of negligible practical consequence into absolute conformance at the expense of signicantly worsening the data t. Instead, we allow for each constraint a specied fraction of training patterns (typically 10–20%) to violate the constraint to any degree, before deeming the constraint to be in violation. The violating patterns may be different for each constraint, and the violating patterns for each constraint may be chosen from the entire training set. The resulting combinatorics allows the network enormous degrees of freedom to achieve maximum data tting while satisfying the constraints.

Training Feedforward Neural Networks

817

Dynamic term balancing proceeds as follows. The initial values of the g k are set to a standard value. For each output, the initial values for the l ki are computed before training begins, such that the average rst-level derivatives of the data term (for that output) are approximately equal to the rst-level derivative of the sum of the constraint terms (for that output). The learning rates g k and l ki are then adjusted after each epoch according to the current state of the training process. A wide variety of situations are possible in gain-constrained training— for example: constraints may be consistent or inconsistent with the data; constraints may be violated at the outset of training or only much later, after the gains have grown in magnitude; or the constraints may be misspecied and cannot be achieved at all. Hence, a variety of rules for adjusting the learning rates are necessary. To adjust g k and l ki after each epoch, we use an extension of the rapidslow g k dynamics described in (Vogl et al., 1988) for training backprop nets, where gk is slowly (additively) increased if the error is decreasing and is rapidly (multiplicatively) decreased if the error is increasing. In the following, a “large” change indicates a multiplicative change, and a “small” change indicates an additive change. The fundamental goal that the constraints should override the data in case of conict dictates the following basic scheme. (l ki is, of course, subject to change only if a constraint for input-output variable pair ki has been specied). l ki Large increase if constraint ki is not satised (“satisfaction” is dened above in this section). Small decrease if constraint ki is satised. gk Large decrease if ED has worsened (increased). Small increase if ED has improved (decreased). The exceptions to these rules are presented in correspondence with the above rules: l ki No increase is made if E Cki has improved, since, given that ECki is improving, increasing l ki would make it unnecessarily harder for ED to improve. No decrease is made if all constraints are exactly satised (EC D 0). In this case, either the constraints agree with the data, in which case re-

818

Eric Hartman

taining the l ki level will actually speed convergence, or the constraints are such that they require large gains to be violated and will therefore remain satised until later in the training process when the weights have grown sufciently to produce large gains. In this case, repeatedly decreasing the l ki ’s early in the training process would result in their being too small to be effective when such constraints become violated. gk No decrease is made in two cases: (1) All constraints are satised but EC has increased. Both ED and EC are increasing, but the decrease in l ki (due to the constraints being satised) might allow E D to improve without decreasing g k . (2) All the constraints are not satised, but EC has improved; this improvement may allow ED to improve without decreasing gk . No increase is made if EC has increased, in accordance with allowing the constraints to dominate. Parameter values used in these dynamics depend on the network architecture, but for typical networks are generally in the range of 1.3 ¤ l ki and 0.8 ¤ g k for multiplicative changes, and (¡)0.03 ¤ l ki and ( C)0.02 ¤ g k for additive changes. Also, expected statistical uctuations in training errors are accounted for when deeming whether ED , EC , or ECki has increased. 5 Examples

This section applies gain-constrained training to two problems. The rst is a synthetic example with maximally correlated (identical) inputs, which clearly illustrates the effectiveness of extrapolation training as well as the use of gain constraints to fully specify gain solutions that are otherwise underspecied due to correlated inputs. The second is a real-world processindustry application involving “weak” data. 5.1 Synthetic Example. The synthetic example consisted of 10 maximally correlated (identical) inputs, x1 through x10 , and one output, y. The target t was also equal to the inputs

t D x1 D x2 D x3 D ¢ ¢ ¢ D x10 ,

(5.1)

each variable being a ramp from 0 to 1. Figure 3 illustrates the data conguration in input space (in 3 instead of 10 dimensions). The data set consisted of 10,000 patterns. A three-layer architecture was used in all cases, with sigmoid output and hidden units. Although the data were known to be linear, a nonlinear model was used to demonstrate the effectiveness of the algorithm. According to equation 5.1, any set of gains

Training Feedforward Neural Networks

819

Figure 3: Data conguration for the synthetic example. All variables are identical; the heavy line represents the training data (the actual input space is a 10-dimensional hypercube). In gain-constrained training, both gain constraints and data tting are applied to the training data. The dots represent extrapolation training patterns (randomly generated throughout training), where the gain constraints alone are applied (output targets neither exist nor are required).

summing to unity would be consistent with the data. We arbitrarily chose the gains @y / @x1 D 0.25 and @y / @x2 D 0.75. Allowing 5% tolerance (typical in real problems), we specied the constraints as 0.244 · @y / @x1 · 0.256 0.731 · @y / @x2 · 0.769

(5.2) (5.3)

No gain constraints involving inputs x3 through x10 were specied. Figure 4 contains gain plots from models trained (a) without gain constraints, (b and c) with gain constraints but without extrapolation training, and (d and e) with both gain constraints and extrapolation training. The constraints were consistent with the data, and in all cases the data-tting error was 0.0. Figure 4a shows the gains from the model trained without constraints. Each curve shows the gain of the output with respect to a single input, as a function of that input (ranging from 0 to 1), with the other nine inputs held constant at 0.5. That is, each gain curve is along a “horizontal” or “vertical” line running through the center of the 10-dimensional input space hypercube (see Figure 3). As expected, the gains of the unconstrained model are distributed roughly uniformly across the 10 identical inputs, with a mean of about 1 / 10. Figure 4b shows the gains for a model trained with gain constraints but without extrapolation training. In this case, training (with constraints) occurs only within the training data region, which runs diagonally through

820

Eric Hartman

the center of the input space (see Figure 3). The “horizontal” or “vertical” gain curve for an input xi therefore intersects the training data region only at xi D 0.5 (the center of the input space; in each curve all other inputs are

Training Feedforward Neural Networks

821

held constant at 0.5). Not surprisingly then, Figure 4b shows that the gain constraint specied for @y / @x1 (0.25) is satised only at x1 D 0.5 (where training data exist), whereas away from the data, the gains vary drastically from 0.25. The same comments hold for the @y / @x2 plot: the constraint for @y / @x2 (0.75) is satised only at the midpoint where there are training data and varies drastically away from the data. Figure 4c is identical to Figure 4b, except that the remaining nine inputs in each curve are held constant at values corresponding to a vertex of the 10-dimensional hypercube rather than the midpoint. Thus, the gain plots run along edges of the input space rather than through its center. The vertex was chosen such that the plots nowhere intersect the training data. The gains @y / @x1 and @y / @x2 satisfy their constraints only coincidentally at certain points. Figures 4d and 4e correspond to 4b and 4c, respectively, but with extrapolation training added. Figures 4d and 4e indicate that the model corresponding to the specied gain constraints has been learned everywhere over the 10-dimensional input space, even in corners far distant from any training data. Recalling that constraints were applied only to @y / @x1 and @y / @x2 , Figures 4d and 4e also show that extrapolation training more tightly enforced the “implied” constraints @y / @x3¡10 D 0. Because the constraints in this example were consistent with the data, the number of epochs required to achieve a given data-tting error with the gain constraints was fewer than the number of epochs required to achieve the same error without the constraints. 5.2 Industrial Process Application. In addition to the usual sparsity and inaccuracies, the data for this industrial process example were “weak” in a way that is not uncommon in data gathered from manufacturing processes. Figure 4: Facing page. Gains for the synthetic example (equations 5.1–5.3). Each curve shows the gain of the output with respect to a single input while holding the other inputs constant at their average (midpoint) values in (a), (b), and (d) and at a vertex in (c) and (e) (same vertex in both cases, a vertex distant from the data set). GC = gain constraints, ET = extrapolation training: models were trained with (a) no GC, (b) GC but no ET (midpoint), (c) GC but no ET (vertex), (d) GC with ET (midpoint), (e) GC with ET (vertex). (a) The distribution of unconstrained gains average 0.10 (as expected). (b) The specied constraints are satised only where the gain curves cross the data set, at the horizontal midpoint (cf. Figure 3). (c) The gain curves nowhere cross the data set and are correct only at accidental locations. (d, e) With extrapolation training, the specied constraints are satised within tolerance everywhere in the 10-dimensional hypercube input space, and the x3 through x10 gains are driven more closely to zero. The constraints were consistent with the data (constraining the otherwise underspecied gain solutions), and the data-tting error was 0.0 in all cases. Constraints were applied only to @y / @x1 and @y / @x2 .

822

Eric Hartman

Figure 5: Network architecture for the industrial process application. The output units are linear and the hidden units are sigmoidal.

The data were collected while parts of the process were under closed-loop control. In this case, an unconstrained neural network will model a mixture of the controller and the process. Because a controller (whether automated or human) essentially represents an inverse model of the process, networks trained without gain constraints often contain gains with reversed signs (see Figure 1 for a simple illustration), as is the case in this example. The process has seven inputs, x1 through x7 and two outputs, y1 and y2 (generic labels are used to protect condentiality). The data set consisted of approximately 79,000 patterns. Figure 5 shows the three-layer architecture that was used throughout, containing a separate hidden layer for each output. Note that only the x3 input was connected to both outputs. All units were fully connected to their appropriate layers. Linear output and sigmoid hidden units were used. Gain constraints were imposed to override the incorrect gains. The physically correct gain signs, known a priori, were specied as: 0.05 · @y1 / @x3 · C1 0.05 · @y1 / @x5 · C1

(positive)

(5.4)

0.05 · @y1 / @x4 · C1

(positive)

(5.5)

(positive)

(5.6)

¡1 · @y1 / @x6 · ¡0.05

(negative)

(5.7)

¡1 · @y2 / @x1 · ¡0.05

(negative)

(5.8)

(positive)

(5.9)

¡1 · @y2 / @x3 · ¡0.05

(negative)

(5.10)

(positive)

(5.11)

and 0.05 · @y2 / @x2 · C1 0.05 · @y2 / @x7 · C1

Training Feedforward Neural Networks

823

The small margin of 0.05 was used to avoid zero crossings; for bounds other than zero, crossing bounds is of less signicance. Figure 6 shows output responses to selected inputs (the gains are the slopes of these curves, in contrast to Figure 4). To avoid cluttering the plots, only two inputs are shown for each output. Figures 6a and 6b show the model trained without gain constraints, and 6c and 6d show the model trained with gain constraints. The architecture was the same in both cases (see Figure 5). Figure 6a shows the y1 response of the unconstrained model to inputs x3 and x5 . In each curve, the remaining input variables were held constant at their average values. The negative slopes in these responses clearly contradict the prior knowledge that they should be positive. (The other two inputs to y1 also violated their constraints.) Figure 6b shows the y2 response of the unconstrained model to inputs x1 and x3 . The positive slopes in these responses clearly contradict the prior knowledge that they should be negative. Figures 6c and 6d show the responses (for the same inputs as in Figures 6a and 6b) of a model trained with the above eight gain (sign) constraints (and with extrapolation training). The gains now satisfy the constraints. The training and validation data-tting error values for the unconstrained model were both 0.50. These are normalized ED error values, dened as

ED norm

v uX ³ ´ X³ ´2 u p p 2 p Dt t k ¡ yk / tk ¡ t k . pk

(5.12)

pk

These error values are not unusual in industrial applications and typically represent useful models. The gain-constrained model training and validation data-tting errors were 1.01 and 0.98, respectively. This large increase in data-tting error is not unexpected due to the many, drastically data-contradicting constraints. Our experience with this algorithm in industrial applications has ranged from very large increases in data-tting error such as in this example (primarily overriding “weak” data) to zero increase in data-tting error (purely constraining underspecied gains solutions). Because the constraints in this example were inconsistent with the data, the number of epochs required to achieve minimal data-tting error subject to the constraints was somewhat greater than the number of epochs required to achieve minimal data-tting error without the constraints. The constrained model is one of dozens of such models currently operating successfully in industrial plants.

824

Eric Hartman

Figure 6: Output responses of unconstrained and constrained models of the industrial process application. The gains are the slopes of the plots. The process was under partial closed-loop control during data gathering, which resulted in some wrong-sign gains in the unconstrained model. For clarity, each part of the gure shows the response of an output to only two of its four inputs. Each curve shows the response to a single input while the remaining inputs are held constant at their average values. (a) Output y1 responses of the unconstrained model (slopes should be positive). (b) Output y2 responses of the unconstrained model (slopes should be negative). (c) Output y1 responses of the constrained model (slopes have correct sign). (d) Output y2 responses of the constrained model (slopes have correct sign). In (a) and (b), the responses shown (and others not shown) violate their constraints. In (c) and (d), the responses shown (and those not shown) conform to the specied gain constraints. Not unexpectedly, the validation data-tting error increased signicantly. The constrained model is successfully operating at a continuous-process manufacturing plant.

Training Feedforward Neural Networks

825

6 Summary and Outlook

An algorithm to train feedforward neural networks subject to inequality (or equality) bound constraints on the gains of the learned mapping was presented. Neural networks trained without gain constraints often develop wrong gains due to correlated inputs (which cause the gains solution to be underspecied), “weak” (inaccurate or incomplete) data, or both. A method for dynamically balancing mutually inconsistent constraints and data terms in the objective function was described. The robustness of this procedure depends strongly on a certain measure of constraint violation, which was also dened. The novel extrapolation training technique that was presented applies the gain constraints alone to the model in regions of input space where no training data are available. This procedure improves model extrapolation in both the prediction and the gain modes. The effectiveness of extrapolation training was demonstrated on a 10dimensional synthetic problem, in which the response surface of the model was correct everywhere over the input space, despite the fact that the training set was conned to a single line running diagonally through the 10dimensional hypercube. The problem also demonstrated the effectiveness of gain constraints in fully specifying the gain solution, which was otherwise underconstrained due to correlated inputs. The algorithm was also applied to a real process industry problem with “weak” data. Several gain constraints, known a priori, were specied to reverse the incorrect signs of the gains inherent in the data. The effectiveness of this algorithm has been proved by its use in creating dozens of models that are operating successfully on a variety of units in continuous process plants, such as polymer reactors, boilers for power turbines, paper manufacturing, and regulated emissions monitoring. These models are being used for continuous steady-state optimization as well as for supplying the gain component in a closed-loop nonlinear dynamic model predictive control system. In many of these cases, neural network modeling would have otherwise been essentially impossible due to critical gain problems caused by correlated inputs and “weak” data. Both sign-only and more restricted magnitude-range constraints have been used in these models. In particular, for projects involving multiple identical (same manufacturer) units, relatively exact gain bounds have been developed, which constitute model templates, allowing very rapid and accurate model development for new such units. This work was restricted to the assumption that specied gain constraints are valid everywhere in the input space (global constraints). We are currently exploring elaborations of the same algorithmic concepts, such as specifying gain constraints as a function of location (local constraints). Such constraints

826

Eric Hartman

may be formulated as min · kil

@yk ¡ ¢ xEl · max, @xi kil

(6.1)

where xEl indicates a location in input space. Appendix: The Second Derivatives

Gradient descent (see equation 3.1) requires the second derivatives @ @w

¡ @y ¢ k

@xi

,

(A.1)

where w denotes an arbitrary weight or bias in the network. This appendix gives the expressions for these second derivatives. (See section 3.2 for the complete training procedure.) Direct input-output connections are often important in real applications (e.g., item 2 of section 1). Generalizing the recursive equations of Lee and Oh (1997) to include layer-skipping connections would be relatively straightforward. Instead, we provide the second derivatives in an easy-to-implement nonrecursive form for the usual three-layer network and include direct input-output connections. Full between layers is not assumed. A second derivative ¡ connectivity ¢ @/ @w @yk / @xi is zero if no path exists from w to yk . The equations below are valid provided that weights corresponding to nonexisting connections are xed at zero. Figure 7 indicates the notations for the basic network elements used in the following. In addition, let ak and aj denote the total input to output yk and to hidden hj respectively: ak D

X j

wkj hj C aj D

X

wki xi C bk

(A.2)

i

wji xi C bj .

(A.3)

i

X

The activation values are then ysk D s (ak ) ¡ ¢ hjs D s aj

yLk D ak hjL

D aj

(A.4) (A.5)

where the superscripts s and L indicate sigmoid and linear units, respectively, and s(¢) is the usual sigmoid transfer function s(x) D (1 C e¡x ) ¡1 .

Training Feedforward Neural Networks

827

Figure 7: Network notations for the second derivative expressions.

The derivatives for output and hidden units are then ( ) yk (1 ¡ yk ) for ysk dyk 0 D yk ´ dak 1 for yLk ( ) hj (1 ¡ hj ) for hjs dhj 0 D . hj ´ daj 1 for hjL For conciseness in the equations below, we dene the quantities ( ) 1 ¡ 2yk for ysk uk D 0 for yLk ( ) 1 ¡ 2hj for hjs . uj D 0 for hjL

(A.6)

(A.7)

(A.8)

(A.9)

The gains @yk / @xi appearing in the equations below are computed according to section 3.2. We write the second derivative equations separately for each type of weight and bias: Hidden-output weight: @ @wkj

¡ @y ¢ m

@xn

¡

D d km um hj

where d km is a Kronecker delta.

¢

@ym C y0m hj0 wjn , @xn

(A.10)

828

Eric Hartman

Input-output weight: @ @wki

¡ @y ¢ m

@xn

¡

¢

@ym C y0md in . @xn

(A.11)

µ ¶ ¡ ¢ @ym D hj0 wmj um xi C y0m uj xi wjn Cd in . @xn

(A.12)

D d km um xi

Input-hidden weight: @ @wji

¡ @y ¢ m

@xn

Output bias: @ @bk

¡ @y ¢ m

@xn

Hidden bias: @ @bj

¡ @y ¢ m

@xn

¡

D d km um

¡

@ym @xn

D hj0 wmj um

¢

.

(A.13)

¢

@ym C y0m uj wjn . @xn

(A.14)

Acknowledgments

I thank L. Lasdon, D. Nehme, and S. Pich´e for useful discussions and C. Peterson, S. Piche´ , and the referees for valuable comments on the manuscript. References Lampinen, J., & Selonen, A. (1995). Multilayer perceptron training with inaccurate derivative information. In Proceedings of the 1995 IEEE International Conference on Neural Networks (ICNN’95), Perth, WA (Vol. 5, pp. 2811–2815). LeCun, Y. (1985). Une procedure d’apprentissage pour reseau a seauil assymetrique (A learning procedure for asymmetric threshold networks). Proc. Cognitiva, 85, 599–604. Lee, J., & Oh, J. (1997).Hybrid learning of mapping and its jacobian in multilayer neural networks. Neural Comp., 9, 937–958. Nash, S. G., & Sofer, A. (1996). Linear and nonlinear programming. New York: McGraw-Hill. Parker, D. B. (1985). Learning-logic (Tech. Rep. TR-47). Cambridge, MA: MIT, Center for Computational Research in Economics and Management Science. Rumelhart, D. E, Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E.Rumelhart, J. L.McClelland, & the PDP Research Group, Parallel distributed processing: Explorations in the microstructure of cognition. Vol. 1: Foundations (pp. 318–362). Cambridge, MA: MIT Press. Sill, J., & Abu-Mostafa, Y. S. (1997). Monotonicity hints. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information Processing systems, 9 (pp. 634–640). Cambridge, MA: MIT Press.

Training Feedforward Neural Networks

829

Smith, S., & Lasdon, L. (1992). Solving large sparse nonlinear programs using GRG. ORSA Journal on Computing, 4, 3–15. Thompson, M. L (1996). Combining prior knowledge and nonparametric models of chemical processes. Unpublished doctoral dissertation, MIT. Vogl, T. P., Mangis, J. K., Rigler, A. K., Zink, W. T., & Alkon, D. L. (1988). Accelerating the convergence of the back-propagation method. Biol. Cybern., 59, 257. Received July 14, 1998; accepted March 4, 1999.

LETTER

Communicated by Volker Tresp

Variational Learning for Switching State-Space Models Zoubin Ghahramani Geoffrey E. Hinton Gatsby Computational Neuroscience Unit, University College London, London WC1N 3AR, U.K. We introduce a new statistical model for time series that iteratively segments data into regimes with approximately linear dynamics and learns the parameters of each of these linear regimes. This model combines and generalizes two of the most widely used stochastic time-series models— hidden Markov models and linear dynamical systems—and is closely related to models that are widely used in the control and econometrics literatures. It can also be derived by extending the mixture of experts neural network (Jacobs, Jordan, Nowlan, & Hinton, 1991) to its fully dynamical version, in which both expert and gating networks are recurrent. Inferring the posterior probabilities of the hidden states of this model is computationally intractable, and therefore the exact expectation maximization (EM) algorithm cannot be applied. However, we present a variational approximation that maximizes a lower bound on the log-likelihood and makes use of both the forward and backward recursions for hidden Markov models and the Kalman lter recursions for linear dynamical systems. We tested the algorithm on articial data sets and a natural data set of respiration force from a patient with sleep apnea. The results suggest that variational approximations are a viable method for inference and learning in switching state-space models. 1 Introduction

Most commonly used probabilistic models of time series are descendants of either hidden Markov models (HMM) or stochastic linear dynamical systems, also known as state-space models (SSM). HMMs represent information about the past of a sequence through a single discrete random variable—the hidden state. The prior probability distribution of this state is derived from the previous hidden state using a stochastic transition matrix. Knowing the state at any time makes the past, present, and future observations statistically independent. This is the Markov independence property that gives the model its name. SSMs represent information about the past through a real-valued hidden state vector. Again, conditioned on this state vector, the past, present, and future observations are statistically independent. The dependency between Neural Computation 12, 831–864 (2000)

c 2000 Massachusetts Institute of Technology °

832

Zoubin Ghahramani and Geoffrey E. Hinton

the present state vector and the previous state vector is specied through the dynamic equations of the system and the noise model. When these equations are linear and the noise model is gaussian, the SSM is also known as a linear dynamical system or Kalman lter model. Unfortunately, most real-world processes cannot be characterized by either purely discrete or purely linear-gaussian dynamics. For example, an industrial plant may have multiple discrete modes of behavior, each with approximately linear dynamics. Similarly, the pixel intensities in an image of a translating object vary according to approximately linear dynamics for subpixel translations, but as the image moves over a larger range, the dynamics change signicantly and nonlinearly. This article addresses models of dynamical phenomena that are characterized by a combination of discrete and continuous dynamics. We introduce a probabilistic model called the switching SSM inspired by the divideand-conquer principle underlying the mixture-of-experts neural network (Jacobs, Jordan, Nowlan, & Hinton, 1991). Switching SSMs are a natural generalization of HMMs and SSMs in which the dynamics can transition in a discrete manner from one linear operating regime to another. There is a large literature on models of this kind in econometrics, signal processing, and other elds (Harrison & Stevens, 1976; Chang & Athans, 1978; Hamilton, 1989; Shumway & Stoffer, 1991; Bar-Shalom & Li, 1993). Here we extend these models to allow for multiple real-valued state vectors, draw connections between these elds and the relevant literature on neural computation and probabilistic graphical models, and derive a learning algorithm for all the parameters of the model based on a structured variational approximation that rigorously maximizes a lower bound on the log-likelihood. In the following section we review the background material on SSMs, HMMs, and hybrids of the two. In section 3, we describe the generative model—the probability distribution dened over the observation sequences—for switching SSMs. In section 4, we describe a learning algorithm for switching state-space models that is based on a structured variational approximation to the expectation-maximization algorithm. In section 5 we present simulation results in both an articial domain, to assess the quality of the approximate inference method, and a natural domain. We conclude with section 6. 2 Background 2.1 State-Space Models. An SSM denes a probability density over time series of real-valued observation vectors fYt g by assuming that the observations were generated from a sequence of hidden state vectors fXt g. (Appendix A describes the variables and notation used throughout this article.) In particular, the SSM species that given the hidden state vector at one time step, the observation vector at that time step is statistically independent from all other observation vectors, and that the hidden state vectors

Switching State-Space Models

833

X1

X2

X3

Y1

Y2

Y3

...

XT YT

Figure 1: A directed acyclic graph (DAG) specifying conditional independence relations for a state-space model. Each node is conditionally independent from its nondescendants given its parents. The output Yt is conditionally independent from all other variables given the state Xt ; and Xt is conditionally independent from X1 , . . . , Xt¡2 given Xt¡1 . In this and the following gures, shaded nodes represent observed variables, and unshaded nodes represent hidden variables.

obey the Markov independence property. The joint probability for the sequences of states Xt and observations Yt can therefore be factored as: P(fXt , Yt g) D P( X1 ) P(Y1 | X1 )

T Y

P(Xt |Xt¡1 ) P( Yt |Xt ).

(2.1)

tD2

The conditional independencies specied by equation 2.1 can be expressed graphically in the form of Figure 1. The simplest and most commonly used models of this kind assume that the transition and output functions are linear and time invariant and the distributions of the state and observation variables are multivariate gaussian. We use the term state-space model to refer to this simple form of the model. For such models, the state transition function is Xt D AXt¡1 C wt ,

(2.2)

where A is the state transition matrix and wt is zero-mean gaussian noise in the dynamics, with covariance matrix Q. P( X1 ) is assumed to be gaussian. Equation 2.2 ensures that if P( Xt¡1 ) is gaussian, so is P( Xt ). The output function is Yt D CXt C vt ,

(2.3)

where C is the output matrix and vt is zero-mean gaussian output noise with covariance matrix R; P( Yt |Xt ) is therefore also gaussian: »

P( Yt |Xt ) D

(2p )¡D / 2

|R|

¡1 / 2

¼ 1 0 ¡1 exp ¡ (Yt ¡ CXt ) R (Yt ¡ CXt ) , 2

where D is the dimensionality of the Y vectors.

(2.4)

834

Zoubin Ghahramani and Geoffrey E. Hinton

Often the observation vector can be divided into input (or predictor) variables and output (or response) variables. To model the input-output behavior of such a system—the conditional probability of output sequences given input sequences—the linear gaussian SSM can be modied to have a state-transition function, Xt D AXt¡1 C BUt C wt ,

(2.5)

where Ut is the input observation vector and B is the (xed) input matrix.1 The problem of inference or state estimation for an SSM with known parameters consists of estimating the posterior probabilities of the hidden variables given a sequence of observed variables. Since the local likelihood functions for the observations are gaussian and the priors for the hidden states are gaussian, the resulting posterior is also gaussian. Three special cases of the inference problem are often considered: ltering, smoothing, and prediction (Anderson & Moore, 1979; Goodwin & Sin, 1984). The goal of ltering is to compute the probability of the current hidden state Xt given the sequence of inputs and outputs up to time t—P( Xt |fYgt1 , fUgt1 ).2 The recursive algorithm used to perform this computation is known as the Kalman lter (Kalman & Bucy, 1961). The goal of smoothing is to compute the probability of Xt given the sequence of inputs and outputs up to time T, where T > t. The Kalman lter is used in the forward direction to compute the probability of Xt given fYgt1 and fUgt1 . A similar set of backward recursions from T to t completes the computation by accounting for the observations after time t (Rauch, 1963). We will refer to the combined forward and backward recursions for smoothing as the Kalman smoothing recursions (also known as the RTS, or Rauch-Tung-Streibel smoother). Finally, the goal of prediction is to compute the probability of future states and observations given observations up to time t. Given P( Xt |fYgt1 , fUgt1 ) computed as before, the model is simulated in the forward direction using equations 2.2 (or 2.5 if there are inputs) and 2.3 to compute the probability density of the state or output at future time t C t . The problem of learning the parameters of an SSM is known in engineering as the system identication problem; in its most general form it assumes access only to sequences of input and output observations. We focus on maximum likelihood learning in which a single (locally optimal) value of the parameters is estimated, rather than Bayesian approaches that treat the parameters as random variables and compute or approximate the posterior distribution of the parameters given the data. One can also distinguish between on-line and off-line approaches to learning. On-line recursive algorithms, favored in real-time adaptive control applications, can be obtained by computing the gradient or the second derivatives of the log-likelihood (Ljung 1 2

One can also dene the state such that XtC1 D AXt C BUt C wt . The notation fYgt1 is shorthand for the sequence Y1 , . . . , Yt .

Switching State-Space Models

835

& Soderstr ¨ om, ¨ 1983). Similar gradient-based methods can be obtained for off-line methods. An alternative method for off-line learning makes use of the expectation maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977). This procedure iterates between an E-step that xes the current parameters and computes posterior probabilities over the hidden states given the observations, and an M-step that maximizes the expected log-likelihood of the parameters using the posterior distribution computed in the E-step. For linear gaussian state-space models, the E-step is exactly the Kalman smoothing problem as dened above, and the M-step simplies to a linear regression problem (Shumway & Stoffer, 1982; Digalakis, Rohlicek, & Ostendorf, 1993). Details on the EM algorithm for SSMs can be found in Ghahramani and Hinton (1996a), as well as in the original Shumway and Stoffer (1982) article. 2.2 Hidden Markov Models. Hidden Markov models also dene probability distributions over sequences of observations fYt g. The distribution over sequences is obtained by specifying a distribution over observations at each time step t given a discrete hidden state St , and the probability of transitioning from one hidden state to another. Using the Markov property, the joint probability for the sequences of states St and observations Yt , can be factored in exactly the same manner as equation 2.1, with St taking the place of Xt :

P(fSt , Yt g) D P (S1 )P (Y1 |S1 )

T Y

P( St | St¡1 )P( Yt |St ).

(2.6)

tD2

Similarly, the conditional independencies in an HMM can be expressed graphically in the same form as Figure 1. The state is represented by a single multinomial variable that can take one of K discrete values, St 2 f1, . . . , Kg. The state transition probabilities, P( St | St¡1 ), are specied by a K £ K transition matrix. If the observables are discrete symbols taking on one of L values, the observation probabilities P( Yt |St ) can be fully specied as a K £ L observation matrix. For a continuous observation vector, P(Yt | St ) can be modeled in many different forms, such as a gaussian, mixture of gaussians, or neural network. HMMs have been applied extensively to problems in speech recognition (Juang & Rabiner, 1991), computational biology (Baldi, Chauvin, Hunkapiller, & McClure, 1994), and fault detection (Smyth, 1994). Given an HMM with known parameters and a sequence of observations, two algorithms are commonly used to solve two different forms of the inference problem (Rabiner & Juang, 1986). The rst computes the posterior probabilities of the hidden states using a recursive algorithm known as the forward-backward algorithm. The computations in the forward pass are exactly analogous to the Kalman lter for SSMs, and the computations in the backward pass are analogous to the backward pass of the Kalman smoothing

836

Zoubin Ghahramani and Geoffrey E. Hinton

equations. As noted by Bridle (pers. comm., 1985) and Smyth, Heckerman, and Jordan (1997), the forward-backward algorithm is a special case of exact inference algorithms for more general graphical probabilistic models (Lauritzen & Spiegelhalter, 1988; Pearl, 1988). The same observation holds true for the Kalman smoothing recursions. The other inference problem commonly posed for HMMs is to compute the single most likely sequence of hidden states. The solution to this problem is given by the Viterbi algorithm, which also consists of a forward and backward pass through the model. To learn maximum likelihood parameters for an HMM given sequences of observations, one can use the well-known Baum-Welch algorithm (Baum, Petrie, Soules, & Weiss, 1970). This algorithm is a special case of EM that uses the forward-backward algorithm to infer the posterior probabilities of the hidden states in the E-step. The M-step uses expected counts of transitions and observations to reestimate the transition and output matrices (or linear regression equations in the case where the observations are gaussian distributed). Like SSMs, HMMs can be augmented to allow for input variables, such that they model the conditional distribution of sequences of output observations given sequences of inputs (Cacciatore & Nowlan, 1994; Bengio & Frasconi, 1995; Meila & Jordan, 1996). 2.3 Hybrids. A burgeoning literature on models that combine the discrete transition structure of HMMs with the linear dynamics of SSMs has developed in elds ranging from econometrics to control engineering (Harrison & Stevens, 1976; Chang & Athans, 1978; Hamilton, 1989; Shumway & Stoffer, 1991; Bar-Shalom & Li, 1993; Deng, 1993; Kadirkamanathan & Kadirkamanathan, 1996; Chaer, Bishop, & Ghosh, 1997). These models are known alternately as hybrid models, SSMs with switching, and jump-linear systems. We briey review some of this literature, including some related neural network models.3 Shortly after Kalman and Bucy solved the problem of state estimation for linear gaussian SSMs, attention turned to the analogous problem for switching models (Ackerson & Fu, 1970). Chang and Athans (1978) derive the equations for computing the conditional mean and variance of the state when the parameters of a linear SSM switch according to arbitrary and Markovian dynamics. The prior and transition probabilities of the switching process are assumed to be known. They note that for M models (sets of parameters) and an observation length T, the exact conditional distribution of the state is a gaussian mixture with M T components. The conditional mean and variance, which require far less computation, are therefore only summary statistics. 3 A review of how SSMs and HMMs are related to simpler statistical models such as principal components analysis, factor analysis, mixture of gaussians, vector quantization, and independent components analysis (ICA) can be found in Roweis and Ghahramani (1999).

Switching State-Space Models

837

a

b S1

S2

S3

...

S1

S2

S3

...

X1

X2

X3

...

X1

X2

X3

...

Y1

Y2

Y3

...

Y1

Y2

Y3

...

c

d U1

U2

U3

...

S1

S2

S3

...

S1

S2

S3

...

X1

X2

X3

...

X1

X2

X3

...

Y1

Y2

Y3

...

Y1

Y2

Y3

...

Figure 2: Directed acyclic graphs specifying conditional independence relations for various switching state-space models. (a) Shumway and Stoffer (1991): the output matrix (C in equation 2.3) switches independently between a xed number of choices at each time step. Its setting is represented by the discrete hidden variable St ; (b) Bar-Shalom and Li (1993): both the output equation and the dynamic equation can switch, and the switches are Markov; (c) Kim (1994); (d) Fraser and Dimitriadis (1993): outputs and states are observed. Here we have shown a simple case where the output depends directly on the current state, previous state, and previous output.

Shumway and Stoffer (1991) consider the problem of learning the parameters of SSMs with a single real-valued hidden state vector and switching output matrices. The probability of choosing a particular output matrix is a prespecied time-varying function, independent of previous choices (see Figure 2a). A pseudo-EM algorithm is derived in which the E-step, which

838

Zoubin Ghahramani and Geoffrey E. Hinton

in its exact form would require computing a gaussian mixture with MT components, is approximated by a single gaussian at each time step. Bar-Shalom and Li (1993; sec. 11.6) review models in which both the state dynamics and the output matrices switch, and where the switching follows Markovian dynamics (see Figure 2b). They present several methods for approximately solving the state-estimation problem in switching models (they do not discuss parameter estimation for such models). These methods, referred to as generalized pseudo-Bayesian (GPB) and interacting multiple models (IMM), are all based on the idea of collapsing into one gaussian the mixture of M gaussians that results from considering all the settings of the switch state at a given time step. This avoids the exponential growth of mixture components at the cost of providing an approximate solution. More sophisticated but computationally expensive methods that collapse M2 gaussians into M gaussians are also derived. Kim (1994) derives a similar approximation for a closely related model, which also includes observed input variables (see Figure 2c). Furthermore, Kim discusses parameter estimation for this model, although without making reference to the EM algorithm. Other authors have used Markov chain Monte Carlo methods for state and parameter estimation in switching models (Carter & Kohn, 1994; Athaide, 1995) and in other related dynamic probabilistic networks (Dean & Kanazawa, 1989; Kanazawa, Koller, & Russell, 1995). Hamilton (1989, 1994, sec. 22.4) describes a class of switching models in which the real-valued observation at time t, Yt , depends on both the observations at times t¡1 to t¡r and the discrete states at time t to t¡r. More precisely, Yt is gaussian with mean that is a linear function of Yt¡1 , . . . , Yt¡r and of binary indicator variables for the discrete states, St , . . . , St¡r . The system can therefore be seen as an (r C 1)th order HMM driving an rth order autoregressive process, and is tractable for small r and number of discrete states in S. Hamilton’s models are closely related to hidden lter HMM (HFHMM; Fraser & Dimitriadis, 1993). HFHMMs have both discrete and real-valued states. However, the real-valued states are assumed to be either observed or a known, deterministic function of the past observations (i.e., an embedding). The outputs depend on the states and previous outputs, and the form of this dependence can switch randomly (see Figure 2d). Because at any time step the only hidden variable is the switch state, St , exact inference in this model can be carried out tractably. The resulting algorithm is a variant of the forward-backward procedure for HMMs. Kehagias and Petridis (1997) and Pawelzik, Kohlmorgen, and Muller ¨ (1996) present other variants of this model. Elliott, Aggoun, and Moore (1995; sec. 12.5) present an inference algorithm for hybrid (Markov switching) systems for which there is a separate observable from which the switch state can be estimated. The true switch states, St , are represented as unit vectors in < M , and the estimated switch state is a vector in the unit square with elements corresponding to the es-

Switching State-Space Models

839

timated probability of being in each switch state. The real-valued state, Xt , is approximated as a gaussian given the estimated switch state by forming a linear combination of the transition and observation matrices for the different SSMs weighted by the estimated switch state. Eliott et al. also derive control equations for such hybrid systems and discuss applications of the change-of-measures whitening procedure to a large family of models. With regard to the literature on neural computation, the model presented in this article is a generalization of both the mixture-of-experts neural network (Jacobs et al., 1991; Jordan & Jacobs, 1994) and the related mixture of factor analyzers (Hinton, Dayan, & Revow, 1997; Ghahramani & Hinton, 1996b). Previous dynamical generalizations of the mixture-of-experts architecture consider the case in which the gating network has Markovian dynamics (Cacciatore & Nowlan, 1994; Kadirkamanathan & Kadirkamanathan, 1996; Meila & Jordan, 1996). One limitation of this generalization is that the entire past sequence is summarized in the value of a single discrete variable (the gating activation), which for a system with M experts can convey on average at most log M bits of information about the past. In the models we consider here, both the experts and the gating network have Markovian dynamics. The past is therefore summarized by a state composed of the cross-product of the discrete variable and the combined real-valued state-space of all the experts. This provides a much wider information channel from the past. One advantage of this representation is that the real-valued state can contain componential structure. Thus, attributes such as the position, orientation, and scale of an object in an image, which are most naturally encoded as independent real-valued variables, can be accommodated in the state without the exponential growth required of a discretized HMM-like representation. It is important to place the work in this article in the context of the literature we have just reviewed. The hybrid models, state-space models with switching, and jump-linear systems we have described all assume a single real-valued state vector. The model considered in this article generalizes this to multiple real-valued state vectors. 4 Unlike the models described in Hamilton (1994), Fraser and Dimitradis (1993), and the current dynamical extensions of mixtures of experts, in the model we present, the real-valued state vectors are hidden. The inference algorithm we derive, which is based on making a structured variational approximation, is entirely novel in the context of switching SSMs. Specically, our method is unlike all the approximate methods we have reviewed in that it is not based on tting a single gaussian to a mixture of gaussians by computing the mean and covariance of the mixture.5 We derive a learning algorithm for all of the parameters 4 Note that the state vectors could be concatenated into one large state vector with factorized (block-diagonal) transition matrices (cf. factorial hidden Markov model; Ghahramani & Jordan, 1997). However, this obscures the decoupled structure of the model. 5 Both classes of methods can be seen as minimizing Kullback-Liebler (KL) diver-

840

Zoubin Ghahramani and Geoffrey E. Hinton

of the model, including the Markov switching parameters. This algorithm maximizes a lower bound on the log-likelihood of the data rather than a heuristically motivated approximation to the likelihood. The algorithm has a simple and intuitive avor: It decouples into forward-backward recursions on a HMM, and Kalman smoothing recursions on each SSM. The states of the HMM determine the soft assignment of each observation to an SSM; the prediction errors of the SSMs determine the observation probabilities for the HMM. 3 The Generative Model

In switching SSMs, the sequence of observations fYt g is modeled by specifying a probabilistic relation between the observations and a hidden state( ) space comprising M real-valued state vectors, Xt m , and one discrete state vector St . The discrete state, St , is modeled as a multinomial variable that can take on M values: St 2 f1, . . . , Mg; for reasons that will become obvious we refer to it as the switch variable. The joint probability of observations and hidden states can be factored as ³n o´ (1) ( M) P St , X t , . . . , X t , Y t D P ( S1 )

T Y tD2

P( St |St¡1 ) ¢

M Y mD1

T ³ ´Y ³ ´ ( ) ( ) ( m) P X1m P Xt m |Xt¡1

T ³ ´ Y (1) ( ) ¢ P Yt |Xt , . . . , Xt M , St ,

tD2

(3.1)

tD1

which corresponds graphically to the conditional independencies represented by Figure 3. Conditioned on a setting of the switch state, St D m, the observable is multivariate gaussian with output equation given by statespace model m. Notice that m is used as both an index for the real-valued state variables and a value for the switch state. The probability of the observation vector Yt is therefore ³ ´ (1) ( M) P Yt |Xt , . . . , Xt , St D m » ´0 ³ ´¼ 1³ ( m) (m) ( m) ( m) ¡ 12 ¡1 D | 2p R| exp ¡ , (3.2) Yt ¡ C Xt R Yt ¡ C Xt 2 where R is the observation noise covariance matrix and C( m) is the output matrix for SSM m (cf. equation 2.4 for a single linear-gaussian SSM). Each gences. However, the KL divergence is asymmetrical, and whereas the variational methods minimize it in one direction, the methods that merge gaussians minimize it in the other direction. We return to this point in section 4.2.

Switching State-Space Models

841

a

b (1)

...

(M) 1

(1)

X3

...

X2

...

(1)

X1

(M) 2

(M) 3

... (1)

X

X

X

...

S1

S2

S3

...

Y1

Y2

Y3

...

Xt

(2)

Xt

(M)

Xt

St Yt

Figure 3: (a) Graphical model representation for switching state-space models. (m) St is the discrete switch variable, and Xt are the real-valued state vectors. (b) Switching state-space model depicted as a generalization of the mixture of experts. The dashed arrows correspond to the connections in a mixture of experts. In a switching state-space model, the states of the experts and the gating network also depend on their previous states (solid arrows).

real-valued state vector evolves according to the linear gaussian dynamics of an SSM with differing initial state, transition matrix, and state noise (see equation 2.2). For simplicity we assume that all state vectors have identical dimensionality; the generalization of the algorithms we present to models with different-sized state-spaces is immediate. The switch state itself evolves according to the discrete Markov transition structure specied by the initial state probabilities P(S1 ) and the M £ M state transition matrix P (St |St¡1 ). An exact analogy can be made to the mixture-of-experts architecture for modular learning in neural networks (see Figure 3b; Jacobs et al., 1991). Each SSM is a linear expert with gaussian output noise model and lineargaussian dynamics. The switch state “gates” the outputs of the M SSMs, and therefore plays the role of a gating network with Markovian dynamics. There are many possible extensions of the model; we shall consider three obvious and straightforward ones: (Ex1) Differing output covariances, R(m) , for each SSM; ( )

(Ex2) Differing output means, m Ym , for each SSM, such that each model is allowed to capture observations in a different operating range (Ex3) Conditioning on a sequence of observed input vectors, fUt g

842

Zoubin Ghahramani and Geoffrey E. Hinton

4 Learning

An efcient learning algorithm for the parameters of a switching SSM can be derived by generalizing the EM algorithm (Baum et al., 1970; Dempster et al., 1977). EM alternates between optimizing a distribution over the hidden states (the E-step) and optimizing the parameters given the distribution over hidden states (the M-step). Any distribution over the hidden states, (1) ( ) Q(fSt , Xt g), where Xt D [Xt , . . . Xt M ] is the combined state of the SSMs, can be used to dene a lower bound, B, on the log probability of the observed data: XZ log P(fYt g|h ) D log (4.1) P(fSt , Xt , Yt g|h ) dfXt g fSt g

¶ P(fSt , Xt , Yt g|h ) dfXt g Q(fSt , Xt g) fSt g µ ¶ XZ P(fSt , Xt , Yt g|h ) (f ¸ Q St , Xt g) log dfXt g Q(fSt , Xt g) fS g D log

XZ

µ

Q(fSt , Xt g)

(4.2)

t

D B( Q, h ),

(4.3)

where h denotes the parameters of the model and we have made use of Jensen’s inequality (Cover & Thomas, 1991) to establish equation 4.3. Both steps of EM increase the lower bound on the log probability of the observed data. The E-step holds the parameters xed and sets Q to be the posterior distribution over the hidden states given the parameters, Q(fSt , Xt g) D P(fSt , Xt g| fYt g, h ).

(4.4)

This maximizes B with respect to the distribution, turning the lower bound into an equality, which can be easily seen by substitution. The M-step holds the distribution xed and computes the parameters that maximize B for that distribution. Since B D log P(fYt g|h ) at the start of the M-step and since the E-step does not affect log P, the two steps combined can never decrease log P. Given the change in the parameters produced by the Mstep, the distribution produced by the previous E-step is typically no longer optimal, so the whole procedure must be iterated. Unfortunately, the exact E-step for switching SSMs is intractable. Like the related hybrid models described in section 2.3, the posterior probability of the real-valued states is a gaussian mixture with M T terms. This can be seen by using the semantics of directed graphs, in particular the d-separation criterion (Pearl, 1988), which implies that the hidden state variables in Figure 3, while marginally independent, become conditionally dependent given the observation sequence. This induced dependency effectively couples all of the real-valued hidden state variables to the discrete switch variable, as a

Switching State-Space Models

843

consequence of which the exact posteriors become Gaussian mixtures with an exponential number of terms.6 In order to derive an efcient learning algorithm for this system, we relax the EM algorithm by approximating the posterior probability of the hidden states. The basic idea is that since expectations with respect to P are intractable, rather than setting Q(fSt , Xt g) D P (fSt , Xt g|fYt g) in the E-step, a tractable distribution Q is used to approximate P. This results in an EM learning algorithm that maximizes a lower bound on the log-likelihood. The difference between the bound B and the log-likelihood is given by the Kullback-Liebler (KL) divergence between Q and P: KL(QkP) D

XZ fSt g

µ

¶ Q(fSt , Xt g) Q(fSt , Xt g) log dfXt g. P(fSt , Xt g|fYt g)

(4.5)

Since the complexity of exact inference in the approximation given by Q is determined by its conditional independence relations, not by its parameters, we can choose Q to have a tractable structure—a graphical representation that eliminates some of the dependencies in P. Given this structure, the parameters of Q are varied to obtain the tightest possible bound by minimizing equation 4.5. Therefore, the algorithm alternates between optimizing the parameters of the distribution Q to minimize equation 4.5 (the E-step) and optimizing the parameters of P given the distribution over the hidden states (the M-step). As in exact EM, both steps increase the lower bound B on the log-likelihood; however, equality is not reached in the E-step. We will refer to the general strategy of using a parameterized approximating distribution as a variational approximation and refer to the free parameters of the distribution as variational parameters. A completely factorized approximation is often used in statistical physics, where it provides the basis for simple yet powerful mean-eld approximations to statistical mechanical systems (Parisi, 1988). Theoretical arguments motivating approximate E-steps are presented in Neal and Hinton (1998; originally in a technical report in 1993). Saul and Jordan (1996) showed that approximate E-steps could be used to maximize a lower bound on the log-likelihood, and proposed the powerful technique of structured variational approximations to intractable probabilistic networks. The key insight of their work, which this article makes use of, is that by judicious use of an approximation Q, exact inference algorithms can be used on the tractable substructures in an intractable network. A general tutorial on variational approximations can be found in Jordan, Ghahramani, Jaakkola, and Saul (1998).

6 The intractability of the E-step or smoothing problem in the simpler single-state switching model has been noted by Ackerson and Fu (1970), Chang and Athans (1978), Bar-Shalom and Li (1993), and others.

844

Zoubin Ghahramani and Geoffrey E. Hinton (1)

X1

(1)

X2

(1)

X3

...

...

...

... X1

(M)

X2

X3

(M)

...

S1

S2

S3

...

(M)

Figure 4: Graphical model representation for the structured variational approximation to the posterior distribution of the hidden states of a switching statespace model. ( )

The parameters of the switching SSM are h D fA( m) , C(m) , Q( m) , m Xm1 ,

( ) Q1m ,

R, ¼ , Wg, where A( m) is the state dynamics matrix for model m, C( m) is ( ) its output matrix, Q( m) is its state noise covariance, m Xm1 is the mean of the ( )

initial state, Q1m is the covariance of the initial state, R is the (tied) output noise covariance, ¼ D P( S1 ) is the prior for the discrete Markov process, and W D P( St |St¡1 ) is the discrete transition matrix. Extensions (Ex1) through (Ex3) can be readily implemented by substituting R( m) for R, adding means ( ) m Ym and input matrices B( m) . Although there are many possible approximations to the posterior distribution of the hidden variables that one could use for learning and inference in switching SSMs, we focus on the following: Q(fSt , Xt g) D

" # T M ³ ´ Y Y 1 ( ) ( ) ( ) y S1 y St¡1 , St y X1m ZQ tD2 mD1 ¢

T Y tD2

³ ´ ( m) ( ) y Xt¡1 , Xt m ,

(4.6)

where the y are unnormalized probabilities, which we will call potential functions and dene soon, and ZQ is a normalization constant ensuring that Q integrates to one. Although Q has been written in terms of potential functions rather than conditional probabilities, it corresponds to the simple graphical model shown in Figure 4. The terms involving the switch variables St dene a discrete Markov chain, and the terms involving the state vectors ( m) Xt dene M uncoupled SSMs. As in mean-eld approximations, we have approximated the stochastically coupled system by removing some of the

Switching State-Space Models

845

couplings of the original system. Specically, we have removed the stochastic coupling between the chains that results from the fact that the observation at time t depends on all the hidden variables at time t. However, we retain the coupling between the hidden variables at successive time steps since these couplings can be handled exactly using the forward-backward and Kalman smoothing recursions. This approximation is therefore structured, in the sense that not all variables are uncoupled. The discrete switching process is dened by ( )

(4.7)

(m) m| St¡1 ) qt ,

(4.8)

y (S1 D m) D P(S1 D m) q1m

y ( St¡1 , St D m) D P(St D ( )

where the qt m are variational parameters of the Q distribution. These parameters scale the probabilities of each of the states of the switch variable at ( ) each time step, so that qtm plays exactly the same role that the observation probability P (Yt | St D m) would play in a regular HMM. We will soon see ( ) that minimizing KL(QkP) results in an equation for qt m that supports this intuition. The uncoupled SSMs in the approximation Q are also dened by potential functions that are related to probabilities in the original system. These potentials are the prior and transition probabilities for X( m) multiplied by a factor that changes these potentials to try to account for the data: ³ ´ ³ ´h ³ ´ih(m) 1 ( ) ( ) ( ) y X1m D P X1m P Y1 |X1m , S1 D m

³ ´ ³ ´h ³ ´ih(m) t ( m) ( ) ( ) ( m) ( ) y Xt¡1 , Xt m D P Xt m |Xt¡1 , P Yt |Xt m , St D m

(4.9) (4.10)

( )

where the ht m are variational parameters of Q. The vector ht plays a role ( ) very similar to the switch variable St . Each component ht m can range be( ) ( ) tween 0 and 1. When ht m D 0, the posterior probability of Xt m under Q ( m) does not depend on the observation at time Yt . When ht D 1, the posterior ( ) probability of Xt m under Q includes a term that assumes that SSM m gener( ) ated Yt . We call ht m the responsibility assigned to SSM m for the observation ( ) ( ) ( ) vector Yt . The difference between ht m and St m is that ht m is a deterministic ( m) parameter, while St is a stochastic random variable. To maximize the lower bound on the log-likelihood, KL(QkP) is mini( ) ( ) mized with respect to the variational parameters ht m and qt m separately for each sequence of observations. Using the denition of P for the switching state-space model (equations 3.1 and 3.2) and the approximating distribution Q, the minimum of KL satises the following xed-point equations for the variational parameters (see appendix B): ( )

ht m D Q( St D m)

(4.11)

846

Zoubin Ghahramani and Geoffrey E. Hinton ( m)

qt

» ½ ´0 ³ ´¾¼ 1 ³ ( ) ( m) ( ) ( m) D exp ¡ , Yt ¡ C m Xt R¡1 Yt ¡ C m Xt 2

(4.12)

where h¢i denotes expectation over the Q distribution. Intuitively, the re( ) sponsibility, htm is equal to the probability under Q that SSM m generated ( ) observation vector Yt , and qt m is an unnormalized gaussian function of the expected squared error if SSM m generated Yt . ( ) To compute ht m it is necessary to sum Q over all the St variables not including St . This can be done efciently using the forward-backward algo( ) rithm on the switch state variables, with qt m playing exactly the same role as an observation probability associated with each setting of the switch vari( ) able. Since qtm is related to the prediction error of model m on data Yt , this has the intuitive interpretation that the switch state associated with models with smaller expected prediction error on a particular observation will be favored at that time step. However, the forward-backward algorithm ensures that the nal responsibilities for the models are obtained after considering the entire sequence of observations. ( ) ( ) To compute qt m , it is necessary to calculate the expectations of Xt m and ( m)

( m) 0

Xt Xt

( m) qt

under Q. We see this by expanding equation 4.12: »

1 ( ) D exp ¡ Yt0 R¡1 Yt C Yt0 R¡1 C( m) hXt m i 2 D Ei¼ 1 h 0 ( ) ( )0 ¡ tr C(m) R¡1 C(m) Xt m Xt m , 2

(4.13)

where tr is the matrix trace operator, and we have used tr( AB) D tr( BA). ( ) ( ) ( )0 The expectations of Xt m and Xt m Xt m can be computed efciently using the Kalman smoothing algorithm on each SSM, where for model m at time ( ) t, the data are weighted by the responsibilities ht m .7 Since the h parameters depend on the q parameters, and vice versa, the whole process has to be iterated, where each iteration involves calls to the forward-backward and Kalman smoothing algorithms. Once the iterations have converged, the Estep outputs the expected values of the hidden variables under the nal Q. The M-step computes the model parameters that optimize the expectation of the log-likelihood (see equation B.7), which is a function of the expectations of the hidden variables. For switching SSMs, all the parameter reestimates can be computed analytically. For example, taking derivatives of the expectation of equation B.7 with respect to C( m) and setting to zero,

7

(m)

Weighting the data by ht is equivalent to running the Kalman smoother on the (m) (m) unweighted data using a time-varying observation noise covariance matrix Rt D R / ht .

Switching State-Space Models

847

Figure 5: Learning algorithm for switching state-space models.

we get O ( m)

C

(

T D E D E X ( ) ( )0 D St m Yt Xt m tD1

!

(

! T D ED E ¡1 X ( m) ( m) ( m) 0 , St Xt Xt

(4.14)

tD1

which is a weighted version of the reestimation equations for SSMs. Similarly, the reestimation equations for the switch process are analogous to the Baum-Welch update rules for HMMs. The learning algorithm for switching state-space models using the above structured variational approximation is summarized in Figure 5. 4.1 Deterministic Annealing. The KL divergence minimized in the Estep of the variational EM algorithm can have multiple minima in general. One way to visualize these minima is to consider the space of all possible segmentations of an observation sequence of length T, where by segmentation we mean a discrete partition of the sequence between the SSMs. If there are M SSMs, then there are MT possible segmentations of the sequence. Given one such segmentation, inferring the optimal distribution for the real-valued states of the SSMs is a convex optimization problem, since these real-valued states are conditionally gaussian. So the difculty in the KL minimization lies in trying to nd the best (soft) partition of the data.

848

Zoubin Ghahramani and Geoffrey E. Hinton

As in other combinatorial optimization problems, the possibility of getting trapped in local minima can be reduced by gradually annealing the cost function. We can employ a deterministic variant of the annealing idea by making the following simple modications to the variational xed-point equations 4.11 and 4.12: ( )

1 Q ( St D m ) T » ½ ´0 ³ ´¾¼ 1 ³ ( m) (m) ( m) (m) ¡1 D exp ¡ . Yt ¡ C Xt R Yt ¡ C Xt 2T

ht m D ( )

qt m

(4.15) (4.16)

Here T is a temperature parameter, which is initialized to a large value and gradually reduced to 1. The above equations maximize a modied form of the bound B in equation 4.3, where the entropy of Q has been multiplied by T (Ueda & Nakano, 1995). 4.2 Merging Gaussians. Almost all the approximate inference methods that are described in the literature for switching SSMs are based on the idea of merging, at each time step, a mixture of M gaussians into one gaussian. The merged gaussian is obtained by setting its mean and covariance equal to the mean and covariance of the mixture. Here we briey describe, as an alternative to the variational approximation methods we have derived, how this more traditional gaussian merging procedure can be applied to the model we have dened. In the switching state-space models described in section 3, there are M different SSMs, with possibly different state-space dimensionalities, so it would be inappropriate to merge their states into one gaussian. However, it is still possibly to apply a gaussian merging technique by considering each SSM separately. In each SSM, m, the hidden state density produces at each time step a mixture of two gaussians: one for the case St D m and one for St 6 D m. We merge these two gaussians, weighted the current estimates of P( St D m| Y1 , . . . Yt ) and 1¡P(St D m|Y1 , . . . Yt ), respectively. This merged gaussian ( m) is used to obtain the gaussian prior for XtC1 for the next time step. We implemented a forward-pass version of this approximate inference scheme, which is analogous to the IMM procedure described in Bar-Shalom and Li (1993). This procedure nds at each time step the “best” gaussian t to the current mixture of gaussians for each SSM. If we denote the approximating gaussian by Q and the mixture being approximated by P, “best” is dened here as minimizing KL(PkQ). Furthermore, gaussian merging techniques are greedy in that the “best” gaussian is computed at every time step and used immediately for the next time step. For a gaussian Q, KL(PkQ) has no local minima, and it is very easy to nd the optimal Q by computing the rst two moments of P. Inaccuracies in this greedy procedure arise because the estimates of P( St | Y1 , . . . , Yt ) are based on this single merged gaussian, not on the real mixture.

Switching State-Space Models

849

In contrast, variational methods seek to minimize KL(QkP), which can have many local minima. Moreover, these methods are not greedy in the same sense: they iterate forward and backward in time until obtaining a locally optimal Q. 5 Simulations 5.1 Experiment 1: Variational Segmentation and Deterministic Annealing. The goal of this experiment was to assess the quality of solutions

found by the variational inference algorithm and the effect of using deterministic annealing on these solutions. We generated 200 sequences of length 200 from a simple model that switched between two SSMs. These SSMs and the switching process were dened by: (1)

Xt

(2) Xt

(1)

(1)

(1)

D 0.99 Xt¡1 C wt D

Yt D

(2) (2) 0.9 Xt¡1 C wt (m) Xt C vt

wt

(2) wt

vt » N

»N

»N

(0, 0.1),

(0, 1)

(5.1)

(0, 10)

(5.2) (5.3)

where the switch state m was chosen using priors ¼ (1) D ¼ (2) D 1 / 2 and transition probabilities W 11 D W 22 D 0.95; W 12 D W 21 D 0.05. Five sequences from this data set are shown in in Figure 6, along with the true state of the switch variable. We compared three different inference algorithms: variational inference, variational inference with deterministic annealing (section 4.1), and inference by gaussian merging (section 4.2). For each sequence, we initialized the variational inference algorithms with equal responsibilities for the two SSMs and ran them for 12 iterations. The nonannealed inference algorithm ran at a xed temperature of T D 1, while the annealed algorithm was initialized to a temperature of T D 100, which decayed down to 1 over the 12 iterations, using the decay function TiC1 D 12 Ti C 12 . To eliminate the effect of model inaccuracies we gave all three inference algorithms the true parameters of the generative model. The segmentations found by the nonannealed variational inference algorithm showed little similarity to the true segmentations of the data (see Figure 7). Furthermore, the nonannealed algorithm generally underestimated the number of switches, often converging on solutions with no switches at all. Both the annealed variational algorithm and the gaussian merging method found segmentations that were more similar to the true segmentations of the data. Comparing percentage correct segmentations, we see that annealing substantially improves the variational inference method and that the gaussian merging and annealed variational methods perform comparably (see Figure 8). The average performance of the annealed variational method is only about 1.3% better than gaussian merging.

850

Zoubin Ghahramani and Geoffrey E. Hinton

Figure 6: Five data sequences of length 200, with their true segmentations below them. In the segmentations, switch states 1 and 2 are represented with presence and absence of dots, respectively. Notice that it is difcult to segment the sequences correctly based only on knowing the dynamics of the two processes.

5.2 Experiment 2: Modeling Respiration in a Patient with Sleep Apnea.

Switching state-space models should prove useful in modeling time series that have dynamics characterized by several different regimes. To illustrate this point, we examined a physiological data set from a patient tentatively diagnosed with sleep apnea, a medical condition in which an individual intermittently stops breathing during sleep. The data were obtained from the repository of time-series data sets associated with the Santa Fe Time Series Analysis and Prediction Competition (Weigend & Gershenfeld, 1993) and is described in detail in Rigney et al. (1993). 8 The respiration pattern in sleep apnea is characterized by at least two regimes: no breathing and gasping breathing induced by a reex arousal. Furthermore, in this patient there also seem to be periods of normal rhythmic breathing (see Figure 9).

8 The data are available online at http://www.stern.nyu.edu/»aweigend/TimeSeries/SantaFe.html#setB. We used samples 6201–7200 for training and 5201–6200 for testing.

Switching State-Space Models

851

N A M T

N A M T

N A M T

N A M T

N A M T

N A M T

N A M T

N A M T

N A M T

N A M T

N A M T

N A M T

N A M T

N A M T

N A M T

N A M T

N A M T

N A M T

N A M T

N A M T

Figure 7: For 10 different sequences of length 200, segmentations are shown with presence and absence of dots corresponding to the two SSMs generating these data. The rows are the segmentations found using the variational method with no annealing (N), the variational method with deterministic annealing (A), the gaussian merging method (M), and the true segmentation (T). All three (m) inference algorithms give real-valued ht ; hard segmentations were obtained (m) by thresholding the nal ht values at 0.5. The rst ve sequences are the ones shown in Figure 6.

852

a

Zoubin Ghahramani and Geoffrey E. Hinton

60 40 20

b

0 0

10

20

30

40

50

60

70

80

90

100

10

20

30

40

50

60

70

80

90

100

10

20

30

40

50

60

70

80

90

100

10

20

30

40

50

60

70

80

90

100

30 20 10

c

0 0 30 20 10

d

0 0 30 20 10 0 0

Percent Correct Segmentation

Figure 8: Histograms of percentage correct segmentations: (a) control, using random segmentation; (b) variational inference without annealing; (c) variational inference with annealing; (d) gaussian merging. Percentage correct segmentation was computed by counting the number of time steps for which the true and estimated segmentations agree.

We trained switching SSMs varying the random seed, the number of components in the mixture (M D 2 to 5), and the dimensionality of the state-space in each component (K D 1 to 10) on a data set consisting of 1000 consecutive measurements of the chest volume. As controls, we also trained simple SSMs (i.e., M D 1), varying the dimension of the state-space from K D 1 to 10, and simple HMMs (i.e., K D 0), varying the number of discrete hidden states from M D 2 to M D 50. Simulations were run until convergence or for 200 iterations, whichever came rst; convergence was assessed by measuring the change in likelihood (or bound on the likelihood) over consecutive steps of EM. The likelihood of the simple SSMs and the HMMs was calculated on a test set consisting of 1000 consecutive measurements of the chest volume. For the switching SSMs, the likelihood is intractable, so we calculated the lower bound on the likelihood, B. The simple SSMs modeled the data very poorly for K D 1, and the performance was at for values of K D 2 to 10 (see

Switching State-Space Models

853

a 20

respiration

15 10 5 0 5

0

50

100

150

200

50

100

150

200

b

250

300

350

400

450

500

250

300

350

400

450

500

time (s)

20

respiration

15 10 5 0 5

0

time (s)

Figure 9: Chest volume (respiration force) of a patient with sleep apnea during two noncontinuous time segments of the same night (measurements sampled at 2 Hz). (a) Training data. Apnea is characterized by extended periods of small variability in chest volume, followed by bursts (gasping). Here we see such behavior around t D 250, followed by normal rhythmic breathing. (b) Test data. In this segment we nd several instances of apnea and an approximately rhythmic region. (The thick lines at the bottom of each plot are explained in the main text.)

Figure 10a). The large majority of runs of the switching state-space model resulted in models with higher likelihood than those of the simple SMMs (see Figures 10b–10e). One consistent exception should be noted: for values of M D 2 and K D 6 to 10, the switching SSM performed almost identically to the simple SSM. Exploratory experiments suggest that in these cases, a single component takes responsibility for all the data, so the model has M D 1 effectively. This may be a local minimum problem or a result of poor initialization heuristics. Looking at the learning curves for simple and switching SSMs, it is easy to see that there are plateaus at the solutions found by the simple one-component SSMs that the switching SSM can get caught in (see Figure 11). The likelihoods for HMMs with around M D 15 were comparable to those of the best switching SSMs (see Figure 10f). Purely in terms of cod-

854

Zoubin Ghahramani and Geoffrey E. Hinton

a

b

SSM

SwSSM (M=2)

c

0.8

0.8

0.8

0.9

0.9

0.9

1

1

1

1.1

1.1

1.1

1234

d

K

6

8 10

SwSSM (M=4)

1234

e

K

6

8 10

SwSSM (M=5)

1234

f

0.8

0.8

0.8

0.9

0.9

0.9

1

1

1

1.1

1.1

1.1

1234

K

6

8 10

1234

K

6

8 10

SwSSM (M=3)

K

6

8 10

HMM

0 10 20 30 40 50 M

Figure 10: Log likelihood (nats per observation) on the test data from a total of almost 400 runs of simple state-space models (a), switching state-space models with differing numbers of components (b–e), and hidden Markov models (f).

ing efciency, switching SSMs have little advantage over HMMs on this data. However, it is useful to contrast the solutions learned by HMMs with the solutions learned by the switching SSMs. The thick dots at the bottom of the Figures 9a and 9b show the responsibility assigned to one of two components in a fairly typical switching SSM with M D 2 components of state size K D 2. This component has clearly specialized to modeling the data during periods of apnea, while the other component models the gasps and periods of rhythmic breathing. These two switching components provide a much more intuitive model of the data than the 10 to 20 discrete components needed in an HMM with comparable coding efciency.9

9 By using further assumptions to constrain the model, such as continuity of the realvalued hidden state at switch times, it should be possible to obtain even better performance on these data.

Switching State-Space Models

855

0.8 0.9 1 1.1

log P

1.2 1.3 1.4 1.5 1.6 1.7 1.8

0

50

100

150

iterations of EM

200

250

Figure 11: Learning curves for a state space model (K D 4) and a switching state-space model (M D 2, K D 2).

6 Discussion

The main conclusion we can draw from the rst series of experiments is that even when given the correct model parameters, the problem of segmenting a switching time series into its components is difcult. There are combinatorially many alternatives to be considered, and the energy surface suffers from many local minima, so local optimization approaches like the variational method we used are limited by the quality of the initial conditions. Deterministic annealing can be thought of as a sophisticated initialization procedure for the hidden states: the nal solution at each temperature provides the initial conditions at the next. We found that annealing substantially improved the quality of the segmentations found. The rst experiment also indicates that the much simpler gaussian merging method performs comparably to annealed variational inference. The gaussian merging methods have the advantage that at each time step, the cost function minimized has no local minima. This may account for how well they perform relative to the nonannealed variational method. On the other hand, the variational methods have the advantage that they iteratively improve their approximation to the posterior, and they dene a lower bound

856

Zoubin Ghahramani and Geoffrey E. Hinton

on the likelihood. Our results suggest that it may be very fruitful to use the gaussian merging method to initialize the variational inference procedure. Furthermore, it is possible to derive variational approximations for other switching models described in the literature, and a combination of gaussian merging and variational approximation may provide a fast and robust method for learning and inference in those models. The second series of experiments suggests that on a real data set believed to have switching dynamics, the switching SSM can indeed uncover multiple regimes. When it captures these regimes, it generalizes to the test set much better than the simple linear dynamical model. Similar coding efciency can be obtained by using HMMs, which due to the discrete nature of the state-space, can model nonlinear dynamics. However, in doing so, the HMMs had to use 10 to 20 discrete states, which makes their solutions less interpretable. Variational approximations provide a powerful tool for inference and learning in complex probabilistic models. We have seen that when applied to the switching SSM, they can incorporate within a single framework wellknown exact inference methods like Kalman smoothing and the forwardbackward algorithm. Variational methods can be applied to many of the other classes of intractable switching models described in section 2.3. However, training more complex models also makes apparent the importance of good methods for model selection and initialization. To summarize, switching SSMs are a dynamical generalization of mixture-of-experts neural networks, are closely related to well-known models in econometrics and control, and combine the representations underlying HMMs and linear dynamical systems. For domains in which we have some a priori belief that there are multiple, approximately linear dynamical regimes, switching SSMs provide a natural modeling tool. Variational approximations provide a method to overcome the most difcult problem in learning switching SSMs: that the inference step is intractable. Deterministic annealing further improves on the solutions found by the variational method.

Appendix A: Notation

Symbol

Size

Description

D£1 D£T K£1 KM £ 1

observation vector at time t sequence of observation vectors [Y1 , Y2 , . . . YT ] state vector of state-space model (SSM) m at time t entire real-valued hidden state at time t: Xt D (1) ( ) [Xt , . . . , Xt M ]

Variables

Yt fYt g ( ) Xt m Xt

Switching State-Space Models

M£1

St

857

switch state variable (represented either as discrete variable St 2 f1, . . . Mg, or as an M £1 vector (1) ( ) ( ) St D [St , . . . St M ]0 where St m 2 f0, 1g)

Model parameters

A( m) C( m)

K£K D£K K£K K£1 K£K D£D M£1 M£M

Q( m) ( ) m Xm1 ( ) Q1m R ¼ W

state dynamics matrix for SSM m output matrix for SSM m state noise covariance matrix for SSM m initial state mean for SSM m initial state noise covariance matrix for SSM m output noise covariance matrix initial state probabilities for switch state state transition matrix for switch state

Variational parameters ( )

ht m ( ) qt m

1£1 1£1

responsibility of SSM m for Yt related to expected squared error if SSM m generated Yt

Miscellaneous

X0 |X| hXi

matrix transpose of X matrix determinant of X expected value of X under the Q distribution

Dimensions

size of observation vector length of a sequence of observation vectors number of state-space models size of state vector in each state-space model

D T M K

Appendix B: Derivation of the Variational Fixed-Point Equations

In this appendix we derive the variational xed-point equations used in the learning algorithm for switching SSMs. First, we write out the probability density P dened by a switching SSM. For convenience, we express this probability density in the log domain, through its associated energy function or hamiltonian, H. The probability density is related to the hamiltonian through the usual Boltzmann distribution (at a temperature of 1), P(¢) D

1 expf¡H(¢)g, Z

858

Zoubin Ghahramani and Geoffrey E. Hinton

where Z is a normalization constant required such that P(¢) integrates to unity. Expressing the probabilities in the log domain does not affect the resulting algorithm. We then similarly express the approximating distribution Q through its hamiltonian HQ . Finally, we obtain the variational xed-point equations by setting to zero the derivatives of the KL divergence between Q and P with respect to the variational parameters of Q. The joint probability of observations and hidden states in a switching SSM is (see equation 3.1) " P(fSt , Xt , Yt g) D P(S1 ) ¢

T Y

T Y

# P( St | St¡1 )

tD2

M Y

"

mD1

# T ³ ´Y ³ ´ ( m) ( m) ( m) P X1 P Xt |Xt¡1 tD2

P( Yt |Xt , St ).

(B.1)

tD1

We proceed to dissect this expression into its constituent parts. The initial probability of the switch variable at time t D 1 is given by P ( S1 ) D

M ³ Y

¼ ( m)

´S(m) 1

,

(B.2)

mD1 (1)

(

)

( )

where S1 is represented by an M£1 vector [S1 . . . S1M ] where S1m D 1 if the switch state is in state m, and 0 otherwise. The probability of transitioning from a switch state at time t ¡ 1 to a switch state at time t is given by P (St |St¡1 ) D

M Y M ³ ´S(m) S(n) Y t t¡1 W (m,n) .

(B.3)

mD1 nD1

The initial distribution for the hidden state variable in SSM m is gaussian ( ) ( ) with mean m Xm1 and covariance matrix Q1m :

³ ´ ¡ 12 ( ) ( m) 2p Q1 P X1m D » ´³ ´ ³ ´¼ 1 ³ ( m) (m) 0 ( ) ¡1 (m) ( m) exp ¡ Q1m . (B.4) X1 ¡ m X1 X1 ¡ m X1 2 The probability distribution of the state in SSM m at time t given the state ( m) at time t ¡ 1 is gaussian with mean A(m) Xt¡1 and covariance matrix Q( m) : » ³ ´ ´ ¡1 1 ³ ( m) (m) ( m) ( m) 0 ( m) 2 (Q( m) )¡1 2p Q exp ¡ P Xt |Xt¡1 D Xt ¡ A( m) Xt¡1 2 ³ ´¼ ( ) ( m) ¢ Xt m ¡ A( m) Xt¡1 . (B.5)

Switching State-Space Models

859

Finally, using equation 3.2, we can write: P( Yt |Xt , St ) D

M µ Y

1

|2p R| ¡ 2

mD1

» ´0 ³ ´¼ ¶S(m) t 1³ ( m) ( m) ( m) ( m) ¡1 exp ¡ Yt ¡ C Xt (B.6) R Yt ¡ C Xt 2 since the terms with exponent equal to 0 vanish in the product. Combining equations B.1 through B.6 and taking the negative of the logarithm, we obtain the hamiltonian of a switching SSM (ignoring constants): HD

M 1X M ³ ´³ ´ ³ ´ 1X ( m) ( m) ( m) 0 ( ) ¡1 ( m) ( m) log Q1 C Q1m X1 ¡ m X1 X 1 ¡ m X1 2 mD1 2 mD1

C

M (T ¡ 1) X ( m) log Q 2 mD1

C

M X T ³ ´0 ³ ´ 1X ( m) ( m) ( ) ( m) ( ) ( ) ( m) Xt ¡ A m Xt¡1 (Q m )¡1 Xt ¡ A m Xt¡1 2 mD1 tD2

C

M X T ³ ´ ³ ´ 1X T ( m) ( m) 0 ¡1 ( m) log |R| C St Yt ¡ C( m) Xt R Yt ¡ C( m) Xt 2 2 mD1 tD1

¡

M X mD1

( )

S1m log ¼ ( m) ¡

T X M X M X tD2 mD1 nD1

( ) ( )

n log W (m,n) . St m St¡1

(B.7)

The hamiltonian for the approximating distribution can be analogously derived from the denition of Q (see equation 4.6): " # T M T Y Y Y 1 ( ) ( m) ( ) y ( S1 ) y ( St¡1 , St ) y ( X1m ) y (Xt¡1 , Xt m ). (B.8) Q(fSt , Xt g) D ZQ tD2 mD1 tD2 The potentials for the initial switch state and switch state transitions are y ( S1 ) D y ( St¡1 , St ) D

M Y

(m)

( ¼ (m) q(1m) )S1

(B.9)

mD1

M Y M ³ ´ (m) (n) Y ( ) St S t¡1 W ( m,n) qt m .

(B.10)

mD1 nD1

The potential for the initial state of SSM m is h ih(m) 1 ( ) ( ) ( ) y ( X1m ) D P( X1m ) P( Y1 |X1m , S1 D m) ,

(B.11)

860

Zoubin Ghahramani and Geoffrey E. Hinton

and the potential for the state at time t given the state at time t ¡ 1 is h ih(m) t ( m) ( ) ( ) ( m) ( ) ) P( Yt |Xt m , St D m) y (Xt¡1 , Xt m ) D P( Xt m | Xt¡1 .

(B.12)

The hamiltonian for Q is obtained by combining these terms and taking the negative logarithm: HQ D

M M ³ ´ ³ ´ 1X 1X ( ) ( m) ( m) 0 ( m) ( m) ( m) log | Q1m | C X1 ¡ m X1 ( Q1 ) ¡1 X1 ¡ m X1 2 mD1 2 mD1

C

M ( T ¡ 1) X log | Q( m) | 2 mD1

C

M X T ³ ´0 ³ ´ 1X ( m) ( m) ( ) ( m) ( ) ( ) (m) Xt ¡ A m Xt¡1 ( Q m ) ¡1 Xt ¡ A m Xt¡1 2 mD1 tD2

C

M M X T ³ ´ ³ ´ 1X TX ( m) ( m) 0 ¡1 ( m) log |R| C ht Yt ¡C( m) Xt R Yt ¡C( m) Xt 2 mD1 2 mD1 tD1

¡ ¡

M X mD1

( )

S1m log ¼ ( m) ¡

T X M X tD1 mD1

( m)

St

T X M X M X tD2 mD1 nD1

( ) ( )

n log W ( m,n) St m St¡1

( )

log qt m .

(B.13) ( )

Comparing HQ with H, we see that the interaction between the St m and ( ) the Xt m variables has been eliminated, while introducing two sets of varia( ) tional parameters: the responsibilities ht m and the bias terms on the discrete ( m) Markov chain, qt . In order to obtain the approximation Q that maximizes the lower bound on the log-likelihood, we minimize the KL divergence KL(QkP) as a function of these variational parameters: XZ Q(fSt , Xt g) KL(QkP) D (B.14) Q(fSt , Xt g) log dfXt g P(fSt , Xt g|fYt g) fS g t

D hH ¡ HQ i ¡ log ZQ C log Z,

(B.15)

where h¢i denotes expectation over the approximating distribution Q and ZQ is the normalization constant for Q. Both Q and P dene distributions in the exponential family. As a consequence, the zeros of the derivatives of KL with respect to the variational parameters can be obtained simply by equating derivatives of hHi and hHQ i with respect to corresponding sufcient statistics (Ghahramani, 1997): @hHQ ¡ Hi ( )

@hStm i

D0

(B.16)

Switching State-Space Models

@hHQ ¡ Hi

861

( )

D0

(B.17)

( )

D0

(B.18)

m @hXt i @hHQ ¡ Hi

@hPt m i ( )

( )

( )0

( )

( )

( )

where Pt m D hXt m Xt m i ¡ hXt m ihXt m i0 is the covariance of Xt m under Q. Many terms cancel when we subtract the two hamiltonians, HQ ¡ H D

M X T ´³ ´ X 1 ³ ( m) (m) ( m) 0 ¡1 h t ¡ St Yt ¡ C( m) Xt R 2 mD1 tD1 ³ ´ ( ) ( ) ( ) ¢ Yt ¡ C( m) Xt m ¡ St m log qt m .

(B.19)

Taking derivatives, we obtain ½ ´ ³ ´¾ 1 ³ ( m) 0 ¡1 ( m) Yt ¡ C( m) Xt R Yt ¡ C( m) Xt 2

@hHQ ¡ Hi

D ¡ log qt m ¡

@hHQ ¡ Hi ( )

³ ´³ ´ ( ) ( ) ( ) D ¡ ht m ¡ hStm i ( Yt ¡ C( m) hXt m i) 0 R¡1 C( m)

( )

D

( )

@hSt m i

m @hXt i @hHQ ¡ Hi

@Pt m

( )

(B.20) (B.21)

´³ ´ 1 ³ ( m) 0 ( ) ht ¡ hSt m i C( m) R¡1 C( m) 2

(B.22) ( )

From equation B.20, we get the xed-point equation, 4.12, for qtm . Both ( ) ( ) equations B.21 and B.22 are satised when htm D hSt m i. Using the fact that ( m) hSt i D Q(St D m) we get equation 4.11. References Ackerson, G. A., & Fu, K. S. (1970). On state estimation in switching environments. IEEE Transactions on Automatic Control, AC-15(1):10–17. Anderson, B. D. O., & Moore, J. B. (1979). Optimal ltering. Englewood Cliffs, NJ: Prentice-Hall. Athaide, C. R. (1995). Likelihood evaluation and state estimation for nonlinear state space models. Unpublished doctoral dissertation, University of Pennsylvania. Baldi, P., Chauvin, Y., Hunkapiller, T., & McClure, M. (1994). Hidden Markov models of biological primary sequence information. Proc. Nat. Acad. Sci. (USA), 91(3), 1059–1063. Bar-Shalom, Y., & Li, X.-R. (1993). Estimation and tracking. Boston: Artech House. Baum, L., Petrie, T., Soules, G., & Weiss, N. (1970).A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics, 41, 164–171.

862

Zoubin Ghahramani and Geoffrey E. Hinton

Bengio, Y., & Frasconi, P. (1995). An input-output HMM architecture. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 427–434). Cambridge, MA: MIT Press. Cacciatore, T. W., & Nowlan, S. J. (1994). Mixtures of controllers for jump linear and non-linear plants. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 719–726). San Mateo: Morgan Kaufmann Publishers. Carter, C. K., & Kohn, R. (1994). On Gibbs sampling for state space models. Biometrika, 81, 541–553. Chaer, W. S., Bishop, R. H., & Ghosh, J. (1997). A mixture-of-experts framework for adaptive Kalman ltering. IEEE Trans. on Systems, Man and Cybernetics. Chang, C. B., & Athans, M. (1978). State estimation for discrete systems with switching parameters. IEEE Transactions on Aerospace and Electronic Systems, AES-14(3), 418–424. Cover, T., & Thomas, J. (1991). Elements of information theory. New York: Wiley. Dean, T., & Kanazawa, K. (1989). A model for reasoning about persistence and causation. Computational Intelligence, 5(3), 142–150. Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society Series B, 39, 1–38. Deng, L. (1993). A stochastic model of speech incorporating hierarchical nonstationarity. IEEE Trans. on Speech and Audio Processing, 1(4), 471–474. Digalakis, V., Rohlicek, J. R., & Ostendorf, M. (1993). ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition. IEEE Transactions on Speech and Audio Processing, 1(4), 431–442. Elliott, R. J., Aggoun, L., & Moore, J. B. (1995). Hidden Markov models: Estimation and control. New York: Springer-Verlag. Fraser, A. M., & Dimitriadis, A. (1993).Forecasting probability densities by using hidden Markov models with mixed states. In A. S. Wiegand & N. A. Gershenfeld (Eds.), Time series prediction: Forecasting the future and understanding the past (pp. 265–282). Reading, MA: Addison-Wesley. Ghahramani, Z. (1997). On structured variational approximations (Tech. Rep. No. CRG-TR-97-1). Toronto: Department of Computer Science, University of Toronto. Available online at: http://www.gatsby.ucl.ac.uk/»zoubin/ papers/struct.ps.gz. Ghahramani, Z., & Hinton, G. E. (1996a). Parameter estimation for linear dynamical systems (Tech. Rep. No. CRG-TR-96-2). Toronto: Department of Computer Science, University of Toronto. Available online at: http://www.gatsby. ucl.ac.uk/»zoubin/papers/tr-96-2.ps.gz. Ghahramani, Z., & Hinton, G. E. (1996b). The EM algorithm for mixtures of factor analyzers (Tech. Rep. No. CRG-TR-96-1). Toronto: Department of Computer Science, University of Toronto. Available online at: http://www. gatsby.ucl.ac.uk/»zoubin/papers/tr-96-1.ps.gz. Ghahramani, Z., & Jordan, M. I. (1997). Factorial hidden Markov models. Machine Learning, 29, 245–273. Goodwin, G., & Sin, K. (1984).Adaptive ltering prediction and control. Englewood Cliffs, NJ: Prentice-Hall.

Switching State-Space Models

863

Hamilton, J. D. (1989).A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica, 57, 357–384. Hamilton, J. D. (1994). Time series analysis. Princeton, NJ: Princeton University Press. Harrison, P. J., & Stevens, C. F. (1976). Bayesian forecasting (with discussion). J. Royal Statistical Society B, 38, 205–247. Hinton, G. E., Dayan, P., & Revow, M. (1997). Modeling the manifolds of images of handwritten digits. IEEE Transactions on Neural Networks, 8, 65–74. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixture of local experts. Neural Computation, 3, 79–87. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1998). An introduction to variational methods in graphical models. In M. I. Jordan (Ed.), Learning in graphical models. Norwell, MA: Kluwer. Jordan, M. I., & Jacobs, R. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. Juang, B. H., & Rabiner, L. R. (1991). Hidden Markov models for speech recognition. Technometrics, 33, 251–272. Kadirkamanathan, V., & Kadirkamanathan, M. (1996). Recursive estimation of dynamic modular RBF networks. In D. Touretzky, M. Mozer, and M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 239–245). Cambridge, MA: MIT Press. Kalman, R. E., & Bucy, R. S. (1961).New results in linear ltering and prediction. Journal of Basic Engineering (ASME), 83D, 95–108. Kanazawa, K., Koller, D., & Russell, S. J. (1995).Stochastic simulation algorithms for dynamic probabilistic networks. In P. Besnard & S. Hanks (Eds.), Uncertainty in articial intelligence. Proceedings of the Eleventh Conference (pp. 346– 351). San Mateo, CA: Morgan Kaufmann. Kehagias, A., & Petrides, V. (1997). Time series segmentation using predictive modular neural networks. Neural Computation, 9(8), 1691–1710. Kim, C.-J. (1994). Dynamic linear models with Markov-switching. J. Econometrics, 60, 1–22. Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems. J. Royal Statistical Society B, 157–224. Ljung, L., & Soderstr ¨ om, ¨ T. (1983). Theory and practice of recursive identication. Cambridge, MA: MIT Press. Meila, M., & Jordan, M. I. (1996). Learning ne motion by Markov mixtures of experts. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8. Cambridge, MA: MIT Press. Neal, R. M., & Hinton, G. E. (1998).A new view of the EM algorithm that justies incremental, sparse, and other variants. In M. I. Jordan (Ed.), Learning in graphical models. Norwell, MA: Kluwer. Parisi, G. (1988). Statistical eld theory. Redwood City, CA: Addison-Wesley. Pawelzik, K., Kohlmorgen, J., & Muller, ¨ K.-R. (1996). Annealed competition of experts for a segmentation and classication of switching dynamics. Neural Computation, 8(2), 340–356.

864

Zoubin Ghahramani and Geoffrey E. Hinton

Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Mateo, CA: Morgan Kaufmann. Rabiner, L. R., & Juang, B. H. (1986). An introduction to hidden Markov models. IEEE Acoustics, Speech & Signal Processing Magazine, 3, 4–16. Rauch, H. E. (1963).Solutions to the linear smoothing problem. IEEE Transactions on Automatic Control, 8, 371–372. Rigney, D., Goldberger, A., Ocasio, W., Ichimaru, Y., Moody, G., & Mark, R. (1993). Multi-channel physiological data: Description and analysis. In A. Weigend & N. Gershenfeld (Eds.), Time series prediction: Forecasting the future and understanding the past (pp. 105–129). Reading, MA: Addison-Wesley. Roweis, S., & Ghahramani, Z. (1999). A unifying review of linear gaussian models. Neural Computation, 11(2), 305–345. Saul, L. and Jordan, M. I. (1996). Exploiting tractable substructures in intractable networks. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8. Cambridge, MA: MIT Press. Shumway, R. H., & Stoffer, D. S. (1982). An approach to time series smoothing and forecasting using the EM algorithm. J. Time Series Analysis, 3(4), 253–264. Shumway, R. H., & Stoffer, D. S. (1991). Dynamic linear models with switching. J. Amer. Stat. Assoc., 86, 763–769. Smyth, P. (1994).Hidden Markov models for fault detection in dynamic systems. Pattern Recognition, 27(1), 149–164. Smyth, P., Heckerman, D., & Jordan, M. I. (1997). Probabilistic independence networks for hidden Markov probability models. Neural Computation, 9, 227– 269. Ueda, N., & Nakano, R. (1995). Deterministic annealing variant of the EM algorithm. In G. Tesauro, D. Touretzky, & J. Alspector (Eds.), Advances in neural information processing systems, 7 (pp. 545–552). San Mateo, CA: Morgan Kaufmann. Weigend, A., & Gershenfeld, N. (1993).Time series prediction:Forecasting the future and understanding the past. Reading, MA: Addison-Wesley. Received March 2, 1998; accepted March 16, 1999.

LETTER

Communicated by Misha Tsodyks

Retrieval Properties of a Hopeld Model with Random Asymmetric Interactions Zhang Chengxiang Chandan Dasgupta Department of Physics, Indian Institute of Science, Bangalore 560012, India Manoranjan P. Singh Laser Programme, Centre for Advanced Technology, Indore 452013, India The process of pattern retrieval in a Hopeld model in which a random antisymmetric component is added to the otherwise symmetric synaptic matrix is studied by computer simulations. The introduction of the antisymmetric component is found to increase the fraction of random inputs that converge to the memory states. However, the size of the basin of attraction of a memory state does not show any signicant change when asymmetry is introduced in the synaptic matrix. We show that this is due to the fact that the spurious xed points, which are destabilized by the introduction of asymmetry, have very small basins of attraction. The convergence time to spurious xed-point attractors increases faster than that for the memory states as the asymmetry parameter is increased. The possibility of convergence to spurious xed points is greatly reduced if a suitable upper limit is set for the convergence time. This prescription works better if the synaptic matrix has an antisymmetric component. 1 Introduction

Neural network models of associative memory have received a great deal of attention in recent years (Amit, 1989). The best-known model of this kind is the one proposed by Hopeld (1982, 1984). In the Hopeld model, a set of random binary patterns (“memories”) are stored in a network using the modied Hebb rule, which leads to a symmetric matrix of synaptic interactions (two model neurons in the network affect each other in exactly the same way). This assumption of symmetry of the synaptic interaction matrix is known (Eccles, 1964) to be biologically unrealistic. Therefore, it is of interest to try to understand the behavior of neural networks in which the constraint of symmetry is relaxed. Furthermore, it has been suggested (Parisi, 1986; Hertz, Grinstein, & Solla, 1987; Krauth, Nadal, & Mezard, 1988) that the presence of asymmetry in the synaptic interactions may improve the performance of neural network models of associative memory. It is well known (Amit, 1989) that the Neural Computation 12, 865–880 (2000)

c 2000 Massachusetts Institute of Technology °

866

Z. Chengxiang, C. Dasgupta, and M. P. Singh

Hopeld model has a large number of spurious xed-point attractors that are not strongly correlated with any one of the memories stored in the network. The presence of these attractors adversely affects the performance of the network by “trapping” initial congurations that are not close to one of the memory states. This problem is aggravated by the fact that the network cannot distinguish between successful retrieval of a memory and convergence to a spurious xed point because in both cases it settles down to a time-independent state. This problem is not specic to the Hopeld model with binary neurons. It is also present in many other models with symmetric synaptic connections, such as the Hopeld model with analog neurons (Waugh, Marcus, & Westervelt, 1990), Hopeld-type models with more sophisticated learning rules (Sompolinsky, 1987; Forrest, 1988), and networks with sparse representations (Golomb, Rubin, & Sompolinsky, 1990; Erichsen & Theumann, 1991). The presence of asymmetry in the interactions is believed to alleviate this problem in two ways. First, analytic and numerical studies (Treves & Amit, 1988; Fukai, 1990; Singh, Zhang, & Dasgupta, 1995) have shown that the introduction of asymmetry greatly reduces the number of spurious xed-point attractors while leaving the attractors corresponding to memory states largely unaffected. These results suggest that the basins of attraction of the memory states may be enlarged by the introduction of asymmetry. Second, there are indications (Parisi, 1986) that the presence of asymmetry greatly increases the time of convergence to spurious attractors relative to that for convergence to memory states. If this were true in general, then it may be possible to use asymmetry to discriminate effectively between memory states and spurious attractors by monitoring the temporal behavior of the network. Although these suggestions are reasonable and physically plausible, they have not been tested by carrying out detailed and systematic studies of the retrieval performance of Hopeld-like models with asymmetric interactions. We describe the results of a numerical study of the retrieval properties of a Hopeld-like model in which a random antisymmetric component is added to the symmetric, Hebbian synaptic matrix. The presence of asymmetry in the interactions rules out the use of the methods of equilibrium statistical mechanics that have been employed extensively (Amit, 1989) in the analysis of the behavior of the Hopeld model. Also, it is not clear whether the methods used in rigorous mathematical analysis (Komlos ´ & Paturi, 1993) of Hopeld-type networks can be extended to networks with asymmetric interactions. Analytic studies of the dynamics of neural networks are difcult; exact results can be obtained only in very special limits (e.g., in the limit of extreme asymmetric dilution (Derrida, Gardner, & Zippelius, 1987). For this reason, existing analytic studies (Hertz et al., 1987; Crisanti & Sompolinsky, 1987) of the dynamics of neural networks with asymmetric interactions use simplications and approximations whose validity for models like the one we consider here is questionable. There are other analytic methods (Gardner, Derrida, & Mottishaw, 1987; Kepler & Abbott, 1988)

Hopeld Model with Random Asymmetric Interactions

867

that may be used in the study of the dynamical evolution of the state of the system for the rst few steps of update. It is difcult to draw any conclusion about the asymptotic behavior of the network from such studies because the short-time dynamics of the network may differ signicantly from its long-time behavior (Gardner et al., 1987). For these reasons, we have chosen to use numerical simulations in our study of the retrieval properties of this model. Section 2 contains denitions of the model we consider and several other quantities used in the analysis of our numerical data. The numerical methods used and the results obtained from our simulations are described in detail in section 3. Section 4 contains a summary of the main results of this study and a few concluding remarks. 2 Model

The network under investigation consists of N two-state model neurons (“spins”) fSi g, each of which may assume the values C1 or ¡1. A conguration or state of the network is dened by giving specic values to all of its N spins. The synaptic interconnection matrix considered here is dened as follows: Jij D Jijs C kJijas ,

(2.1)

where Jijs is symmetric and is given by the modied Hebb rule (Hopeld, 1982): Jijs D

p 1 X m m j i jj , i 6 D j, Jiis D 0, N m D1

(2.2)

m

where the fj i g, m D 1, . . . , p are the stored patterns or memories. Each m j i may take the values § 1 with equal probability. The number of patterns stored in the network is p, and a D p / N is the memory loading of the network. Jijas is an antisymmetric matrix with independent gaussian elements, that is, " # ¡(Jijas )2 1 as ) ( exp , i < j, P Jij D p 2/N 2p / N

(2.3)

and Jijas D ¡Jjias . The parameter k of equation 2.1 is a measure of the strength of the antisymmetric part of the coupling relative to that of the symmetric part. As discussed in Singh et al. (1995), a synaptic matrix of this form may arise in the tabula non rasa learning scenario proposed by Toulouse, Dehaene, & Changeux (1986).

868

Z. Chengxiang, C. Dasgupta, and M. P. Singh

The network evolves according to the iterated map, 2 Si ( t C 1) D sign4

N X jD1

3 Jij Sj (t)5, i D 1, . . . , N,

(2.4)

where the discrete “time” t is measured in units of updates per spin. A xed point of the dynamics is a conguration fSi g satisfying Si D sign(hi ),

(2.5)

where hi ´

N X

Jij Sj

(2.6)

jD1

is the local eld of the ith spin. Equation 2.5 is equivalent to the conditions l i ´ Si hi ¸ 0.

(2.7)

The quantity l i , called the local stability parameter of the ith spin of the network, may serve as a measure of the local stability of the xed point fSi g. m The fractional Hamming distance between two congurations fSi g and º fSi g of the network is dened as Hm º D

1 (1 ¡ mm º ), 2

(2.8)

N 1 X m Si Sºi N iD1

(2.9)

where mm º D

is the overlap of the two congurations. Although an energy function does not exist for an asymmetric network, one may still dene an energy-like function using the symmetric part of the interconnection matrix: ED¡

N p 1X Js Si Sj ´ aNe. 2 i, jD1 ij

(2.10)

Although asymmetric networks do not always evolve from higher values of E to lower values, the energy function dened above still provides some insight into the dynamics of the network.

Hopeld Model with Random Asymmetric Interactions

869

3 Results of Simulations

We have carried out numerical simulations for various sizes of the network ranging from N D 60 to 500. For asymmetric networks, k was taken to be 0.1, and simulations for each realization of the symmetric component of the synaptic matrix were carried out for two different sets of antisymmetric interconnections. The results reported here are averages over 20 realizations of the symmetric component of the synaptic matrix. The memory loading level a was xed at 0.1. Asynchronous dynamics with both serial and random sequential updating was employed in our simulations. We found no qualitative difference between the behavior of the network for the two updating procedures considered. We therefore present here only the results of random sequential updating. We rst looked at the effects of synaptic asymmetry on the number and properties of the xed-point attractors of the system. All the xed-point attractors were located by starting the system from a very large number of random initial states and following its evolution until a xed point is reached. Such exhaustive enumeration (Singh et al., 1995) of the xed points is possible only for small systems. We present here the results obtained for 60-neuron networks with six memories. The total number of input states used for each realization of the coupling matrix was of the order of 106 . We nd that the average number of xed points is 151 for k D 0 and 38 for k D 0.1 (a xed point and its complement, obtained by reversing the signs of all the spins, are not counted separately). In contrast to this dramatic decrease of the total number of stable states, the stored patterns are nearly unaffected by the introduction of asymmetry. In particular, we found only one realization of Jijs for which one memory becomes unstable for one of the two realizations of Jijas . The accessibility (Hopeld, Feinstein, & Palmer, 1983) of the stored patterns, dened as the fraction of random initial states converging to the set of stored memory states, increases from 0.64 to 0.70 when the asymmetry is turned on. Second, the inuence of asymmetry on the size of the basins of attraction of the stored patterns was investigated. Previous studies (Crisanti & Sompolinsky, 1987; Krauth et al., 1988; Spitzner & Kinzel, 1989) have led to contradictory conclusions about this issue. In our simulations of a network of 500 neurons and 50 memories, we chose each initial state in such a way that it is at a given Hamming distance H from a specic memory state (the target state), and at the same time, its Hamming distances to other memory states are larger than those to the target state. The total number of initial states used for each value of H was 10,000 for the asymmetric network and 5000 for the symmetric network. Figure 1 shows the probability of correct retrieval (operationally dened as one in which the overlap of the retrieved state with the target memory exceeds 0.95) at different values of H. The curves for k D 0 and k D 0.1 are found to be almost coincident. When H < 0.2, the probability of convergence to the target memory is al-

870

Z. Chengxiang, C. Dasgupta, and M. P. Singh

Figure 1: Probability of convergence to a memory state as a function of the fractional Hamming distance H of the initial state from the memory. The data shown are for networks with N D 500 and a D 0.1. Results for the symmetric network (k D 0) are shown by the solid line; the triangles represent results for an asymmetric network with k D 0.1.

most equal to 1. It drops sharply when H exceeds 0.2 and becomes 0 as H approaches 0.4 for both symmetric and asymmetric networks. It is surprising that although a large number of spurious states are suppressed, the size of the basin of attraction of the memory states does not increase when the asymmetry is turned on. Spitzner & Kinzel (1989) found a similar result for a Hopeld model with directed bonds. A probable explanation of the results shown in Figure 1 is that the spurious stable states that become unstable due to the turning on of the asymmetry usually have high values of the energy E dened in equation 2.10. As noted in Hopeld et al. (1983), the typical size of the basin of attraction of such high-E states is expected to be small. So the disappearance of these high-energy states has no signicant inuence on the size of the basins of attraction of the memory states.

Hopeld Model with Random Asymmetric Interactions

871

The connection between the stability and the energy of a state of the network may be understood from the relation between the local stability parameters fl i g dened in equation 2.7 and the energy. The energy of a symmetric network (k D 0) may be written as ED¡

N N 1X 1X l i. Jijs Si Sj D ¡ 2 i, jD1 2 iD1

(3.1)

It is obvious that stable states with lower energy correspond to a higher local stability on the average, and vice versa. The energy of the network does not change when the antisymmetric component is added. However, all the local stability parameters are changed: some are increased, and the others are decreased. If the decrease of one of the local stability parameters is so large that its sign changes from positive to negative, then that state becomes unstable. Since the antisymmetric part of the connection matrix is chosen randomly, it is clear that stable states with higher energy (i.e., with lower average local stability) are more likely to become unstable when synaptic asymmetry is introduced. This proposed explanation has been conrmed by the results of our simulation. In Figures 2 and 3, we have shown the results obtained from simulations of networks with N D 60 and 6 memories. To obtain the data shown in Figure 2, the entire range of possible energy values was divided into 20 equal parts (“bins”). For each realization of the connection matrix, the number of stable states whose energies fall into a particular bin was counted. The numbers obtained in this way for all the bins were then averaged over all realizations of the connection matrix. We observe that the antisymmetric part of the connection matrix has a stronger effect in reducing the number of stable states with higher energy. Figure 3 shows the dependence of the accessibility of a xed point on the energy parameter e. The data shown in this gure were obtained in a similar way as those in Figure 2. For each coupling matrix, the fraction of input states that converge to a stable state whose energy falls in a particular bin was calculated. This fraction was then averaged over all the stable states belonging to that bin. Finally, an averaging was done over all realizations of the coupling matrix. We note that the accessibility drops rapidly with increasing e. Furthermore, the presence of asymmetry in the coupling matrix enhances the accessibility of low-energy stable states because a large number of high-energy stable states are eliminated. Since the total number of random input states used in these simulations is quite large, the fraction of inputs converging to a stable state may be considered to be a measure of the size of its basin of attraction. It is then clear from the data shown in Figure 3 that the size of the basin of attraction of a stable state decreases signicantly with increasing e. Although the introduction of asymmetry in the synaptic interactions does not directly increase the size of the basin of attraction of stored patterns, it

872

Z. Chengxiang, C. Dasgupta, and M. P. Singh

Figure 2: Dependence of the average number of xed points, N fp , on the energy parameter e for networks with N D 60 and a D 0.1. Results for the symmetric network (k D 0) and the asymmetric network with k = 0.1 are shown.

does have a positive effect on the retrieval performance of the network. This positive effect stems from the fact that the convergence time for wrong retrievals increases in the presence of asymmetry. This makes it possible to discriminate between a correct retrieval and an incorrect one. Results for the average convergence time t , obtained for simulations of networks of 500 neurons, are presented in Figure 4. Data for t for both correct and wrong retrievals are shown for the symmetric network and an asymmetric network with k D 0.1. We nd that the average convergence time for wrong retrievals is signicantly larger than that for correct retrievals, and that the former increases signicantly in the presence of asymmetry, whereas the latter remains almost unchanged. Our results for the convergence times in the symmetric network are consistent with those of Tanaka and Yamada (1993). There may be several reasons for these observations. The typical Hamming distance of an input state to a spurious stable state may be longer than

Hopeld Model with Random Asymmetric Interactions

873

Figure 3: Dependence of the accessibility of a xed point (i.e., the fraction of random initial states which converge to it) on the energy parameter e for networks with N D 60 and a D 0.1. Results for the symmetric network (k D 0) and the asymmetric network with k D 0.1 are shown.

that to its target memory state. The former may increase further in the presence of asymmetry in the coupling matrix, which suppresses a large number of spurious states. A second possibility is that the dynamical evolution process for eventual trapping into a spurious state is more complicated than convergence to the target memory. Introduction of synaptic asymmetry further increases the complexity of processes leading to wrong retrieval. In order to shed some light on this question, we have calculated the average Hamming distance H0 between the input state and the output spurious state to which the system converges for different values of H. We nd that H0 is only slightly higher than H and that the introduction of asymmetry has almost no effect on H 0 . We have also found that for a xed value of H, the values of H0 obtained for networks of different size are nearly the same. This implies that the above observation is not a nite-size effect. It is then reasonable to con-

874

Z. Chengxiang, C. Dasgupta, and M. P. Singh

Figure 4: Average convergence time t , measured in units of updates per neuron, as a function of the fractional Hamming distance H of the initial state from a memory state. The data shown are for networks with N D 500 and a D 0.1. The two curves at the bottom represent data for convergence to memory states for two different values of the asymmetry parameter k: k D 0 (full line) and k D 0.1 (dashed line). The two curves at the top represent data for t when the convergence is to a spurious state: k D 0 (solid line); k D 0.1 (dashed line).

clude that the second reason mentioned above plays a crucial role in slowing the process of wrong retrievals. The same mechanism may lead to a transformation of the spurious stable states to chaotic trajectories (Parisi, 1986; Crisanti & Sompolinsky, 1987) when the asymmetry parameter is increased. The dip near H ’ 0.2 in the curves (see Figure 4) for the time of convergence to spurious attractors arises from the fact that the spurious attractors to which the system converges have signicant correlations with the memory states. We have calculated the distribution of the minimum Hamming distance Hm of a spurious stable state from all the memory states. This distribution is found to have a sharp peak at a relatively small value (0.2–0.3) of Hm (Singh et al., 1995). The presence of a large number of spurious stable

Hopeld Model with Random Asymmetric Interactions

875

states at a Hamming distance of 0.2 to 0.3 from a memory state results in relatively fast convergence to one of them when the Hamming distance of the initial state from its target memory lies in this range. The sharp drop in the probability of convergence to a memory state as H is increased beyond 0.2 (see Figure 1) is also due to the same reason. The peak of the distribution of Hm is found to move to higher values as the sample size is increased. This observation suggests that the effect mentioned above may be an artifact of the smallness of sample size. This point merits further investigation. The observation that the average convergence times for correct and wrong retrievals are well separated suggests (Tanaka & Yamada, 1993) that if we set a suitable upper limit tm for the convergence time and allow the network to evolve only up to the time tm , then wrong retrievals will be mostly discarded and the correct ones will be selected out. However, this strategy may suffer because of the overlap of the two “bands” of convergence times. This difculty may be alleviated by adding an antisymmetric component to the synaptic matrix because the presence of asymmetry reduces the overlap between the two “bands” by increasing the convergence time for wrong retrievals. The effectiveness of this procedure is illustrated in Figure 5, where we show simulation results for networks with N D 500. The upper limit used in these simulations was tm = 16 updates per spin. The probability of convergence to memory states was calculated by considering only those runs in which the system reaches a stable state within time tm . A comparison with Figure 1 clearly shows that this strategy leads to a signicant improvement in the retrieval performance of the network. The probability of correct retrieval is close to 65% for H D 0.4 when an appropriate upper limit is set for the convergence time, whereas this probability is close to zero in the absence of any upper limit. The probability of correct retrieval for H D 0.4 increases to 73% for k D 0.1, indicating a further improvement caused by the introduction of asymmetry. Similar results were obtained for tm = 12 and 14 updates per spin. We have also studied the dependence of the convergence times on the network size N. The results are shown in Figure 6. It can be seen that both the gap between the average convergence times for correct and incorrect retrievals and the effectiveness of synaptic asymmetry in widening this gap increase as the size of the system is increased. Therefore, the prescription proposed for improving the retrieval performance of the network is expected to work better for larger networks. The same conclusions hold for serial updating. The only difference is that in serial updating, the average convergence time is shorter for memory states and spurious states. 4 Summary and Discussion

We have studied the retrieval properties of a Hopeld-like model in which a random antisymmetric part is added to the otherwise symmetric synap-

876

Z. Chengxiang, C. Dasgupta, and M. P. Singh

Figure 5: The probability of convergence to a memory state is plotted as a function of H, the Hamming distance of the initial conguration from the target memory. In these simulations, an upper limit of 16 updates per neuron was set for the time of convergence; only those runs that converged to xed points within this time limit were used to calculate the probabilities. The results shown are for N D 500, a D 0.1, and two values of k: k D 0 (solid line) and k D 0.1 (triangles).

tic matrix. We nd that the introduction of the antisymmetric component enhances the overall accessibility of the memory states. However, the size of the basin of attraction of a memory state does not show any signicant change when asymmetry is introduced in the synaptic matrix. This is quite an unexpected result. We show that it is due to the fact that the spurious xed points that are destabilized by the introduction of asymmetry are highly inaccessible to begin with. Therefore, their suppression does not have any appreciable effect on the basins of attraction of the more accessible xed points. The convergence time to spurious xed-point attractors increases faster than that to the memory states as the asymmetry parameter is increased. The possibility of convergence to spurious xed points is

Hopeld Model with Random Asymmetric Interactions

877

Figure 6: Dependence of the average convergence time t on the number of neurons N of the network. The two curves at the bottom represent data for convergence to memory states; those at the top represent results for convergence to spurious states. The data shown are for a = 0.1. The Hamming distance of the initial state from the target memory was set at H = 0.3. The full lines represent the results for the symmetric network, and the dashed lines correspond to k =0.1.

greatly reduced if a suitable upper limit is set to the convergence time. This prescription for suppressing convergence to spurious xed points becomes more effective in the presence of asymmetry in synaptic interactions. Thus, the antisymmetric part can be very useful in the associative recall of the stored memories by slowing the process of wrong retrievals. As the antisymmetric part of the coupling matrix is chosen randomly, we cannot expect it to have only the positive effects mentioned above. Analytic studies (Treves & Amit, 1988; Singh et al., 1995) of the behavior of this model in the large-N limit indicate that the introduction of random synaptic asymmetry decreases the storage capacity of the network and marginally degrades the quality of recall. Also, networks with asymmetric interconnections and serial updating may exhibit time-dependent attractors, such

878

Z. Chengxiang, C. Dasgupta, and M. P. Singh

as limit cycles of varying length (Nutzel, ¨ 1991), in addition to xed points. Since k D 0.1 corresponds to a small relative strength of the antisymmetric component, and possibly due to the smallness of the systems studied, we found little evidence of such adverse effects. These effects are probably responsible for the slight reduction of the recall probability (see Figure 1) as the asymmetry is turned on. The “optimal” value of k would have to be determined by balancing these negative effects against the positive ones discussed above. Our limited investigation of the behavior of the network for other values of k suggests that its optimal value is higher than 0.1. A more detailed study of this issue would be useful. The retrieval properties of the network may improve if the antisymmetric component of the interconnection matrix is chosen in a systematic way instead of randomly. A recent study (Okada, Fukai, & Shiino, 1998) shows that random asymmetric dilution of the synaptic connections of a Hopeld-type model leads to a decreased storage capacity that can be rectied by a systematic dilution technique. Another indication of the benecial effects of structured asymmetry is provided by the fact that some of the learning algorithms proposed (Krauth & Mezard, 1987; Gardner, 1988) for increasing the storage capacity and optimizing the local stability of the memory states lead to asymmetric synaptic connections. We close with a few comments on the important question of whether the effects of synaptic asymmetry found in our study are also expected to be present, at least qualitatively, in other neural network models of associative memory. Our work and similar other studies (Crisanti & Sompolinsky, 1988; Gutfreund, Reger, & Young, 1988) of spin models with asymmetric interactions strongly suggest that the presence of synaptic asymmetry greatly reduces the number of spurious attractors. However, it is not clear whether this reduction in the number of spurious attractors always leads to an enlargement of the basins of attraction of the memory states. This does not happen in our model or in a Hopeld model with asymmetric dilution of the synapses (Spitzner & Kinzel, 1989). In contrast, a recent study (Athithan & Dasgupta, 1997) shows that asymmetric dilution of the synapses of a network constructed using an optimized learning rule (Krauth & Mezard, 1987) based on linear programming causes a nearly total suppression of convergence to spurious attractors. To take another example, extreme asymmetric dilution (Derrida et al., 1987; Gardner, 1989) of the synapses leads to maximal size of the basins of attraction of the memory states in some but not all models. So the effects of synaptic asymmetry on the size of the basins of attraction of the memory states appear to be somewhat model specic. The asymmetry-induced relative increase of the convergence time to spurious attractors is expected (Crisanti & Sompolinsky, 1988; Nutzel, ¨ 1991) to be quite general. Therefore, our prescription of introducing a moderate amount of asymmetry and then setting an upper limit to the convergence time is expected to be generally useful for improving the retrieval performance.

Hopeld Model with Random Asymmetric Interactions

879

Acknowledgments

Z. C. thanks the Department of Physics, Indian Institute of Science and T. V. Ramakrishnan for their hospitality for the period during which this work was carried out. Facilities for carrying out the numerical work were provided by the Supercomputer Education and Research Centre, Indian Institute of Science. References Amit, D. J. (1989). Modeling brain functions. Cambridge: Cambridge University Press. Athithan, G., & Dasgupta, C. (1997). On the problem of spurious patterns in neural associative memory models. IEEE Trans. Neur. Net., 8, 1483–1491. Crisanti, A., & Sompolinsky, H. (1987).Dynamics of spin systems with randomly asymmetric bonds: Langevin dynamics and a spherical model. Phys. Rev. A, 36, 4922–4939. Crisanti, A., & Sompolinsky, H. (1988).Dynamics of spin systems with randomly asymmetric bonds: Ising spins and Glauber dynamics. Phys. Rev. A, 37, 4865– 4879. Derrida, B., Gardner, E., & Zippelius, A. (1987). An exactly solvable asymmetric neural network model. Europhys. Lett., 4, 167–173. Eccles, J. C. (1964). The physiology of synapses. Berlin: Springer-Verlag. Erichsen Jr., R., & Theumann, W. K. (1991). Mixture states and storage with correlated patterns in Hopeld model. Int. J. Neur. Sys., 1, 347–353. Forrest, B. M. (1988). Content-addressability and learning in neural networks. J. Phys. A, 21, 245–255. Fukai, T. 1990. Metastable states of neural networks incorporating the physiological Dale hypothesis. J. Phys. A, 23, 249–258. Gardner, E. (1988). The space of interactions in neural network models. J. Phys. A, 21, 257–270. Gardner, E. (1989). Optimal basins of attraction in randomly sparse network models. J. Phys. A, 22, 1969–1974. Gardner, E., Derrida, B., & Mottishaw, P. (1987). Zero temperature dynamics for innite range spin glasses and neural networks. J. Phys. (France), 48, 741–755. Golomb, D., Rubin, N., & Sompolinsky, H. (1990). Willshaw model: Associative memory with sparse coding and low ring rates. Phys. Rev. A, 41, 1843–1854. Gutfreund, H., Reger, J. D., & Young, P. Y. (1988). The nature of attractors in an asymmetric spin glass with deterministic dynamics. J. Phys. A, 21, 2775–2797. Hertz, J. A., Grinstein, G., & Solla, S. A. (1987). Neural nets with asymmetric bonds. In L. N. van Hemmen & I. Morgenstern (Eds.), Heidelberg Colloquium on Glassy Dynamics (pp. 538–546). Heidelberg: Springer-Verlag. Hopeld, J. J. (1982). Neural networks and physical systems with emergent computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554–2558.

880

Z. Chengxiang, C. Dasgupta, and M. P. Singh

Hopeld, J. J. (1984). Neurons with gradedresponse have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA, 81, 3088–3092. Hopeld, J. J., Feinstein, D. I., & Palmer, R. G. (1983). “Unlearning” has a stabilizing effect in collective memories. Nature, 304, 158–159. Kepler, T. B., & Abbott, L. F. (1988). Domains of attraction in neural networks. J. Phys. (France), 49, 1657–1662. Koml os, ´ J., & Paturi, R. (1993). Effects of connectivity in an associative memory model. J. Comput. Sys. Sci., 47, 350–373. Krauth, W., & Mezard, M. (1987). Learning algorithms with optimal stability in neural networks. J. Phys. A, 20, L745–L752. Krauth, W., Nadal, J. P., & Mezard, M. (1988).The roles of stability and symmetry in the dynamics of neural networks. J. Phys., A, 21, 2995–3011. Nutzel, ¨ K. (1991). The length of attractors in asymmetric random neural networks with deterministic dynamics. J. Phys. A, 24, L151–L157. Okada, M., Fukai, T., & Shiino, M. (1998). Random and systematic dilutions of synaptic connections in a neural network with a nonmonotonic response function. Phys. Rev. E, 57, 2095–2103. Parisi, G. (1986). Asymmetric neural networks and the process of learning. J. Phys., A19, L675–L680. Singh, M. P., Zhang, C., & Dasgupta, C. (1995). Fixed points in a Hopeld model with random asymmetric interactions. Phys. Rev. E, 52, 5261–5272. Sompolinsky, H. (1987). The theory of neural networks: The Hebb rule and beyond. In L. N. van Hemmen & I. Morgenstern (Eds.), Heidelberg Colloquium on Glassy Dynamics. Heidelberg: Springer-Verlag. Spitzner, P., & Kinzel, W. (1989). Hopeld model with directed bonds. Z. Phys., 74, 539–545. Tanaka, T., & Yamada, M. (1993). The characteristics of the convergence time of associative neural networks. Neural Computation, 5, 463–472. Toulouse, G., Dehaene, S., & Changeux, J.-P. (1986). Spin glass model of learning by selection. Proc. Natl. Acad. Sci. USA, 83, 1695–1698. Treves, A., & Amit, D. J. (1988). Metastable states in asymmetrically diluted Hopeld networks. J. Phys. A, 21, 3155–3169. Waugh, F. R., Marcus, C. M., & Westervelt, R. M. (1990). Fixed-point attractors in analog neural computation. Phys. Rev. Lett., 64, 1986–1989. Received October 26, 1998; accepted May 26, 1999.

LETTER

Communicated by Shun-ichi Amari

On “Natural” Learning and Pruning in Multilayered Perceptrons Tom Heskes Foundation for Neural Networks, University of Nijmegen, 6525 EZ Nijmegen, The Netherlands Several studies have shown that natural gradient descent for on-line learning is much more efcient than standard gradient descent. In this article, we derive natural gradients in a slightly different manner and discuss implications for batch-mode learning and pruning, linking them to existing algorithms such as Levenberg-Marquardt optimization and optimal brain surgeon. The Fisher matrix plays an important role in all these algorithms. The second half of the article discusses a layered approximation of the Fisher matrix specic to multilayered perceptrons. Using this approximation rather than the exact Fisher matrix, we arrive at much faster “natural” learning algorithms and more robust pruning procedures. 1 Introduction

Natural gradient has recently received quite a lot attention (Amari, 1998; Rattray, Saad, & Amari, 1998; Yang & Amari, 1998). The basic idea is to replace the standard gradient by the natural gradient, dened as the steepest descent direction in Riemannian space (Amari, 1998). It has been shown (Rattray et al., 1998; Yang & Amari, 1998) that the use of natural gradients in on-line backpropagation learning for multilayered perceptrons (MLPs) leads to a tremendous speed-up in convergence. One of the important observations is that natural gradients greatly help to break the symmetry between hidden units (Rattray et al., 1998), which, using standard gradients, is the cause of long plateaus (long time spans in which the performance of the MLP barely improves—also referred to as temporary minima; see, among others, Ampazis, Perantonis, & Taylor, 1999; Saad & Solla, 1995; Wiegerinck & Heskes, 1996). The goal of this article is twofold: A different formulation of natural gradients (section 2). We will derive “natural” gradients in a slightly different way. In this formulation, the appropriate metric follows naturally from the error function one tries to minimize (section 2.1). We discuss the overlap and differences with Amari’s approach (section 2.2). The natural equivalents of batch-mode Neural Computation 12, 881–901 (2000)

c 2000 Massachusetts Institute of Technology °

882

Tom Heskes

learning and pruning can be dened and linked to existing algorithms (sections 2.3 and 2.4). An efcient approximation of the Fisher matrix for MLPs (sections 3 and 4). The Fisher matrix plays a key role in the theory about natural gradients. In practical applications of “natural” rather than “standard” learning or pruning procedures, one constantly has to compute and invert this matrix of size nweights £ nweights with nweights the total number of network weights. The proposed approximation loosens this computational burden by orders of magnitudes (section 4.1). For on-line learning, we show that the approximated Fisher matrix is sufciently powerful to help break the symmetry between hidden units and escape from the corresponding plateau in the same order of iterations as with the exact Fisher matrix (section 4.2). In the context of batchmode learning, we arrive at an alternative to Levenberg-Marquardt optimization, which can be much more efcient, especially for large networks (section 4.3). Pruning algorithms based on the approximated Fisher matrix already exist (although not interpreted in this way), and although in principle they are less powerful, they are claimed to be more robust than pruning algorithms using the exact Fisher matrix (Laar & Heskes, 1999) (section 4.4). 2 Natural Gradient in a Slightly Different Formulation 2.1 On-line Learning. We are concerned here with supervised learning. Given an input x, the output of a model with parameters W reads y( W, x). Both x and y can be vectors. We will use T to denote the transpose. In supervised learning, this output is compared with a given target vector t, and the error between the two is dened as d( y( W, x), t). Well-known error functions are sum-squared error, d( y, t) D ( y ¡ t) 2 / 2 and cross-entropy d( y, t) D t log[t / y] C (1 ¡ t) log[(1 ¡ t) / (1 ¡ y)], but other error functions might be chosen as well, as long as they are at least twice differentiable with respect to y. In on-line learning, a weight update is based on one combination of inputs x and targets t. The standard stochastic gradient procedure is

( ( ), ) O ¡ W D ¡g @d y W, x t , DW ´ W @W

(2.1)

O the new, and W the current set with g a (usually small) learning rate, W of parameters. Amari’s work on natural gradient learning (see, e.g., Amari, 1998) has made very clear that this particular choice for the gradient is rather arbitrary and should be replaced by natural gradients, dened as the steepest descent direction in Riemannian space. Here we use a slightly different formulation to arrive at more or less the same results. The advantage of our formulation is that we can achieve similar “natural” gradients for any kind

“Natural” Learning and Pruning in Multilayered Perceptrons

883

of (twice differentiable) error function d( y, t), without the need to introduce (manifolds of) probability distributions or Riemannian spaces. First note that the standard gradient descent procedure (see equation 2.1) is, in lowest order of the learning parameter g, equivalent to1 µ ¶ O D argmin gd(y(W 0 , x), t) C 1 kW 0 ¡ Wk2 , W 2 W0

(2.2)

P 2 with kWk2 ´ i Wi the Euclidean distance. The rst term expresses the desire to learn the new training example and the second term the need to be conservative, that is, to retain the information acquired in previous learning trials (Kivinen & Warmuth, 1997). The obvious question is, Why use the Euclidean distance and not some other distance measure? Indeed the “natural” choice would be to use the same distance 2 function for comparing the outputs of the new model with the given targets as for comparing them with those of the old model: d( y(W 0 , x), y(W, x)). To make this distance independent of the currently presented input x, we take the average over the distribution of all inputs; that is, we dene « ¬ D( W 0 , W ) D d( y( W 0 , x), y( W, x)) x .

(2.3)

This distance function is independent of the distribution of the targets t. On the other hand, it depends on the distribution of the inputs x, so the underlying assumption is that we know or at least can estimate this distribution. We will discuss a batch-mode procedure, for which this is obviously not a problem. However, in the on-line version, knowledge of the input distribution is indeed an important assumption, on which all studies on natural gradient procedures for supervised learning are based. So replacing in equation 2.2 the Euclidean distance by the more “natural” D(W 0 , W ) of equation 2.3, we obtain £ ¤ O D argmin gd(y(W 0 , x), t) C D( W 0 , W ) , W

(2.4)

W0

1

For an exact correspondence with the gradient 2.1, we should write D W D g lim

2 !0

2

1

»

h

i

¼

1 argmin 2 d(y(W , x), t) C kW 0 ¡ Wk2 ¡ W . 2 0 W 0

In fact, this kind of limit can be viewed as an alternative denition of a gradient to be compared with the schoolbook denition of a difference in the limit 2 to zero, but this seems unnecessarily complicated for the current purposes. 2 Since we do not require d(y, t) D d(t, y) the term distance is not always appropriate and could be replaced with divergence. We stick with distance because it does provide the correct sense.

884

Tom Heskes

which, in lowest order of g, corresponds to the gradient procedure

D W D ¡gF¡1 ( W )

@d( y(W, x), t) @2 D( W 0 , W ) , with F( W ) ´ . @W @W 0 @W 0T W 0 DW

(2.5)

We refer to F( W ) as the Fisher matrix, although it need not be a Fisher matrix according to its stricter denition as in equation 2.8 below. In computing the Fisher matrix, it is quite convenient that all secondorder derivatives of the outputs with respect to the parameters W disappear; that is, we have

@2 d( y(W 0 , x), y(W, x)) @y( W, x) @y( W, x) D , h( y(W, x)) 0 0 T @W @W @W @W T W 0 DW

(2.6)

with

h( y) D

@2 d( y0 , y) . @y0 @y0T y0 Dy

(2.7)

In other words, we only have to compute the second-order derivative h( y) of the error function with respect to the outputs, not the second-order derivative of the outputs with respect to the parameters W, which does appear in the computation of the Hessian matrix (see also section 2.3). 2.2 Relation to Amari’s Natural Gradients. In this section we rst show that our denition of a “natural” gradient does indeed correspond to Amari’s expressions for error functions of the log likelihood or KullbackLeibler type. Next we discuss when and how the two denitions of “natural” gradients differ. Let us consider the special case where the outputs y and targets t can be interpreted as probabilities or parameters in a probability density, and the error function d( y, t) is the Kullback-Leibler divergence between the probabilities implied by y and t: µ ¶ Z p(z|t) . d( y, t) D dz p( z|t) log p( z| y)

Both the sum-squared error (y refers to the mean of a gaussian distribution with xed variance, that is, p( z| y) » N ( y, s) with s xed and known) and the cross-entropy (y is the class probability, that is, p(z D 1|y) D y and p( z D 0| y) D 1 ¡ y; the integral can be replaced by a sum over z D f0, 1g), described in the beginning of section 2.1, can be written in this way up to irrelevant constants. For this kind of error function, we obtain, after some straightforward manipulations, @2 F( W ) D @W 0 @W 0T

½ Z

µ

p( z|W, x) dz p( z|W, x) log p( z|W 0 , x)

¶¾

0

x W DW

“Natural” Learning and Pruning in Multilayered Perceptrons

½ Z D

dz p(z|W, x)

@log p( z| W, x) @log p( z|W, x) @W @W T

885

¾ ,

(2.8)

x

the Fisher information (Amari 1985, 1998). In other words, we have shown here that in cases where the error function to be minimized can be interpreted as a Kullback-Leibler divergence, Amari’s natural gradient learning coincides with the formulation in equation 2.4. To understand when and how the two denitions of “natural” differ, let us rst rephrase Amari’s arguments in our context. Given a manifold of probability distributions parameterized by W, the most general denition of information divergence under some invariance considerations and obvious constraints for divergences (nonnegativity, zero iff the two distributions are equivalent, triangle inequality) is the so-called f -divergence (Amari, 1985), ½ Z 0

D( W , W ) D

dz p(z|W, x) f

¡ p(z|W , x) ¢ ¾ 0

p(z|W, x)

,

(2.9)

x

where f (¢) is any at least twice differentiable convex function, strictly positive, except that f (1) D 0. The Kullback-Leibler divergence follows from f (r) D r ¡ log r ¡ 1. It is easy to show that any choice of f (¢) yields, up to an irrelevant multiplying constant, the same Fisher information (see equation 2.8). This makes the Fisher information matrix the preferred metric in the manifold of probability distributions—that is, the only metric that is invariant under reparameterizations. In Amari’s formulation, there is no direct link between the rst and the second term on the right-hand side of equation 2.4. No matter what the rst term is, the second term is an f -divergence of the form equation 2.9, and thus the gradient of the rst term has to be transformed using the Fisher information metric, equation 2.8. In the formulation presented in section 2.1, there does exist a direct link between the two terms. Here the chosen distance function, since it depends on only the model outputs given the model parameters, by denition also ensures invariance under reparameterization. However, it is not unique in the sense that any distance function with appropriate invariance conditions gives rise to the same Fisher matrix (see equation 2.5). The remaining indifference stems from the fact that different choices for d( y0 , y) in the denition of D( W 0 , W ) can lead to different secondorder derivatives h( y) in equation 2.7, which yield different Fisher matrices in equation 2.5. As long as the error function d( y, t) can be interpreted as some kind of f divergence between probability distributions implied by the model outputs y and targets t, the two formulations yield exactly the same procedure. This is the case for most commonly used error functions such as sum-squared error and cross-entropy. The formulation of section 2.1 is more general in that it also yields some kind of “natural” gradient if there is no such interpretation. Amari’s formulation, on the other hand, is more general in that it can be

886

Tom Heskes

applied to learning rules that cannot be derived from an error function, but for which it is possible to dene the Riemannian structure (and impossible or more difcult to dene an appropriate distance function). An example is the application of natural gradients to independent component analysis (see, e.g., Amari, 1998). 2.3 Batch-Mode Learning. The above description, like Amari’s work on natural gradients, focuses on on-line learning. However, exactly the same line of reasoning can be followed for gradient-descent batch-mode learning. In batch-mode learning, a parameter update is obtained by averaging over all inputs x and corresponding targets t; that is, in equations 2.1, 2.2, 2.4, and 2.5, we can simply replace d( y( W, x), t) by hd(y(W, x), t)ix,t . In a batch-mode scenario, we can interpret equation 2.2 as a smooth approach to the minimum of hd(y(W, x), t)ix,t , trying to keep some history about the initial conditions by staying close to the previous set of parameters. This interpretation might explain some of the popularity of early stopping, where one starts with small parameters (a prior with no functional dependency between inputs and outputs) and terminates the procedure when the model starts overtting. In this context, the distance measure used for comparing W 0 and W in equations 2.2 and 2.4 can be viewed as a kind of (dynamic) prior. Pushing this analogy even further, the Euclidean distance in equation 2.2 corresponds to a prior similar to standard weight decay, whereas the “natural” distance D( W 0 , W ) of equation 2.3 implements a dynamic version of Jeffrey’s prior (see any textbook on Bayesian statistics, e.g., Gelman, Carlin, Stern, & Rubin, 1995). There is another link, maybe not so much in spirit, but more in the expressions that have to be computed, with the Levenberg-Marquardt method for least-squares tting (see, e.g., Luenberger, 1984, and many other textbooks on optimization). The¬idea behind the Levenberg-Marquardt method is that « the error d( y( W, x), t) x,t is, at least close to its minimum, quite well approximated by a quadratic form. If this is indeed a good approximation, we can jump to the minimum in a single leap, through (compare with equation 2.5),

D W D ¡H

¡1

( W)

with H ( W ) ´

« ¬ @ d( y(W, x), t) x,t @W

« ¬ @2 d( y( W, x), t) x,t @W@W T

.

, (2.10)

Now, the Levenberg approximation of the Hessian matrix H ( W ) for sumsquared error is nothing but our denition of the Fisher matrix: ½ ¾ @y( W, x) @y( W, x) D F(W ), (2.11) H( W ) ¼ @W @W T x

“Natural” Learning and Pruning in Multilayered Perceptrons

887

where in the rst step where the second-order derivative of y(W, x) with respect to the parameters W is neglected, and where the last step follows from equation 2.6 and h( y) D 1 for d( y, t) D ( y ¡ t)2 / 2 in equation 2.7. In other words, except for some technicalities like adding a constant to the diagonal, Levenberg-Marquardt least-squares tting is equivalent to a greedy version of batch-mode natural gradient learning for sum-squared error. In fact, this correspondence need not be restricted to sum-squared error: the Fisher matrix F( W ) for general error functions dened in equation 2.6 is indeed a valid approximation of the Hessian for these error functions, similar in spirit to the Levenberg approximation, equation 2.11, for the Hessian of sum-squared error. 2.4 Natural Pruning. Pruning seems quite different from learning. With pruning, one intends to remove from a trained model those parameters that hardly affect, or maybe even cause to deteriorate, the performance of the model. The reason to discuss pruning here is that most of the existing and popular pruning algorithms can be understood in terms of distance functions and metrics. In fact, as we will see, optimal brain surgeon (OBS) (Hassibi & Stork, 1993) can be interpreted as the natural equivalent of pruning based on weight magnitude. Usually pruning is done in an iterative procedure. Starting with the full model, one successively removes those parameters that appear to be least relevant. The better the performance of the model with a particular parameter (or set of parameters) left out (i.e., the lower the error), the less relevant this parameter. To quantify the relevance of a particular subset of parameters Wk , one has to go through the following two steps:

O ( k) closest to the original W according to the 1. Find the new model W predened distance function D(W 0 , W ), with constraint Wk0 D 0 (compare with equation 2.4): O ( k) D argmin D( W 0 , W ). W W 0 ,Wk0 D0

Below we will elaborate on specic choices for D( W 0 , W ). 2. Compute the error of this new model according to some predened O (k)). Often used criteria are training, validation, or criterion Ek D E( W test error or an estimate of these (see e.g., Pedersen, Hansen, & Larsen 1996), but also the distance to the original model, as, for example, in OBS, E( W 0 ) D D( W 0 , W ). Apart from all kinds of more technical details, the principal difference between different pruning strategies is their choice for the distance function D(W 0 , W ) and, to a lesser extent, their choice for the performance criterion E( W 0 ). A simple, and not recommended, choice would be the Euclidean distance D(W 0 , W ) D kW 0 ¡Wk2 combined with E( W 0 ) D D( W 0 , W ), yielding

888

Tom Heskes

Ek D kWk k2 . The resulting algorithm is: prune the weights according to their magnitude. OBS (Hassibi & Stork, 1993) (see also its “generalized” version; Stahlberger & Riedmiller, 1997) can be interpreted as the “natural” equivalent of pruning based on weight magnitude. The distance function used in OBS, although derived from a different perspective, is the quadratic approximation of the “natural” distance, equation 2.3: « ¬ D( W 0 , W ) D d( y( W 0 , x), y( W, x)) x ¼

1 (W 0 ¡ W ) T F( W )( W 0 ¡ W ), 2

(2.12)

with F( W ) the Fisher matrix given in equation 2.5. So there is a close link between (natural) pruning and learning: it all boils down to the denition of an appropriate distance between two different models. Until now, everything has been quite general; apart from yielding an output y given an input x, we have not made any assumptions about our model with parameters W. Some of the algorithms mentioned in the previous sections have been derived and introduced specically for neural networks. But although in practice there may be some advantages to a neural interpretation or implementation, in principle, they can be applied to any parameterized model. 3 The Fisher Matrix for Multilayered Perceptrons

The Fisher matrix plays a key role in the theory and algorithms discussed in the previous sections. In this section, we propose a simplication of the Fisher matrix, specic to MLPs. By replacing the exact Fisher matrix with this approximation, we hope to arrive at algorithms that make the most out of the layered structure of MLPs, by either increasing their speed or improving their robustness. 3.1 Notation. Let us consider an MLP with L layers of weights. Our notation will be slightly different from the one used in the previous section. yl is an nl £ 1 vector with the output activities of the lth layer, where layer 0 is the input layer and layer L the output layer. In other words, the outputs y in the previous section are now called yL ; that is, the error function depends on only the outputs yL . xl stands for the input activity of the units in layer l. They are related to yl¡1 and yl through

xl D Wl yl¡1 and yl D fl ( xl ),

(3.1)

with fl (¢) some kind of transfer function (usually a sigmoid for hidden units, by denition the identity for the input units), and Wl an nl £(nl¡1 C1) matrix with weights, where one column has been added to include biases (e.g., by dening yl0 D 1 for all layers l < L). The input x of the previous section

“Natural” Learning and Pruning in Multilayered Perceptrons

889

corresponds to x0 D y0 . (We will still use h. . .ix to denote an average over all inputs and h. . .ix,t for an average over both inputs and targets.) The set of parameters W is now the combination of all weights fW1 , . . . , WL g. For notational convenience, we assume that there are no connections across layers of hidden units. What follows can be easily generalized to this more general case. 3.2 The Exact Fisher Matrix. Equation 3.1 fully species the forward pass through the network. Gradient and Fisher matrix follow from the usual backpropagation rules. With denitions

g´

@d( yL , t) @yL and c l ´ T , @yL @xl

it is easy to derive @d( yL (W, x), t) D [c lT g]yTl¡1 , @Wl

(3.2)

where the nL £ ( nl¡1 C 1) matrices c l can be computed efciently using the backprop rule, c l¡1 D c l W l f 0 ( xl¡1 ). The components of the Fisher matrix directly follow from equation 2.6 combined with the above denitions: D E (3.3) Fll0 (W ) D c lT hc l0 yl0 ¡1 yTl¡1 , x

where h ´ h( yL ) from equation 2.7. 3.3 An Approximate Fisher Matrix for MLPs. Although the expressions may seem straightforward, what we have to deal with is a full nweights £ nweights -dimensional matrix with nweights the total number of weights. We have to compute and, perhaps even worse, invert this matrix each iteration. For example, this is what makes Levenberg-Marquardt optimization unpopular for applications with large MLPs. Do we really need this complexity? Considering the way we arrived here, maybe not. The Fisher matrix follows from our choice of a “natural” metric to compute the distance between the network before and after the update. As we will see, an approximation of the Fisher matrix leads to a different choice of this metric. Or, the other way around, maybe we can choose a different metric that leads to a much faster “natural” learning procedure. Here we will arrive at such a metric by simplifying the exact Fisher matrix.

890

Tom Heskes

The rst simplication is to treat different layers independently; we make the Fisher matrix block diagonal by ignoring the cross-terms that occur from changes in weights across layers: D E Fll0 ¼ Flld ll0 D c lT hc l yl¡1 yTl¡1 d ll0 . x

There is a very good reason to do this: these effects—for example, that a change in a weight can be compensated for by a change in a weight in a different layer—are highly nonlinear, so the quadratic approximation implied by the Fisher matrix may not be accurate at all (see Laar & Heskes, 1999, and section 4.4). The other simplication is to neglect correlations between the derivatives c l and the activities yl¡1 : D E D E (3.4) Fll0 ¼ [C l Cl¡1 ]d ll0 with C l ´ c lT hc l and Cl ´ yl yTl . x

x

Again, this seems to be a reasonable approximation: the term c lT hc l measures the impact of changes in hidden unit activities on the error. That is, it depends on what happens upstream (picturing an information ow from input to output), whereas the activities yl¡1 are directly affected by what happens downstream. Now we can go back and derive the metric that corresponds to this approximation of the Fisher matrix. First, we note that the quadratic approximation, equation 2.12, yields exactly the same natural gradient procedure as the original D(W 0 , W ), since only the second-order derivative with respect to the rst argument matters. By construction it is quadratic in W 0 but not in W (for nonconstant F( W )). The approximate distance function, which follows from substitution of the approximate Fisher matrix, equation 3.4, in equation 2.12, and some rewriting, has the same asymmetry: Dapprox( W 0 , W ) D

L D E 1X [xl ( W ) ¡Wl0 yl¡1 ( W )]T C l ( W )[xl ( W ) ¡W l0 yl¡1 ( W )]T , 2 lD1 x

(3.5)

with xl (W ) and yl ( W ) the incoming and outgoing activities of the units in layer l for a network with parameters W. In other words, the approximate Fisher matrix of equation 3.4 corresponds to a sum-squared error between the activities in all layers of the networks, weighted by C l ( W ) which measures how, on average, changes in these activities affect the error on the outputs of the network. 4 Implications for Learning and Pruning 4.1 Inversion of the Fisher Matrix. The reduction in computational complexity for inverting the Fisher matrix is enormous. We have to invert

“Natural” Learning and Pruning in Multilayered Perceptrons

891

the matrices Cl of size ( nl C 1) £ ( nl C 1) for l D 0 to L ¡ 1 and the matrices C l of size nl £ nl for l D 1 to L. To give a rough order of magnitude, suppose that all layers are of about the same size nl D n and that inverting an n £ n matrix requires n3 executions. Then, neglecting biases, inverting the exact Fisher matrix requires L3 n6 (total number of weights cubed) executions 3 whereas inverting the approximated Fisher matrix requires only 2Ln3 executions (2L£ number of units per layer cubed). For a moderate MLP with one layer of hidden units (L D 2) and n D 5 units per layer, we are already talking about a factor 250 speed-up. Computing the forward and backward pass through the network to compute the batch-mode gradient each requires on the order of Ln2 p executions (number of weights £ the number of patterns p), which is for reasonable amounts of patterns (p of the same order as n) about as time-consuming as inverting the approximate Fisher matrix. Computing the approximate Fisher matrix is only about twice as time-consuming as computing the batch-mode gradient. Note further that C0 , the covariance matrix of the network inputs, needs to be computed and inverted only once (in principle; see, however, section 4.3). So, roughly speaking, to use the “natural” batch-mode gradient with the approximated Fisher matrix makes batch-mode learning only a factor 2 slower when all layers are of about the same size. Large variations in the size of the layers and a relatively small number of patterns increase the relative overhead, but using the approximate Fisher is always an order of magnitude faster than working with the exact Fisher matrix. 4.2 Approximate On-line “Natural” Learning. Combining equations 2.5, 3.2, and 3.4, we obtain for the weights W l in layer l the on-line learning rule n on o @d( yL (W, x), t) ¡1 T ¡1 T ( ) D W l D ¡gF¡1 D C [c (4.1) W g] y C l l¡1 l¡1 . ll l @Wl

The second term between braces is the output activity of the preceding layer of units yl¡1 , but then normalized and decorrelated by the inverse of the corresponding covariance matrix Cl¡1 . This idea, to decorrelate the activities of the layers before computing the weight update, is not new (see, e.g., Almeida & Silva, 1992), but here it comes out naturally as an approximation to natural gradient learning. An interpretation of the rst term between braces is slightly more involved. Recall that [c lT g] D

@d( yL ( W, xl ), t) @xl

3 In some special cases, where the input distribution is known, e.g., gaussian, the Fisher matrix can be calculated analytically and inverted even in order n2input time, if ninput , the number of inputs, is much larger than the number of hidden and output units (Yang & Amari, 1998).

892

Tom Heskes

* and C l D

+

@2 d( yL ( W, x0l ), yL (W, xl )) @x0l @x0l T

0

,

xl Dxl x

that is, [c lT g] measures the impact of a change in xl on the error d( yL ( W, xl ), t), while C l is some kind of Fisher matrix, but now considering changes with respect to xl rather than with respect to a set of parameters. The general effect of incorporating curvature information in this way is to make larger steps in at directions and smaller steps in steep directions. So here, by multiplying the gradient [c lT g] with the inverse of the curvature C l , updates are more focused to create differences between the activities of units in the same layer. The tendency of hidden units to represent the same features, which often appears in the initial stages of learning, is the cause of the notorious plateaus, long time spans in which the performance of the MLP hardly changes. In previous studies it has been shown that incorporating curvature information helps a great deal to break the symmetry between the hidden units and thus considerably shortens the length of the plateau (see, e.g., Rattray et al., 1998; Yang & Amari, 1998). The obvious question is then whether we need the full Fisher matrix to achieve symmetry breaking or can do with the approximation as well. 4.2.1 Learning the Tent Map. To answer this question, we consider a simple simulation study: on-line learning of the tent map. In various studies it has been shown that standard on-line learning of the tent map is severely hampered by the existence of a long plateau (see, e.g., Hondou & Sawada, 1994; Wiegerinck & Heskes, 1996). In this work, the effect of correlations between subsequently presented patterns has been studied. Here we simply view it as a static problem, with a xed training set of 100 patterns. The 100 inputs x were homogeneously drawn from the interval [¡1, 1], and the corresponding targets t obey t D 1 ¡ |x|. Our MLP has two hidden units (see Figure 1). Initial weights were drawn at random from a gaussian with zero average and standard deviation 0.1. We performed three runs of online learning, each starting from the same set of initial weights and with the same sequence of pattern presentations: @d( y( W,x), t) @W

with g D 0.1. With a higher learning parameter, uctuations start to become unacceptably large. A smaller learning parameter further reduces the chance to escape from the plateau. Standard on-line. D W D ¡g

Exact natural gradient. D W D ¡g[F(W) Cm ]¡1

@d( y( W,x), t) @W

with F( W ) the Fisher matrix numerically computed based on the 100 inputs x, g D 0.01 andm D 0.01 a small constant added to the diagonal for stabilization (similar to the additional constant in Levenberg-Marquardt).

“Natural” Learning and Pruning in Multilayered Perceptrons

893

Figure 1: On-line learning of the tent map. The architecture of the network (middle), evolution of all weights and biases (on the sides), and the error (top) for three different learning procedure: standard on-line (dash-dotted; upper time axes), exact Fisher matrix (solid), and approximate Fisher matrix (dashed).

Approximate natural gradient. D W D ¡g[Fapprox (W ) Cm ]¡1

@d( y( W,x), t) @W

with the same g D 0.01 and m D 0.01 as for natural gradient learning and where Fapprox( W ) stands for the proposed approximation of the Fisher matrix. In other words, except for the replacement of the exact by the approximate Fisher matrix, the conditions for the latter two runs were completely equivalent. Standard on-line learning is used for reference. The results are «displayed in¬ Figure 1. The plot in the upper-left corner displays the error d( y( W, x), t) x,t as a function of time (number of patterns presented) for the three runs. The upper time axis is for standard on-line

894

Tom Heskes

learning (dash-dotted lines); the lower time axis is for exact natural gradient (solid lines) and approximate natural gradient (dashed lines). The other plots show the evolution of all the weights and biases, where the time axes are the same as for the errors (20 standard on-line steps for 1 exact or approximate natural gradient step). Standard on-line learning has great difculty escaping from the initial plateau. This plateau corresponds to a solution with all weights and biases close to zero except for the output bias, which is roughly 0.5, the average of the targets t; there is barely any functional dependence on the inputs x. The two runs using curvature information break the symmetry between the hidden units fairly rapidly and escape from this plateau. The run using the approximate Fisher matrix manages this even a little faster than the run using the exact Fisher matrix. After breaking this symmetry, both converge to a good solution (error < 0.001), the one for the exact Fisher matrix slightly better than the one for the approximate Fisher matrix. Both are still slowly improving. A closer look shows that the run using the approximate Fisher matrix in fact “jumps” from the initial plateau to a second, much lower plateau. The rst plateau—the one that is so hard to take for standard on-line learning— has to do with breaking the symmetry between the hidden units. Similar plateaus are the subject of many analytic studies on on-line learning in large networks (number of input units going to innity with a xed number of hidden units) (Rattray et al., 1998). The second plateau, however, is caused by the fact that small changes in the input-to-hidden weights and small (opposite) changes in the hidden-to-output weights almost cancel. This effect is neglected by the proposed approximation, yet properly taken into account by the full Fisher matrix, which does “look across layers of hidden units.” In short, the simulation shows that the approximate Fisher matrix is sufciently powerful to break the symmetry between the units and reduces the length of the plateau by the same order of magnitude as the full Fisher matrix. A second plateau, which has to do with interdependencies between weights in different layers, cannot be tackled efciently using the approximate Fisher matrix, but disappears when using the exact Fisher matrix. 4.2.2 Escape Times for Symmetry Breaking. The simulation described above is just a single instance of a single problem. To check whether the general picture is more or less reproducible, we have compared the three different learning strategies (standard on-line, exact natural gradient, and approximate natural gradient) on three different on-line learning problems and averaged over many different instances generated by randomizing over both initial conditions and the presentation order of the training patterns. We focus on the number of iterations it takes to escape from the rst plateau that requires breaking of the symmetry between hidden units. We considered the following problems. Unless specied otherwise, networks were initialized with weights drawn independently from a gaussian distribution with mean zero and standard deviation 0.01, had two hidden

“Natural” Learning and Pruning in Multilayered Perceptrons

895

units with hyperbolic tangent transfer functions, and were trained to minimize sum-squared error. For standard on-line, we took learning parameter g D 0.1, for both exact and approximate natural gradient learning g D 0.01 with a constant m D 0.01 added to the diagonal to prevent ill-conditioning. In all simulations, these appeared to be close to optimal settings for the learning parameters. With much smaller learning parameters breaking the symmetry takes (even) more time, much larger learning parameters lead to numerical instabilities. XOR. Two binary inputs, one output, t D 0 for x D f1, 1g and x D

f¡1, ¡1g, t D 1 for x D f1, ¡1g and x D f¡1, 1g. Each iteration, one of the four training patterns is drawn at random.

Tent map. As described above, with x at each iteration drawn homo-

geneously from the interval [¡1, 1] and t D 1 ¡ | x| . Exact and approximate Fisher matrix were obtained through numeric integration. Soft-committee. A kind of standard problem used for analyzing neu-

ral network learning in the statistical mechanics framework (see, e.g., Rattray et al., 1998). Here we considered N D 10 inputs, at each iteration independently drawn from a gaussian with zero mean and unit standard deviation. The target comes from a “teacher,” which is another neural network with exactly the same architecture as the “student.” Both student and teacher have no thresholds, one output, and all hidden-to-output weights xed to one. To obtain plateaus of about the same length as for the other problems, we had to initialize to extremely small weights, drawn from a gaussian distribution with standard deviation 10¡15 rather than 0.01 (apparently symmetry breaking in soft-committee machines is relatively much easier than in learning the XOR problem or the tent map). Exact and approximate Fisher matrix were obtained through numeric integration (which is why we considered a relatively small network). Besides randomizing the initial conditions of the “student” and the inputs x drawn at each iteration, we also randomized over “teachers”: at the beginning of each run, the input-to-hidden weights of the teacher network were independently drawn from a gaussian distribution with standard dep viation 1 / N. On each of the problems, we performed 1000 independent runs for each of the on-line learning procedures. For each run, we carried out the number of iterations needed to escape from the plateau—the number of iterations before the training error reached a level somewhere in the middle between the error on the plateau (before symmetry breaking) and close to optimal (after symmetry breaking). The results are displayed in Figure 2. With standard on-line learning (dash-dotted lines), only 11 out of 1000 networks managed to escape from the plateau within 104 iterations for the XOR-problem, and none for the tent map. The results for natural learning

896

Tom Heskes

Figure 2: Number of iterations needed to escape from the plateau for the XOR problem, learning the tent map, and imitating a soft committee machine for three different on-line learning procedures: standard on-line (dash-dotted), exact Fisher matrix (solid), and approximate Fisher matrix (dashed).

using the approximate and exact Fisher matrix are comparable: where the approximate version seems to have somewhat more difculty with breaking the symmetry between the hidden units for the XOR problem, the exact Fisher matrix has more problems with the tent map. Learning the soft committee machine appeared to be relatively easy for all procedures, and the differences between them are much less striking. As argued in Heskes and Coolen (1997), the plateaus for learning the soft committee machine and related problems studied in the statistical mechanics literature are an order of magnitude easier to tackle than the plateaus that occur when learning the tent map or the XOR function. In the former case, the plateau corresponds to a saddle point of the error: the gradient on the plateau is negligible, but the Hessian has a nite negative eigenvalue that drives the escape. In the latter two cases, the error surface really is at: the smallest eigenvalue of the Hessian matrix is negligible as well (zero curvature), which makes it much harder to escape. 4.3 Using the Approximate Fisher Matrix in Levenberg-Marquardt.

As explained in section 2.3, there is a close connection between batch-mode natural learning and Levenberg-Marquardt optimization. With LevenbergMarquardt, weight changes obey (compare with equation 2.10),

D W D ¡[F(W) Cm ]

¡1

« ¬ @ d( yL ( W, x), t) x,t @W

,

(4.2)

where m is a constant added to the diagonal of F( W ) to ensure that the error after the learning step is lower than the error before the learning step (extremely large m corresponds to a standard batch-mode gradient-descent weight update with learning parameter m ¡1 ).

“Natural” Learning and Pruning in Multilayered Perceptrons

897

Simply replacing the exact F( W ) by its approximation, equation 3.4, would yield (compare with equation 4.1): « ¬ @ d( yL ( W, x), t) x,t ¡1 D W l D ¡[Fll ( W ) Cm ] with Fll (W ) D [C l Cl¡1 ]. @Wl However, adding the constant to the diagonal destroys the decomposition D C l¡1 C¡1 , which considerably speeds up the computation of the F¡1 ll l¡1 inversions. The alternative, with similar properties but keeping the decomposition intact, would be to add the constant not to the diagonal of Fll ( W ) but to both C l and Cl¡1 : « ¬ p ¡1 p ¡1 @ d( yL ( W, x), t) x,t D W l D ¡[(C l C m ) (Cl¡1 C m ) ] . @Wl

(4.3)

For large m we still obtain a gradient step with learning parameter m ¡1 , whereas for smallm , the update is dominated by the approximated “natural” gradient.4 As an illustration, we compared three different batch-mode procedures for training a neural network with eight hidden units on the Boston housing data set: standard Levenberg-Marquardt as in equation 4.2, “layered” Levenberg-Marquardt as in equation 4.3, and, for reference, (almost) steepest descent (equation 4.2 with F(W ) ´ 0 or equation 4.3 with C l ´ 0 and Cl¡1 ´ 0). All 506 patterns were used for training. The training error as a function of CPU time is plotted in Figure 3. Especially in the early stages of learning, using the approximate Fisher matrix (dashed line) considerably speeds up learning compared with both using the exact Fisher matrix (solid line) and almost steepest descent (dash-dotted line). Other runs (not shown), with a validation set left aside, revealed that the minimum of the validation error is obtained somewhere around 101 seconds CPU time for Levenberg-Marquardt learning using either the approximate or the exact Fisher matrix. The inset in Figure 3 shows how the CPU time per update scales as a function of the number of hidden units for this particular problem (the number of weights is 15 times the number of hidden units plus 1). 4.4 Natural Pruning in the Layered Approximation. Also here, in the context of pruning, one might wonder whether using the approximate Fisher matrix instead of the exact one leads to faster or more robust pruning 4

Another speed-up can be obtained by using

£

D W1 D ¡ (C 1 Cm

) ¡1

C¡1 0

« ¬ ¤ @ d(yL (W, x), t) x,t @W1

for the weight changes in the rst layer, in which case we have to compute and invert the input covariance matrix only once.

898

Tom Heskes

Figure 3: Levenberg-Marquardt learning on the Boston housing data using the exact Fisher matrix (solid lines), the approximate Fisher matrix (dashed lines), and the identity matrix (dash-dotted lines). The main gure shows the training error as a function of CPU time for a network with eight hidden units. The inset shows the CPU time per update for the three different procedures as a function of the number of hidden units.

procedures. The effect of replacing the exact Fisher matrix by its approximation, equation 3.4, in OBS is most easily understood by looking at the corresponding metric, equation 3.5. Finding the new weights W boils down 0 to solving single-layer problems of the form (W lk denote the weight(s) to be pruned), D E O l ( k) D argmin 1 [xl ( W )¡ W 0 yl¡1 ( W )]T C l (W )[xl (W ) ¡W 0 yl¡1 ( W )]T W l l x W 0 IW0 D0 2 l

lk

E 1D kxl (W ) ¡ Wl0 yl¡1 ( W )k2 , D0 2

D argmin Wl0 IWlk0

(4.4)

where the last step, that the solution is independent of C l , can be easily veried by taking into account that C l is positive denite and symmetric. In fact, we are just tting the weights in order to reproduce the hidden unit activities of the original network as accurately as possible. This idea is present in many independently proposed pruning algorithms (Castellano, Fanelli, & Pelillo, 1997; Egmont-Petersen, Talmon, Hasman, & Ambergen, 1998; Laar & Heskes, 1999). The link provided here, a layered approximation of the exact Fisher matrix, might explain some of their success. In Laar and Heskes (1999), “partial retraining,” an algorithm that, apart from a few minor details, is based on equation 4.4, was compared with (generalized) OBS. It came out that partial retraining is much more robust than

“Natural” Learning and Pruning in Multilayered Perceptrons

899

OBS, mainly because the quadratic approximation, equation 2.12, breaks down quite rapidly after removing a few weights. Interdependencies between weights in different layers are not accurately approximated, and the exibility to compensate for pruned weights by changing weights in different layers is overestimated. The more conservative layered approximation (see equation 4.4) of the exact Fisher matrix appeared to be sufciently powerful and much less sensitive to weight removals. 5 Discussion and Directions

We have described natural learning and pruning procedures from a slightly different perspective than Amari’s orginal formulation. In all of this, the Fisher matrix, measuring the local curvature of the distance function, plays a key role. Making a block diagonal (“layered”) approximation of this Fisher matrix, we arrived at fast learning and robust pruning procedures specic to MLPs. The approximation proposed is rather straightforward and simplistic: it neglects all interdependencies between weights in different layers. Still, it appeared to be sufciently powerful to break the symmetry between hidden units in the early stages of learning. However, in later stages, where subtle interactions between weights in different layers come into play, one may have to pay the price for this neglect. The odds for a layered approximation in pruning, which has been discussed to a larger extent in Laar and Heskes (1999), are even better than for learning. In learning, the Fisher matrix is used only to transform the standard gradient in a smart way. To guarantee that the error indeed decreases (always for batch-mode or on average for on-line), some regularization can be added to prevent jumps that are too large in weight space. In pruning algorithms, the role of the Fisher matrix is much more critical. Here the Fisher matrix is used in a quadratic approximation of a distance function, and if this approximation breaks down, there is no feedback from an error function as in learning. This might explain why in pruning it pays to be somewhat more conservative and use a more local metric, as in equation 3.5, rather than the full quadratic approximation, in equation 2.12. For on-line natural learning, the important assumption remains that the distribution of inputs is known. Instead, one might try to sample the required statistics dynamically, while learning. It would be interesting to see under which conditions and with what kind of schemes one could best approximate the “true” natural gradient to get the most benet out of using the natural rather than the standard gradients. In Yang and Amari (1998), exact expressions for the Fisher matrix are derived for gaussian input probability distributions. One could apply the same expressions as an approximation for nongaussian input distributions as well. The advantage compared to the approach taken in this article is that such an approximation can take into account dependencies among

900

Tom Heskes

different layers. The question is how useful the expressions still are when the shape of the input distribution starts to differ from a standard gaussian. One could also think of combining the two approaches, where the layered approximation yields a kind of zeroth order with interdependencies among weights in different layers estimated assuming gaussian inputs. Acknowledgments

I thank Bert Kappen and Martijn Leisink for discussions and comments on earlier drafts. This research was supported by the Technology Foundation STW, applied science division of NWO, and the technology program of the Ministry of Economic Affairs. References Almeida, L., & Silva, F. (1992). Adaptive decorrelation. In I. Aleksander & J. Taylor (Eds.), Articial neural networks, 2 (pp. 149–156). Amsterdam: NorthHolland. Amari, S. (1985). Differential-geometric methods in statistics. New York: SpringerVerlag. Amari, S. (1998). Natural gradient works efciently in learning. Neural Computation, 10, 251–276. Ampazis, N., Perantonis, S., & Taylor, J. (1999).Dynamics of multilayer networks in the vicinity of temporary minima. Neural Networks, 12, 43–58. Castellano, G., Fanelli, A., & Pelillo, M. (1997). An iterative pruning algorithm for feedforward neural networks. IEEE Transactions on Neural Networks, 8, 519–531. Egmont-Petersen, M., Talmon, J., Hasman, A., & Ambergen, A. (1998).Assessing the importance of features for multi-layer perceptrons. Neural Networks, 11, 623–635. Gelman, A., Carlin, J., Stern, H., & Rubin, D. (1995). Bayesian data analysis. London: Chapman & Hall. Hassibi, B., & Stork, D. (1993). Second order derivatives for network pruning: Optimal brain surgeon. In S. Hanson, J. Cowan, & L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 164–171). San Mateo, CA: Morgan Kaufmann. Heskes, T., & Coolen, J. (1997).Learning in two-layered networks with correlated examples. Journal of Physics A, 30, 4983–4992. Hondou, T., & Sawada, Y. (1994). Analysis of learning processes of chaotic time series by neural networks. Progress in Theoretical Physics, 91, 397–402. Kivinen, J., & Warmuth, M. (1997). Exponentiated gradient versus gradient descent for linear predictors. Information and Computation, 132, 1–63. Laar, P. v. d., & Heskes, T. (1999).Pruning using parameter and neuronal metrics. Neural Computation, 11, 977–993. Luenberger, D. (1984).Linear and nonlinear programming. Reading, MA: AddisonWesley.

“Natural” Learning and Pruning in Multilayered Perceptrons

901

Pedersen, M., Hansen, L., & Larsen, J. (1996). Pruning with generalization based weight saliencies: c OBD, c OBS. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 521–527). Cambridge, MA: MIT Press. Rattray, M., Saad, D., & Amari, S. (1998). Natural gradient descent for on-line learning. Physical Review Letters, 81, 5461. Saad, D., & Solla, S. (1995). Exact solution for on-line learning in multilayer neural networks. Physical Review Letters, 74, 4337. Stahlberger, A., & Riedmiller, M. (1997). Fast network pruning and feature extraction using the unit-OBS algorithm. In M. Mozer, M. Jordan & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 655–661). Cambridge, MA: MIT Press. Wiegerinck, W., & Heskes, T. (1996). How dependencies between successive examples affect on-line learning. Neural Computation, 8, 1743–1765. Yang, H., & Amari, S. (1998). The efciency and robustness of natural gradient descent learning rule. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 385–391). Cambridge, MA: MIT Press. Received April 26, 1999; accepted July 12, 1999.

LETTER

Communicated by William Lytton

Synthesis of Generalized Algorithms for the Fast Computation of Synaptic Conductances with Markov Kinetic Models in Large Network Simulations Michele Giugliano N.B.T Neural and Bioelectronic Technologies Group, Department of Biophysical and Electronic Engineering, University of Genova, Genova, Italy Markov kinetic models constitute a powerful framework to analyze patchclamp data from single-channel recordings and model the dynamics of ion conductances and synaptic transmission between neurons. In particular, the accurate simulation of a large number of synaptic inputs in wide-scale network models may result in a computationally highly demanding process. We present a generalized consolidating algorithm to simulate efciently a large number of synaptic inputs of the same kind (excitatory or inhibitory), converging on an isopotential compartment, independently modeling each synaptic current by a generic n-state Markov model characterized by piece-wise constant transition probabilities. We extend our ndings to a class of simplied phenomenological descriptions of synaptic transmission that incorporate higher-order dynamics, such as short-term facilitation, depression, and synaptic plasticity. 1 Introduction

Realistic network simulations, involving the description of individual synaptic contributions, generally require a large proportion of the total processing time to be devoted to the updating of the state variables associated with each synapse and to the computation of the resulting post-synaptic current, at every simulation time step (Lytton, 1996; Koch & Segev, 1998). Such distributed representation of a set of identical independent physical systems, performing potentially redundant calculations at slightly different times, has been previously analyzed in the case of specic model synapses, in order to improve its performances in terms of CPU time and memory required by software implementations (Srinivasan & Chiel, 1993; Lytton, 1996; Kohn ¨ & Worg¨ ¨ otter, 1998; Giugliano, Bove, & Grattarola, 1999). This has been done by representing the contribution of multiple synaptic sites in a consolidated iterated closed form. In this work, we prove that such an approach can be extended to arbitrary Markov kinetic models of the postsynaptic receptor dynamics. This leads to the synthesis of a general class of algorithms, efciently implementing the synaptic transmission in large network simulations. Besides a Neural Computation 12, 903–931 (2000)

c 2000 Massachusetts Institute of Technology °

904

Michele Giugliano

theoretical analysis of the performance improvements, we report a few examples and computer simulations to clarify the explored strategy. We show that synaptic transmission models incorporating the spontaneous release of neurotransmitter can be easily implemented and optimized with the same technique. Our ndings are then extended to the fast simulation of short-term dynamic synaptic phenomena (Abbott, Varela, Sen, & Nelson, 1997; Tsodyks & Markram, 1997; Varela et al., 1997; Chance, Nelson, & Abbott, 1998), involving calcium-dependent facilitation, refractoriness of neurotransmitter exocytosis, exhaustion of the ready-releasable vesicle pool, desensitization of postsynaptic receptors and plasticity of the absolute synaptic efcacy (Finkel & Edelman, 1987). A previous approach to the optimization of a large ensemble of identical independent depressing synaptic conductances (Giugliano et al., 1999) is thus a particular case of a more general algorithm. 2 Methods

Biophysical details of the behavior of single channels have been widely investigated by using patch recording techniques (Sakmann & Neher, 1983). It has been possible to demonstrate experimentally the stochastic nature of membrane ion conductances and characterize the rapid transitions among different conformational states, which depend on the electric potential across the plasma membrane or on the interaction with second-messenger and ligand molecules (Sakmann & Neher, 1983; Hille, 1992). In such a context, Markov models were widely used consistently to model the stochastic microscopic properties (e.g., single channel openings and closings, open time distributions) (Colquhoun & Hawkes, 1983), as well as the deterministic macroscopic features determining membrane ionic conductances and other cellular signaling pathways such as the ionotropic and metabotropic chemical synaptic transmission (Destexhe, Mainen, & Sejnowski, 1994b). It is usually assumed that membrane channels and receptors uctuate through a succession of discrete conformational congurations, among a nite number of states, separated by large energy barriers (Hille, 1992). The probability of transitions from one state to another one is also assumed to depend only on the occupied state and not on the whole past history. Membrane channels and receptors can be thus regarded as nite-state Markov systems (Papoulis, 1991; Destexhe, Mainen, & Sejnowski, 1994a, 1994b). In this article we focus on the chemical synaptic transmission between neurons (Walmsley, Alvarez, & Fyffe, 1998) and we discuss the optimization of the collective contribution of many synaptic currents converging to an isopotential compartment, independently modeling each synaptic site by an arbitrary n-state Markov model. This framework is revealed to be appropriate for the efcient network simulation of phenomena such as the spontaneous release of neurotransmitters, useful for detailed investigations of the contribution of miniature synaptic potentials (i.e. mi-

Synthesis of Generalized Algorithms

905

nis) and spontaneous synaptic activity (Par e´ , Shink, Gondreau, Destexhe, & Lang, 1998) to the electrophysiological activity, with realistic biophysical models. 2.1 Denition of a Markov Kinetic Model. The formal denition of a Markov kinetic model identies a set S of n states and a set of transition probabilities among states, also known as transition rates (Howard, 1971):

S D fs1 , . . . , sn g

fi, j D 1 . . . n : r ij ¸ og.

(2.1)

The concepts of state and transition are particularly important in the mathematical description of the system congurations, specifying, for instance, transitions and conformational states where the channel binds to neurotransmitter molecules and opens or inactivates, consequently modifying the ionic ows through the membrane. The transition probabilities, and hence the Markov process itself, can be graphically represented by a state diagram, and an oriented line segment is drawn from each state and labeled with the probability per time unity r ij that a transition from the ith state to the jth state occurs. Focusing now on a single ionotropic ligand-gated postsynaptic receptor, we dene the probability qi ( t) that the generic ith state is occupied at a given time t: qi ( t) D Prfstate s D si , occupied at time tg

i D 1, . . . , n

0 · qi · 1. (2.2)

These n absolute probabilities evolve in time according to a set of linear differential equations (see equation 2.3) (Colquhoun & Hawkes, 1983; Papoulis, 1991; Destexhe et al., 1994a, 1994b; Grattarola & Massobrio, 1998), which can be equivalently rewritten by means of the vector notation (see equation 2.4). qP i D

n X jD1

rij ¢ qj

i D 1, . . . , n

(2.3) (2.4)

qP D Rq.

In the previous equations, we introduced the matrix R of the coefcients rij (Gantmacher, 1959), explicitly dened as: 8 > r i 6D j > < ijX n (2.5) rij D ¡ r ik i D j, > > : kD1 k6 Di

assuming that r ii D 0. When the coefcients of the matrix R are constant in time, equation 2.4 can be solved in closed form (Gantmacher, 1959.) Its

906

Michele Giugliano

solution is: q(t) D e( t¡to ) R q ,

(2.6)

o

with e( t¡to ) R indicating the exponential of a matrix and q the state vector o initial condition. Equation 2.6 is the analytical solution of equation 2.4, and thus the most complete description of the time course of the probability that each state will be occupied at any time. However, in most of computer simulations of large networks of neurons, such a description is not strictly required. Because of the architecture of digital computers, such probabilities must be computed at discrete times only. Depending on the implemented numerical method for the integration of equations that describe currents and voltages associated with each unit of the network, these discrete time events can be either uniformly spaced over the time axis (in the case of xed-step numerical methods) or not (in the case of adaptive integration-step implementations). Thanks to the properties of the matrix ( t ¡ to ) ¢ R, with ( t ¡ to ) being a scalar multiplying factor, the exact solution of equation 2.4 can be iteratively expressed as indicated below, for an arbitrary sequence of time steps (Gantmacher, 1959): q(tkC1 ) D eD k R q( tK )

t1 , t2 , . . . , tk , tkC1 , . . .

D k D tkC1 ¡ tk .

(2.7)

This general property of linear systems has already been used in the context of specic model synapses (Srinivasan & Chiel, 1993; Lytton, 1996; Giugliano et al., 1999) and it is fundamental to the development of the fast algorithms. 2.2 Fast Calculation of the Postsynaptic Currents. Let us consider an isopotential compartment of a model neuron. Neglecting propagation delays, whose management could be, for instance, optimized by a direct implementation of the single queue strategy proposed in Lytton (1996), on a rst approximation the deterministic macroscopic synaptic current Isyn, resulting from N independent synapses of the same type (e.g., excitatory), is given by a linear superposition of independent contributions, expressed by a standard I-V ohmic relationship (Koch & Segev, 1998; Destexhe et al., 1994a, 1994b), as reported in equation 2.8:

Isyn D

N X iD1

¡

¢

gi ( t) ¢ Esyn i ¡ Vm D

"

N X iD1

#

¡ ¢ gi ( t) ¢ Esyn ¡ Vm .

(2.8)

Vm and Esyn indicate the postsynaptic membrane potential and the apparent equilibrium (reversal) potential, respectively. The ith synaptic conductance gi ( t) is related to the fraction of ligand-gated receptors in particular conformational states, letting ions owing through the membrane. The fraction

Synthesis of Generalized Algorithms

907

of receptors in such states can be deterministically expressed under the hypotheses of the law of large number. In terms quantitatively dened by the Tchebyceff inequality (Papoulis, 1991), the probability qi (t) referred to a single channel represents the fraction of units occupying the ith state at time t, within a large ensemble of receptors. For instance, let us consider a nine-state kinetic model (Standley, Norris, Ramsey, & Usherwood, 1993), characterized by ve closed (C) and four open (O) channel congurations: C

$

C1 l O

$ $

C2 l O2

$ $

C3 l O3

$ $

C4 l O4 .

In this example, the ith synaptic site conductance is expressed as j k ( i) ( i) ( i) (i ) gi ( t) / qO (t) C qO2 ( t) C qO3 (t) C qO4 ( t) ,

(2.9)

(2.10)

qO ( t), qO2 ( t), . . . , qO4 (t) being the fractions of the postsynaptic receptors in the open states O, O2 , . . . , O4 , respectively. In a simpler case, the conductance can be related to a single conformational state, j k () (2.11) g i ( t) / q i ( t) , where q is the fraction of receptors in the open conformational state O evolving, for example, according to a simplied two-state kinetics (see Destexhe et al., 1994a): a¢T

C

! O. Ã

(2.12)

b

In the general case of equation 2.4, we now partly relax the constraint on R, letting it be time dependent. Specically, we require it to be piecewise constant; we assume the existence of a set of M constant matrices fW1 , W2 , . . . , W M g, such that at any given time t, 9m 2 f1, . . . , Mg: R(t) D Wm .

(2.13)

Given a sequence of time events t1 , t2 , t3 , . . . , tk , tkC1 , . . . , let us consider, for example, the following changes in the coefcients of R( t): 8 W 5 t 2 [t1 I t2 ) > > > > W 7 t 2 [t2 I t3 ) > > <. (2.14) R( t) D .. > > ). 2 [t I W t t > m k kC1 > > > :.. .

908

Michele Giugliano

Such a sequence of events models steep changes in the concentration of the ligand / neurotrasmitter in the cleft and thus in the value of the coefcients of R, induced by evoked presynaptic responses, as well as determined by spontaneous exocytotic events, neuromodulation, or changes in the concentration of G- protein subunits in simplied models of second-messenger gating (Destexhe et al., 1994b). In the example of equation 2.14, it immediately follows

8 D tW 5 q( t) e > > > >eD tW7 q( t) > > > <. q(t C D t) D .. > > > eD tWm q(t) > > > > :.. .

t 2 [t1 I t2 ) t 2 [t2 I t3 ) (2.15) t 2 [tk I tkC1 ),

by choosing D t < D k , for each k (see equation 2.7). For small M, this suggests that it might be convenient to compute off-line each exponential matrix. In particular, we dene the following M quantities: U1 D eD tW1 U2 D eD tW2 .. .

Um D eD tWm .. .

U M D eD tWM .

(2.16)

The kinetic model indicated in equation 2.12 is thus a particular case of a more general framework. In that example, the two transition probabilities have been made explicit: T indicates the concentration of neurotransmitter molecules in the cleft, and a, b, are the forward and backward rate constants for transmitter binding, respectively (Destexhe et al., 1994a). Indicating with gN i the maximal synaptic conductance (i.e., the absolute synaptic efcacy), the differential equation that describes the behavior of each synaptic conductance follows: gi (t) D gN i ¢ q( i) (t)

i D 1, . . . , N

dq( i) D ¡b ¢ q( i) C a ¢ Ti ( t) ¢ (1 ¡ q( i) ). dt

(2.17) (2.18)

Assuming the transmitter concentration Ti ( t) to occur in the ith synaptic cleft as a pulse of amplitude Tmax (i.e., M D 2) and duration Cdur (Anderson

Synthesis of Generalized Algorithms

909

& Stevens, 1973; Colquhoun et al., 1992), triggered by a presynaptic action potential, we satisfy the piece-wise time invariance previously discussed (see equation 2.13) and derive the iterative form of the solution:

8 ( i) q (t) ¢ e¡(a¢Tmax Cb )¢D t > > > a ¢ Tmax > > < C ( i) a ¢ T max C b q (t CD t) D > > > ¢ (1 ¡ e¡(a¢Tmax Cb)¢D t ) > > : ( i) q (t) ¢ e¡b¢D t

¡

¢

(2.19) tNi < t C D t < tNi C Cdur t C D t > tNi C Cdur

with tNi being the last occurrence time of a presynaptic action potential for the ith synapse (see equations 2.14 and 2.15). Such a simplied denition of the time course of the postsynaptic conductance implicitly accounts for saturation and summation of multiple presynaptic events, being as fast to calculate as a single alpha function (Destexhe et al., 1994a) and its quantitative comparison to the general case and notation (see equation 2.15) leads to: µ ¶ q and n D 2, M D 2, qD 1¡q W1 D

¡ ¡b b

a ¢ Tmax ¡a ¢ T max

¢

W2 D

¡ ¡b 0¢ b

0

.

We focus again on the general case of N independent synaptic sites described by a generic kinetics. A state vector q( i) and a matrix R( i) , i D 1, . . . , N are associated with each site, and the following set of N ¢ n differential equations can be derived (as in equation 2.4):

8 (1) > qP D R(1) q(1) > < .. . > > :qP( N) D R( N) q(N ) .

(2.20)

Under the hypothesis of equation 2.13, we now require that for the ith system, i D 1, . . . , N, at any given time t the following expression holds: 9m 2 f1, . . . , Mg: R(i) ( t) D Wm .

(2.21)

Assuming M considerably smaller than N, the previous statement on the time course of each R( i) matrix implicitly induces an instantaneous partition of the set of N systems into M classes, each characterized by the same matrix Wm , m D 1, . . . , M, at any time. This fact represents the fundamental stage for the fast algorithm and generalizes the previous consolidating strategies

910

Michele Giugliano

(Lytton, 1996) to the case of n states and M constant transition matrices. More quantitatively, we dene the set of all the systems sharing the transition matrix R( im ) at a given time t, and indicating the cardinality of such a set as Nm . We thus introduce the indexes im D 1, . . . , Nm , in order to refer independently to each element of the mth class, being 8m 2 f1, . . . , Mg

8im 2 f1, . . . , Nm g

R

( im )

D Wm,

(2.22)

so that ND

M X

(2.23)

8t

Nm

mD1

and qP ( im ) D W m q(im )

m D 1, . . . , M.

(2.24)

The computer simulation of equation 2.8 in the case of large networks may result in a highly demanding computational task. This is mainly due to the need for tracking the temporal evolution of each synaptic site (e.g., as in equation 2.19), and then computing the overall synaptic conductance by means of a sum over the N sites. Without losing any generality, we assume that a single conducting state exists in the generic Markov model describing the postsynaptic receptors, and we identify it (e.g.) as the fourth component of the state vector q( i) , for the ith synaptic site, i D 1, . . . , N. Let us dene the following M quantities: wm D

Nm X

( )

q im

m D 1, . . . , M.

i m D1

(2.25)

The sums of all the state vectors q( i) can be also expressed as the sum of these quantities: w D

N X iD1

() qi D

M X mD1

w m.

(2.26)

For the sake of clarity, we point out that the total synaptic conductance required for the computation of equation 2.8 is given by the fourth component of the vector w , or equivalently by the sum of the fourth component of each w m : h

i wm

4

D

Nm X im D1

( )

q4im .

(2.27)

Synthesis of Generalized Algorithms

911

Because of the linearity of equation 2.15, we can considerably speed up the computation of equation 2.27, by noting that each w m satises the following set of linear differential equations: wP m D Wm w m .

(2.28)

As for equations 2.6 and 2.7, by using the denitions given in equation 2.16, we can either analytically solve such differential equations, specifying their initial conditions w m ( to ), or iteratively calculate their exact solutions. By choosing the second approach, we nally express w in a lumped form, summing over M instead of N terms (see appendix A for the summarized general form of the algorithm): M X

w(t C D t) D

mD1

w m (t C D t) D

M X mD1

Um w m ( t).

(2.29)

In the last expression, an arbitrary time increment D t has been considered. The size of such increment is constrained only by the time interval between two consecutive events (e.g., as in equation 2.14), so that a xedstep updating is not strictly required. However, we note that further performance improvements can be achieved by calculating off-line each Um , m D 1, . . . , M, for a xed D t or for a limited set of D t values. Finally, equation 2.8 can be rewritten as follows: ( Isyn D gN ¢

N h X iD1

( i)

q

i 4

)

(

M h i X ¢ (Esyn ¡ Vm ) D gN ¢ w m mD1

) 4

¢ ( Esyn ¡ Vm ). (2.30)

In the last expression we assumed that all the synapses share the same value of the maximal conductance, but this is not strictly required, as we show later. 3 Results 3.1 Computational Efciency. In this section, an analysis of the number of arithmetic operations required at each time step by the implementation of the fast calculation algorithm (see appendix A) presented in section 2.2 is given.1 As stated by equation 2.7, assuming R time independent and without considering the off-line computation of its exponential matrix, the number of operations required for the iterative calculation of the analytical solution of equation 2.4, consists of n ¢ (n ¡ 1) sums and n2 products needed 1 Computer simulations, implemented in standard ANSI C-code available on request, were performed on a Digital DEC 3000 Alpha workstation and on an Intel Pentium II 333 MHz double processor workstation.

912

Michele Giugliano

to compute the product of the (n ¢ n) matrix eD k ¢R with a column vector ( n ¢ 1), q( tk ). In the case of nonoptimized calculation of the total synaptic conductance (see equation 2.8), the quantity N h i X ( i) w D q4 4

(3.1)

iD1

must be computed by accumulating N terms, whose updating expression is given by ()

i q 4 ( t C D t) D

n £ X

UF(i)

jD1

¤

()

4j

¢ qj i ( t),

(3.2)

generically assuming R( i) (tO) D UF(i) tO 2 [tI t C D t), F( i) 2 f1, . . . , Mg. In order to derive the computational load of the nonoptimized procedure for the summation of a large number of identical independent synaptic conductances, we rewrite equation 2.29 by denition as follows: N X n h i X () w ( t C D t) D [UF( i) ]4j ¢ qj i ( t). 4

(3.3)

iD1 jD1

By indicating with D prod and D sum (D prod > D sum ) the times required by an abstract oating-point unit to compute the result of a two-operand product and sum respectively, and neglecting any pipe-lined architecture, the total estimated time required for equation 3.3 is h i D TSLOW D N ¢ n2 ¢ D prod C n ¢ ( n ¡ 1) ¢ D sum C ( N ¡ 1) ¢ D sum . (3.4) In the fast algorithm, equation 3.1 is replaced by M h h i i X w D w , 4

mD1

(3.5)

m 4

so that the accumulation of only M quantities is required, independently on N. Each term is iteratively updated, immediately leading to M X n h i h i X w ( t C D t) D [Um ]4j ¢ w m ( t). 4

mD1 jD1

4

An abstract oating-point unit would then require h i D TFAST D M ¢ n2 ¢ D prod C n ¢ ( n ¡ 1) ¢ D sum C ( M ¡ 1) ¢ D sum .

(3.6)

(3.7)

Synthesis of Generalized Algorithms

913

We can thus dene a normalized gain factor f as follows (see equation 3.9), by dening ³ ´¡1 Q D c ¢ n2 ¡ n C 1 D TFAST D

c D

D prod D sum

C1 c > 1

1 1 ¢ ¢ D TSLOW f N

(3.8) (3.9)

When N is much bigger than Q, f becomes independent of N: NÀQ) f » D

1 , M¡Q

(3.10)

quantitatively establishing that the acceleration increases as the number of simulated independent synaptic sites increases. Explicitly, the following equation holds: µ ¶ c ¢ M ¢ n2 ¡ M ¢ n C ( M ¡ 1) 1 D TFAST » ¢ ¢ D TSLOW . D c ¢ n2 ¡ n C 1 N

(3.11)

In a real implementation of the algorithm, however, the acceleration improvements do not grow unbounded as N increases, because of the time required to process the temporal events t1 , t2 , . . . , tk , tkC1 , . . . , each time executing the part I of the algorithm (see appendix A), depending roughly on the product of N and the average rate of activation of synaptic inputs. 3.2 Numerical Accuracy. Equation 3.11 indicates that this technique improves the computation efciency, reducing the number of arithmetic operations at each simulation step, sincec > 1, n > 1, and N À M in all realistic situations. Such a result has other important consequences concerning the numerical accuracy of the computation (see Figure 1). Referring for the sake of simplicity to the example reported in equation 2.18, two main strategies exist in order to compute the total synaptic conductance resulting from an ensemble of N independent sites, under the usual hypotheses. The rst consists of the computation of the iterated exact solution, given by equation 2.19 (Destexhe et al., 1994a), while the second is the integration of equation 2.18 by means of a numerical technique, such as the Euler ’s method or the Runge-Kutta methods (Press, Teukolsky, Vetterling, & Flannery, 1996). It follows that the fast algorithm must be compared with the fastest computation strategy, being the numerical integration. Of course, there is a compromise between the performances that a numerical method can achieve in terms of accuracy and its complexity, resulting

914

Michele Giugliano

Figure 1: A comparison between the numerical accuracy given by two computer implementations is shown. (Top) Time course of the instantaneous arithmetic error of the numerical integration of equations 3.13 with respect to its iterated exact closed form. (Bottom) The higher arithmetic precision of the fast algorithm (equations 3.16–3.20) is evident; it shares the numerical accuracy of the exact iterated solution strategy; as a consequence, the instantaneous arithmetic error is zero most of the time. Both measures have been normalized with respect to the iterated exact closed form (equations 3.14), chosen as a reference.

in higher computational loads. However, our interest is not focused on the general convergence properties of any numerical methods, but rather on the truncation errors occurring during the accumulation of the N terms in the sum of equation 2.10, because of the nite precision arithmetics. We chose the fastest numerical methods (Euler ’s method) and veried that an implementation of the proposed fast algorithm shares the numerical accuracy of the exact iterated solution strategy and an optimal CPU-time performance, at a parity of time-step size. We can also roughly quantify the generic consequences of the niteprecision arithmetic on the errors accumulating over N sums by emulating such effects with a white additive noise source in equation 2.8, regarding the global synaptic conductance as a random variable (Oppenheim & Shaffer, 1975). From this analysis, it follows that the reduction of the total number of sums from N to M (see equation 2.26), implies a decrease in the variance of the global conductance (see equation 2.8) at each time step according to a factor M / N, thus improving the signal-to-noise ratio, especially when the number N of synapses is considerably high. An example of such an improvement is reported in Figure 1. 3.3 Efcient Simulation of the Neurotransmitter Spontaneous Release.

In this section, we propose an example of a fast implementation of synaptic transmission with a kinetic Markov model, incorporating spontaneous release of neurotransmitter in the cleft.

Synthesis of Generalized Algorithms

915

Figure 2: (Top) The temporal evolution of the piece-wise constant transition rate. (Bottom) The resulting time course of the synaptic conductance, according to equation 4.5.

We consider a two-state kinetic scheme for the neurotransmitter-receptor binding, but any n-state kinetics could have been used. On a rst approximation, we assume T ( t) resulting from the linear superposition of two processes, describing the prole of neurotransmitter in the cleft due to spontaneous exocytotic events and the spike-triggered releases. For simplicity, we assume that deterministic and spontaneous events differ only in the maximal amplitude of the neurotransmitter pulse (Anderson & Stevens, 1973; Sheperd, 1988; Colquhoun et al., 1992), but we note that distinct pulses duration could be considered too. The previous assumption ensures the piece-wise time invariance of the rate coefcients since T takes one out of four constant values at any given time (see Figure 2, top):

8 0 > > < T1 T(t) D T2 > > : T1 C T2

T1 < T 2 .

(3.12)

The dynamics of the postsynaptic conductance is described by the solution of the following equation, where tON and q1 are piece-wise constant, according to the values reported in Table 1, depending on the amplitude of neurotransmitter in the cleft (see equation 3.12): dq D dt

¡q

1

¡q

tON

¢

.

(3.13)

Under this framework, we can express the iterated exact solution of equation 3.13, by implementing the appropriate updating rule, depending on the

916

Michele Giugliano

Table 1: Constant Parameters Dened in the Example of Neurotransmitter Spontaneous Release. Symbol

Expression

tOFF q1 0 tON 1 q1 1 tON 2 q1 2 tON 3 q1 3

b ¡1 0 (a ¢ T1 C b) ¡1 a ¢ T1 ¢ (a ¢ T1 C b) ¡1 (a ¢ T2 C b) ¡1 a ¢ T2 ¢ (a ¢ T2 C b) ¡1 [a ¢ (T1 C T2 ) C b]¡1 a ¢ (T1 C T2 ) ¢ [a ¢ (T1 C T2 ) C b]¡1

value of T ( t) in the current time interval. Rewriting equation 2.15 in such a context leads to: 8 Dt ¡ > > e tOFF ¢ q( t) ³ TD0 > ´ > > ¡t Dt ¡t Dt > > <e ON1 ¢ q( t) C 1 ¡ e ON1 ¢ q11 T D T1 ³ ´ (3.14) q ( t C D t) D ¡t Dt ¡ Dt > ON2 ¢ q( t) C 1 ¡ e tON2 ¢ q12 T D T2 e > > > > ³ > Dt Dt ´ > :e¡ tON3 ¢ q( t) C 1 ¡ e¡ tON3 ¢ q13 T D T1 C T2 . Regarding a large population of such synaptic sites, we induce a partitioning of the ensemble of synapses into four classes (M D 4), by introducing the following index notation (see section 2.2): iOFF D 1, . . . , NOFF

iON 1 D 1, . . . , NON 1 iON 2 D 1, . . . , NON 2 iON 3 D 1, . . . , NON 3 The fast algorithm can be immediately instantiated to this case, rst by dening, as in equation 2.25, the following quantities: QOFF D

N OFF X

QON 1 D

qiOFF

i OFF D1

QON 3 D

NX ON 3

NX ON 1

qiON

1

iON 1 D1

QON 2 D

NX ON 2

qiON

2

iON 2 D1

(3.15)

qiON 3

iON 3 D1

and then by expressing them in an iterative way: ¡tDt

QOFF ( t C D t) D e

OFF

¢ QOFF ( t)

(3.16)

Synthesis of Generalized Algorithms ¡t

QON 1 ( t C D t) D e

Dt ON 1

³ ´ ¡ Dt ¢ QON 1 ( t) C 1 ¡ e tON 1

¢ q1 1 ¢ NON 1 ¡t

QON 2 ( t C D t) D e

Dt ON 2

¡t

Dt ON 3

(3.17)

³ ´ ¡ Dt ¢ QON 2 ( t) C 1 ¡ e tON 2

¢ q1 2 ¢ NON 2 QON 3 ( t C D t) D e

917

(3.18)

³ ´ ¡ Dt ¢ QON 3 ( t) C 1 ¡ e tON 3

¢ q1 2 ¢ NON 3

(3.19)

The resulting total synaptic conductance is given by the sum of all four terms (see equations 3.15 and 2.29): g ( t) /

N X iD1

q( i) D [QOFF C QON 1 C QON 2 C QON 3 ] .

(3.20)

In Figure 3, time courses of the total synaptic conductance computed by using several techniques have been compared. 4 Extension to Dynamic Synaptic Transmission and Plasticity 4.1 General Case. The consolidating algorithmic strategy we explored in this work turns out to constitute an appropriate framework for the optimization of a simplied phenomenological description of higher-order synaptic phenomena (Sen, Jorge-Rivera, Marder, & Abbott, 1996; Abbott & Marder, 1998). Several experimental investigations at the neuromuscular junction (Magleby & Zengel, 1975) and the central synapse (Thomson & Deuchars, 1994; Markram & Tsodyks, 1996; Varela et al., 1997; Chance et al., 1998) revealed that many dynamic behaviors characterize the signal transmission between excitable cells (see Figure 4). Indeed, several frequencydependent phenomena together with short-term plasticity occur over a heterogeneous set of timescales (Markram & Tsodyks, 1996; Varela et al., 1997). This past-history-dependent signal transmission results from presynaptic calcium accumulation, exhaustion of ready-releasable neurotransmitter resources, nite-time mobilization of vesicle pools, desensitization of postsynaptic receptors, and probably many other biophysical mechanisms. For instance, four temporally distinct components of the postsynaptic response amplitude enhancement have been identied (facilitation type 1, type 2, augmentation, and potentiation), and at least two components seem to be appropriate in order to account for activity-dependent synaptic depression,

918

Michele Giugliano

Figure 3: Temporal evolution of the total synaptic conductance, simulated by using 1000 synapses, randomly activated and spontaneously releasing neurotransmitter molecules. (a) A numerical integration of the equations 3.13. (b) The iterated closed form (see equation 3.14). (c) Optimized algorithm. The lines do not coincide in all three cases, because of time-step round-off differences among implementations (see section 3.2).

Figure 4: Characteristic temporal evolution of the total synaptic conductance in the case of short-term depression. The simulation refers to the activity of 1000 independent synaptic sites, activated randomly by Poisson point processes with the same mean frequency, synchronously increasing from 20 to 80 Hz (see Tsodyks, Pawelzik, & Markram, 1998), setting tF very small: the fast algorithm contributed to a reduction of 79% of the total CPU time.

Synthesis of Generalized Algorithms

919

involving the refractoriness of the exocytotic processes and a slower depletion of releasable vesicles. Recently a simplied model has been proposed, describing components of facilitating and depressing synaptic responses (Abbott et al., 1997; Abbott & Marder, 1998; Chance et al., 1998). This consists of a multiplicative composition of variables, accounting for different aspects of the response to a train of presynaptic spikes. For instance, the following choice was revealed to be appropriate in order to t experimental tracks in the case of excitatory synapses in layer 2 / 3 of rat primary visual cortex (Varela et al., 1997): g( t) D g ¢ [F ¢ D1 ¢ D2 ¢ D3 ¢ q].

(4.1)

Equation 4.1 includes a component for facilitation F and three for depression D1 , D2 , D3 acting on different timescales. In the case of a large number of synaptic conductances of this kind, the total current would be indicated by the expression: Isyn D

" N X

# g

iD1

( i)

¢F

( i)

() ¢ D1i

() ¢ D2i

() ¢ D3i

( i)

¢q

¢ ( Vm ¡ E syn ).

(4.2)

Equation 4.1 can be considered the particular case of a more general multiplicative composition of variables, being components of different state vectors, each evolving according to a distinct piece-wise time-invariant linear differential system. For simplicity, we now focus on the generic multiplicative composition of three variables, regarding the case of N identical systems and deriving the general differential equations for each of them (as in equation 2.20): q(1) , . . . , q( i) , . . . , q(N ) (1)

( i)

p , ..., p , ..., p (1)

( i)

s , ..., s , ...,s

( N)

( N)

qP( i) D Q( i) q( i)

dim(q) D nq

sP( i) D S(i) s( i)

dim(s) D ns

pP

( i)

( i) ( i)

DP p

dim(p) D np

(4.3)

Assuming the usual piece-wise time invariance for the transition matrices, we rewrite equation 2.21:

8 ( i) <9mq 2 f1, . . . , M q g Q ( t) D W mq ( i) 8t8i 9mp 2 f1, . . . , Mp g P (t) D Wmp : 9mS 2 f1, . . . , Ms g S(i) ( t) D Wms

(4.4)

We are inducing an instantaneous partition over the set of N systems, now regarding the possible combinations hQ, P, Si, such that « ¬ hQ, P, Si D Wmq , Wmp , Wms

mq , mp , ms 2 f1, . . . , M q g £ f1, . . . , Mp g £ f1, . . . , Ms g

(4.5)

920

Michele Giugliano

at any given time. By dening the number N (mq , mp , ms ) of systems that share the same combination hQ, P, Si D hWmq , W mp , W ms i, the following expression, analog to equation 2.23, holds:

ND

Mq X Mp X Ms X

N ( mq , mp , ms ).

(4.6)

mq D1 mp D1 ms D1

Finally, as reported in detail in the appendix B, we introduced a small set of lumped new terms in a similar way to what we discussed in section 2.2 (see equation 2.25). These are tensors eyru (mq , mp , ms ) whose components are dened by the sums of the state vectors composed in a multiplicative way, according to all possible combinations (see section 4.2 and Table 2). Moreover, multiplicative composition models and plasticity rules affecting the maximal synaptic efcacies can be thus introduced and optimized, given that the equations describing each variable are piece-wise stationary linear differential equations. 4.2 Fast Calculation of Short-Term Depressing and Facilitating Synaptic Conductances. We report an example of the optimization technique for

the simulation of two-component depressing and facilitating synaptic conductances. In particular we consider a reduced version of the total synaptic current given in equation 4.15, assuming just a single-component facilitation and a single-component depression: " Isyn D

N X iD1

# ( i)

gN

¢F

(i )

¢D

(i )

( i)

¢q

¢ ( Esyn ¡ Vm ).

(4.7)

We note that the maximal synaptic efcacy for each synapse is now considered to be unique (i.e., we are regarding gN as just another state variable as q, p, or s in equation B.9) and constant in time. In other words, we will not introduce any synaptic efcacy plasticity rule, but note that specic longterm plasticity rules could have been easily introduced as well (Markram, Lubke, ¨ Frotscher, & Sakmann, 1997; Senn, Tsodyks, & Markram, 1997; Abbott & Song, 1999). The shape of the postsynaptic potential has been described by the time course of q( t), being the fraction of receptor bound to neurotransmitter molecules, according to the simple two-state kinetic scheme reviewed in section 2.2 (see equation 2.12). In a similar way to equation 3.14, since M D 2 we choose the following index notation: iON D 1, . . . , NON and iOFF D 1, . . . , NOFF , in order to address independently each conductance sharing the same value of T ( t), and we

Synthesis of Generalized Algorithms

921

Table 2: Consolidated New State Variables, in the Example of the Fast Calculation of Short-Term Depressing and Facilitating Synaptic Conductances. Symbol

Expression N ON X

GON

iON D1

X

N OFF

GOFF

iOFF D1 N ON X

C ON iON D1

X

N OFF

C OFF iOFF D1 N ON X

VON iON D1 N OFF

X

VOFF iOFF D1 N ON X

D iON D1 N ON X

=

iON D1 N ON X

l iON D1 N ON X

S iON D1

gN (iON ) ¢ q(iON ) gN (iOFF ) ¢ q(iOFF ) gN (iON ) ¢ D(iON ) ¢ q(iON ) gN (iOFF ) ¢ D(iOFF ) ¢ q(iOFF ) gN (iON ) ¢ F(iON ) ¢ q(iON ) gN (iOFF ) ¢ F(iOFF ) ¢ q(iOFF ) gN (iON ) ¢ D(iON ) gN (iON ) ¢ F(iON ) gN (iON ) ¢ F(iON ) ¢ D(iON ) gN (iON )

express those terms iteratively: » (i ) q OFF (t) q (t) D (iON ) ( t) q ( i)

q( iOFF ) ( t C D t) D aOFF ¢ q( iOFF ) ( t) q( iON ) ( t C D t) D aON ¢ q(iON ) ( t) C b.

(4.8)

Following Abbott and Marder (1998), F and D are nonnegative variables evolving in time according to two-state kinetics, accounting for an exponential recovery process to a unitary resting value (see equations 4.10 and 4.12). Whenever a presynaptic spike occurs, D and F are instantaneously decreased and increased, respectively, in a multiplicative (equation 4.11) or

922

Michele Giugliano

Table 3: Constant Parameters Dened in the Example of the Fast Calculation of Short-Term Depressing and Facilitating Synaptic Conductances. Symbol

Expression

aON aOFF

e¡ t t e¡b ¢D³

b

q 1 ¢ 1 ¡ e¡

c

e

h t

e F (a ¢ TMAX C b) ¡1

Dt

Dt t

´

¡ tD t D

t ¡D t

a ¢ TMAX a ¢ TMAX C b aON ¢ c ¢ h aON ¢ c ¢ (1 ¡ h) aON ¢ (1 ¡ c) ¢ h aON ¢ (1 ¡ c) ¢ (1 ¡ h) b¢c¢h b ¢ c ¢ (1 ¡ h) b ¢ (1 ¡ c) ¢ h b ¢ (1 ¡ c) ¢ (1 ¡ h) aOFF ¢ c ¢ h aOFF ¢ c ¢ (1 ¡ h) aOFF ¢ (1 ¡ c) ¢ h aOFF ¢ (1 ¡ c) ¢ (1 ¡ h)

q1 AON 1 AON 2 AON 3 AON 4 B1 B2 B3 B4 AOFF 1 AOFF 2 AOFF 3 AOFF 4

additive (equation 4.13) manner: D Ã(1 ¡ D)

F Ã(1 ¡ F)

tD¡1

(4.9)

tF¡1

D( i) ( t C D t) D c ¢ D( i) ( t) C (1 ¡ c)

(4.10)

D( i) :D d( i) ¢ D( i)

(4.11)

() () F i ( t C D t) D h ¢ F i ( t) C (1 ¡ h)

(4.12)

F( i) :D f ( i) C F( i) ,

(4.13)

with aON , aOFF , c, and h being constants whose values are reported in Table 3. As discussed for the general case (see appendix B), we note that the total synaptic conductance, referred to as w (t) D

" N X iD1

# gN

( i)

¢F

( i)

( i)

¢D

( i)

¢q

,

(4.14)

Synthesis of Generalized Algorithms

923

can be expressed as the sum of two terms, independently on N: w ( t) D w ON ( t) C w OFF ( t),

(4.15)

being by denition " w ON (t) D

N ON X

iON D1

# gN

( iON )

¢F

( iON )

( iON )

¢D

¢q

(i ON )

(4.16)

and " w OFF (t) D

N OFF X

iOFF D1

# gN

( iOFF )

¢F

(i OFF )

¢D

( iOFF )

( iON )

¢q

(4.17)

5 Discussion

We have introduced a general class of consolidating algorithms in order to implement efciently the synaptic transmission in large network simulations, being the dynamics of the individual synaptic current described by a generic n-state Markov model. Our approach relies on the piece-wise time invariance and linearity of the differential equations dening the deterministic temporal evolution of the synaptic conductance, and thus it does not apply to all cases of synaptic transmission and neuromodulation. Actually, the technique introduced in the rst part of this work deals with the description of ionotropic postsynaptic receptors only, under hypotheses that ensure that the coefcients of the transition matrix (being even both ligand gated and voltage dependent) can be regarded as piece-wise constant (Destexhe et al., 1994a, 1994b; Destexhe, 1997). As discussed in Destexhe et al. (1994b), simplied models of second messenger-gated ion channels can be provided, approximating (e.g., in the case of G-protein-gated channels, including GABAB , 5-HT¡1, and muscarinic M2) the concentration of the second messenger to occur as a pulse, interacting with the transition coefcient of three- and four-state models that can be optimized by our strategy. Throughout the article, we have been alternatively focusing on complex models, including multiple states with potentially distinct functional roles and accounting for higher-order phenomena, and a much simpler two-state kinetics exemplifying the general approaches described. In the latter case, only the gross kinetic properties of synaptic transmission, relevant in modeling phenomena such as the genesis of oscillatory electrophysiological activity (Destexhe, Contreras, & Steriade, 1998), are taken into account, and fast-calculation strategies have been already proposed (Srinivasan & Chiel, 1993; Lytton, 1996). Our work proved that such algorithmic strategies can be generalized to more detailed models of synaptic physiology and expressed under the same formalism. The generalization let us to predict quantitatively the general performance improvement in terms of CPU times and

924

Michele Giugliano

numerical accuracy, and it turned out to be suitable for the computation of the total synaptic conductance resulting from a multiplicative composition of state variables in activity-dependent models of synaptic physiology (Abbott & Marder, 1998). This implies that our approach can be conveniently extended to contexts where temporal interactions and activity-dependent spike-train processing dynamics have to be considered, thus generalizing a previous attempt (Giugliano et al., 1999). In such perspectives, the applicability of the Markov kinetic models as a mathematical tool for synaptic modeling in large network implementations has been improved and extended, and specic examples and computer simulations of the algorithms that efciently simulate a large collection of synaptic inputs, converging to the same isopotential compartment, have been presented. Beyond the speed-up achieved by the algorithms, the strategies presented in the last part of this work constitute a relevant tool to develop new biophysically inspired models of complex autoorganization properties in the nervous system. Actually, because a very large number of synaptic inputs can be introduced into neuronal network simulations at no additional CPU time cost and constrained only by the available computer memory, the impact on the collective activity of biophysical phenomena activating or modulating the properties of synaptic transmission at a site, previously silent or nonrelevant to the global activity, can be investigated. In particular, phenomena such as long- and short-term plasticity (Markram et al., 1997; Senn et al., 1997; Tsodyks & Markram, 1997; Varela et al., 1997; Abbott & Song, 1999), synaptogenesis, synaptic maturation (Kirsch & Betz, 1998) and unmasking (Liao, Hessler, & Malinow, 1995) can be modeled and efciently simulated in large networks by using the discussed approach. Appendix A: General Form of the Algorithm

According to the notation introduced in section 2.2, we summarize the general form of the algorithm by dividing it into two parts. The rst is performed only when the kth system, previously characterized by the transition matrix Wm , changes at a time t, and starts to evolve according to another matrix (see equation 2.14), as a consequence of a change in the concentration of the neurotransmitter/neuromodulator in the kth synaptic cleft. Part 1

8 (k) ( k) > q( k) ( t) Ã e( t¡tOLD ) Wm q( k) ( tOLD ) > > > x7 > > <w Ã w ¡ q( k) (t) Nm Ã Nm ¡ 1 m m x7

> w n Ã w n C q( k) (t) > > x7 > > > t( k ) Ã t : OLD x7

x7

Nn Ã Nn C 1 x7

Part 2

FOR m D 1, . . . , M w m (t C D t) Ã Um w m ( t) x7

Synthesis of Generalized Algorithms

925

Appendix B: Dynamic Synaptic Transmission

As done in equation 2.16, the exponential of matrices Q, P, and S can be computed off-line and referred to as Q m D eD tWmq W q

mq D 1, . . . , M q

(B.1)

VQ mp D eD tWmp

mp D 1, . . . , Mp

(B.2)

XQ ms D eD tWms

ms D 1, . . . , Ms

(B.3)

In a similar way to what has been discussed in the case of equation 2.22, we dene an index i¤ ( mq , mp , ms ), running from 1 to N ( mq , mp , ms ), identifying the subset of systems sharing the same combination hQ, P, Si, at any time t. We generalize the formulation given in equation 4.2 by dening ej h k as the sum, over all the N systems, of the multiplicative composition of the (e.g.) jth component of the state vector q, the hth component of the state vector p, and the kth component of the state vector s, respectively, and by equivalently expressing it as a sum over the items of the same partition. We rst dene ej h k ( mq , mp , ms ) D

N( mX q ,mp ,ms ) ¤(

i mq ,mp ,ms )D1

()

()

()

i i i qj ¢ ph ¢ sk

(B.4)

being the sum of the multiplicative composition of the jth component of the state vector q, the hth component of the state vector p, and the kth component of the state vector s, respectively, over all the system sharing hQ, P, Si D hWmq , Wmp , Wms i. We thus note that ej h k D

N X iD1

()

()

()

i i i qj ¢ ph ¢ sk D

Mq X Mp X Ms X

ej h k ( mq , mp , ms )

(B.5)

mq D1 mp D1 ms D1

and, thanks to the linearity of equations 4.3, we conclude that it is possible to express ej h k iteratively in terms of ( Mq ¢ Mp ¢ Ms ) quantities, ej h k (mq , mp , ms ): ej h k | tCD t D

8 Mq X Mp X nq X np X Ms <X ns h X

mq D1 mp D1 ms D1

:

yD1 rD1 uD1

Qm W q

i h jy

VQ mp

9 = £ ey r u ( mq , mp , ms )| t . ;

i

h hr

XQ ms

i ku

(B.6)

Equation B.6 constitutes the main stage of the general optimization technique, and it proves that such a fast algorithm equivalently replaces the

926

Michele Giugliano

normal procedure, given by equation B.8, and that it leads to a consistent speed-up as soon as NÀ N X iD1

5 ¢ D prod C D sum 3 ¢ D prod C D sum

( i) qj

¢

( i) ph

¢

( i) sk | tCD t

¢ ( M q ¢ Mp ¢ Ms )

(B.7)

8 nq X np X ns h N <X i h i h i X Qm D W VQ mp XQ ms q : jy hr ku iD1 yD1 rD1 uD1 9 = £ q(yi) ¢ p(ri) ¢ s(ui) | t . (B.8) ;

Concerning the specic example discussed in section 4.2, in the case of the fast calculation of short-term depressing and facilitating synaptic conductances, a few consolidated new state variables (i.e., ey r u (ON) and ey r u (OFF)) have been dened, as reported in Table 2. In order to calculate equation 4.15 quickly, at each time step it is now required to compute the following linear iterated expressions, independent of N: w ON ( t C D t) D AON 1 ¢ w ON (t) C AON 2 ¢ C ON ( t) C AON 3 ¢ VON (t) C AON 4 ¢ GON (t) C B1 ¢ l(t) C B2 ¢ D ( t) C B3 ¢ =(t) C B 4 ¢ S ( t)

(B.9)

w OFF ( t C D t) D AOFF1 ¢ w OFF ( t) C AOFF2 ¢ C OFF ( t) C AOFF3 ¢ VOFF ( t) C AOFF4 ¢ GOFF ( t) (B.10) GON ( t C D t) D aON ¢ GON ( t) C b ¢ S ( t)

GOFF (t C D t) D aOFF ¢ GOFF ( t) C ON (t C D t) D aON ¢ c ¢ C ON ( t) C aON ¢ (1 ¡ c) ¢ GON ( t) C b ¢ c ¢ D ( t) C b ¢ (1 ¡ c) ¢ S ( t)

C OFF ( t C D t) D aOFF ¢ c ¢ C OFF ( t) C aOFF ¢ (1 ¡ c) ¢ GOFF (t) VON ( t C D t) D aON ¢ h ¢ VON ( t) C aON ¢ (1 ¡ h) ¢ GON ( t) C b ¢ h ¢ =(t) C b ¢ (1 ¡ h) ¢ S ( t) VOFF ( t C D t) D aOFF ¢ h ¢ VOFF ( t) C aOFF ¢ (1 ¡ h) ¢ GOFF (t) D ( t C D t) D c ¢ D (t) C (1 ¡ c) ¢ S (t) =(t C D t) D h ¢ =(t) C (1 ¡ h) ¢ S (t)

l ( t C D t) D c ¢ h ¢ l (t) C c ¢ (1 ¡ h) ¢ D ( t) C h ¢ (1 ¡ c) ¢ =(t) C (1 ¡ c) ¢ (1 ¡ h) ¢ S ( t)

Synthesis of Generalized Algorithms

927

The following updating steps must be also performed when the kth synapse, previously inactive, changes state from OFF to ON at time t:

When the kth synapse, already active, turns on again at time t, the following steps must be executed:

( k)

x12

q

(k)

Ã e¡b¢[t¡(t0 ¡

D

Ãe x12

t¡t

CCdur )]

¢D

C @1 ¡ e ¡

F ( k) Ã e x12

¢ q( k)

t¡t

(k) 0

tF

0

¡

¡

t¡t

¢1 (k)

0 tD

D ( k) Ã e x12

(

t¡t

¡

(k) 0

tD

¡

¡ C @1 ¡ e

t¡t

¢1 (k)

0 tF

¢ D( k) ¡

C @1 ¡ e

F

( k)

¡

Ãe

x12

A

(k) t¡t 0 tF

0

(k) 0 t

t¡t

¡

¢!

¢

0

A

¢ ¢ F ( k) ¡

¡

C q1 ¢ 1 ¡ e ¡

( k)

0

¡

¢ q( k)

¢ (k)

0 tD

¡

¢

x12

x12 ( k)

(k) 0 t

t¡t

q( k) Ã e¡

S Ã S C gN ( k)

¡

t¡t

¡

(k) 0

tD

¢1 A

¢ ¢ F ( k) ¡ ¡

C @1 ¡ e

(k) t¡t 0 tF

¢1 A

GOFF Ã GOFF ¡ gN (k) ¢ q( k)

w ON Ã w ON ¡ gN ( k) ¢ F( k) ¢ D( k) ¢ q(k)

GON Ã GON C gN ( k) ¢ q( k)

C ON Ã C ON ¡ gN ( k) ¢ D( k) ¢ q( k)

x12

x12

x12

x12

VON Ã VON ¡ gN (k) ¢ F( k) ¢ q( k) x12

w OFF Ã w OFF ¡ gN ( k) ¢ F( k) ¢ D( k) ¢ q( k)

D Ã D ¡ gN (k) ¢ D(k)

C OFF Ã C OFF ¡ gN ( k) ¢ D( k) ¢ q( k)

= Ã = ¡ gN ( k) ¢ F( k)

VOFF Ã VOFF ¡ gN ( k) ¢ F( k) ¢ q( k)

l Ã l ¡ gN ( k) ¢ F(k) ¢ D(k)

( ) ( ) ( ) D k Ãd k ¢D k

( ) ( ) ( ) D k Ãd k ¢D k

F ( k) Ã f ( k) C F ( k)

F ( k) Ã f ( k) C F ( k)

x28

x28

x28

x28

x28

x28

x28

x28

x28

x28

928

Michele Giugliano

w ON Ã w ON C gN (k) ¢ F( k) ¢ D( k) ¢ q( k)

w ON Ã w ON C gN (k) ¢ F( k) ¢ D( k) ¢ q( k)

C ON Ã C ON C gN ( k) ¢ D( k) ¢ q( k)

C ON Ã C ON C gN (k) ¢ D(k) ¢ q( k)

VON Ã VON C gN ( k) ¢ F(k) ¢ q( k)

VON Ã VON C gN ( k) ¢ F(k) ¢ q( k)

D Ã D C gN ( k) ¢ D( k)

D Ã D C gN ( k) ¢ D( k)

= Ã = C gN ( k) ¢ F( k)

= Ã = C gN ( k) ¢ F( k)

l Ã l C gN ( k) ¢ F( k) ¢ D( k)

l Ã l C gN ( k) ¢ F( k) ¢ D( k)

x28

x28

x28

x28

x28

x28

x28

x28

x28

x28

x28

x28

( )

t(ok) Ã t

k t0 Ã t

x28

x28

When the jth synapse turns OFF at time t, the following steps must be performed: S Ã S ¡ gN ( j) x28

¡ q

( j)

( j) 0 t

Ã e¡

D

( j)

¡

Ãe x28

t¡t

¡

Ãe x28

(k) 0

t¡t

(k) 0

tF

(

0

¢

tD

¡ ( j)

¡

¢D

( j)

¡

C @1 ¡ e 0

¢ ¢F

( j)

t¡t

¡

¡ ¡

t¡t

C @1 ¡ e

x11

() () GOFF Ã GOFF C gN j ¢ q j x11

w ON Ã w ON ¡ gN ( j) ¢ F( j) ¢ D( j) ¢ q( j) x11

C ON Ã C ON ¡ gN ( j) ¢ D( j) ¢ q( j)

(k) 0

(k) 0

tF

( j) 0

¢!

t

tD

GON Ã GON ¡ gN ( j) ¢ q( j)

x11

t¡t

¢ q( j) C q1 ¢ 1 ¡ e¡

x28

¡

F

¢

t¡t

¢1 A

¢1 A

Synthesis of Generalized Algorithms

929

VON Ã VON ¡ gN ( j) ¢ F( j) ¢ q( j) x11

D Ã D ¡ gN ( j) ¢ D( j) x11

= Ã = ¡ gN ( j) ¢ F( j) x11

l Ã l ¡ gN ( j) ¢ F( j) ¢ D( j) x11

w OFF Ã w OFF C gN ( j) ¢ F( j) ¢ D( j) ¢ q( j) x11

C OFF Ã C OFF C gN ( j) ¢ D( j) ¢ q( j) x11

VOFF Ã VOFF C gN ( j) ¢ F( j) ¢ q( j) x11

Acknowledgments

I thank A. Destexhe for comments on an earlier version of this article, as well as W. Rocchia for useful discussions and the referees for their helpful comments and suggestions. I am especially grateful to M. Grattarola, M. Bove, and J. Lowenstein for critically reading the manuscript. This work was supported by the University of Genoa (Italy). References Abbott, L. F., & Marder, E. (1998). Modeling small networks. In C. Koch & I. Segev (Eds.), Methods in neuronal modelling: From ions to networks (2nd ed.). Cambridge, MA: MIT Press. Abbott, L. F., & Song, S. (1999). Temporally asymmetric Hebbian learning, spike timing, and neuronal response variability. In Kearn, M. S., Solla, S. A., & Cohn, D. A. (Eds.), Advances in Neural Information Processing Systems 11. Cambridge, MA: MIT Press. Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–223. Anderson, C. R., & Stevens, C. F. (1973).Voltage-clamp analysis of acetylcholineproduced end-plate current uctuations at frog neuromuscular junction. J. Physiol. (London), 235, 655–691. Chance, F. S., Nelson, S. B., & Abbott, L. F. (1998). Synaptic depression and the temporal response characteristics of V1 simple cells. J. Neurosci., 18, 4785– 4799. Colquhoun, D., & Hawkes, A. G. (1983). The principles of the stochastic interpretation of ion-channel mechanisms. In R. Sakmann & E. Neher (Eds.), Single-channel recording (pp. 135–175). New York: Plenum Press. Destexhe, A. (1997). Conductance-based integrate-and-re models. Neural Comput., 9, 503–514.

930

Michele Giugliano

Destexhe, A., Contreras, D., & Steriade, M. (1998). Mechanisms underlying the synchronizing action of corticothalamic feedback through inhibition of thalamic relay cells. J. Neurophys., 79, 999–1016. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1994a). An efcient method for computing synaptic conductances based on a kinetic model of receptor binding. Neural Comp., 6,14–18. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1994b). Synaptic transmission and neuromodulation using a common kinetic formalism. J. Comp. Neurosci., 1, 195–230. Finkel, L. H., & Edelman, G. M. (1987). Population rules for synapses in networks. In G. M. Edelman, W. E. Gall, & W. M. Cowan (Eds.), Synaptic function (pp. 711–757). New York: Wiley. Gantmacher, F. R. (1959). The theory of matrices. New York: Chelsea Publishing Company. Giugliano, M., Bove, M., & Grattarola, M. (1999). Fast calculation of short-term depressing synaptic conductances. Neural Comp., 11, 1413–1426. Grattarola, M., & Massobrio, G. (1998). Bioelectronics handbook: MOSFETs, biosensors, neurons. New York: McGraw-Hill. Hille, B. (1992). Ionic channels of excitable membranes (2nd ed.). Sunderland, MA: Sinauer. Howard, R. A. (1971). Dynamic probabilistic systems. New York: Wiley. Kirsch, J., & Betz, H. (1998). Glycine-receptor activation is required for receptor clustering in spinal neurons. Nature, 392. Koch, C., & Segev, I. (eds.). (1998). Methods in neuronal modelling: From ions to networks (2nd ed.). Cambridge, MA: MIT Press. Kohn, ¨ J., & Worg ¨ otter, ¨ F. (1998).Employing the Z-transform to optimize the calculation of the synaptic conductance of NMDA and other synaptic channels in network simulations. Neural Comp., 10, 1639–1651. Liao, D., Hessler, N. A., & Malinow, R. (1995). Activation of postsynaptically silent synapses during pairing-induced LTP in CA1 region of hippocampal slice. Nature, 375, 400–404. Lytton, W. W. (1996). Optimizing synaptic conductance calculation for network simulations. Neural Comp., 8, 501–509. Magleby, K. L., & Zengel, J. E. (1975). A quantitative description of stimulationinduced changes in transmitter release at the frog neuromuscular junction. J. Gen. Physiol., 80, 613–638. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efcacy between neocortical pyramidal neurons. Nature, 382, 807–810. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APS and EPSPs. Science, 275, 213–215. Oppenheim, A. V., & Shaffer, R. (1975).Digital-signalprocessing. London: Prentice Hall International. Papoulis, A. (1991). Probability, random variables, and stochastic processes (3rd ed.). New York: McGraw-Hill. Par´e, D., Shink, E., Gondreau, H., Destexhe, A., & Lang, E. J. (1998). Impact of spontaneous synaptic activity on the resting properties of cat neocortical

Synthesis of Generalized Algorithms

931

pyramidal neurons in vivo. J. Neurophys., 79, 1450–1460. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1996).Numerical recipes in C: The art of scientic computing (2nd ed.). Cambridge: Cambridge University Press. Sakmann, R., & Neher, E. (1983). Single-channel recording. New York: Plenum Press. Sen, K., Jorge-Rivera, J. C., Marder, E., & Abbott, L. F. (1996).Decoding synapses. J. Neurosci., 16, 6307–6318. Senn, W., Tsodyks, M., & Markram, H. (1997). An algorithm for synaptic modication based on exact timing of pre- and post-synaptic action potentials. In W. Gerstner, A. Germond, M. Hasler, & J. D. Nicoud (eds.), Articial neural networks (pp. 121–126). Lausanne: ICANN. Sheperd, G. M. (1988). Neurobiology (2nd ed.). Oxford: Oxford University Press. Standley, C., Norris, T. M., Ramsey, R. L., & Usherwood, P. N. R. (1993). Gating kinetics of the quisqualate-sensitive glutamate receptor of locust muscle studied using agonist concentration jumps and computer simulations. Biophys. J., 65, 1379–1386. Srinivasan, R., & Chiel, H. J. (1993). Fast calculation of synaptic conductances. Neural Comp., 5, 200–204. Thomson, A. M., & Deuchars, J. (1994). Temporal and spatial properties of local circuits in neocortex. Trend Neurosci., 17, 119–126. Tsodyks, M. V., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci. USA, 94, 719–723. Tsodyks, M., Pawelzik, K., & Markram, H. (1998). Neural networks with dynamic synapses. Neural Comp., 10, 821–835. Varela, J., Sen, K., Gibson, J., Fost, J., Abbott, L. F., & Nelson, S. B. (1997). A Quantitative description of short-term plasticity at excitatory synapses in layer 2 / 3 of rat primary visual cortex, J. Neurosci., 17, 7926–7940. Walmsley, B., Alvarez, F. J., & Fyffe, R. E. W. (1998). Diversity of structure and function at mammalian central synapses. Trends Neurosci., 21, 81–88. Received April 2, 1999; accepted May 26, 1999.

LETTER

Communicated by C. Lee Giles

Hierarchical Bayesian Models for Regularization in Sequential Learning J. F. G. de Freitas M. Niranjan A. H. Gee Cambridge University Engineering Department, Cambridge CB2 1PZ, U.K. We show that a hierarchical Bayesian modeling approach allows us to perform regularization in sequential learning. We identify three inference levels within this hierarchy: model selection, parameter estimation, and noise estimation. In environments where data arrive sequentially, techniques such as cross validation to achieve regularization or model selection are not possible. The Bayesian approach, with extended Kalman ltering at the parameter estimation level, allows for regularization within a minimum variance framework. A multilayer perceptron is used to generate the extended Kalman lter nonlinear measurements mapping. We describe several algorithms at the noise estimation level that allow us to implement on-line regularization. We also show the theoretical links between adaptive noise estimation in extended Kalman ltering, multiple adaptive learning rates, and multiple smoothing regularization coefcients.

1 Introduction

Sequential training of neural networks is important in applications where data sequences either exhibit nonstationary behavior or are difcult and expensive to obtain before the training process. Scenarios where this type of sequence arise include tracking and surveillance, control systems, fault detection, signal processing, communications, econometric systems, demographic systems, geophysical problems, operations research, and automatic navigation. Although there has been great interest on the topic of regularization in batch learning tasks, this topic has not received much attention in sequential learning tasks. In this article, we adopt a hierarchical Bayesian framework in conjunction with dynamical state-space models to derive regularization algorithms for sequential estimation. These algorithms are based on estimating the noise processes on-line, while estimating the network weights with the extended Kalman lter (EKF). In addition, we show that this methodology provides a unifying theoretical framework for many sequential estiNeural Computation 12, 933–953 (2000)

c 2000 Massachusetts Institute of Technology °

934

J. F. G. de Freitas, M. Niranjan, and A. H. Gee

mation algorithms that attempt to avoid local minima in the error function. In particular, we show that adaptive noise covariances in extended Kalman ltering, multiple adaptive learning rates in on-line backpropagation, and multiple smoothing regularization coefcients are mathematically equivalent. Section 2 describes the sequential learning task using state-space models and a three-level hierarchical Bayesian structure. The three levels of inference correspond to noise estimation, parameter estimation, and model selection. In section 3, we propose a solution to the parameter estimation level based on the application of the EKF to neural networks. Section 4 is devoted to the noise estimation level and regularization. Finally, we present our experiments in section 5 and point out several areas for further research in section 6. 2 Dynamical Hierarchical Bayesian Models

We address the problem of training neural networks sequentially within the following dynamical state-space framework: w kC1 D w k C d k

yk D g k ( w k , x k ) C v k ,

(2.1) (2.2)

where k( k D 1, . . . , L) denotes the discrete time index. The output measurements of the system (y k 2
Likelihood Prior . Evidence

Regularization in Sequential Learning

935

We propose the following inference levels: Level 1: Parameter estimation

p( w kC1 | YkC1 , Mj , RkC1 , Qk ) D

p( y kC1 | w kC1 , Mj , RkC1 , Qk ) p( w kC1 | Yk , Mj , RkC1 , Qk ). p( y kC1 | Yk , Mj , RkC1 , Qk )

(2.3)

Level 2: Noise estimation

p(RkC1 , Qk | YkC1 ) D

p( y kC1 | Yk , Mj , RkC1 , Qk ) p(RkC1 , Qk | Yk , Mj ). p( y kC1 | Yk , Mj )

(2.4)

Level 3: Model selection

p(Mj | YkC1 ) D

p( y kC1 | Yk , Mj ) p(Mj | Yk ). p( y kC1 | Yk )

(2.5)

The likelihood function at a particular level constitutes the evidence function at the next higher level. Therefore, by maximizing the evidence function in the parameter estimation level, we are, in fact, maximizing the likelihood of the noise covariances Rk and Qk as the new data arrive. This result plays an important role when we devise methods for estimating the noise covariances. At the parameter estimation level, we apply the EKF algorithm to estimate the weights of an MLP. The EKF, however, requires knowledge of the noise covariances. To overcome this difculty, in section 4 we present techniques for estimating these covariances in slowly changing non-stationary environments. There, we show that algorithms for estimating the noise covariance allow us to perform regularization in a sequential framework. Model selection is not covered in this article. 3 Parameter Estimation

O k to be an unbiased, minimum variFor optimality reasons, we want w ance, and consistent estimate (Gelb, 1974). The minimum variance estimation framework is based on minimizing the variance of the neural network weights, thereby leading to smooth estimates for the network weights and outputs. A popular and suboptimal strategy for obtaining estimates of this type is to employ the EKF for parameter estimation. The EKF is a minimum variance estimator based on a Taylor series expansion of the nonlinear function g k ( w k , x k ) around the previous estimate. Using this expansion and under the assumptions that the state-space model noise pro-

936

J. F. G. de Freitas, M. Niranjan, and A. H. Gee

cesses are uncorrelated with each other and the initial estimates of the parameters w k and their covariance matrix Pk , we can model the prior, evidence, and likelihood functions as follows (de Freitas, Niranjan, & Gee 1997): Prior D p(w kC1 | Yk , Mj , RkC1 , Qk ) ¼ N

(w O k , P k C Qk )

Evidence D p( y kC1 | Yk , Mj , RkC1 , Qk ) ¼ N

( g( w ˆ k , x kC1 ), GkC1 (Pk C Qk )GTkC1 C RkC1 )

Likelihood D p( y kC1 | w kC1 , Mj , RkC1 , Qk ) ¼ N

(g ( w kC1 , x kC1 ), RkC1 ), g

where G denotes the Jacobian @@w | ( w Dwˆ ) and the symbol T denotes the transpose of a matrix. Since the EKF is a suboptimal estimator based on linearization of a nonlinear mapping, w O is only an approximation to the expected value, and, strictly speaking, Pk is an approximation to the covariance matrix. It is also important to point out that the EKF may diverge as a result of its inherent approximations. The consistency of the EKF may be evaluated by means of extensive Monte Carlo simulations (Bar-Shalom & Li 1993). Substituting the expressions for the prior, likelihood, and evidence into equation (2.3), yields the posterior density function: Posterior = p( w kC1 | YkC1 , Mj , RkC1 , Qk ) ¼ N

(w ˆ kC1 , PkC1 ),

where the updated weights, covariance and Kalman gain (KkC1 ) are given by: O kC1 D w O k C KkC1 ( y kC1 ¡ g ( w ˆ k , x kC1 )) w

(3.1)

KkC1 D ( Pk C Qk ) GTkC1 [RkC1 C GkC1 (Pk C Qk )GTkC1 ]¡1 .

(3.3)

PkC1 D Pk C Qk ¡ KkC1 GkC1 ( Pk C Qk )

(3.2)

By grouping the MLP weights into a single vector w , we can use the EKF equations (3.1–3.3) to compute new estimates of the weights recursively. The entries of the Jacobian matrix are calculated by backpropagating the o output values fy1 ( t), y2 ( t), ¢ ¢ ¢ , yo ( t)g through the network. An example of how to do this for a simple MLP is presented in appendix B of de Freitas et al. (1997).

Regularization in Sequential Learning

937

One of the earliest implementations of EKF-trained MLPs is due to Singhal and Wu (1988). The algorithm’s computational complexity is of the order om2 multiplications per time step. Shah, Palmieri, and Datum (1992) and Puskorius and Feldkamp (1991) have proposed various approximations to the weights covariance so as to simplify this problem. The EKF is an improvement over conventional MLP estimation techniques, such as online backpropagation, in that it makes use of second-order statistics (Ruck, Rogers, Kabrisky, Maybeck, & Oxley, 1992; Schotky & Saad, 1999). These statistics are essential for placing error bars on the predictions and for combining separate networks into committees of networks when p( w k | Yk ) has multiple modes (Bar-Shalom & Li, 1993; Blom & Bar-Shalom, 1988; Kadirkamanathan & Kadirkamanathan, 1995).

4 Noise Estimation and Regularization

A well-known limitation of the EKF is the assumption of known a priori statistics to describe the measurement and process noise. Setting these noise levels appropriately often makes the difference between success and failure in the use of the EKF (Candy, 1986). In many applications, it is not straightforward to choose the noise covariances (Jazwinski, 1970). In addition, in environments where the noise statistics change with time, such an approach can lead to large estimation errors and even to a divergence of errors. Several researchers in the estimation, ltering and control elds have attempted to solve this problem (Jazwinski, 1969; Mehra, 1970, 1971; Myers & Tapley, 1976; Tenney, Hebbert, & Sandall, 1977). Mehra (1972) and Li and Bar-Shalom (1994) give brief surveys on this topic. It is important to note that algorithms for estimating the noise covariances within the EKF framework can lead to a degradation of the performance of the EKF. By increasing the process noise covariance Qk , the Kalman gain also increases, thereby producing bigger changes in the weight updates (refer to equations 3.1 and 3.3). That is, more importance is placed on the most recent measurements. Consequently, it may be asserted that lters with adaptive process noise covariances exhibit adaptive memory. Additionally, as the Kalman gain increases, the bandwidth of the lter also increases (BarShalom & Li, 1993). Therefore, the lter becomes less immune to noise and outliers. The amount of oscillation in the model prediction clearly depends on the value of the process noise covariance. As a result, this covariance can be used as a regularization mechanism to control the smoothness of the prediction. In the following subsections, we derive three algorithms to estimate the noise covariances. The rst two derivations serve to illustrate the fact that algorithms for adapting distributed learning rates, smoothing regularizers, or stochastic noise in gradient descent methods are equivalent (see Figure 1).

938

J. F. G. de Freitas, M. Niranjan, and A. H. Gee

Figure 1: One-dimensional example of several adaptive gradient descent methods. To escape from local minima, we can increase the momentum of the ball as it approaches a particular local minimum, allow the ball to bounce randomly within a given vertical interval (stochastic descent), or stretch the surface with smoothing regularizers.

In the derivation of the third algorithm, we address the trade-off between regularization and tracking performance in sequential learning. 4.1 Algorithm 1: Adaptive Distributed Learning Rates. Sutton (1992b) proposed an optimization approach for linear networks, using the Kalman lter equations with Pk updated by a variation of the least-mean-square rule (Jacobs, 1988; Sutton, 1992a). The main purpose of the method was to reduce the computational time at the expense of a small deterioration in the performance of the estimator. Another important aspect of the algorithm is that it circumvents the problem of choosing the process noise covariance Q. The technique involves approximating P with a diagonal matrix, whose ith diagonal entry is given by:

pmm D exp(bm ), where bm is updated by the least-mean-square rule modied such that the learning rates for each parameter are updated sequentially. The diagonal matrix approximation to P implies that the model parameters are uncorrelated. This assumption may, of course, degrade the performance of the EKF estimator.

Regularization in Sequential Learning

939

To circumvent the problem of choosing the process noise covariance Q when training nonlinear neural networks, while at the same time increasing computational efciency, we have extended Sutton’s algorithm to the nonlinear case. In particular, we make use of MLPs with sigmoidal basis functions in the hidden layer and linear basis functions in the output layer. The network weights and Kalman gain are updated using the EKF, while the weights covariance P is updated by backpropagating the squared output errors, with b as follows: ( bk Cgd ik ojk output layer bkC1 D (1 ) bk Cgwijkd ik ojk ¡ ojk xdk hidden layer where the index i corresponds to the ith neuron in the output layer, j to the jth neuron in the hidden layer, d to the dth input variable, and k to the estimation step. d ik represents the k-output error for neuron i. The symbols oi and g denote the output of the ith neuron in layer i and the learning rate, respectively. This learning rate is a parameter that quanties another parameter Pk . We shall refer to it as a hyperparameter. We have found in practice that choosing this hyperparameter is easier than choosing the process noise covariance matrix. The EKF equation used to update the weights is similar to the update equations typically used to compute the weights of neural networks by error backpropagation. The only difference is that it assumes that there is a different adaptive learning-rate parameter for each weight. The mathematical relation between adaptive learning rates (L ) in on-line backpropagation and Kalman ltering parameters is given by de Freitas et al. (1997):

L D ( Pk C Qk )(RkC1 C GkC1 ( Pk C Qk ) GTkC1 )¡1 . Thus, adapting the process noise is equivalent to adapting the learning rates. 4.2 Algorithm 2: Evidence Maximization with Weight Decay Priors.

We derive a sequential method for updating R and Q, based on the evidence approximation framework with weight decay priors for batch learning (see Bishop, 1995, chap. 10). In so doing, we show that algorithms for adapting the noise covariances are equivalent to algorithms that make use of smoothing regularizers. In the evidence approximation framework for batch learning, the prior and likelihood are expressed as follows: p( w ) D p(Yk | w ) D

1 (2p ) m / 2 a¡m / 2 1 (2p ) L / 2 b ¡L / 2

³ a ´ exp ¡ kw k2 2

! L bX 2 (y k ¡ gˆ L,m ( w , x k )) , exp ¡ 2 kD1

(

(4.1) (4.2)

940

J. F. G. de Freitas, M. Niranjan, and A. H. Gee

where gO L,m ( w , x k ) corresponds to the model prediction. The hyperparameters a and b control the variance of the prior distribution of the weights and the variance of the measurement noise. a also plays the role of the regularization coefcient. Using Bayes’ rule (see equation 2.3) and taking into account that the evidence does not depend on the weights, the following posterior density function may be obtained: p( w | Yk ) D

1 exp(¡S(w )), Zs (a, b)

where Zs (a, b) is a normalizing factor. For the prior and the likelihood of equations 4.1 and 4.2, S( w ) is given by S( w ) D

L a bX (y k ¡ gˆ L,m ( w , x k ))2 . kw k2 C 2 2 kD1

(4.3)

The posterior density may be approximated by applying a Taylor series expansion of S(w ) around a local minimum ( w MP ) and retaining the series terms up to second order: 1 S( w ) D S( w MP ) C ( w ¡ w MP ) T A( w ¡ w MP ). 2 Hence, the gaussian approximation to the posterior density function becomes: p( w | Yk ) D

¡

¢

1 1 exp ¡ ( w ¡ w MP ) T A(w ¡ w MP) . (4.4) m 2 ¡1 2 / / (2p ) | A| 2

Maximizing the posterior probability density function involves minimizing the error function given by equation 4.3. Equation 4.3 is a particular case of a regularized error function. More generally, this error function is given by: S( w ) D

L X kD1

( y k ¡ gˆ L,m ( w , x k )) 2 CºV,

where º is a positive parameter that serves to balance the trade-off between smoothness and data approximation. A large value of º places more importance on the smoothness of the model; a small value of º places more emphasis on tting the data. The functional V penalizes for excessive model complexity (Girosi, Jones, & Poggio, 1995). In the evidence framework, the parameters w are obtained by minimizing equation 4.3, while the hyperparameters a and b are obtained by maximizing the evidence p(Yk | a, b) after approximating the posterior density function by a gaussian function centered at w MP . In doing so, the following recursive formulas for a and b are obtained: c akC1 D Pm

2 iD1 wi

and

bkC1 D PL

kD1

L ¡c

( y k ¡ gˆ L,m (w k , x k ))2

.

(4.5)

Regularization in Sequential Learning

941

P li The quantity c D m iD1 l i Ca , represents the effective number of parameters, where the l i are the eigenvalues of the Hessian of the error function ED . The effective number of parameters, as the name implies, is the number of parameters that effectively contributes to the neural network mapping. The remaining weights have no contribution because their magnitudes are forced to zero by the weight decay prior. It is possible to maximize the posterior density function by performing integrations over the hyperparameters analytically (Buntine & Weigend, 1991; Mackay, 1996; Williams, 1995; Wolpert, 1993). The latter approach is known as the MAP framework for a and b. The hyperparameters computed by the MAP framework differ from the ones computed by the evidence framework in that the former makes use of the total number of parameters and not only the effective number of parameters. That is, a and b are updated according to: m akC1 D Pm

2 iD1 wi

and

bkC1 D PL

L

kD1 ( y k

¡ gˆ L,m ( w k , x k )) 2

.

(4.6)

By comparing the expressions for the prior, likelihood, and evidence in the EKF framework (see section 3) with equations 4.1, 4.2, and 4.4, we can establish the following relations: P D A¡1 ,

Q D a¡1 Im ¡ A¡1

and

R D b ¡1 Io ,

(4.7)

where Im and Io represent identity matrices of sizes m and o, respectively. Therefore, it is possible to update Q and R sequentially by expressing them in terms of the sequential updates of a and b. That is, adapting the noise processes is equivalent to adapting the regularization coefcients. A moving window may be implemented to estimate b. The size of the window is a parameter that requires tuning. 4.3 Algorithm 3: Evidence Maximization with Sequentially Updated Priors. In EKF, we have knowledge of the equation describing the evidence

function in terms of the noise covariances. Consequently, we can compute Rk and Qk automatically by maximizing the evidence density: p(ykC1 | Yk , Mj , RkC1 , Qk ) » N

( gkC1 ( w ˆ k , x kC1 ), GkC1 ( Pk C Qk ) GTkC1 C RkC1 ).

Strictly speaking, this is not a full Bayesian solution. We are computing solely the likelihood of the noise covariances. That is, we are assuming no knowledge of the prior at the noise estimation level. For simplicity, we have restricted our analysis in this section to a single output. Let us now dene the model residuals: ˆ k , x kC1 ). (4.8) rkC1 D ykC1 ¡ E [ykC1 | Yk , Mj , RkC1 , Qk ] D ykC1 ¡ gkC1 ( w

942

J. F. G. de Freitas, M. Niranjan, and A. H. Gee

It follows that the probability of the residuals is equivalent to the evidence function at the parameter estimation level, that is: p(rkC1 ) D p(ykC1 | Yk , Mj , RkC1 , Qk ). Let us assume initially that the process noise covariance may be described by a single parameter q—more specically: Q D qIm . The maxima of the evidence function with respect to q may be calculated by differentiating the evidence function as follows: ! r2kC1 d 1 1 ) p(rk C1 D exp ¡ (2p ) 1 / 2 dq 2 GkC1 ( Pk C Qk ) GTkC1 C RkC1 µ 1 ¡ GkC1 GTkC1 (GkC1 ( Pk C Qk ) GTkC1 C RkC1 )¡3 / 2 2

(

¶ 1 C r2kC1 GkC1 GTkC1 ( GkC1 ( Pk C Qk ) GTkC1 C RkC1 ) ¡5 / 2 . 2

Equating the derivative to zero yields: r2kC1 D E [r2kC1 ].

(4.9)

It is straightforward to prove that this singularity corresponds to a global maximum on [0, 1) by computing the second derivative. This result reveals that maximizing the evidence function corresponds to equating the covariance over time r2kC1 to the ensemble covariance E [r2kC1 ]. That is, maximizing the evidence leads to a covariance matching method. Jazwinski (1969; Jazwinski & Bailie, 1967) devised an algorithm for updating q according to equation 4.9. Since r2kC1 D GkC1 Pk GTkC1 C qGkC1 GTkC1 C RkC1 , it follows that q may be recursively computed according to:

qD

8 < r2kC1 ¡E [r2kC1 |qD0] :0

Gk C1 GTkC1

if q ¸ 0

otherwise.

(4.10)

This estimator increases q each time the variance over time of the model residuals exceeds the ensemble variance. When q increases, the Kalman gain also increases, and consequently the model parameters update also increases (see equations 3.1 and 3.3). That is, the estimator places more

Regularization in Sequential Learning

943

emphasis on the incoming data. As long as the variance over time of the residuals remains smaller than the ensemble covariance, the process noise input is zero and the lter carries on minimizing the variance of the parameters (i.e., tending to a regularized solution). Section 5.2 discusses an experiment where this behavior is illustrated. The estimator of equation 4.10 is based on a single residual and is therefore of little statistical signicance. This difculty is overcome by employing a sliding window to compute the sample mean for N predicted residuals instead of a single residual. Jazwinski (1969) shows that for the following sample mean, mr D

N 1 X rkCl , N lD1 R1 / 2 kCl

we may proceed as above, by maximizing p(mr ), to obtain the following estimator: ( qD

m2r ¡E [m2r | qD0] S

0

if q ¸ 0 otherwise,

(4.11)

where E [m2r | q D 0] D SN Pk STN C 1 / N,

S D SN STN C SN¡1 STN¡1 C, ¢ ¢ ¢ , CS1 ST1 and SN D S1 D

N 1 X 1 GkCl , N lD1 R1 / 2 kCl

1 1 GkCN . N R1 / 2

SN¡1 D

N 1 X 1 GkCl , N lD2 R1 / 2 kCl

..., (4.12)

kCN

With this estimator, one has to choose the length N of the moving window used to update q. If the window size is too small, the algorithm places more emphasis on tting the incoming data than tting the previous data. As a result, it might never converge. On the other hand, if the window size is too large, the algorithm will fail to adapt quickly to new environments. We refer to the problem of choosing the right window length as the regularization/tracking dilemma. It is a dilemma because we cannot ascertain, without a priori knowledge, whether the uctuations in the data correspond to time-varying components of the data or to noise.

944

J. F. G. de Freitas, M. Niranjan, and A. H. Gee

It is possible to extend the derivation to a more general noise model by adopting the following covariance: 2

q1 60 6 QD6. 4 .. 0

0 q2

3 0 07 7 7. 5

¢¢¢ ..

.

0

(4.13)

qm

By calculating the derivative of the evidence function with respect to a generic diagonal entry of Q and then equating to zero, we obtain an estimator involving the following system of equations: 2³

´ @ykC1 2 @w 1

6 6 ³ ´2 6 @yk 6 6 @w 1 6 .. 6 6 . ´2 4³ @yk¡N @w 1

³

´2

@ykC1 @w 2

³

³

@yk @w 2

´2

@yk¡N @w 2

´2

³ ¢¢¢ ..

.

@ykC1 @w m

´2 3

72q 3 2e 3 1 ´2 7 7 6 7 6 kC1 @yk 7 6 q2 7 6 ek 7 @w m 76 . 7 D 6 . 7 . 74 . 5 4 . 7 7 . . 5 7 ³ ´2 5 qm ek¡N ³

(4.14)

@yk¡N @w m

Multiple hyperparameters are very handy when one considers distributed priors for automatic relevance determination (input and basis functions selection) (de Freitas et al., 1997; Mackay 1994, 1995). Estimating Q in equation 4.14 is, however, not very reliable (de Freitas et al., 1997; Mackay 1994, 1995) because it involves estimating a large number of noise parameters and, in addition, requires a long moving window of size N to avoid illconditioning. We can also maximize the evidence function with respect to Rk D rIo and obtain the following estimator for r:

rD

82
0

if r ¸ 0

(4.15)

otherwise.

The hyperparameter r is not as useful as q in controlling lter divergence. This is because r can slow the rate of decrease of the covariance matrix Pk , but cannot cause it to increase (see equations 3.1–3.3). 5 Experiments 5.1 Experiment 1: Comparison Between the Noise Estimation Methods. To compare the performance of the various EKF training algorithms,

100 input-output data vectors were generated from the following nonlinear,

Regularization in Sequential Learning

945

Figure 2: Data generated to train MLP.

nonstationary process: xk D 0.5xk¡1 C yk D

25xk¡1 1 C x2k¡1

C 8 cos(1.2(k ¡ 1)) C dk

x2k C vk , 20

where xk denotes the input vectors and yk the output vectors of 300 time samples each. The gaussian process noise standard deviation was set to 0.1; the measurement noise standard deviation was set to 3 sin(0.05k). The initial state x0 was 0.1. Figure 2 shows the data generated with this model. The changes of the measurement noise variance are similar to the ones typically observed in nancial returns data (Shephard, 1996). We then proceeded to train an MLP with 10 sigmoidal neurons in the hidden layer and 1 linear output neuron with the following methods: the standard EKF algorithm, the EKF algorithm with Pk updated by error backpropagation (EKFBP), with evidence maximization and weight decay priors (EKFEV), with MAP noise adaptation (EKFMAP), and with evidence maximization and sequentially updated priors (EKFQ). The initial variance of the weights, initial weights covariance matrix entries, initial R, and

946

J. F. G. de Freitas, M. Niranjan, and A. H. Gee

Figure 3: Simulation results for the EKF and EKFQ algorithms. The top left plot shows the actual data [¢ ¢ ¢] and the EKF [- -] and EKFQ [—] one-step-ahead predictions. The top right plot shows the estimated process noise parameter. The bottom plots show the innovations covariance for both methods.

initial Q were set to 1, 10, 3, and 1e ¡ 5, respectively. The length of the sliding window of the adaptive noise algorithms was set to 10 time samples. The simulation results for the EKF and EKFQ algorithms are shown in Figure 3. Note that the EKFQ algorithm slows the convergence of the EKF parameter estimates so as to be able to track the changing measurement variance. In Table 1, we compare the one-step-ahead normalized square errors (NSE) obtained with each method. The NSE are dened as follows: v u 300 uX NSE D t (y k ¡ gˆ (w k , x k ))2 . kD1

According to Table 1, it is clear that the only algorithm that provides a clear prediction improvement over the standard EKF algorithm is the evidence maximization algorithm with sequentially updated priors. In terms of computational time, the EKF algorithm with Pk updated by backpropagation is

Regularization in Sequential Learning

947

Table 1: Simulation Results for 100 Runs in Experiment 1.

EKF EKFQ EKFMAP EKFEV EKFBP

NSE

Mega Floating-Point operations

25.95 23.01 61.06 73.94 58.87

21.9 24.1 22.6 22.6 2.2

faster, but its prediction is worse than the one for the standard EKF. This is not a surprising result considering the assumption of uncorrelated weights. The EKFEV and EKFMAP performed poorly because they require the network weights to converge to a good solution before the noise covariances can be updated. That is, the noise estimation algorithm does not facilitate the estimation of the weights, as it happens in the case of the EKFQ algorithm. The EKFEV and EKFMAP therefore appear to be unsuitable for sequential learning tasks. 5.2 Experiment 2: Sequential Evidence Maximization with Sequentially Updated Priors. This experiment aims to describe the behavior of

the evidence-maximization algorithm (EKFQ) of equation 4.11 in a timevarying, noisy, and chaotic scenario. The problem tackled is a more difcult variation of the chaotic quadratic or logistic map. One hundred input (yk ) and output (ykC1 ) data vectors were generated according to the following equation:

ykC1

8 <3.5yk (1 ¡ yk ) C vk D 3.7yk (1 ¡ yk ) C vk : 3.1yk (1 ¡ yk ) C vk

1 · k · 150 150 < k · 225 225 < k · 300

where vk denotes gaussian noise with a standard deviation of 0.01. In the interval 150 < k · 225, the series exhibits chaotic behavior. A single hiddenlayer MLP with 10 sigmoidal neurons in the hidden layer and a single output linear neuron were trained to approximate the mapping between (yk ) and (ykC1 ). The initial weights, weights covariance matrix diagonal entries, R, and Q were set to 1, 100, 1 £ 10¡4 , and 0, respectively. The sliding window to estimate Q was set to three time samples. As shown in Figure 4, during the initialization and after each change of behavior (samples 150 and 225), the estimator for the process noise covariance Q becomes active. That is, each time the environment undergoes a severe change, more importance is given to the new data. As the environment stabilizes, the minimum variance minimization criterion of the Kalman lter leads to a decrease in the variance of the output. Therefore, it is possible to design an estimator that balances the trade-off between regularization

948

J. F. G. de Freitas, M. Niranjan, and A. H. Gee

Figure 4: Performance of the evidence maximization for a nonstationary chaotic quadratic map. (Top) True data [¢ ¢ ¢] and the prediction [—]. (Middle) Output condence intervals. (Bottom) Value of the adaptive noise parameter.

and tracking. The results obtained with the EKF and EKFQ algorithms are summarized in Table 2. 5.3 Example with Real-World Data. The mathematical modeling of nancial derivatives has become increasingly popular for two reasons. First, nancial institutions have much interest in developing more sophisticated pricing models for options contracts (Hull, 1997). Second, options data offer an excellent source of difcult and challenging problems to the statistical and neural computing communities (Hutchinson, Lo, & Poggio, 1994; Ingber &

Table 2: Simulation Results for 100 Runs of the Quadratic Chaotic Map.

EKF EKFQ

NSE

Mega Floating-Point Operations

31.76 1.37

21.8 23.0

Note: The EKFQ algorithm provides a large improvement over the EKF at a very small computational cost.

Regularization in Sequential Learning

949

Wilson, 1999; Niranjan, 1996). So far, the research results seem to provide clear evidence that there is a nonlinear and nonstationary relation between the options’ price and the cash products’ price, maturity time, strike price, interest rates, and variance of the returns on the cash product (volatility). The standard model used to describe this relation is the Black-Scholes (1973) model. Hutchinson et. al. (1994) and Niranjan (1996) have focused on the options pricing problem from a neural computing perspective. The former showed that good approximations to the widely used Black-Scholes formula may be obtained with neural networks, while the latter looked at the nonstationary aspects of the problem. Our work follows from Niranjan (1996), with the aim of showing that more accurate tracking of the options prices can be achieved by adapting the noise covariances. We train MLPs to generate onestep-ahead predictions of the options prices. These predictions for a group of options on the same cash product, but with different strike prices and/or time to maturity, can be used to determine whether one of the options is being mispriced. We treat the cash product’s value normalized by the strike price and time to maturity as the network inputs. The network’s output consists of the call and put option prices normalized by the strike price. We used ve pairs of call and put option contracts on the FTSE100 index (February 1994– December 1994) to evaluate our pricing algorithms. The parameters were estimated by the following methods: Trivial: Involves using the current value of the option as the next prediction. RBF-EKF: Represents a regularized radial basis function network with four

hidden neurons, which was originally proposed in Hutchinson et al. (1994). The output weights are estimated with a Kalman lter, while the means of the radial functions correspond to random subsets of the data and their covariance is set to the identity matrix as in Niranjan (1996). BS: Corresponds to a conventional Black-Scholes model with two outputs

(normalized call and put prices) and two parameters (risk-free interest rate and volatility). The risk-free interest rate was set to 0.06, while the volatility was estimated over a moving window (of 50 time steps) as described in Hull (1997). MLP-EKF: Stands for an MLP, with six sigmoidal hidden units and a lin-

ear output neuron, trained with the EKF algorithm. The noise covariances R and Q and the states covariance P were set to diagonal matrices with entries equal to 10¡6 , 10¡7 , and 10, respectively. The weights prior corresponded to a zero mean gaussian density with covariance equal to 1. MLP-EKFQ: Represents an MLP, with six sigmoidal hidden units and a

linear output neuron, trained with the EKF with evidence maximization and sequentially updated priors. The states covariance P was given by a

950

J. F. G. de Freitas, M. Niranjan, and A. H. Gee

Figure 5: Tracking put and call option prices with an MLP trained with the EKFQ algorithm.

diagonal matrix with entries equal to 10. The weights prior corresponded to a zero mean gaussian density with covariance equal to 1. Figure 5 shows the one-step-ahead predictions obtained with the EKFQ algorithm. In Table 3, we compare the one-step-ahead normalized square errors obtained with each method. The square errors were measured over only the last 100 days of trading, so as to allow the algorithms to converge. It is clear from the results that adapting the process noise covariance sequentially leads to improved predictions with the options data.

Table 3: One-Step-Ahead Prediction Errors on Call Options. Strike Price

2925

3025

3125

3225

3325

Trivial RBF-EKF BS MLP-EKF MLP-EKFQ

0.0783 0.0538 0.0761 0.0414 0.0404

0.0611 0.0445 0.0598 0.0384 0.0366

0.0524 0.0546 0.0534 0.0427 0.0394

0.0339 0.0360 0.0377 0.0285 0.0283

0.0205 0.0206 0.0262 0.0145 0.0150

Regularization in Sequential Learning

951

6 Conclusions

We have presented several algorithms to perform regularization in sequential learning tasks. The algorithms are based on adapting the noise processes sequentially. The experiments21 indicated that one of these algorithms, EKFQ, may lead to improved prediction results when either the data source is time varying or there is little a priori knowledge about how to tune the noise processes. We showed that the hierarchical Bayesian inference methodology provides an elegant, unifying treatment of the sequential learning problem. We showed that distributed learning rates, adaptive noise parameters, and adaptive smoothing regularizers are mathematically equivalent. This result sheds light on many areas of the machine learning eld. It places many diverse approaches to estimation and regularization within a unied framework. There are many avenues for further research. These include deriving convergence bounds, implementing static and dynamic mixtures of models, implementing other model structures such as recurrent networks, and, testing the algorithms on additional problems. 7 Acknowledgments

We thank Analytical Mechanics Associates (Virginia, U.S.A.) for helpful assistance in providing some of the references. We are also very grateful to the referees for their valuable comments. J.F.G.F. is nancially supported by two University of the Witwatersrand Merit Scholarships, a Foundation for Research Development Scholarship (South Africa), an ORS award (UK), and a Trinity College External Research Studentship (Cambridge). References Bar-Shalom, Y., & Li, X. R. (1993). Estimation and tracking: Principles, techniques and software. Boston: Artech House. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. Journal of Political Economy, 81, 637–659. Blom, H. A. P., & Bar-Shalom, Y. (1988). The interacting multiple model algorithm for systems with Markovian switching coefcients. IEEE Transactions on Automatic Control, 33(8), 780–783. Buntine, W. L., & Weigend, A. S. (1991). Bayesian back-propagation. Complex Systems, 5, 603–643. 1

Matlab demos for each experiment are available online at: http://www.cs. berkeley.edu/»jfgf/.

952

J. F. G. de Freitas, M. Niranjan, and A. H. Gee

Candy, J. V. (1986). Signal processing: The model based approach. New York: McGraw-Hill. de Freitas, J. F. G., & Niranjan, M., & Gee, A. H. (1997). Hierarchichal BayesianKalman models for regularization and ARD in sequential learning (Tech. Rep. No. CUED/F-INFENG/TR 307). Cambridge: Cambridge University. Available online at: http://svr-www.eng.cam.ac.uk/»jfgf, Gelb, A. (ed.). (1974). Applied optimal estimation. Cambridge, MA: MIT Press. Girosi, F., Jones, M., & Poggio, T. (1995). Regularization theory and neural networks architectures. Neural Computation, 7, 219–269. Hull, J. C. (1997). Options, futures, and other derivative (3rd ed.). Englewood Cliffs, NJ: Prentice Hall. Hutchinson, J. M., & Lo, A W., & Poggio, T. (1994). A nonparametric approach to pricing and hedging derivative securities via learning networks. Journal of Finance, 49(3), 851–889. Ingber, L., & Wilson, J. K. (1999). Volatility of volatility of nancial markets. Journal of Mathematical and Computer Modeling, 99, 39–57. Jacobs, R. A. (1988).Increased rates of convergence through learning rates adaptation. Neural Networks, 1, 295–307. Jazwinski, A. H. (1969). Adaptive ltering. Automatica, 5, 475–485. Jazwinski, A. H. (1970). Stochastic processes and ltering theory, Orlando, FL: Academic Press. Jazwinski, A. H., & Bailie, A. E. (1967). Adaptive ltering, Tech. Rep. No. 67–6. Hampton, VA: Analytical Mechanics Associates. Kadirkamanathan, V., & Kadirkamanathan, M. (1995). Recursive estimation of dynamic modular RBF networks. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 239–245). Cambridge, MA: MIT Press. Li., X. R. & Bar-Shalom, Y. (1994). A recursive multiple model approach to noise identication. IEEE Transactions on Aerospace and Electronic Systems, 30(3), 671–684. Mackay, D. J. C. (1994). Bayesian nonlinear modeling for the prediction competition. ASHRAE Transactions Pt. 2 (Vol. 100). Atlanta, GA. Mackay, D. J. C. (1995). Probable networks and plausible predictions—a review of practical Bayesian methods for supervized neural networks. NetworkComputation in Neural Systems, 6, 469–505. Mackay, D. J. C. (1996). Hyperparameters: Optimize or integrate out? In G. Heidbreder (Ed.), Fundamental theories of physics (pp. 43–59). Norwell, MA: Kluwer. Mehra, R. K. (1970). On the identication of variances and adaptive Kalman ltering. IEEE Transactions on Automatic Control, AC-15(2), 175–184. Mehra, R. K. (1971). On-line identication of linear dynamic systems with applications to Kalman ltering. IEEE Transactions on Automatic Control, AC-16(1), 12–21. Mehra, R. K. (1972). Approaches to adaptive ltering. IEEE Transactions on Automatic Control, AC-17, 693–698. Myers, K. A., & Tapley, B. D. (1976).Adaptive sequential estimation of unknown noise statistics. IEEE Transactions on Automatic Control, AC-21, 520–523.

Regularization in Sequential Learning

953

Niranjan, M. (1996).Sequential tracking in pricing nancial options using model based and neural network approaches. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems (pp. 960– 966). Cambridge, MA: MIT Press. Puskorius, G. V., & Feldkamp, L. A. (1991). Decoupled extended Kalman lter training of feedforward Layered networks. In International Joint Conference on Neural Networks, Seattle (pp. 307–312). Ruck, D. W., Rogers, S. K., Kabrisky, M., Maybeck, P. S., & Oxley, M. E. (1992). Comparative analysis of backpropagation and the extended Kalman lter for training multilayer perceptrons. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(6), 686–690. Schottky, B., & Saad, D. (1999). Statistical mechanics of EKF learning in neural networks. Journal of Physics A, 32, 1605–1621. Shah, S., Palmieri, F., & Datum, M. (1992). Optimal ltering algorithms for fast learning in feedforward neural networks. Neural Networks, 5, 779–787. Shephard, N. (1996). Statistical aspects of ARCH and stochastic volatility. In D. P. Cox, O. E. Barndorff-Nielson, & D. V. Hinkley (Eds), Time series models in econometrics, nance and other elds. London: Chapman and Hall (pp. 1–67). Singhal, S., & Wu, L. (1988). Training multilayer perceptrons with the extended Kalman algorithm. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 133–140), San Mateo, CA: Morgan Kaufmann. Sutton, R. S. (1992). Adapting bias by gradient descent: An incremental version of delta-bar-delta. Proceedings of the Tenth National Conference on Articial Intelligence (pp. 171–176). Cambridge, MA: MIT Press. Sutton, R. S. (1992).Gain adaptation beats least squares? Proceedings of the Seventh Yale Workshop on Adaptive Learning Systems (pp. 161–166). Tenney, R. R., Hebbert, R. S., & Sandell, N. S. (1977). A tracking lter for maneuvering sources. IEEE Transactions on Automatic Control, 22, 246–251. Williams, P. M. (1995). Bayesian regularization and pruning using a Laplace prior. Neural Computation, 7(1), 117–143. Wolpert, D. H. (1993).On the use of evidence in neural networks. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 539–546). San Mateo, CA: Morgan Kaufmann. Received March 31, 1998; accepted March 12, 1999.

LETTER

Communicated by Steven Nowlan and Gerard Dreyfus

Sequential Monte Carlo Methods to Train Neural Network Models J. F. G. de Freitas M. Niranjan A. H. Gee A. Doucet Cambridge University Engineering Department, Cambridge CB2 1PZ, England, U.K. We discuss a novel strategy for training neural networks using sequential Monte Carlo algorithms and propose a new hybrid gradient descent / sampling importance resampling algorithm (HySIR). In terms of computational time and accuracy, the hybrid SIR is a clear improvement over conventional sequential Monte Carlo techniques. The new algorithm may be viewed as a global optimization strategy that allows us to learn the probability distributions of the network weights and outputs in a sequential framework. It is well suited to applications involving on-line, nonlinear, and nongaussian signal processing. We show how the new algorithm outperforms extended Kalman lter training on several problems. In particular, we address the problem of pricing option contracts, traded in nancial markets. In this context, we are able to estimate the one-step-ahead probability density functions of the options prices. 1 Introduction

Probabilistic modelling coupled with nonlinear function approximators is a powerful tool in solving many real-world problems. In the engineering community, there are many examples of problems involving large data sets that have been tackled with nonlinear function approximators such as the multilayer perceptron (MLP) or radial basis functions. Much of the emphasis here is on performance, in the form of accuracy of prediction on unseen data. The tricks required to estimate parameters of very large models and the handling of very large data sets become interesting challenges too. Tools such as cross validation to deal with overtting and bootstrap to deal with model uncertainty have proved to be very effective. Neural networks applied to speech recognition (Robinson, 1994) and handwritten digit recognition (Le Cun et al., 1989) are examples. Such work has demonstrated that many interesting and hard inference tasks, involving thousands of free parameters, can indeed be solved at a desirable level of performance. On the other hand, in the statistics community, we see Neural Computation 12, 955–993 (2000)

c 2000 Massachusetts Institute of Technology °

956

J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet

that the emphasis is on rigorous mathematical analysis of algorithms, careful model formulation, and model criticism. This work has led to a rich paradigm to handle various inference problems in which a good collection of powerful algorithms exist. Sampling techniques applied to neural networks, starting from the work of Radford Neal, is a classic example where both the above features are exploited (Neal, 1996). He shows that nonlinear function approximators in the form of MLPs can be trained, and their performance evaluated, in a Bayesian framework. This formulation leads to probability distributions that are difcult to handle analytically. Markov chain Monte Carlo (MCMC) sampling methods are used to make inferences in the Bayesian framework. While sampling methods tend to be much richer in exploring the probability distributions, approximate methods, such as gaussian approximation, have also attracted interest (Bishop, 1995; Mackay, 1992). Many problems, such as time-series analysis, are characterized by data that arrive sequentially. In such problems one has the task of performing model estimation, model validation, and inference sequentially, on the arrival of each item of data. By the very nature of such real-time tasks, or if we suspect the source of the data to be time varying, we may wish to adopt a sequential strategy in which each item of data is used only once and then discarded. In this case, deciding what information we should be propagating on a sample-by-sample basis becomes an interesting topic. In the classic state-space model estimation by a Kalman lter setting, one denes a gaussian probability density over the parameters to be estimated. The mean and covariance matrix are propagated using recursive update equations each time new data are received. Use of the matrix inversion lemma allows an elegant update of the inverse covariance matrix without actually having to invert a matrix at each sample. This is a widely studied topic, optimal in the case of linear gaussian models. For nonlinear models, the common trick is Taylor series expansion around the operating point, leading to the extended Kalman lter (EKF). The derivation of the Kalman lter starting from a Bayesian setting is of much interest. This allows for tuning of the noise levels as well as running multiple models in parallel to evaluate model likelihoods. Jazwinski (1970) and Bar-Shalom and Li (1993) are classic textbooks dealing with these aspects. Our earlier work reviews these concepts and shows them applied in a neural network context (de Freitas, Niranjan, & Gee, 1997, 1998a). Taylor series approximation leading to the EKF makes gross simplication of the probabilistic specication of the model. With nonlinear models, the probability distributions of interest tend to be multimodal. Gaussian approximation in such cases could miss the rich structure brought in by the nonlinear model. Sequential sampling techniques provide a class of algorithms to address this issue. A good application of these ideas is tracking in computer vision (Isard & Blake, 1996), where the Kalman lter approximation is shown to fail to capture the multimodalities. Sequential sam-

Methods to Train Neural Network Models

957

pling algorithms, under the names of particle lters, sequential samplingimportance resampling (SIR), bootstrap lters, and condensation trackers have also been applied to a wide range of problems, including target tracking (Gordon, Salmond, & Smith, 1993), nancial analysis (Muller, ¨ 1992; Pitt & Shephard, 1997), diagnostic measures of t (Pitt & Shephard, 1997), sequential imputation in missing data problems (Kong, Liu, & Wong, 1994), blind deconvolution (Liu & Chen, 1995) and medical prognosis (Berzuini, Best, Gilks, & Larizza, 1997). In this article we focus on neural networks, constraining ourselves to sequential training and predictions. We use sampling techniques in a statespace setting to show how an MLP may be trained in environments where data arrive one at a time. We consider sequential importance sampling and resampling algorithms and illustrate their performance on some simple problems. Further, we introduce a hybrid algorithm in which each sampling step is supplemented by a gradient-type update step. This could be a straightforward gradient descent on all the samples propagated or even an EKF type update of the samples. We do not deal here with the problem of choosing the network architecture. We target this work primarily at the neural networks community. We believe that many signal processing applications addressed by this community will fall into this category. Part of the article is written in a tutorial form to introduce the concepts. We also believe the real-life example we include and the experience gained in applying sequential sampling techniques with a highly nonlinear function approximator will be of benet to the statistics community. In particular, we improve the existing sequential sampling methods by incorporating gradient information. This results in a hybrid optimization scheme (HySIR), whereby each sampling trajectory of the sampling algorithm is updated with an EKF. The HySIR may therefore be interpreted as an efcient dynamic mixture of EKFs, whose accuracy improves as the number of lters increases. We show that the HySIR algorithm outperforms conventional sequential sampling methods in terms of computational speed and accuracy. In section 2, we formulate the problem of training neural networks in terms of a state-space representation. A theoretical Bayesian solution is subsequently proposed in section 3. Owing to the practical difculties that arise when computing the Bayesian solution, three approximate methods (numerical integration, gaussian approximation, and Monte Carlo simulation) are discussed in section 4. In sections 5 and 6, we derive a generic sequential importance sampling methodology and point out some of the limitations of conventional sampling methods. In section 7, we introduce the HySIR algorithm. Finally, in section 8, we test the algorithms on several problems, including the pricing of options on the Financial Times Stock Exchange 100-Share (FTSE-100) index.

958

J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet

2 State-Space Neural Network Modeling

As in our previous work (de Freitas et al., 1997), we start from a statespace representation to model the neural network’s evolution in time. A transition equation describes the evolution of the network weights, and a measurements equation describes the nonlinear relation between the inputs and outputs of a particular physical process, as follows: (2.1)

w kC1 D w k C d k

yk D g ( w k , x k ) C v k ,

(2.2)

where (y k 2
@gˆ ( w k , xk ) C dk , @w k

(2.3)

where a is a learning-rate parameter and gˆ (w k , x k ) is the model prediction at time k. The gradient descent model allows for directed steps when searching for optimal locations in parameter space. The noise terms are assumed to be uncorrelated with the network weights and the initial conditions. The evolution of the states (network weights) corresponds to a Markov process with initial probability p( w 0 ) and transition probability p(w k | w k¡1 ). The observations are assumed to be conditionally independent given the states. These are standard assumptions in a large class of tracking and time series problems (Harvey, 1970).

Methods to Train Neural Network Models

959

Typically, the task is to estimate the network weights w O k and the noise parameters R and Q given the measurements Yk D fy 1 , y 2 , . . . , y k g. To simplify the exposition, we do not treat the problem of estimating the noise covariances and the initial probabilities. (To understand how these variables may be estimated by hierarchical Bayesian models or expectationmaximization (EM) learning, see de Freitas et al., 1997, 1998a.) We shall, however, analyze the roles played by R and Q in sequential Monte Carlo simulation. 3 The Bayesian Solution

The posterior density p( Wk |Yk ), where Yk D fy 1 , y 2 , . . . , y k g and Wk D fw 1 , w 2 , . . . , w k g, constitutes the complete solution to the sequential estimation problem. In many applications, such as tracking, it is of interest to estimate one of its marginals, the ltering density p(w k | Yk ). By computing the ltering density recursively, we do not need to keep track of the complete history of the weights. Thus, from a storage point of view, the ltering density turns out to be more parsimonious than the full posterior density function. If we know the ltering density of the network weights, we can easily derive various estimates of the network weights, including centroids, modes, medians, and condence intervals. In sections 5 and 6, we show how the ltering density may be approximated using sequential importance sampling techniques. The ltering density is estimated recursively in two stages: prediction and update, as illustrated in Figure 1. In the prediction step, the ltering density p( w k¡1 |Yk¡1 ) is propagated into the future by the transition density p(w k | w k¡1 ) as follows: Z p( w k |Yk¡1 ) D

p(w k | w k¡1 ) p(w k¡1 | Yk¡1 )dw k¡1 .

(3.1)

The transition density is dened in terms of the probabilistic model governing the states’ evolution (see equation 2.1) and the process noise statistics, that is: Z p( w k | w k¡1 ) D p( w k | d k¡1 , w k¡1 ) p( d k¡1 | w k¡1 )d d k¡1 Z D d ( w k ¡ d k¡1 ¡ w k¡1 ) p( d k¡1 )dd k¡1 , where the Dirac delta function d (.) indicates that w k can be computed via a purely deterministic relation when w k¡1 and d k¡1 are known. Note that p(d k¡1 | w k¡1 ) D p( d k¡1 ) because the process and measurement noise terms have been assumed to be independent of past and present values of the states.

960

J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet

Figure 1: Prediction and update stages in the recursive computation of the ltering density.

The update stage involves the application of Bayes’ rule when new data are observed: p( w k |Yk ) D

p(y k | w k ) p(w k | Yk¡1 ) . p( y k |Yk¡1 )

(3.2)

The likelihood density function is dened in terms of the measurements model (see equation 2.2), as follows: Z p( y k | w k ) D d ( y k ¡ g( w k , xk ) ¡ v k ) p(v k )dv k .

Methods to Train Neural Network Models

961

The normalizing denominator of equation 3.2 is often referred to as the evidence density function. It plays a key role in optimization schemes that exploit gaussian approximation (de Freitas et al., 1997; Jazwinski, 1969; Mackay, 1992; Sibisi, 1989). It is given by: Z p( y k |Yk¡1 ) D p(y k | w k ) p(w k | Yk¡1 )dw k . In this formulation, the question of how to choose the prior p( w 0 ) arises immediately. This is a central issue in Bayesian inference. The problem is, however, much more general. It is related to the ill-posed nature of the inference and learning task. A nite set of data admits many possible explanations, unless some form of a priori knowledge is used to constrain the solution space. Assuming that the data conform to a specic degree of smoothness is certainly the most fundamental and widely applied type of a priori knowledge. This assumption is particularly suited to applications involving noisy data. Nonetheless, the Bayesian approach admits more general forms of a priori knowledge within a probabilistic formulation. It would be interesting to devise methods for mapping experts’ knowledge, available in many formats, into the Bayesian probabilistic representation. 4 Practical Solutions

The prediction and update strategy given by equations 3.1 and 3.2 provides an optimal solution to the inference problem, but unfortunately, it involves multidimensional integrations, the source of most of the practical difculties inherent in the Bayesian methodology. In most applications, analytical solutions are not possible. This is why we need to resort to alternative approaches, such as direct numerical integration, gaussian approximation, and Monte Carlo simulation methods. 4.1 Direct Numerical Integration. The direct numerical integration method relies on approximating the distribution of interest by a discrete distribution on a nite grid of points. The location of the grid is a nontrivial design issue. Once the distribution is computed at the grid points, an interpolation procedure is used to approximate it in the remainder of the space. Kitagawa (1987) used this method to replace the ltering integrals by nite sums over a large set of equally spaced grid points. He chose a piecewise linear interpolation strategy. Kramer and Sorenson (1988) adhered to the same methodology, but opted for a constant interpolating function. Pole and West (1990) have attempted to reduce the problem of choosing the grid’s location by implementing a dynamic grid allocation method. When the grid points are spaced “closely enough” and encompass the region of high probability, the method works well. However, the method is very difcult to implement in high-dimensional multivariate problems such as neural network modeling. Here, computing at every point in a dense

962

J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet

multidimensional grid becomes prohibitively expensive (Gelman, Carlin, Stern, & Rubin, 1995; Gilks, Richardson, & Spiegelhalter, 1996). 4.2 Gaussian Approximation. Until recently, the popular approach to sequential estimation has been gaussian approximation (Bar-Shalom & Li, 1993). In the linear gaussian estimation problem, the Kalman lter provides an optimal recursive algorithm for propagating and updating the mean and covariance of the hidden states. In nonlinear scenarios, the EKF is a computationally efcient natural extension of the Kalman lter. The EKF is obtained by approximating the transition and measurements equations with Taylor expansion about the last predicted state. Typically, rst-order expansions, which lead to a gaussian approximation of the posterior density function, are employed. The mean and covariance are propagated and updated by a simple set of equations, similar to the Kalman lter equations. As the number of terms in the Taylor expansion increases, the complexity of the algorithm also increases due to the computation of derivatives of increasing order. For example, a linear expansion requires the computation of the Jacobian, while a quadratic expansion involves computing the Hessian matrix. Gaussian approximation results in a simple and elegant framework that is amenable to the design of inference and learning algorithms. Various authors have studied the problem of approximating the distribution of neural network weights with a gaussian function. One of the earliest implementations of EKF-trained MLPs is due to Singhal and Wu (1988). In their method, the network weights are grouped into a single vector w that is updated in accordance with the EKF equations. The entries of the Jacobian matrix (i.e., the derivative of the outputs with respect to the weights) are calculated by backpropagating the output observations through the network. The algorithm Singhal and Wu proposed requires considerable computational effort. The complexity is of the order om2 multiplications per estimation step, where m represents the number of weights and o the number of network outputs. Shah, Palmieri, and Datum (1992) and Puskorius and Feldkamp (1991; Puskorius, Feldkamp, & Davis, 1996) have proposed strategies for decoupling the global EKF estimation algorithm into local EKF estimation subproblems. For example, they suggest that the weights of each neuron could be updated independently. The assumption in the local updating strategies is that the weights are decoupled, and, consequently, the Kalman lter weights covariance is a block-diagonal matrix. The EKF is an improvement over conventional neural network estimation techniques, such as on-line backpropagation, in that it makes use of second-order statistics (covariances). These statistics are essential for placing error bars on the predictions and combining separate networks into committees of networks. The backpropagation algorithm is, in fact, a degenerate form of the EKF algorithm (de Freitas et al., 1997; Ruck, Rogers, Kabrisky, Maybeck, & Oxley, 1992). Previously we have shown that gaussian

Methods to Train Neural Network Models

963

approximation, within a hierarchical Bayesian framework, leads to interesting algorithms that exhibit adaptive memory and adaptive regularization coefcients. These algorithms enabled us to learn the dynamics and measurements noise statistics in slowly changing nonstationary environments (de Freitas et al., 1997; de Freitas, Niranjan, & Gee, 1998c). In stationary environments, where the data are available in batches, we have also shown (de Freitas et al., 1998a; de Freitas, Niranjan, & Gee, 1998b) that it is possible to learn these statistics by means of the Rauch-Tung-Striebel smoother (Rauch, Tung, & Striebel, 1965) and the EM algorithm (Dempster, Laird, & Rubin, 1977). There have been several attempts to implement the EM algorithm online (Collings, Krishnamurthy, & Moore, 1994; Elliott, 1994; Krishnamurthy & Moore, 1993). Unfortunately, these still rely on various heuristics and tend to be computationally demanding. A natural progression on sequential gaussian approximation is to employ a mixture of gaussian densities (Kadirkamanathan & Kadirkamanathan, 1995; Li & Bar-Shalom, 1994; Sorenson & Alspach, 1971). These mixtures can be either static or dynamic. In static mixtures, the gaussian models assumed to be valid throughout the entire process are a subset of several hypothesized models. That is, we start with a few models and compute which of them describe the sequential process most accurately. The remaining models are discarded. In dynamic model selection, one particular model out of a set of r operating models is selected during each estimation step. Dynamic mixtures of gaussian models are far more general than static mixtures of models. However, in stationary environments, static mixtures are obviously more adequate. Dynamic mixtures are particularly suited to the problem of noise estimation in rapidly changing environments, such as tracking maneuvering targets. There each model corresponds to a different hypothesis of the value of the noise covariances (Bar-Shalom & Li, 1993). In summary, gaussian approximation, because of its simplicity and computational efciency, constitutes a good way of handling many problems where the density of interest has a signicant and predominant mode. Many problems, however, do not fall into this category. Mixtures of gaussians provide a better solution when there are a few dominant modes. However, they introduce extra computational requirements and complications, such as estimating the number of mixture components. 4.3 Monte Carlo Simulation. Many signal processing problems, such as equalization of communication signals, nancial time series, medical prognosis, target tracking, and geophysical data analysis, involve elements of nongaussianity, nonlinearity, and nonstationarity. Consequently, it is not possible to derive exact closed-form estimators based on the standard criteria of maximum likelihood, maximum a posteriori, or minimum meansquared error. Analytical approximations to the true distribution of the data do not take into account all the salient statistical features of the processes under consideration. Monte Carlo simulation methods, which provide a

964

J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet

complete description of the probability distribution of the data, improve the accuracy of the analysis. The basic idea in Monte Carlo simulation is that a set of weighted samples, drawn from the posterior density function of the neural network weights, is used to map the integrations, involved in the inference process, to discrete sums. More precisely, we make use of the following Monte Carlo approximation: pO ( Wk |Yk ) D

N ³ ´ 1 X () d W k ¡ W ki , N iD1

()

where Wki represents the samples used to describe the posterior density and, as before, d (.) denotes the Dirac delta function. We carry the sufx k, denoting time, to emphasize that the approximation is performed at the arrival of each data point. Consequently, any expectations of the form Z E [ fk ( Wk )] D fk (W k ) p(W k |Yk )dWk may be approximated by the following estimate: E [ fk ( Wk )] ¼

N 1 X ( i) fk (W k ), N iD1 ()

where the samples Wki are drawn from the posterior density function. Monte Carlo sampling techniques are an improvement over direct numerical approximation in that they automatically select samples in regions of high probability. In recent years, many researchers in the statistical and signal processing communities have, almost simultaneously, proposed several variations of sequential Monte Carlo algorithms. These algorithms have been applied to a wide range of problems, including target tracking (Gordon et al., 1993; Isard & Blake, 1996), nancial analysis (Muller, ¨ 1992; Pitt & Shephard, 1997), diagnostic measures of t (Gelfand, Dey, & Chang, 1992; Pitt & Shephard, 1997), sequential imputation in missing data problems (Kong et al., 1994), blind deconvolution (Liu & Chen, 1995) and medical prognosis (Berzuini et al., 1997). Most of the current research has focused on statistical renements of this class of algorithm (Carpenter, Clifford, & Fearnhead, 1997; Liu & Chen, 1998; Pitt & Shephard, 1997). The basic sequential Monte Carlo methods had been introduced in the automatic control eld in the late sixties. For instance, Handschin and Mayne (1969) tackled the problem of nonlinear ltering with a sequential importance sampling approach. They combined analytical and numerical techniques, using the control variate method (Hammersley & Handscomb, 1968), to reduce the variance of the estimates. In the seventies, various researchers continued working on these ideas (Akashi & Kumamoto, 1977; Handschin, 1970; Zaritskii, Svetnik, &

Methods to Train Neural Network Models

965

Figure 2: Update and prediction stages in the sequential sampling process. In the update stage, the likelihood of each sample is evaluated. The samples are then propagated with their respective likelihood. The size of the black circles indicates the likelihood of a particular sample. The samples with higher likelihood are allowed to have more “children.” In the prediction stage, a process noise term is added to the samples so as to increase the variety of the sample set. The result is that the surviving samples provide a better weighted description of the likelihood function.

Shimelevich, 1975). Zaritskii et al. (1975) is particularly interesting. The authors treated the problems of continuous and discrete time ltering and introduced several novel ideas. With the exception of Doucet (1998), most authors have overlooked the Monte Carlo sampling ideas proposed in the seventies. Figure 2 illustrates the operation of a generic sequential Monte Carlo method. It embraces the standard assumption that we can sample from the () prior p( w k |Yk¡1 ) and evaluate the likelihood p( y k | w ki ) of each sample. Only the ttest samples, that is, the ones with the highest likelihood, survive after the update stage. These then proceed to be multiplied, according to their likelihood, in the prediction stage. The update and prediction stages are governed by equations 3.2 and 3.1, respectively. It is instructive to approach the problem from an optimization perspective. Figures 3 and 4 show the windowed global descent in the error function that is typically observed. The diagrams shed light on the roles played by the noise covariances R and Q. Q dictates by how much the cloud of samples is expanded in the

966

J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet

Figure 3: First and second steps of a one-dimensional sequential Monte Carlo optimization problem. The size of the initial cloud of samples determines the search region in parameter space. The goal is to nd the best possible minimum of the error function. It is clear that as the number of samples increases, the chances of reaching lower minima increase. In the second step, the updated clouds of samples are generated. The width of these clouds is obtained from the intersection of the width of the prior cloud of samples, the error function, and a threshold determined by the measurements noise covariance R. The updated clouds of samples are denser than the prior cloud of samples. Next, the samples are grouped in regions of higher likelihood in a resampling step. Finally, the clouds of samples are expanded by a factor determined by the process noise covariance Q. This expansion allows the clouds to reach regions of the parameter function where the error function is lower.

Methods to Train Neural Network Models

967

Figure 4: Third and fourth steps of a one-dimensional sequential Monte Carlo optimization problem. To reach the global minimum on the right, the number of samples has to be increased.

prediction stage. By increasing Q, the density of the cloud of samples is reduced. Consequently, the algorithm will take longer to converge. A very small Q, on the other hand, will not allow the algorithm to explore new regions of the parameter space. Ideally, one needs to implement an algorithm that adapts Q automatically. This is explored in section 7. R controls the resolution of the update stage, as shown in Figure 5. A small value of R will cause the likelihood to be too narrow. Consequently, only a few trajectories

968

J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet

Figure 5: Role played by the measurements variance R. Small values of R result in a narrow likelihood, with only a few trajectories being able to propagate to the next time step. If R is too small, we might miss some of the important modes. On the other hand, if R is too large, we are not giving preference to any of the trajectories, and the algorithm might not converge.

will be able to propagate to the next time step. If R is too small, the optimization scheme might fail to detect some of the important modes of the likelihood function. By increasing R, we broaden the likelihood function, thereby increasing the number of surviving trajectories. If we choose R to be too large, all the trajectories become equally likely and the algorithm might not converge. As shown in Figures 3 and 4, R gives rise to a threshold T ( R) in the error function, which determines the width of the updated clouds of samples. If we can sample from the prior and evaluate the likelihood up to proportionality, three sampling strategies can be adopted to sample from the posterior: acceptance-rejection sampling, Markov chain Monte Carlo (MCMC), and sampling-importance resampling (SIR). After sampling from the prior,rejection sampling dictates that one should () ( i) i) ), where w (k¡Max accept the samples with probability p(y k | w ki ) / p( yk | w k¡Max D arg maxw k p(y k | w k ). Unfortunately, the rejection sampler requires a random number of iterations at each time step. This proves to be computationally expensive in high-dimensional spaces (Doucet, 1998; Muller, ¨ 1991; Pitt & Shephard, 1997). The fundamental idea behind MCMC methods is to construct a Markov chain whose asymptotic distribution tends to the required posterior density (Gilks et al., 1996). Generally, it can take a large number of iterations before

Methods to Train Neural Network Models

969

this happens. Moreover, it is difcult even to assess when the chain has converged. For most real-time sequential processing applications, MCMC methods can be too computationally demanding (Berzuini et al., 1997; Gordon & Whitby, 1995; Muller, ¨ 1992). In our work, we favor the SIR sampling option. This approach is the core of many successful sequential Bayesian tools, such as the well-known bootstrap or particle lters (Avitzour, 1995; Beadle & Djuric, 1997; Berzuini et al., 1997; Doucet, 1998; Gordon et al., 1993; Kitagawa, 1996; Liu & Chen, 1998; Pitt & Shephard, 1997). In the following sections, we derive this algorithm from an importance sampling perspective. 5 Importance Sampling

We can approximate the posterior density function with a function on a nite discrete support. Consequently, it follows from the strong law of large numbers that as the number or samples N increases, expectations can be mapped into sums. Unfortunately, it is often impossible to sample directly from the posterior density function. However, we can circumvent this difculty by sampling from a known, easy-to-sample, proposal density function p (W k |Yk ) and making use of the following substitution: Z

p(W k |Yk ) p ( Wk |Yk )dWk p ( Wk |Yk ) Z p( Yk | Wk ) p( Wk ) D p ( Wk | Yk )dWk fk ( Wk ) p(Yk )p ( Wk |Yk ) Z qk ( Wk ) D p ( Wk | Yk )dWk , fk ( Wk ) p( Y k )

E [ fk ( Wk )] D

fk ( Wk )

where the variables qk ( Wk ) are known as the unnormalized importance ratios: qk D

p( Yk |Wk ) p( Wk ) . p ( Wk |Yk )

We can get rid of the unknown normalizing density p( Yk ) as follows: Z 1 fk ( Wk ) qk ( Wk )p ( Wk | Yk )dWk p( Y k ) R fk ( Wk ) qk ( Wk )p (W k |Yk )dWk D R p( Yk |W k ) p(W k ) p (( Wk |Yk )) dWk p Wk |Yk R fk (W k )qk ( Wk )p ( Wk |Yk )dWk R D qk (W k )p ( Wk |Yk )d Wk E p [qk ( Wk ) fk ( Wk )] D . E p [qk ( Wk )]

E [ fk ( Wk )] D

(5.1)

970

J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet

Hence, by drawing samples from the proposal function p (.), we can approximate the expectations of interest by the following estimate: E [ fk ( Wk )] ¼

D

PN

( i) ( i) iD1 fk ( Wk ) qk ( Wk ) PN () 1 / N iD1 qk ( Wki )

1/N N X iD1

()

()

i i fk (Wk )qQ k ( Wk ),

(5.2) ()

where the normalized importance ratios qQ ki are given by ( i)

q () qQ ki D P k N

( j) jD1 qk

. ()

The estimate of equation 5.2 is valid if the W ki correspond to a set of independently and identically distributed samples drawn from the proposal density function, the support of the proposal function includes the support of the posterior function, and the expectations of f (.), q and qf 2 (.) exist and are nite (Geweke, 1989). Under these conditions and as N tends to innity, the posterior density function can be approximated arbitrarily well by the estimate: pO ( Wk |Yk ) D

N X iD1

()

()

i i qQ k d ( Wk ¡ Wk ).

6 Sequential Importance Sampling

In order to compute a sequential estimate of the posterior density function at time k without modifying the previously simulated states W k¡1 , we may adopt the following proposal density: p ( Wk |Yk ) D p (W0 |Y0 )

k Y

p (w j |Wj¡1 , Yj )

jD1

D p (Wk¡1 |Yk¡1 )p ( w k |Wk¡1 , Yk ).

(6.1)

At this stage, we need to recall that we assumed that the states correspond to a Markov process and that the observations are conditionally independent given the states—that is: p( Wk ) D p( w 0 )

k Y jD1

p( wj | wj¡1 )

(6.2)

Methods to Train Neural Network Models

971

and p( Yk |Wk ) D

k Y

p( yj | wj ).

(6.3)

jD1

By substituting equations 6.1–6.3 into equation 5.1, we can derive a recursive estimate for the importance weights as follows: p( Yk |W k ) p(W k ) p ( Wk¡1 | Yk¡1 )p (w k | Wk¡1 , Yk ) p(Yk |Wk ) p(Wk ) 1 D qk¡1 p( Yk¡1 | Wk¡1 ) p( Wk¡1 ) p (w k | Wk¡1 , Yk ) p( y k | w k ) p( w k | w k¡1 ) D qk¡1 . p (w k | Wk¡1 , Yk )

qk D

(6.4)

Equation 6.4 gives a mechanism to update the importance ratios sequentially. Since we can sample from the proposal function and evaluate the likelihood and transition probabilities (see equations 2.1 and 2.2), all we need to do is generate a prior set of samples and iteratively compute the importance ratios. This procedure allows us to obtain the following type of estimates: N X (i ) ( i) E [ fk ( Wk )] ¼ (6.5) qQ k fk ( Wk ). iD1

In section 6.4 we give a more detailed description of a generic sequential importance sampling algorithm. We need to delay this presentation because the SIS algorithm discussed so far has a serious limitation that requires immediate attention: the variance of the ratio (p( Wk |Yk ) / p ( Wk |Yk )) increases over time when the observations are regarded as random (Kong et al., 1994). To understand why the variance increase poses a problem, suppose that we want to sample from the posterior. In that case, we want the proposal density to be very close to the posterior density. When this happens, we obtain the following results for the mean and variance: µ ¶ p( Wk | Yk ) Ep D1 p (W k |Yk ) and

µ varp

p( Wk |Yk ) p ( Wk | Yk )

¡ p(W |Y ) ¢ #

"

¶ D Ep

k

k

p (Wk |Yk )

2

¡1

D 0.

In other words, we expect the variance to be close to zero in order to obtain reasonable estimates. Therefore, a variance increase has a harmful effect on the accuracy of the simulations. In practice, the degeneracy caused by the variance increase can be observed by monitoring the importance ratios. Typically what we observe is that after a few iterations, one of the normalized importance ratios tends to 1, while the remaining ratios tend to zero.

972

J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet

(i) (i) Figure 6: Resampling process, whereby a random measure fw k , qQ k g is mapped ( j) into an equally weighted random measure fw k , N ¡1 g. The index i is drawn from a uniform distribution.

6.1 Resampling. To avoid the degeneracy of the SIS simulation method, a resampling stage may be used to eliminate samples with low importance ratios and multiply samples with high importance ratios. It is possible to see an analogy to the steps in genetic algorithms. Many of the ideas on resampling have stemmed from the work of Efron (1982), Rubin (1988), and Smith and Gelfand (1992). Various authors have described efcient algorithms for accomplishing this task in O (N ) operations (Doucet, 1998; Pitt & Shephard, 1997). () () Resampling involves mapping the measure fw ki , qQ ki g into an equally ( j)

weighted random measure fw k , N ¡1 g. This is accomplished by sampling () ( i) uniformly from the discrete set fw ki gN qk gN iD1 with probabilities fQ iD1 . A mathematical proof of this can be found in Gordon (1994, pp. 111–112). Figure 6 shows a way of sampling from this discrete set. After constructing the cumulative distribution of the discrete set, a uniformly drawn sampling index i is projected onto the distribution range and then onto the distribution domain. The intersection with the domain constitutes the new sample index ( j) j. That is, the vector w k is accepted as the new sample. Clearly, the vectors with the larger sampling ratios will end up with more copies after the resampling process. In the subsequent prediction stage, random disturbances are added to the samples, thereby adding variety to the dominant samples, as was shown in Figure 2. Liu and Chen (1995, 1998) have argued that when all the importance ratios are nearly equal, resampling only reduces the number of distinctive

Methods to Train Neural Network Models

973

streams and introduces extra variation in the simulations. Using a result from Kong et al. (1994), they suggest that resampling should be performed only when the effective sample size Nef f is below a xed heuristic threshold. The effective sample size is dened as Nef f D

1 . PN ³ (i) ´2 Q iD1 q k

We have so far explained how to compute the importance ratios sequentially and how to improve the sample set by resampling. After discussing the choices of likelihood and proposal density functions, we shall be able to present a complete and detailed algorithm description. 6.2 Likelihood Function. The likelihood density function is dened in terms of the measurements equation. In our work, we adopted the following gaussian form:

p( y k | w k ) / exp ¡

´ 1³ ( y k ¡ gˆ ( w k , x k )) T R¡1 (y k ¡ gˆ (w k , x k )) . 2

We also studied the effect of adding a weight decay term to the likelihood function to penalize for large network weights. This did not seem to improve the results in the experiments that we performed. In addition, it introduced an extra tuning parameter. If it is suspected that the data set may contain several outliers, likelihood functions with heavy tails should be employed. 6.3 Choosing the Proposal Function. The choice of proposal function is one of the most critical design issues in importance sampling algorithms. Doucet (1997) has shown that the proposal function:

p ( w k |Wk¡1 , Yk ) D p( w k | w k¡1 , y k )

(6.6)

minimizes the variance of the importance ratios conditional on W k¡1 and Yk . This choice of proposal function has also been advocated by other researchers (Kong et al., 1994; Liu & Chen, 1995; Zaritskii et al., 1975). Nonetheless, the density: p ( w k |Wk¡1 , Yk ) D p( w k | w k¡1 )

(6.7)

is the most popular choice of proposal function (Avitzour, 1995; Beadle & Djuric, 1997; Gordon et al., 1993; Isard & Blake, 1996; Kitagawa, 1996). Although it results in higher Monte Carlo variation than the optimal proposal (see equation 6.6), as a result of its’ not incorporating the most recent observations, it is usually easier to implement (Berzuini et al., 1997; Doucet, 1998; Liu & Chen, 1998).

974

J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet

In section 7, we describe an algorithm that allows us to update each sampling trajectory, with an extended Kalman lter, before the resampling stage. This Kalman update effectively constitutes a strategy for sampling from the prior of equation 6.6. 6.4 Algorithm Pseudocode. The essential structure of a sequential Monte Carlo algorithm to train neural networks with two layers, using the proposal of function of equation 6.7, involves the following steps:

1. INITIALISE NETWORK WEIGHTS ( k D 0): For i D 1, . . . , N ()

Draw the weights w 0i from the first layer prior p1 ( w 0 ) and the second layer prior p2 ( w 0 ) . Evaluate the importance ratios: ()

()

i i q 0 D p( y 0 | w 0 )

Normalize the importance ratios: ()

qi (i) qQ 0 D P 0 N

( j) jD1 q0

2. For k D 1, . . . , L (a) SAMPLING STAGE: For i D 1, . . . , N

Predict via the dynamics equation: (i)

( i)

(i )

O kC1 D w k C d k w

( i) where d k is a sample from p(d k ) (N in our case).

Evaluate the importance ratios: ( i) ( i) ( i) ) O kC1 qkC1 D qk p( y kC1 | w

Normalize the importance ratios: ( i)

q ( i) D P kC1( j) qQ kC1 N jD1 qkC1 (b) RESAMPLING STAGE: For i D 1, . . . , N If Nef f ¸ Threshold: ()

()

i i w kC1 Dw O kC1

(0, Q)

Methods to Train Neural Network Models

975

Else

Resample new index j from the discrete ( i) ( i) O kC1 , qQkC1 g set fw ( j)

()

i w kC1 Dw O kC1 ()

i qkC1 D

1 N

An order N algorithm to perform the resampling stage is described in pseudocode by Carpenter et al. (1997). The generic sequential sampling algorithm is rather straightforward to implement using a gaussian likelihood function.1 We make use of a different prior for each layer because the functional form of the neurons varies with layer (Muller ¨ & Rios Insua, 1998). In addition, we can select the prior hyperparameters for each layer in accordance with the magnitude of the input and output signals. Statistical estimates such as mean values and condence intervals can be computed using equation 6.5. 6.5 Problems with Sequential SIR Simulation. The discreteness of the resampling stage (see Figure 6) implies that any particular sample with a high importance ratio will be duplicated many times. As a result, the cloud of samples may eventually collapse to a single sample. This degeneracy will limit the ability of the algorithm to search for lower minima in other regions of the error surface. In addition, the number of samples used to describe the posterior density function will become too small and inadequate. Gordon (1994) discusses various strategies to overcome this problem. He suggests that the sample set can be boosted by ensuring that the number of elements in the discrete distribution exceeds the posterior samples by a factor of at least 10. In addition, he proposes smoothing the discrete distribution with a kernel interpolator. Gordon et al. (1993) propose a roughening procedure, whereby an independent jitter, with zero mean and standard deviation s, is added to each sample drawn in the resampling stage. The standard deviation of the jitter is given by:

s D KIN ¡1 / m ,

(6.8)

where I denotes the length of the interval between the maximum and minimum samples of each specic component, K is a tuning parameter, and, as before, m represents the number of weights. Large values of K blur the distribution, and very small values produce tight clusters of points around the original samples. We have found in our simulations that this technique works very well and that tuning K requires little effort. 1

We have made the software for the implementation of the SIR algorithm available online at the following Web site: http://www.cs.berkeley.edu/»jfgf/.

976

J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet

Another serious problem arises when the likelihood function is much narrower than the prior. In this scenario, very few samples are accepted in the resampling stage. The problem is exacerbated when the regions of high likelihood are located in the tails of the prior density. This difculty may be surmounted by the brute force method of increasing the sample size. Alternatively, one should select the prior variances and likelihood variance with great care, after an exploratory data analysis stage. In our neural networks framework, we adopt different informative priors for each layer, whose variances depend on the magnitude of the input and output signals. Moreover, the covariance matrix of the network weights for each sampling trajectory is propagated and updated with an EKF. This leads to an improved annealed sampling algorithm, whereby the variance of each sampling trajectory decreases with time. That is, we start searching over a large region of the error surface, and as time progresses, we concentrate on the regions of lower error. This technique is described in the following section. 7 HySIR

The Monte Carlo conception of optimization relies solely on probing the error surface at several points, as shown in Figures 3 and 4. It fails to make use of all the information implicit in the error surface. For instance, the method could be enhanced by evaluating the gradient and other higher derivatives of the error surface. This is evident in Figure 7. By allowing the samples to follow the gradient after the rst prediction stage, the algorithm can reach a better minimum. In order to improve existing sequential Monte Carlo simulation algorithms, we propose a new hybrid gradient descent / SIR method. The main feature of the algorithm is that the samples are updated by a gradient descent step in the prediction stage, as follows: a @ (y k ¡ gˆ (w k , x k ))2 2 @w k @gˆ ( w k , xk ) D w k C a(y k ¡ gˆ (w k , x k )) . @w k

w kC1 D w k ¡

(7.1)

The term @gˆ ( w k , x k ) / @w k is the Jacobian. It can be easily computed by backpropagating derivatives through the network as explained in appendix B of de Freitas et al. (1997). The gradient descent learning rate a is a tuning parameter that controls how fast we should descend. Too large a parameter would cause the trajectory in the error surface to oveructuate, thereby never converging. A very small value, on the other hand, would not contribute toward speeding the SIR algorithm’s convergence. The plain gradient descent algorithm can get trapped at shallow local minima. One way of overcoming this problem is to dene the error sur-

Methods to Train Neural Network Models

977

Figure 7: HySIR algorithm. By following the gradient after the prediction stage, it is possible to reach better minima.

face in terms of a Hamiltonian that accounts for the approximation errors and the momentum of each trajectory. This is the basis of the hybrid Monte Carlo algorithm (Brass, Pendleton, Chen, & Robson, 1993; Duane, Kennedy, Pendleton, & Roweth, 1987). In this algorithm, each trajectory is updated by approximating the Hamiltonian differential equations by a leapfrog discretization scheme. The discretization, however, may introduce biases. In our work, we favor a stochastic gradient descent approach based on the EKF. This technique avoids shallow local minima by adaptively adding noise to the network weights. This is shown in Figure 8. The EKF equations are given by: w kC1|k D w k

PkC1|k D PTk C Q¤ Imm

KkC1 D PkC1| k GkC1 [R¤ Ioo C GTkC1 PkC1|k GkC1 ]¡1 w kC1 D w kC1|K C KkC1 (y kC1 ¡ g ( xk , w kC1|k )) PkC1 D PkC1|k ¡ KkC1 GTkC1 PkC1|k ,

where KkC1 is known as the Kalman gain matrix, Imm denotes the identity matrix of size m £ m, and R¤ and Q¤ are two tuning parameters, whose

978

J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet

Figure 8: HySIR with EKF updating. By adding noise to the network weights and adapting the weights covariance, it is possible to reach better minima with the EKF than with plain gradient descent techniques.

roles are explained in detail in de Freitas et al. (1997). Here, it sufces to say that they control the rate of convergence of the EKF algorithm and the generalization performance of the neural network. In the general multiple input, multiple output (MIMO) case, g 2

@g 1 6 @@wg 1 6 1 6 @w 2

@g 2 @w 1

@g 1 @w m

¢¢¢

@g D6 . GD 6 . @w ( w Dw O) 4 .

¢¢¢

@g o @w 1

3T

7 7 7 . .. 7 7 . 5

@g o @w m

Since the EKF is a suboptimal estimator based on linearization of a nonlinear mapping, strictly speaking, P k is an approximation to the covariance matrix. In mathematical terms: O k ) T (w k ¡ w O k )| Yk ]. P k ¼ E [(w k ¡ w

Methods to Train Neural Network Models

979

The Kalman lter step, before the update stage, allows us to incorporate the latest measurements into the prior. As a result, we can sample from the prior p(w k | w k¡1 , y k ) before the update stage. Another advantage of this method is that the covariance of the weights changes over time. Because the EKF is a minimum variance estimator, the covariance of the weights decreases with time. Consequently, as the tracking improves, the variation of the network weights is reduced. This annealing process improves the efciency of the search for global minima in parameter space and reduces the variance of the estimates. It should be pointed out that the weights need to be resampled in conjunction with their respective covariances. The pseudocode for the HySIR algorithm with EKF updating is as follows:2 1. INITIALIZE NETWORK WEIGHTS (k D 0): 2. For k D 1, . . . , L

(a) SAMPLING STAGE: For i D 1, . . . , N

Predict via the dynamics equation: ()

()

()

i w O kC1 D w ki C d ki

() where d ki is a sample from p( d k ) (N in our case).

(0, Q)

Update samples with the EKF equations. Evaluate the importance ratios: ( i) ( i) ( i) ) O kC1 qkC1 D qk p(y kC1 | w

Normalize the importance ratios: ()

qi ( i) qQkC1 D P kC1( j) N jD1 qkC1 (b) RESAMPLING STAGE: For i D 1, . . . , N If Nef f ¸ Threshold: ()

()

i i w kC1 Dw O kC1 ()

()

i PkiC1 D PO kC1 2

We have made the software for the implementation of the HySIR algorithm available online at the following web site: http://www.cs.berkeley.edu/»jfgf/.

980

J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet

Else

Resample new index j from the discrete ( i) ( i) O kC1 , qQ kC1 g set fw ( j) ( j) ( i) ( i) w kC1 Dw O kC1 and PkC1 D POk C1 ()

i qkC1 D

1 N

The HySIR with EKF updates may be interpreted as a dynamical mixture of EKFs. That is, the estimates depend on a few modes of the posterior density function. In addition to having to compute the normalized importance ratios (O ( N ( oS21 C d2 S1 ))) operations, where S1 is the number of hidden neurons, N the number of samples, d the input dimension, and o the number of outputs, and resampling (O ( N ) operations), the HySIR has to perform the EKF updates (O ( Nom2 ) operations, where m is the number of network weights). That is, the computational complexity of the HySIR increases approximately linearly with the number of EKFs in the mixture. For problems with dominant modes, we need to use only a few trajectories, and, consequently, the HySIR provides accurate and computationally efcient solutions. The HySIR approach makes use of the likelihood function twice per time step. Consequently, if we make the noise distributions too narrow, the algorithm will give only a representation of a narrow region of the posterior density function. Typically, the algorithm will converge to a few modes of the posterior density function. This is illustrated in section 8.2. 8 Simulations

In this section, we discuss two experiments, using synthetic data and a real application involving the pricing nancial options on the FTSE100 index. The rst experiment provides a comparison of the various algorithms discussed so far. It addresses the trade-off between accuracy and computational efciency. The second experiment shows how the HySIR algorithm can be used to estimate time-varying latent variables. It also serves the purpose of illustrating the effect of the process noise covariance Q on the simulations. 8.1 Experiment 1: Time-Varying Function Approximation. We generated input-output data using the following time-varying function:

y( x1 ( k), x2 (k)) D 4 sin(x1 (k) ¡ 2) C 2x2 (k) 2 C 5 cos(0.02k) C 5 Cº, where the inputs x1 ( k) and x2 ( k) were simulated from a gaussian distribution with zero mean and unit variance. The noise º was drawn from a gaussian distribution with zero mean and variance equal to 0.1. Subsequently, we approximated the data with an MLP with ve hidden sigmoidal neurons and a linear output neuron. The MLP was trained sequentially using the

Methods to Train Neural Network Models

981

Table 1: RMS Errors and Floating-Point Operations for the Algorithms Used in the Function Approximation Problem.

RMS error Mega ops Time (seconds)

EKF

SIS

SIR

HySIR

6.51 4.8 1.2

3.87 5.6 27.4

3.27 5.8 28.9

1.17 48.6 24.9

Notes: The RMS errors include the convergence period. This explains why they are higher than the variance of the additive noise.

HySIR, SIR, SIS, and EKF algorithms. The difference between the SIR and the SIS algorithms is that SIR resamples every time, while SIS resamples only if the effective sample set is below a threshold. We set our threshold to N / 3. We set the number of samples (N) to 100 for the SIS and SIR methods and to 10 for the HySIR method. We assigned the values 100 and 1 to the initial variance of the weights and the diagonal entries of the weights covariance matrix. The diagonal entries of R and Q were given the values 0.5 and 2, respectively. Finally, the EKF noise hyperparameters R¤ and Q¤ were set to 2 and 0.01. Table 1 shows the average one-step-ahead prediction errors obtained for 100 runs, of 200 time steps each, on a Silicon Graphics R10000 workstation. On average, there was a 50% occurrence of resampling steps using the SIS algorithm. It is clear from the results that avoiding the resampling stage does not yield a great reduction in computational time. This is because one needs to evaluate neural network mappings for each trajectory in the sampling stage. As a result, the sampling stage tends to be much more computationally expensive than the resampling stage. The results also show that the HySIR performs much better than the conventional SIR and EKF algorithms. It is, however, computationally more expensive, as explained at the end of section 7. It is interesting to note that the HySIR, despite requiring more oatingpoint operations, is faster than the SIR algorithm. This shows that the computational efciency of sampling algorithms depends on the platform in which they are implemented. The design of efcient software and hardware platforms is an important avenue for further research. Figure 9 shows two plots of the prediction errors for the EKF and HySIR. The left plot shows the best and worst results obtained with the EKF out of 100 trials, each trial involving a different initial weights vector. We then used the 100 initial weights vectors to initialize the HySIR algorithm with 100 sampling trajectories. As shown in the right plot, the global search nature of the HySIR algorithm allows it to converge much faster than the EKF algorithm. This is a clear example that dynamic mixtures of models,

982

J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet

Figure 9: Convergence of the EKF and HySIR algorithms for experiment 1. The left plot shows the best and worst EKF convergence results out of 100 runs using different initial conditions. The initial conditions for each EKF were used to initialize the EKF lters within the HySIR algorithm. At each time step, the HySIR evaluates a weighted combination of several EKFs. By performing this type of dynamic model selection, it is able to search over large regions of parameter space and, hence, converge much faster than the standard EKF algorithm.

with model selection at each time step, perform better than static mixtures (choosing the best EKF out of the 100 trials). 8.2 Experiment 2: Latent States Tracking. To assess the ability of the hybrid algorithm to estimate time-varying hidden parameters, we generated input-output data from a logistic function followed by a linear scaling and a displacement, as shown in Figure 10. We applied two gaussian input sequences to the model. This simple model is equivalent to an MLP with one hidden neuron and an output linear neuron. We then trained a second model with the same structure using the input-output data generated by the rst model. In the training phase, of 200 time steps, we allowed the model weights to vary with time. During this phase, the HySIR algorithm was used to track the input-output training data and estimate the latent model weights. After the 200th time step, we xed the values of the weights and generated another 200 input-output data test sets from the original model. The input test data were then fed to the trained model, using the weights values estimated at the 200th time step. Subsequently, the output prediction of the trained model was compared to the output data from the original model so as to assess the generalization performance of the training process.

Methods to Train Neural Network Models

983

1 x1

w1

w3

w4

S x2

1 w2

S

y

w5

Figure 10: Logistic function with linear scaling and displacement used in experiment 2. The weights were chosen as follows: w 1 (k) D 1 C k / 100, w 2 (k) D sin(0.06k) ¡ 2, w 3 (k) D 0.1, w 4 (k) D 1, w 5 (k) D ¡0.5.

We repeated the experiment using two different values of Q, as shown in Figures 11 and 12. With a very small value of Q (0.0001), the algorithm was able to estimate the time-varying latent weights. After 100 time steps, the algorithm became very condent, as indicated by the narrow error bars. By increasing Q to 0.01, we found that the algorithm becomes less condent. Yet it is able to explore wider regions of parameter space, as shown in the histogram of Figure 13. The variance (Q) of the noise added in the prediction stage implies a tradeoff between the algorithm’s condence and the size of the search region in parameter space. Too large a value of Q will allow the algorithm to explore a large region of the space, but will never allow the algorithm to converge to specic error minimum. A very small value, on the other hand, will cause all the sampling trajectories to converge to a very narrow region. The problem with the latter is that if the process generating the data changes with time, we want to keep the search region large enough so as to nd the new minima quickly and efciently. To address this problem, we have employed a very simple heuristic. Specically, when we add gaussian noise in the prediction stage, we select a small set of the trajectories randomly and then add to them gaussian noise with a covariance much larger than Q. In theory, this should allow the algorithm to converge and yet allow it to keep exploring other regions of parameter space. From an evolutionary computing perspective, we can think of these few randomly selected trajectories as mutations. The results of applying this idea to our problem are shown in Figure 13. They suggest that our heuristic reduces the trade-off between the algorithms’ condence and the size of the search region. Yet we believe that further research on algorithms for adapting Q is necessary. 8.3 Example with Real Data. An interesting source of inference problems requiring nonlinear modeling in a time-varying environment is nancial data analysis. Many applications ranging from time-series prediction to stock selection have been attempted using neural networks. Hutchinson,

984

J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet 3.5 3

3

Test set output

Training set output

4

2 1 0 1

2 1.5 1 0.5

2 2

Histogram

2.5

0

2

0

4

Output prediction

0

1

2

Output prediction

3

100 2

50 0

0 200

150

2

100

50

0

4

W

Time

2

Network weights

4

w1 w5 w3 w4 w2

2

0

2

4 0

20

40

60

80

100

Time

120

140

160

180

200

Figure 11: Weights tracking with Q D 0.0001. (Top left) One-step-ahead prediction errors in the training set. (Top right) Prediction errors on validation data, assuming that after the 200th time step, the weights stop varying with time. (Middle) Histogram of w 2 . (Bottom) Tracking of the weights with the posterior mean estimate computed by the HySIR and the one standard deviation error bars of the estimator. For this particular value of Q, the estimator becomes very condent after 100 iterations, and, consequently, the histogram for w 2 converges to a narrow gaussian distribution.

Lo, and Poggio (1994) and Niranjan (1996) look at the problem of pricing options contracts in a data-driven manner. The relationship between the price of an options contract and the parameters relating to the option, such as the price of the underlying asset and the time to maturity, is known to be nonlinear. Hutchinson et al. show that a neural network can form a very good approximation to the price of the options contract as a function of variables relating to this. This relationship is also likely to change over the

Methods to Train Neural Network Models 3.5 3

3

Test set output

Training set output

4

2 1 0 1

2.5 2 1.5 1 0.5

2 2

Histogram

985

0

2

0

4

Output prediction

0

1

2

Output prediction

3

40 2

20 0

0 200

2

150

100

50

0

4

W

Time

2

Network weights

4

w1 w5 w3 w4 w2

2

0

2

4

0

20

40

60

80

100

Time

120

140

160

180

200

Figure 12: Weights tracking with Q D 0.01. (Top left) One-step-ahead prediction errors in the training set. (Top right) Prediction errors on validation data, assuming that after the 200th time step, the weights stop varying with time. (Middle) Histogram of w 2 . (Bottom) Tracking of the weights with the posterior mean estimate computed by the HySIR and the one standard deviation error bars of the estimator. For this value of Q, the search region is particularly large (wide histograms), but the algorithm lacks condence (wide error bars).

lifetime of the option. Hence, one would expect that a sequential approach for tracking the price of the options contract might be a more natural way to handle the nonstationarity. Niranjan addresses this by means of an EKF. We extend on the above work, using the data in Niranjan (1996) consisting of daily FTSE100 options during the February 93–December 93 period. We compare the approximation and convergence capabilities of a number of algorithms discussed in earlier sections. The models considered are: Trivial Uses the current value of the option as the next prediction.

986

J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet 3.5 3

3

Test set output

Training set output

4

2 1 0 1

2 1.5 1 0.5

2 2

Histogram

2.5

0

2

0

4

Output prediction

0

1

2

Output prediction

3

100 2

50 0

0 200

150

2

100

50

0

4

W

Time

2

Network weights

4

w1 w5 w3 w4 w2

2

0

2

4 0

20

40

60

80

100

Time

120

140

160

180

200

Figure 13: Weights tracking with Q D 0.0001 and two random mutations per time step with Q D 0.1. (Top left) One-step-ahead prediction errors in the training set. (Top right) Prediction errors on validation data, assuming that after the 200th time step, the weights stop varying with time. (Middle) Histogram of w 2 . (Bottom) Tracking of the weights with the posterior mean estimate computed by the HySIR and the one standard deviation error bars of the estimator. The histogram shows that the mutations expand the search region considerably.

RBF-EKF. Represents a regularized radial basis function network with four

hidden neurons, which was originally proposed in Hutchinson et al. (1994). The output weights are estimated with a Kalman lter, while the means of the radial functions correspond to random subsets of the data and their covariance is set to the identity matrix, as in Niranjan (1996). BS. Corresponds to a conventional Black-Scholes model with two outputs

(normalised call and put prices) and two parameters (risk-free interest

Methods to Train Neural Network Models

987

rate and volatility). The risk-free interest rate was set to 0.06, while the volatility was estimated over a moving window (of 50 time steps), as described in Hull (1997). MLP-EKF. A multilayer perceptron with six sigmoidal hidden units and a

linear output neuron, trained with the EKF algorithm (de Freitas et al., 1997). The noise covariances R and Q and the states covariance P were set to diagonal matrices with entries equal to 10¡6 , 10¡7 , and 10, respectively. The weights prior corresponded to a zero-mean gaussian density with covariance equal to 1.

MLP-EKFE. Represents an MLP, with six sigmoidal hidden units and a lin-

ear output neuron, trained with a hierarchical Bayesian model, whereby the weights are estimated with the EKF algorithm and the noise statistics are computed by maximizing the evidence density function (de Freitas et al., 1997). The states covariance P was given by a diagonal matrix with entries equal to 10. The weights prior corresponded to a zero mean gaussian density with covariance equal to 1. MLP-SIR. Corresponds to an MLP, with six sigmoidal hidden units and

a linear output neuron, trained with the SIR algorithm. One thousand samples were employed in each simulation. The noise covariances R and Q were set to diagonal matrices with entries equal to 10¡4 and 10¡5 , respectively. The prior samples for each layer were drawn from a zeromean gaussian density with covariance equal to 1. MLP-HySIR. Corresponds to an MLP, with six sigmoidal hidden units and

a linear output neuron, trained with the Kalman HySIR algorithm. Each simulation made use of 10 samples. The noise covariances R and Q were set to diagonal matrices with entries equal to 1 and 10¡6 , respectively. The prior samples for each layer were drawn from a zero-mean gaussian density with covariance equal to 1. The Kalman lter parameters R¤ , Q¤ and P were set to diagonal matrices with entries equal to 10¡4 , 10¡5 , and 1000, respectively. Table 2 shows the mean squared error between the actual value of the options contracts and the one-step-ahead prediction from the various algorithms. Although the differences are small, one sees that sequential algorithms produce lower prediction errors and that the HySIR algorithm produces even lower errors than the Kalman lter. These errors are computed on data corresponding to the last 100 days, to allow for convergence of the algorithms. Note that in the on-line setting considered here, all of these errors are computed on unseen data. Figure 14 shows the approximations of the MLP-HySIR algorithm on the call and put option prices at the same strike price (3325). The algorithm exhibits fast convergence and accurate tracking. In Figure 15, we plot the probability density function over the network’s one-step-ahead prediction, computed through 100 samples propagated by the HySIR approach.

988

J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet

0.1

Call price

0.08 0.06 0.04 0.02 0 0

20

40

60

80

0 0

20

40

60

80

100

120

140

160

180

200

100

120

140

160

180

200

Time

0.14

Put price

0.12 0.1 0.08 0.06 0.04 0.02

Time

Posterior density

Figure 14: Tracking call and put option prices with the MLP-HSIR method. Estimated values [—] and actual values [- -].

60 40 20 0

200 180 160 140

0.08

120

0.07

100

0.06 80

0.05 0.04

60

0.03

40

Time

0.02

20 0

0.01 0

Sample space

Figure 15: Probability density function of the neural networks’ one-step-ahead prediction of the call price for the FTSE option with a strike price of 3325.

Methods to Train Neural Network Models

989

Table 2: One-Step-Ahead Prediction Errors on Call Options. Strike Price

2925

3025

3125

3225

3325

Trivial RBF-EKF BS MLP-EKF MLP-EKFE MLP-SIR MLP-HySIR

0.0783 0.0538 0.0761 0.0414 0.0404 0.0419 0.0394

0.0611 0.0445 0.0598 0.0384 0.0366 0.0405 0.0362

0.0524 0.0546 0.0534 0.0427 0.0394 0.0432 0.0404

0.0339 0.0360 0.0377 0.0285 0.0283 0.0314 0.0273

0.0205 0.0206 0.0262 0.0145 0.0150 0.0164 0.0151

Typically the one-step-ahead predictions for a group of options on the same cash product, but with different strike prices or duration to maturity, can be used to determine whether one of the options is being mispriced. Knowing the probability distribution of the network outputs allows us to design more interesting pricing tools. 9 Conclusions

We have presented a sequential Monte Carlo approach for training neural networks in a Bayesian setting. This approach enables accurate characterization of the probability distributions, which tend to be very complicated for nonlinear models such as neural networks. Propagating a number of samples in a state-space dynamical systems formulation of the problem is the key idea in this framework. Experimental illustrations presented here suggest that it is possible to apply sampling methods effectively to tackle difcult problems involving nonlinear and time-varying processes, typical of many sequential signal processing problems. A particular contribution is the HySIR algorithm, which combines gradient information of the error function with probabilistic sampling information. Simulations show that this combination performs better than EKF and conventional SIR approaches. HySIR can be interpreted as a gaussian mixture lter, in that only a few sampling trajectories need to be employed. Yet as the number of trajectories increases, the computational requirements increase only linearly. Therefore, the method is also suitable as a sampling strategy for multimodal optimization. Our experiment with nancial data showed that the sequential sampling approach, in addition to matching the one-step-ahead square errors of the best available pricing methods, allowed us to obtain complete estimates of the probability density functions of the one-step-ahead predictions. Further avenues of research include the design of algorithms for adapting the noise covariances R and Q, studying the effect of different noise models

990

J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet

for the network weights, and improving the computational efciency of the algorithms. 10 Acknowledgments

We thank Neil Gordon (DERA, U.K.), Sue Johnson (Department of Engineering, Cambridge), and Michael Isard (Computer Vision Group, Oxford) for interesting and useful discussions on this work. We are also very grateful to the referees for their valuable comments. J. F. G. F. is nancially supported by two University of the Witwatersrand Merit Scholarships, a Foundation for Research Development Scholarship (South Africa), an ORS award, and a Trinity College External Studentship (Cambridge). References Akashi, H., & Kumamoto, H. (1977). Random sampling approach to state estimation in switching environments. Automatica, 13, 429–434. Avitzour, D. (1995). A stochastic simulation Bayesian approach to multitarget tracking. IEE Proceedings on Radar, Sonar and Navigation, 142(2), 41–44. Bar-Shalom, Y., & Li, X. R. (1993). Estimation and tracking: Principles, techniques and software. Boston: Artech House. Beadle, E. R., & Djuric, P. M. (1997).A fast weighted Bayesian bootstrap lter for nonlinear model state estimation. IEEE Transactions on Aerospace and Electronic Systems, 33(1), 338–343. Berzuini, C., Best, N. G., Gilks, W. R., & Larizza, C. (1997). Dynamic conditional independence models and Markov chain Monte Carlo methods. Journal of the American Statistical Association, 92(440), 1403–1412. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Brass, A., Pendleton, B. J., Chen, Y., & Robson, B. (1993). Hybrid Monte Carlo simulations theory and initial comparison with molecular dynamics. Biopolymers, 33(8), 1307–1315. Carpenter, J., Clifford, P., & Fearnhead, P. (1997).An improved particle lter for nonlinear problems (Tech. Rep.). Oxford: Department of Statistics, Oxford University. Available online at http://www.stats.ox.ac.uk/»clifford/index.htm. Collings, I. B., Krishnamurthy, V., & Moore, J. B. (1994). On-line identication of hidden Markov models via recursive prediction error techniques. IEEE Transactions on Signal Processing, 42(12), 3535–3539. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2(4), 303–314. de Freitas, J. F. G., Niranjan, M., & Gee, A. H. (1997). Hierarchical BayesianKalman models for regularisation and ARD in sequential learning (Tech. Rep. No. CUED/F-INFENG/TR 307). Cambridge: Cambridge University. Available online at: http://svr-www.eng.cam.ac.uk/»jfgf. de Freitas, J. F. G., Niranjan, M., & Gee, A. H. (1998a).The EM algorithm and neural networks for nonlinear state space estimation (Tech. Rep. CUED/F-INFENG/TR

Methods to Train Neural Network Models

991

313). Cambridge: Cambridge University. Available online at: http://svrwww.eng.cam.ac.uk/»jfgf. de Freitas, J. F. G., Niranjan, M., & Gee, A. H. (1998b). Nonlinear state space learning with EM and neural networks. In IEEE International Workshop on Neural Networks in Signal Processing. Cambridge, England. de Freitas, J. F. G., Niranjan, M., & Gee, A. H. (1998c).Regularisation in sequential learning algorithms. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.),Advances in neural information processing systems, 10. Cambridge, MA: MIT Press. Depster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B, 39, 1–38. Doucet, A. (1997). Monte Carlo methods for Bayesian estimation of hidden Markov models: Applicationto radiationsignals. Unpublished doctoral dissertation, University Paris-Sud, Orsay, France. Doucet, A. (1998). On sequential simulation-based methods for Bayesian ltering (Tech. Rep. No. CUED/F-INFENG/TR 310). Cambridge: Cambridge University. Available online at: http://www.stats.bris.ac.uk:81/ MCMC/pages/list.html. Duane, S., Kennedy, A. D., Pendleton, B. J., & Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters B, 195(2), 216–222. Efron, B. (1982). The bootstrap, jacknife and other resampling plans. Philadelphia: SIAM. Elliott, R. J. (1994).Exact adaptive lters for Markov chains observed in gaussian noise. Automatica, 30(9), 1399–1408. Gelfand, A. E., Dey, D. K., & Chang, H. (1992). Model determination using predictive distributions with implementation via sampling-based methods. In J. M. Bernardo, J. O. Berger, A. P. Dawid, & A. F. M. Smith (Eds.), Bayesian statistics 4 (pp. 147–167). Oxford: Oxford University Press. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (1995). Bayesian data analysis. London: Chapman and Hall. Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlo integration. Econometrica, 24, 1317–1399. Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (Eds.). (1996). Markov chain Monte Carlo in practice. Soffolk: Chapman and Hall. Gordon, N., & Whitby, A. (1995). A Bayesian approach to target tracking in the presence of a glint. In O. E. Drummond (Ed.), Signal and data processing of small targets (pp. 472–483). Bellingham, WA: SPIE Press. Gordon, N. J. (1994). Bayesian methods for tracking. Unpublished doctoral dissertation, Imperial College, University of London. Gordon, N. J., Salmond, D. J., & Smith, A. F. M. (1993). Novel appraoch to nonlinear / nonGaussian Bayesian state estimation. IEE Proceedings-F, 140(2), 107–113. Hammersley, J. H., & Handscomb, D. C. (1968). Monte Carlo methods. London: Methuen. Handschin, J. E. (1970). Monte Carlo techniques for prediction and ltering of non-linear stochastic processes. Automatica, 6, 555–563. Handschin, J. E., & Mayne, D. Q. (1969). Monte Carlo techniques to estimate

992

J. F. G. de Freitas, M. Niranjan, A. H. Gee, and A. Doucet

the conditional expectation in multi-stage non-linear ltering. International Journal of Control, 9(5), 547–559. Harvey, A. C. (1970).Forecasting, structural time series models, and the Kalman lter. Cambridge: Cambridge University Press. Hull, J. C. (1997). Options, futures, and other derivative (3rd ed.). Englewood Cliffs, NJ: Prentice Hall. Hutchinson, J. M., Lo, A. W., & Poggio, T. (1994). A nonparametric approach to pricing and hedging derivative securities via learning networks. Journal of Finance, 49(3), 851–889. Isard, M., & Blake, A. (1996). Contour tracking by stochastic propagation of conditional density. In European Conference on Computer Vision (pp. 343–356). Cambridge, UK. Jazwinski, A. H. (1969). Adaptive ltering. Automatica, 5, 475–485. Jazwinski, A. H. (1970). Stochastic processes and ltering theory. Orlando, FL: Academic Press. Kadirkamanathan, V., & Kadirkamanathan, M. (1995). Recursive estimation of dynamic modular RBF networks. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 239–245). Cambridge, MA: MIT Press. Kitagawa, G. (1987). Non-Gaussian state-space modeling of nonstationary time series. Journal of the American Statistical Association, 82(400), 1032–1063. Kitagawa, G. (1996). Monte Carlo lter and smoother for non-Gaussian nonlinear state space models. Journal of Computational and Graphical Statistics, 5, 1–25. Kong, A., Liu, J. S., & Wong, W. H. (1994). Sequential imputations and Bayesian missing data problems. Journal of the American Statistical Association, 89(425), 278–288. Kramer, S. C., & Sorenson, H. W. (1988). Recursive Bayesian estimation using piece-wise constant approximations. Automatica, 24(6), 789–801. Krishnamurthy, V., & Moore, J. B. (1993). On-line estimation of hidden Markov model parameters based on the Kullback-Leibler information measure. IEEE Transactions on Signal Processing, 41(8), 2557–2573. Le Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., & Hubbard, W. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1, 541–551. Li, X. R., & Bar-Shalom, Y. (1994). A recursive multiple model approach to noise identication. IEEE Transactions on Aerospace and Electronic Systems, 30(3), 671–684. Liu, J. S., & Chen, R. (1995). Blind deconvolution via sequential imputations. Journal of the American Statistical Association, 90(430), 567–576. Liu, J. S., & Chen, R. (1998). Sequential Monte Carlo methods for dynamic systems. Journal of the American Statistical Association, 93, 1032–1044. Mackay, D. J. C. (1992).Bayesian interpolation. Neural Computation, 4(3),415–447. Muller, ¨ P. (1991). Monte Carlo integration in general dynamic models. Contemporary Mathematics, 115, 145–163. Muller, ¨ P. (1992). Posterior integration in dynamic models. Computer Science and Statistics, 24, 318–324.

Methods to Train Neural Network Models

993

Muller, ¨ P., & Rios Insua, D. (1998). Issues in Bayesian analysis of neural network models. Neural Computation, 10, 571–592. Neal, R. M. (1996). Bayesian learning for neural networks. New York: SpringerVerlag. Niranjan, M. (1996).Sequential tracking in pricing nancial options using model based and neural network approaches. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems 9 (pp. 960– 966). Cambridge, MA: MIT Press. Pitt, M. K., & Shephard, N. (1997). Filtering via simulation: Auxiliary particle lters (Tech. Rep.). London: Department of Statistics, Imperial College of London. Available online at: http://www.nuff.ox.ac.uk/economics/papers. Pole, A., & West, M. (1990). Efcient Bayesian learning on non-linear dynamic models. Journal of Forecasting, 9, 119–136. Puskorious, G. V., & Feldkamp, L. A. (1991). Decoupled extended Kalman lter training of feedforward layered networks. International Joint Conference on Neural Networks (pp. 307–312). Seattle. Puskorius, G. V., Feldkamp, L. A., & Davis, L. I. (1996).Dynamic neural network methods applied to on-vehicle idle speed control. Proceedings of the IEEE, 84(10), 1407–1420. Rauch, H. E., Tung, F., & Striebel, C. T. (1965). Maximum likelihood estimates of linear dynamic systems. AIAA Journal, 3(8), 1445–1450. Robinson, T. (1994). The application of recurrent nets to phone probability estimation. IEEE Transactions on Neural Networks, 5(2), 298–305. Rubin, D. B. (1988). Using the SIR algorithm to simulate posterior distributions. In J. M. Bernardo, M. H. DeGroot, D. V. Lindley, & A. F. M. Smith (Eds.), Bayesian statistics, 3 (pp. 395–402). Oxford: Oxford University Press. Ruck, D. W., Rogers, S. K., Kabrisky, M., Maybeck, P. S., & Oxley, M. E. (1992). Comparative analysis of backpropagation and the extended Kalman lter for training multilayer perceptrons. IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(6), 686–690. Shah, S., Palmieri, F., & Datum, M. (1992). Optimal ltering algorithms for fast learning in feedforward neural networks. Neural Networks, 5, 779–787. Sibisi, S. (1989). Regularization and inverse problems. In J. Skilling (Ed.), Maximum entropy and Bayesian methods (pp. 389–396). Norwell, MA: Kluwer. Singhal, S., & Wu, L. (1988). Training multilayer perceptrons with the extended Kalman algorithm. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 133–140). San Mateo, CA: Morgan Kaufmann. Smith, A. F. M., & Gelfand, A. E. (1992). Bayesian statistics without tears: A sampling-resampling perspective. American Statistician, 46, 84–88. Sorenson, H. W., & Alspach, D. L. (1971). Recursive Bayesian estimation using gaussian sums. Automatica, 7, 465–479. Zaritskii, V. S., Svetnik, V. B., & Smihelevich, L. I. (1975).Monte-Carlo techniques in problems of optimal information processing. Automatic Remote Control, 36, 2015–2022. Received July 17, 1998; accepted March 12, 1999.

Neural Computation, Volume 12, Number 5

CONTENTS Review

Expanding NEURON’s Repertoire of Mechanisms with NMODL M. L. Hines and N. T. Carnevale

995

Article

Latent Attractors: A Model for Context-Dependent Place Representations in the Hippocampus Simona Doboli, Ali A. Minai, and Phillip J. Best

1009

Notes

The Approach of a Neuron Population Firing Rate to a New Equilibrium: An Exact Theoretical Result B. W. Knight, A. Omurtag, and L. Sirovich

1045

Formation of Direction Selectivity in Natural Scene Environments Brian Blais, Leon N. Cooper, and Harel Shouval

1057

Letters

A Phase Model of Temperature-Dependent Mammalian Cold Receptors Peter Roper, Paul C. Bressloff, and Andr´e Longtin

1067

The Number of Synaptic Inputs and the Synchrony of Large Sparse Neuronal Networks D. Golomb and D. Hansel

1095

A Neural Network Architecture for Visual Selection Yali Amit

1141

Using Bayes’ Rule to Model Multisensory Enhancement in the Superior Colliculus Thomas J. Anastasio, Paul E. Patton, and Kamel Belkacem-Boussaid

1165

Choice and Value Flexibility Jointly Contribute to the Capacity of a Subsampled Quadratic Classier Panayiota Poirazi and Bartlett W. Mel

1189

New Support Vector Algorithms Bernhard Sch¨olkopf, Alex J. Smola, Robert C. Williamson, and Peter L. Bartlett

1207

REVIEW

Communicated by William Lytton

Expanding NEURON’s Repertoire of Mechanisms with NMODL M. L. Hines Department of Computer Science, Yale University, New Haven, CT 06520, U.S.A. N. T. Carnevale Department of Psychology, Yale University, New Haven, CT 06520, U.S.A. Neuronal function involves the interaction of electrical and chemical signals that are distributed in time and space. The mechanisms that generate these signals and regulate their interactions are marked by a rich diversity of properties that precludes a “one size ts all” approach to modeling. This article presents a summary of how the model description language NMODL enables the neuronal simulation environment NEURON to accommodate these differences. 1 Introduction

Recently we described the core concepts and strategies that are responsible for much of the utility of NEURON as a tool for empirically based neuronal modeling (Hines & Carnevale, 1997). That article focused on the strategy used in NEURON to deal with the problem of mapping a spatially distributed system into a discretized (compartmental) representation in a manner that ensures conceptual control while maintaining numeric accuracy and computational efciency. Now we shift our attention to another important feature of NEURON: its special facility for expanding and customizing its library of biophysical mechanisms. The need for this facility stems from the fact that experimentalists are applying an ever-growing armamentarium of techniques to dissect neuronal operation at the cellular level. There is a steady increase in the number of phenomena that are known to participate in electrical and chemical signaling and are characterized well enough to support empirically based simulations. Since the mechanisms that underlie these phenomena differ across neuronal cell class, developmental stage, and species (e.g., chapter 7 in Johnston & Wu, 1995; also see McCormick, 1998), a simulator that is useful in research must provide a exible and powerful means for incorporating new biophysical mechanisms in models. It must also help the user remain focused on the model instead of programming. Such a means is provided to the NEURON simulation environment by NMODL, a high-level language that was originally implemented for NEURON by Michael Hines, which Neural Computation 12, 995–1007 (2000)

c 2000 Massachusetts Institute of Technology °

996

M. L. Hines and N. T. Carnevale

he and Upinder Bhalla later extended to generate code suitable for linking with GENESIS (Wilson & Bower, 1989). The aim of this article is to provide a high-level summary of those aspects of NMODL that we regard as unique and most pertinent to the creation of representations of biophysical mechanisms. These include features that enable the user to construct mechanisms using differential equations or kinetic schemes, specify continuous (density) or delta function (point process) spatial distributions, and implement models of synaptic connectivity, while preserving mass balance for each ionic species. A detailed explanation of the unusual structure and features of NMODL requires numerous, specic illustrations. Therefore we have posted an expanded version of this document at http://www.neuron.yale.edu/nmodl.html. The expanded version is organized around a sequence of examples of increasing complexity and sophistication that introduce important topics in the context of problems of scientic interest, including models of the following mechanisms: A passive “leak” current and a localized transmembrane shunt (density mechanisms versus point processes) An electrode stimulus (discontinuous parameter changes with variable time-step methods) Voltage-gated channels (differential equations versus kinetic schemes) Ion accumulation in a restricted space (e.g., extracellular KC ) Ion buffering, diffusion, and active transport (e.g., Ca2C ) Use-dependent synaptic plasticity Multiple streams of synaptic input and output (network fan-in and fan-out) This article makes extensive use of specialized concepts and terminology that pertain to NEURON itself. For denitive treatment of these, see Hines and Carnevale (1997) and NEURON’s on-line help les, which are available through links at http://www.neuron.yale.edu. 2 Describing Mechanisms with NMODL

A brief overview of how NMODL is used will clarify its underlying rationale. The rst step is to write a text le (a “mod le”) that describes a mechanism as a set of nonlinear algebraic equations, differential equations, or kinetic reaction schemes. The description employs a syntax that closely resembles familiar mathematical and chemical notation. This text is passed to a translator that converts each statement into many statements in C, auto-

Expanding NEURON with NMODL

997

matically generating code that handles details such as mass balance for each ionic species and producing code suitable for each of NEURON’s integration methods. The output of the translator is then compiled for computational efciency. This achieves tremendous conceptual leverage and savings of effort not only because the high-level mechanism specication is much easier to understand and far more compact than the equivalent C code, but also because it spares the user from having to bother with low-level programming issues, like how to “interface” the code with other mechanisms and with NEURON itself. NMODL is a descendant of the MOdel Description Language (MODL; Kohn, Hines, Kootsey, & Feezor, 1994), which was developed at Duke University by the National Biomedical Simulation Resource project for the purpose of building models that would be exercised by the Simulation Control Program (SCoP; Kootsey, Kohn, Feezor, Mitchell, & Fletcher, 1986). NMODL has the same basic syntax and style of organizing model code into named blocks as MODL. Variable declaration blocks, such as PARAMETER , STATE, and ASSIGNED , specify names and attributes of variables that are used in the model. Other blocks are directly involved in setting initial conditions or generating solutions at each time step (the equation denition blocks, e.g., INITIAL , BREAKPOINT , DERIVATIVE , KINETIC , FUNCTION , PROCEDURE ). There is also a provision for inserting C statements into a mod le to accomplish implementation-specic goals. NMODL recognizes all the keywords of MODL, but we will limit this discussion to those that are relevant to NEURON simulations, with an emphasis on changes and extensions that were necessary to endow NMODL with NEURON-specic features. 2.1 The NEURON Block. The principal extension that differentiates NMODL from its MODL origins is that there are separate instances of mechanism data, with different values of states and parameters, in each segment (compartment) of a model cell. The NEURON block was introduced to make this possible by dening what the model of the mechanism looks like from the “outside” when there are many instances of the model sprinkled at different locations on the cell. The specications entered in this block are independent of any particular simulator, but the detailed “interface code” requirements of a particular simulator determine whether the output C le is suitable for NEURON (NMODL) or GENESIS (GMODL). For example, consider the accumulation of potassium in the extracellular space adjacent to squid axon (see Figure 1). Satellite cells and other extracellular structures act as a diffusion barrier that prevents free communication between the bath and this “Frankenhaeuser-Hodgkin space” (F-H space; Frankenhaeuser & Hodgkin, 1956). When there is a large efux of KC ions from the axon, as during the repolarizing phase of an action potential or in response to injected depolarizing current, KC builds up in the F-H space. This elevation of [KC ]o shifts EK in

998

M. L. Hines and N. T. Carnevale

Figure 1: Extracellular potasasium may accumulate adjacent to the cell membrance due to restricted diffusion from the Frankenhaeuser-Hodgkin space.

a depolarized direction, which has two important consequences. First, it reduces the driving force for KC efux and causes a decline of the outward IK . Second, when the action potential terminates or the injected depolarizing current is stopped, the persistent elevation of EK causes a slowly decaying depolarization or inward current. This depolarizing shift dissipates gradually as [KC ]o equilibrates with [KC ]bath. NEURON f SUFFIX kext USEION k READ ik WRITE ko GLOBAL kbath RANGE fhspace, txfer g

This is the NEURON block for the kext mechanism, which emulates the effects of the F-H space. (Readers who desire more detailed information, for example, how the ODE for KC accumulation is specied, are referred to http://www.neuron.yale.edu/nmodl.html for the expanded version of this article, which contains complete code and descriptions for this and other mechanisms.) The SUFFIX keyword has two consequences. First, it identies this as a density mechanism, which can be incorporated into a NEURON cable section by an insert statement (see Hines & Carnevale, 1997). Second, it tells the NEURON interpreter that the names for variables and parameters that belong to this mechanism will include the sufx kext, so there will be no conict with similar names in other mechanisms. 2.1.1 Mechanisms That Interact with Ionic Concentrations. Many biophysical mechanisms involve processes that have direct interactions with ionic concentrations (e.g., diffusion, buffers, pumps). Furthermore, any compart-

Expanding NEURON with NMODL

999

ment in a model may contain several such mechanisms. Therefore, total ionic currents and concentrations must be computed consistently. This is facilitated by the USEION statement, which sets up the necessary bookkeeping by automatically creating a separate mechanism that keeps track of four essential variables: the total outward current carried by an ion, the internal and external concentrations of the ion next to the membrane, and its equilibrium potential. In this case the name of the ion is “k” and the automatically created mechanism is called “k ion” in the hoc interpreter. The k ion mechanism has variables ik, ki, ko, and ek , which represent IK , [KC ]i , [KC ]o , and EK , respectively. These do not have sufxes; furthermore, they are RANGE variables so they can have different values in every segment of each section in which they exist (see the discussion of sections, segments, and RANGE variables in Hines & Carnevale, 1997). In other words, the KC current through Hodgkin-Huxley potassium channels near one end of the section cable would be cable.ik hh(0.1) , but the total KC current generated by all sources, including other ionic conductances and pumps, would be cable.ik(0.1) (the 0.1 signies a location that is one-tenth of the distance along the length of the section away from its origin). Since mechanisms can generate transmembrane uxes that are attributed to specic ionic species by the USEION x WRITE ix syntax, modeling restricted diffusion is straightforward. A mechanism needs a separate USEION statement for each of the ions that it affects or is affected by. For example, the Hodgkin-Huxley mechanism hh has separate USEION statements for na and k, and the USEION statement for potassium includes READ ek because the potential gradient that drives ik hh depends on the equilibrium potential for KC . Since the resulting ionic ux may affect local [KC ], the USEION statement for hh also includes WRITE ik so that NEURON can keep track of the total outward current that is carried by potassium, [KC ]i and [KC ]o , and EK . The kext mechanism computes [KC ]o from the outward potassium current, so it READs ik and WRITEs ko. When a mechanism WRITEs a particular ionic concentration, this means that it sets the value for that concentration at all locations in every section into which it has been inserted. This has an important consequence: in any given section, no ionic concentration should be “written” by more than one mechanism. The bath is assumed to be a large, well-stirred compartment that envelops the entire “experimental preparation.” Therefore kbath is a GLOBAL variable so that all sections that contain the kext mechanism will have the same numeric value for [KC ]bath. Since this would be one of the controlled variables in an experiment, the value of kbath is specied by the user and will remain constant during the simulation. The thickness of the F-H space is fhspace , the time constant for equilibration with the bath is txfer, and both are RANGE variables so they can vary along the length of each section.

1000

M. L. Hines and N. T. Carnevale

2.2 General Comments About Kinetic Schemes. Kinetic schemes provide a high-level framework that is perfectly suited for compact and intuitively clear specication of models that involve discrete states in which “material” is conserved. The basic notion in such mechanisms is that ow out of one state equals ow into another. Almost all models of membrane channels, chemical reactions, macroscopic Markov processes, and ionic diffusion are elegantly expressed through kinetic schemes. The unknowns in a kinetic scheme, which are usually concentrations of individual reactants, are declared in the STATE block. The user expresses the kinetic scheme with a notation that is similar to a list of simultaneous chemical reactions. The NMODL translator converts the kinetic scheme into a family of ODEs whose unknowns are the STATEs. Hence the simple

STATE f mc m g KINETIC scheme1 f » mc <-> m (a(v), b(v)) g

is equivalent to DERIVATIVE scheme1 f mc’ = -a(v)*mc + b(v)*m m’ = a(v)*mc - b(v)*m g

The rst character of a reaction statement is the tilde, », which is used to distinguish this kind of statement from other sequences of tokens that could be interpreted as an expression. The expression to the left of the threecharacter reaction indicator, <->, species the reactants, and the expression immediately to the right species the products. The two expressions in parentheses specify the forward and reverse reaction rates (here, the rate functions a(v) and b(v)). After each reaction, the variables f flux and b flux are assigned the values of the forward and reverse uxes, respectively. These can be used in assignment statements such as » cai + pump <-> capump (k1,k2) » capump <-> pump + cao (k3,k4) ica = (f flux - b flux)*2*Faraday/area In this case, the forward ux is k3*capump , the reverse ux is k4*pump*cao , and the positive-outward current convention is consistent with the sign of the expression for ica (in the second reaction, forward ux means positive ions move from the inside to the outside).

Expanding NEURON with NMODL

1001

More complicated reaction sequences, such as the wholly imaginary KINETIC scheme2 f » 2A + B <-> C (k1,k2) » C + D <-> A + 2B (k3,k4) g begin to show the clarity of expression and suggest the comparative ease of modication of the kinetic representation over the equivalent but stoichiometrically confusing DERIVATIVE scheme2 f A’ = -2*k1*AÃ 2*B + 2*k2*C + k3*C*D - k4*A*BÃ B’ = -k1*AÃ 2*B + k2*C + 2*k3*C*D - 2*k4*A*BÃ C’ = k1*AÃ 2*B - k2*C - k3*C*D + k4*A*BÃ D’ = - k3*C*D + k4*A*BÃ g

2 2 2 2

Clearly a statement such as » calmodulin + 3Ca <-> active (k1, k2) would be easier to modify (e.g., so it requires combination with four calcium ions) than the relevant term in the three differential equations for the STATEs that this reaction affects. The kinetic representation is easy to debug because it closely resembles familiar notations and is much closer to the conceptualization of what is happening than the differential equations would be. Another benet of kinetic schemes is the simple polynomial nature of the ux terms, which allows the translator to perform easily a great deal of preprocessing that makes implicit numerical integration more efcient. Specically, the nonzero elements @y0i / @yj (partial derivatives of dyi / dt with respect to yj ) of the sparse matrix are calculated analytically in NMODL and collected into a C function that is called by solvers to calculate the Jacobian. Furthermore, the form of the reaction statements determines if the scheme is linear, obviating an iterative computation of the solution. Kinetic scheme representations provide a great deal of leverage because a single compact expression is equivalent to a large amount of C code. One special advantage from the programmer’s point of view is that these expressions are independent of the solution method. Different solution methods require different code, but the NMODL translator generates this code automatically. This saves the user time and effort and ensures that all code expresses the same mechanism. Another advantage is that the NMODL translator handles the task of interfacing the mechanism to the remainder of the program. This is a tedious exercise that would require the user to have

1002

M. L. Hines and N. T. Carnevale

special knowledge that is not relevant to neurophysiology and may change from version to version. 2.3 Models with Discontinuities.

2.3.1 Discontinuities in PARAMETER s. In the past, abrupt changes in PARAMETER s and ASSIGNED variables, such as the sudden change in current injection during a current pulse, have been implicitly assumed to take place on a time-step boundary. This is inadequate with variable time-step methods because it is unlikely that a time-step boundary will correspond to the onset and offset of the pulse. Worse, the time step may be longer than the pulse itself, which may thus be entirely ignored. For these reasons, a model description must explicitly notify NEURON, via the at time() function, of the times at which any discontinuities occur. The statement at time(event time) guarantees that during simulation with a variable time-step method, as t advances past event time, the integrator will reduce the step size so that it completes at t = event time - e , where e » 10¡9 ms. The next step resets the integrator to rst order, thereby discarding any previous solution history, and immediately returns after computing all the dyi / dt at t = event time + e . Note that at time() returns “true” only during the “innitesimal” step that ends at t = event time + e. During a variable time-step simulation, a missing at time() call may cause one of two symptoms. If a PARAMETER changes but returns to its original value within the same interval, the pulse may be entirely missed. More often a single discontinuity will take place within a time-step interval, in which case what seems like a binary search will start for the location of the discontinuity in order to satisfy the error tolerance on the step; this is very inefcient. Time-dependent PARAMETER changes at the hoc interpreter level are highly discouraged because they cannot currently be properly computed in the context of variable time steps. For instance, with xed time steps, it was convenient to change PARAMETER s prior to fadvance() calls, as in proc advance() f IClamp[0].amp = imax*sin(w*t) fadvance() g

With variable time-step methods, all time-dependent changes must be described explicitly in a model, in this case with BREAKPOINT f i = imax*sin(w*t) g

Expanding NEURON with NMODL

1003

A future version of NEURON may provide a facility to specify time-dependent and discontinuous PARAMETER changes safely at the hoc level in the context of variable time step methods. (The BREAKPOINT block is discussed fully in the expanded version of this article. Its purpose is to declare the integration method for calculating the time evolution of STATEs and assign values to the currents declared by the mechanism.) 2.3.2 Discontinuities in STATEs. Some kinds of synaptic models process an event as a discontinuity in one or more of their STATE variables. For example, a synapse whose conductance follows the time course of an alpha function (for more detail about the alpha function itself see Rall, 1977, and Jack, Noble, & Tsien, 1983) can be implemented as a kinetic scheme in the two-state model KINETIC state f » a <-> g (k, 0) » g -> (k) g

where a discrete synaptic event is handled as an abrupt increase of STATE a. This formulation has the attractive property that it can handle multiple streams of events with different weights, so that g will be the sum of the individual alpha functions with their appropriate onsets. However, because of the special nature of states in variable time-step ODE solvers, it is necessary not only to notify NEURON about the time of the discontinuity with the at time(onset) call, but also to notify NEURON about any discontinuities in STATEs. If onset is the time of the synaptic event and gmax is the desired maximum conductance change, this would be accomplished by including a state discontinuity() call in the BREAKPOINT block, as follows: BREAKPOINT f if (at time(onset)) f : scale factor exp(1) = 2.718... ensures : that peak conductance will be gmax state discontinuity(a, a + gmax*exp(1)) g SOLVE state METHOD sparse i = g*(v - e) g

The rst argument to state discontinuity() will be assigned the value of its second argument just once for any time step. This is important, since for several integration methods, BREAKPOINT assignment statements are often executed twice to calculate the di / dv terms of the Jacobian matrix.

1004

M. L. Hines and N. T. Carnevale

Although this synaptic model works well with deterministic stimulus trains, it is difcult for the user to supply the administrative hoc code for managing the onset and gmax variables to take advantage of the promise of multiple streams of events with different weights. The most important problem is how to save events that have signicant delay between their generation and their handling at time onset. As is, an event can be passed to this model by assigning values to onset and gmax only after the previous onset event has been handled. Discussion of the details of how NEURON now treats streams of synaptic events with arbitrary delays and weights is beyond the scope of this article. Let it sufce that from the local view of the postsynaptic model, the state discontinuity should no longer be handled in the BREAKPOINT block, and the above synaptic model is more properly written in the form BREAKPOINT f SOLVE state METHOD sparse i = g*(v - e) g NET RECEIVE(weight (microsiemens)) f state discontinuity(a, a + weight*exp(1)) g

in which event distribution is handled internally from a specication of network connectivity (see the next section). 2.4 General Comments About Synaptic Models. The examples so far have been of mechanisms that are local in the sense that an instance of a mechanism at a particular location on the cell depends only on STATEs and PARAMETER s of the model at that location. Of course, they normally depend on voltage and ionic variables as well, but these also are at that location and automatically available to the model. Synaptic models have an essential distinguishing characteristic that sets them apart: in order to compute their contribution to membrane current at the postsynaptic site, they require information from another place, for example, presynaptic voltage. Models that contain LONGITUDINAL DIFFUSION are perhaps also an exception, but their dependence on adjacent compartment ion concentration is handled automatically by the translator. In the past, model descriptions could only use POINTER variables to obtain their presynaptic information. A POINTER in NMODL holds a reference to another variable; the specic reference is dened by a hoc statement such as

setpointer postcell.synapse.vpre, precell.axon.v(1)

in which vpre is a POINTER , declared in the indicated POINT PROCESS

Expanding NEURON with NMODL

1005

synapse instance, which references the value of a specic membrane voltage—in this case, at the distal end of the presynaptic axon. Gap junctions or ephaptic synapses can be handled by a pair of POINT PROCESS es on the two sides of the junction that point to each other ’s voltage, as in section1 gap1 = new Gap(x1) section2 gap2 = new Gap(x2) setpointer gap1.vpre, section2.v(x2) setpointer gap2.vpre, section1.v(x1)

This kind of detailed piecing together of individual components is acceptable for models with only a few synapses, but larger network models have required considerable administrative effort from users to create mechanisms that handle synaptic delay, exploit very great simulation efciencies available with simplied models of synapses, and maintain information about the connectivity of the network. The experience of NEURON users, especially Alain Destexhe and William Lytton, in creating special models and procedures for managing network simulations has been incorporated in a new built-in network connection (NetCon ) class, whose instances manage the delivery of presynaptic threshold events to postsynaptic POINT PROCESS es. The NetCon class works for all NEURON integrators, including a local variable time-step method in which each cell is integrated with a time step appropriate to the state changes occurring in that cell. With this event delivery system, model descriptions of synapses never need to queue events, and they do not have to make heroic efforts to work properly with variable time-step methods. These features offer enormous convenience to users. NetCon connects a presynaptic variable such as voltage to a synapse with arbitrary (individually specied on a per NetCon instance) delay and weight. If the presynaptic variable passes threshold at time t, a special NET RECEIVE procedure in the postsynaptic POINT PROCESS is called at time t + delay. The only constraint on delay is that it be nonnegative. Events always arrive at the postsynaptic object at the interval delay after the time they were generated, and there is no loss of events under any circumstances. This new class also reduces the computational burden of network simulations, because the event delivery system for NetCon objects supports unlimited fan-in and fan-out (convergence and divergence). That is, many NetCon objects can be connected to the same postsynaptic POINT PROCESS (fan-in). This yields large efciency improvements because a single set of equations for synaptic conductance change can be shared by many streams of inputs (one input stream per connecting NetCon instance). Likewise, many NetCon objects can be connected to the same presynaptic variable (fan-out), thus providing additional efciency improvement since the presynaptic variable is checked only once per time step, and when it crosses

1006

M. L. Hines and N. T. Carnevale

threshold in the positive direction, events are generated for each connecting NetCon object. 3 Discussion

The model description framework has proved to be a useful, efcient, and exible way to implement computational models of biophysical mechanisms. The leverage that NMODL provides is amplied by its platform independence, since it runs in the MacOS, MSWindows, and UNIX/Linux environments. Another important factor is its use of high-level syntax, which allows it to incorporate advances in numerical methods in a way that is transparent to the user. NMODL continues to undergo revision and improvement in response to the evolving needs of computational neuroscience, particularly in the domain of empirically based modeling. One recent example of the extension of NMODL to encompass new kinds of mechanisms is longitudinal diffusion. Another is kinetic schemes in a form that can be interpreted as Markov processes (Colquhoun & Hawkes, 1981), that is, linear schemes, which are now translated into single-channel models. By removing arbitrary limits related to programming complexity, such advances give NEURON the ability to accommodate insights derived from new experimental ndings and enable modeling to keep pace with the broad arena of “wet-lab” neuroscience. Acknowledgments

This work was supported in part by NIH grant NS11613. We thank John Moore for inspiration and encouragement, Alain Destexhe and William Lytton for invaluable contributions to the convenient and efcient simulation of networks, Ragnhild Halvorsrud for helpful suggestions regarding the manuscript for this article, and the many users of NEURON who have provided indispensable feedback, presented challenging problems that have stimulated new advances in the program, and developed their own enhancements. References Colquhoun, D., & Hawkes, A. G. (1981). On the stochastic properties of single ion channels. Philosophical Transactions of the Royal Society of London Series B, 211, 205–235. Frankenhaeuser, B., & Hodgkin, A. L. (1956). The after-effects of impulses in the giant nerve bers of Loligo. J. Physiol., 131, 341–376. Hines, M. L., & Carnevale, N. T. (1997). The NEURON simulation environment. Neural Computation, 9, 1179–1209. Jack, J. J. B., Noble, D., & Tsien, R. W. (1983). Electric current ow in excitable cells. London: Oxford University Press.

Expanding NEURON with NMODL

1007

Johnston, D., & Wu, S. M.-S. (1995). Foundations of cellular neurophysiology. Cambridge, MA: MIT Press. Kohn, M. C., Hines, M. L., Kootsey, J. M., & Feezor, M. D. (1994). A block organized model builder. Mathematical and Computer Modelling, 19, 75–97. Kootsey, J. M., Kohn, M. C., Feezor, M. D., Mitchell, G. R., & Fletcher, P. R. (1986). SCoP: An interactive simulation control program for micro- and minicomputers. Bulletin of Mathematical Biology, 48, 427–441. McCormick, D. A. (1998). Membrane properties and neurotransmitter actions. In G. M. Sheperd (Ed.), The synaptic organization of the brain (pp. 37–75). New York: Oxford University Press. Rall, W. (1977). Core conductor theory and cable properties of neurons. In E. R. Kandel (Ed.), Handbook of physiology, vol. 1, part 1: The nervous system (pp. 39– 98). Bethesda, MD: American Physiological Society. Wilson, M. A., & Bower, J. M. (1989). The simulation of large scale neural networks. In C. Koch & I. Segev (Eds.),Methodsin neuronal modeling (pp. 291–333). Cambridge, MA: MIT Press. Received September 30, 1998; accepted April 20, 1999.

ARTICLE

Communicated by Michael Hasselmo

Latent Attractors: A Model for Context-Dependent Place Representations in the Hippocampus Simona Doboli Ali A. Minai Complex Adaptive Systems Laboratory, ECECS Department, University of Cincinnati, Cincinnati, OH 45221-0030, U.S.A. Phillip J. Best Laboratory of Cognitive Psychobiology, Department of Psychology, Miami University, Oxford, OH 45056, U.S.A. Cells throughout the rodent hippocampal system show place-specic patterns of ring called place elds, creating a coarse-coded representation of location. The dependencies of this place code—or cognitive map—on sensory cues have been investigated extensively, and several computational models have been developed to explain them. However, place representations also exhibit strong dependence on spatial and behavioral context, and identical sensory environments can produce very different place codes in different situations. Several recent studies have proposed models for the computational basis of this phenomenon, but it is still not completely understood. In this article, we present a very simple connectionist model for producing context-dependent place representations in the hippocampus. We propose that context dependence arises in the dentate gyrus-hilus (DGH) system, which functions as a dynamic selector, disposing a small group of granule and pyramidal cells to re in response to afferent stimulus while depressing the rest. It is hypothesized that the DGH system dynamics has “latent attractors,” which are unmasked by the afferent input and channel system activity into subpopulations of cells in the DG, CA3, and other hippocampal regions as observed experimentally. The proposed model shows that a minimally structured hippocampus-like system can robustly produce context-dependent place codes with realistic attributes. 1 Introduction

The hippocampus has been a region of great interest for neuroscientists and psychobiologists because of its apparently central role in memory and cognition. In primates (including humans), the hippocampus is crucial to the formation and consolidation of episodic memories (Scoville & Milner, 1957; Squire, 1992; Squire & Zola-Morgan, 1988), and this function probac 2000 Massachusetts Institute of Technology Neural Computation 12, 1009– 1043 (2000) °

1010

Simona Doboli, Ali A. Minai, and Phillip J. Best

EC

Group 2 Group 1

(Inactive)

(Active)

DG Global Inhibition

Hilus

CA3

Figure 1: Schematic of the network model. The DG and H layers are congured into randomly chosen cell groups with possible overlap. The EC projects broadly to both DG and CA3, and the DG projects to CA3. There are no explicit groups in CA3, but DG-to-CA3 projections are very sparse. The gure shows only two DG/H groups: Group 1 is nominally inactive and Group 2 active. Filled squares symbolize cell activity, and arrows indicate excitatory signals between layers, with line width representing signal strength. Note that cells in each group are placed contiguously only for clarity: The model does not assume topographic organization of cell groups.

bly extends to other mammals as well (Eichenbaum, Kuperstein, Fagan, & Nagode, 1987). In rodents, the hippocampus has been implicated specically in spatial cognition (Morris, Garrud, Rawlins, & O’Keefe, 1982) and is hypothesized to form a spatial cognitive map (O’Keefe & Nadel, 1978). Primary cells in the dentate gyrus (DG), CA3, and CA1 regions of the hippocampus (see Figure 1) have activity tuned to an animal’s spatial location (O’Keefe & Dostrovsky, 1971; O’Keefe & Nadel, 1978; Muller, Kubie, & Ranck, 1987; Jung & McNaughton, 1993). These cells are called place cells, and the spatial regions where they are active are termed place elds. Together, the activity pattern of primary cells in CA3 or CA1 forms a coarse-coded, distributed representation of the animal’s position. We call this a place representation or place code.

Latent Attractors

1011

The sensory, behavioral, and motivational correlates of place elds throughout the hippocampus have been elucidated experimentally (Miller & Best, 1980; Hill & Best, 1981; Muller & Kubie, 1987; O’Keefe & Speakman, 1987; Thompson & Best, 1989, 1990; Sharp, Kubie, & Muller, 1990; Quirk, Muller, & Kubie, 1990; Quirk, Muller, Kubie, & Ranck, 1992; Bostock, Muller, & Kubie, 1991; Wilson & McNaughton, 1993; Markus et al., 1995; Gothard, Skaggs, Moore, & McNaughton, 1996; Gothard, Skaggs, & McNaughton, 1996; O’Keefe & Burgess, 1996; Barnes et al., 1997; Rotenberg and Muller, 1997). An important aspect of place representations that is not yet completely understood is their dependence on spatial or behavioral context. Several studies have shown that CA3/CA1 place representations, while strongly controlled by sensory cues (O’Keefe & Nadel, 1978; Hill & Best, 1981; Muller & Kubie, 1987), are not causally dependent on these cues and are affected signicantly by an animal’s perception of its situation (Quirk et al., 1990; Bostock et al., 1991; Markus et al., 1995; Markus, Qin, McNaughton, & Barnes, 1994; Gothard, Skaggs, Moore, & McNaughton, 1996; Gothard, Skaggs, & McNaughton, 1996). This phenomenon has been described as the dependence of place codes on frames of reference (Gothard, Skaggs, Moore, & McNaughton, 1996; Touretzky & Redish, 1996; Samsonovich & McNaughton, 1997). Our notion of context dependence refers in particular to the observation that identical sensory environments can produce different place representations without affecting the intraenvironment stability or spatial coherence of these representations (Quirk et al., 1990), suggesting the existence of a multistable mechanism in the hippocampus (Barnes et al., 1997). In this article, we address the issue of context-dependent place representations in the hippocampal system in an abstract connectionist framework. 2 Theoretical Development and Justication 2.1 Motivating Experimental Results. This research is motivated by several observations about hippocampal place elds:

1. All regions of the hippocampus show place elds (O’Keefe & Nadel, 1978; Muller et al., 1987; Jung & McNaughton, 1993). Entorhinal cortex (EC) neurons also have broadly tuned and noisy place elds (Barnes et al., 1990; Quirk, Muller, Kubie, & Ranck, 1992). 2. Only a subset (20–30%) of cells in regions CA1 and CA3 show place activity in any specic environment, and this active set appears to be randomly selected for each situation (Thompson & Best, 1989, 1990; Kubie & Muller, 1991; Muller, Kubie, & Saypoff, 1991; Wilson & McNaughton, 1993). Silent cells have alsio been reported in the DG (Jung & McNaughton, 1993), although to our knowledge, there is no systematic study of this effect. EC neurons, in contrast, appear to re in almost all environments, with their place specicity more closely tied

1012

Simona Doboli, Ali A. Minai, and Phillip J. Best

to sensory stimulus than is the case for CA3/CA1 neurons (Quirk et al., 1992). 3. Within an environment, place elds are controlled primarily by distal visual cues (O’Keefe & Nadel, 1978; Miller & Best, 1980; Hill & Best, 1981; Muller & Kubie, 1987; Sharp et al., 1990; Bostock et al., 1991). However, removal of these cues in the animal’s presence does not disrupt the elds (O’Keefe & Speakman, 1987; Quirk et al., 1990). 4. The place representation is established very early in an episode. Even in novel environments, animals have full-edged place elds soon after entry (Hill, 1978) although they might take some time to consolidate (Wilson & McNaughton, 1993). 5. Different place representations are active in different environments (Thompson & Best, 1989; Kubie & Muller, 1991) and behavioral situations (Markus et al., 1994, 1995). However, the choice of representation seems to depend on the animal’s perception of its environment rather than purely on sensory input. Identical environments—and even the same environment—can support distinct place representations in different contexts (Quirk et al., 1990). 6. Once a place representation has been activated in the hippocampus, it is very stable against large and cognitively dissonant changes in visual cues (Quirk et al., 1990). When it does change, the switch is usually to a signicantly different global representation with a different set of active cells rather than a gradual shift (Bostock et al., 1991; Gothard, Skaggs, Moore, & McNaughton, 1996; Rotenberg & Muller, 1997). However, there are also data suggesting that the remapping leaves some place elds unchanged (Wilson & McNaughton, 1993; Shapiro, Tanila, & Eichenbaum, 1997; Tanila, Shapiro, & Eichenbaum, 1997; Knierim, Kudrimoti, & McNaughton, 1998; Skaggs & McNaughton, 1998), while others may switch their dependencies from one set of cues to another (Shapiro et al., 1997; Tanila et al., 1997; Skaggs & McNaughton, 1998). Also, local alterations to place elds can occur if the topology of the environment is altered locally (e.g., by the insertion of barriers) (Muller & Kubie, 1987). 7. Old rats show a propensity to activate an incorrect place representation upon return to a familiar environment after a hiatus, but when the correct representation is activated, it appears to show no degradation (Barnes et al., 1997). This suggests a multistable system of internally stable cognitive maps with a recognition-based selection of the correct one. Barnes et al. (1997) hypothesize that activation of incorrect maps in old rats may reect a failure of recognition due to degradation in associative learning.

Latent Attractors

1013

2.2 Primary Hypothesis. Based on the above observations, we hypothesize that DG and CA3 place elds form via the combination of two effects:

1. Informational: Specic sensitivity of individual cells to conjunctions and disjunctions of features from sensory, vestibular, directional, motivational, and motor information. 2. Contextual: Global selection of a cell group for activity in the particular environment and task at hand. Similar ideas are inherent in some previously proposed models of the hippocampus, most notably in the work of Redish and Touretzky (Touretzky & Redish, 1996; Redish & Touretzky, 1997; Redish, 1997; Redish & Touretzky, 1999) and Samsonovich & McNaughton (1997). The term context throughout this article refers to a long-lasting bias that affects place representations for the duration of an episode (Quirk et al., 1990; Markus et al., 1994, 1995), unless disrupted by a signicant mismatch between expectation and reality. This bias implicitly encodes the animal’s perception of its situation, signifying a particular set of relationships, expectations, and conditions— and possibly, goals, actions, and consequences—that characterize the entire episode in a stable way and remain invariant over its duration. This should be distinguished from the transient inuence of a system’s recent dynamics on its current state (e.g., the inuence of an animal’s recent positions and actions on its current position). The latter type of context dependence has been proposed as a means for disambiguating stimulus sequences of nite length (Minai, Barrows, & Levy, 1994; Wu, Baxter, & Levy, 1995; Levy & Wu, 1996; Levy, 1996), and is quite likely to exist in the hippocampus. However, global context dependence requires a mechanism of much greater stability. The issues involved in coding global context are illustrated well by the requirements of effective place representations. There are two conicting imperatives: Intraenvironment consistency. Within a specic environment, the place representation must vary smoothly over space to encode consistent spatial information and allow generalization. If the place code is to serve as a robust representation of the animal’s location, proximate locations should be coded similarly, with the similarity declining monotonically with distance. This is precisely what place elds achieve (as demonstrated by measures of spatial coherence in place elds; Muller et al., 1991; Sharp, 1997). The stability of place elds, in turn, requires dependence of pyramidal cell activity on information about the animal’s location: sensory input (O’Keefe & Nadel, 1978; Muller & Kubie, 1987), vestibular cues (Hill & Best, 1981), and almost certainly, the estimate from a path integration (PI) system (Etienne, Maurer, & S`eguinot, 1996), which updates the animal’s estimate of its location internally based on self-motion (Touretzky & Redish, 1996; Redish & Touretzky, 1997; Redish, 1997; Samsonovich & McNaughton, 1997).

1014

Simona Doboli, Ali A. Minai, and Phillip J. Best

Interenvironment discrimination. In two similar—even identical—environments with different signicance, the place representations must be very different in order to discriminate between the situations. This means that the place representation must not depend exclusively on sensory and vestibular cues; it must also be shaped by the animal’s perception of the global context it is in. This is borne out by the experiments of Quirk et al. (1990), who showed that an animal can have two distinct place representations of the same sensory environment in different contexts. The fact that place representations switch in a discontinuous manner (Quirk et al., 1990; Bostock et al., 1990; Gothard, Skaggs, Moore, & McNaughton, 1996; Gothard, Skaggs, & McNaughton, 1996; Barnes et al., 1997) also supports a global context-dependence hypothesis. How does a system resolve these conicting requirements? We propose that the hippocampus does so by using a two-part strategy. First, when a distinct context is recognized, it triggers the stable priming of a subgroup of dentate gyrus granule cells for ring and depresses the rest. Resetting this stable priming then requires a major disruptive event (e.g., removal from the environment). Once the selected group has been primed, the actual ring of cells within this group depends on sensory, vestibular, and path integration cues, thus producing place elds and a smooth representation of the environment in the DG (Minai & Best, 1998; Doboli, Minai, & Best, 1999). Since the projection from DG to CA3 is very tightly focused, the granule cells in the selected group excite only a subset of CA3 pyramidal cells, leading to smooth place elds in this subset and silence in the rest of the CA3. By selecting different groups in different contexts, the animal can thus create very different—but internally consistent—representations of sensorily identical environments. This conjecture—which we call the latent attractor hypothesis for reasons that will be clear—is consistent with the statistical characteristics of place representations seen experimentally (Thompson & Best, 1989, 1990; Wilson & McNaughton, 1993). Indeed, a variant of this hypothesis has been suggested previously (though not formally developed) by Kubie and Muller (1991). It is also implicit in the chart theory advanced by Samsonovich and McNaughton (1997). However, our theory differs from these previous proposals in three major respects: (1) the computational mechanism we propose is very simple, with minimal preconguration of place-related network structure in the DG or CA3; (2) the stable priming is seen as originating in the dentate gyrus-hilus (DGH) system rather than the CA3; and (3) the system underlying the priming process is a two-layer (disynaptic) rather than a one-layer (monosynaptic) recurrent network (we use the terms two-layer and one-layer in their connectionist sense and not in the anatomical sense of cortical layers). The justication and implications of these choices are discussed below. More recently, Samsonovich (1999) has independently

Latent Attractors

1015

proposed a simpler context-selection mechanism similar to ours in principle. However, it is still embodied in a one-layer CA3 model network. In contrast, we suggest that context dependence in CA3 place representations is not produced internally, but simply reects the context dependence already present in the afferent input from the DG. We show that, based on current (mostly indirect) evidence, placing context selection in the DGH is logically and biologically reasonable. 2.3 Why the DGH System?. Previous suggestions about the encoding of context via multistable attractors have proposed CA3 as the likeliest locus (Kubie & Muller, 1991; McNaughton et al., 1996; Shen & McNaughton, 1996; Redish, 1997; Samsonovich & McNaughton, 1997; Samsonovich, 1999). However, we prefer the DGH system for several reasons:

Granule cells show highly place-specic behavior like CA3 cells (Jung & McNaughton, 1993). There is also evidence that some granule cells are silent in a given environment (Jung & McNaughton, 1993). DG granule cells show the same context dependencies as CA3/CA1 place cells (Markus et al., 1995), suggesting that context is established by this point in the trisynaptic pathway. There is potential evidence of attractor dynamics in the DGH system (Shen, McNaughton, & Barnes, 1997). The DGH system has recurrent excitatory connectivity (Buckmaster & Schwartzkroin, 1994; Scharfman, 1995; Jackson & Scharfman, 1996; Golarai and Sutula, 1996) and a very structured inhibitory system (Han, Buhl, Lorinczi, ¨ & Somogyi, 1993). The recurrent connectivity in DGH is disynaptic (unlike CA3) and highly divergent and convergent (Scharfman, 1991), making it more suitable than the CA3 for implementing exible multiattractor dynamics. In particular, the disynaptic recurrence allows the biasing function (needed for the latent attractor) to be located in the hilus and the coding function (needed for the projection to CA3) in the DG. This is important because biasing requires high source activity and diffuse efferent projection, while coding for CA3 representations requires sparse activity and a very focused efferent projection. A two-layer system implements this very well, and the required characteristics correspond well to the known anatomy and physiology of the DGH. The associational input from mossy cells to the granule cells shows NMDA-dependent LTP (Hetherington, Austin, & Shapiro, 1994). DG has a very large population of cells (Amaral, Ishizuka, & Claiborne, 1990) and low intrinsic activity, which would allow for many more attractors than CA3.

1016

Simona Doboli, Ali A. Minai, and Phillip J. Best

DG granule cells and their mossy ber projections are organized in a lamellar fashion (Andersen, Bliss, & Skrede, 1971; Amaral & Witter, 1989), which has been compared to a piano keyboard (Sloviter & Brisman, 1995). There is also strong evidence for mutual inhibition between lamellae (Struble, Desmond, & Levy, 1978; Sloviter, 1994). Finally, the mossy cell projection to granule cells is highly divergent. This architecture appears well suited to the formation of mutually inhibitory granule cell groups via the associative modication of mossy-to-granule cell synapses linking different lamellae. Using the CA3 recurrent connections for implementing context attractors for multiple environments might interfere with their use in pattern completion and associative memory recall (Treves & Rolls, 1992; Hasselmo, Schnell, & Barkai, 1995; Hasselmo, Wyble, & Wallenstein, 1996). Redish and Touretzky (1998) have shown that a CA3-like network can simultaneously accommodate pattern completion, memory recall, and contextual attractors, but the effect of this multiple functionality on system capacity is not yet known. Essentially, our model proposes a division of labor between EC, DGH, and CA3. Many functions, including associative recall (Marr, 1971; McNaughton & Morris, 1987; O’Reilly & McClelland, 1994; Rolls, 1996; Treves & Rolls, 1992; Hasselmo et al., 1995, 1996), sequence completion (Minai & Levy, 1993; Minai et al., 1994; Wu, Baxter, & Levy, 1995; Levy & Wu, 1996; Levy, 1996; Sohal & Hasselmo, 1998; Lisman, 1999), path selection (Muller et al., 1991; Muller & Stead, 1996; Blum & Abbott, 1996), path integration (Samsonovich & McNaughton, 1997), and spontaneous replay of learned memories (Minai & Levy, 1993; Wilson & McNaughton, 1994; Shen & McNaughton, 1996; Skaggs & McNaughton, 1996; Redish & Touretzky, 1998) have been imputed to CA3 by various researchers. Meanwhile, most models have automatically assumed that the DG performs orthogonalization of sensory inputs (Treves & Rolls, 1992; O’Reilly & McClelland, 1994), and the hilus has been neglected completely. A notable exception to this is the recent model proposed by Lisman (1999), which suggests that the recurrent DGH functions as a pattern completion module in a system for sequence recall. Our hypothesis elegantly addresses three issues raised by these approaches. First, it provides CA3 place representations with the requisite context dependence (due to DG input) and informational contingencies (due to EC input) while leaving the CA3 free to perform other functions such as associating places along a path, disambiguating stimulus sequences, and recalling previously learned representations. Second, it leaves the orthogonalization function for the DG in place, with the subtle distinction that similar stimuli are orthogonalized only if they occur in different contexts or environments. This obviates the need to block DG activity during associative recall as hypothesized in earlier models (O’Reilly & McClelland, 1994; Redish, 1997; Minai, 1997). Finally, the hypothesis provides an

Latent Attractors

1017

elegant explanation for the very rich recurrent architecture of the DGH system. 3 Computational Model 3.1 Functional Description. We model the DGH system as a recurrent system of granule-mossy-granule cell connections with various inhibitory subsystems. Both the DG and hilus are seen as organized into distributed non-disjoint excitatory cell groups (Andersen et al., 1971; Soltesz, Bourassa, & Deschenes, 1993; Sloviter & Brisman, 1995). Cells in the kth DG or hilus group tend to activate cells in the corresponding group in the other region, while each region uniformly inhibits all cells in the other region in proportion to its own activity. The net effect is to make the groups mutually inhibitory across the system. Due to competitive inhibition, the total allowed number of simultaneously active cells within a region is constrained to be fewer than the number of cells in a group. Since the groups are mutually inhibitory, a disproportionately high level of activity in one group tends to inhibit other groups strongly, quashing any incipient activity there and, as a consequence, perpetuating its own high level of activity. Thus, once a group “wins” the intergroup competition, activity will tend to stay within this group until it is strongly perturbed. In this sense, each group is an “attractor ” of the system, but the whole attractor never becomes active. We hypothesize that the ring thresholds of granule cells are high enough that the system is capable of intrinsic (nonafferent driven) activity only under conditions of explicit disinhibition (Scharfman, 1991; Buckmaster & Schwartzkroin, 1995). When afferent activity from the EC is present, the effect of the recurrent system is to focus ring within a metastable subset of granule cells. We call this the latent attractor effect, since the effect of the EC activity is channeled by an attractor that cannot sustain itself autonomously, but whose effect is unmasked by the afferent input. Formally, we dene a latent attractor for a recurrent network of binary neurons as a nontrivial pattern of activity that is not realized (or only partially realized) at any particular time, but would be self-sustaining and attracting within a subdomain of the network’s phase space under conditions of sufcient disinhibition or external stimulus. It is termed “latent” because it is implicit in the connectivity structure of the network but requires external stimulus to become visible, much as the peaks in a dark landscape may be picked out by a searchlight. This is different from the situation in other recurrent networks—including disynaptic ones (Carpenter & Grossberg, 1987a, 1987b; Kosko, 1988)—where the entire attractor is activated. The activity of the EC cells in our hypothetical model encodes the various “informational” contingencies, including (but not limited to) conjunctions and disjunctions between sensory cues, vestibular information, and information from the path integration system. There is strong evidence for location-specic activity in EC neurons (Barnes et al., 1990; Quirk et al.,

1018

Simona Doboli, Ali A. Minai, and Phillip J. Best

1992), and it has been hypothesized (Touretzky & Redish, 1996; Redish & Touretzky, 1997) that the EC forms a stage in the rat’s path integration system. Although we do not model these here, the ring contingencies of EC neurons are seen as being encoded through LTP-based learning and competitive mechanisms as suggested by Shapiro et al. (1997). Since the EC provides input to both DG and CA3, the same contingencies appear in neurons of those regions, and all would respond to cue manipulations in a heterogeneous manner consistent with the multiplicity of informational inputs to the system (Shapiro et al., 1997; Tanila et al., 1997; Knierim et al., 1998). At the beginning of an episode, a particular pattern of input from the EC (a “recognition stimulus”) excites DG granule cells disproportionately within a specic group, focusing future activity within this groups. For the duration of the episode, then, granule cells are driven primarily by the afferent input, but cells that are not part of the selected group are effectively suppressed. As the animal moves through its environment, the granule cell activity is projected onto the CA3 pyramidal cells where, in combination with the EC input, it appears as context-specic place elds. Place is thus coded with a two-level hierarchical representation. At the rst, coarser level, each context is represented by the selection of active cell groups. Within this, individual cells are red by sensory and ideothetic afferent information. The disynaptic recurrence in our model is important because subcortical inputs (e.g., theta from the septum) can precisely control the recurrent loop via modulation of mossy and inhibitory cell excitability. Direct evidence for exactly such control exists (Deadwyler, West, & Robinson, 1981; Moser, 1996). However, we do not address this issue here. 3.2 Neuron Equations. The environment for our simulations is an M£M grid, (u, v), on which the simulated animal moves randomly. The position of the animal at time t is denoted by L(t) D ( Lu ( t), Lv ( t)). The network has four layers: EC, DG, H (hilus), and CA3. 3.2.1 EC Layer. The EC layer has NEC neurons, and is modeled at a purely phenomenological level, based on the known characteristics of EC place elds (Barnes et al., 1990; Quirk et al., 1992). Each cell, i, is given a broad, noisy gaussian place eld dened by: fi ( t) D exp[¡ai ( Lu ( t) ¡ cui C qu ) 2 ¡ bi ( Lv ( t) ¡ cvi C qv ) 2 p p C di ai ( Lu ( t) ¡ cui ) bi ( Lv ( t) ¡ cvi ) ],

(3.1)

where cNi D (cui , cvi ) is center for i’s eld, ai , bi are eld size and shape parameters, di 2 [¡1, 1] is the eld orientation parameter, and qu , qv are 0-mean positional noise with variance sq2 . All of these parameters are chosen randomly within reasonable ranges for each cell. As the animal moves on the

Latent Attractors

1019

grid, neuron i is red noisily, as zi ( t) D fi ( t) C g(t) C k,

(3.2)

where g(t) is zero-mean uniform noise with variance s2 and k is a baseline activity level (set to a small value). The value of zi (t) is clipped to 0 if noise would make it negative. While rather simple, this method provides a sufciently accurate model of the broad, noisy EC place elds seen experimentally (Quirk et al., 1992). 3.2.2 DG Layer. The DG layer has NDG simulated granule cells that get excitatory input from the EC and H layers, and a broad inhibitory input from the H layer. In addition, the DG layer is assumed to have competitive recurrent inhibition, which maintains activity within a narrow range. This recurrent inhibition is not modeled explicitly, but is reected in the stochastic K-of-N ring rule. The activation of granule cell i at time t is given by: yi ( t) D gEC¡DG

X j2EC

¡ GH¡DG

wij zj (t) C gH¡DG

X j2H

zj ( t ¡ 1),

X j2H

wij zj ( t ¡ 1) (3.3)

where wij is the connection strength from neuron j to neuron i, and gEC¡DG , gH¡DG , and GH¡DG are gain parameters that are set randomly within narrow ranges. The ring of cell i is based on the following rule: ( zi ( t) D

1

with probability r i ( t)

0

otherwise

(3.4)

where r i (t) is dened as follows. Let C 1 ( t) be the set of KDG most excited DG cells at time t, C 2 ( t) the set of the next KDG most excited ones, and C 3 (t) the remaining neurons. Then,

8 DG > < r1 r i ( t) D r2DG > : DG r3

if i 2 C 1 ( t) if i 2 C 2 ( t)

(3.5)

if i 2 C 3 ( t)

where r1DG is close to 1, r2DG is a small value close to 0, and r3DG is an even smaller value, which indicates the sporadic activity of noncoding neurons. P The quantity bDG D r1DG / k rkDG quanties the reliability of the KDG -of-NDG competitive ring mechanism. Qualitatively, the rule means that almost all of the KDG most excited neurons re, a few of the next KDG most excited ones re, and the rest re only sporadically, giving an approximate net

1020

Simona Doboli, Ali A. Minai, and Phillip J. Best

ring of KDG neurons per time step. Neurons with yi ( t) · 0 are explicitly barred from ring, so that under very weak stimulus, the total ring may be signicantly below KDG . However, we do not consider this case here. Consistent with experimentally observed data, KDG is taken to be a small fraction ( < 5%). 3.2.3 H Layer. The H layer has NH mossy cells, which receive excitatory input from the DG layer. The activation of cell i at time t is given by: yi (t) D gDG¡H

X

wij zj ( t),

(3.6)

j2DG

where gDG¡H is a gain parameter. Firing is KH -of-NH implemented as for the DG layer with parameters rH . k 3.2.4 CA3 Layer. The CA3 layer has NCA3 simulated pyramidal cells, which get very sparse excitatory input from the DG layer and a diffuse excitatory input from the EC. The DG-CA3 synapses are taken to be much stronger than EC-CA3 ones, so the net effects of the two projections are approximately balanced. The CA3 also has extensive recurrent connectivity. However, these connections are not included in our current model in order to avoid obscuring our main hypothesis with additional effects. The recurrent projection might, of course, interact signicantly with the context, but we see it primarily as a substrate for associative recall of stored, context-dependent representations (Marr, 1971; Rolls, 1996; Hasselmo et al., 1995, 1996) and perhaps for path learning (Muller & Stead, 1996; Blum & Abbott, 1996; Mehta, Barnes, & McNaughton, 1997). The activation of CA3 cell i at time t is given by: yi (t) D gEC¡CA3

X j2EC

¡ GDG¡CA3

wij zj ( t) C gDG¡CA3 X

zj (t)

X

wij zj ( t)

j2DG

(3.7)

j2DG

The ring of neurons is implemented by a competitive KCA3 -of-NCA3 rule similar to the one for DG with parameters rCA3 . k 3.3 Network Architecture. The EC-DG and EC-CA3 connections are selected randomly with each DG cell receiving input from a fraction CEC¡DG of EC cells and each CA3 cell from a fraction CEC¡CA3 . The connections are made such that the fan-outs from individual EC neurons are approximately equal. The actual weight values are chosen randomly in a xed range. The heart of our model is the architecture of the DG-hilus subsystem, which is as follows.

Latent Attractors

1021

We choose m cell groups, Q1 , . . . , Qm , of excitatory cells in both the DG and H layers. All groups in DG have size nDG , and those in H have size nH . The cells in each group are chosen randomly and independently of other groups. A uniform random connection pattern is established between the DG and H layers in both directions, with the connectivity fractions given by CDG¡H and CH¡DG for DG-to-H and H-to-DG connections, respectively. As in the case of EC-DG and EC-CA3 connections, the fan-outs of presynaptic neurons are approximately balanced. The synaptic weights for these connections are then set as follows: DG ! H Weights ( wij D

hDG¡H lDG¡H

if i 2 Qk and j 2 Qk f or some k D 1, . . . , m

(3.8)

if i 2 Qk and j 2 Qk f or some k D 1, . . . , m,

(3.9)

else

H ! DG Weights: ( wij D

hH¡DG lH¡DG

else

where hDG¡H À lDG¡H and hH¡DG À lH¡DG . This very simple connectivity, which is essentially equivalent to the embedding of each group as an attractor via a two-level Hebbian rule (Willshaw, Buneman & Longuet-Higgins, 1969), is sufcient to set up the required system behavior. The competitive ring rule then ensures that only one attractor is active at a time, and that only in part. The connections from DG to CA3 are very sparse, with connection probability CDG¡CA3 , but the gains, gEC¡CA3 and gDG¡CA3 are set such that the effect of EC and DG inputs on the CA3 is roughly equal. 4 Model Evaluation Methods We evaluate our model system by studying whether it satises the two conicting requirements needed for place representation: intraenvironment consistency and interenvironment discrimination. The simulated animal is run in two sensorily identical environments, A and B (both have identical EC place elds), for a large number of steps (typically 5000–10000) along random paths. The initial input to the DG layer is signicantly different in the two cases, signaling the difference of context at the time of entry. To simulate repeated trials in the same environment, the context input is similar each time, simulating the initial sensory input whose recognition causes the animal to decide on a specic context. A randomly chosen subset, S, of CA3 cells is monitored as the animal moves about, and the empirical place elds for these cells are reconstructed as follows. First, the conditional probability

1022

Simona Doboli, Ali A. Minai, and Phillip J. Best

of cell i ring at a location is estimated as: P 1i|u,v ´ P( zi ( t) D 1 | Lu (t) D u, Lv ( t) D v) ¼

P t zi ( t)d uv ( t) P , t d uv ( t )

(4.1)

P where d uv ( t) D 1 if Lu ( t) D u, Lv ( t) D v, and 0 else. If t d uv D 0 (location never visited), P1i|u,v is also set to 0. The place eld for cell i is then calculated as:

8 <

a : 1 e( ) : D P fi u, v : i| u,v v :

P1i|u,v < a a · P1i|u,v · v P1i|u,v > v,

(4.2)

where e 2 fA, Bg denotes the environment. Parameters a D rCA3 and v D 3 are used to model the fact that no CA3 cell, under our ring rule, has rCA3 1 CA3 CA3 probability less than r3 or greater than r1 of ring at any location, but this may not be picked up in the nite sample used to estimate P1i| u,v . Intraenvironment consistency is measured by the extent to which the reconstructed place elds allow accurate localization of the simulated animal (Wilson & McNaughton, 1993; Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Brown, Frank, Tang, Quirk, & Wilson, 1998). This is done using a Bayesian maximum likelihood formulation as follows. The activity of cells in set Se is monitored as the simulated animal moves on a random path in environment e to get the activity vectors Z( t) D fzk ( t)gI k 2 Se . The set Se ½ S includes only those cells that have signicant place elds in environment e. This is determined as follows. The activity of a cell in every 3 £ 3 neighborhood of the environment is considered, and the number of neighborhoods in which at least 7 of the 9 cells have above average ring is counted. If this number exceeds 10 (out of a possible 324 neighborhoods for a 20 £ 20 environment), the cell is deemed to have a place eld and is included in Se . We also impose a continuity constraint (Zhang et al., 1998) based on the previous estimated position in order to avoid unrealistic jumps from one step to another. Thus, the posterior probability function at each location depends not only on the current ring pattern but also on the previous reconstructed position (LO u ( t ¡ 1), LO v ( t ¡ 1)): P (u, v | Z( t), LO u ( t ¡ 1), LO v ( t ¡ 1)) D CP( u, v | Z( t)) P( LO u ( t ¡ 1), LO v (t ¡ 1) | u, v)

(4.3)

where C is a normalizing factor, which does not depend on ( u, v), and Y

P (u, v | Z( t)) D

k

[P(u, v | zk D1)zk ( t) C P( u, v | zk D0)(1¡zk ( t))] (4.4)

Latent Attractors

1023

with P(u, v | zk D 1) D P( zk D 1 | u, v)P (u, v) / P( zk D 1) D fke ( u, v) P( u, v) / hzk i P(u, v | zk D 0) D P( zk D 0 | u, v)P (u, v) / P( zk D 0) D [1 ¡ fke (u, v)]P( u, v) / (1 ¡ hzk i)

(4.5) (4.6)

and P(LO u (t ¡ 1), LO v ( t ¡ 1) | u, v) h i » exp ¡k( LO u ( t ¡ 1), LO v (t ¡ 1)) ¡ ( u, v)k2 / 2s 2 .

(4.7)

Angular brackets indicate time averages and k.k the Euclidian distance. The value of s represents the width of the gaussian function centered at the previous estimated position. Assuming P(u, v) D M¡2 (all locations equiprobable), we calculate the estimate of the current location: (u¤ (t), v¤ (t)) D arg max P( u, v, | Z( t), LO u (t ¡ 1), LO v ( t ¡ 1)). ( u,v)

(4.8)

As the animal moves along its path, the error (Euclidean distance) between the estimated and true positions is calculated, and its mean value is used as a measure of localization performance: lD

l 1X k(Lu ( t), Lv (t)) ¡ ( LO u ( t), LO v ( t))k l tD1

(4.9)

where l is the length of the path. Interenvironment discrimination is measured using the mean correlation coefcient of the reconstructed place elds for the two environments: j ( A, B) |S | 1 X 0

D

| S0 |

iD1

M2 P

P

A v) fiB (u, v) u,v fi ( u,P A( B ))( u,v fi u, v u,v fi ( u,

¡( v)) , P P M 2 u,v fiA ( u, v) 2 ¡ ( u,v fiA ( u, v))2 q P P £ M 2 u,v fiB (u, v)2 ¡ ( u,v fiB ( u, v)) 2 q

(4.10)

where S0 is the set of monitored cells that have place elds in at least one of the environments (i.e., S0 D SA [ SB ), and |S0 | denotes the cardinality of S0 . A low value for j ( A, B) indicates that the place representations for environments A and B are signicantly different. However, because of the pervasive

1024

Simona Doboli, Ali A. Minai, and Phillip J. Best

noise in the system, the value of j ( A, B) itself is not a sufcient indicator of discrimination and must be compared with the correlation between place representations within the same environment during two different episodes. To this end, we also calculate a quantity j ( A, A) analogous to j ( A, B) but calculated over two sessions in environment A with the same initial condition in both sessions. A high positive j ( A, A) indicates that the place representation in environment A is stable over different sessions. The interenvironment discrimination achieved by the system is then measured as D( A, B) D hj ( A, A)i ¡ hj ( A, B)i,

(4.11)

where angular brackets indicate averaging over many independent A and B environments. A signicantly positive D( A, B) indicates that the system actually discriminates between A and B, while a value close to zero means that the difference in place codes for A and B arises only due to the noise within the system. In order to provide a baseline for our evaluations and determine the utility of the latent attractor architecture, we compare our system with one that is identical to ours in every way except that the DGH system does not have groups. The actual connectivity proles of all postsynaptic neurons are matched exactly for the two systems. The goal is to determine how much the group architecture helps with context encoding beyond what one might get from a system without groups. The system with groups is henceforth termed the latent attractor (LA) model, while the network without groups is called the no latent attractors (NLA) model. The independent parameter of interest is dened as R D gEC¡DG / gH¡DG , which quanties the relative strength of the afferent (informational) and recurrent (contextual) inputs to the DG layer. 5 Results We evaluate our model by simulating LA and NLA networks with layers of the following sizes: NEC D 200, NDG D 1000, NH D 500, and NCA3 D 300. The H-to-DG gain, gH¡DG , is xed at 0.5, and gEC¡DG is varied systematically to change R. The LA network has 10 cell groups, with 100 cells per group in DG and 50 cells per group in H. This implies that the mean overlap between DG groups is 10, about 60 cells in each group belong to at least one other group, and about 350 cells are not part of any group. The values of other parameters are as given in the appendix. Figure 2 shows the empirically calculated localization and discrimination values for the LA and NLA systems as a function of R. The two systems are quite similar in terms of localization. When R is small, the DG gets little sensory information, leading to poorly formed CA3 place elds (see Figure 3a) and poor localization. As R increases, the place elds improve

Latent Attractors

1025

(a) Inter environment discrimination 0.7 0.6

LA NLA

0.5 D(A,B)

0.4 0.3 0.2 0.1 0 0.1

0

2

4

6 R

8

10

12

(b) Reconstruction error 5 LA NLA

l (grid squares)

4 3 2 1 0

0

2

4

6 R

8

10

12

Figure 2: (a) The interenvironment discrimination, D(A, B), for the LA and NLA systems. Environments A and B (size 20 £ 20) have identical EC place elds and differ only in the initial afferent stimulus given to the DG. The data are based on output from 200 CA3 cells whose place elds were reconstructed over 5000 steps in each environment. (b) The localization accuracy over 100 steps in Environment A for both systems. Only place elds meeting a quality criterion were used in localization. Each bar was averaged over ve runs. Error bars indicate one standard deviation above and below the mean.

1026

Simona Doboli, Ali A. Minai, and Phillip J. Best (a) Max=0.95

(b) Max=0.95

(c) Max=0.67

(d) Max=0.95

(e) Max=0.95

(f) Max=0.86

(g) Max=0.95

(h) Max=0.95

Figure 3: (A) Place-dependent activity for weak EC input to the DG layer (R D 1.0). Graphs a and b are for two typical CA3 cells the LA system, and graphs c and d for two NLA system cells. Note the lack of coherent place activity for both systems, which explains poor localization. (B) Place-dependent activity for strong EC input to the DG layer (R D 12.0). Graphs e and f are for two CA3 cells in the LA system and graphs g and h for two NLA system cells. Note that the elds are well formed, and localization is good. Each gure uses 24 grayscale levels between 0 (black) and Max (white), where Max (indicated above each graph) is the peak probability of ring for the cell at any location in the environment.

Latent Attractors

1027

(see Figures 3b and 4), and so does localization. The NLA system localizes slightly better than the LA system because its place code is distributed over more active neurons. However, this is inconsistent with the observation that only a small percentage of CA3 place cells are active in any one environment. A point worth noting is that the CA3 place representation in the LA system shows interenvironment discrimination even though the CA3 is structurally homogeneous and does receive a strong direct input from the EC. Thus, the partial inuence from the DG is enough to produce a high degree of context dependence in CA3 place elds. This is reected in the place elds for environments A and B, as shown in Figure 4. The NLA system shows no interenvironment discrimination at any value of R. Although j ( A, B) is fairly small for low R, it is matched almost exactly by j ( A, A). Thus, the lack of correlation between A and B place elds for low R does not indicate any memory of the initial difference in context, but simply reects the noise inherent in the system. As R increases and more sensory information (which is identical for A and B) gets through to the DG, the correlation increases identically for both the A-A and A-B cases. These results show that the recurrent loop in the DGH subsystem is not able to provide an adequate memory of the difference in initial conditions over the course of a session. The LA system, in contrast, shows strong interenvironment discrimination at all R levels, though the value drops slightly, as expected, at high R. With increasing R values (not shown), we expect discrimination to be lost eventually, but the LA system clearly provides high discrimination and good localization over a broad range of R values. One point of note is that even at best, D( A, B) < 1, whereas one might expect a value close to 1. The reason is that j ( A, A) < 1 due to factors such as noise and path dependence. We now consider this issue in more detail. Ideally, a noise-free system run along identical paths in identical environments with the same initial conditions will produce perfectly correlated place elds (j ( A, A) D 1). Thus, the factors that reduce the correlation below 1 are: Path dependence. Difference in place elds reconstructed over the different paths taken in the two sessions. This occurs because the recurrence in the DGH system makes DG (and, thus, CA3) activity dependent on previous states of the system. Context dependence. Difference in place elds due to the difference in initial conditions (indicating context) for the two sessions. Noise. Difference in place elds for the two sessions due to independent noise sources: spatial noise in EC elds; additive noise in EC cell activity; and stochasticity in competitive ring of DG, H, and CA3 neurons. Differences in afferent input. Place code differences due to real differences in the afferent sensory input.

1028

Simona Doboli, Ali A. Minai, and Phillip J. Best (a) Max=0.95

(b) Max=0.33

(c) Max=0.20

(d) Max=0.95

(e) Max=0.95

(f) Max=0.95

(g) Max=0.95

(h) Max=0.95

Figure 4: Activity patterns of the same cells in Environments A (left column) and B (right column) for R D 6.0. Note that the cells in the LA system (graphs a–d) have distinct place elds in the two environments, while cells in the NLA system (graphs e–h) do not, showing that the NLA system does not discriminate well on the basis of context. Many CA3 cells in the LA system did not have place elds in one or both environments, while almost all NLA cells had similar elds in both, which is contrary to experimental observation. Gray-scale coding is as in Figure 3.

Latent Attractors

1029

Simply looking at D( A, B) (or even j ( A, A) and j ( A, B)) does not indicate the degree to which these factors affect the system’s discrimination quality. The data shown in Figure 5 attempt to do this. The leftmost bars in graphs a and b show j ( A, A) in noise-free LA and NLA systems, respectively. All three sources of noise are eliminated, and the EC place elds and the initial conditions are identical for the two sessions. Thus, the residual lack of correlation, 1 ¡j ( A, A), can be ascribed completely to path dependence. Clearly, the NLA network is much more susceptible to path dependence than the LA network, though both improve as R increases and the similar sensory input from EC dominates the effects of the DGH recurrence. The second set of bars in both graphs show j ( A, A) with all sources of noise present, but the EC place elds and initial conditions are again identical in the two sessions. These are the j ( A, A) values used in determining D( A, B) for Figure 2. A comparison with the leftmost bars shows how much additional loss of correlation is caused by noise. Both systems are affected signicantly (as expected). Finally, the rightmost set of bars in the two graphs show j ( A, B), that is, place code correlation when EC inputs are the same but the initial conditions (contexts) are different and all noise sources are present. The NLA system correlations show no change from the situation with identical context (middle bars), while the LA system correlations drop dramatically, indicating context-based discrimination of place codes. These results verify that dissimilarity in NLA representations for environments A and B is accounted for wholly by effects of noise and path dependence. For completeness, we also simulated the cases where the sensory input was different (different EC place elds) but the initial condition was the same in the two sessions, and the case when both EC elds and the initial conditions were different. Both LA and NLA systems produced substantially distinct representations in these cases, showing that differences in sensory input remain salient (we argue later that path integration can partially override these). Finally, we also briey considered the issue of network capacity for latent attractors. A full investigation of capacity (such as that carried out by Battaglia & Treves, 1998, for the charts system) is beyond the scope of this article, but we used simulations to study how the number of stable latent attractors in DG changed as the size of attractors, nDG , increased relative to the system size, NDG . This was parameterized by the ratio f D nDG / NDG . The gain parameters were adjusted to keep the expected excitation to DG neurons in the active group constant. Group stability was evaluated by running the LA system in environment A for 110 steps and checking the DG activity pattern for steps 101 through 110. The degree of connement to the initially activated (i.e., correct) group, Q¤ , was calculated as ( "P #) P 110 X ¤ zi ( t) ¤ z i ( t) 1 n 6 2 DG i2Q i Q y ( Q¤ ) D ¡ , (5.1) 10 K ( t) nDG NDG ¡ nDG tD101

1030

Simona Doboli, Ali A. Minai, and Phillip J. Best (a) LA system 1

Mean correlation

0.8 0.6 0.4 0.2 0

0

2

4

6 R

8

10

12

8

10

12

(b) NLA system 1

Mean correlation

0.8 0.6 0.4 0.2 0

0

2

4

6 R

Figure 5: Mean correlations between reconstructed place elds for two sessions. The stimulus and network conditions in the two sessions are: (1) Black bars: Different paths, identical initial states (contexts), identical EC elds, and no noise; (2) Gray bars: Different paths, identical initial states (contexts), identical EC elds, and noise; (3) White bars: Different paths, different initial states (contexts), identical EC elds, and noise. See text for discussion.

where K( t) is the number of DG neurons ring at step t. A value of y ( Q¤ ) D 1 indicates total connement of activity to group Q¤ , while y ( Q¤ ) D 0 means that connement is no better than for activity distributed randomly over

Latent Attractors

1031

120

Capacity (number of groups)

100

80

60

40

20

0 0.05

0.1

0.15

0.2

0.25 z

0.3

0.35

0.4

0.45

0.5

Figure 6: Capacity of the DGH system as a function of f , the size of the embedded latent attractors relative to network size. Network sizes are: NEC D 400, NDG D 2000, NH D 1000 (±); NEC D 200, NDG D 1000, NH D 500 (¤) (the size used in all other simulations). All connectivities are as shown in the appendix. See the text for details.

the whole DG. For each f value, we determined the maximum number of groups that could be stabilized, where stabilization was dened by the following criterion: the mean y over all groups was at least 0.85, and no group had a y value less than 0.7. This was taken as the network’s capacity. Figure 6 shows this capacity as a function of f for networks of two sizes. In both cases, we nd that capacity decreases exponentially with f , but when groups are small compared to the neuron populations, a large number of latent attractors can be stored. 6 Discussion 6.1 Overall Signicance of the Model. The primary result of our modeling is that the inclusion of a simple structured recurrent module (the DGH subsystem) in the hippocampal complex is sufcient to produce long-term,

1032

Simona Doboli, Ali A. Minai, and Phillip J. Best

stable, and reliable context dependence in CA3 place elds. However, an unstructured recurrent module is insufcient for this purpose. The LA model we present is a very simple—even simplistic—one and does not account for many important aspects of hippocampal function. Its purpose, however, is to illustrate the latent attractors idea in the simplest possible terms within the context of hippocampal place representations. There has been at least one previous model that proposes mechanisms similar to ours: the charts model of Samsonovich and McNaughton (1997). While elegant and insightful, it involves considerably more detailed preconguration of the system. A primary aim of our model is to demonstrate that a much simpler system is capable of essentially the same type of behavior as that hypothesized in the charts model. Unlike the charts model, we do not assume any intrinsic, noise-free place specicity in our DG or CA3 cells, but allow it to emerge through convergence of EC afferents and competitive ring. In our model, place-specic ring is explicitly produced only in the EC layer and is very noisy and unreliable. Also, unlike the charts model, we do not congure the recurrent connectivity in our system to reect the place elds of individual cells in different contexts. Our DGH system is connected randomly, the groups comprising the attractors are randomly chosen, and connection weights have only two possible values (rather than a continuum). The results indicate that although the detailed chart structure does produce better location estimates and more compact place elds, only minimal network structure is required to obtain the degree of localization, place specicity, and context dependence seen in actual recordings (Zhang et al., 1998). The same conclusion is reached by Samsonovich (1999) for a model similar to ours in principle but located in CA3 with connections specied via an off-line Hebbian prescription. In spite of its simplicity, our model captures many essential properties of hippocampal place representations, including the following: The model accounts for different, randomly determined populations of active cells and silent cells in different environments (Thompson & Best, 1989, 1990). The model provides a simple explanation of the remapping phenomenon (Bostock et al., 1991; Shapiro et al., 1997; Tanila et al., 1997; Knierim et al., 1998), whereby place representations switch in a discontinuous manner. In our model, this can be seen as a switch in the selected groups. Partial remappings can be explained by assuming that the overlaps between some groups are larger than usual, or by more complex connectivity patterns that allow multiple groups to be active together and switch separately. We do not explore this issue in the current simulations. The model provides a simple mechanism for producing different place representations in identical sensory environments based on initial ex-

Latent Attractors

1033

pectation (context) (Quirk et al., 1990), without interfering unduly with the reliable reproducibility of each representation within its context. The model shows how broad, noisy, and unreliable place elds in the EC can lead to much more reliable, context-specic place representations in the CA3 (Quirk et al., 1992). The increased accuracy, of course, is a direct result of averaging over multiple EC afferents performed by each DG and CA3 cell. The context dependence is a consequence of group selection in the DG. 6.2 Predictions. As a means of verifying (or rejecting) our hypotheses, we offer a set of experimental predictions that could be of interest to experimentalists: DG place elds will show the same phenomena of “silent cells” and coding of similar environments by uncorrelated cell groups as the CA3/CA1. This prediction, if true, would mean that the phenomena observed in CA3/CA1 place codes (Thompson & Best, 1989) are already present at the DG stage in the trisynaptic pathway. There is already some evidence for silent cells (Jung & McNaughton, 1993) and context-dependence (Markus et al., 1995) in DG place representations, but, to our knowledge, there is no systematic study of the correlatedness of DG place codes for similar environments (i.e., whether these codes are more EC-like—Quirk et al., 1992—or more CA3like—Thompson & Best, 1989; Kubie & Muller, 1991). A potential problem is that of locating silent granule cells, since granule cell ring is rather sparse and easily swamped by interneuron activity. However, disinhibition by bicuculline has been used successfully to elicit activity—albeit abnormal— from inactive granule cells (see, e.g., Sloviter and Brisman, 1995), and in vivo granule cell identication procedures have also been used successfully by Jung and McNaughton (1993). CA3/CA1 place elds following a lesion of the DG using colchicene (McNaughton et al., 1989) would show signicantly reduced context-dependence. It is known (McNaughton, Barnes, & O’Keefe, 1983; Knierim & McNaughton, 1995) that, at least in strongly directional environments such as arm mazes, CA3/CA1 place elds persist after DG lesion—presumably because of the direct input from EC to CA3. There is also evidence that EC place representations are not context dependent in the sense that similar environments produce similar EC representations (Quirk et al., 1992) (though Mizumori, Ward, & Lavoie, 1992 report that they do show directional ring on arm mazes). This contrasts with the context dependence seen in the CA3/CA1 elds (O’Keefe & Nadel, 1978; Muller & Kubie, 1987; Thompson & Best, 1989; Kubie & Muller, 1991; Quirk et al., 1990). If the DGH is responsible for this context dependence in CA3/CA1, its lesion should eliminate it, or at least reduce it signicantly. CA3/CA1 place codes in rats with mossy cell loss will show signicantly reduced context dependence. Rats with hilar cell loss—usually due to kainic acid

1034

Simona Doboli, Ali A. Minai, and Phillip J. Best

application—are used as a model for temporal-lobe epilepsy (Nadler, Perry, & Cotman, 1978, 1980). It is widely believed that the cells lost include mossy cells, whose loss leads to a disruption in the pattern of inhibition to granule cells (Sloviter, 1987, 1994; Sloviter & Brisman, 1995; Gibbs, Shumate, & Coulter, 1997). Our hypothesis predicts that rats with hilar cell loss will also show less context dependence in their CA3/CA1 place codes. Environments with similar entry points will be coded by cell groups with greater than usual overlap. This is a somewhat speculative prediction, but is included because it would be very easy to test and, if correct, would say a great deal. The idea is that if cell groups for representing novel environments are chosen at the time of entry (as hypothesized), environments with identical or similar entry areas would tend to get assigned to the same DGH cell group. Subsequently encountered differences in sensory input within each environment would produce different place representations in the two cases, but using the same group of cells, implying the prediction. There are several caveats. First, there may be other mechanisms that ensure random rather than stimulus-based group assignment. Second, group assignment during the rst encounter with an environment may happen too slowly to show the effect. Third, the phenomenon might not happen reliably and may be observable only over a large enough sample of environment pairs. 6.3 Extensions and Open Issues. The work reported in this article addresses only some of the issues that our model is directed toward exploring. These include the consideration of more subtle latent attractor schemes, the explicit use of pattern recognition in the EC-DG projection to trigger appropriate groups, and the inclusion of path integration in the model. The last issue is of special signicance because some of the experimental results that motivated this research require path integration for adequate modeling. For example, the model in its current form cannot directly duplicate the light and dark results of Quirk et al. (1990), though it certainly suggests a detailed explanation. As indicated earlier, we agree with the hypothesis (Touretzky & Redish, 1996; Redish & Touretzky, 1997) that path integration information enters the hippocampus through the EC. The simplest mechanism would be if the positional information reconstructed by the PI system directly elicited approximately correct ring in the EC (Redish & Touretzky, 1998). The rest of the system could then function as under normal circumstances. This framework can also be used to address the issues of hierarchical conict resolution between distal sensory, local sensory, and ideothetic information that apparently occurs in the hippocampus (Hill & Best, 1981; Shapiro et al., 1997; Knierim et al., 1998). If EC cells are sensitive to subsets of cues, individual cues (local and distal), vestibular cues, PI estimates, and so on, the same contingencies will be reected in DG and CA3 place elds, which might explain the heterogeneous and complex changes seen in place representations in pathologically inconsistent environments (Bures et al., 1997; Rotenberg & Muller, 1997; Knierim et al., 1998).

Latent Attractors

1035

An important aspect of the hippocampal system that is not addressed in this work is the function of the CA3 recurrent connections. We essentially agree with previous proposals (Marr, 1971; Treves & Rolls, 1992; O’Reilly & McClelland, 1994; Hasselmo et al., 1994, 1996; Blum & Abbott, 1996; Rolls, 1996; Redish & Touretzky, 1998) that the CA3 functions as an autoassociator performing the dual roles of pattern completion (to produce correct place codes from degraded afferent information) and sequence learning (Minai & Levy, 1993; Levy, 1996; Blum & Abbott, 1996; Skaggs & McNaughton, 1996; Redish & Touretzky, 1998; Sohal & Hasselmo, 1998). It has been suggested (Hasselmo & Schnell, 1994; Redish & Touretzky, 1998) that the recurrent CA3 connections are used only during associative recall and are blocked by cholinergic (and perhaps even GABAergic) modulation during learning (Hasselmo & Schnell, 1994; Hasselmo et al., 1995, 1996). Our current model focuses on only the response of the CA3 place cells to afferent sensory information. Once a PI system is incorporated into the model, the inclusion of CA3 recurrent connections will become more relevant since pattern completion would be needed to resolve conicts between sensory and PI information (Redish & Touretzky, 1998). Another issue closely related with the role of CA3 is the 5–12 Hz theta rhythm of the hippocampal EEG (Vanderwolf, 1969). The discrete time steps in our model can be seen as corresponding to single theta cycles. We do not include any explicit provision for theta in our system, but our assumption of updating the afferent stimulus once every theta cycle is consistent with evidence that rats sample sensory information at the theta frequency (Greenstein, Pavlides, & Winson 1988). Long-term potentiation (LTP), which is probably the basis for neural learning, is also optimized when stimuli are spaced one theta cycle apart (Larson, Wong, & Lynch, 1986). It has been suggested (Lisman & Idiart, 1995) that the phase of the theta cycle may index multiple patterns simultaneously active in a hippocampal associative memory. An alternative possibility is that a single stored pattern may be retrieved recurrently via multiple high-frequency (gamma) oscillations nested in each theta cycle. Such a process could be very useful in eliciting previously stored CA3 place codes in response to partially correct input based on sensory stimuli and path integration (Redish and Touretzky, 1998). We do not include this in our current model but envisage doing so in the future. Appendix

The following parameter values were used in the simulations reported. M NEC NDG NH

Size of environment Number of neurons in layer EC Number of neurons in layer DG Number of neurons in layer H

20 200 1000 500

1036

NCA3 m (LA system only) nDG (LA system only) nH (LA system only) CEC¡DG CEC¡CA3 CDG¡CA3 CDG¡H CH¡DG

hDG¡H lDG¡H hH¡DG lH¡DG ai bi cui , cvi di sq2 s2 k gEC¡DG gH¡DG GH¡DG KDG r1DG r2DG r3DG gDG¡H KH rH 1 rH 2 rH 3 gEC¡CA3

Simona Doboli, Ali A. Minai, and Phillip J. Best

Number of neurons in layer CA3 Number of cell groups

300 10

Number of cells in each DG group

100

Number of cells in each H group EC ! DG connectivity EC ! CA3 connectivity DG ! CA3 connectivity DG ! H connectivity H ! DG connectivity EC ! DG weights EC ! CA3 weights DG ! CA3 weights Within-group DG ! H weight Cross-group DG ! H weight Within-group H ! DG weight Cross-group H ! DG weight EC place eld width parameter EC place eld width parameter EC place eld center locations EC place elds orientation parameter EC place eld positional noise variance EC ring rate noise variance EC baseline ring rate EC ! DG excitatory gain H ! DG excitatory gain H ! DG inhibitory gain Nominal number of active DG neurons Probability of ring in the KDG most active DG neurons Probability of ring in the next KDG most active DG neurons Probability of ring in the remaining NDG ¡ 2KDG DG neurons DG ! H excitatory gain Nominal number of active H neurons Probability of ring in the KH most active H neurons Probability of ring in the next KH most active H neurons Probability of ring in the remaining NH ¡ 2KH H neurons EC ! CA3 excitatory gain

50 0.05 0.07 0.003 0.6 0.6 U[0, 1] U[0.01, 0.1] U[0.4, 0.6] 1.0 0.01 1.0 0.01 U[0.004, 0.006] U[0.004, 0.006] U[-9, 29] U[-1.0, 1.0] 1.0 0.01 0.0 [0 ¢ ¢ ¢ 7] 0.5 0.2 40 0.95 0.05 0.003 1.0 20 0.95 0.05 0.003 1.0

Latent Attractors

gDG¡CA3 GDG¡CA3 KCA3 rCA3 1 rCA3 2 rCA3 3 a v

1037

DG ! CA3 excitatory gain DG ! CA3 inhibitory gain Nominal number of active CA3 neurons Probability of ring in the KCA3 most active CA3 neurons Probability of ring in the next KCA3 most active CA3 neurons Probability of ring in the remaining NCA3 ¡ 2KCA3 CA3 neurons Minimum place eld ring rate Maximum place eld ring rate

1.0 0.01 15 0.95 0.05 0.003 rCA3 3 rCA3 1

Acknowledgments

This research was supported by grants no. IBN-9634424 (A.A.M. and P.J.B.), IBN-9808664 (A.A.M.), and IBN-9816612 (P.J.B.) from the National Science Foundation. We thank John Lisman, David Redish, Alexei Samsonovich, and Dave Touretzky for providing preprints. Discussions with many people have contributed to this work, but A.A.M. especially acknowledges Andr´e Fenton, Alex Guazzelli, Mike Hasselmo, Jaideep Kapur, Chip Levy, John Lisman, Mayank Mehta, David Redish, Alexei Samsonovich, Lokendra Shastri, Dave Touretzky, Aaron White, and Brad Wyble. Detailed feedback from three anonymous reviewers signicantly enhanced the quality of this report. References Amaral, D. G., Ishizuka, N., & Claiborne, B. (1990). Neurons, numbers and the hippocampal network. Progress in Brain Research, 83, 1–11. Amaral, D. G., & Witter, M. P. (1989). The three dimensional organization of the hippocampal formation: A review of anatomical data. Neuroscience, 31, 571–591. Andersen, P., Bliss, T. V. P., & Skrede, K. K. (1971). Lamellar organization of hippocampal excitatory pathways. Exp. Brain Res., 13, 222–238. Barnes, C. A., Suster, M. S., Shen, J., & McNaughton, B. L. (1997). Multistability of cognitive maps in the hippocampus of old rats. Nature, 388, 272–275. Barnes, C. A., McNaughton, B. L., Mizumori, S. J. Y., Leonard, B. W., & Lin, L-H. (1990). Comparison of spatial and temporal characteristics of neuronal activity in sequential stages of hippocampal processing. Progress in Brain Research, 87, 287–300. Battaglia, F. P., & Treves, A. (1998). Attractor neural networks storing multiple space representations: A model for hippocampal place elds. Phys. Rev. E, 58, 7738–7753. Blum, K. I., & Abbott, L. F. (1996). A model of spatial map formation in the hippocampus of the rat. Neural Computation, 8, 85–93.

1038

Simona Doboli, Ali A. Minai, and Phillip J. Best

Bostock, E., Muller, R. U., & Kubie, J. L. (1991). Experience-dependent modications of hippocampal place cell ring. Hippocampus, 1, 193–206. Brown, E. N., Frank, L. M., Tang, T., Quirk, M. C., & Wilson, M. A. (1998). A statistical paradigm for neural spike-train decoding applied to position prediction from ensemble ring patterns of rat hippocampal place cells. J. Neurosci, 18, 7411–7425. Buckmaster, P. S., & Schwartzkroin, P. A. (1994). Hippocampal mossy cell function: A speculative view. Hippocampus, 4, 393–402. Buckmaster, P. S., & Schwartzkroin, P. A. (1995). Interneurons and inhibition in the dentate gyrus of the rat in vivo. J. Neurosci., 15, 774–789. Bures, J., Fenton, A. A., Kaminsky, Y., Rossier, J., Sacchetti, A., & Zinyuk, L. (1997). Dissociation of exteroceptive and ideothetic orientation cues: Effect on hippocampal place cells and place navigation. Phil. Trans. R. Soc. Lond. B, 352, 1515–1524. Carpenter, G. A., & Grossberg, S. (1987a). A massively parallel architecture for a self-organizing neural pattern recognition machine. Comp. Vision, Graphics, and Im. Proc., 37, 54–115. Carpenter, G. A., & Grossberg, S. (1987b). ART 2: Self-organization of stable category recognition codes for analog input patterns. Applied Optics, 26, 4919– 4930. Deadwyler, S. A., West, M. O., & Robinson, J. H. (1981). Entorhinal and septal inputs differentially control sensory-evoked responses in the rat dentate gyrus. Science, 211, 1181–1183. Doboli, S., Minai, A. A., & Best, P. J. (1999). A latent attractors model of contextselection in the dentate gyrus-hilus system. Neurocomputing, 26–27, 671–676. Eichenbaum, H., Kuperstein, M., Fagan, A., & Nagode, J. (1987) Cue-sampling and goal-approach correlates of hippocampal unit activity in rats performing an odor-discrimination task. J. Neurosci., 7, 716–732 Etienne, A. S., Maurer, R., & S`eguinot, V. (1996). Path integration in mammals and its interaction with visual landmarks. J. Exper. Biol., 199, 201–209. Gibbs III, J. W., Shumate, M. D., & Coulter, D. A. (1997). Differential epilepsyassociated alterations in postsynaptic GABA A receptor function in dentate granule and CA1 neurons. J. Neurophysiol., 77, 1924–1938. Golarai, G., & Sutula, T. P. (1996). Bilateral organization of parallel and serial pathways in the dentate gyrus demonstrated by current-source density analysis in the rat. J. Neurophysiol., 75, 329–342. Gothard, K. M., Skaggs, W. E., Moore, K. E., & McNaughton, B. L. (1996). Binding of hippocampal CA1 neural activity to multiple reference frames in a landmark-based navigation task. J. Neurosci., 16, 823–835. Gothard, K. L., Skaggs, W. E., & McNaughton, B. L. (1996). Dynamics of mismatch correction in the hippocampal ensemble code for space: Interaction between path integration and environmental cues. J. Neurosci., 16, 8027–8040. Greenstein, Y. J., Pavlides, C., & Winson, J. (1988). Long-term potentiation in the dentate gyrus is preferentially induced at theta rhythm periodicity. Brain Res., 438, 331–334. Han, Z.-S., Buhl, E. H., L¨orinczi, Z., & Somogyi, P. (1993).A high degree of spatial selectivity in the axonal and dendritic domains of physiologically identied

Latent Attractors

1039

local-circuit neurons in the dentate gyrus of the rat hippocampus. Europ. J. Neurosci., 5, 395–410. Hasselmo, M. E., & Schnell, E. (1994). Laminar selectivity of the cholinergic suppression of synaptic transmission in rat hippocampal region CA1: Computational modeling and brain slice physiology. J. Neurosci., 14, 3898–3914. Hasselmo, M. E., Schnell, E., & Barkai, E. (1995).Dynamics of learning and recall at excitatory recurrent synapses and cholinergic modulation in hippocampal region CA3. J. Neurosci., 15, 5249–5262. Hasselmo, M. E., Wyble, B. P., & Wallenstein, G. V. (1996).Encoding and retrieval of episodic memories: Role of cholinergic and GABAergic modulation in the hippocampus. Hippocampus, 6, 693–708. Hetherington, P. A., Austin, K. B., & Shapiro, M. L. (1994). Ipsilateral associational pathway in the dentate gyrus: An excitatory feedback system that supports N-methyl-D-aspartate-dependent long-term potentiation. Hippocampus, 4, 422–438. Hill, A. J. (1978). First occurrence of hippocampal spatial ring in a new environment. Exp. Neurol., 62, 282–297. Hill, A. J., & Best, P. J. (1981). The effect of deafness and blindness on the spatial correlates of hippocampal unit activity in the rat. Exper. Neurol., 74, 204–207. Jackson, M. B., & Scharfman, H. E. (1996). Positive feedback from hilar mossy cells to granule cells in the dentate gyrus revealed by voltage-sensitive dye and microelectrode recording. J. Neurophysiol., 76, 601–616. Jung, M. W., & McNaughton, B. L. (1993). Spatial selectivity of unit activity in the hippocampal granular layer. Hippocampus, 3, 165–182. Knierim, J. J., Kudrimoti, H. S., & McNaughton, B. L. (1998).Interactions between ideothetic cues and external landmarks in the control of place cells and headdirection cells. J. Neurophysiol., 80, 425–446. Knierim, J. J., & McNaughton, B. L. (1995). Differential effects of dentate gyrus lesions on pyramidal cell ring in 1- and 2-dimensional spatial tasks. Soc. Neurosci. Abstr., 21, 940. Kosko, B. (1988). Bidirectional associative memories. IEEE Trans. Syst. Man. Cyber., 18, 49–60. Kubie, J. L., & Muller, R. U. (1991).Multiple representations in the hippocampus. Hippocampus, 1, 240–242. Larson, J., Wong, D., & Lynch, G. (1986). Patterned stimulation at the theta frequency is optimal for the induction of long-term potentiation. Brain Res., 368, 347–350. Levy, W. B. (1996). A sequence predicting CA3 is a exible associator that learns and uses context to solve hippocampal-like tasks. Hippocampus, 6, 579–591. Levy, W. B., & Wu, X. (1996). The relationship of local context cues to sequence length memory capacity. Network: Computation in Neural Systems, 7, 371–384. Lisman, J. E. (1999). Relating hippocampal circuitry to function: The role of reciprocal dentate-CA3 interaction in the recall of sequences. Neuron, 22, 233– 242. Lisman, J. E., & Idiart, M. A. P. (1995). Storage of 7 § 2 short-term memories in oscillatory subcycles. Science, 267, 1512–1515. Markus, E. J., Qin, Y.-L., McNaughton, B. L., & Barnes, C. A. (1994). Place elds

1040

Simona Doboli, Ali A. Minai, and Phillip J. Best

are affected by behavioral and spatial constraints. Cog. Neurosci. Soc. Abstr., 1, 91. Markus, E. J., Qin, Y.-L., Leonard, B., Skaggs, W. E., McNaughton, B. L., & Barnes, C. A. (1995). Interactions between location and task affect the spatial and directional ring of hippocampal neurons. J. Neurosci., 15, 7079– 7094. Marr, D. (1971). Simple memory: A theory for archicortex. Phil. Trans. R. Soc. Lond. B, 262, 23–81. McNaughton, B. L., Barnes, C. A., & O’Keefe, J. (1983). The contributions of position, direction, and velocity to single unit activity in the hippocampus of freely moving rats. Experimental Brain Research, 52, 41–49. McNaughton, B. L., Barnes, C. A., Gerrard, J. L., Gothard, K., Jung, M. W., Knierim, J. J., Kudrimoti, H., Qin, Y., Skaggs, W. E., Suster, M., & Weaver, K. L. (1996). Deciphering the hippocampal polyglot: The hippocampus as a path integration system. J. Exp. Biol., 199, 173–185. McNaughton, B. L., & Morris, R. G. M. (1987). Hippocampal synaptic enhancement and information storage within a distributed memory system. Trends in Neurosci., 10, 408–415. Mehta, M., Barnes, C. A., & McNaughton, B. L. (1997). Experience-dependent, asymmetric expansion of hippocampal place elds. Proc. Natl. Acad. Sci. USA, 94, 8918–8921 Miller, V. M., & Best, P. J. (1980). Spatial correlates of hippocampal unit activity are altered by lesions of the fornix and entorhinal cortex. Brain Res., 194, 311–323. Minai, A. A. (1997). Control of CA3 place elds by the dentate gyrus: A neural network model. In: J. M. Bower (Ed.), Computational neuroscience: Trends in Research 1997: Proceedings of the Fifth Computational Neuroscience Meeting pp. 411–416. Minai, A. A., Barrows, G. L., & Levy, W. B. (1994). Disambiguation of pattern sequences with recurrent networks. In Proc. World Congress on Neural Networks, San Diego (Vol. 4, pp 176–180). Minai, A. A., & Best, P. J. (1998). Encoding spatial context: A hypothesis on the function of the dentate gyrus-hilus system. In Proc. Int. Joint Conf. on Neural Networks, Anchorage (pp 587–592). Minai, A. A., & Levy, W. B. (1993). Sequence learning in a single trial. In Proc. World Congress on Neural Networks, Portland Vol. 2 pp 505–508). Mizumori, S. J. Y., Ward, K. E., & Lavoie, A. M. (1992).Medial septal modulation of entorhinal single unit activity in anesthetized and freely moving rats. Brain Res., 570, 188–197. Morris, R. G. M., Garrud, P., Rawlins, J. N. P., & O’Keefe, J. (1982). Place navigation impaired in rats with hippocampal lesions. Nature, 297, 681–683. Moser, E. I. (1996). Altered inhibition of dentate granule cells during spatial learning in an exploration task. J. Neurosci., 16, 1247–1259. Muller, R. U., & Kubie, J. L. (1987). The effects of changes in the environment on the spatial ring properties of hippocampal place cells. J. Neurosci., 7, 1951–1968. Muller, R. U., Kubie, J. L., & Ranck, J. B., Jr. (1987). Spatial ring patterns of

Latent Attractors

1041

hippocampal complex-spike cells in a xed environment. J. Neurosci., 7, 1935– 1950. Muller, R. U., Kubie, J. L., & Saypoff, R. (1991). The hippocampus as a cognitive graph (abridged version). Hippocampus, 1, 243–246. Muller, R. U., & Stead, M. (1996).Hippocampal place cells connected by Hebbian synapses can solve spatial problems. Hippocampus, 6, 709–719. Nadler, J. V., Perry, B. W., & Cotman, C. W. (1978). Intraventricular kainic acid preferentially destroys hippocampal pyramidal cells. Nature, 271, 676– 677. Nadler, J. V., Perry, B. W., and Cotman, C. W. (1980). Selective reinnervation of hippocampal area CA1 and the fascia dentata after destruction of CA3-CA4 afferents with kainic acid. Brain Res., 182, 1–9. O’Keefe, J., & Burgess, N. (1996). Geometric determinants of the place elds of hippocampal neurons. Nature, 381, 425–428. O’Keefe, J., & Dostrovsky, J. (1971). The hippocampus as a spatial map: Preliminary evidence from unit activity in the freely moving rat. Brain Res., 34, 171–175. O’Keefe, J., & Nadel, L. (1978).The hippocampusas a cognitivemap. Oxford: Clarendon Press. O’Keefe, J., & Speakman, A. (1987). Single unit activity in the rat hippocampus during a spatial memory task. Exp. Br. Res., 68, 1–27. O’Reilly, R. C., & McClelland, J. L (1994). Hippocampal conjunctive encoding, storage, and recall: Avoiding a trade-off. Hippocampus, 4, 661–682. Quirk, G. J., Muller, R. U., & Kubie, J. L. (1990). The ring of hippocampal place cells in the dark depends on the rat’s recent experience. J. Neurosci., 10, 2008– 2017. Quirk, G. J., Muller, R. U., Kubie, J. L., & Ranck, J. B., Jr. (1992). The positional ring properties of medial entorhinal neurons: Description and comparison with hippocampal place cells. J. Neurosci., 12, 1945–1963. Redish, A. D. (1997). Beyond the cognitive map: Contributions to a computational neuroscience theory of rodent navigation. Unpublished doctoral dissertation, Carnegie Mellon University, Pittsburgh, PA. Redish, A. D., & Touretzky, D. S. (1997). Cognitive maps beyond the hippocampus. Hippocampus, 7, 15–35. Redish, A. D., & Touretzky, D. S. (1998). The role of the hippocampus in solving the Morris water maze. Neural Computation, 10, 73–111. Redish, A. D., & Touretzky, D. S. (1999). Separating hippocampal maps. In N. Burgess, K. Jeffery, & J. O’Keefe (Eds.), Spatial functions of the hippocampal formation and the parietal cortex. New York: Oxford University Press. Rolls, E. T. (1996). A theory of hippocampal function in memory. Hippocampus, 6, 601–620. Rotenberg, A., & Muller, R. U. (1997). Variable place-cell coupling to a continuously viewed stimulus: Evidence that the hippocampus acts as a perceptual system. Phil. Trans. R. Soc. Lond. B, 352, 1505–1513. Samsonovich, A. (1999). Theory of “partial remapping” in the hippocampal representation of space. Unpublished manuscript, University of Arizona. Samsonovich, A., & McNaughton, B. L. (1997). Path integration and cognitive

1042

Simona Doboli, Ali A. Minai, and Phillip J. Best

mapping in a continuous attractor neural network model. J. Neurosci., 17, 5900–5920. Scharfman, H. E. (1991).Dentate hilar cells with dendrites in the molecular layer have lower thresholds for synaptic activation by perforant path than granule cells. J. Neurosci., 11, 1660–1673. Scharfman, H. E. (1995). Electrophysiological evidence that dentate hilar mossy cells are excitatory and innervate both granule cells and interneurons. J. Neurophysiol., 74, 179–194. Scoville, W. B., & Milner, B. (1957). Loss of recent memory after bilateral hippocampal lesions. J. Neurol. Psychiatr., 20, 11–21. Shapiro, M. L., Tanila, H., & Eichenbaum, H. (1997).Cues that hippocampal place cells encode: Dynamic and hierarchical representation of local and distal stimuli. Hippocampus, 7, 624–642 Sharp, P. E. (1997). Subicular cells generate similar spatial ring patterns in two geometrically and visually distinctive environments: Comparison with hippocampal place cells. Behav. Brain Res., 85, 71–92. Sharp, P. E., Kubie, J. L., & Muller, R. U. (1990). Firing properties of hippocampal neurons in a visually symmetrical environment: Contributions of multiple sensory cues and mnemonic processes. J. Neurosci., 10, 3093–3105. Shen, B., & McNaughton, B. L. (1996).Modeling the spontaneous reactivation of experience-specic hippocampal cell assemblies during sleep. Hippocampus, 6, 685–693. Shen, J., McNaughton, B. L., & Barnes, C. A. (1997). Reactivation of neuronal ensembles in hippocampal dentate gyrus during sleep following spatial experience. Soc. Neurosci. Abstr., 23, 506. Skaggs, W. E., & McNaughton, B. L. (1996). Replay of neuronal ring sequences in rat hippocampus during sleep following spatial experience. Science, 271, 1870–1873. Skaggs, W. E., & McNaughton, B. L. (1998). Spatial ring properties of hipocampal CA1 populations in an environment containing two visually identical regions. J. Neurosci., 18, 8455–8466. Sloviter, R. S. (1987). Decreased hippocampal inhibition and a selective loss of interneurons in experimental epilepsy. Science, 235, 73–76. Sloviter, R. S. (1994). The functional organization of the hippocampal dentate gyrus and its relevance to the pathogenesis of temporal lobe epilepsy. Ann. Neurol., 35, 640–654. Sloviter, R. S., & Brisman, J. L. (1995). Lateral inhibition and granule cell synchrony in the rat hippocampal dentate gyrus. J. Neurosci., 15, 811–820. Sohal, V. S., & Hasselmo, M. E. (1998). GABAB modulation improves sequence disambiguation in computational models of hippocampal region CA3. Hippocampus, 8, 171–193. Soltesz, I., Bourassa, J., & Deschenes, M. (1993). The behavior of mossy cells of the rat dentate gyrus during theta oscillations in vivo. Neuroscience, 57, 555–564. Squire, L. R. (1992). Memory and the hippocampus: A synthesis from ndings with rats, monkeys, and humans. Psych. Rev., 99, 195–231. Squire, L. R., & Zola-Morgan, S. (1988). Memory: Brain systems and behavior.

Latent Attractors

1043

Trends in Neurosci., 11, 170–175. Struble, R. G., Desmond, N. L., & Levy, W. B. (1978). Anatomical evidence for interlamellar inhibition in the fascia dentata. Brain Res., 152, 580–585. Tanila, H., Shapiro, M. L., & Eichenbaum, H. (1997). Discordance of spatial representation in ensembles of hippocampal place cells. Hippocampus, 7, 613– 623. Thompson, L. T., & Best, P. J. (1989). Place cells and silent cells in the hippocampus of freely-behaving rats. J. Neurosci., 9, 2382–2390. Thompson, L. T., & Best, P. J. (1990).Long term stability of the place-eld activity of single units recorded from the dorsal hippocampus of freely behaving rats. Brain Res., 509, 299–318. Touretzky, D. S., & Redish, A. D. (1996). Theory of rodent navigation based on interacting representations of space. Hippocampus, 6, 247–270. Treves, A., & Rolls, E. T. (1992). Computational constraints suggest the need for two distinct input systems to the hippocampal CA3 network. Hippocampus, 2, 189–200. Vanderwolf, C. H. (1969). Hippocampal electrical activity and voluntary movement in the rat. EEG Clin. Neurophysiol., 26, 407–418. Willshaw, D. J., Buneman, O. P., & Longuet-Higgins, H. C. (1969). Non-holographic associative memory. Nature, 222, 960–962. Wilson, M. A., & McNaughton, B. L. (1993). Dynamics of the hippocampal ensemble code for space. Science, 261, 1055–1058. Wilson, M. A., & McNaughton, B. L. (1994). Reactivation of hippocampal ensemble memories during sleep. Science, 265, 676–679. Wu, X., Baxter, R. A., & Levy, W. B. (1995). Context codes and the effect of noisy learning on a simplied hippocampal CA3 model. Biol. Cybern., 74, 159–165. Zhang, K., Ginzburg, I., McNaughton, B. L., & Sejnowski, T. J. (1998). Interpreting neuronal population activity by reconstruction: Unied framework with application to hippocampal place cells. J. Neurophysiol., 79, 1017–1044. Received September 11, 1998; accepted May 5, 1999.

NOTE

Communicated by Carl van Vreeswijk

The Approach of a Neuron Population Firing Rate to a New Equilibrium: An Exact Theoretical Result B. W. Knight Rockefeller University, New York, NY 10021, and Laboratory of Applied Mathematics, Mount Sinai School of Medicine, New York, NY 10029, U.S.A. A. Omurtag Laboratory of Applied Mathematics, Mount Sinai School of Medicine, New York, NY 10029, U.S.A. L. Sirovich Rockefeller University, New York, NY 10021, and Laboratory of Applied Mathematics, Mount Sinai School of Medicine, New York, NY 10029, U.S.A. The response of a noninteracting population of identical neurons to a step change in steady synaptic input can be analytically calculated exactly from the dynamical equation that describes the population’ s evolution in time. Here, for model integrate-and-re neurons that undergo a xed nite upward shift in voltage in response to each synaptic event, we compare the theoretical prediction with the result of a direct simulation of 90,000 model neurons. The degree of agreement supports the applicability of the population dynamics equation. The theoretical prediction is in the form of a series. Convergence is rapid, so that the full result is well approximated by a few terms. 1 Introduction and Results

A population-dynamics approach has been introduced as a means for the efcient simulation of the activity of large neuron populations (Knight, Manin, & Sirovich, 1996). Recent studies support and verify the power of this new approach (Omurtag, Knight, & Sirovich, 2000; Sirovich, Knight, & Omurtag, forthcoming; Nykamp & Tranchina, 2000). A general treatment of this approach with further analytic exploration appears in Knight (2000). In this article we follow up on one such feature. It is commonly the case that a dynamical system, which is not disturbed by any time-varying external inputs, will possess a time-independent equilibrium conguration. It also is typically the case that when a dynamical system is in a state near—but not at—that equilibrium, we may describe the early evolution of its subsequent motion in terms of the changing amplitudes of a set of normal modes that in effect impose a privileged set of c 2000 Massachusetts Institute of Technology Neural Computation 12, 1045– 1055 (2000) °

1046

B. W. Knight, A. Omurtag, and L. Sirovich

coordinates on the system’s state-space in the neighborhood of the equilibrium point. Normal modes come as either singles, whose amplitudes change exponentially with time, or natural pairs, whose coordinated amplitudes change together in time as sine and cosine multiplied by an envelope that changes exponentially with time. An arbitrary initial condition near equilibrium will yield a time evolution that is a superposition of these motions. The dynamical equation that describes the time evolution of a noninteracting neuron population density conforms to the general rule just quoted. However, there is an important additional feature: because the neurons are noninteracting, superposition of component probabilities is respected by the time evolution of their density function, and so the dynamical equation is linear in the population density. Because of this, the general rule quoted above is exact for the population dynamics equation: the rule describes the nature of our system’s evolution toward equilibrium even if we do not conne our initial condition to the neighborhood of the equilibrium point. If the neuron population receives an input signal that is time independent at one value and then jumps abruptly to a new value, where it holds steady again, the prejump equilibrium point becomes a specic postjump nonequilibrium point from which the system will proceed to evolve to the new equilibrium point in accordance with our rule, and the transient response of the population ring rate will reect this dynamics. All of this we have reasoned out here without so far having to dene a specic neuron model and without having to turn to any explicit mathematics. To proceed to quantitative predictions, we must have an explicit neuronal model and, from the dynamical equation for its population density, calculate the following things: 1. The equilibrium points before and after the jump in input. 2. The normal modes for the postjump equilibrium point, and their characteristic exponential decay rates and sinusoidal frequencies. 3. The coefcients that express the prejump equilibrium point as a superposition of those normal modes. 4. As the calculations above yield an explicit expression for the population density’s time course, all that remains is to express the population ring rate in terms of that density. The result of these steps is an explicit expression (Knight, 2000) stating that under common circumstances, the new equilibrium ring rate is approached with a time course that is an exponentially decaying sinusoid, whose oscillation rate is near to the single-neuron ring rate at the new equilibrium. We compare the analytic result with a direct simulation of a population of 90,000 neurons. These are integrate-and-re neurons with ring threshold

Approach of a Neuron Population Firing Rate

1047

Figure 1: Prediction of simulation results by theory. A population of 90,000 neurons was simulated and per-neuron ring rate (irregular thin line) calculated from impulses in each millisecond. Theoretical equilibrium levels are 4.54, 11.92, 24.79 sec ¡1 . For comparison, the damped transients have theoretical principalmode frequencies of 5.77, 11.50, 24.70 sec¡1 .

voltage scaled to unity, with an ohmic shunt that gives a decay rate constant of c D 20 sec ¡1 , and for which each synaptic event advances the dimensionless scaled voltage variable toward threshold by an instantaneous step of h D .03. These nominal values are representative of realistic measures. The input to each neuron is a Poisson process with a common specied mean rate. To simplify comparison with the more familiar nonstochastic limit ¡ (h ¢! 0) it is more convenient to specify the scaled mean input current s sec¡1 which is the mean input rate times the synaptic step h. (In the nonstochastic limit and without ohmic leakage, s would be equal to the neuron ring rate; however, with the presence of ohmic leakage, the nonstochastic neuron’s ring rate drops to zero when s is moved to less than c . For large s, the ring rate of the nonstochastic neuron with leakage approaches s ¡ 12c from below. The following section furnishes some theoretical details.) Figure 1 shows the results of four direct simulation numerical experiments in which s is jumped to the three steady values 18, 24, 36 sec¡1 and the ring rates nd equilibrium at 4.54, 11.92, 24.79 sec¡1 . The larger two ¡ values¢ are not far below their nonstochastic-and-large-s estimated values s ¡ 12c ¡ ¢ of 14 and 26 sec¡1 . However the lowest input s D 18 sec¡1 < c D 20 sec¡1 is below the corresponding nonstochastic ring threshold, and the ring that we see depends on unusually closely spaced clusters of synaptic events

1048

B. W. Knight, A. Omurtag, and L. Sirovich

which are permitted by the Poisson arrival statistics. In Figure 1, for graphic presentation the neuron rings have been averaged over 1 millisecond bins. The transients that lead to equilibrium in Figure 1 convey further information. In each case the exact analytical solution is given by the smooth line, which passes centrally through the slightly noisy direct simulation result. There is no indication of any systematic departure of the exact analytical solution from the results of the direct simulation. In each of the four cases, after a fraction of a cycle, the return of the transient to equilibrium takes the form of a damped sinusoid. Although it is hard to see in the highly damped case of the lowest equilibrium ring rate, in each case the frequency of the sinusoid lies close to the mean ring rate at equilibrium. In Figure 1, the amplitude of the residual noise in the simulation results is inversely proportional to the square root of the number of neurons simulated. We see that to approach the precision of the theoretical curves, simulation of about 9 million neurons would be required. 2 Some Further Theoretical Details

The exact analytical solution of the return-to-equilibrium problem follows directly from the general population dynamics theory and has been presented in some detail in Knight (2000). Here we give a brief general development and then discuss the specics of the case that was used in the simulation. The momentary state of a model neuron is specied by values of internal variables such as its transmembrane potential and the channel variables that determine its various transmembrane ionic conductances. The neuron’s state may be thought of as a point in a state-space in which the possible values of the internal variables specify a coordinate system. As time advances, the neuron’s point in the state-space moves in accordance with a specied dynamics. A set of similar neurons corresponds to a set of points, and typically their motions will depend on a mean external input s, which they share in common, and also on stochastic inputs, which they receive independently. The momentary state of a large population of similar neurons can be described by a population density r , which species the likelihood that a neuron blindly picked at random will be found in the vicinity of any specied point in the state-space. The population density thus integrates to unity. It evolves in time in accordance with a dynamical equation of the general form @r / @t D Q(s)r ,

(2.1)

where the right-hand side is specied by the law of motion of the individual neurons and is a function, over the state-space, that depends linearly on r . Thus Q ( s) is technically a linear operator, which depends on the common input s.

Approach of a Neuron Population Firing Rate

1049

In our situation, s remains xed after an abrupt jump. In this circumstance it is a generic property (essentially a commonly occurring feature) of linear operators that in the state-space we can nd a set of eigenfunctions, w n , such that the action of the linear operator on these is simply to multiply by an eigenvalue l n according to the relation Qw n Dl n w n .

(2.2)

The eigenfunctions have the further important property that any function in the state-space may be expressed as a weighted sum of them. In particular we can do this for the population density r , though we must observe that r evolves in time according to equation 2.1, and so the weightings in its eigenfunction sum must evolve as well. Equation 2.1 in fact demands that r must be of the form r D

X

an el n t w n .

(2.3)

n

Substitute equation 2.3 in 2.1. Perform the time differentiation on the left, and use equation 2.2 on the right to conrm that 2.3 is indeed a solution. Below we will see that a few terms of equation 2.3 furnish a close approximation to the full sum. Now equation 2.1 has an equilibrium solution that is independent of time. From either equation 2.1 or 2.3, the equilibrium solution is clearly an eigenfunction w 0 of equation 2.2 with the eigenvalue l 0 D 0.

(2.4)

(See Sirovich et al., forthcoming, for a general treatment of the equilibrium problem.) From equation 2.3, we note that all the other eigenfunctions have the properties that were discussed in Section 1 as the properties of normal modes. In fact the eigenvalues l n of equation 2.2 typically will be complex numbers with negative real parts and which arise in complex conjugate pairs. It is convenient to assign integers to them in the order of increasing negativity of their real part, and corresponding negative integers to those eigenvalues that are the complex conjugates of eigenvalues whose imaginary parts are positive. The eigenfunctions typically will be complex and occur in conjugate pairs; similarly their coefcients in the sum for r , equation 2.3, will be complex conjugates, so that the sum breaks up into natural pairs of terms. Each pair is real, and as the time t advances, the nth pair decrements exponentially and oscillates sinusoidally according to the rate and the radian frequency, which are expressed by the real and imaginary parts of the nth eigenvalue. To determine the population density r , we have still to evaluate the constants an in equation 2.3. If we set t to zero in equation 2.3, the summation

1050

B. W. Knight, A. Omurtag, and L. Sirovich

must give us the equilibrium distribution of the population density before the input s underwent its abrupt jump. Let us call this equilibrium density (¡) w 0 ; we must expand it in terms of the eigenfunctions after the jump. Pairs of complex-valued functions that are supported by our state-space have a natural bilinear inner product. If x denotes a point in the state-space, and v (x ) and u ( x) are functions, their natural inner product is Z ( v, u) D dx ( v ( x )) ¤ u ( x ) , (2.5) where the integral extends over the state-space. The set of eigenfunctions from equation 2.2 determines a companion set b w n of biorthonormal functions that satisfy the inner product relations, ¡ ¢ b w m , w n Dd mn , (2.6) where d mn is unity if m D n and zero otherwise. This biorthonormal set are determined by equation 2.6 and may be constructed from the w n by straightforward procedures of linear algebra. Alternatively, if equation 2.1 is discretized for computational purposes, standard software will compute the b w n as the eigenvectors of the matrix transpose of Q. If t is set to zero in equation 2.3, we have (¡)

w0

D

X

an w n .

(2.7)

n (¡)

With a prejump steady-state distribution w 0 as-yet-unknown coefcients an by ³ ´ (¡) w m, w 0 I am D b

on hand, we may evaluate the (2.8)

if we substitute equation 2.7 into 2.8 and use 2.6, we conrm 2.8 as an identity. The postjump time course of r , given by equation 2.3, now is fully evaluated. The methodology just reviewed is classical. Stability near equilibrium in nonlinear systems is extensively discussed by Iooss and Joseph (1980). The eigenvalue theory of linear operators, with applications, is discussed in much detail by Morse and Feshbach (1953) and more briey by Page (1955). To nd the time course of the population ring rate, as implied by the dynamics that we have solved, we need some further structure of our neuron model. As we follow the trajectory of a neuron in its state-space, at some point on that trajectory we can say that the neuron has just red an impulse. As we look at the set of all trajectory lines, each has a ring point. The rate (on a per-neuron basis) at which the noninteracting neurons of a large population pass their ring points must depend in a linear way on the population density function. We may write r D R fr g I

(2.9)

Approach of a Neuron Population Firing Rate

1051

the ring rate r is a linear functional of the population density r . If we substitute r from equation 2.3 into 2.9, we obtain r (t) D D

X n

an el n t R fw n g

X³ n

´ (¡) b wn , w 0 R fw n g el n t .

(2.10)

Once we have the prejump equilibrium distribution, the postjump eigenfunctions and eigenvalues and the ring-rate linear functional, this is an explicit and exact evaluation. Equation 2.10 was used to calculate the smooth lines in Figure 1. The neurons in our simulation follow the dynamical equation, dx / dt D ¡c x C

X n

hd ( t ¡ tn ) ,

(2.11)

where x is the dimensionless voltage, and the synaptic input times, tn , are drawn from a Poisson process with mean rate s / h, and are independently chosen for each neuron. When a synaptic input jumps a neuron beyond x D 1, the neuron is reset to x D 0 and a nerve impulse is registered. If we hold s xed and let h go to zero, equation 2.11 goes to dx / dt D ¡c x C s,

(2.12)

which is the familiar equation for the forgetful or “leaky” integrate-and-re neuron. We note that if c is larger than s, then x D s /c

(2.13)

gives an equilibrium solution to equation 2.12 that is below the ring threshold. Equation 2.12 may be rearranged to the form dt D dx / (s ¡ c x) or Z1 1 ³ c ´ T D dx / ( s ¡ c x) D ¡ ln 1 ¡ c s

(2.14)

0

for the ring period, unless small s makes s /c less than unity. The ring rate is the reciprocal ring period 1 r D 1 / T ! s ¡ c for large s, 2 as mentioned in section 1.

(2.15)

1052

B. W. Knight, A. Omurtag, and L. Sirovich

The dynamical equation for the population density that follows from equation 2.11 is @r (x, t) @t

D

@ @x

[c xr ( x, t) ] C

¢ ¤ s£ ¡ r x ¡ h, t ¡ r (x, t) . h

(2.16)

The rst right-hand term corresponds to backward advective drift in voltage as the current c x leaks back through the ohmic shunt; the second term (absent if x < h) corresponds to density accumulation as neurons jump in from voltage x ¡ h. The third term similarly corresponds to density depletion as neurons jump away from the voltage x. The s-dependent right-hand side expresses a linear transformation upon r , and so denes the linear operator Q ( s) in equation 2.1 for this particular case. Equation 2.16 was studied by Wilbur and Rinzel (1982) and in further detail by Omurtag et al. (2000), and Sirovich et al. (forthcoming). As shown in Omurtag et al. (2000), equation 2.16 follows rigorously from equation 2.11 and the Poisson specication of synaptic input times, without introduction of free parameters. (If in equation 2.16 we were to expand the term that is offset in h, as a Taylor series in h, and keep terms through h2 , we would reduce the equation to a second-order partial differential equation that describes the dynamics of the probability density in terms of advection and diffusion in state-space. Such an approximate dynamical equation for the probability density is technically known as a Fokker-Planck equation and is extensively discussed by Risken, 1996. Application of the Fokker-Planck equation to the integrateand-re neuron model is discussed by Tuckwell, 1988.) The population ring rate linear functional R fr g in the present case is the rate at which neurons make a nal synaptic jump that carries them beyond x D 1. It is proportional to the synaptic rate s / h and also clearly depends equally on r (x) for all values of x larger than 1 ¡ h. It is s r (t) D R fr g D h

Z1 dxr (x, t) .

(2.17)

1¡h

The specic case of the eigenvalue equation, 2.2, which corresponds to the dynamical equation, 2.16, is ¢ ¤ d s£ ¡ [c xw n ( x)] C w n x ¡ h ¡ w n ( x) Dl n w n ( x) . dx h

(2.18)

We have discretized this equation into 200 compartments. We have discretized equation 2.17, which serves as the input into the rst compartment of the discretized equation, 2.18. The resulting system we have solved for its eigenvalues and eigenfunctions. These compare well with corresponding

Approach of a Neuron Population Firing Rate

1053

r (x,t)

4 3 2 0.05

1 0 0 0

0.05 0.2

0.1 0.15

0.4

x

0.6

0.8

t

0.2 1 0.25

Figure 2: Evolution of the population density following a change of mean input current from 18 sec¡1 to 24 sec¡1 .

eigenvalues and eigenfunctions obtained by approximate analytic methods (Omurtag et al., 2000; Sirovich et al., forthcoming). These eigenvalues (¡) and eigenfunctions (including w 0 ) and the discretized evaluation of R fw n g from equation 2.17 have been used in equation 2.10 in order to calculate the smooth lines in Figure 1. Figure 2 gives a perspective view of the evolution of the population density, as time advances from the back of the gure toward the viewer, in the case where the average current suddenly shifts from its lowest setting of 18 sec¡1 up to 24 sec¡1 . Initially the population density is mostly concentrated in the upper end of the range, where forward jumping and backward advection approximately balance. With the onset of more frequent forward jumping, at time t D 0, this accumulation is swept away and is returned to the low-voltage region. Then it is swept forward as it disperses and arrives once more at the high-voltage end after about 1 / 10 sec, which is consistent with the new single-neuron ring rate of 11.92 / sec . We are able to follow this still-dispersing concentration through about one additional round trip,

1054

B. W. Knight, A. Omurtag, and L. Sirovich

Figure 3: Convergence of partial sums to the full theoretical result.

consistent with the corresponding ring response in Figure 1, which goes through about two visible decrementing oscillations. The high peaks of probability concentration at low voltages arise because of the discrete voltage jump of h D .03, and wash out because the irregularly timed jumping is not synchronized with the steady advective backward drift, which is due to capacitor-discharge as ohmic current leaks through the model nerve membrane. The choice above of 200 voltage divisions for the numerical work was made to resolve these peaks adequately in the population density. Figure 3 addresses the convergence of the series in equation 2.10. For the case of s D 24 sec¡1 , we plot the partial sums for both the ascending and descending transients. After the second passage across the equilibrium value, the rst pair of eigenfunctions yields an accurate description of the transient in both cases. Between the rst and second passage, the descending transient is fairly well t by two pairs of eigenfunctions, while four pairs give a result indistinguishable from the full result. For the ascending transient in the same span, only two pairs are needed to replicate the full solution visually. In summary, we see that the exact analytic solution for the transient to a new equilibrium accounts well for the results of the direct simulation, and after the rst equilibrium crossing, the full analytic result is excellently approximated by a short, truncated sum that may be expressed in terms of a few parameters.

Approach of a Neuron Population Firing Rate

1055

The manner in which this same sort of treatment may be applied to more detailed neuron models, such as that of Hodgkin and Huxley (1952), is given some discussion in Knight (2000). The effect of feedback (where the output contributes to the input and the population dynamics consequently becomes nonlinear in the population density) is also discussed in Knight, and with a somewhat different emphasis in Sirovich et al. (forthcoming). These publications are available at our web site. Acknowledgments

This work has been supported by NIH/NEI EY11276, NIH/NIMH MH50166, ONR N00014-96-1-0492. Publications from the Laboratory of Applied Mathematics may be found online at http://camelot.mssm.edu/. References Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol., 117, 500–544. Iooss, G., & Joseph, D. D. (1980). Elementary stability and bifurcation theory. New York: Springer-Verlag. Knight, B. W. (2000).Dynamics of encoding in neuron populations: Some general mathematical features. Neural Computation, 12, 497–542. Knight, B. W., Manin, D., & Sirovich, L. (1996). Dynamical models of interacting neuron populations. In E. C. Geil (Ed.), Symposium on Robotics and Cybernetics: Computational Engineering in Systems Applications. Lille, France: Cite Scientique. Morse, P. M., & Feshbach, H. (1953). Methods of theoretical physics. New York: McGraw-Hill. Nykamp, D., & Tranchina, D. (2000). A population density approach that facilitates large-scale modeling of neural networks: Analysis and an application to orientation tuning. J. Comp. Neurosci., 8, 19–50. Omurtag, A., Knight, B. W., & Sirovich, L. (2000). On the simulation of large populations of neurons. J. Comp. Neurosci., 8, 51–63. Page, C. H. (1955). Physical mathematics. New York: Van Nostrand. Risken, H. (1996). The Fokker-Planck equation: Methods of solution and applications. New York: Springer-Verlag. Sirovich, L., Knight, B. W., & Omurtag, A. (Forthcoming). Dynamics of neuronal populations: The equilibrium solution. SIAM. Tuckwell, H. C. (1988).Introduction to theoreticalneurobiology (Vol. 2). Cambridge: Cambridge University Press. Wilbur, W. J., & Rinzel, J. (1982). An analysis of Stein’s model for stochastic neuronal excitation. Biological Cybernetics, 45, 107–114. Received February 16, 1999; accepted May 20, 1999.

NOTE

Communicated by Hans van Hateran

Formation of Direction Selectivity in Natural Scene Environments Brian Blais Leon N. Cooper Harel Shouval Departments of Physics and Neuroscience, Institute for Brain and Neural Systems, Brown University, Providence, RI 02912, U.S.A. Most simple and complex cells in the cat striate cortex are both orientation and direction selective. In this article we use single-cell learning rules to develop both orientation and direction selectivity in a natural scene environment. We show that a simple principal component analysis rule is inadequate for developing direction selectivity, but that the BCM rule as well as similar higher-order rules can. We also demonstrate that the convergence of lagged and nonlagged cells depends on the velocity of motion in the environment, and that strobe rearing disrupts this convergence, resulting in a loss of direction selectivity. 1 Introduction

Most simple and complex cells in the cat striate cortex are both orientation (Hubel & Wiesel, 1959, 1962) and direction selective (Hammond, 1978; Reid, Soodak, & Shapley, 1991; Deangelis, Ohzawa, & Freeman, 1995). At the preferred orientation, a cell that is direction selective responds to a drifting, grating moving in one direction more strongly than the opposite direction. The ability of the cell to detect the direction of motion depends on the interaction of responses to at least two different points in the visual eld at different times. This is to say that it depends on the spatiotemporal receptive eld of the cell (Reid et al., 1991). A cell that is not direction selective (non-DS) has a maximum response to a sine grating moving in one direction equal to its maximum response to a sine grating moving in the opposite direction. In a linear approximation, the response of a cell, c(t), can be written as a convolution between the spatiotemporal input pattern, I ( x, t), and a spatiotemporal receptive eld kernel, K( x, t), giving Z c( t ) D

C1

¡1

Z dx0

t ¡1

dt0 I ( x0 , t0 ) K( x0 , t ¡ t0 ).

(1.1)

It is easy to show that in this approximation, a DS cell must have a spatiotemc 2000 Massachusetts Institute of Technology Neural Computation 12, 1057– 1066 (2000) °

1058

Brian Blais, Leon N. Cooper, and Harel Shouval

poral (ST) inseparable receptive eld, that is, the kernel cannot be expressed as K( x, t) D F(t)G(x), where F( t) and G( x) are functions that depend on only time and space, respectively. There are many models of direction selectivity (Barlow & Levick, 1965; Adelson & Bergen, 1985; Watson & Ahumada, 1985; Burr, 1981). In all of the models, the response of the cell is determined by receptive elds that have different temporal response properties at different spatial locations (i.e., spatiotemporal inseparable). This can be realized by the appropriate spatial positioning of the receptive elds and the introduction of temporal shifts. These temporal shifts could possibly arise from delays caused by cortical loops (Suarez, Koch, & Douglas, 1995; Maex & Orban, 1996), phase advances caused by depressing synapses (Chance, Nelson, & Abbott, 1998), or by lagged responses in the lateral geniculate nucleus (LGN; Mastronarde, 1987; Saul & Humphrey, 1990). In this article we present a feedforward model of the development of direction selectivity, which includes the effects of two types of LGN cells, called lagged and nonlagged cells, that differ only in their response timing. This is similar to a previous model (Feidler, Saul, Murthy, & Humphrey, 1997), differing signicantly in the input environment used. The previous model used a neuron with three inputs each with sinusoidally varying activations governed by a single parameter. In our case, we use a natural scene environment (Law & Cooper, 1994), providing a more realistic correspondence with biology and a more direct connection to experiment. We examine two types of visual environments, differing primarily in their temporal properties. In one environment, static natural images are used with a simple model of eye movements to provide motion over the receptive cortical eld. In the other, natural image movies are used to provide motion over the receptive eld. Using these more realistic environments, we reproduce the observation of Feidler et al. (1997) that simple Hebb rules are incapable of producing DS cells, and that the BCM rule (Bienenstock, Cooper, & Munro, 1982) is capable of producing DS cells. We also consider several other statistically motivated rules (Blais, Intrator, Shouval, & Cooper, 1998), which have similar form to the BCM rule, and explore how they develop direction selectivity. In addition we provide an insight into the reason that direction selectivity is, or is not, produced by these rules. Finally, we compare these simulations to experiments in strobe light environments (Cynader, Berman, & Hein, 1973; Cynader & Chernenko, 1976; Humphrey & Saul, 1998). 2 Methods We use as the visual environment 13£13 circular patches from 12 images of natural scenes processed with a retinal difference of gaussian lter (Law & Cooper, 1994). The cortical cell receives input from two sets of lateral geniculate nucleus (LGN) cells that view the same area of space but differ in their timing. The rst set has a delayed response to the input (lagged cells)

Direction Selectivity

1059

relative to the second set (nonlagged cells). Essentially the cortical cell has a receptive eld that is composed of two different receptive elds that receive input from two different times at the same spatial location. We will refer to these as the lagged and the nonlagged RFs. Two possible factors contribute to motion in the visual environment: movements of the eyes and head and movements of objects in the world. To model the former, input patches are chosen using a sequence of random saccades and drifts (Carpenter, 1977). A saccade is a large jump to a random part of an image, and a drift is a continuous motion within an image in a particular direction at a particular velocity. In the model, the drift velocity is kept constant, and the drifts last a random time. In between drifts are saccades to a different image or part of the same image. Although this is a simplication of both the temporal properties of lagged and nonlagged cells and of the true input structure available to an animal in a dynamic environment composed of moving objects in addition to eye and head movements, the added complexities make no noticeable difference in the results (see Figure 3). In contrast, the simplication of the environment provides insights that may be obscured in the complexities, and it allows the velocities and delays to be more tightly controlled. We denote the input vector by d and the weight vector by m. Neural activity is given by the rectied product of the inputs and the weights, c D s(d ¢ m). The derivative of the sigmoidal is given simply by s 0 . We consider several synaptic modication rules such as the quadratic form of BCM (Bienenstock et al., 1982; Intrator & Cooper, 1992) as well as other statistically motivated rules that share the basic properties proposed by BCM (Blais et al., 1998). We include, for comparison, a stabilized Hebb rule used for extracting the principal component of the input (PCA) (Oja, 1982). These rules have the form: Quadratic BCM (Intrator and Cooper, 1992): dm D c( c ¡ H M )s 0 d D c(c ¡ E[c2 ])s 0 d. dt

(2.1)

Skewness [S1]. This rule is based on the statistical measure of skewness: dm D c( c ¡ E[c3 ] / E[c2 ]) / E1.5 [c2 ]s 0 d. (2.2) dt Kurtosis [K1]. This rule is based on the statistical measure of kurtosis: dm D c( c2 ¡ E[c4 ] / E[c2 ]) / E2 [c2 ]s 0 d. (2.3) dt For all of these rules, we replace the spatial average, E[¢], by a temporal average of the form Z 1 t n 0 ¡(t¡t0 ) / t 0 E[cn ( t)] ¼ c ( t )e dt . t ¡1

Brian Blais, Leon N. Cooper, and Harel Shouval

non lagged lagged

non lagged

K

S

1

1

lagged

BCM

lagged

lagged

non lagged

PCA

non lagged

1060

120 90 60 120 90 60 120 90 60 120 90 60 150 30 150 30 150 30 150 30 180

0 180

0 180

0 180

0

210 330 210 330 210 330 210 330 240270300 240270300 240 270 300 240270300 Figure 1: Sample receptive elds and their orientation tuning, for a velocity of two pixels per iteration. The orientation tuning was obtained using drifting oriented sine gratings. Orientations larger than 180 degrees denote motion in the opposite direction. Tuning curves that have a larger response for one direction than another are for direction-selective cells. The PCA neuron is the only one that did not achieve direction selectivity.

PCA dm D c(d ¡ cm). dt

(2.4)

We measure direction selectivity using the DS index, dened as DS ´

R(preferred) ¡ R(nonpreferred) R(preferred) C R(nonpreferred)

,

(2.5)

where R(preferred) and R(nonpreferred) are the responses to a sine grating, at optimum orientation and spatial frequency, moving in the preferred direction and nonpreferred direction, respectively. 3 Results Example receptive elds and their orientation tuning, for a drift velocity of 2 pixels per iteration, are shown in Figure 1. The orientation tuning was obtained using drifting oriented sine gratings. It is clear that the PCA rule did not develop direction selectivity, whereas the other rules did. A spatiotemporal separable receptive eld would be attained if the lagged RF is a simple scalar multiple of the nonlagged RF, such as that observed to occur using the PCA learning rule.

tuning

non lagged lagged

Direction Selectivity

1061

v=0 v=1 v=2 v=3 v=4 v=5 v=6 v=7 v=8 v=9 v=10

1

DS index

PCA BCM K1 S1

0.5

0 0

2

4

velocity

6

8

10

Figure 2: Sample receptive elds with polar tuning plots (above) for BCM, for several eye drift velocities. The direction selectivity index as a function of drift velocity (below) for four different learning rules. The PCA learning rule did not develop direction selectivity. The other rules show some tuning to eye drift velocity; they all lose direction selectivity for velocities that are too high (no overlap) or too low (near complete overlap). The LGN lagged cells had a constant one-iteration lag.

The other learning rules developed different lagged and nonlagged RFs. We observe that the lagged RF is a shifted version of the nonlagged RF, which yields a spatiotemporal inseparable receptive eld. Figure 2 shows the direction selectivity index as a function of eye drift velocity, for a constant LGN lag of one iteration. Example receptive elds from the BCM learning rule for each velocity are shown. Other than the PCA rule, all rules show some tuning to eye drift velocity: they develop direction selectivity for some velocities, but all lose it for velocities that are too high or too low. At very low velocities, the lagged and nonlagged RFs are identical (ST separable), and at very high velocities only the lagged or the nonlagged develops while the other RF is small and random (also ST separable). Using a more dynamic environment (movies from van Hateren & Ruderman, 1998) and a more realistic temporal ltering (Saul & Humphrey, 1990; Wimbauer, Wenisch, Miller, & von Hemmen, 1997a) does not alter the main

1062

Brian Blais, Leon N. Cooper, and Harel Shouval temporal filter

4 2 0 2 4 2 0 2 0

200 time (ms)

non lagged

400

lagged

12090 60 150 30 180 0 210 330 240270300 12090 60 150 30 180 0 210 330 240270300

Figure 3: Direction selectivity in a movie environment. Neurons are trained with sequential patches from movies preprocessed with a spatial DOG and temporal lters for lagged and nonlagged cells, as specied in (Wimbauer et. al., 1997a). The BCM rule was used in these examples. Shown are the temporal lters (left), the receptive elds for the lagged and nonlagged channels (middle), and the orientation tuning of the cells (right). The results are comparable to those from the drift-saccade input model.

result (see Figure 3). The neurons develop a spatiotemporal inseparable receptive eld and are thus direction selective. 4 Summary and Conclusions

This work extends previous work (Feidler et al., 1997) in which a cell attained direction selectivity in a simplied environment; the authors’ conclusion— that a simple Hebb rule is inadequate to develop direction selectivity— is reproduced. Other than the PCA rule, all the rules examined develop direction selectivity for some velocities, but all lose it for velocities that are too high or too low. There is a parallel between these results and the results on binocular cortical misalignment (Shouval, Intrator, Law, & Cooper, 1996), which we explain. In the misalignment work, it was shown that the BCM rule in a natural scene environment, with varying degrees of binocular overlap, developed identical receptive elds (for complete overlap), monocular receptive elds (for no overlap), or receptive elds formed only in the overlap region (for intermediate overlap). In the current work, if the lag time of the LGN lagged cells is kept constant, then a constant velocity would imply a constant amount of overlap of input patterns during eye drift. Thus, zero velocity would yield identical lagged and the nonlagged RFs, yielding no direction selectivity. Similarly, high velocity would give no overlap of input patterns during eye drift and would yield either a completely lagged or a completely nonlagged receptive eld, and again no direction selectivity. The high-velocity case is analogous to the strobe light environment (Cy-

Direction Selectivity

1063

nader et al., 1973; Cynader & Chernenko, 1976; Humphrey & Saul, 1998), because in both situations, the temporal correlations are lost. In experiment direction selectivity was lost but orientation selectivity remained, which is reproduced by the simulations for the high-velocity case. In addition, there is evidence (Humphrey, Saul, & Feidler, 1998) that strobe rearing prevents the convergence of the lagged and nonlagged inputs onto the cortical cell—that is, the cortical receptive eld is affected by either lagged or nonlagged but not both. This result is reproduced completely in the simulations. It is straightforward to show that a PCA neuron in a spatially isotropic input environment develops either two identical receptive elds or two receptive elds differing only by a sign.1 These two cases are, in the context of lagged and nonlagged receptive elds, ST separable and not direction selective. A recent model of direction selectivity used a correlation-based learning paradigm, with lagged and nonlagged LGN responses (Wimbauer et al., 1997a; Wimbauer, Wenisch, Miller, & von Hemmen, 1997b). The results were that direction selectivity could not form unless the lagged and nonlagged RFs developed independently (very weak correlations) or if the environment contains anisotropic motion (objects move in some directions more often than others). Despite the added complication of network effects and continuous response properties, these results can be understood from the results presented here using the single-cell PCA neuron. The BCM rule, as well as higher-order rules, as we have seen, require neither of these assumptions to develop direction selectivity. Recent comparisons between independent component analysis (ICA) or efcient coding of natural image sequences and the ST receptive elds of visual cortical neurons have been performed (van Hateren & Ruderman, 1998; Rao & Ballard, 1997). The assumptions used there were different from the ones used in the current work; however, some of the results are similar. The primary differences are that many more temporal delays are included in the ICA work, and the environment is required to be “whitened.” There are two possible interpretations to that work. One is that there are many different types of LGN neurons, each with a different time lag and each converging through independent synaptic weights on a single cortical cell. This interpretation is similar to our work, differing in our use of only two populations of LGN cells, as indicated experimentally, and in that we have investigated several different learning rules. Another interpretation is that there is one population of LGN cells but that synaptic weights have independent values at different time points. This interpretation is very different from our suggestion and is unlikely biologically. 1

Simply stated, if the two-channel correlation function is symmetric and has the form

¡C C ¢ ¡ ¢ 1 C D C 1 C 2 , then the two channels of the principal components, v D v v2 , are identical 2 1 v § v up to a sign, 1 D 2.

1064

Brian Blais, Leon N. Cooper, and Harel Shouval

In addition, the use of “whitening” can be problematic. Although LGN responses could be approximately decorrelated (Atick & Redlich, 1992; Dan et al., 1996), some of the rules used for ICA are particularly sensitive to small correlations remaining in nearly whitened environments (Blais et al., 1998). Also, for the same interpretations described above, the spatiotemporal whitening performed could be questionable biologically. The learning rules in the current work do not require whitening, nor are they affected adversely by it, and thus they provide a more robust system for achieving direction selectivity. The interpretation of these other works and the comparison to biological quantities are therefore signicantly different. The fact that they achieve direction selectivity in the natural scene environment is the same as the current work. Although we have focused on a geniculocortical single-cell model, there is some indication that network effects, especially cortical-cortical interaction, play a part in direction selectivity (Livingstone, 1998). This work can be seen as a rst step toward a network model and as a straightforward way of addressing some of the issues regarding direction selectivity. Acknowledgments

This work was supported in part by the Charles A. Dana Foundation and the Ofce of Naval Research. References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. Science, 275, 220–222. Atick, J. J., & Redlich, A. (1992).What does the retina know about natural scenes? Neural Computation, 4, 196–211. Barlow, H., & Levick, R. (1965). The mechanism of direction selectivity in the rabbit’s retina. Journal of Physiology, 173, 477–504. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specicity and binocular interaction in visual cortex. Journal of Neuroscience, 2, 32–48. Blais, B. S., Intrator, N., Shouval, H., & Cooper, L. N. (1998). Receptive eld formation in natural scene environments: Comparison of single cell learning rules. Neural Computation, 10(7). Burr, D. (1981). A dynamic model for image registration. Computer Graphics and Image Processing, 15, 102–112. Carpenter, R. (1977). Movements of the eyes. London: Pion. Chance, F. S., Nelson, S. B., & Abbott, L. F. (1998). Synaptic depression and the temporal response characteristics of V1 cells. Journal of Neuroscience, 18, 4785–4799. Cynader, M., Berman, N., & Hein, A. (1973). Cats reared in stroboscopic illumination: Effects on receptive elds in visual cortex. Proceedings of the National

Direction Selectivity

1065

Academy of Sciences, 70, 1353–1354. Cynader, M., & Chernenko, G. (1976). Abolition of direction selectivity in the visual cortex of the cat. Science, 193, 504–505. Dan, Y., Atick, J. J., & Reid, R. C. (1996). Efcient coding of natural scenes in lateral geniculate nucleus: Experimental test of a computational theory. J. Neurosci., 16(1), 3351–3362. Deangelis, G. C., Ohzawa, I., & Freeman, R. C. (1995). Receptive eld dynamics in the central visual pathway. Trends in Neuroscience, 18, 451–458. Feidler, J. C., Saul, A. B., Murthy, A., & Humphrey, A. L. (1997).Hebbian learning and the development of direction selectivity: The role of geniculate response timings. Network: Computational Neural Systems, 8, 195–214. Hammond, P. (1978). Directional tuning of complex cells in area 17 of the feline visual cortex. J. Physiol. (Lond.), 275, 479–491. Hubel, D. H., & Wiesel, T. N. (1959). Receptive elds of single neurons in the cat striate cortex. J. Physiol. (Lond.), 148, 509–591. Hubel, D. H., & Wiesel, T. N. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol, 160, 106–154. Humphrey, A. L., & Saul, A. B. (1998). Strobe rearing reduces direction selectivity in area 17 by altering spatiotemporal receptive-eld structure. Journal of Neurophysiology, 80, 2991–3004. Humphrey, A. L., Saul, A. B., & Feidler, J. C. (1998). Strobe rearing prevents the convergence of inputs with different response timings onto area 17 simple cells. Journal of Neurophysiology, 80, 3005–3020. Intrator, N, & Cooper, L. N. (1992). Objective function formulation of the BCM theory of visual cortical plasticity: Statistical connections, stability conditions. Neural Networks, 5, 3–17. Law, C., & Cooper, L. (1994).Formation of receptive elds according to the BCM theory in realistic visual environments. Proceedings of the National Academy of Sciences, 91, 7797–7801. Livingstone, M. S. (1998). Mechanisms of direction selectivity in V1. Neuron, 20, 509–526. Maex, R, & Orban, G. A. (1996). Model circuit of spiking neurons generating directional selectivity in simple cells. Journal of Neurophysiology, 75, 1515– 1545. Mastronarde, D. N. (1987).Two classes of single input X-cells in cat lateral geniculate nucleus. Cat lateral geniculate nucleus. I. Receptive eld properties and classication of cells. Journal of Neurophysiology, 57, 357–380. Oja, E. (1982). A simplied neuron model as a principal component analyzer. Journal of Mathematical Biology, 15, 267–273. Rao, R. P. N., & Ballard, D. H. (1997). Efcient encoding of natural time varying images produces oriented space-time receptive elds (Tech. Rep.). Rochester, NY: Department of Computer Science, University of Rochester. Reid, R. C., Soodak, R. E., & Shapley, R. M. (1991). Directional selectivity and spatiotemporal structure of receptive elds of simple cells in cat striate cortex. Journal of Neurophysiology, 66(2).

1066

Brian Blais, Leon N. Cooper, and Harel Shouval

Saul, A. B., & Humphrey, A. L. (1990).Spatial and temporal properties of lagged and nonlagged cells in the cat lateral geniculate nucleus. Journal of Neurophysiology, 68, 1190–1208. Shouval, H., Intrator, N., Law, C. C., & Cooper, L. N. (1996). Effect of binocular cortical misalignment on ocular dominance and orientation selectivity. Neural Computation, 8(5), 1021–1040. Suarez, H., Koch, C., & Douglas, R. (1995). Modeling direction selectivity of simple cells in striate visual cortex within the framework of the canonical microcircuit. Journal of Neuroscience, 15, 6700–6719. van Hateren, J., & Ruderman, D. L. (1998). Independent component analysis of natural image sequences yields spatiotemporal lters similar to simple cells in primary visual cortex. Proc. R. Soc. Lond. B, 265, 2315–2320. Watson, A. B., & Ahumada, A. J. (1985).Model of human visual-motion sensing. Journal of the Optics Society of America, A2, 322–342. Wimbauer, S., Wenisch, O. G., Miller, K. D., & von Hemmen, J. L. (1997a).Development of spatiotemporal receptive elds of simple cells: I. Model formulation. Biological Cybernetics, 77, 453–461. Wimbauer, S., Wenisch, O. G., Miller, K. D., & von Hemmen, J. L. (1997b). Development of spatiotemporal receptive elds of simple cells: II. Simulation and analysis. Biological Cybernetics, 77, 463–477. Received October 30, 1998; accepted June 14, 1999.

LETTER

Communicated by John Byrne

A Phase Model of Temperature-Dependent Mammalian Cold Receptors Peter Roper Paul C. Bressloff Nonlinear and Complex Systems Group, Department of Mathematical Sciences, University of Loughborough, Loughborough, Leicestershire, U.K. Andr e´ Longtin D´epartement de Physique, Universit´e d’Ottawa, Ottawa, Ontario, Canada K1N 6N5 We present a tractable stochastic phase model of the temperature sensitivity of a mammalian cold receptor. Using simple linear dependencies of the amplitude, frequency, and bias on temperature, the model reproduces the experimentally observed transitions between bursting, beating, and stochastically phase-locked ring patterns. We analyze the model in the deterministic limit and predict, using a Strutt map, the number of spikes per burst for a given temperature. The inclusion of noise produces a variable number of spikes per burst and also extends the dynamic range of the neuron, both of which are analyzed in terms of the Strutt map. Our analysis can be readily applied to other receptors that display various bursting patterns following temperature changes.

1 Introduction

Bursting is the rhythmic generation of several action potentials during a short time, followed by a period of inactivity. The limiting case of a single spike per burst is termed beating. There are a wide variety of burst phenomena, but it appears that many are due to a similar underlying mechanism. The various chemical and electrical dynamics of the neuron operate on many timescales, and in some cases, the full dynamics can be dissected into a fast system coupled to a slowly oscillating subsystem (Rinzel & Lee, 1987). Typically the fast system operates on a millisecond timescale and models the currents associated with spike generation, while the slow subsystem, with a timescale of up to 1 second or more, is often associated with calcium concentrations. The fast system is modulated by the slow one and has two parameter regimes: a stationary state or “xed point” and a limit cycle state during which action potentials are periodically generated. Thus, for this “slow-wave” bursting to occur, the slow variable must parameterize bifurcations in the fast system. c 2000 Massachusetts Institute of Technology Neural Computation 12, 1067– 1093 (2000) °

1068

Peter Roper, Paul C. Bressloff, and Andr´e Longtin

Cold receptor neurons are free nerve endings (Sch¨afer, Braun, & Rempe, 1990) that transduce local temperature information into neuronal signals. Their neuronal discharge takes the form of regular bursts, phase-locked single spikes (“beating”), or stochastically phase-locked spikes (“skipping”). The cells are subcutaneous to the skin and tongue but are sparsely distributed (Ivanov, 1990). Due to their small size, they cannot be recorded intracellularly, and the only quantity that can be measured directly is the spike train. However, other cellular properties may be inferred by the use of pharmacological agents. Temporal ring patterns and interspike interval histograms (ISIH) from these neurons (see Figure 1) show that the bursting dynamics is highly temperature dependent and further suggest the existence of a slowly oscillating current with a frequency that increases with temperature (Braun, Sch¨afer, & Wissing, 1990). In this article we present a canonical model for a thermoresponsive cold receptor with noise that exhibits bursting, beating, and skipping. Our model is a simplied version of a full ionic model of a slow-wave burster, such as Plant’s model, which formed the basis of a recently proposed model of cold thermoreception (Longtin & Hinzer, 1996). The simplied model derives from the ionic model by means of a phase-reduction procedure due to Ermentrout and Kopell (1986). Although we focus specically on cold receptors, the phase-model approach is general and is applicable to any thermosensitive neuron that discharges in a similar manner to cold receptors (for example, certain cells exhibit similar patterns but in a different order; (Braun, Sch¨afer, Wissing, & Hensel, 1984). It is also applicable to other more recently proposed ionic models of cold receptors (Braun, Huber, Dewald, Sch¨affer, & Voigt, 1998). The model contains biologically motivated parameters and exhibits behavior that is consistent with experiment. Moreover, it has the advantage of being mathematically tractable so that if these parameters were to be quantied, analytic predictions about the behavior of real receptors could be made. We analyze the model in the limiting case of zero noise (the deterministic limit) and also when subjected to noise originating primarily from conductance uctuations and electrogenic pumps. For the deterministic case, we are able to predict how many action potentials are generated per burst for a given temperature and derive the transition temperature from bursting to beating. For the latter we are able to estimate the skipping rate at a given temperature. We demonstrate that skipping is a noise-induced effect that can occur for both the suprathreshold and the subthreshold dynamics. Above threshold, spikes become deleted as a consequence of noise-induced trapping (Apostolico, Gammaitoni, Marchesoni, & Santucci, 1994), while below threshold, the ring pattern becomes augmented by noise-induced spiking. The article is organized as follows. Section 2 is a brief review of cold receptor physiology, a subject covered more completely in Longtin and Hinzer (1996). Our model is introduced in section 3. In section 4 we consider the

Model of Mammalian Cold Receptors

1069

Figure 1: Characteristic discharge patterns of bursting cold receptors of the cat lingual nerve at different constant temperatures. The steady-state patterns are recorded at least two minutes after each 5± C temperature change (from Figure 1 of Sch¨afer et al., 1988, reproduced with permission).

limiting case of zero noise and analyze the simulated ring patterns using Floquet theory. We then examine how noise can alter the deterministic discharge pattern. 2 A Summary of the Neuro- and Electrophysiology of Cold Receptors

Figure 1 shows characteristic discharge patterns and their associated ISIH, measured from cold receptors in a cat tongue at various static temperatures. Not every cell has this repertoire; some exhibit only a few of these patterns when the temperature is varied. Furthermore, the temperature at which a given discharge pattern occurs varies between receptors. It has been argued (Dykes, 1975) that by using the temporal structure of the discharges, the central nervous system is able to discriminate between ring patterns that have the same mean ring rate but correspond to different temperatures. This is plausible because bursts can drive higher neurons efciently (Lisman, 1997): the relatively high ring rates during a burst release higher levels

1070

Peter Roper, Paul C. Bressloff, and Andr´e Longtin

of transmitter than would be the case if the individual spikes were more broadly distributed in time. At low temperatures, the neuron bursts repetitively with a uniform burst length and with bursts that appear to be synchronized with some underlying slow oscillation. However, the timing of individual spikes within the burst is nonuniform (Braun, Bade, & Hensel, 1980), since the ring frequency within a burst increases. This is a feature shared with parabolic bursting neurons (Rinzel & Lee 1987). 1 When the temperature is quasi-statically increased, the burst length and interburst period diminish, until at the middle of its operating range, the neuron emits a regular spike train with phase-locked spikes. If the temperature is increased still further, the spike train becomes aperiodic so that occasional double spikes appear and skipping occurs. The spike train is still synchronized to some underlying rhythm, but between spikes a random integer number of oscillation cycles may be skipped (Braun et al., 1990). The origin of this randomness is uncertain, but due to the cell’s lack of synapses and its small size, it is thought that thermal, conductance, and other uctuations might be important. It has been argued (see Longtin & Hinzer, 1996, for a review), based on physiological and pharmacological studies, that the key ingredients of this skipping are noise and a slow internal oscillation with a maximum amplitude that lies close to the spiking threshold. The skipping rate is also thermally dependent, increasing with temperature, and above 40± C, skips of up to eight cycles have been observed (Sch¨afer, Braun, & Rempe, 1988). We introduce some nomenclature for describing the discharge pattern. The number of action potentials in a given burst is N, and the mean number of spikes per burst, N, is the temporal average of N. V is the frequency of the slow oscillatory cycle and is the reciprocal of the sum of the inter- and intraburst durations. At a xed temperature, the number of spikes per burst for a given receptor can vary by one or two (Braun et al., 1980), implying that the bursting dynamics also has a degree of stochasticity. This further suggests the presence of noise or chaos, or both (Longtin & Hinzer, 1996). The breadth of the ISIH peak corresponding to the interburst period suggests uctuations in the slow-wave frequency V(T), the mean of which is determined only by T. The variability in the number of spikes per burst further contributes to this breadth. Examination of the spike trains of several cells at different static temperatures reveals (Braun et al., 1984) that both V and N depend monotonically (and sometimes approximately linearly (Braun et al., 1980) on the temperature T, with V increasing and N decreasing with T.

1 A burst is termed parabolic if the spiking frequency is lower at both the beginning and the end of the burst compared with that during the middle of the burst; thus, here we have only half of the “parabolic” bursting

Model of Mammalian Cold Receptors

1071

The thermosensitivity and regular discharge of cold receptors have been likened to that of Aplysia R15 neurons. Intracellular recordings of R15 have revealed the mechanism of slow-wave bursting and have shown the presence of an underlying slow-wave voltage oscillation that persists even when spike generation is pharmacologically blocked. The rhythmic behavior of Aplysia is well reviewed in (Canavier, Clarke, & Byrne, 1991). The thermosensitivity of Aplysia R15 cells has been investigated (Carpenter, 1967) (reviewed in Longtin & Hinzer, 1996), and it points to the importance of the thermosensitivity of their electrogenic ionic pumps (such as NaC -KC ATPase); such pumps have also been found in cold receptors. It has thus been proposed that the mechanisms of bursting and thermosensitivity for both cold receptors and for Aplysia R15 neurons are related (Willis, Gaubatz, & Carpenter, 1974; Wiederhold & Carpenter 1982). That hypothesis was further substantiated in a recent study (Longtin & Hinzer, 1996) that extended the Plant model (Plant, 1981), devised originally for R15 bursting, to explain cold receptor ring. Finally, skipping occurs in other preparations in which there is no apparent periodic stimulation. For example, a related thermoresponsive preparation is the ampullae of Lorenzini of the dogsh. These mandibular sensory afferents exhibit a similar temperature-dependent slow wave, but although skipping occurs, they do not burst in the same manner. Recent data (Braun, Wissing, Sch¨afer, & Hirsch, 1994) point to this skipping as being a consequence of noise internal to the neuron, following a mechanism similar to that suggested for cold receptors (Braun et al., 1980): a slow oscillation periodically brings the membrane potential close to the ring threshold, at which point action potentials are caused randomly by noise pushing the potential over threshold. All of these studies suggest the following scenario. At low temperatures, the membrane potential is high and the slow wave depolarizes the membrane for a signicant part of the cycle; the neuron therefore bursts. As the temperature increases, so does the oscillation frequency of the slow wave. In addition the mean potential becomes more negative (due to the electrogenic pump), and so the depolarization time due to the slow wave, and hence the burst length, is reduced. At high temperatures, the slow wave is no longer able to generate spikes by depolarization; instead, all spikes must be noise driven. 3 Temperature-Dependent Phase Model of Cold Receptor Function 3.1 A Phase Model. Plant’s model (1981) of slow wave, parabolic bursting, extended to cold receptors in Longtin and Hinzer (1996), is based on a Hodgkin-Huxley-type ionic current description. Although such ionic models (see also Braun et al., 1998) elucidate cellular function, they involve many coupled differential equations and so are difcult to treat analytically. They also involve many parameters that are currently inaccesible

1072

Peter Roper, Paul C. Bressloff, and Andr´e Longtin

to experimental determination. Consequently, we seek a simpler canonical model. We write the vector of potentials and gating variables associated with spiking (e.g., fast sodium/calcium and potassium channels) as u 2 Rp , and the vector of concentration and gating variables associated with the slow dynamics (e.g., calcium concentration) as v 2 Rq . Ermentrout and Kopell (1986) studied parabolic bursting models with the general form uP D f ( u ) C e 2 g ( u , v , e ) vP D e h( u , v , e )

(3.1)

for which the following two properties hold: 1. When e D 0 the system uP D f ( u ) has an attracting invariant circle and a single degenerate critical point; the system is at a saddle-node bifurcation on a limit cycle. 2. vP D h(0, v , 0) has a limit cycle solution. They showed that such models converge in the weak-coupling limit, e ! 0, to the canonical form wP D [1 ¡ cos(w ) ] C [1 C cos(w )] g(0, v , 0) 1 c

vP D hN (0, v , 0),

(3.2)

where c is some constant and f (), g(), h(), and hN () are smooth functions of their arguments. w 2 [¡p , p ] is dened on the unit circle S1 and represents the phase of the membrane potential at the trigger zone. A full rotation of w corresponds to the generation of an action potential. The requirement of weak coupling (e ! 0) implies that the time constants for the slow dynamics must be much smaller than those for the fast subsystem. For real cold receptors, e is not quantiable since there are no intracellular recordings. However, we previously (Longtin & Hinzer, 1996) proposed that the Plant model might well describe cold receptors and for this model, the coupling between fast and slow subsystems is indeed weak (Rinzel & Lee, 1987). However, as it stands, the Plant model fails to satisfy the second hypothesis since the slow subsystem approaches a xed point when the membrane potential is clamped to any xed value. Soto-Trevino, ˜ Kopell, & Watson (1996) have therefore generalized Ermentrout and Kopell’s (EK) phase reduction and furthermore have taken the Plant model as their specic example. They show that hypothesis (2) can be replaced by the more lax criterion that there be a slow periodic orbit that remains close to a curve of degenerate homoclinic points for the fast system. Under such constraints, the Plant model may be reduced to the canonical form in equation 3.2.

Model of Mammalian Cold Receptors

1073

We will modify the canonical form to account for temperature effects, while retaining similar dynamics and bifurcations. As a rst step, we write the slow dynamics as the sum of a constant term X and a zero-mean periodic term Y(V, t) that oscillates with frequency V, and for convenience we set h D w C p . Thus hP D [1 C cos(h )] C [1 ¡ cos(h )][X C Y(V, t)].

(3.3)

In the absence of any electrophysiological evidence, we initially choose to model the oscillatory component with a simple cosine where the upward trend might derive from an inward Ca2C current while the downward trend could be a Ca2C -dependent outward KC current. Furthermore, we rescale time so that t0 D (1 ¡ X)t, and introduce bD

1CX 1¡X

and

A cos(Vt0 ) D ¡

Y(V, t) . 1¡X

(3.4)

Thus we obtain ¡ ¢ dh D [b C cos(h ) ] ¡ A cos Vt0 [1 ¡ cos(h ) ] 0 dt £ ¡ ¢¤ £ ¡ ¢¤ D b ¡ A cos Vt0 C 1 C A cos Vt0 cos(h ).

(3.5)

The parameter b modulates the mean potential of the cell (and hence its excitability), and thus could be associated with the activity of one or more electrogenic pumps that move ions against their concentration gradients.2 The phase model in equation 3.5 displays a saddle-node bifurcation on a limit cycle whenever b ¡ A cos (Vt) D 1 C A cos (Vt) .

(3.6)

Time appears here explicitly (for clarity we have omitted the prime), and so for certain parameter values, the bifurcation occurs periodically in one direction followed by the other. If the slow oscillation is of such magnitude that for part of the cycle A cos (Vt) <

(b ¡ 1) ¡ g, 2

(3.7)

where g is some small positive parameter, then the neuron bursts. This oscillatory behavior across the saddle node is much simpler than the corre2 Kopell and Ermentrout (1986) discuss other possible biological correlates of the slow oscillation and make the alternative suggestion that they could derive solely from oscillations in pump activity.

1074

Peter Roper, Paul C. Bressloff, and Andr´e Longtin

sponding bifurcations undergone by ionic models. For example, the Plant model has a two-dimensional slow subsystem that depends on the threedimensional fast subsystem, with the result that the latter is periodically driven through homoclinic transitions between steady-state and limit-cycle solutions. Close to the bifurcation the phase model exhibits critical slowing down; relaxation to a xed point (the resting potential) becomes polynomial in time rather than exponential, thus increasing the time for h (t) to traverse the unit circle. Such behavior has also been described as passage through a bottleneck (Strogatz, 1994). Therefore, within a burst there is a nonuniform distribution of spikes: a signature of parabolic bursting. Equation 3.5 may be made more tangible by graphing 1 C A cos(Vt) and b ¡ A cos(Vt), as in Figure 2a. A saddle-node bifurcation occurs when the sinusoids cross, and hence bursting occurs. The mean number of spikes per burst, N, is proportional to D , the overlap of the two curves. D is a function of all of A, V, and b, and is approximated by

¡

2 DD p ¡cos¡1 V

¡ b ¡ 1 ¢¢ 2A

¡

¢

1 b¡1 D pC CO V A

"

¡b¡1¢ # 3

2A

. (3.8)

3.2 Modeling Temperature Dependence and Fluctuations. The main effect of a temperature change is a variation in the rate constants of the biochemical processes occurring within the cell. Therefore, to obtain expressions for the variation with temperature of the coefcients of the phase model, we must rst consider the thermal dependence of the parameters of more biophysical models of slow-wave bursting. In the appendix, we derive a novel relationship between the frequency V of the slow wave in equation 3.5 and parameters of the Plant model extended to cold receptors (Longtin & Hinzer, 1996). The two main temperature-dependent parameters of the slow subsystem in the Plant model are the (positive) kinetic rates l and r of, respectively, the slow inward current gating variable and the calcium dynamics. We nd that V ¼ (lr )1 / 2 . Also, numerics suggest that the transition from xed point to the slowwave limit cycle is a supercritical Hopf bifurcation and that the limit cycle frequency does not depend strongly on the bifurcation parameter, as is usually the case in the vicinity of this kind of bifurcation. If both of these rates are assigned the same temperature dependence as in Longtin and Hinzer (1996)—the same Q10 —then V will vary with temperature according to this Q10 as well. For Q10 D 3 in Longtin and Hinzer (1996), this means that V ¼ 30.1T . This functional form readily ts the temperature dependence seen in the conductance-based model (Longtin & Hinzer, 1996) (not shown). Further, since this form is almost linear over the temperature range of interest, it is similar to the experimental observation (Braun et al., 1980) that the burst frequency (slow-wave frequency) increases linearly with T. Over a temper-

Model of Mammalian Cold Receptors

(i)

1075

’

(i) 1 b

1 b

’ (ii)

(ii) 1

1

b’

b’

b’ b2 > b3 , and b2 D 1 ¡ 2A. (i) When the overlap is large, the neuron bursts (res repetitively) for a time D . (ii) Bursting has ceased; any spiking activity must be noise driven and the neuron skips (it does not always re at every slow wave cycle). (iii) All spiking has ceased, and the neuron is silent. (b) There are three possible mechanisms by which the burst length can be reduced (D 0 < D ) (compare with (a)): (i) increasing slow-wave frequency, V 0 > V, (ii) decreasing pump parameter b 0 < b1 , and (iii) decreasing slow-wave magnitude, A0 < A.

ature range of 17.8 ± to 40± C, our analysis produces a factor 11.5 increase in frequency, which is close to the factor of 10.0 obtained from numerical simulations of the full model, and close also to actual ratios exhibited by some cold bers. Further bifurcation analysis could be done to compute the leading-order term for the temperature dependence of the slow-wave amplitude A(T ) and thus also to relate the amplitude parameter in the phase model to biophysical parameters of a conductance-based ionic model. Because we are interested only in reproducing qualitative features of thermosensitivity, we model thermal dependence with linear functions of the magnitude of the

1076

Peter Roper, Paul C. Bressloff, and Andr´e Longtin

pump coefcient, b, and of the magnitude, A, and frequency, V, of the slow wave. b ! b(T ) Db0 ¡bT T

A ! A(T ) DA0 CAT T

V ! V(T) DV0 CVT T, (3.9)

where V0 ,VT ,A0 ,AT ,b0 , and bT are constants. More complex relationships could be used if required. Figure 2b shows how the individual variation of each of these parameters can alter the burst length; it is likely that the discharge patterns observed are due to some combination of these mechanisms. We furthermore conne our interest to the temperature range for which A(T ), V(T) > 0. Neuronal noise has many possible sources and is generally difcult to measure, a problem exacerbated by the lack of intracellular recordings from these neurons. In the absence of synaptic input, its main components are due to thermal ionic movement, conductance uctuations in the ion channels, and electrogenic pump noise (DeFelice, 1981). Thermal noise is proportional to the absolute temperature, and so varies only slightly over the temperature range of interest. For simplicity again, we lump noise terms together into an additive noise on the h dynamics (see equation 3.5), as previously proposed in Longtin and Hinzer (1996) (and references therein). We further assume that the noisej ( t) has a zero-mean gaussian distribution and is white: hj ( t)i D 0

and

hj ( t)j ( t 0 )i D 2Dd (t ¡ t 0 ).

(3.10)

We refer to the noise variance D D s 2 as the noise strength. Thus we write dh Db(T)¡A(T) cos (V (T ) t)C[1 C A( T ) cos (V ( T )t) ] cos(h ) Cj ( t).(3.11) dt From now on, b( T ), A( T ), and V (T ) are assumed to depend implicitly on temperature. If we now examine the system numerically for different temperatures (see Figure 3) and choose b0 , bT , A0 , and AT such that at low temperatures there is a part of the cycle for which equation 3.7 is satised, then we observe that the neuron bursts (see the inset to Figure 3f). As T increases, V(T) and A( T ) both increase while b( T ) decreases. However, for a given temperature change, the increase in A( T ) is smaller than and is counteracted by the larger change in b( T ). In this way both the intra– and interburst periods also decrease with temperature (see the insets to Figures 3d and 3e). We thus obtain a sequence of ring patterns similar to that seen experimentally (see Figure 1). At low temperatures, bursts are lengthy, and so the dominant component of the ISIH is a narrow peak close to the origin (see the inset to Figure 3f), which corresponds to the timing of successive spikes in a burst (the intraburst period). As the temperature increases, the burst length decreases and the

Model of Mammalian Cold Receptors

1077

o

4

T = 40 C

2 0 o

4

T = 35 C

2 0 o

T = 30 C

3 1.5 0

o

T = 25 C 4 2 0 o

T = 20 C 4 2 0 o

6

T = 15 C

3 0

0

200

400

600

800

1000

Figure 3: Simulated spike trains (inset gures) and corresponding interspike interval histograms (main gures) for increasing temperatures. Parameters are: A0 D 0.3, AT D 0.001, b0 D 0.675, bT D 0.007, V0 D ¡p / 150, VT D p / 1500, and D D 0.05.

1078

Peter Roper, Paul C. Bressloff, and Andr´e Longtin

Figure 4: Cold receptors encode temperature by means of their ring patterns and not by a mean rate code. The efciency of this strategy is shown by their ability to encode two different temperatures with the same mean ring rate: (a) The variability with temperature of the number of spikes in a burst (SB) and the burst frequency (BF). (b) The mean ring rate (BF£SB) varies with increasing temperature. See Figures 10a and 10b of Braun et al., (1980). Parameters are: A0 D 0.234, AT D 0.0004, b0 D 0.5868, bT D 0.00088, V0 D ¡p / 150, VT D p / 3000, and D D 0.025.

interburst period becomes a pronounced feature of the ISIH: a new peak appears close to one period of the slow-wave oscillation (see Figures 3b and 3c). For high T equation 3.7 is never satised, and bursting does not occur; instead, all spikes are noise driven (see the insets to Figures 3a and 3b) and higher subharmonics of the slow wave begin to appear in the ISIH. Furthermore, our canonical model can encode two different temperatures by two different ring patterns that have the same mean ring rate, as shown in Figure 4 (cf. Figures 10a and 10b of Braun et al., 1980). 4 Analysis of Bursting and Beating

The neuron’s behavior depends sensitively on both the noise strength and the temperature. If we graph the mean number of spikes per burst, N (T ), versus T for a neuron subject to a vanishingly small level of noise, we observe the staircase depicted in Figure 5. Note that N( T ) is constant over each plateau, but between adjacent plateaus changes by a single spike per burst. Each plateau is labeled by its respective value of N ( T ), and the transition

Model of Mammalian Cold Receptors

1079

12 Mathieu prediction Zero noise (deterministic case) (D = 0) Low noise (D = 0.0001) High noise (D = 0.05)

8

4

0

15

25

T2

35

T1

45

0.2

1.0 0.1

0

3

3.1

3.2

3.3

0.5

Monte-Carlo estimation of firing rate Observed firing rate 0

40

41

T1

43

44

Figure 5: (a) Mean number of action potentials per burst, N, versus temperature, T. Simulation results with several noise levels are shown; the circles mark the corresponding Mathieu predictions for the deterministic neuron. (b) Monte Carlo estimation of the ring rate close to the transition temperature T1 for D D 10¡4 . The squares show numerically observed ring rates (measured in spikes per slow wave cycle), and the crosses depict the Monte Carlo approximation to rate as estimated by the method of section 4.2 (the dashed curve is a spline t through these data points). The inset gure shows a Monte Carlo approximation to the probability density p(h , t D p / V|hOs , 0) at T D 42o C. Parameters are: A0 D 0.3, AT D 0.001, b0 D 0.675, bT D 0.007, V0 D ¡p / 150, and VT D p / 1500.

1080

Peter Roper, Paul C. Bressloff, and Andr´e Longtin

temperature between the nth and (n ¡ 1)th plateaus by Tn . If a low level of noise is now introduced, the staircase retains its shape but the steps become rounded; as D increases, the plateaus disappear, and N ( T ) approaches a smoothly decreasing function of temperature. To understand the origin of the staircase, we reinterpret equation 3.11 as a gradient system in the limit of high friction: dh dU D¡ C j ( t). dt dh

(4.1)

We now view our model as a particle obeying a noisy dynamics in a timedependent potential U (h , t) of the form U (h, t) D ¡c ( t) [l ( t)h C sin(h ) ] ,

(4.2)

where c (t) D 1 C A cos(Vt)

and l (t) D

b ¡ A cos(Vt) , 1 C A cos(Vt)

(4.3)

which is equivalent to an active rotator with periodic coefcients. The multiplicative term c (t) periodically rescales the magnitude of U, while l(t) periodically sculpts the shape of U (see Figure 6). The coefcients c ( t) and l(t) are both periodic with period 2p / V but are antiphase. 4.1 The Deterministic Limit (D ! 0): the Strutt Map. At any time t, the bias l(t) characterizes the instantaneous deterministic dynamics. Three regimes occur: 1. l (t) < 1. The oscillator has one stable and one unstable xed point, each given by the solutions to h D cos¡1 (¡l ). This is termed the locked state, and the dynamics relaxes to the stable xed point (see the inset to Figure 6a).

2. l (t) D 1. The stable and unstable xed points coalesce via a saddlenode bifurcation to form a half-stable xed point at h D p .

3. l (t) > 1. The potential U has no minima and the deterministic dynamics has no xed points. The oscillator therefore rotates with the variable velocity hP ( t) Dc ( t) (l (t) C cos(h )) ,

(4.4)

and each rotation corresponds to the ring of an action potential. Such a solution is termed a running state (see the inset to Figure 6b).

Model of Mammalian Cold Receptors

1081

Figure 6: The potential U(h , t) for a single cycle of the slow wave. (a) l max < 1, the always-stable potential. For low noise levels the oscillator tends to remain close to the minima of the potential. (b) l max > 1, l min < 1, the partially stable potential. When the barrier is absent, the oscillator may escape beyond 2p , generating an action potential. The particle is then reinjected at h D 0. The inset gures caricature the respective potentials at times t D (2n C 1)p / V, n 2 Z.

1082

Peter Roper, Paul C. Bressloff, and Andr´e Longtin

Thus the extremal values of l(t), l max D (b C A) / (1 ¡ A) and l min D ( b ¡ A) / (1 C A), dene the global behavior of the system: 1. l min > 1. The potential U never has a barrier. The neuron res regularly, and the system is always unstable. 2. l max < 1. The potential always has a nite barrier. The neuron is quiescent, and the system is always stable. 3. l min < 1 and l max > 1. A potential barrier exists for part of the cycle, and the system is partially stable. An always-unstable system spikes repetitively at a very high rate. It therefore has no useful temporal structure, and so no relevance to our study. We therefore choose A0 and AT such that at low temperatures, the system is partially stable, and at high temperatures the system is always stable. We denote the critical temperature for which l max D 1 by T c , which is dened by b ( Tc ) C A ( Tc ) D 1 ¡ A ( Tc )

(4.5)

and so, for the linear system (see equation 3.9), Tc D

1 ¡ b0 ¡ 2A0 2AT ¡ bT

(4.6)

(for the coefcients shown in Figures 3 and 5, Tc D 55± C). We therefore see that the nth burst plateau corresponds to a partially stable system for which the time when the barrier is absent is commensurate with the time to wind n times round the torus. Perhaps surprisingly it is found that T1 < Tc (recall that T1 is the temperature beyond which the deterministic neuron ceases ring). In fact, there is a nite temperature range between T1 and Tc for which one would expect the running mode to persist over a signicant part of the oscillation period even though, in the limit D ! 0, no spikes are generated. This is a consequence of critical slowing down close to the bifurcation. If l max is only marginally greater than unity, then a “ghost” of the half-stable xed point causes the relaxation time t0 to become comparable to the slow-wave oscillation period. The system is then unable to escape beyond this laminar region before l decreases again below unity and the system goes through the saddle-node bifurcation in reverse. We may in fact derive a condition for beating to occur. For at least one action potential to be generated per oscillation, h must pass through p (i.e., passes the point at which the stable and unstable xed points collide) within the rst half of the cycle—within t0 < p / V. If h passes through p after one cycle, then it moves so slowly that it is unable to escape before the bifurcation

Model of Mammalian Cold Receptors

1083

recurs, and it becomes trapped by the barrier. Thus, if the oscillator is found at the stable state, hOs ´ cos¡1 (¡l min ), at t D 0, then we have the condition Z

p V

0

hP dt ¸ p ¡ hOs .

(4.7)

The envelope function (recall equation 3.8) is a heuristic that loosely predicts how the deterministic pattern varies with temperature. However, for this deterministic case, it is possible to predict exactly how many action potentials are actually generated during one cycle. Following Ermentrout and Kopell (1986) we recast the zero-noise limit of equation 3.11 as a Mathieu equation. We use the transformation

¡ ¢

1 d 1Cb h cot VD 2 2 V dt

(4.8)

and simple trignometric identities to obtain £ ¤ d2 V C a ¡ 2q cos(2s) V D 0, 2 ds

(4.9)

where we have rescaled time so that 2s D Vt and introduced, to accord with convention (McLachlan, 1947), the coefcients a and q aD

b2 ¡ 1 V2

and

qD

A( b C 1) . V2

(4.10)

The Mathieu equation (4.9), is a linear equation with periodic coefcients (McLachlan, 1947), and hence (according to Floquet’s theorem) has a general solution of the form V( s) D c1 exp(r 1 s) p1 ( s) C c2 exp(r 2 s) p2 ( s),

(4.11)

where c1 and c2 are constants, r 1 and r 2 are termed characteristic exponents, and the pi (s) are periodic functions with the same minimal period as the periodic coefcient of the original equation, 4.9. When the r i are pure imaginary (and conjugate), the solutions are bounded and oscillatory. However, when both are real (with r 1 D ¡r 2 ), all solutions are unbounded. The ( a, q) plane divides into a countable set of simply connected regions for which either all solutions fall into the former class—the stable regions—or belong to the latter class—the unstable regions. This situation is depicted in the Strutt stability map (see Figure 7) (McLachlan, 1947). Let Ij denote the jth instability tongue of the Strutt map, which corresponds to a set of general solutions with real exponents. Unbounded solutions are generally of one of two qualitative types (Nayfeh & Mook, 1995):

1084

Peter Roper, Paul C. Bressloff, and Andr´e Longtin

Figure 7: The Strutt map: stability and instability regimes for the Mathieu equation. The graph is symmetric about the ordinate. The jth instability tongue represents an area of parameter space in which the neuron bursts (res repetitively with a periodic pattern), and with a burst containing j spikes.

either oscillatory but with an amplitude that increases exponentially with time, or nonoscillatory but exponentially increasing. Our interest is with the former. According to the Sturm oscillation theorem (Coddington & Levinson, 1955), each of the periodic functions, p1 (s) and p2 (s), of a solution in the jth instability region has exactly j zeros per oscillation period of the slow wave (Ermentrout & Kopell, 1986). Returning to our original variable,

2 6 6 h (s) D2 cot¡1 6 4

3 2(c1 (ap1 ( s) C p01 (s)) C c2 exp[¡2as](¡ap2 (s) C p02 ( s))) 7 7 7, (1 C b)( c1 p1 ( s) C c2 exp[¡2as]p2 ( s)) 5

(4.12)

Model of Mammalian Cold Receptors

1085

since a > 0, then if c1 6D 0, h ( s) exponentially approaches the stable periodic solution " ¡ ¢# 2 ap1 (s) C p01 ( s) ¡1 ¡ ¢ hs ( s) D 2 cot . (4.13) p1 ( s ) 1 C b Since p1 (s) has j zeros over the period of the slow-wave oscillation, the argument to cot¡1 “blows up” j times over this period. Consequently h passes 0 j times within a period. Furthermore, by use of equations 3.5 and 3.7, it is simple to verify that each time h passes through 0, it does so with a positive velocity. Thus, h wraps around the torus j times per slow-wave cycle, and this then corresponds to a burst containing j spikes. The coefcients of the Mathieu equation, 4.9, are parameterized by the temperature, and so as T varies, it carves a trajectory across the ( a, q) plane. As the trajectory passes through the kth instability region, the neuron has a burst length of k spikes, generating the plateaus seen in Figure 5. Note also that while e appears in the EK transform, it drops out of the canonical model (see equation 3.2). Ermentrout and Kopell further show (1986, lemma 4) that for any pair of Mathieu coefcients ( a, q) that lie in any instability region of the Strutt map, there exists a sufciently small e such that there is a solution to the full equations that converges to the reduced one. 4.2 Stochastic Dynamics of Bursting and Beating.

4.2.1 Smoothing the Mathieu Staircase. To clarify how the transitions between plateaus become smoothed as noise is introduced, we will examine the transition from beating to quiescence, occurring at T D T1 . Our conclusions will extrapolate to each transition between the nth and (n ¡ 1)th plateaus. We introduce a small amount of noise into the dynamics of equation 3.11, so that D 6D 0.3 For T marginally greater than T1 , the neuron now emits a succession of single spikes that are entrained to the underlying slow wave, but occasionally cycles are skipped. As the temperature increases beyond T1 , cycles are skipped more frequently, until the neuron becomes silent. At any T, the skipping rate depends on the noise intensity; the noise simply propels the voltage over the unstable xed point (Gang, Ditzinger, Ning, & Haken, 1993; Rappel & Strogatz, 1994). Conversely, for temperatures slightly below T1 the noisy neuron is seen to misre occasionally and thus skip a period of the slow-wave oscillation. In this regime, although the deterministic neuron is able to re, the noise can trap the system above the ghost bottleneck and postpone its ring. To understand this, note that critical slowing down in the laminar bottleneck means 3

Recall equation 3.10; the noise intensity is assumed to be independent of T.

1086

Peter Roper, Paul C. Bressloff, and Andr´e Longtin

that the noisy dynamics have negligible deterministic motion (“drift”) in this region and so approximates a one-dimensional Wiener process. Thus, the neuron is equally likely to diffuse in either direction. If the diffusion acts to diminish hP, the ring condition in equation 4.7 may be violated and ring retarded. Qualitatively similar retardation and trapping due to noise has been noted by Apostolico et al. (1994) as a failure mechanism in bistable switches. These two skipping modes have a natural interpretation in terms of the Strutt map. Neural dynamics close to a transition temperature correspond to a region in parameter space that is close to a tongue boundary. The inclusion of noise allows the neuron to execute a random walk through the Strutt map, and so to explore adjacent regions of the parameter space. Thus, if the neuron is in the jth tongue but lies close to the ( j C 1)th, then the noise can carry the neuron over the boundary, and thus augment the burst. Conversely, if the neuron lies closer to the boundary with the ( j ¡ 1)th tongue, the noise can delete a spike from the burst. 4.2.2 The Skipping Rate. We have seen that for the deterministic neuron, a spike is generated only if h passes through p within the rst half of the cycle. Numerical investigations of equation 4.1 show that the condition in equation 4.7 generalizes to the case of weak noise, provided we consider instead the probability that h is greater than p after half a driving period. We introduce the conditional probability density p(h , t|h0 , t0 ) subject to the initial condition p(h , 0|hOs , 0) Dd (h ¡ hOs )

(4.14)

(recall that the stable state hOs ´ cos¡1 (¡l min )). Therefore the probability P(h > p , t|hOs , 0) that at a time t, h is greater than p is given by Z P (h > p , t|hOs , 0) D

p

2p

p(h , t|hOs , 0)dh ,

(4.15)

which is equal to the probability of generating an action potential. By performing an ensemble average, we may equate this quantity with the mean number of spikes per slow-wave cycle. The conditional probability density obeys a Smoluchowski equation (Risken, 1989), µ ¶ @ D @ 0 O p(h , t|h s , 0) D U (h , t) C p(h , t|hOs , 0) 2 @h @t @h @

(4.16)

where U0 (h, t) represents the spatial derivative of the potential. The time dependence of the potential forbids a general closed solution to equation 4.16 and furthermore makes a numerical solution difcult to obtain. However,

Model of Mammalian Cold Receptors

1087

a Monte Carlo approximation to the probability density p(h , t|hOs , 0) may be obtained. We rst numerically iterate an ensemble of N receptors h i ( t), i D 1, . . . , N , for half of one slow-wave period, subject to the initial condition hi (t D 0) D hOs ´ cos¡1 (¡l min ), 8 i 2 N . An approximation to p(h, t|hOs , 0) will be given by a normalized histogram of the ensemble of hi (t D p / V) and so an estimation of the ring rate may be found from equation 4.15 (see Figure 5). 5 Discussion

We have developed a phase model of slow-wave parabolic bursting cold receptors that qualitatively reproduces the observed ring patterns at different constant temperatures. The stochastic extension of this model accounts for the variability in the number of spikes per burst and for the skipping patterns. Our analysis in terms of a Strutt map allows a description of the regions of parameter space that produce specic ring patterns, in both the deterministic and noise-driven regimes. It is worthwhile to make a few points regarding our model. 5.1 Paths for Other Receptors. There are many different thermally responsive bursting cells (Braun et al., 1984), for example, the feline lingual and infraorbital nerves, the boa constrictor warm ber, and the dogsh ampullae of Lorenzini. The discharge patterns of all of these cell types exhibit many similar qualitative features, but quantitatively they differ; for example, they have differing burst lengths at a given temperature. In addition, there can also be considerable variation within a single cell type (recall section 2). Therefore, the paradigm of a temperature-dependent noisy trajectory through the Strutt map allows a universal model that might help to explain the discharge patterns of more of these cells. 5.2 Chaos. The existence of chaos in thermoresponsive neuronal spiketrains has been recently studied in both real (Braun et al., 1997) and model (Longtin & Hinzer, 1996) neurons. However, the phase model reported here does not support chaos; instead its spike-train irregularities have a stochastic origin. Is this important? For this class of neurons at least, the answer is “probably not,” since it is more likely that the bursting pattern itself is the fundamental carrier of information rather than the timing of individual spikes within a burst. Such patterns may be more reliably detected by higher neurons due to synaptic facilitation. 5.3 Additive Versus Multiplicative Noise. We introduced thermal and pump noise as a simple, additive term on theh dynamics. However, there are other ways in which noise might enter the dynamical equations. Consider instead an additive term, j s ( t), on the slow dynamics (see also Gutkin &

1088

Peter Roper, Paul C. Bressloff, and Andr´e Longtin

Ermentrout, 1998), so that equation 3.5 becomes dh D [b C cos(h )] ¡ [1 ¡ cos(h )] [A cos (Vt) C j s ( t)] dt D F(h , t) ¡ [1 ¡ cos(h )]j s (t).

(5.1)

j s ( t) now couples multiplicatively to the fast dynamics, and when h D 0, the effect of noise vanishes. Since the noise cannot carry the system around the limit cycle, it might therefore appear that this sort of noise cannot contribute to the spiking dynamics. However, recall the gradient system (see equation 4.1) and Figure 6, and note that h D 0 is halfway down a fairly steep slope. At this time (the peak amplitude of a fast sodium spike) and for the noise levels considered here, the deterministic dynamics dominate and carry the cell around the limit cycle. The additive noise introduced in equation 3.11 would also have very little impact at this time. 4 Further, following Sigeti (1988), we could replace [1 ¡ cos(h )]j s ( t) by a mean noise (Gutkin & Ermentrout, 1998), averaged over some spatial domain (say, between the stable and unstable xed points), and still get quantitatively similar behavior. 5.4 Different Noise Distribution. In the absence of any electrophysiological motivation, we have introduced a white noise process. Recall that the discharge pattern derives from a random walk through the Strutt map and that spike augmentation and deletion arise when the random walk crosses a tongue boundary. In consequence, we note that other additive noise distributions (such as the Ornstein–Uhlenbeck process considered in Longtin & Hinzer, 1996) will produce qualitatively similar burst patterns, and so the actual noise distribution is not pertinent to an understanding of the general model. However, the choice of noise distribution will be extremely important when describing a specic burst pattern. 5.5 Asymmetric Burst Patterns and Parabolicity. In contrast to our phase model, discharge patterns from real cold receptors exhibit asymmetric burst patterns. Typical (Braun et al., 1980) burst proles comprise a rapid increase followed by a slower decrease in spike frequency, resulting in a sawtooth prole. However, parabolic cells generally exhibit burst proles with ring rates that vary nonmonotonically throughout the burst. Thus, denitively labeling the discharge of these cells as parabolic is problematic. However, according to Rinzel and Lee (1987), the symmetry of the discharge of the Plant model can be manipulated by parameter modication. Furthermore, 4 For additive noise, there is in fact a nite probability for h to recross zero in the opposite direction. However, such an event is unlikely, and, moreover, simply crossing zero does not constitute a spike. A spike is a full rotation around the limit cycle, and the probability that this will happen is vanishingly small (at least for these noise levels).

Model of Mammalian Cold Receptors

1089

q

2p

p

0

Figure 8: A burstP pattern from a sawtooth slow wave (the rst few terms of the expansion A[1 ¡ n 1n sin(nx)]). Parameters are V D 0.001, A D 0.5, and b D 0.6 (cf. gure 1 of Braun et al., 1980). See also Ermentrout and Kopell (1986).

for our canonical model, the time-dependent term A cos(Vt) models the oscillatory slow-wave dynamics. We have no electrophysiological guidelines to direct our choice for its functional form, and so we have chosen a cosine for its analytical tractability. However, the asymmetry of the discharge of real cells suggests that if the bursting is of slow-wave type, then the wave is asymmetric. Thus a more complex waveform may better reproduce the ner details of the discharge pattern. Figure 8 shows a single burst driven by a sawtooth slow wave, and the ner details of this pattern are in better agreement with data (compare with Figure 1 of Braun et al., 1980). Our analysis readily extends to slow waves of this form, except now the Mathieu equation must be replaced by a more general Hill equation (Ermentrout & Kopell, 1986). 6 Conclusions

We have presented a tractable phase model for cold-receptor function. Our phase model can be related to more complex, biophysical models of neuronal operation, and we have shown how certain parameters of the phase model derive from Plant’s ionic model. We have investigated the phase model, both numerically and analytically, in the deterministic regime and also when subject to a nite amount of thermal noise. Numerically obtained

1090

Peter Roper, Paul C. Bressloff, and Andr´e Longtin

spike trains and interspike interval histograms from the phase model agree well with the experimental data. From our investigations, we have shown that skipping might be caused by noise. We have further shown that both the number of spikes in a burst and also the skipping rate at any given temperature may be predicted. Finally, we have studied how altering the noise level affects the dynamics and shown that the skipping regime may be subdivided. The rst part of skipping is caused by noise-induced trapping, and the second by noise-induced spiking. Our ndings provide a framework in which to study the effect of various physicochemical parameters in the environment of these receptors and analytical insight into the variability of patterns across receptors. Appendix

Here we analyze the temperature dependence of the slow-wave frequency in a model of cold receptors based on the Plant model described in Longtin and Hinzer (1996). This model is ve-dimensional and possesses a clear separation of timescales as discussed in (Rinzel & Lee, 1987). The three fast variables are the voltage V and the gating variables h and n of the fast-spiking dynamics (sodium inactivation and potassium activation). The slow variables are the gating variable x of the slow inward current and the intracellular calcium concentration y. The dynamics of V are given by Cm

dV D ¡Iion (V, h, nI x, y). dt

(A.1)

The dynamics of x and y are given by dx Dl [x1 ( V ) ¡ x] / tx dt £ ¤ dy Dr Kc x( VCa ¡ V ) ¡ y . dt

(A.2) (A.3)

In the approximation where the full system can be separated into fast and slow subsystems, the slow dynamics are dx Dl [x1 ( Vss ) ¡ x] / tx dt £ ¤ dy Dr Kc x( VCa ¡ Vss ) ¡ y , dt

(A.4) (A.5)

where Vss is the pseudo-steady-state membrane voltage, obtained as a solution of 0 D Iion ( Vss , h1 ( Vss ), n1 ( Vss )I x, y).

(A.6)

Model of Mammalian Cold Receptors

1091

The important fact is that Vss D Vss ( x, y). Hence, the ( x, y) dynamics are coupled together by Vss . Based on numerical computations of this surface in the absence of fast-spiking dynamics (Rinzel & Lee, 1987; see in particular Figure 2), we then approximate the pseudo-steady-state surface by the plane Vss ( x, y) ¼ Vo C a1 x C a2 y,

(A.7)

where Vo is the mean value of the slow-wave potential, and the partial derivatives a1 and a2 are constants. These constants are negative according to (Rinzel & Lee, 1987; see Figure 2). Using this form for Vss , we can write £ ¤ (A.8) x¡1 1 ( x, y) D exp 0.15(¡50 ¡ Vo ¡ a1 x ¡ a2 y) C 1 . Introducing the coordinates X ´ x ¡ x¤ and Y ´ y ¡ y¤ , the linearization about the xed point (x¤ , y¤ ) of this two-dimensional subsystem is dX D al X C blY dt dY Dc r X Cdr Y, dt

(A.9) (A.10)

where we have isolated the temperature-dependent parameters l and r from the other parameters: ! @x1 a´ ¡ 1 tx¡1 (A.11) @x x¤ ,y¤

(

b´

tx¡1

@x1

@y x¤ ,y¤

c ´ Kc (VCa ¡ Vo ¡ 2a1 x¤ ¡ a2 y¤ ) ¤

d ´ ¡(Kc a2 x C 1).

(A.12) (A.13) (A.14)

We note that x and y are positive quantities, as is d ; a and b are negative, and the sign of d is dependent on precise parameter values. If (al C rd) 2 < 4rl(ad ¡ c b), the eigenvalues are complex. At the bifurcation point, al C rd D 0 and the frequency is given by p vH D (ad ¡ c b)rl » (rl) 1 / 2 , (A.15) which is real since ad ¡ c b > 0. Acknowledgments

This work was supported by a Loughborough University postgraduate studentship, a traveling fellowship from the John Phillips Guest fund, and NSERC (Canada).

1092

Peter Roper, Paul C. Bressloff, and Andr´e Longtin

References Apostolico, F., Gammaitoni, L., Marchesoni, F., & Santucci S. (1994). Resonant trapping: A failure mechanism in switch transitions. Physical Review E, 55(1), 36–39. Braun, H. A., Bade, H., & Hensel, H. (1980). Static, dynamic discharge patterns ¨ of bursting cold bres related to hypothetical receptor mechanisms. P ugers Archiv—European Journal of Physiology, 386, 1–9. Braun, H. A., Huber, M. T., Dewald, M., Sch¨affer, K., & Voigt, K. (1998). Computer simulations of neuronal signal transduction: The role of nonlinear dynamics and noise. International Journal of Bifurcation and Chaos, 8(5), 881– 889. Braun, H. A., Sch¨afer, K., Voigt, K., Peters, R., Bretschneider, F., Pei, X., Wilkens, L., & Moss, F. (1997). Low-dimensional dynamics in sensory biology 1: Thermally sensitive electroreceptors of the catsh. Journal of Computational Neuroscience, 4, 335–347. Braun, H. A., Sch¨afer, K., & Wissing, H. (1990). Theories and models of temperature transduction. In J. Bligh & K. Voigt (Eds.), Thermoreception and thermal regulation (pp. 18–29). Berlin: Springer-Verlag. Braun, H. A., Sch¨afer, K., Wissing, H., & Hensel H. (1984). Periodic transduction processes in thermosensitive receptors. In W. Hamman & A. Iggo (Eds.), Sensory receptor mechanisms (pp. 147–156). Singapore: World Scientic Publishing. Braun, H. A., Wissing, H., Sch¨afer, K., & Hirsch, M. C. (1994). Oscillation and noise determine signal transduction in shark multimodal sensory cells. Nature, 367, 270–273. Canavier, C. C, Clark, J. W., & Byrne, J. H. (1991). Simulation of the bursting activity of neuron R15 in Aplysia: Role of ion currents, calcium balance, modulatory transmitters. Journal of Neurophysiology, 66, 2107–2124. Carpenter, D. O. (1967). Temperature effects on pacemaker generation, membrane potential, and critical ring threshold in Aplysia neurons. Journal of General Physiology, 50(6), 1469–1484. Coddington, E. A., & Levinson, N. (1955). Theory of ordinary differential equations. New York: McGraw-Hill. DeFelice, L. J. (1981). Introduction to membrane noise. New York: Plenum. Dykes, R. W. (1975). Coding of steady, transient temperatures by cutaneous “cold” bers serving the hand of monkeys. Biophysical Journal, 98, 485–500. Ermentrout, G. B., & Kopell, N. (1986). Parabolic bursting in an excitable system coupled with a slow oscillation. SIAM Journal of Applied Mathematics, 46(2), 233–253. Gang, H., Ditzinger, T., Ning, C. Z., & Haken, H. (1993). Stochastic resonance without external periodic force. Physical Review Letters, 71(6), 807–810. Gutkin, B. S., & Ermentrout, G. B. (1998). Dynamics of membrane excitability determine interspike interval variability: A link between spike generation mechanisms and cortical spike train statistics. Neural Computation, 10, 1047– 1065.

Model of Mammalian Cold Receptors

1093

Ivanov, K. P. (1990). The location, function of different skin thermoreceptors. In J. Bligh & K. Voigt (Eds.), Thermoreception and thermal regulation (pp. 37–43). Berlin: Springer-Verlag. Kopell, N., & Ermentrout, G. B. (1986). Subcellular oscillations and bursting. Mathematical Biosciences, 78, 265–291. Lisman, J. E. (1997). Bursts as a reliable unit of neural information: Making unreliable synapses reliable. Trends in Neuroscience, 20(1), 38–43. Longtin, A., & Hinzer, K. (1996). Encoding with bursting, subthreshold oscillations, and noise in mammalian cold receptors. Neural Computation, 8, 215–255. McLachlan, N. W. (1947). Theory and application of Mathieu functions. Oxford: Clarendon Press. Nayfeh, A. H., & Mook, D. T. (1995). Nonlinear oscillations. New York: Wiley. Plant, R. E. (1981).Bifurcation and resonance in a model for bursting nerve-cells. Journal of Mathematical Biology, 11(15), 15–32. Rappel, W., & Strogatz, S. H. (1994). Stochastic resonance in an autonomous system with a nonuniform limit cycle. Physical Review E, 50(4), 3249–3250. Rinzel, J., & Lee, Y. S. (1987). Dissection of a model for neuronal parabolic bursting. Journal of Mathematical Biology, 25, 653–675. Risken, H. (1989). The Fokker-Planck equation (2nd ed.). New York: SpringerVerlag. Sch¨afer, K., Braun, H. A., & Rempe, L. (1988). Classication of a calcium conductance in cold receptors. Progress in Brain Research, 74, 29–36. Sch¨afer, K., Braun, H. A., & Rempe, L. (1990). Mechanisms of sensory transduction in cold receptors. In J. Bligh & K. Voigt (Eds.), Thermoreceptionand thermal regulation (pp. 30–36). Berlin: Springer-Verlag. Sigeti, D. E. (1988). Universal results for the effects of noise on dynamical systems. Unpublished doctoral dissertation, University of Texas at Austin. Soto-Trevino, ˜ C., Kopell, N., & Watson,D. (1996). Parabolic bursting revisited. Journal of Mathematical Biology, 35, 114–128. Strogatz, S. H. (1994). Nonlinear dynamics and chaos. Reading, MA: AddisonWesley. Wiederhold, M. L., & Carpenter, D. O. (1982). Possible role of pacemaker mechanisms in sensory systems. In D. O. Carpenter (Ed.), Cellular pacemakers (Vol. 2, pp. 27–58). New York: Wiley-Interscience. Willis, J. A., Gaubatz, G. L., & Carpenter, D. O. (1974).The role of the electrogenic pump in modulation of pacemaker discharge of Aplysia neurons. Journal of Cellular Physiology, 84, 463–471. Received November 10, 1998; accepted June 14, 1999.

LETTER

Communicated by Bard Ermentrout

The Number of Synaptic Inputs and the Synchrony of Large, Sparse Neuronal Networks D. Golomb Zlotowski Center for Neuroscience and Department of Physiology, Faculty of Health Sciences, Ben Gurion University of the Negev, Be’er-Sheva 84105, Israel D. Hansel Laboratoire de Neurophysique et de Physiologie du Syst`eme Moteur, EP 1848 CNRS, Universit´e Ren´e Descartes, 45 rue de Saints P`eres 75270 Paris Cedex 06, France The prevalence of coherent oscillations in various frequency ranges in the central nervous system raises the question of the mechanisms that synchronize large populations of neurons. We study synchronization in models of large networks of spiking neurons with random sparse connectivity. Synchrony occurs only when the average number of synapses, M, that a cell receives is larger than a critical value, Mc . Below Mc , the system is in an asynchronous state. In the limit of weak coupling, assuming identical neurons, we reduce the model to a system of phase oscillators that are coupled via an effective interaction, C . In this framework, we develop an approximate theory for sparse networks of identical neurons to estimate Mc analytically from the Fourier coefcients of C . Our approach relies on the assumption that the dynamics of a neuron depend mainly on the number of cells that are presynaptic to it. We apply this theory to compute Mc for a model of inhibitory networks of integrate-and-re (I&F) neurons as a function of the intrinsic neuronal properties (e.g., the refractory period Tr ), the synaptic time constants, and the strength of the external stimulus, Iext . The number Mc is found to be nonmonotonous with the strength of Iext . For Tr D 0, we estimate the minimum value of Mc over all the parameters of the model to be 363.8. Above M c , the neurons tend to re in smeared one-cluster states at high ring rates and smeared two-or -more-cluster states at low ring rates. Refractoriness decreases Mc at intermediate and high ring rates. These results are compared to numerical simulations. We show numerically that systems with different sizes, N , behave in the same way provided the connectivity, M , is such that 1 / M eff D 1 / M ¡ 1 / N remains constant when N varies. This allows extrapolating the large N behavior of a network from numerical simulations of networks of relatively small sizes (N D 800 in our case). We nd that our theory predicts with remarkable accuracy the value of Mc and the patterns of synchrony above M c , provided the synaptic coupling is not too large. We also study the strong coupling regime of inhibitory c 2000 Massachusetts Institute of Technology Neural Computation 12, 1095– 1139 (2000) °

1096

D. Golomb and D. Hansel

sparse networks. All of our simulations demonstrate that increasing the coupling strength reduces the level of synchrony of the neuronal activity. Above a critical coupling strength, the network activity is asynchronous. We point out a fundamental limitation for the mechanisms of synchrony relying on inhibition alone, if heterogeneities in the intrinsic properties of the neurons and spatial uctuations in the external input are also taken into account. 1 Introduction The ubiquity of coherent oscillations in the central nervous system (for a review, see Gray, 1994), raises fundamental biophysical issues concerning the role of intrinsic neuronal and synaptic characteristics in the emergence of coherent neuronal activity. In recent years, these issues have been addressed in various theoretical studies, focusing mostly on the simplest situation in which the neurons are identical and coupled all-to-all (Gerstner, 1999; Gerstner, van Hemmen, & Cowan, 1996; Hansel, Mato, & Meunier, 1993a, 1993b, 1993c, 1995; van Vreeswijk, Abbott, & Ermentrout, 1994). These studies have demonstrated that the degree of synchrony that can emerge in a large neural system depends crucially on the nature, whether excitatory or inhibitory, of the synaptic interactions between the neurons. More recently, the effect of heterogeneities in neuron properties, still assuming an all-to-all coupling (Chow, 1998; Chow, White, Ritt, & Kopell, 1998; Neltner, Hansel, Mato, & Meunier, 1999; White, Chow, Ritt, Soto-Trevino, Q & Kopell, 1998), have been investigated. On the basis of these studies, it has been suggested that inhibition should play a crucial role in the emergence of synchrony in the central nervous system. Similar conclusion has been reached, in parallel with these theoretical investigations, in experimental studies in in vitro preparations. Indeed, Whittington, Traub, and Jefferys (1995) demonstrated that an inhibitory population can re in synchrony in the gamma band even if the excitatory population is blocked pharmacologically. This result goes along the idea that pyramidal cells in vivo may be entrained by synchronous rhythmic inhibition originating from fast spiking interneurons, as suggested as early as 1983 by Buzsaki, ´ Leung, and Vanderwolf (1983) and Buzs a´ ki and Chrobak (1995) (see also the theoretical study by Lytton & Sejnowski, 1991). Most of the theoretical investigations of neural synchrony in large networks have focused on all-to-all architecture. However, in cortex, each neuron is coupled to only a certain number of neurons, which is much smaller than the total number of neurons in the network. This property of the network architecture, referred to as sparseness or dilution, is expected to reduce, or even destroy, the synchrony in the network even if the corresponding network with all-to-all coupling is highly coherent. It is often assumed in models that the coupling between neurons is random. In these cases (Barkai, Kanter, & Sompolinsky, 1990; Wang, Golomb, & Rinzel, 1995), it was found

Synchrony of Sparse Networks

1097

that when the all-to-all network is synchronized, synchrony will still prevail if the average number of inputs a cell receives is larger than a critical number. Wang and Buzs´aki (1996) numerically studied a model of a sparse and heterogeneous network made of spiking neurons with inhibitory coupling. They claimed that synchrony could emerge if the ratio between the synaptic decay time constant and the oscillation period was sufciently large and the heterogeneity level was modest. For their specic parameter set, they determined that the average number of inputs a cell should receive in order to get synchrony is on the order of 100. However, this number depends on the intrinsic properties of the cells as well as on the synaptic and network properties. In this article, we study models of large, sparse networks made of identical intrinsically oscillating neurons with inhibitory coupling. Our goal is to nd how the critical average number of synaptic inputs, Mc , above which synchrony emerges, and the patterns of synchrony in networks in which neurons receive more inputs than this critical number, depend on the intrinsic and synaptic properties of the neurons. Our strategy is to study the limit of weak synaptic coupling in which the network dynamics can be reduced to the dynamics of coupled phase oscillator, which, to a large extent, can then be studied analytically. Using an approximate mean-eld theory, we estimate the critical average number of inputs for the integrateand-re (I & F) neuron model. This result helps us to understand the network dynamics even beyond this limit, as obtained numerically. Simulations of sparse I & F networks are performed to verify the prediction of our approximate theory for nite (but not too strong) coupling, to nd the patterns of synchrony in networks for which the asynchronous state is unstable, and to study the model behavior beyond the weak coupling limit. In the next section we present a general framework for modeling diluted networks. Section 3 is devoted to the weak coupling case. The theory is applied to networks of I & F neurons in section 4, and simulations of such networks are described in section 5. Section 6 summarizes and discusses our ndings. 2 The Model 2.1 Architecture of the Network. We are interested in the dynamics of one population of N neurons of the same type, all excitatory or all inhibitory, which are randomly connected. A neuron, i, ( i D 1, . . . , N ) is making one synapse on neuron j with a probability M / N, where M < N. We dene a connectivity matrix wij by: » wij D

1 0

if neuron j is presynaptic to neuron i otherwise.

(2.1)

1098

D. Golomb and D. Hansel

In this architecture, the number of synaptic inputs uctuates from neuron to neuron with a population average M. The total synaptic input received by neuron i is gtot syn,i D

N gsyn X wij sj , M jD1

(2.2)

where si is the synaptic variable of the ith neuron and gsyn does not depend on M. With this normalization, the system-average synaptic input that a cell receives when all of the synaptic channels are open (si D 1 for all i) is gsyn. The standard deviation of the neuron-to-neuron uctuations is r 1 1 ¡ . (2.3) gsyn M N p This value is equal to gsyn / M for N À M. The spatial uctuations in the synaptic input decrease as M increases, and we expect the system to be more synchronized. Analysis of diluted networks is often carried out in two behavioral regimes: The ratio M / N remains constant as N ! 1. In this regime, input uctuations go to zero at large N, and the behavior is similar to the case of all-to-all coupling (M D N) (Golomb, Wang, & Rinzel, 1994).

The number M remains constant as N ! 1. This is the sparse regime, and synchrony is expected to be reduced at small values of M in comparison to the case M D N. In previous work on different systems (Barkai et al., 1990; Wang et al., 1995; Wang & Buzs a´ ki, 1996), it was suggested that the asynchronous state is stable if M is below a critical value Mc . In this work we study the dynamical properties of sparse networks in the limit N ! 1. 2.2 Integrate-and-Fire Dynamics. In this article we mainly study networks of I & F neurons (Tuckwell, 1988; Abbott & van Vreeswijk, 1993; Bressloff & Coombes, 1998a, 1998b; Brunel & Hakim, 1999; Chow, 1998; Gerstner, 1999; Gerstner et al., 1996; Hansel et al., 1995; Kuramoto, 1991; Mirollo and Strogatz, 1990; Treves, 1993; Tsodyks, Mitkov, & Sompolinsky, 1993; Tsodyks & Sejnowki, 1995; van Vreeswijk et al., 1994). This enables us to investigate the synchronization properties of models without active ionic currents that generate spikes. In the following, we study the simplest I & F model (the Lapicque model) in which the membrane potential of a neuron satises the differential equation: dV V D ¡ C Iext C Isyn (t) t0 dt

(2.4)

Synchrony of Sparse Networks

1099

for 0 < V < h and ) V ( tC 0 D0

) V ( t¡ 0 D h.

if

(2.5)

This last condition corresponds to a fast resetting of the neuron after the ring of a spike at time t0 ; this ring occurs when the membrane potential reaches the spiking threshold h ; we choose h D 10 mV above rest. The time constant of the membrane is t0 D 10 ms and Iext D IQext / C, where IQIext represents the contribution of an external (e.g., sensory) stimulus to the system. The quantity, C, is the membrane capacitance. Note that the normalized parameter, Iext , has units of mV / ms. Refractoriness is introduced into the model by imposing that V ( t) remains equal to 0 for a refractory period, Tr , after the ring of a spike. The last term in equation 2.4 is the synaptic current received by the neuron, normalized to the value of C. For this model we adopt Isyn( t) D gsyn s (t) ,

(2.6)

where gsyn is the normalized synaptic current (in units of mV/ms) and s is a synaptic variable given by sD

X spikes

f (t ¡ tspike),

(2.7)

where the summation is done over all of the spikes emitted prior to t by all of the neurons presynaptic to the neuron we consider. With this simplied form of synaptic coupling, excitation corresponds to gsyn > 0 and inhibition to gsyn < 0. The function, f (t), is the extended a-function: f ( t) D

exp (¡t / t2 ) ¡ exp (¡t / t1 ) . t2 ¡ t1

(2.8)

The characteristic times t1 and t2 are the rise and decay times of the synapse, respectively: t1 · t2 . If the neurons are not interacting (gsyn D 0), they emit spikes periodically for Iext > h / t0 with a ring period, T0 (ring frequency f0 D 1 / T0 ):

¡

h T0 D Tr ¡ t0 ln 1 ¡ t0 Iext

¢

.

(2.9)

3 The Weak Coupling Limit 3.1 Reduction to Phase Models. In general, the dynamical equations of large networks of spiking neurons cannot be solved analytically, and

1100

D. Golomb and D. Hansel

the study of synchronization in such networks relies on numerical computations. However, under the following three conditions the state of each neuron can be described by a phase variable, w i , which characterizes at time t, the location of neuron i along its limit cycle (Kuramoto, 1984): 1. The uncoupled neurons display a periodic behavior (limit cycle) when the external current is above some threshold. 2. Their ring rates all lie in a narrow range, that is, the heterogeneities in the neuronal intrinsic properties and in the external inputs are small. 3. The coupling is weak (and therefore it modies the ring rate of the neurons only slightly). The original system of equations for the N oscillators is replaced by a simpler set of N differential equations that governs the time evolution of the w i , ¡ ¢ 1 X dw i D 2p f0i C wij C w i ¡ wj , dt M j6Di

(3.1)

where f0i is the ring rate of neuron i, when uncoupled. Here, C is the effective interaction between any two neurons: 1 C w i ¡ wj D 2p ¡

¢

Z

2p 0

¡ ¢ Z ( y C w i ) Isyn y C w i , y C wj dy .

(3.2)

The function Z is the phase-resetting curve (Kuramoto, 1984; Hansel et al., 1993a) of the neuron in the limit of vanishingly small perturbations of the membrane potential, and Isyn(w i , wj ) is the synaptic interaction, which is a function of the phases of the presynaptic and postsynaptic neurons. The function C depends on only the single-neuron dynamics. For simple I & F neurons it can be computed analytically. For conductance-based neuronal models, numerical methods have to be used as described in Ermentrout and Kopell (1984), Kopell (1988), or Hansel et al. (1993a). Once this effective phase interaction is determined, it can be used to analyze networks of arbitrary complexity. Reduction to a phase model assumes that the coupling is small enough for amplitude effects to be neglected. The more stable the limit cycle is, the weaker this constraint will be, but it is difcult to derive quantitative a priori estimates of the validity of the phase reduction. However, predictions of phase models often remain valid, at least qualitatively, for moderate values of the coupling. This is the case for the examples studied below.

Synchrony of Sparse Networks

1101

3.2 Approximate Theory for Randomly and Weakly Coupled Neurons. The Fourier series representation of the C function is C (w ) D C 0 C CQ (w ) D C 0 ¡ |C 0 |

1 X nD1

cn sin (nw C an ) ,

(3.3)

Q ) includes all the where C 0 is the zeroth Fourier component of C(w ), C(w phasic components of C, cn is the normalized amplitude of the nth Fourier component and is therefore always nonnegative, and an is the angle of that mode. This parameterization is chosen in order to be consistent with previous works on networks of phase oscillators (Kuramoto, 1984; Strogatz & Mirollo, 1991). Substituting equation 3.3 in equation 3.1 yields the following dynamics: N ¡ ¢ 1 X dw i mi C 0 D 2p f0i C C wij CQ w i ¡ w j dt M M jD1

i D 1, . . . , N,

(3.4)

where mi D

N X

wij

(3.5)

jD1

is the number of synaptic inputs the ith cell receives from other cells. The second term in this equation is a major source of disorder in the system, which tends to desynchronize it. For example, if C 0 < 0, a cell that receives a larger number of inputs, mi , will tend, in general, to delay the time of occurrence of its next spike in comparison to another cell that receives a smaller mi . The third term in equation 3.4 includes the Fourier components, which depend on the difference w i ¡wj and therefore can help to synchronize the system. Therefore, the synchronization of the system is determined by a competition between the second and third terms. In most of this article, we assume that all the neurons are identical, that is, f0i D f0 for i D 1, . . . , N. In that case the rst term in equation 3.4 can be gauged out of the equations. We analyze the synchronization properties of equation 3.4 in the limits N!1

(3.6)

1 ¿ M ¿ N.

(3.7)

and

The number of neurons that receive m inputs is a binomial distribution that we denote by P0 ( m, M). Under the condition N À M À 1 (see equation 3.7),

1102

D. Golomb and D. Hansel

this distribution (see equations 2.1 and 3.5) becomes a gaussian distribution, P(d m, M): P (d m, M ) D p

1 2p M

exp

¡ ¡d m ¢ 2

2M

,

(3.8)

where dm D m ¡ M. In this work, we develop an approximate theory. Our analysis is based on the assumption that the state of the network at time t can be characterized by the density function r (w , t, d m) dw , which is the fraction of neurons with M C d m inputs whose phase lies between w and w C dw at time t. It is normalized such that Z

p ¡p

r (w , t, dm) dw D 1.

(3.9)

In other words, we assume that the dynamics of a neuron depends on only the number of inputs it directly receives from other neurons in the network, and we derive (approximate) dynamics in terms of a Fokker-Planck equation for r (see equation A1). As shown in appendix A, this amounts to the assumption that most of the effects of the sparseness are due to the second term in equation 3.4. This is exact in the limit cn ! 0 for all n, but only approximate for a general phase model (Golomb & Hansel, unpublished). However, as we will see, this approximation provides good estimates in the case of neuronal systems. The asynchronous state is given by r 0 (w , t, dm) D

1 . 2p

(3.10)

In appendix A we estimate the critical value, Mc , such that for M > Mc the asynchronous state is an unstable solution of the (approximate) dynamics. Here we give the nal results. The critical number, Mc , is the minimum of the numbers, M c,n , over all positive integers, n, for which it is nite, Mc D min Mc,n ,

(3.11)

n

where Mc,n is determined by the amplitude, cn , and the phase, an , of the nth Fourier term of C(w) (see equation 3.3): Mc,n

» £ 2 2 ¤ 8 / p cn b (an ) D innity

¡p / 2 < an < p / 2 . otherwise

(3.12)

The function b(a) is universal and does not depend on the model details (see Figure 12B). It is a decreasing function of | a|. As a ! § p / 2, b(§ p / 2) D 0. The decrease is gradual for values of a sufciently far from § p / 2 . Near

Synchrony of Sparse Networks

1103

these boundaries, it is very steep. For example, the value of b is 0.25 at a D § 0.995p / 2. The integer nc is dened as the value of n for which Mc,n is minimal. Therefore, M c D Mc,nc . The eigenmode that destabilizes the stationary distribution is a linear combination of exp(inc w ) and exp(¡inc w ). The strength of the coupling, gsyn, affects only C 0 , and not the cn ’s and an ’s (see equation 3.3). Hence, in the limit of weak coupling, Mc does not depend on gsyn . Intuitively, this result stems from the fact that Mc is determined by the competition between the neuron-to-neuron uctuations in the level of synaptic input, represented by C 0 , and the strength of the higher components, represented by C 0 cn (see equation 3.4). Both factors are proportional to C 0 , and therefore, gsyn does not affect the balance between them. From equation 3.12 we see that in order to get a nite M c,n , the condition ¡p / 2 < an < p / 2 should hold. Moreover, Mc,n is an increasing function of |an |. The larger the coefcient, cn , the smaller M c,n becomes. If the transition from an asynchronous state to a state with some synchrony is supercritical, nc determines, for M»>Mc , the nature of the state in which the system settles at large time for almost asynchronous initial conditions. For example, if nc D 1, the system is expected to settle in a “smeared” one-cluster state in which the phase distribution is broad but monomodal. If nc D 2, the system is expected to behave above Mc as a “smeared” twocluster state, characterized by bimodal phase distributions. Examples of such states are shown below. 4 Sparse Networks of Integrate-and-Fire Neurons 4.1 Estimates of Mc at Weak Coupling. The synaptic coupling term (see equation 2.6) in our version of the I&F model depends only on the dynamics of the presynaptic neuron via the variable s(t). Therefore, the effective interaction, C(w ), for this model is (see equation 3.2) Z 2p 1 C (w ) D (4.1) dy Z ( y C w) Isyn (y ) . 2p 0 The right-hand side is almost a convolution integral (except for the plus sign in the function Z(y C w )). The phase coordinates are obtained from the time coordinate according to w D v0 t with v0 D 2p f0 . The functions Z(w) for the I&F model is (Hansel et al., 1995) ( 0 0 < w · wr (4.2) Z (w ) D 1 (w¡w r ) / (v0 t0 ) w r < w · 2p , e Iext where w r D v0 Tr . The function Isyn(w ) is (Hansel et al., 1995) Isyn(w ) D

gsyn t2 ¡ t1

¡

e¡w / (v0 t2 ) e¡w / (v0 t1 ) ¡ ( ) ¡1 t f 1¡e / 0 2 1 ¡ e¡1 / ( f0 t1 )

¢

.

(4.3)

1104

D. Golomb and D. Hansel

V /q

1.0 0.5 0.0

s

1.0 0.5 0.0

I ext Z

5.0

/ |g syn|

10.0

2.0

0.0

G

2.5 0

p /2

p f

3p /2

2p

Figure 1: The functions V(w ) (normalized toh ), s(w ), Z(w ), and C(w ) (normalized to | gsyn |) for the I & F model with no refractory period (Tr D 0) and inhibitory coupling (gsyn < 0). Parameters: t1 D 5.9 ms, t2 D 6 ms, Iext D 1.12 mV / ms, f0 D 44.8 Hz.

The functions V(w ), s(w), Z(w ), and C(w) for a specic case are plotted in Figure 1. The function Z(w ) is discontinuous at w D 0 and w D w r , but the function C(w ) is continuous for all w . Using the parameterization (see equation 3.3) and the calculation reported in appendix B, one obtains: ´³ ´o¡1 / 2 n³ 1 C n2 v02 t12 1 C n2 v02 t22 ¶¼ ³ ´¡1 / 2 »µ 2Iext (Iext ¡h ) 2 2 2 (1¡cos ( )) £ 1 C n v0 t0 1C nv0 Tr h2

cn D 2

¡ ¢ p sign gsyn C arctan nv0 t1 C arctan nv0 t2 2 sin (n v0 Tr ) . C arctan nv0 t0 C arctan [Iext / (Iext ¡ h )] ¡ cos ( n v0 Tr )

1 /2

(4.4)

an D ¡p C

(4.5)

Synchrony of Sparse Networks

1105

The rst factor in equation 4.4 depends on only the synaptic properties, whereas the second factor reects the intrinsic response properties of the neurons. Similarly, the rst three, nontrivial, terms in equation 4.5 depend the nature of the interactions and on the synaptic dynamics, while the two last terms in this equation depend on only the intrinsic properties of the neurons. 4.2 Inhibitory Coupling, Tr D 0. The angle, an , for an inhibitory coupling and Tr D 0 is given by an D ¡

3p C arctan nv0 t0 C arctan nv0 t1 C arctan nv0 t2 . 2

(4.6)

The critical number, Mc,n , is nite only if ¡p / 2 < an ¡p / 2, the sum of these three terms should be larger than p (see Figure 2A). The coefcients, cn , are given by cn D 2

´³ ´³ ´i¡1 / 2 h³ 1 C n2 v02 t02 1 C n2 v02 t12 1 C n2 v20 t22 .

(4.7)

Below, the implications of equations 4.6 and 4.7 are listed . While reading it, it is useful to look at Figure 2B, where an and cn are plotted as a function of f0 for t2 D 6 ms, t1 D 5.9 ms, and n D 1, 2, 3; and at the two examples in Figure 3A, in which Mc,n is shown as a function of f0 for that parameter set (left) and for t1 D 1 ms (right). The rst six conclusions below are valid for t1 > 0. 1. Competition between an and cn . The angle, an , increases with f0 , whereas the amplitude, cn , decreases with it (see Figure 2). Therefore, for each n, at low ring rates, an is smaller than ¡p / 2, and Mc,n is innite. There is a critical ring rate for which Mc,n is nite, and for slightly larger f0 , M c,n decreases because near an D ¡p / 2 the function b is very steep. For large frequencies, an approaches 0, and cn continues to decrease; therefore, Mc,n increases. As a result, Mc,n has a minimum as a function of f0 at a nite value of f0 . 2. The asynchronous state is always unstable for high enough M. The angle, an , is a monotonic increasing function of n. For t1 > 0, an ! 0 for n ! 1. Therefore, there is always an nc such that an > ¡p / 2 for all n > nc . This implies that the asynchronous state loses stability at a nite value of M, which is Mc,nc . 3. The critical mode is nc D 1 at high enough values of f0 . This is because for high-enough values of f0 , a1 > ¡p / 2 and c1 is the minimal cn . At very high values of f0 , a1 ¼ 0, and from equations 3.12, 4.7, and 2.9 one concludes that Mc scales like f06 (or I6ext ).

1106

D. Golomb and D. Hansel

4. The critical mode, nc , increases with decreasing f0 . This is a consequence of the facts that: I. for each specic n, an decreases as the ring frequency, f0 , decreases; and II. an increases with n. When for some n, an goes below ¡p / 2, nc is higher than that specic n. 5. All Mc,n ’s have the same minimum with respect to f0 . For xed t1 / t0 and t2 / t0 , M c,n is a function of the variable x D nf0 t0 . Therefore, ¡ ¢ ¡ ¢ M c,n f0 D Mc,1 nf0 .

(4.8)

In particular, the minima over the network parameters of all Mc,n are the same. 6. The critical number of synaptic connection, Mc , is nonmonotone with f0 . Here we describe the behavior of Mc as f0 decreases. At high f0 , nc D 1, and M c decreases with decreasing f0 . At a specic value of f0 , it reaches its minimum Mc,1 ( f0 ) and then increases again. As f0 continues to decrease, a value of f0 for which nc D 2 is reached, and Mc decreases again until it reaches a second minimum. Then it increases until nc D 3, decreases again and so on. More generally, M c oscillates innitely many times as f0 decreases toward 0, that is, as Iext decreases toward the threshold h . 7. Positive t1 is needed for destabilizing the asynchronous state. If t1 D 0, and in the absence of refractoriness, an < ¡p / 2 for all n. Therefore the asynchronous state is stable for every M. This extends to the case of a sparse, weakly coupled network, a similar result established by Van Vreeswijk (1996) in the case of all-to-all coupling. 8. Effects of synaptic times. As t1 increases from 0 (note that t1 · t2 ), a lower frequency is needed for Mc,1 to be nite. Since c1 is larger at lower frequencies, the minimal value of M c,1 as a function of f0 is smaller at higher t1 . For the parameters of Figure 3A, it is 418 for

Figure 2: Facing page. (A) Schematic demonstration of the condition needed to achieve an > ¡p / 2 for the I&F model with inhibitory coupling and Tr D 0. The sum of the three angles, denoted by the three levels of shading, should be larger than p for achieving this condition. The angle, arctan(n v0 t0 ), stems from the contribution of the response function, Z, of the I&F neuron to the C function (see equation 4.2), whereas the two angles, arctan(n v0 t1 ) and arctan(n v0 t2 ), stem from the contribution of the two exponents in the synaptic coupling function, Isyn (see equation 4.3). (B) The dependence of an (I) and cn (II) on f0 for Tr D 0, t1 D 5.9 ms, t2 D 6 ms; n D 1, 2, 3. The angle, an , increases with Iext , whereas the coefcient, cn , decreases with it. For a given Iext , an increases with n while cn decreases with it.

Synchrony of Sparse Networks

1107

t1 D 5.9ms and 2825 for t1 D 1ms. The dependence of M c on the synaptic rise and decay times at high frequency ( f0 = 102 Hz) is shown in Figure 3B. In the left panel, Mc,1 varies with t1 for t2 D 6 ms. At low t1 , Mc1 is indenite because a1 < ¡p / 2. It increases with t1 for high

- 3p /2 p /2 arctan(nw

t )

0 0

p

arctan(nw

0 t )

0 1

arctan(nw 0t 2)

I

0

a

p /2

n

p /2 n=1 n=2 n=3

p 3p /2 II

2.0

cn

1.5 1.0 0.5 0.0

0

50 100 F requency, f 0 (H z)

150

1108

D. Golomb and D. Hansel

t 1= 5.9 m s

4000

t 1=1 m s

Mc

3000 2000

n= 1 n= 2 n= 3

1000 0

0

50 100 F requency, f 0 (H z)

50 100 F requency, f 0 (H z)

t 2= 6 m s

10000

Mc

0

t 1=1 m s

5000

0

0

2

4 t 1 (m s)

6

2

4 t 2 (m s)

6

Figure 3: The rst three critical numbers of inputs Mc, n for an I&F model with inhibitory coupling and Tr D 0. The graphs of Mc,n are denoted by the solid, the dotted, and the dashed lines for n D 1, 2, 3, respectively. (A) Mc,n versus f0 for t2 D 6 ms and t1 D 5.9 ms (left) or t1 D 1 ms (right). Both gures demonstrate the nonmonotonicity of Mc , the minimal Mc, n , as a function of Iext . (B) Mc,n versus t1 for t2 D 6 ms (left) and versus t2 for t1 D 1 ms (right). In both cases, Iext D 1.6 mV / ms ( f0 D 102 Hz). In the left panel, Mc,2 and Mc,3 are too large to be plotted; in the right panel, Mc,3 is not plotted for the same reason. The scale for t2 starts from 1 ms because t2 ¸ t1 by denition.

t1 because c1 decreases. Higher modes have Mc,n ’s that are out of the scale of the gure. In the right panel, Mc,1 varies with t2 for t1 D 1 ms. Again, it has a minimum with respect to t2 . At small t2 , nc D 2. We have found that the minimal number, over all the parameters of the I & F model with inhibition and Tr D 0, of synaptic inputs M required

Synchrony of Sparse Networks

1109

to destabilize the asynchronous state is obtained for t1 D t2 D t0 where Mc D 363.8; for t0 D 10 ms, f0 at the minimum is 31.2 Hz. This shows that I & F neurons are hard to synchronize with inhibitory coupling in the sense that a large number of inputs is necessary to destabilize the asynchronous state of a sparse network of such neurons. 4.3 The Effect of Refractoriness. In the I & F model with no refractoriness, Mc is on the order of several hundred or more. Can Mc be reduced by the inclusion of refractoriness? To investigate this issue, we analyze the effect of Tr for xed ring frequency, f0 . Finite Tr contributes a factor that is larger than 1 to cn (see equation 4.4) and a term to an (see equation 4.5). This factor and term depend explicitly on v0 Tr . They are small at low ring rates, but they can have a major effect at high ring rates. The dependence of M c,n , n D 1, 2, 3 on Iext is presented in Figure 4A for Tr D 2 ms, t2 D 6 ms, and two values of t1 : 5.9 ms and 1 ms. As in the case Tr D 0, the stability is determined by the rst mode at large Iext and by higher modes as Iext decreases. The effect of Tr on M c,n is demonstrated by comparing Figure 4A to Figure 3A. This comparison demonstrates that the effect of the refractory period is indeed more important at high frequencies. The dependence of Mc on Tr is demonstrated in Figure 4B for t2 D 6 ms, t1 D 1 ms. The level of external current, Iext , is tuned such that the frequency is xed: f0 D 102 Hz. At small Tr , nc D 1. Mc decreases considerably with Tr (from 2941 at Tr D 0 to 141 at Tr D 4.1 ms), mainly because of an increase of c1 . At Tr D 5.5 ms, a1 D ¡p / 2, and hence nc D 2 for higher Tr . In summary, if the mode n D 1 destabilizes the asynchronous state for Tr D 0, refractoriness decreases Mc considerably at high ring rates (as long as it is not too large), but only slightly at low ring rates. 5 Numerical Simulations 5.1 Goals and Methods. The analytical theory that enables us to estimate Mc has been developed for the conditions of weak coupling, 1 ¿ M ¿ N, and in the limit where cn ! 0, for all n. There are several questions it cannot answer: How good is the estimate of equation 3.12 when these conditions are not satised? What are the patterns of synchrony for M > M c ? Are the qualitative predictions of the theory correct beyond the weak coupling limit? How does Mc vary when the coupling increases? In order to answer these questions, we carry out numerical simulations of the full model and the corresponding phase model.

1110

D. Golomb and D. Hansel

t 1= 5.9 m s

4000

t 1= 1 m s

Mc

3000 2000

n= 1 n= 2 n= 3

1000 0

0

50 100 F requenc y, f 0 (H z)

0

50 100 F requen cy, f 0 (H z)

t 1= 1 m s, f 0= 102 H z

Mc

4000

2000

0

0

2

4 T r (m s)

6

Figure 4: Effects of refractory period, Tr . The rst three critical numbers of inputs Mc,n are plotted for an I & F model with inhibitory coupling and t2 D 6 ms. The graphs of Mc,n are denoted by the solid, the dotted, and the dashed lines for n D 1, 2, 3, respectively. (A) Mc,n versus f0 for Tr D 2 ms and t1 D 5.9 ms (left) or t1 D 1 ms (right). (B) Mc,n versus Tr for t1 D 1 ms ( f0 D 102 Hz).

5.1.1 Phase Model. The simulations of the phase model were done as follows. The phase equations (see equation 3.4) (without the term 2p f0 that does not affect synchronization) were integrated using a fourth-order RungeKutta method. The time step, D t ,was chosen such that there are at least 20 steps in a time period of the simulated model. The initial conditions for w i were chosen at random from a uniform distribution between ¡p and p . The state of the network—the level and form of synchrony—was characterized

Synchrony of Sparse Networks

1111

by the order parameters (Kuramoto, 1984), Zn D

1 Tm

Z

Tm 0

|zn (t)| dt

(5.1)

where zn D

N 1 X einw i N iD1

(5.2)

and Tm is a long-enough measurement time. The rst-order parameter, Z1 , measures the tendency of the oscillator population to evolve in full synchrony: Z1 D 1 if and only if w i ( t) D w (t) for all i and t. The higher-order parameters, Zn , measure the tendency of the population to segregate into n clusters. For example, if a large population segregates into two equally populated clusters, oscillating locked in antiphase, Z2 D 1 while Z1 D 0. In the limit N ! 1, Zn D d n,0 if and only if the system is in an asynchronous state. As a result, zn (t) rises from near 0 until it reaches a steady state. The order parameters, Zn , are computed for a time period of 5000 time steps after the transients are gone. All numerical simulations are carried out with 10 realizations, over which results are averaged for each set of parameters. 5.1.2 Integrate-and-Fire Network. The I & F equations (2.4 and 2.5) were integrated using a second-order Runge-Kutta method supplemented by a rst-order interpolation of the ring time at threshold as explained in Hansel, Mato, Meunier, and Neltner (1998) and Shelley (1999). This algorithm is optimal for I & F neuronal networks and allows one to get reliable results for time steps, D t, which are substantially large. The integration time step is D t D 0.25 ms. In typical simulations, the dynamic of the network was integrated for 4 £ 104 ¡ 3 £ 105 time steps. Stability of the results was checked against varying the number of time steps in the simulations and their size. The initial conditions were chosen according to

¡

¡

Vi (0) D I0 t0 1 ¡ exp ¡b

i ¡ 1 ( T0 ¡ T r ) t N

¢¢

(5.3)

( i D 1, . . . , N ). The coefcient 0 < b < 1 controls the degree of synchrony of the initial condition. Setting b D 0 yields a perfectly synchronized initial condition. On the opposite, b D 1 corresponds to a uniform distribution of ring times for uncoupled neurons between 0 and T0 ¡ Tr . All synaptic currents were set to 0 at t D 0. To measure the degree of synchrony in the steady state of the network, we follow the method proposed and developed in Ginzburg and Sompolinsky (1994), Golomb and Rinzel (1993, 1994), and Hansel and Sompolinsky (1992)

1112

D. Golomb and D. Hansel

that is grounded on the analysis of the temporal uctuations of macroscopic observables of the networks, such as the instantaneous activity of the system or the spatial average of the membrane potential. For the latter, one evaluates at a given time, t, the quantity: V ( t) D

N 1 X Vi (t). N iD1

(5.4)

The variance of the time uctuations of V ( t) is D E sV2 D [V (t) ]2 ¡ [hV ( t)it ]2 , t

(5.5)

R where h. . .it D 0Tm dt . . . denotes time averaging over a large time, Tm . After normalization of sV to the average over the population of the single-cell membrane potentials, D E sV2i D [Vi ( t) ]2 ¡ [hVi (t)it ]2 , (5.6) t

one denes a synchrony measure, Â ( N ) for the activity of a system of N neurons by Â 2 (N ) D

1 N

sV2 PN

iD1

sV2 i

.

(5.7)

This synchrony measure, Â ( N), is between 0 and 1. In the limit N ! 1 it behaves as

¡ ¢

1 a Â (N) D Â (1) C p C O N N

,

(5.8)

where a is some constant, between 0 and 1. In particular, Â (N ) D 1, if the system is fully synchronized (i.e., Vi ( t) D V (t) for all i), and Â (N ) D p O(1 / ( N)) if the state of the system is asynchronous. Asynchronous and synchronous states are unambiguously characterized in the thermodynamic limit (i.e., when the number of neurons is innite). In the asynchronous state, Â (1) D 0. By contrast, in synchronous states, Â (1) > 0. In homogeneous fully connected networks, an important class of partially synchronous state, (0 < Â (1) < 1), are the “cluster states” (Golomb, Hansel, Shraiman, & Sompolinsky, 1992; Golomb & Rinzel, 1994; Golomb et al., 1994; Hansel et al., 1995). In these states, the system segregates into several clusters that may re alternately. In heterogeneous or sparse networks, the disorder can smear the clustering. For instance, in a “smeared one-cluster state,” the population voltage, V( t), (see equation 5.4), oscillates with time and has one peak as a function of time in each time period. More generally, in a “smeared n-cluster state,” V ( t) has n peaks as a function of time in each time period.

Synchrony of Sparse Networks

1113

5.1.3 Finite-Size Effects in Sparse Networks. In general, for nite N, even in the asynchronous state, Â (N ) is not equal to 0. However, for nite but large N, according to equation 5.8, synchronous state differs from asynchronous state in the way Â ( N) scales with N—that is, by the way the nite size of the system affects the level of synchrony of the neural activity. As explained in Hansel and Sompolinsky (1996), these nite-size effects can be used in numerical simulations to characterize the regimes of parameters in which the network states are asynchronous or synchronous. If in the thermodynamic limit a continuous transition occurs between asynchronous and synchronous regimes when a parameter is changed (e.g., M), this transition is replaced for nite N by a crossover between the two different scalings of Â with N. These nite-size effects occur no matter what the connectivity. In networks with nite M and N, sparseness is another source of nite-size effects. The reason is that the desynchronizing effect due to the sparseness depends not only on M, but also on the relative values of M and N. For instance, in equation 3.4, the synchrony level in the system is determined by the competition between the term CQ that may be synchronizing, and the distribution of mi / M that tends to desynchronize the system. If M is sufciently large, this binomial distribution of mi / M is approximated by a gaussian with a variance 1 / M (see equation 3.8). When N is nite but large enough (say, N > 30), and N ¡ M is also large enough, the binomial distribution of mi / M is well approximated by a gaussian with a variance, ³ Var

´ 1 1 mi D ¡ . M M N

(5.9)

This suggests that the level of synchrony should depend on M and N through the effective variable, Meff , dened as: 1 1 1 D ¡ . M eff M N

(5.10)

Relying on this heuristic argument, we expect that M eff is a more appropriate parameter than M to represent simulation results. 5.2 Simulation Results for Weakly Coupled Neurons. 5.2.1 Finite-Size Effects. In Figure 5A, we present the dependence of Â on M for I & F networks with different size, N D 200 to N D 3200. The parameter set is: Tr D 0, t1 D 1 ms, t2 D 3 ms, Iext D 2 mV / ms, gsyn D ¡0.2 mV / ms. The initial conditions where chosen such that b D 0.5 (see equation 5.3). When N increases, the curves shift toward higher values of M. Very large values of N would be necessary in order to see that the lines converge to a limit curve. This is due to the fact that for these parameters,

1114

D. Golomb and D. Hansel

A

1.0 0.8

N=200 N=400 N=800 N=1600 N=3200

c

0.6 0.4 0.2 0.0

0.0

1000.0

B

1.0

2000.0 M

3000.0

4000.0

0.8

c

0.6 0.4 0.2 0.0

0.0

5000.0 M eff

10000.0

Figure 5: Simulations of the I & F model with Tr D 0, t1 D 1 ms, t2 D 3 ms, Iext D 2 mV / ms ( f0 D 144.3 Hz), gsyn D ¡0.2 mV / ms, b D 0.5. Simulations are carried out with N D 200, averaged over 100 realizations of the network connectivity and random initial conditions (dots), N D 400, 50 realizations (crosses), N D 800, 25 realizations (dashed line), N D 1600, 20 realizations (dotted line) and N D 3200,10 realizations (solid line). The integration time step D t D 0.25 ms. The runs were performed over 4 £ 104 time steps. The synchrony measure was computed after discarding a transient of 2 £ 104 time steps. (A) Synchrony measure Â versus M. (B) Synchrony measure Â versus Meff (see equation 5.10). The lines for N D 800, 1600, 3200 almost coincide with each other, and the deviations of the line with N D 400 and N D 200 are small. The analytical estimate for M pc is < 2500, Â decreases with N: Â / 1 / N. denoted by the arrow. Note that for Meff »

Mc is very large. In Figure 5B, we present Â as a function of Meff for the same values of N. It is remarkable that the lines for N D 800, 1600, 3200 collapse on almost the same curve. Deviations from this curve are more

Synchrony of Sparse Networks

1115

apparent at small values of M, where one can check that Â depends on N as p 1 / N, which means that the network state is asynchronous there. For larger values of Meff , around Meff ¼ 2500, a crossover occurs between a regime in p which Â D O(1 / N ) and a regime where Â remains nite in the large N limit. The region in which this crossover occurs is in good correspondence with our weak coupling estimate for the critical value of M above which, for this set of parameters, the asynchronous state loses stability: Mc D 2785. For N D 400 and N D 200, Â ( Meff ) is also quite close to the effective line, although deviations are more prominent than for larger values of N. This example shows that it is possible to estimate Mc by studying how Â depends on Meff in simulation with a relatively small N (about 200–800), even though Mc is much larger than these values of N. 5.2.2 Weakly Coupled Inhibitory Neurons. To proceed further in the comparison between the theory and the numerical simulations, we study the dependence of the synchrony level on M for three values of the external input current Iext : 1.02 mV / ms, 1.07 mV / ms, and 1.12 mV / ms. We simulated both the phase model and the full I & F model for inhibitory coupling with N D 800 neurons, Tr D 0, t1 D 5.9 ms, t2 D 6 ms, and b D 0.5. The rise and decay times were chosen such that Mc will not be too large. Indeed, the theory for weak coupling predicts a value of Mc that is smaller by almost one order of magnitude than for t1 D 1 ms and t2 D 3 ms. Another prediction of the theory is that Mc (for N ! 1) varies nonmonotonously with Iext : Mc D 490, 1737, 418.4 for Iext D 1.02 mV / ms, 1.07 mV / ms, and 1.12 mV / ms respectively. For the full I&F model, we chose gsyn D ¡0.025 mV / ms. This is a weak coupling. For example, for Iext D 1.12 mV / ms, the ring rate of the neurons changes by less than 3%. The results for the phase models that correspond to these parameters of the I & F network are shown in Figure 6A. For all Iext values, Z1 p and Z2 are small at small M. Careful analysis shows that they scale as 1 / N, indicating that at small Meff , the network is in the asynchronous state. By increasing the value of Meff , a continuous transition to a synchronous state occurs. The analytical estimates for Mc , (see equation 5.10) for the three values of the external current are denoted by the arrows in Figure 6. There is a very good correspondence between these values and the range of Meff values in which Z1 or Z2 starts to be substantially different from 0. Note, however, that in order to estimate in a systematic way the value of Mc from the simulations, a precise nite-size scaling procedure is necessary (Binder & Heermann, 1992). This is beyond the scope of this article. The pattern of synchrony displayed by the network in the synchronous regime depends also on the value of Iext . Indeed, for Iext D 1.12 mV / ms, the network settles in a smeared one-cluster state in which both Z1 and Z2 are O(1) in the large N limit. By contrast, for Iext D 1.02 mV / ms and Iext D 1.07 mV / ms, the pattern of synchrony is a smeared two-cluster state characterized by Z2 D O(1) but

1116

D. Golomb and D. Hansel

A. Phase Model

0.8 I 0.6

0.4

0.4

c

0.6

0.2

0.0

0.0

0.8 II

0.8 II

0.6

0.6

0.4

0.4 0.2

0.0

0.0

0.8 III

0.8 III

0.2 0

1000

Meff

2000

c

Zn

0.6 Z1 Z2

0.4 0.0

I ext (m V /m s)

1.07

0.2

0.6

B. Full Model

1.02

0.2

c

Zn

Zn

0.8 I

3000

1.12

0.4 0.2 0.0

0

1000

Meff

2000

300 0

Figure 6: Simulations of inhibitory I & F networks. Parameters: Tr D 0, t1 D 5.9 ms, t2 D 6 ms, N D 800. Three values of Iext are used: 1.02 mV / ms ( f0 D 25.4 Hz); 1.07 mV / ms ( f0 D 36.7 Hz); 1.12 mV / ms ( f0 D 44.8 Hz). (A) The order parameters Z1 and Z2 in the reduced-phase model versus Meff . Integration time step is 0.1. The results are averaged over 5 £ 103 time steps (after a discarded transient of 5 £ 103 time steps) and 10 realizations of the network architecture and initial conditions (homogeneous distribution between 0 and 2p ). The error bars represent the standard deviation of Z. The arrows denote the analytical estimates for Mc . (B) The synchrony measure Â in the full I & F model with gsyn D ¡0.025 mV / ms. Time averaging over 8 £ 104 time steps; discarded transient 8 £ 104 ; D t D 0.025 ms; 25 realizations; the initial conditions are chosen at random with b D 0.5. The error bars represent the standard deviation of Â.

p Z1 D O(1 / N ). This is in agreement with the theory that predicts nc D 1 for Iext D 1.12 mV / ms, and nc D 2 for Iext D 1.02 mV / ms and Iext D 1.07 mV / ms. Note that for Iext D 1.07 mV / ms and Iext D 1.02 mV / ms for large values of M, the system displays multistability: smeared two-cluster states coexist with a stable smeared one-cluster state (data not shown). In Figure 6B we present the results of our simulations for the full I & F networks for a synaptic strength, gsyn D ¡0.025 mV / ms. Similar to the

Synchrony of Sparse Networks

1117

phase model, when M increases, a continuous transition occurs between the asynchronous state and a synchronous state in a range of values of Meff that agrees with our analytical estimate for M c . The synchrony measure Â does not discriminate between different patterns of synchrony of the full model. To characterize these patterns further, the rastergrams are plotted in Figure 7 for Meff D 600 and the three Iext values. We plot also r , the percentage of neurons ring as a function of time (1 ms bins). For Iext D 1.12, simulations reveal only one group of neurons ring every cycle; that is, the state of the system is a smeared one-cluster (see Figure 7A). By contrast, for Iext D 1.07 mV / ms and Iext D 1.02 mV / ms, the network settles in a smeared two-cluster state (see (Figures 7B and 7C). These results are in agreement with what we have found for the corresponding-phase models. The theory predicts a nonmonotonous variation of Mc with the external current, or equivalently with f0 (see Figure 4). Our simulations for both the phase model and the full model are in agreement with this prediction (see Figure 6). From this result, one expects that for xed M, the synchrony measure Â is a nonmonotonous function of f0 . This is conrmed by our simulations as shown in Figure 8A, where we have plotted Â versus f0 for Meff D 800 (M D 400 and N D 800). In the range of f0 we studied, Â displayed two peaks of different amplitudes. The peak of smallest amplitude occurs for 20 Hz < f0 < 30 Hz. Raster plots reveal that for these values of f0 , the network settles into smeared twocluster states (results not shown). A second peak of larger amplitude occurs for 40 Hz < f0 < 70 Hz. In this range, the network settles into a smeared one-cluster state. For 30 Hz < f0 < 40 Hz, the asynchronous state is stable. This is also in agreement with our analytical predictions (see Figure 4). Interestingly, in a small range of f0 (37 Hz < f0 < 40 Hz), the network state depends on the initial conditions. Starting with initial conditions that are strongly coherent (b D 0.95), the network settles into a one-cluster state. By contrast, if b D 0.1, the asymptotic state is asynchronous. Therefore, in this range of frequencies, the system is bistable with the coexistence of a stable asynchronous state and a stable smeared 1-cluster state. The results shown in Figure 4B suggest that for a xed ring rate, f , the synchrony measure can be a nonmonotonic function of the refractory period. This prediction is checked in Figure 8B where Â is plotted versus the refractory period for f0 D 102 Hz and M eff D 800 (other parameters as in Figure 4B, N D 800, M D 400). For small refractory period, Tr < 1 ms, the network is in an asynchronous state. By increasing Tr , the asynchronous state becomes unstable, and the synchrony increases. The network settles in a smeared one-cluster state. However, increasing Tr further causes Â to decrease. In a small range of Tr , namely, 5.5 ms < Tr < 6.5 ms, the asynchronous state becomes stable again, as predicted from our analytic results (see Figure 4B). For large values of Tr , the activity of the network

1118

D. Golomb and D. Hansel

synchronizes again and settles into a smeared two-cluster state (result not shown). We conclude that the phase model and our analytical estimation can be used to predict the behavior of the full model with small coupling strength. Similar behaviors are observed in the simulations of the full model (see Figure 6B) for nite but not too large coupling strength. 5.3 Beyond Weak Coupling. Here we are interested in how M c depends on the coupling strength in the I & F model beyond the weak coupling regime. In Figure 9, we plot results from simulations showing the synchrony measure Â verus M eff for a xed value of the external input, Iext D 2 mV/ms and different coupling strength, | gsyn|. For sufciently low values of | gsyn |, Â ( M) depends weakly on | gsyn|, and Mc remains close to the weak coupling estimate, although it starts to increase. For larger ¡gsyn (e.g., 0.5 mV / ms), the limit of stability of the asynchronous state becomes substantially different from its weak-coupling limit. Moreover, the system displays multistability: one-cluster state (upper branch) coexists with asynchronous or two-cluster state (lower branch). Therefore, increasing | gsyn | opposes synchrony instead of helping synchrony, as one would expect at rst. This is also clear from Figure 10A (solid line). Here we have plotted Â versus ¡gsyn for a given value of M and Iext . Not only does the synchrony of the system decrease when the coupling increases, but there exists a critical coupling, ¡gc ( M), above which the network settles into an asynchronous state. In these simulations, Iext was kept xed. Therefore, the ring rate decreased when the inhibitory coupling was increased. In order to compare networks with comparable average ring rates but different coupling strength, we have performed another set of simulations in which Iext and ¡gsyn increase in such a way that the average ring rate, f , in the asynchronous state was kept xed. The results are shown in Figure 10A (dotted line). Here too Â is a decreasing function of ¡gsyn, although it varies more slowly than when the ring rate is not xed. For a sufciently strong coupling, | gsyn| > ¡gc ( M), the network settles into a state in which neurons re in an asynchronous and (almost) periodic fashion.

Figure 7: Facing page. Simulations of networks of inhibitory I & F neurons. Parameters are the same as in Figure 6, and M D 600. Rastergrams of 400 neurons out of N D 800 are shown in the upper panels. Initial conditions are such that b D 0. A bar denoting the time period, T0 , is shown above each rastergram. In each lower panelr , the number of ring neurons within a time bin of 1 ms is displayed as a function of time. For Iext D 1.12 mV / ms (A) the system is in a smeared one-cluster state, whereas for Iext D 1.07 mV / ms (B) and Iext D 1.02 mV / ms (C), it is in a two-cluster state.

Synchrony of Sparse Networks

1119

0 .8 0 .4 0 .0

0

200

0 0

0 .8 0 .4 0 .0

200 2 00

0.8 0.4 0.0

T0

50 ms

400 4 00

T0

400

T0

By further increasing ¡gsyn, one can maintain the population average ring-rate constant by increasing Iext . However, the neural activity becomes more and more heterogeneous across the network, and the distribution of the time-averaged ring rates of the neurons becomes broad. Simultane-

r

N e u ro n In d e x , i

1120

D. Golomb and D. Hansel

0.8

c

0.6 0.4 0.2 0.0

0.0

20.0 40.0 60.0 Frequency, f (H z)

80.0

1.0 0.8

c

0.6 0.4 0.2 0.0

0.0

2.0

4.0 T r (m s)

6.0

Figure 8: Simulation of the full I & F network with N D 800, integration time step: D t D 0.25 ms. (A) Synchrony measure Â versus frequency f0 for Tr D 0, t1 D 5.9 ms, t2 D 6 ms, gsyn D ¡0.025 mV / ms, M D 450 (corresponding to M D 1028.6 in the limit N ! 1). Results were averaged over 40 realizations of the network architecture and initial conditions. b D 0.95 and b D 0.1; 2.4 £ 105 time steps (discarded transient of 1.8 £ 105 time steps). (B) Synchrony measure Â versus the refractory period Tr . The applied current was changed such that f0 D 102 Hz. Other parameters: gsyn D ¡0.1 mV / ms, M D 400 (corresponding to M D 800 in the limit N ! 1), t1 D 1 ms, t2 D 6 ms. All the runs had 3 £ 105 time steps. Time averages were computed after discarding a transient of 1.5£105 time steps. The results were averaged over 50 realizations (b D 0.95 and b D 0.1).

Synchrony of Sparse Networks

1121

1.0

g syn

c

(mV/ms) 0.1 0.2 0.3 0.4 0.5 0.6

0.5

0.0

0

1000 M eff

2000

Figure 9: Beyond weak coupling: simulations of the full inhibitory I & F network. The synchrony measure Â is plotted versus Meff for different synaptic strength. Parameters are Tr D 2 ms, t1 D 1 ms, t2 D 3 ms, Iext D 2 mV / ms ( f D 112 Hz), N D 800. Simulations with 4 £ 104 time steps with D t D 0.25 ms. Time averages were computed after discarding a transient of 2 £ 104 time steps. The results were averaged over 100 realizations. The arrow denotes the analytical estimate for Mc in the weak coupling limit. In the asynchronous regime, the average ring rate in the network changes from f ¼ 105 Hz for gsyn D ¡0.1 mV / ms to f ¼ 80 Hz for gsyn D ¡0.6 mV / ms.

ously, the activity of the neurons becomes less and less periodic. This is shown in Figure 11 for f D 100 Hz and gsyn D ¡25mV / ms. In this gure, we present results from the simulations of networks with different size (N D 400, N D 800, and N D 1600), but the same Meff (Meff D 800). Figure 11A shows the distribution of the ring rates of the neurons. The coefcient of variations of the interspike distribution (CV) of the neurons versus their ring rates is displayed in Figure 11B. Remarkably, the curves for the three sizes are superimposing. In Figure 11B, the three curves are indistinguishable. This means that in the strong coupling regime, the dynamic state of the network also depends on only Meff . Note that the variability in the ring pattern is stronger for the neurons with the lower rate. For xed gsyn , the lower the population average ring

1122

D. Golomb and D. Hansel

A

B

fixed Iext fixed f

1.0

d = 0.075 m V/m s d = 0.15 m V/m s

0.8

c

0.6 0.4 0.2 0.0 0.0

1.0

2.0

g syn (m V/m s)

3.0

0.0

1.0

2.0

g syn (m V/m s)

3.0

Figure 10: Synchrony measure Â plotted versus coupling strength gsyn for M D 400, N D 800 (Meff D 800). Parameters are: Tr D 2 ms, t1 D 1 ms, t2 D 3 ms, N D 800. Simulations were carried out with 4 £ 104 time steps and D t D 0.25 ms. (A) No heterogeneity. Solid line: The external current is xed at Iext D 1.816 mV / ms. The ring rate decreases with increasing ¡gsyn from 100 Hz for gsyn D 0 to 56 Hz for gsyn D ¡1 mV / ms. Dotted line: The external current increases when ¡gsyn increases such that the average ring rate in the asynchronous state remains constant ( f D 100 Hz). (B) Heterogeneity. The external current varies from neuron to neuron and is homogeneously distributed between NIext ¡ d and NIext C d . The average, NIext , is determined such that the population average of the ring rate, f , in the asynchronous state is kept constant, f D 100 Hz. Solid line: The width of the distribution is d D 0.075 mV / ms. The network activity is synchronized for 0.5 mV / ms · ¡gsyn · 2.2 mV / ms. Dotted line: The width of the distribution is d D 0.15 mV / ms. The network activity is always asynchronous. Simulations with larger N show that the residual nonzero value of Â are nite-size effects (result not shown). Simulations indicate that in the fully connected network with the same parameters, the asynchronous state becomes unstable at ¡gsyn ¼ 0.6mV / ms (result not shown).

rate or the lower the Meff , the larger is the variability of the neural activity (result not shown). This is in agreement with results obtained by previous authors (van Vreeswijk & Sompolinsky, 1996, 1998; Kim, van Vreeswijk, & Sompolinsky, 1998). 5.4 Synchrony in Sparse and Heterogeneous Inhibitory Networks. Until now we have investigated sparse homogeneous networks. Here we consider briey some effects of heterogeneities. For the sake of simplicity, heterogeneities are introduced as spatial uctuations in the external input, that

N u m b e r o f C e lls ´ s

Synchrony of Sparse Networks

1123

N=400 N=800 N=1600

A 0.005

0.000

B

1.0

CV

0.8 0.6 0.4 0.2 0.0

0.0

100.0 200.0 1 Firing R ate (s )

Figure 11: Distribution histogram of the ring rates (A) and coefcient of variation of the interspike distribution (B) for Meff D 800 and N D 400, 800, 1600. Parameters are: Tr D 2 ms, t1 D 1 ms, t2 D 3 ms, f D 100 Hz, gsyn D ¡25 mV / ms. Simulations are done with 3.2 £ 106 time steps and D t D 0.125 ms. Results for N D 400, 800, 1600 are averaged over 40, 20, 10 realizations, respectively, and a bin width of 2 Hz. In both gures the silent neurons (neurons with ring rate < 2 Hz which are about 6.2% of the population) are not represented. In (B), the three curves superimpose almost perfectly and are hardly distinguishable.

is, we assume that each neuron receives an external input, Ii,ext D INext C d i ,

(5.11)

where INext is the population average of the external current andd i is a random number, uniformly distributed between ¡d and d . These distributions of external current can be thought of as a consequence of spatial heterogeneities

1124

D. Golomb and D. Hansel

in some external stimulus or, equivalently, as the result of heterogeneities in the threshold of the neurons in the network. Neltner et al. (2000) have shown that for a given level of heterogeneities, d, the asynchronous state of a fully connected inhibitory network is destabilized if the coupling | gsyn | is larger than a critical value, which is roughly proportional tod . When the network is heterogeneous and sparse, the situation is different. This is shown in Figure 10B where the synchrony measure, Â, is plotted as a function of ¡gsyn, for two values of d : d D 0.075mV / ms and d D 0.15mV / ms, and the same parameters as in Figure 10A. At p weak ( coupling, the network activity is asynchronous in both cases (Â D O N)). This is a consequence of the heterogeneities that prevent the neural activity to synchronize. Indeed, for d D 0 the network activity would synchronize (Meff > Mc ). For d D 0.075mV / ms, Â starts to increase above some critical value of the coupling. This signals the fact that the asynchronous state becomes unstable because the stronger coupling overcomes the desynchronizing effect of heterogeneities. However, if the coupling increases too much, Â decreases again. This is because the spatial uctuations that result from the sparseness become sufciently strong to desynchronize the network activity again. Eventually, at large coupling, the neurons re asynchronously. For d D 0.15mV / ms, the network activity remains asynchronous at any coupling value. More generally, this will be the situation if d is too large. Therefore, in contrast to what happens in a fully connected network, there is a maximum value ford , which depends on M, that is compatible with the emergence of synchrony. 6 Discussion 6.1 Analytical Results. The analytical results reported in this article are based on two approximations. The rst is the weak coupling, in which the model can be reduced to a phase model. The conditions for which this approximation is valid have been discussed before (Kuramoto, 1984). Simulations of networks of identical neurons with all-to-all coupling have revealed that results based on the weak-coupling limit are also valid for intermediate coupling strengths (Hansel et al., 1995). A similar conclusion is obtained here. In particular, simulations of the full model with moderate coupling strength reveal values of Mc that are similar to what is obtained by simulating the reduced-phase model. Our analytical approach also relies on a second assumption: the dynamics of a neuron is governed only by the number of synaptic inputs it receives from other neurons in the network. A more thorough analysis of this assumption (Golomb & Hansel, unpublished) reveals that it is exact in the limit cn ! 0, but is only approximate for a general phase model. Our simulations indicate that this approximation provides a good estimate of the dynamics of sparse neuronal systems. In all the cases we have

Synchrony of Sparse Networks

1125

examined, the differences between the theoretically estimated values of Mc and the values of M c obtained from simulation have been less than 20%. The good correspondence between the analytical and simulation results stems from the fact that in most cases in I & F, neurons re at not too small rate, c1 is 0.2 or less, and cn values with n > 1 are even smaller. Our approximation means that the difference in the number of inputs between different neurons is the main source of desynchronzing disorder coming from the sparseness of the connectivity. A second source of disorder—the fact that neurons receive synaptic input from different sources (output neurons)—was found to be of minor effect in our simulations. Another type of sparse architecture that has been used in modeling neuronal systems by other authors (Whittington et al., 1995; Traub, Whittington, Colliing, Buzsaki, ´ & Jefferys, 1996) consists of neurons randomly connected, but with a number of inputs that is the same for all the neurons. For such models, this second source of disorder is the only desynchronizing factor due to sparseness. Our simulations show that the network exhibits a high level of synchrony for values of M, on the order of 10 or less (data not shown; see also Wang & Buzs´aki, 1996, for a conductance-based model). We have also found that the system exhibits bistability even for weak coupling, when a highly synchronized state coexists with an asynchronous state. 6.2 Patterns of Synchrony for Weakly Coupled Neurons. The nth mode of the stability matrix of the asynchronous state is unstable if ¡p / 2 < an < p / 2, and cn is large enough. The rst condition is identical to the instability criterion of the asynchronous state in a fully connected network. The second condition means that the modulation of the nth Fourier mode of the effective phase interaction is large enough to overcome the disorder caused by the sparse connectivity. If M D N, this disorder is absent, and therefore any modulation can destabilize the asynchronous state. We rst discuss the pattern of synchrony of inhibitory I & F networks. In that case, Mc is a nonmonotonous function of the frequency, f0 (or equivalently the external current Iext ), and it has many minima. At high frequency, the asynchronous state is destabilized by the rst modes, whereas at lower frequencies, it is destabilized by higher modes. These are consequences of the interplay between the effects of the angle, an , and the amplitude, cn . For each mode, cn decreases with frequency, and the phasic modulation of the mode decreases, but an is within the desynchronizing range (¡p / 2, p / 2) only for high-enough frequencies. Therefore, for each mode, there is a minimal value for its critical number, Mc,n , with respect to f0 . Moreover, cn is smaller and an is larger for higher n, and therefore these minima occur at lower frequencies for higher modes. The asynchronous state is always stable for Tr D 0 if the synaptic rise is instantaneous (t1 D 0). For small t1 , Mc is very high. However, with refractoriness, Mc can be moderate and even small for small or zero t1 .

1126

D. Golomb and D. Hansel

Our simulations show that destabilization of the asynchronous state by the nth mode leads to a “smeared” n-cluster state. This explains why cluster states are observed at low frequencies. However, for sufciently large M, these states coexist with smeared one-cluster states. A consequence is that in numerical simulations, the synchrony pattern in which the network settles depends on the initial conditions from which the simulation begins. The existence of cluster states is not specic of sparse networks. Indeed, for I & F network models, cluster states can also be found in fully connected coupled networks when the synaptic timescale was fast in comparison to the ring frequency (van Vreeswijk, 1996; Neltner et al., 2000). Cluster states have also been found in conductance-based models of inhibitory spiking neurons (Hansel et al., 1993c; Wang & Buzsaki, ´ 1996). Models of thalamic networks, in which inhibitory neurons display postinhibitory rebound, also exhibit cluster states with hyperpolarizing inhibition and fast synaptic kinetics (Golomb & Rinzel, 1993, 1994; Golomb et al., 1994). Cluster states are therefore an important type of states demonstrated by various models of recurrent neuronal networks. However, their occurrence in the central nervous system has yet to be demonstrated. 6.3 Patterns of Synchrony for Finite Coupling Strength. Increasing the strength of the inhibitory synapses (keeping the ring rate constant by increasing also the external input) has two competitive effects. On one hand, this increases the average synaptic modulation that may synchronize the system, but on the other hand, this increases the neuron-to-neuron uctuations, which tends to desynchronize the neuronal activity. For weak coupling, these two factors cancel each other, and the stability of the asynchronous state is not affected by gsyn; the convergence time to the synchronous state, however, is proportional to 1 / | gsyn |. Beyond weak coupling, the synaptic inputs also distort the neuronal time course from the limit cycle of the isolated neuron; this distortion is different for neurons that receive different numbers of inputs. Then it is not clear which of the two competitive effects is the stronger. Remarkably, in all the cases we have studied, the synchrony measure, Â, was found to decrease monotonously with the coupling strength, and for | gsyn | larger than a critical value | gc ( M)| , the network is in the asynchronous state. This suggests that the increase in the spatial uctuations is always winning. It would be interesting to have more rigorous proof of this result. van Vreeswijk and Sompolinsky (1996, 1998) have studied analytically a network of sparsely connected binary neurons in the limit 1 ¿ M ¿ p N,N, M ! 1, gsyn, Iext / M. They have shown that in this limit, the network settles in an asynchronous state in which the neuronal ring is at random and is close to a Poisson process. In this state the total input to the cells is order 1, although excitation p (from the external input) and inhibition are separately on the order of O( M). Therefore, excitation and inhibition

Synchrony of Sparse Networks

1127

are balancing each other without any ne tuning of the parameters. More recently, Kim et al. (1998) have performed numerical simulations of large, sparse networks of I & F neurons. They have found that under similar conditions, the network settles in a chaotic state very similar to this balanced state. Our ndings that the network activity becomes asynchronous for gsyn > gc ( M) are consistent with these results. In order to understand the dynamics of the sparse networks, further it would be interesting to know how gc ( M) scales with M for large M. 6.4 Behavior of More General I & F Network Models. The model we have studied is restricted, even in the I & F framework. We assume that the postsynaptic neuron receives a postsynaptic current immediately after the presynaptic neuron has red. In cortex, there is a delay of about 2 ms even between adjacent neurons (Thomson, Deuchars, & West, 1993; Markram, Lubke, Frotscher, Roth, & Sakmann, 1997). With time delay td , the effective coupling of the phase model, C d , is related to C by: C d (w ) D C(w ¡ D ) where D D 2p td / T0 (Hansel et al., 1995; Izhikevich, 1998). Therefore, all the angles, an , are increased by D . For inhibitory coupling, a small delay reduces the frequencies at which the destabilization of the asynchronous state occurs. The coupling in equation 2.4 is via “synaptic currents” in the sense that the total conductance of a cell does not depend on the synaptic input. In a version of the I & F network model that is more faithful to the biophysics, the interactions are mediated by conductance changes. The synaptic current is Isyn( t) D gsyn( t)( V ¡ Vrev ), where Vrev is the synaptic reversal potential. In this case, the effective time constant t0 decreases with increasing the synaptic input. This is expected to have some quantitative, although not qualitative, effects on the results presented here. The temporal summation of synaptic inputs here is assumed to be linear, and kinetic properties of synapses, such as depression or facilitation, are not considered. This is a justied assumption, provided one deals with stationary properties of the network, and the synaptic time and strength t those of the real synapses at long time, and not in response to the initial stimulus (Markram & Tsodyks, 1996). 6.5 Sparseness Versus Heterogeneity. Random architecture has two consequences (van Vreeswijk & Sompolinsky, 1998): (1) the number of inputs uctuates from neuron to neuron, and (2) different neurons receive synaptic inputs from different sources. In the approximate analytical theory we have developed in the weak coupling limit, the latter effect is neglected. That is why the instability condition of the asynchronous state (see equations A.18 and A.19) is identical to what is found for a fully connected network of phase oscillators with random intrinsic frequencies (Sakaguchi & Kuramoto, 1986; Mirollo & Strogatz, 1991; Neltner et al., 1990), distributed according to a gaussian distribution with an average and a variance that depends on the zero mode of the phase interaction and therefore on the coupling strength.

1128

D. Golomb and D. Hansel

The good agreement of the simulations with this theory is due to the fact that for I&F neurons (and more generally for conductance-based neurons; Golomb & Hansel, unpublished), the zero mode of the phase interaction is large in comparison to the amplitude of the rst mode. A more exact mean-eld theory would require also taking into account the fact that different neurons receive synaptic inputs from different sources. This can be done in the large M, large N limit. Then one nds that the dynamics of the sparse network is similar to the dynamics of an effective phase model with randomly distributed intrinsic frequencies and a timecorrelated multiplicative noise. The properties of the noise and of the frequency distribution have to be determined self-consistently. This shows that a sparse network is not equivalent to a fully connected network with heterogeneities. However, the solution of this exact mean-eld theory is difcult. We intend to study it in future work. The difference between the effect of sparseness and of heterogeneities is also clear from our simulations at large coupling. Indeed, with a nite level of heterogeneity, a large, inhibitory all-to-all network with xed external drive becomes asynchronous with large enough | gsyn | (Wang & Buzs a´ ki, 1996; White et al., 1998). This is related to decreasing the ring rate when the coupling increases. However, if the external drive is changed such that the ring rate remains xed when the coupling strength increases, the asynchronous state becomes unstable at large enough | gsyn| (Neltner et al., 2000). By contrast, our simulations show that when the ring rate is kept xed, a sparse network becomes asynchronous at large enough | gsyn| . In systems with both heterogeneities and sparseness, we have found that for coupling that is not too strong, Â is an increasing function of | gsyn | because the desynchronizing effect of the heterogeneities is counterbalanced by the interactions. However, for sufciently strong coupling, the effect of the sparseness wins. Therefore, an optimal value exists for gsyn at which Â is maximal. Note that nonmonotonic variations of the synchrony level in sparse and heterogeneous network have been also found by Wang and Buzs´aki (1996). However, they found it without controlling the ring rate of the neurons when the coupling is changed. 6.6 Consequences for Neural Modeling. In many cases, modeling large neuronal systems of the central nervous system (CNS) in their full size would require simulating networks of tens of thousands of neurons, interacting through several million synapses. Simulations of such large systems are both time and memory consuming, and systematic studies of their dynamical properties using numerical simulations are hard and frequently impossible due to limitations in computer capabilities. In order to avoid these problems, systems of reduced sizes are frequently studied. However, it is not always clear how to make the size reduction of the real system. Our work suggests a simple scaling rule that can be applied to the study of patterns of synchrony in sparse network. Assume that the system one wants to

Synchrony of Sparse Networks

1129

study consists of N neurons connected at random, with an average number of synapses per neuron, M. Assume for simplicity that all the synapses have the same strength, Gs . Our study shows that the synchronization properties of this system can be studied by simulating a model network consisting of Nm randomly connected neurons, with average connectivity, M m , and synaptic strength Gm , satisfying the following scaling relation: Gm D

M Gs Mm

1 1 1 1 ¡ D ¡ . Mm Nm M N

(6.1) (6.2)

The rst equation ensures that the average synaptic input in both the real and the simulated systems are the same. The second relation guarantees that the spatial uctuations of this input are the same in both systems. As we have shown, using this scaling allows one to investigate the behavior of very large networks by simulating a network with a relatively small Nm —on the order of a few hundreds. Assuming that in the real system 1 ¿ M ¿ N, Nm and Mm will not be very different. In spite of this fact, our simulation results indicate that the synchronization properties in the two systems are similar. Synchrony in I & F inhibitory networks is very sensitive to disorder. In the context of our work here, this is expressed in the fact that Mc is large for these models. Similar conclusions have been reached for networks of I & F with heterogeneous properties. This lack of robustness is a consequence of the purely passive integration of the synaptic and external inputs. However, even within the I & F framework, refractoriness can improve the robustness of the synchrony substantially by decreasing Mc , as we have shown. Synchrony in conductance-based neurons that take into account more biophysics of neurons seems to be more robust (Wang and Buzs a´ ki, 1996; Neltner et al., 2000; Golomb & Hansel, unpublished). The patterns of synchrony for M > Mc can also depend on the active properties and their kinetics. For example, for the conductance-based model of Wang and Buzs a´ ki (1996), the network settles into a smeared one-cluster state except for very low frequencies. With slower kinetics of the intrinsic currents, the second mode destabilizes the asynchronous state for the entire frequency range (Golomb & Hansel, preliminary results). In that case, the network may settle into a two-cluster state. Therefore, the role of intrinsic properties of the neurons is as important as the role of the synaptic properties in determining when synchrony can emerge in the neural activity of large networks. Nevertheless, our study points out qualitative trends regarding the role of active properties in the emergence of synchrony that have the effect of reducing Mc . However, one cannot draw general quantitative conclusions regarding the minimal number of synaptic inputs required for synchrony to emerge since this number can vary by one order of magnitude or more, depending on

1130

D. Golomb and D. Hansel

the model considered. This raises an important issue for neural modeling: under which conditions can a conductance-based dynamics be reduced to an I & F dynamics without losing synchronization properties? 6.7 Consequences for the Mechanisms of Neural Synchrony. Finally, we discuss some consequences of our results for the possible mechanisms underlying neural synchrony in the CNS. It has been proposed recently that inhibitory neurons could play a central role in the emergence of synchrony in large networks in the CNS. According to these ideas, synchronous rhythmic oscillations would emerge in the subnetwork of inhibitory interneurons and subsequently would entrain other populations of neurons in the system. This mechanism is supported by theoretical considerations. Indeed, it has been realized that it is easier to achieve coherent activity in fully connected networks of identical neurons if the interactions are inhibitory than if the interactions are excitatory. For that to be true, no specic active properties of the neurons are required. Recent studies have pointed out that because of the heterogeneities, synchrony in inhibitory networks can be a problem, especially at moderate or high ring rates. Nevertheless, as it has been demonstrated in Neltner et al. (2000), the desynchronizing effect of heterogeneities of fully connected networks can be overcome provided the external input and the synaptic interactions are sufciently strong. We have seen that if the network is sparsely connected, increasing the interaction is desynchronizing. Hence, the capability of compensating for the heterogeneities in the strong coupling regime is opposed by the amplication in this regime of the spatial uctuations due to the sparseness. To characterize this limitation in a more quantitative manner, more specic information is needed about the intrinsic active properties of the interneurons and the degree of connectivity of networks of interneurons in the CNS. However, this article shows a fundamental limitation for the mechanism of synchrony relying solely on inhibition. An important issue is therefore whether more robust synchrony in the CNS can be achieved through recurrent inhibition between excitatory and inhibitory populations. Work is in progress in this direction. Appendix A: Analytical Estimate of Mc in the Weak Coupling Limit We assume that the state of the network at time t can be completely characterized by the density function r (w , t, m) dw, which is the fraction of neurons with exactly m inputs whose phase lies between w and w C dw at time t. Under this assumption, one can write the continuity equation for r (w , t, m), which reads: @r (w , t, m) @t

Synchrony of Sparse Networks

1131

(

" Z ¡ ¢ 2p mC 0 m N D¡ r (w , t, m) C C dm0 P0 m0 , M @w T0 M M 0 #) Z p ¡ 0 ¢ ¡ ¢ 0 0 Q 0 £ . dw r w , t, m C w ¡ w @

¡p

(A.1)

The right-hand side term in equation A.1 means that a neuron receives phasic contribution from m other neurons, and the state of the “ancestor ” neurons is determined by the number of input m0 they receive themselves. Because M À 1, the contribution from the “tails” of the gaussian distribution RN (see equation 3.8) to the integral 0 dm0 is negligible, and therefore this R1 integral is well approximated by ¡1 dm0 . The following transformations, tO D | C 0 | t,

dm D m ¡ M,

wO D w ¡

¡ 2p

¢

T0

C C0

t,

(A.2)

lead to ³

O t, O dm @r w,

´

@tO

³

D ¡s0

O O d m @r w, t, dm

´ ¡

1 |C 0 |

¡ 1C

dm M

¢

M @wO µ ³ ´Z 1 @ O dm £ r wO , t, d (d m) P (d m, M) @wO ¡1 Z p ³ ´ ³ ´¶ 0 0 O 0 O O Q O O £ , dw r w , t, d m C w ¡ w ¡p

(A.3)

where s0 D sign(C 0 ) and sign(x) D 1 if x > 0 and ¡1 if x < 0. From now on, we will omit the symbol O in equation A.3 and refer to the values t, w , as the new values dened in equation A.2. A trivial solution of equation A.3 is r (w , t, dm) D

1 . 2p

(A.4)

It corresponds to the asynchronous state of the network. For analyzing its stability, we follow the approach of Strogatz and Mirollo (1991). We investigate the evolution of small deviations around this solution: r (w , t, dm) D

1 C g (w , t, d m) . 2p

(A.5)

Using equation 3.3 and expanding g(w , t, dm) in Fourier series g (w , t, d m) D

1 £ X nD1

¤ Bn (t, dm) einw C c.c. ,

(A.6)

1132

D. Golomb and D. Hansel

one nds @Bn (t, dm)

dm Bn (t, dm) M Z 1 ¡ ¢ ¡ ¢ ¡ ¢ 1 dm 1C C ncn eian d d m0 Bn t, d m0 P dm0, M . (A.7) 2 M ¡1

D ¡in s0

@t

¡

¢

We seek for each n a solution for equation A.7 of the form B n (t, d m) D bn (dm) el n t ,

(A.8)

where the eigenvalue, l n , does not depend on d m. Solving equations A.7 and A.8 for bn (d m), we nd ¡ bn (d m) D

dm M

1C

¢

An

l n C in s0 dm M

,

(A.9)

where An is determined by the self-consistent equation: n An D cn eian 2

Z

1 ¡1

d (d m) P (d m, M) bn (d m) .

(A.10)

The complex number l n can be written as l n D m n ¡ inVn , where m n and Vn are real. Using this notation, the complex equations, A.9 and A.10, are expressed as: 2 cos an D n cn 2 sin an D n cn

Z

1 ¡1

Z

1 ¡1

¡ ¢ m n 1 C dm M d (dm) P (dm, M) ¡ ¢2 m 2n C n2 s0 dm ¡ Vn M d (d m) P (d m, M)

¡ ¢¡ ¢ ¡ Vn 1 C dm n s0 dm M M ¡ ¢2 . m 2n C n2 s0 dm ¡ V n M

(A.11)

(A.12)

p The term d M / M is of the order 1 / M. Since M À 1 (see equation 3.6), we neglect this term in equations A.11 and A.12 in comparison to 1, but not in comparison to |Vn |, which may be small. We want to nd the critical value, Mc,n , above which l n has a positive real part (m n > 0). Hence, we look for a solution, Mc,n , for equations A.11 and A.12 in the limit m n ! 0C . The righthand side of equation A.11 is always positive form n ¸ 0, and, therefore, this equation has a solution only if ¡p / 2 < an < p / 2.

(A.13)

Synchrony of Sparse Networks

1133

R1 Using the fact that limm n !0C 1 dºm n / (m 2n C º2 ) D pd (0), dening yn D p Mc,n Vn , and using the identity,

¡

¢

1 m P ( m, M) D p P p , 1 , M M

(A.14)

equation A.11 yields (after some straightforward algebra) ¡ ¢p 2 cos an D p P yn , 1 Mc,n . cn

(A.15)

Similarly, one takes the limit m n ! 0C in equation A.12 that gives p ¡ ¢ sin an D Mc,n J yn , 1 , cn

(A.16)

where, following Sakaguchi and Kuramoto (1986), we have dened Z J (V, M) D

1 0

dx [P (V C x, M) ¡ P (V ¡ x, M)] . 2x

(A.17)

Therefore, Mc,n and Vn are computed from: ¡ ¢ p ¡ ¢ J y, 1 D P y, 1 tan an 2 M c,n D

(

2 cos an ¡ ¢ p cn P y, 1

(A.18)

!2 .

(A.19)

These equations are similar to those obtained by Sakaguchi and Kuramoto (1986) for the emergence of a synchronized state in a heterogeneous phase model with coupling via the rst mode only: C(w ) D sin(w C a). The value of y obtained from solving equation A.18 is a function of an only. In Figure 12A, we plot the graphs of the functions J( y, 1) and P( y, 1) tan a for a D ¡p / 4. It is clear that y increases with |a| for |a| < p / 2, and therefore P( y, 1) decreases with |a| in this range. We dene the universal function, b(a), to be b(a) D

p

2p P( y, 1) cos a

(A.20)

(see Figure 12B) where y is the solution of equation A.18. Equations A.19 and A.13 become M c,n D

8 p c2n b 2

(an )

I

¡ p / 2 < an < p / 2.

(A.21)

1134

D. Golomb and D. Hansel

p J, 2 P tan a 0.5

0.0

10

0

y

10

0.5

b (a )

1.0

0.5

0.0

p 2

p 4 a

0

p

4

p

2

Figure 12: (A) Functions J(y, 1) (solid line) and (p / 2)P(y, 1) tan a (dashed line) for a D ¡p / 4. (B) Graph of the function b(a), referred to in equation 3.12 and dened in equations A.18 and A.20.

Equation A.21 gives the number of inputs, Mc,n , above which the nth Fourier mode destabilizes the asynchronous state. In order to nd the number Mc above which this state becomes unstable, we look for the minimal Mc,n Mc D min Mc,n , n

(A.22)

where the minimum is over all the positive integer, n, values for which ¡p / 2 < an < p / 2. Below Mc , the asynchronous state can be stable or marginally stable. In the case of the Kuramoto model (Kuramoto, 1984), the asynchronous state is linearly marginally stable when it is not unstable, in the sense that the stability matrix has a zero eigenvalue. Still, uctuations of the asynchronous state decay with time (Strogatz & Mirollo, 1991; Strogatz, Mirollo, & Matthews,

Synchrony of Sparse Networks

1135

1992). This is probably also the situation here. Our numerical results show that below Mc , the system dynamics converge to the asynchronous state when the initial conditions are chosen from a uniform distribution. Note that the instability criterion we have established is formally similar to the equations that determine the stability boundary of the asynchronous state in a fully connected network of neurons with heterogeneous intrinsic properties (Neltner et al., 2000). Appendix B: The Fourier Transform of C(w) The Fourier transform of a periodic function h(w ), in the complex exponential representation, is h (w) D

1 X nD¡1

hn e¡inw .

(B.1)

The complex number, hn , is represented as hn D | hn | exp[i arg(hn )], where |hn | is nonnegative and ¡p / < arg(hn ) · p . For a real number x, arg(x) D 0 if x > 0 and arg(x) D p if x < 0. It is straightforward to calculate the Fourier coefcients of the functions Z(w ) (see equation 4.2) and Isyn (w ) (see equation 4.3). One nds: |Zn | D

h 1 ¡ ¢ Iext (Iext ¡ h ) T0 1 C n2 v2 t 2 1 / 2 0 0 » ¼ 2Iext (Iext ¡ h ) ( ) £ 1C ¡ cos [1 nv T ] 0 r h2

1/2

(B.2)

arg (Zn ) D ¡ arctan nv0 t0 ¡ arctan

sin (nv0 Tr ) [Iext / ( Iext ¡ h ) ] ¡ cos ( nv0 Tr )

(B.3)

| gsyn | t0 ¢¡ ¢¤1 2 1 C n2 v02 t12 1 C n2 v20 t22 /

(B.4)

¡ ¢ ¡ ¢ arg Isyn,n D arg gsyn C arctan ( nv0 t1 ) C arctan ( nv0 t2 ) ,

(B.5)

Isyn,n D

£¡ 2

T0

where the function arctan takes values in the interval [¡p / 2, p / 2]. The convolution theorem implies that the Fourier components of C(w ) (in the complex exponential representation) are given by C n D Zn I¤syn,n ,

(B.6)

1136

D. Golomb and D. Hansel

where the asterisk means complex conjugate. Transferring this representation to the sin representation (see equation 3.3), this formula becomes

cn D | Z0 | Isyn,0 2 |Zn | Isyn,n

an D ¡

¡ ¢ p ¡ arg (Zn ) C arg Isyn,n . 2

(B.7)

(B.8)

Equations B.6, B.7, and B.8 are valid for every model in which the synaptic coupling is expressed in terms of currents (and not in terms of conductances). Substituting equations B.2–B.5 into equations B.7 and B.8 yields equations 4.4 and 4.5. Acknowledgments We are thankful to C. van Vreeswijk, G. Mato, H. Sompolinsky, L. Neltner, and G. B. Ermentrout for most helpful discussions. We thank C. van Vreeswijk for a careful reading of the manuscript. D. H. acknowledges the hospitality of the Center for Neural Computation and the Racah Institute of Physics of the Hebrew University. This research was partially supported by the PICS 236 (PIR Cognisciences and M.A.E) and by a grant from AFIRST to D. H., and by a grant from the Israel Science Foundation to D. G. References Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in networks of pulse-coupled oscillators. Phys. Rev. E, 48, 1483–1490. Barkai, E., Kanter, I., & Sompolinsky, H. (1990). Properties of sparsely connected excitatory neural networks. Phys. Rev. A, 41, 590–597. Binder, K., & Heermann, D. W. (1992). Monte Carlo simulation in statistical physics. Berlin: Springer Verlag. Bressloff, P. C., & Coombes, S. (1998a). Desynchronization, mode locking, and bursting in strongly coupled integrate-and-re oscillators. Phys. Rev. Lett., 81, 2168–2171. Bressloff, P. C., & Coombes, S. (1998b).Spike train dynamics underlying pattern formation in integrate-and-re oscillator network. Phys. Rev. Lett., 81, 2384– 2387. Brunel, N., & Hakim, V. (1999). Fast global oscillations in networks of integrateand-re neurons with low ring rates. Neural Comput., 11, 1621–1671. Buzs´aki, G., & Chrobak, J. J. (1995). Temporal structure in spatially organized neuronal ensembles: A role for interneuron networks. Curr. Opin. Neurobiol., 5, 504–510. Buzs´aki, G., Leung, L., & Vanderwolf, C. H. (1983). Cellular bases of hippocampal EEG in the behaving rat. Brain Res. Rev., 6, 139–171.

Synchrony of Sparse Networks

1137

Chow, C. C. (1998). Phase-locking in weakly heterogeneous neuronal networks. Physica D, 118, 343–370. Chow, C. C., White, J. A., Ritt, J., & Kopell, N. (1998). Frequency control in synchronized networks of inhibitory neurons. J. Comput. Neurosci., 5, 407–420. Ermentrout, G. B., & Kopell, N. (1984). Frequency plateaus in a chain of weakly coupled oscillators. SIAM J. Math. Anal., 15, 215–237. Gerstner, W. (2000). Population dynamics of spiking neurons: Fast transients, asynchronous states, and locking. Neural Comput., 12, 43–89. Gerstner, W., van Hemmen, J. L., & Cowan, J. (1996). What matters in neuronal locking. Neural Comput., 8, 1653–1676. Ginzburg, I., & Sompolinsky, H. (1994). Theory of correlations in stochastic neuronal networks. Phys. Rev. E, 50, 3171–3191. Golomb, D. & Hansel, D. (2000). Synchrony in sparse networks of phase oscillators. Unpublished manuscript. Golomb, D., Hansel, D., Shraiman, B., & Sompolinsky, H. (1992). Clustering in globally coupled phase oscillators. Phys. Rev. A, 45, 3516–3530. Golomb, D., & Rinzel, J. (1993).Dynamics of globally coupled inhibitory neurons with heterogeneity. Phys. Rev. E, E 48, 4810–4814. Golomb, D., & Rinzel, J. (1994). Clustering in globally coupled inhibitory neurons. Physica D, 71, 259–282. Golomb, D., Wang, X. J., & Rinzel, J. (1994).Synchronization properties of spindle oscillations in a thalamic reticular nucleus model. J. Neurophysiol., 72, 1109– 1126. Gray, C. M. (1994). Synchronous oscillations in neuronal systems: Mechanisms and functions. J. Comp. Neurosci., 1, 11–38. Hansel, D., Mato, G., & Meunier, C. (1993a).Phase dynamics for weakly coupled Hodgkin-Huxley neurons. Europhys. Lett., 23, 367–372. Hansel, D., Mato, G., & Meunier, C. (1993b). Clustering and slow switching in globally coupled phase oscillators. Phys. Rev, E48, 3470–3477. Hansel, D., Mato, G., & Meunier, C. (1993c). Phase reduction and neural modeling. Concepts in Neuroscience, 4, 192–210. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comp., 7, 307–337. Hansel, D., Mato, G., Meunier, C., & Neltner, L. (1998).On numerical simulations of integrate-and-re neural networks. Neural Comp., 10, 467–485. Hansel, D., & Sompolinsky, H. (1992). Synchrony and computation in a chaotic neural network. Phys. Rev. Lett., 68, 718–721. Hansel, D., & Sompolinsky, H. (1996). Chaos and synchrony in a model of a hypercolumn in visual cortex. J. Comp. Neurosci., 3, 7–34. Izhikevich, E. M. (1998). Phase models with explicit time delays. Phys. Rev. E, 58, 905–908. Kim, J., van Vreeswijk C., & Sompolinsky, H. (1998). Stochastic activity in balanced integrate-and-re neural networks. Unpublished manuscript, Racah Institute of Physics, Hebrew University of Jerusalem. Kopell, N. (1988). Toward a theory of modeling central pattern generators. In A. Cohen (Ed.), Neural control of rhythmic movements in vertebrates. (pp. 369–413). New York: Wiley.

1138

D. Golomb and D. Hansel

Kuramoto, Y. (1984). Chemical oscillations, waves and turbulence. New York: Springer-Verlag. Kuramoto, Y. (1991).Collective synchronization of pulse-coupled oscillators and excitable units. Physica D, 50, 15–30. Lytton, W. W., & Sejnowski, T. J. (1991). Simulations of cortical pyramidal neurons synchronized by inhibitory interneurons. J. Neurophysiol., 66, 1059– 1079. Markram, H., Lubke, J., Frotscher, M., Roth, A., & Sakmann, B. (1997). Physiology and anatomy of synaptic connections between thick tufted pyramidal neurones in the developing rat neocortex. J. Physiol. (Lond), 500, 409–440. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efcacy between neocortical pyramidal neurons. Nature, 382, 807–810. Mirollo R. E., & Strogatz S. H. (1990). Synchronization of pulse-coupled biological oscillators. SIAM J. Appl. Math., 50, 1645–1657. Neltner, L., Hansel, D., Mato, G., & Meunier, C. (2000). Synchrony in heterogeneous networks of spiking neurons. Neural Comput., 12 (in preparation). Sakaguchi, H., & Kuramoto, Y. (1986). A solvable active rotator model showing phase transitions via mutual entrainment. Prog. Theor. Phys., 56, 576–581. Shelley, M. (1999). Efcient and accurate integration schemes for systems of integrate-and-re neurons. Unpublished manuscript, Courant Institute of Mathematical Sciences and Center for Neural Science, New York University. Strogatz, S. H., & Mirollo, R. E. (1991). Stability of incoherence in a population of coupled oscillators. J. Stat. Phys., 63, 613–635. Strogatz, S. H., Mirollo, R. E., & Matthews, P. C. (1992). Coupled nonlinear oscillations below the synchronization threshold: Relaxation be generalized Landau damping. Phys. Rev. Lett., 68, 2730–2733. Thomson, A. M., Deuchars, J., & West, D. C. (1993). Large, deep later pyramidpyramid single axon EPSPs in slices of rat motor cortex display paired pulse and frequency-dependent depression, mediated presynaptically and selffacilitation, mediated postsynaptically. J. Neurophysiol., 70, 2354–2369. Traub, R. D., Whittington, M. A., Colling, S. B., Buzs´aki, G., & Jefferys, J. G. R. (1996). Analysis of gamma rhythms in the rat hippocampus in vitro and in vivo. J. Physiol., 493, 471–484. Treves, A. (1993). Mean eld analysis of neuronal spike dynamics. Network, 4, 259–284. Tsodyks, M., Mitkov, I., & Sompolinsky, H. (1993). Patterns of synchrony in integrate and re network. Phys. Rev. Lett., 71, 1280–1283. Tsodyks, M., & Sejnowski, T. (1995). Rapid state switching in balanced cortical network models. Network, 6, 111–124. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology. New York: Cambridge University Press. van Vreeswijk, C. (1996). Partial synchronization in populations of pulsecoupled oscillators. Phys. Rev. E, 54, 5522–5537 van Vreeswijk, C., Abbott, L. F., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural ring. J. Comput. Neurosci., 1, 313–321. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity, Science, 274, 1724–1726.

Synchrony of Sparse Networks

1139

van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comp., 10, 1321–1372. Wang, X.-J., & Buzs´aki, G. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuronal network model. J. Neurosci., 16, 6402–6413. Wang, X.-J., Golomb, D., & Rinzel, J. (1995). Emergent spindle oscillations and intermittent burst ring in a thalamic model: Specic neuronal mechanisms. Proc. Natl. Acad. Sci. (USA), 2, 5577–5581. White, J. A., Chow, C. C., Ritt, J., Soto-Trevino, ˜ C., & Kopell, N. (1998). Synchronous oscillations in heterogeneous, mutually inhibited neurons. J. Comput. Neurosci., 5, 5–16. Whittington, M. A., Traub, R. D, & Jefferys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615. Received April 30, 1999; accepted June 30, 1999.

LETTER

Communicated by David Lowe

A Neural Network Architecture for Visual Selection Yali Amit Department of Statistics, University of Chicago, Chicago, IL 60637, U.S.A. This article describes a parallel neural net architecture for efcient and robust visual selection in generic gray-level images. Objects are represented through exible star-type planar arrangements of binary local features which are in turn star-type planar arrangements of oriented edges. Candidate locations are detected over a range of scales and other deformations, using a generalized Hough transform. The exibility of the arrangements provides the required invariance. Training involves selecting a small number of stable local features from a predened pool, which are well localized on registered examples of the object. Training therefore requires only small data sets. The parallel architecture is constructed so that the Hough transform associated with any object can be implemented without creating or modifying any connections. The different object representations are learned and stored in a central module. When one of these representations is evoked, it “primes” the appropriate layers in the network so that the corresponding Hough transform is computed. Analogies between the different layers in the network and those in the visual system are discussed. Furthermore, the model can be used to explain certain experiments on visual selection reported in the literature.

1 Introduction

The issue at hand is visual selection in real, still gray-scale images, which is guided by a very specic task of detecting a “memorized” or “learned” object class. Assume no “easy” segmentation cues are available, such as color or brightness of the object, or a relatively at and uniform background. In other words, segmentation can not precede detection. Moreover, the typical 240 £ 320 gray-scale image is highly cluttered with multiple objects at various scales, and dealing with false positives is crucial. The aim of selection is to choose quickly a small number of candidate poses for the object in the scene, from among the millions of possibilities, taking into account all possible locations and a range of scales. Some of these candidate poses may be false and may be further analyzed for nal verication, using more intensive procedures, which are not discussed here. c 2000 Massachusetts Institute of Technology Neural Computation 12, 1141–1164 (2000) °

1142

Yali Amit

Various biological facts and experiments must be dealt with when constructing a computational model for visual selection. Selection is a very rapid process—approximately 150 milliseconds after presentation of the image (Thorpe, Fize, & Marlow, 1996; Desimone, Miller, Chelazzi, & Lueschow, 1995). Moreover of these 150 milliseconds, at least 50 milliseconds are spent on processing from the retina through the primary visual cortex V1, which is assumed to be involved with detection of edges, various local motion cues, and so forth. So very little time is left for global processing necessary to determine a candidate location of the object. There is a limited number of processing stages between V1 and IT, where there is clear evidence for selective responses to full objects even prior to the saccade. (See Desimone et al., 1995.) A common schematic description of these stages includes V1 simple cells, V1 complex cells, V2 cells detecting more complex structures (e.g., illusory contours, tjunctions, corners; see von der Heydt, 1995), V4 cells, and nally IT. Receptive eld size increases as one moves up this hierarchy. The receptive elds of cells in a given column, which corresponds to approximately the same location in the visual eld and to the same feature, exhibit signicant deviations with respect to each other. These deviations increase with the level in the hierarchy (Zeki, 1993). Selection is not a very accurate process. Mistakes can be made. Often it is necessary to foveate to the location and process the data further to make a nal determination as to whether the required object is indeed there. Selection is invariant to a range of scales and locations (Lueschow, Miller, & Desimone, 1994; Ito, Tamura, Fujita, & Tanaka, 1995). The question is whether there are computational approaches that accommodate the constraints listed above—that is, can efcient selection be achieved at a range of scales, with a minimal number of false negatives and a small number of false positives, using a small number of functional layers? Can the computation be parallelized, making minimal assumptions regarding the information carried by an individual neuron? Such computational approaches, if identied, may serve as models for information processing beyond V1. They can provide a source of hypotheses that can be experimentally tested. The aim of this article is to present such a model in the form of a neural network architecture that is able to select candidate locations for any object representation evoked in a memory model; the network does not need to be changed for detecting different objects. The network has a sequence of layers similar to those found in visual cortex. The operations of all units are simple integer counts. A unit is either on or off, and its status depends

Neural Network Architecture for Visual Selection

1143

on the number of on units feeding into it. Selection at a xed resolution is invariant to a range of scales and all locations, as well as small linear and nonlinear deformations. 1.1 Overview of the Selection Algorithm. Selection involves nding candidate locations for the center of the object, which can be present at a range of poses (location, small rotations, scaling of § 20% and other small deformations). In other words, the detection is invariant to the range of poses but does not convey information about the parameters of the pose except for location. Various options for recovering pose parameters such as scale and rotation are discussed in Amit and Geman (1999) and Amit (1998). This issue is not dealt with here. Selection can therefore be viewed as a process whereby each location is classied as object or background. By object, we mean that an object is present at one of an admissible range of deformations at that location. There are two pitfalls arising from this point of view that should be avoided. The rst is the notion that one simply needs training samples from the object at the predetermined range of scales and samples from background images. These are then processed through some learning algorithm (e.g., tree classiers, feedforward neural nets, support vector machines) to produce the classier. The second is that this classier is subsequently implemented at each location. The problem with this point of view is that it will produce a different global architecture (tree, neural net, etc.) for each new object that needs to be detected; it will require a large training set both to represent background images and to learn the required range of scales; it will not directly address the issue of efciency in computation and resources required for detection of multiple objects. It should be noted, however, that this approach has led to successful computer vision algorithms for specic objects. For an example in the context of face detection, see Rowley, Baluja, and Takeo (1998) and Sung and Poggio (1998). In view of these issues and in order to accommodate the constraints described in the previous section, the selection system described below is guided by the following assumptions:

The classication carried out at each image location—object versus background—must be a very simple operation. One cannot expect a sophisticated classier such as a heavily connected neural net to be present at each location. Objects are represented through spatial arrangements of local features, not through individual local features. Individual local features are insufcient to discriminate between an object and the rich structure of the background; multiple local features are also insufcient without spatial constraints on their relative locations.

1144

Yali Amit

The basic inputs to the network are coarse-oriented edge detectors with a high degree of photometric invariance. Local features are dened as functions of these edges. Training involves determining moderate probability local features, at specic locations on examples of the object, which are registered to a reference pose in a reference grid. Invariance to local deformations is hard wired into the denition of the local features, not learned. Invariance to global transformations is hard wired into the computation and not learned. No changes need to be made to the network for detecting new objects except for learning their representation in a central memory module. The representation does not need to be distributed to all locations. The network codes for the continuous location parameter through a retinotopic layer, which is active (on) only at the detected locations. Individual units are not expected to code for continuous variables. The solution proposed here is to represent objects in terms of a collection of local features constrained to lie in certain regions relative to a virtual center. In other words the spatial conjunction employed here can be expressed in a very simple star-type graph (see Figure 2, right panel.) Any constraints involving relative locations of pairs or larger subsets of features signicantly increase the computational load or the complexity of the required parallel architecture. Selection involves nding those locations for which there is a sufcient number of such features satisfying the relative spatial constraints. This is done using a version of the generalized Hough transform (see Grimson, 1990). The parallel architecture is constructed so that the Hough transform associated with any object can be implemented without creating or modifying any connections. The different object representations are stored in a central module. When one of these representations is evoked, it primes the appropriate layers in the network so that the corresponding Hough transform is computed. The network involves two levels of simple and complex type neurons. The simple layers detect features (edge and edge conjunctions) at specic locations; the complex layers perform a disjunction over a certain region. These disjunctions are appropriately displaced from the original location of the feature, enabling a simple summation to compute the Hough transform. The disjunction regions are not learned but are hard-wired into the architecture. The features come from a predened pool. They are constructed so as to have sufciently low density on generic background. Training for an object class involves taking a small number of examples at reference pose (i.e., registered to the reference grid) and identifying those features from the pool that are frequent at particular locations in the reference grid. These locations implicitly dene the global spatial relations. There is no spe-

Neural Network Architecture for Visual Selection

1145

cial learning phase for these relations. The object representation is simply the list of selected features with the selected locations on the reference grid. The entire architecture uses a caricature of neurons as binary variables that are either on or off. This is clearly a very simplistic assumption, but it has several advantages. The description of the algorithm is greatly simplied, its serial implementation becomes very efcient, and it does not a prioricommit to any specic assumption on the information conveyed in the continuous output of any of the neurons. On the other hand, starting from the model presented here, it is possible to incorporate multivalued outputs for the neurons at each stage and carefully study how this affects the performance, stability, and invariance properties of the system. Two main conclusions can be drawn from the experiments and model described here. There is a way to create good object and background discrimination, with low false-negative and false-positive rates, in real images, with a very simple form of spatial conjunctions of local features. The local features are simple functions of the edge maps. This can be viewed as a statistical statement about objects and background in images. The detection algorithm has a simple parallel implementation, based on the well-known Hough transform, that exhibits interesting analogies with the visual system. In other words, efcient visual selection can be done in parallel. The article is organized as follows. Section 2 discusses relations to other work on detection. Section 3 provides a description of the detection algorithm, the type of local features employed, and how they are identied in training. In section 4 the neural architecture for such detection is described. In section 5 the biological analogies of the model are discussed and how the architecture can be used to explain certain experiments in visual selection reported in the literature. 2 Related Work

The idea of using local feature arrangements for detection can also be found in Burl, Leung, and Perona (1995), Wiskott, Fellous, Kruger, and von der Malsburg (1997), and Cootes and Taylor (1996). In these approaches the features, or certain relevant parameters, are also identied through training. One clear difference, however, is that the approach presented here makes use of very simple binary features with hard-wired invariances and employs a very simple form of spatial arrangement for the object representation. This leads to an efcient implementation of the detection algorithm. On the other hand, the representations described in these articles are more detailed, provide more precise registration information, and are useful in the more

1146

Yali Amit

intensive verication and recognition processes subsequent to selection and foveation. In Van Rullen, Gautrais, Delorme, and Thorpe (1998) a similar representation is derived. Features dened in terms of simple functionals of an edge input layer are used to represent components of the face, making use of the statistics of the edges on the training set. They model only three locations (two eyes and mouth) on the object, which are chosen by the user, training multiple features for each. In the approach suggested here, many more locations on the object are represented using one feature for each. These locations are identied automatically in training. One problem with overdedicating resources to only three locations is the issue of noise and occlusion, which may eliminate one of the three. Our representation for faces, for example, is at a much lower resolution—14 pixels between the eyes on average. At that resolution the eyes and mouth in themselves may be quite ambiguous. In Van Rullen et al. (1998) pose-invariant detection is achieved by having a “face neuron” for each location in the scene. This raises a serious problem of distributing the representation to all locations. One of the main goals of the architecture here is to overcome the need to distribute the representation of each new object that is learned, to a dedicated “object neuron” at each location. The generalized Hough transform has been extensively used in object detection (see Ballard, 1981; Grimson, 1990; Rojer & Schwartz, 1992). However, in contrast to most previous uses of the Hough transform, the local features are more complex and are identied through training, not from the geometric properties of the object. It should be emphasized that this “trained” Hough transform can be implemented with any pool of binary features that exhibit the appropriate invariance properties and the appropriate statistics on object and background—for example, those suggested by Van Rullen et al. (1998). In the use of simple and complex layers this architecture has similarities with Fukushima (1986) and Fukushima and Wake (1991). Indeed, in both articles the role of ORing in the complex cell layers as a means of obtaining invariance is emphasized. However, there are major differences between the architecture described here and the neocognitron paradigm. In our model, training is done only for local features. The global integration of local-level information is done by a xed architecture, driven by top-down information. Therefore, features do not need to get more and more complex. There is no need for a long sequence of layers. Robust detection occurs directly in terms of the local feature level, which has only the oriented edge level below it. 3 Detection

We rst describe a simple counting classier for object versus background at the reference pose. Then, an efcient implementation of this classier on an entire scene is presented using the Hough transform.

Neural Network Architecture for Visual Selection

1147

3.1 A Counting Classier. Let G denote a d £ d reference grid, centered at the origin, where d » 30. An image from the object class is said to be registered to the reference grid if certain well-dened landmarks on the object are located at xed locations in G. For example, an image of a face would be registered if the two eyes and the mouth are at xed locations. Let a1 , . . . , anob be binary local features (lters) that have been chosen through training (see section 3.4) together with locations y1 , . . . , ynob in G. This collection constitutes the object representation. For any i D 1, . . . , nob and any location x in a scene, we write ai ( x) D 1 / 0 to indicate the presence or absence of local feature ai in a small neighborhood of x. A threshold t is chosen. An image in the reference grid is then classied as object if at least t of these features are present at the correct location, and background otherwise. Assume that conditional on the presence of a registered object in the reference grid the variables ai ( yi ) are independent, with P(ai ( yi ) D 1| Object) D po , and conditional on the presence of a generic background image in the reference grid these variables are all also independent with P(ai ( yi ) D 1|Bgd.) D pb < < po . Using basic properties of the binomial distribution, we can predict the false-positive and false-negative probabilities for given nob , t, po , pb (see Rojer & Schwartz, 1992; Amit, 1998). 3.2 Finding Candidate Poses. How can this classier be extended to the full image, taking into account that the object may appear at all locations and under a large range of allowable transformations? Assume all images are presented on a D £ D grid L and centered at the origin, with D » 300. Thus, the reference grid is a small subset of the image grid. Henceforth y will denote locations on the reference grid, and x will denote locations in the scene or image. A pose of the object in the image is determined by a location xc and a map A 2 A, where A is some subset of linear (and perhaps even nonlinear) transformations of the plane. Typically A accommodates a range of scales of § 25% and a small range of rotations. To decide whether the object is present at a location xc 2 L, for each transformation A 2 A, we need to check how many of the nob features ai are present at the transformed location xi D Ayi C xc . If at least t are found for some A, then the object is declared present at location xc . This operation involves an extensive search through the pose set A or, alternatively, verifying complex constraints on the relative locations of pairs or triples of local features (see Grimson, 1990). It is not clear how to implement this in a parallel architecture. We simplify by decoupling the constraints between the local features. To decide if the object is present at xc , we check for each feature ai whether it is present at xi D Ayi C xc for any A 2 A. Namely, the map A may vary from feature to feature. In other words we count how many of the regions xc C Byi contain at least one instance of the corresponding feature ai , where

1148

Yali Amit

Byi D fAyi : A 2 Ag. If we nd at least t , we declare the object is present at xc . This is summarized as follows: (C) There exist at least t indices i1 , . . . , it , such that the region xc C Byij contains an instance of local feature aij , for j D 1, . . . , t . The object model can now be viewed as a exible star-type planar arrangement in terms of the local features, centered at the origin (see Figure 2, right panel.) This is a very simple system of spatial constraints on the relative locations of the features. Decoupling the constraints allows us to detect candidate locations with great efciency. There is no need to proceed location by location and verify (C). Starting from the detected locations of the local features, the following Hough transform provides the candidate locations of the object: Hough Transform

1. Identify the locations of all nob local features in the image. 2. Initialize D £ D arrays Qi , i D 1, . . . , nob at 0.

3. For every location x of feature ai set to 1, all locations in the region x ¡ Byi in array Qi . P 4. Choose locations xc for which niDob1 Qi ( xc ) ¸ t . The locations of step 4 are precisely the locations where condition (C) is satised. For a 30 £ 30 reference grid, the size of Bi D Byi needed to accommodate the range of scales and rotations mentioned above varies depending on the distance of the point yi from the origin; it is at most on the order of 100 pixels. Note that B y is entirely determined by y once the set of allowable transformations A is xed. Under condition (C) both po and pb increase relative to their value on the reference grid. The new background probability pb is approximated byl b | B|, where l b is the density of the features in background, and | B| is the area of the regions B i . The new object probability po can be reestimated from the training data using the regions Bi . The main concern is the increase in false positives due to the increase in the background probabilities. This is directly related to the background density l b . As long as l b is sufciently small, this “greedy” detection algorithm will not yield too many false positives. 3.3 Local Features. The key question is which local features to use— namely, how to decrease l b while maintaining relatively high “on object” probabilities po . The rst possibility would be to use the oriented edges, which serve as input to the entire system. These are the most basic local features detected in the visual system. The problem with edges, however, is

. . . .. .. .

Neural Network Architecture for Visual Selection

1149

z2 y2

z3

z

y

y 3

z y 1 1

Figure 1: A vertical edge is present at z if |I(z) ¡ I(y)| > |I(z) ¡ I(zi )| and |I(z) ¡ I(y)| > |I(y) ¡ I(yi )| for all i D 1, 2, 3. The polarity depends on the sign of I(z) ¡ I(y). Horizontal edges are dened by rotating this scheme by 90 degrees.

that they appear with a rather high density in typical scenes. In other words l b is such that l b | B| > 1, leading to very large numbers of false positives. (See also the detailed discussion in Grimson, 1990.) The edge detector used here is a simple local maximum of the gradient with four orientations— horizontal and vertical of both polarities (see Figure 1). The edges exhibit strong invariance to photometric transformations and local geometric transformations. Typical l b for such edges on generic images is .03 per pixel. We therefore move on to exible edge conjunctions to reduce l b while maintaining high po. These conjunctions are exible enough to be stable at specic locations on the object for the previously dened range of deformations, and they are lower density in the background. These conjunctions are simple binary lters computed on the edge maps. We emphasize that the computational approach and associated architecture can be implemented with other families of local features exhibiting similar statistical properties. 3.3.1 Flexible edge arrangements. Dene a collection of small regions bv in the immediate vicinity of the origin, indexed by their centers v. Let Gloc denote the collection of these centers. A local feature is dened in terms of nloc pairs of edges and locations ( e`, v`), ` D 1, . . . , nloc . It is present at point x if there is an edge of type e` in the neighborhood x C bv` , for each ` D 1, . . . , nloc . Figure 2 shows an example of an arrangement with three edges. It will be convenient to assume that v1 D 0 and that b1 D f0g. Invariance to local deformations is explicitly incorporated in the disjunction over the regions bv . Note that each local feature can be viewed as a exible startype planar arrangement centered at x. Arrangements with two edges— nloc D 2—loosely describe local features such as corners, t-junctions, slits, and various contour segments with different curvature. On generic gray-scale images the density l b is found to decay exponentially with nloc (see Amit & Geman, 1999), and for xed nloc is more or less the same no matter which arrangement is chosen. For example for nloc D 4 the density l b lies between .002 and .003. For nloc D 2, l b » .007.

1150

Yali Amit

bv2 bv

1

bv

3

Figure 2: An arrangement with three edges. We are using v1 D (0, 0), v2 D (¡2, 2), v3 D (2, ¡3). The regions bv2 and bv3 are 3 £3, b1 D f0g. This arrangement is present at a location x; if a horizontal edge is present at x, another horizontal edge is present anywhere in x C bv2 and a vertical edge is present anywhere in x C bv3 . (Right) A graphical representation of the full model. The large circle in the middle represents the center point, each medium-sized circle represents the center of a local feature, and the small circles represent the edges

3.4 Training. Training images of the object class are registered to the reference grid. A choice of feature complexity is made; nloc is chosen. At each location y a search through the pool of features with nloc edges is carried out to nd a feature that is present in over 50% of the data, anywhere in a small neighborhood of y. If one is found, it is recorded together with the location. For a variety of object classes such as faces, synthetically deformed symbols and shapes, rigid 3D objects, and using nloc D 4, a greedy search typically yields several tens of locations on the reference grid for which po » .5. After correcting for the larger regions Bi we obtain po » 0.8, and it turns out that taking into consideration the background statistics using only 20 of these feature and location pairs is sufcient. These are chosen randomly from all identied feature location pairs. We then use a threshold t » 10 in the Hough transform. Alternatively a full search over all features with two edges (nloc D 2) yields po > 0.9. In this case 40 local features with t » 30 are needed for good classication. It is precisely the fact that this hierarchy of local features exhibits such statistics in real gray-level images that makes the algorithm feasible. 3.5 Some Experiments. In order to cover a range of scales of approximately 4:1, the same detection procedure is carried out at 5 to 6 resolutions. On a 200-Mhz Pentium-II the computation time for all scales together is about 1.5 seconds for generic 240 £ 320 gray-scale images. Example detections are shown in Figures 7, 8, and 9. Also in these displays are random samples from the training sets and some of the local features used in each

Neural Network Architecture for Visual Selection

1151

representation in their respective locations on the reference grid. The detectors for the 2D view of the clip (see Figure 9) and for the “eight” (see Figure 8) were made from only one image, which was subsequently synthetically deformed using some random linear transformations. In the example of the clip, due to the invariances built into the detection algorithm, the clip can be detected in a moderate range of viewing angles around the original one, which were not present in training. The face detector was produced from only 300 faces of the Olivetti database. Still faces are detected in very diverse lighting conditions and in the presence of occlusion. (More can be viewed online at http://galton. uchicago.edu/»amit/detect.) For each image, we show both the detection of the Hough transform using a representation with 40 local features of two edges each. We also show the result of an algorithm that employs one additional stage of ltering of the false positives as described in Amit and Geman (1999). In this data set, there are on average 13 false positives per image at all six resolutions together. This amounts to .00005 false positives per pixel. The false-negative rate for this collection of images is 5%. Note that the lighting conditions of the faces are far more diverse than those on the Olivetti training set. 4 Neural Net Implementation

Our aim is to describe a network that can easily adjust to detect arrangements of local features representing different object classes. All features used in any object representation come from a predetermined pool F D fa1 , . . . , aN g. The image locations of each feature in F are detected in arrays Fi , i D 1, . . . , N (see section 4.1.) A location x in Fi is on if local feature ai is present at location x in the image. Let F £ G denote the entire collection of pairs of features and reference grid locations—namely, F £ G D f(ai , y), i D 1, . . . , N, y 2 Gg. Dene a module M that has one unit corresponding to each such pair—N £ |G| units. An object representation consists of a small collection of nob such pairs (aij , yj ), j D 1, . . . , nob . For simplicity assume that nob is the same for all objects and that t is the same as well. Typically nob will be much smaller than N—on the order of several tens. An object representation is a simple binary pattern in M , with nob 1s and ( N ¡ nob ) 0s. For each local feature array Fi , introduce a system of arrays Qi,y for each location y 2 G. These Q arrays lay the ground for the detection of any object representation by performing step 3 in the Hough transform, for each of the By regions. Thus a unit at location x 2 Qi,y receives input from the region x C By in Fi . Note that for each unit u D (ai , y) in M , there is a corresponding Qi,y array. All units in Qi,y receive input from u. This is where the top-down ow of information is achieved. A unit x in Qi,y is on only if both (ai , y) is on and any unit in the region x C By is on in Fi . In other words the representation

1152

Yali Amit

evoked in M primes the appropriate Qi,y arrays. The system of Q arrays sum into an array S . A unit at location x 2 S receives input from all Qi,y arrays at P P location x and is on if N iD1 y2G Qi,y ( x) ¸ t . The array S therefore shows those locations for which condition (C) is satised. If a unique location needs to be chosen, say, for the next saccade, then picking the x with the highest sum seems a natural choice. In Figures 7, 8, and 9, the red dots show all locations satisfying condition (C), and the green dot shows the one with the most hits. This architecture, called NETGLOB, is described in Figure 3. For each local feature ai we need |G| D d2 complex Q-type arrays, so that the total number of Q-type arrays is NQ D Nd2 . For example, if we use arrangements with two edges and assuming 4 edge types and 16 possible regions bv (see section 3.3), the total number of features N D 4£4£16 D 256. Taking a 30 £ 30 reference grid NQ is on the order of 105 . It is, of course, possible to give up some degree of accuracy in the locations of the yj ’s, assuming, for example, they are on some subgrid of the reference grid. Recall that the S array is only supposed to provide candidate locations, which must then undergo further processing after foveation, including small corrections for locations. Therefore, assuming, for example, that the coordinates of yj are multiples of three would lead to harmless errors. But then the number of possible locations in the reference grid reduces to 100, and NQ » 104 . Using such a coarse subgrid of G also allows the F and Q arrays to have lower resolution (by the same factor of three), thus reducing the space required by a factor of nine. Nonetheless, NQ is still very large due to the assumption that all possible local features are detected bottom-up in hard-wired arrays Fi . This is quite wasteful in space and computational resources. In the next section we introduce adaptable local feature detectors, which greatly reduce the number of Q-type layers needed in the system. 4.1 Detecting the Local Features. The local features are local star-type arrangements of edges; they are detected in the Fi layers with input from edge detector arrays Ee , e D 1, . . . , 4. For each edge type e, dene a system of “complex” arrays Ce,v , v 2 Gloc . A unit x 2 Ce,v receives input from the region x C bv in Ee . It is on if any unit in x C bv is on . A local feature ai with pattern ( e`, v`), ` D 1, . . . , nloc , as dened in section 3.3, is detected in the array Fi . Each x 2 Fi receives input from Ce`,v` (x), ` D 1, . . . , nloc and is on if all nloc units are on . The full system of E, C, and F layers, called NETLOC, is shown in Figure 4. Note that in contrast to the global detection scheme, where a top-down model determined which arrays contributed to the summation array S , here everything is hard-wired and calculated bottom-up. Each unit in a Fi array sums the appropriate inputs from the edge detection arrays. Alongside the hard-wired local feature detectors we introduce adaptable local feature detectors. Here local feature detection is driven by top-down

Neural Network Architecture for Visual Selection

Detected location: Only active place in summation layer Sglob .

Sglob Q

1,y

Q 1,y1

Q

Mglob

. . . .

a 1 y1 a 7y3

.

Q

a 4 y2

a 9 y4

Model 4 local features at 4 locations.

Q

x + By2

.

4,y 4,y 2

.

.

.

F4

7,y 3

. .

.

F7 .

9,y 4

.

x + By4

Wedges representing region of influence in F - layer.

F1

9,y

Q

x + By3

. .. .

Q 7,y

Q

x + By1

1153

. .

F9 .

Q

N,y

.

. .

FN

Figure 3: NETGLOB. The object is dened in terms of four features at four locations. Each feature/location (ai , y) pair provides input to all units in the corresponding Qi,y array (thick lines). The locations of the feature detections are shown as dots. They provide input to regions in the Q arrays shown as double-thick lines. At least three local features need to be found for a candidate detection. The fourth feature does not contribute to the detection; it is not present in the correct location. There are N systems of F, Q arrays—one for each ai 2 F .

1154

Yali Amit

Figure 4: NETLOC. Four E layers detect edge locations. For each Ee layer, there is a Ce, v corresponding to each v 2 Gloc —all in all, 4 £ |Gloc | ¡ C layers. There is an F layer for each local feature, each location in the F layer receives input from the same location in Ce`,v` , ` D 1, . . . , nloc .

information. Only features required by the global model are detected. If only adaptable features detectors are used in the full system, the number of Q arrays required for the system is greatly reduced and is equal to the size of the reference grid. For each location y in the reference grid, there is a module Mloc,y , a detector array Floc,y , one “complex” array Qy , and a dedicated collection of Ce,v arrays. Without a local feature representation being evoked in Mloc,y nothing happens in this system. When a local feature representation is turned on in Mloc,y , only the corresponding Ce,v arrays are primed. Together with the input from the edge arrays, the local feature locations are detected in a summation array array Floc,y . Finally a detection at x in Floc,y is spread to the region x ¡ B y in Qy . (See Figure 5.)

Neural Network Architecture for Visual Selection

1155

Qy Detected location: Only active place in summation layer Floc,y.

Floc,y C1,v C1,v 3 C 1,v2 C1,v 1

Mloc,y

. .

e1 v1

e4 v2

. .. .

E1 . .

C4,v

.

C4,v 3 Model 2 edges at 2 locations.

C4,v 2 C4,v 1

. .

E4

Figure 5: NETADAP. The representation of the local feature is turned on in the local feature memory module. This determines which C layers receive input from Mloc, y . Units in these C layers that also receive input from the edge layer become activated. The Floc, y layer sums the activities in the C layers and indicates the locations of the local features. Floc,y then feeds into Q y : x 2 Q y is on if any point in x C By is on in Floc,y .

A global system with only adaptable feature detections is shown in Figure 6. In this system the number of Q type arrays is |G| . Top-down information does not go directly to these Q layers. When an object representation is turned on in M , each unit (ai , y) that is on activates the representation for ai in Mloc,y , which primes the associated Ce,v arrays. The detections of these features are obtained in the Floc,y arrays and appropriately spread to the Qy arrays. All the arrays Qy are summed and thresholded in S to nd the candidate locations. The adaptable local feature detectors Floc,y can operate alongside the hard-wired F arrays. The arrays Qy , together with the original system of

1156

Yali Amit

Sglob

Detected location: Only active place in summation layer Sglob .

.. .. . . . . .. .. Mloc,y

1

a 7y3

1

. .. .

F loc,y

1

Mloc,y

2

Mglob

a 1 y1

Qy

a 4 y2

a 9 y4

Model 4 local features at 4 locations.

Qy

2

Mloc,y

3

Mloc,y

4

Mloc,y

.

.

.

Floc,y

2

.

Qy

3

Qy

4

.

.

Floc,y

.

3

. .

Floc,y

4

. .

Qy

.

Floc,y Figure 6: NETGLOB (adaptable). The object is dened in terms of four features at four locations. Each location y provides input to an Mloc,y . If a pair (a, y) is on in M glob , the representation of a is on in Mloc, y . This is represented by the dots in the corresponding modules. The detections of feature a are then obtained in Floc,y , using a system as in Figure 5, and spread to Qy . At least three local features need to be found for a candidate detection. The fourth feature does not contribute to the detection; it is not present in the correct location. There was no feature on at location y, so the system of Mloc,y is not active.

Neural Network Architecture for Visual Selection

1157

Figure 7: Each of the top two panels shows 10 of the 20 local features from the face representation, trained from 300 faces of the Olivetti database. The bluegreen combinations indicate which edge is involved in the edge arrangement. Blue stands for darker. The three pink dots are the locations of the two eyes and mouth on the reference grid. Each red dot in an image represents a candidate location for the center of the face. The green dot shows the location with most counts. Detections are shown from all six resolutions together. The top image was scanned from Nature, vol. 278, October 1997.

1158

Yali Amit

Figure 8: First panel: 12 samples from the training set. Second panel: 10 images of the prototype 8 with two different local features in each. Detections are shown in four randomly generated displays; red dots indicate detected locations, and the green dot indicates the location with the highest count.

Qi,y arrays, sum into S glob . If a local feature (ai , y) is on in M , either there is an associated and hard-wired detector layer Fi and Qi,y is primed, or it feeds into Mloc,y and the required detections are obtained in Floc,y and spread to the associated Qy . It appears more reasonable to have a small number of re-

Neural Network Architecture for Visual Selection

1159

Figure 9: Same information for the clip. The clip was trained on 32 randomly perturbed images of one original image, 12 of which are shown. Detection is only for this range of views, not for all possible 3D views of the clip.

curring local features that are hard-wired in the system alongside adaptable local feature detectors. 5 Biological Analogies 5.1 Labeling the Layers. Analogies can be drawn between the layers dened in the architecture of NETGLOB and existing layers of the visual system. The simple edge-detector layers Ee , e D 1 . . . , 4 correspond to simple orientation selective cells in V1. These layers represent a schematic abstraction of the information provided by the biological cells. However, the

1160

Yali Amit

empirical success of the algorithm on real images indicates that not much more information is needed for the task of selection. The Ce,y layers correspond to complex orientation selective cells. One could imagine that all cells corresponding to a xed-edge type e and xed image location x are arranged in a column. The only difference as one proceeds down the column is the region over which disjunction (ORing) is performed (e.g., the receptive eld). In other words, the units in a column are indexed by v 2 Gloc , which determines the displaced region bv . Indeed, any report on vertical electrode probings in V1, in which orientation selectivity is constant, will show a variation in the displacement of the receptive elds. (See Zeki, 1993.) The detection arrays Fi correspond to cells in V2 that respond to more complex structures, including what is termed as illusory contours (see von der Heydt, 1995). If an illusory contour is viewed as a loose local arrangement of edge elements, that is precisely what the Fi layers are detecting. The Qi,y layers correspond to V4 cells. These have much larger receptive elds, and the variability of the location of these elds as one proceeds down a column is much more pronounced than in V1 or V2; it is on the order of the size of the reference grid. 5.2 Bottom-Up–Top-Down Interactions. Top-down and bottom-up processing are explicitly modeled. Bottom-up processing is constantly occurring in the simple cell arrays Ee , which feed into the complex cell arrays Ce,y which in turn feed into the Fi –V2 type arrays. One could even imagine the Q–V4 type arrays being activated by the F arrays regardless of the input from M . Using the terminology introduced in Ullman (1996), only those units that are primed by receiving input from M contribute to the summation into S . The object pattern that is excited in the main memory module determines which of the Q arrays will have enhanced activity toward their summation into S . Thus the nal determination of the candidate locations is given by an interaction of the bottom-up processing and the top-down processing manifested in the array S . With the adaptable local feature detectors, the topdown inuence is even greater. Nothing occurs in an Floc,y detector unless a local feature representation is turned on in Mloc,y due to the activation in M of an object representation that contains that feature at location y. 5.3 Gating and Invariant Detection. The summation array S can serve as a gating mechanism for visual attention. Introduce feedback connections from each unit x 2 S to the unit at the same location in all Q layers. The only way a unit in a Qi,y layer can remain active for an extended period is if it received input from some (ai , y) 2 M and from the S layer. This could provide a model for the behavior of IT neurons in delay matchto-sample task experiments, (see Chelazzi, Miller, Duncan, & Desimone, 1993, and Desimone et al., 1995). In these experiments neurons in IT se-

Neural Network Architecture for Visual Selection

1161

lectively responsive to two different objects are found. The subject is then presented with the object to be detected: the sample. After a delay period, an image with both objects is displayed, both displaced from the center. The subject is supposed to saccade to the sample object. After presentation of the test image, neurons responsive to both objects are active. After about 100 milliseconds and a few tens of milliseconds prior to the saccade, the activity of the neurons selective for the nonsample object decays. Only neurons selective for the sample object remain active. If all units in the Qi,y layer feed into (ai , y) in M , then (ai , y) receives bottom-up input whenever ai is present anywhere in the image, and no matter what y is. Thus, the units in M do not discriminate between the locations from which the inputs are coming, and at the presentation of a test image, one may observe activity in M due to bottom-up processing in the Q layers, which responds to other objects in the scene. This is consistent with scale and location invariance observed in IT neuron responses (Ito et al., 1995). When a location is detected in S , gating occurs as described above, and the only input into M from the Q arrays is from the detected location at the layers corresponding to the model representation. For example assume that feature ai is detected at locations x1 , x2 , x3 in the image (i.e., units x1 , x2 , x3 are on in Fi ). Then all the units in x1 ¡ By , x2 ¡ By , x3 ¡ B y will be on in Qi,y for all y 2 G. This corresponds to the bottom-up ow of information. In the object model, assume ai is paired with location O Assume that only x1 comes from the correct location on an instance of the y. object, that is, the center of the object lies around x1 ¡ByO . Then if enough other features are detected on this instance of the object, some point x¤ 2 x1 ¡ B yO will be on in S , and subsequently the area around x¤ in the Qy,i O layer will remain active. After detection occurs in S followed by the gating process, the only input coming into M from the Q arrays corresponding to feature ai (Qi,y , y 2 G) is from Qi, yO ( x¤ ). The same is true for the other features in the object representation that have been detected at location x¤ . This means that after detection, the M module is either receiving no input (if no candidate locations are found) or some subset of the units in the object representation is receiving reinforced input. Only the object representation is reinforced in M , signifying that detection has occurred invariant to translation scale and other deformations. To summarize, the units in M exhibit selective activity for the sample object, only due to the input from a lower-level process in which location has already been detected and gating has occurred. 5.4 Hard-Wired Versus Adaptable Features. The number of hard-wired local feature arrays Fi and Qy,i we would expect to nd in the system is limited. It is reasonable to assume that a complementary system of adaptable features exists as described in section 4.1. These features are learned for more specic tasks and may then be discarded if not needed; for example,

1162

Yali Amit

ner angle tuning of the edges may be required, as in the experiments in Ahissar and Hochstein (1997). People show dramatic improvement in detection tasks over hundreds of trials in which they are expected to repeat the task. It may well be that the hard-wired local features are too coarse for these tasks. The repetition is needed to learn the appropriate local features, which are then detected through the adaptable local feature system. 5.5 Learning. Consider the reference grid as the central eld of view (fovea), and let each unit (ai , y) 2 M be directly connected to Fi ( y) for the locations y 2 G. The unit (ai , y) 2 M is on if Fi ( y) is on. In other words, the area of the reference grid in the Fi arrays feeds directly into the M module. Learning is then done through images of the object class presented at the reference grid. Each such image generates a binary pattern in the module M . In the current implementation (see section 3.4), a search is carried out at each location y 2 G to identify stable features for each location. These are then externally stored. However, it also possible to consider M as a connected neural net that employs Hebbian learning (see, for example, Brunel, Carusi, & Fusi, 1998, and Amit & Fusi, 1994) to store the binary patterns of high-frequency units for an object class internally. These patterns of high-frequency units would be the attractors generated by presentations of random samples from the object class. The attractor pattern produced in M then activates the appropriate Q layers for selection, as described above. 6 Discussion

The architecture presented here provides a model for visual selection when we know what we are looking for. What if the system is not guided toward a specic object class? Clearly high-level contextual information can restrict the number of relevant object classes to be entertained (see Ullman, 1996). However, even in the absence of such information, visual selection occurs. One possibility is that there is a small collection of generic representations that are either learned or hard-wired, which are useful for “generic” object and background discrimination and have dedicated systems for detecting them. For example, we noticed, not surprisingly, that the object representation produced for the “clip” (see Figure 9) often detects generic rectangular, elongated dark areas. Many psychophysical experiments on detection and visual selection involve measuring time as a function of the number of distractors. In Elder and Zucker (1993) there is evidence that detection of shapes with “closure” properties does not slow linearly with the number of distractors, whereas shapes that are “open” do exhibit slowing; Also in Treisman and Sharon (1990) some detection problems exhibit slowing; others do not. This is directly related to the issue of false positives. The architecture presented above is parallel and is not affected by the number of distractors. However, the number of

Neural Network Architecture for Visual Selection

1163

distractors can inuence the number of false positives. The chance that the rst saccade moves to the correct location decreases, and slowing down occurs. How can the effect of the distractors be avoided? Again, this could be achieved using generic object representations with hard-wired detection mechanisms, that is, a summation layer S dedicated to a specic representation, and hence responding in a totally bottom-up fashion. This discussion also relates to the question of how the representation for detection interacts with representations for more detailed recognition and classication, which occur after foveation. How many false positives can detection tolerate, leaving nal determination to the stage after foveation? There are numerous feedback connections in the visual system. In the current architecture, there is no feedback to the local feature detectors and edge detectors beyond the basic top-down information provided by the model. Also the issue of larger rotations has not been addressed. The local features and edges are not invariant to large rotations. How to deal with this issue remains an open question. Finally, in the context of elementary local cues, we have not made use of anything other than spatial discontinuities: edges. However, the same architecture can incorporate local motion cues, color cues, depth cues, texture, and so forth. Local features involving conjunctions of all of these local cues can be used. Acknowledgments

I am indebted to Daniel Amit and Donald Geman for numerous conversations and suggestions. The experiments in this article would not have been possible without the coding assistance of Keneth Wilder. This work was supported in part by the Army Research Ofce under grant DAAH04-96-1-0061 and MURI grant DAAHO4-96-1-0445. References Ahissar, M., & Hochstein, S. (1997).Task difculty and visual hierarchy: Reverse hierarchies in sensory processing and perceptual learning. Nature, 387, 401– 406. Amit, Y. (1998). Deformable template methods for object detection (Tech. Rep.). Chicago: Department of Statistics, University of Chicago. Amit, D. J., & Fusi, S. (1994). Dynamic learning in neural networks with material synapses. Neural Computation, 6, 957. Amit, Y., & Geman, D. (1999).A computational model for visual selection. Neural Computation, 11, 1691–1715. Ballard, D. H. (1981). Generalizing the Hough transform to detect arbitrary shapes. Pattern Recognition, 13, 111–122. Brunel, N., Carusi, F., & Fusi, S. (1998). Slow stochastic Hebbian learning of classes of stimuli in a recurrent neural network. Network, 9, 123–152.

1164

Yali Amit

Burl, M. C., Leung, T. K., & Perona, P. (1995). Face localization via shape statistics. In M. Bichsel (Ed.), Proc. Intl. Workshop on Automatic Face and Gesture Recognition (pp. 154–159). Chelazzi, L., Miller, E. K., Duncan, J., & Desimone, R. (1993). A neural basis for visual search in inferior temporal cortex. Nature, 363, 345–347. Cootes, T. F., & Taylor, C. J. (1996). Locating faces using statistical feature detectors. In Proc., Second Intl. Workshop on Automatic Face and Gesture Recognition (pp. 204–210). Desimone, R., Miller, E. K., Chelazzi, L., & Lueschow, A. (1995). Multiple memory systems in visual cortex. In M. S. Gazzaniga (Ed.), The cognitive neurosciences (pp. 475–486). Cambridge, MA: MIT Press. Elder, J., & Zucker, S. W. (1993). The effect of contour closure on the rapid discrimination of two-dimensional shapes. Vison Research, 33, 981–991. Fukushima, K. (1986). A neural network model for selective attention in visual pattern recognition. Biol. Cyber., 55, 5–15. Fukushima, K., & Wake, N. (1991). Handwritten alphanumeric character recognition by the neocognitron. IEEE Trans. Neural Networks, 2, 355–365. Grimson, W. E. L. (1990). Object recognition by computer: The role of geometric constraints. Cambridge, MA: MIT Press. Ito, M., Tamura, H., Fujita, I., & Tanaka, K. (1995).Size and position invariance of neuronal response in monkey inferotemporal cortex. Journal of Neuroscience, 73(1), 218–226. Lueschow, A., Miller, E. K., & Desimone, R. (1994).Inferior temporal mechanisms for invariant object recognition. Cerebral Cortex, 5, 523–531. Rojer, A. S., & Schwartz, E. L. (1992). A quotient space Hough transform for space-variant visual attention. In G. A. Carpenter & S. Grossberg (Eds.), Neural networks for vision and image processing. Cambridge, MA: MIT Press. Rowley, H. A., Baluja, S., & Takeo, K. (1998). Neural network-based face detection. IEEE Trans. PAMI, 20, 23–38. Sung, K. K., & Poggio, T. (1998). Example-based learning for view-based face detection. IEEE Trans. PAMI, 20, 39–51. Thorpe, S., Fize, D., & Marlot, C. (1996).Speed of processing in the human visual system. Nature, 381, 520–522. Treisman, A., & Sharon, S. (1990). Conjunction search revisited. Journal of Experimental Psychology: Human Perception and Performance, 16, 459–478. Ullman, S. (1996). High-level vision. Cambridge, MA: MIT Press. Van Rullen, R., Gautrais, J., Delorme, A., & Thorpe, S. (1998). Face processing using one spike per neuron. Biosystems, 48, 229–239. von der Heydt, R. (1995), Form analysis in visual cortex. In M. S. Gazzaniga (Ed.), The cognitive neurosciences (pp. 365–382). Cambridge, MA: MIT Press. Wiskott, L., Fellous, J.-M., Kruger, N., & von der Marlsburg, C. (1997). Face recognition by elastic bunch graph matching. IEEE Trans. on Patt. Anal. and Mach. Intel., 7, 775–779. Zeki, S. (1993). A vision of the brain. Oxford: Blackwell Scientic Publications. Received November 19, 1998; accepted July 12, 1999.

LETTER

Communicated by B. E. Stein

Using Bayes’ Rule to Model Multisensory Enhancement in the Superior Colliculus Thomas J. Anastasio Beckman Instituteand Department of Molecular and IntegrativePhysiology,University of Illinois at Urbana/Champaign, Urbana, IL 61801, U.S.A. Paul E. Patton Kamel Belkacem-Boussaid Beckman Institute, University of Illinois at Urbana/Champaign, Urbana, IL 61801, U.S.A. The deep layers of the superior colliculus (SC) integrate multisensory inputs and initiate an orienting response toward the source of stimulation (target). Multisensory enhancement, which occurs in the deep SC, is the augmentation of a neural response to sensory input of one modality by input of another modality. Multisensory enhancement appears to underlie the behavioral observation that an animal is more likely to orient toward weak stimuli if a stimulus of one modality is paired with a stimulus of another modality. Yet not all deep SC neurons are multisensory. Those that are exhibit the property of inverse effectiveness: combinations of weaker unimodal responses produce larger amounts of enhancement. We show that these neurophysiological ndings support the hypothesis that deep SC neurons use their sensory inputs to compute the probability that a target is present. We model multimodal sensory inputs to the deep SC as random variables and cast the computation function in terms of Bayes’ rule. Our analysis suggests that multisensory deep SC neurons are those that combine unimodal inputs that would be more uncertain by themselves. It also suggests that inverse effectiveness results because the increase in target probability due to the integration of multisensory inputs is larger when the unimodal responses are weaker.

1 Introduction

The power of the brain as an information processor is due in part to its ability to integrate inputs from multiple sensory systems. Nowhere else has neurophysiological research provided a clearer picture of multisensory integration than in the deep layers of the superior colliculus (SC) (see Stein & Meredith, 1993, for a lucid review). Enhancement is a form of multisensory integration in which the neural response to a stimulus of one modality is augmented by a stimulus of another modality. We propose a probabilistic c 2000 Massachusetts Institute of Technology Neural Computation 12, 1165–1187 (2000) °

1166

T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid

theory of the functional signicance of multisensory enhancement in the deep SC that can account for its most important features. The SC is a layered structure located in the mammalian midbrain (Wurtz & Goldberg, 1989). The deep layers receive inputs from various brain regions composing the visual, auditory, and somatosensory systems (Edwards, Ginsburgh, Henkel, & Stein, 1979; Cadusseau & Roger, 1985; Sparks & Hartwich-Young, 1989). The deep SC integrates this multisensory input and then initiates an orienting response, such as a saccadic eye movement, toward the source of the stimulation. Neurons in the SC are organized topographically according to the location in space of their receptive elds (auditory, Middlebrooks & Knudsen, 1984; visual, Meredith & Stein, 1990; somatosensory, Meredith, Clemo, & Stein, 1991). Individual deep SC neurons can receive inputs from multiple sensory systems (Meredith & Stein, 1983; 1986b; Wallace & Stein, 1996). There is considerable overlap between the receptive elds of individual multisensory neurons for inputs of different modalities (Meredith & Stein, 1996). The responses of multisensory deep SC neurons are dependent on the spatial and temporal relationships of the multisensory stimuli. Stimuli that occur at the same time and place can produce response enhancement (King & Palmer, 1985; Meredith & Stein 1986a, 1986b; Meredith, Nemitz, & Stein, 1987). Multisensory enhancement is the augmentation of the response of a deep SC neuron to a sensory input of one modality by an input of another modality. Percent enhancement (% enh) is quantied as (Meredith & Stein, 1986a): % enh D

¡ CM ¡ SM ¢ max

SMmax

£ 100,

(1.1)

where CM and SMmax are the numbers of neural impulses evoked by the combined modality and best single modality stimuli, respectively. Multisensory enhancement is dependent on the size of the responses to unimodal stimuli. Smaller unimodal responses are associated with larger percentages of multisensory enhancement (Meredith & Stein, 1986b). This property is called inverse effectiveness (Wallace & Stein, 1994). Percentage enhancement can range upwards of 1000% (Meredith & Stein, 1986b). Yet not all deep SC neurons are multisensory. Respectively, in cat and monkey, 46% and 73% of deep SC neurons studied receive sensory input of only one modality (Wallace & Stein, 1996). It is not clear why some neurons in the deep SC exhibit such extreme levels of multisensory enhancement, while others receive only unimodal input. Since the deep SC initiates orienting movements, it is not surprising to nd that the orienting responses observed in behaving animals also show multisensory enhancement (Stein, Huneycutt, & Meredith, 1988; Stein, Meredith, Huneycutt, & McDade, 1989). For example, the chances that an orienting response will be made to a stimulus of one modality can be in-

Multisensory Enhancement in the Superior Colliculus

1167

creased by a stimulus of another modality presented simultaneously in the same place, especially when the available stimuli are weak. The rules of multisensory enhancement make intuitive sense in the behavioral context. Strong stimuli that effectively command orienting responses by themselves require no multisensory enhancement. In contrast, the chances that a weak stimulus indicates the presence of an actual source should be greatly increased by another stimulus (even a weak one) of a different modality that is coincident with it. We will develop this intuition into a formal model using elementary probability theory. Our model, based on Bayes’ rule, provides a possible explanation for why some deep SC neurons are multisensory while others are not. It also offers an explanation for the property of inverse effectiveness observed for multisensory deep SC neurons. 2 Bayes’ Rule

We hypothesize that each deep SC neuron computes the probability that a stimulus source (target) is present in its receptive eld, given the sensory inputs it receives. The sensory inputs are neural and are modeled as stochastic. Stochasticity implies that the input carried by sensory neurons indicating the presence of a target is to some extent uncertain. Thus, we postulate that each deep SC neuron computes the conditional probability that a target is present given its uncertain sensory inputs. We will model the computation of this conditional probability using Bayes’ rule. The model is schematized in Figure 1, in which the deep SC receives input from the visual and auditory systems. We represent these two sensory systems because the bimodal, visual-auditory combination is the most common one observed in cat and monkey (Wallace & Stein, 1996). The results generalize to any other combination. Each block on the grid in Figure 1 represents a single, deep SC neuron. Our model will focus on individual deep SC neurons and on the visual and auditory inputs they receive from within their receptive elds. We begin with the simplest case: a deep SC neuron receives an input of only one sensory modality. We denote the target as random variable T and a visual sensory input as random variable V. Then we hypothesize that a deep SC neuron computes the conditional probability P( T | V ). This conditional probability can be computed using Bayes’ rule (Hellstrom, 1984; Applebaum, 1996): P( T | V ) D

P( V | T ) P( T ) . P( V )

(2.1)

Bayes’ rule is a fundamental concept in probability that has found wide application (Duda & Hart, 1973; Bishop, 1995). It underlies all modern systems for probabilistic inference in articial intelligence (Russell & Norvig, 1995).

1168

T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid

Figure 1: Schematic diagram illustrating multisensory integration in the deep layers of the superior colliculus (SC). Neurons in the SC are organized topographically according to the location in space of their receptive elds. Individual deep SC neurons, represented as blocks on a grid, can receive sensory input of more than one modality. The receptive elds of individual multisensory neurons for inputs of different modalities overlap. V and A are random variables representing input from the visual and auditory systems, respectively. T is a binary random variable representing the target. P(T D 1 | V, A) is the Bayesian probability that a target is present given sensory inputs V and A. The bird image is from Stein and Meredith (1993).

We suggest that a deep SC neuron can be regarded as implementing Bayes’ rule in computing the probability of a target given its sensory input. According to equation 2.1, we suppose that the nervous system can compute P( T | V ) given some input V because it already has available the other three probabilities: P( V | T ), P (V ), and P(T ). The conditional probability of V given T [P(V | T )] is the probability of obtaining some visual sensory input V D v given a target T D t (upper- and lowercase italicized letters represent random variables and particular numerical values of those variables, respectively). In the context of Bayes’ rule, P( V | T ) is referred to as the likelihood of V given T. The unconditional probability of V [P(V)] is the probability of obtaining some visual sensory input V D v regardless of whether a target is present. Thus, P (V ) and P( V | T ) are properties of the visual system and the environment, and it is reasonable to suppose that those properties could be represented by the brain.

Multisensory Enhancement in the Superior Colliculus

1169

The unconditional probability of T [P(T)] is the probability that T D t. P( T ) is a property of the environment. In the context of Bayes’ rule, P(T ) is referred to as the prior probability of T. What Bayes’ rule essentially does is to modify the prior probability on the basis of input. In Bayes’ rule (equation 2.1), P(T | V) is computed by multiplying P(T ) by the ratio P( V | T ) / P( V ). Because P( T | V ) can be thought of as a modied version of P(T ) based on input, P(T | V ) can be referred to as the posterior probability of T. Thus, when computed using Bayes’ rule, P( T | V ) is referred to as either the Bayesian or the posterior probability of T. It’s possible that the prior probability of T [P(T)], and its representation by the brain, could change as circumstances dictate. With Bayes’ rule, we suggest a model of how the brain could use knowledge of the statistical properties of its sensory systems to modify this prior on the basis of sensory input. 3 Bayes’ Rule with One Sensory Input

Equation 2.1 can be used to compute the posterior (Bayesian) probability of a target given sensory input of only one modality. It can be used to simulate the responses of unimodal deep SC neurons. To evaluate equation 2.1, it is necessary to dene the probability distributions for the terms on the right-hand side. We dene T as a binary random variable where T D 1 if the target is present and T D 0 if it is absent. We dene the prior probability distribution of T [P(T)] by arbitrarily assigning the probability of 0.1 to P( T D 1) and 0.9 to P( T D 0). These values are not meant to reect precisely some experimental situation, but are chosen because they offer advantages in illustrating the principles. Bayes’ rule is valid for any prior distribution. We dene V as a discrete random variable that represents the number of neural impulses arriving as input to a deep SC neuron from a visual neuron in a unit time interval. The unit interval for the data we model is 250 msec (Meredith & Stein, 1986b). We dene the likelihoods of V given T D 1 [P(V | T D 1)] or T D 0 [P(V | T D 0)] as Poisson densities with different means. The Poisson density provides a reasonable rst approximation to the discrete distribution of the number of impulses per unit time observed in the spontaneous activity and driven responses of single neurons, and offers certain modeling advantages. The Poisson density function for discrete random variable V with mean l is dened as (Hellstrom, 1984; Applebaum, 1996): P( V D v ) D

l v e¡l . v!

(3.1)

Example likelihood distributions of V are illustrated in Figure 2A. The likelihood of V given T D 0 [P(V | T D 0)] is the probability distribution of V when a target is absent. Thus, P (V | T D 0) is the likelihood of V under

1170

T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid

Figure 2: Likelihood distributions and Bayesian probability in the unimodal case. The target is denoted by the binary random variable T. The prior probabilities of the target [P(T)] are 0.9 that it is absent [P(T D 0) D 0.9] and 0.1 that it is present [P(T D 1) D 0.1]. The discrete random variable V represents the number of impulses arriving at the deep SC from a visual input neuron in a unit time interval (250 msec). Poisson densities, shown in (A), model the likelihood distributions of V [P(V | T)]. A mean 5 Poisson density models the spontaneous likelihood of V [with target absent, P(V | T D 0), dashed line]. A mean 8 Poisson density models the driven likelihood of V [with target present, P(V | T D 1), line with circles]. The unconditional probability of V [P(V), line without symbols in (A)] is computed from the input likelihoods and target priors (using equation 3.2). The posterior (Bayesian) probability that the target is present given input V [P(T D 1 | V)] is computed using equation 3.3, and the result is shown in (B) (line with circles). The target-present prior [P(T D 1) D 0.1] is also shown for comparison in (B) (dashed line).

spontaneous conditions. It is modeled as a Poisson density (equation 3.1) with l D 5 (dashed line). The mean of 5 impulses per 250-msec interval corresponds to a mean spontaneous ring rate of 20 impulses per second. This spontaneous rate is meant to represent a reasonable generic value, but one that is high enough so that the behavior of the model under spontaneous

Multisensory Enhancement in the Superior Colliculus

1171

conditions can be examined easily. The model is valid for any spontaneous rate. The likelihood of V given T D 1 [P(V | T D 1)] is the probability distribution of V when a target is present. Thus, P( V | T D 1) is the likelihood of V under driven conditions. The driven likelihood takes into account random variability in the stimulus effectiveness of the target, as well as the intrinsic stochasticity of the sensory neurons. It is modeled as a Poisson density (equation 3.1) with l D 8 (line with circles in Figure 2A). The mean of 8 impulses per 250-msec interval corresponds to a mean driven ring rate of 32 impulses per second. Notice that there is overlap between the spontaneous and driven likelihoods of V (see Figure 2A). Given the prior probability distribution of T [P(T)] and the likelihood distributions of V [P(V | T )], the unconditional probability of V [P(V)] can be computed using the principle of total probability (Hellstrom, 1984; Applebaum, 1996): P(V ) D P(V | T D 1)P(T D 1) C P( V | T D 0)P(T D 0).

(3.2)

Equation 3.2 shows that the unconditional probability of V is the sum of the likelihoods of V weighted by the priors of T. The result is the compound distribution shown in Figure 2A (line without symbols). With P( T ) and P( V | T ) dened, and P( V ) computed from them (equation 3.2), all of the variables needed to evaluate Bayes’ rule in the unimodal case (equation 2.1) have been specied. Our hypothesis is that a deep SC neuron computes the Bayesian (posterior) probability that a target is present in its receptive eld given its sensory input. We therefore want to evaluate equation 2.1 for T D 1 with unimodal input V: P( T D 1 | V ) D

P( V | T D 1)P(T D 1) . P( V )

(3.3)

We evaluate equation 3.3 by taking integer values of v in the range from 0 to 25 impulses per unit time. For each value of v, the values of P( V | T D 1) and P( V ) are read off the curves previously dened (see Figure 2A). The unimodal posterior probability that T D 1 [P(T D 1 | V )] is plotted in Figure 2B (line with circles). The prior probability that T D 1 [P(T D 1) D 0.1] is also shown for comparison in Figure 2B (dashed line). For the Poisson density means chosen, the posterior exceeds the prior probability when v ¸ 7 impulses per unit time (see Figure 2B). For the same values of v ( v ¸ 7), the driven likelihood of V [P(V | T D 1)] exceeds the unconditional probability of V [P(V)] (see Figure 2A). Because the target-present prior [P(T D 1) D 0.1] is much smaller than the target-absent prior of T [P(T D 0) D 0.9], the unconditional probability of V [P(V)] is dominated by the spontaneous likelihood of V [P(V | T D 0)]

1172

T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid

(equation 3.2). This is shown in Figure 2A [P(V), line without symbols; P( V | T D 0), dashed line]. For the priors of T as dened, the ratio P(V | T D 1) / P( V) is roughly equal to the ratio P( V | T D 1) / P (V | T D 0), and computing the posterior probability of the target using Bayes’ rule (equation 3.3) is roughly equivalent to multiplying the target-present prior by the ratio of the driven and spontaneous likelihoods of the input. Thus, as the input increases from 0 to 25 impulses per unit time, it changes from being most likely spontaneous to most likely driven (see Figure 2A). On the basis of this increasing unimodal input, Bayes’ rule indicates that the target changes from being most probably absent to most probably present (see Figure 2B). We suggest that the sigmoidal curve (Figure 2B, line with circles) describing the Bayesian probability that a target is present given unimodal input V [P(T D 1 | V )] is proportional to the responses of individual, unimodal deep SC neurons. 4 Bayes’ Rule with Two Sensory Inputs

We can now consider the bimodal case of visual and auditory inputs. We dene A as a discrete random variable that represents the number of neural impulses arriving at the SC neuron from an auditory neuron in a unit time interval. The Bayesian probability in the bimodal case (given V and A) can be computed as in the unimodal case, except that in the bimodal case, the input is the conjunction of V and A. The posterior probability that T D 1 given V and A [P(T D 1 | V, A)] can be computed using Bayes’ rule as follows: P (T D 1 | V, A) D

P(V, A | T D 1)P(T D 1) . P( V, A)

(4.1)

To evaluate Bayes’ rule in the bimodal case (equation 4.1) we need to dene the likelihoods of A given that a target is present or absent [P(A | T D 1) or P( A | T D 0)]. As for V, we dene the likelihoods of A given T as Poisson densities with different means. To simplify the analysis, we assume that V and A are conditionally independent given T. On this assumption, the joint likelihoods of V and A given T [P(V, A | T )] can be computed as the products of the corresponding likelihoods of V given T [P(V | T )] and of A given T [P(A | T )]. For the case in which a target is present, P (V, A | T D 1) D P(V | T D 1)P(A | T D 1).

(4.2)

Also on the assumption of conditional independence, the unconditional joint probability of V and A [P(V, A)] can be computed using the principle of total probability: P (V, A) D P( V | T D 1)P(A | T D 1)P(T D 1) C P( V | T D 0)P(A | T D 0)P(T D 0).

(4.3)

Multisensory Enhancement in the Superior Colliculus

1173

We use the same priors of T as before. With the priors of T and the likelihoods of V and A dened, and the conditional and unconditional joint probabilities of ( V, A) computed from them (equations 4.2 and 4.3), we can use Bayes’ rule (equation 4.1) to simulate the bimodal responses of deep SC neurons. We evaluate equation 4.1 by taking integer values of v and a in the range from 0 to 25 impulses per unit time. To simplify the comparison between simulated unimodal and bimodal responses, we consider the hypothetical case in which v and a are equal. The results are qualitatively similar when v and a are unequal as long as they increase together. An example of a bimodal Bayesian probability that a target is present is shown in Figure 3A (line without symbols). In this example, the likelihoods of A are the same as those of V (in Figure 2A). The means of the Poisson densities modeling the spontaneous and driven likelihoods of both V and A are 5 and 8, respectively. The relationship between the Bayesian probability and the inputs in the bimodal case, in which V and A increase together, is qualitatively similar to that in the unimodal case, in which V increases alone. As in the unimodal case (Figure 2), Bayes’ rule in the bimodal case indicates that the target changes from being most probably absent to most probably present as the input changes from being most likely spontaneous to most likely driven. In the bimodal case, however, the spontaneous and driven input distributions reect the products of the individual likelihood distributions of V and A (equations 4.2 and 4.3). The roughly sigmoidal curves in Figure 3A for the bimodal [P(T D 1 | V, A)] (line without symbols) and the unimodal [P(T D 1 | V )] (line with circles) Bayesian probabilities are similar in that both rise from 0 to 1 as v and a (bimodal) or v only (unimodal) increase from 0 to 25 impulses per unit time. The difference is that the bimodal curve rises faster. There is a range of values of v over which the probability of a target is higher in the bimodal case, when a is also available as an input, than in the unimodal case with v alone. This can be called the bimodal-unimodal difference (BUD) range. Within the BUD range, the Bayesian probability that a target is present will be higher in the bimodal than in the unimodal case at the same level of input. The size of the BUD range is related to the amount of overlap between the spontaneous and driven likelihood distributions of the inputs (e.g., Figure 2A). The amount of overlap decreases as the difference between the driven and spontaneous means increases. As an example, the mean of the Poisson density modeling the driven likelihoods of both V and A is changed from 8 to 20. The spontaneous mean for both V and A remains at 5. The unimodal (line with circles) and bimodal (line without symbols) Bayesian probability curves for this example are shown in Figure 3B. The BUD range is appreciably smaller in this example, with driven and spontaneous means of 20 and 5 (see Figure 3B) than in the previous example with means of 8 and 5 (see Figure 3A).

1174

T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid

Figure 3: Bayesian target-present probabilities computed in unimodal and bimodal cases. The prior probabilities of the target are set as before [P(T D 1) D 0.1 and P(T D 0) D 0.9] and the likelihoods of the visual (V) and auditory (A) inputs are modeled using Poisson densities as in Figure 2A. Unimodal Bayesian probabilities (see equation 3.3) are computed as V takes integer values v in the range from 0 to 25 impulses per unit time. Bimodal Bayesian probabilities (equation 4.1) are computed as V and A vary together over the same range with v D a. (A) Spontaneous and driven likelihood means are 5 and 8, respectively, for the unimodal case of V only, and for both V and A in the bimodal case. (B) Spontaneous and driven likelihood means are 5 and 20, respectively. The bimodal exceeds the unimodal Bayesian probability over a much narrower range in (B) than in (A). For (A) and (B), lines with circles and lines without symbols denote unimodal and bimodal Bayesian probabilities, respectively.

Comparison of the curves in Figures 3A and 3B illustrates the general nding that the BUD range decreases as the difference between the driven and spontaneous input means increases. The BUD range is also dependent on the prior probabilities of the target. The inuences of changing the spontaneous and driven input means and target priors on the BUD range are illustrated in Figure 4. In this gure, the summed difference between the bimodal and unimodal Bayesian probability curves (cumulative BUD) is

Multisensory Enhancement in the Superior Colliculus

1175

Figure 4: The difference between the bimodal and unimodal Bayesian probabilities decreases rapidly as the difference between the driven and spontaneous input means increases. The spontaneous means for both V and A are xed at 5 impulses per unit time, and the driven means for V and A are varied together from 7 to 25. Unimodal (equation 3.3) and bimodal (equation 4.1) Bayesian probabilities are computed as v, or v and a together, vary over the range from 0 to 25 impulses per unit time (as in Figure 3). The unimodal is subtracted from the bimodal curve at each point in the input range. The difference (BUD) is summed over the range and plotted. The cumulative BUD is computed with target-present prior probabilities of 0.1 (line without symbols), 0.01 (dashed line), and 0.001 (line with asterisks) (target-absent priors are 0.9, 0.99, and 0.999, respectively).

1176

T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid

plotted as the driven means of both V and A are varied together from 7 to 25. The spontaneous means of both V and A are xed at 5. The target prior distribution is altered so that the target-present prior equals 0.1 (line without symbols), 0.01 (dashed line), or 0.001 (line with asterisks). Figure 4 shows that the cumulative BUD decreases as the difference between the driven and spontaneous input means increases. The difference is higher for all means as the target-present prior decreases. Whether a deep SC neuron receives multimodal or only unimodal input may depend on the amount by which the driven response of a unimodal input differs from its spontaneous discharge. If the spontaneous and driven means are not well separated, then a unimodal sensory input provided to a deep SC neuron indicating the presence of a target would be uncertain and ambiguous. In the ambiguous case, a sensory input of another modality makes a relatively big difference in the computation of the probability that a target is present (e.g., Figure 4 for driven means less than about 15). Conversely, if the spontaneous and driven means are well separated, then a unimodal input provided to a deep SC neuron would be unambiguous. In the unambiguous case, a sensory input of another modality makes relatively little difference in the computation of the probability that a target is present (e.g., Figure 4 for driven means greater than about 15). Unimodal deep SC neurons may be those that receive an unambiguous input of one sensory modality and have no need for an input of another modality. 5 Bayes’ Rule with One Spontaneous and One Driven Input

To compare the Bayes’ rule model with neurophysiological data on multisensory enhancement, we must explore the behavior of the bimodal model (equation 4.1) not only for cases of bimodal stimulation, but also for cases where stimulation of only one or the other modality is available. To model the experimental situation in which sensory stimulation of both modalities is presented, we consider the case in which v and a are equal and vary together over the range from 0 to 25 impulses per unit time (as before). This simulates the driven responses of both inputs (both-driven). To model situations in which sensory stimulation of only one modality is presented, we consider the cases in which v or a is allowed to vary over the range, while the other input is xed at the mean of its spontaneous likelihood distribution (driven/spontaneous). The results of these simulations are shown in Figure 5A, in which the spontaneous and driven input means are 5 and 8 for both V and A, as before. Because both sensory inputs have the same spontaneous and driven means, the bimodal Bayesian probability curves for the two driven/spontaneous cases are the same. Thus, the curves coincide for the bimodal cases in which v is xed at 5 while a varies (line with diamonds) and in which a is xed at 5 while v varies (line with crosses). There is a range of v and a in Figure 5A over which the bimodal Bayesian probability in the both-driven

Multisensory Enhancement in the Superior Colliculus

1177

Figure 5: Bimodal Bayesian probabilities (equation 4.1) are computed with both inputs driven or with one input driven and the other spontaneous. Either v and a vary together with v D a (both-driven), or one input varies while the other is xed at its spontaneous mean of 5 (v D 5 or a D 5, driven/spontaneous). Target prior probabilities are set as before [P(T D 1) D 0.1 and P(T D 0) D 0.9] and the likelihoods of inputs V and A are modeled using Poisson densities (as in Figure 2A). (A) The driven input mean is set at 8 impulses per unit time for both V and A. The unimodal Bayesian probability (equation 3.3) is computed for comparison. (B) The driven input mean is 8 for A (as in A) but 10 for V. In (A) and (B), the Bayesian probabilities for the cases where both inputs vary together exceed over a broad range those for which one or the other input is xed at 5. The driven/spontaneous bimodal Bayesian probability curves are denoted by lines with diamonds for v D 5, and lines with crosses for a D 5. The both-driven bimodal probability curves (v D a) are denoted by lines without symbols, and the unimodal Bayesian probability curve in (A) is denoted by a line with circles, as in Figure 3.

1178

T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid

case (line without symbols) exceeds that in the driven/spontaneous cases (lines with diamonds and crosses). This can be called the range of enhancement. The bimodal curves (both-driven and driven/spontaneous; equation 4.1) are compared in Figure 5A with the corresponding unimodal Bayesian probability curve computed before (equation 3.3). It shows that the Bayesian probabilities in the bimodal, driven/spontaneous cases (lines with diamonds and crosses) are smaller than in the unimodal case (line with circles) for the same inputs v and/or a. This illustrates that Bayes’ rule works in both directions. The bimodal probability can be higher than the unimodal probability when the target most likely drives both inputs. Conversely, the bimodal probability can be lower than the unimodal probability when the target most likely drives one input while the other input is most likely spontaneous. For simplicity, we have considered cases of the bimodal Bayes’ rule where the likelihood distributions are the same for inputs of both modalities. The Bayesian analysis presented is valid for any input mean values, as long as inputs that are intended to represent neurons that are activated by sensory stimulation have driven means that are higher than spontaneous means. Figure 5B presents an example of bimodal Bayesian probabilities computed with the same spontaneous means (of 5), but with different driven means of 10 and 8 for V and A, respectively. This case is qualitatively similar to the previous one except that the driven/spontaneous curves do not coincide. The range of enhancement is much larger relative to the curve for which a varies alone while v D 5 (line with diamonds) than relative to the curve for which v varies alone while a D 5 (line with crosses). There are values of the input for which a driven/spontaneous probability is almost zero while the both-driven probability is almost one. We hypothesize that the response of a deep SC neuron is proportional to the Bayesian probability that a target is present given its inputs. For the curves in Figures 5A and 5B, the ranges of input over which the bimodal, both-driven probabilities (lines without symbols) exceed the driven/ spontaneous probabilities (lines with diamonds and crosses) correspond to ranges of multisensory enhancement. As v and a increase within a range of enhancement, the amount of enhancement increases and then decreases until, above the upper limit of the range, enhancement decreases to zero. This behavior of the bimodal Bayes’ rule model can be used to simulate the results of a neurophysiological experiment (Meredith & Stein, 1986b). The experiment involves recording the responses of a multisensory deep SC neuron to stimuli that by themselves produce minimal, suboptimal, or optimal responses. The simulation is based on the bimodal case shown in Figure 5B, in which the driven means for V and A are different. Inputs for v and a are chosen that give similar minimal, suboptimal, and optimal Bayesian probabilities in the driven/spontaneous case. The both-driven probability is computed using the same, driven input values. The percent-

Multisensory Enhancement in the Superior Colliculus

1179

Table 1: Numerical Values for a Simulated Enhancement Experiment. Driven Input

Minimal Suboptimal Optimal

Bimodal Bayesian Probability

v

a

v Driven

a Driven

va Driven

% enh

8 12 16

9 15 21

0.0476 0.4446 0.9276

0.0487 0.4622 0.9351

0.3960 0.9865 1

713 113 7

Notes: Target prior probabilities are set at 0.1 for target present and 0.9 for target absent. The likelihoods of inputs V and A are modeled using Poisson densities with a spontaneous mean of 5 impulses per unit time for both V and A and driven means of 10 for V and 8 for A. With one input xed at the spontaneous mean, the other input (v or a) is assigned a driven value such that the two driven/spontaneous, bimodal Bayesian probabilities are approximately equal at each of three levels (minimal, suboptimal, and optimal). The both-driven (va) bimodal Bayesian probability is computed using the same driven input values. The percentage enhancement is computed using equation 1.1 in which the bothdriven probability is substituted for CM and the larger of the two driven/spontaneous probabilities is substituted for SMmax . The percentage increase in the Bayesian probability (% enh, equation 1.1) when both inputs are driven is largest when the driven/spontaneous Bayesian probabilities are smallest.

age enhancement is computed using equation 1.1, in which the Bayesian, bimodal both-driven probability is substituted for CM and the larger of the two driven/spontaneous probabilities is substituted for SMmax . The values are listed in Table 1. The results are illustrated in Figure 6, where the bars correspond to the Bayesian, bimodal probabilities in the driven/spontaneous cases (white bar, visual driven (v); gray bar, auditory driven ( a)) or in the both-driven case (black bar, visual and auditory driven ( va)). Figure 6 is drawn as a bar graph to facilitate comparisons with published gures based on experimental data (Meredith & Stein, 1986b; Stein & Meredith, 1993). Figure 6 shows that in the model, the percentage enhancement goes down as the driven/spontaneous probabilities go up. These modeling results simulate the property of inverse effectiveness that is observed for multisensory deep SC neurons, by which the percentage of multisensory enhancement is largest when the singlemodality responses are smallest. 6 Discussion

Multisensory enhancement is a particularly dramatic form of multisensory integration in which the response of a deep SC neuron to a stimulus of one modality can be increased many times by a stimulus of another modality. Yet not all deep SC neurons are multisensory. Those that are exhibit the property of inverse effectiveness, by which the percentage of multisensory enhancement is largest when the single-sensory responses are smallest. We hypothesize that the responses of individual deep SC neurons are propor-

1180

T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid

Figure 6: Comparing the driven/spontaneous bimodal Bayesian probabilities at three levels (minimal, suboptimal, and optimal) with the corresponding bothdriven probability. Parameters for the target prior and input likelihood distributions are as in Figure 5B. Bimodal probabilities (equation 4.4) are computed when both v and a take various values within the range representing driven responses (va, black bars), or when one input takes the driven value (v, white bars; a, gray bars) but the other is xed at the spontaneous mean of 5. The percentage increase in the Bayesian probability when both inputs are driven is largest when the driven/spontaneous Bayesian probabilities are smallest. The values are reported in Table 1.

tional to the Bayesian probability that a target is present given their sensory inputs. Our model, based on this hypothesis, suggests that multisensory deep SC neurons are those that receive unimodal inputs that would be more ambiguous by themselves. It also provides an explanation for the property of inverse effectiveness. The probability of a target computed on the basis of a large unimodal input will not be increased much by an input of another modality. However, the increase in target probability due to the integration of inputs of two modalities goes up rapidly as the size of the unimodal inputs decreases. Thus, our hypothesis provides a functional interpretation of multisensory enhancement that can explain its most salient features.

Multisensory Enhancement in the Superior Colliculus

1181

6.1 Hypothesis, Assumptions, and Model Parameters. Our basic hypothesis is that the responses of deep SC neurons are proportional to the probability that a target is present. A similar hypothesis was put forward by Barlow (1969, 1972), who suggested that the responses of feature detector neurons are proportional to the probability that their trigger feature is present. Barlow suggested a model based on classical statistical inference, in which the response of a neuron is proportional to ¡ log(P) where P is the probability of the results on a null hypothesis (i.e., that the trigger feature is absent). Classical inference differs from Bayesian analysis in many respects, with the main difference being that prior information is used in the latter but not in the former (Applebaum, 1996). In our model, the response of a deep SC neuron is directly proportional to the Bayesian (posterior) probability of a target, based on the prior probability of the target and sensory input of one or more modalities. We constructed our model so as to minimize the number of assumptions needed to evaluate it. That was our motivation in choosing the Poisson density rather than some other probability density function to represent the visual and auditory inputs, and in considering those two inputs as conditionally independent given the target. The Bayesian formulation is valid regardless of the probability density function chosen to represent the inputs, or whether those inputs are independent or dependent. Our particular construction of the model demonstrates its robustness, since the parameters of other instantiations requiring additional assumptions can easily be set so as to exaggerate the multisensory enhancement effects that we successfully simulate despite the constraints we impose. Conditional independence given the target implies that the visibility of a target indicates nothing about its audibility, and vice-versa. One can easily imagine real targets for which this assumption is valid and others for which it is not. The advantage of the assumption of conditional independence is that the conditional and unconditional joint distributions of the visual and auditory inputs [P(V, A | T ) and P( V, A)] are then computed directly from the likelihoods of each individual input and the target priors (equations 4.2 and 4.3). The alternative assumption of conditional dependence carries with it a number of additional assumptions that would have to be made in order to specify how V and A are jointly distributed. The parameters of such a dependent, joint distribution could not be constrained on the basis of available data, but could be chosen in such a way that the simulated multisensory enhancement effects would be greatly exaggerated. Thus, our model construction based on conditional independence of the inputs is more restrictive than the dependent alternative, and conclusions based on it are therefore stronger. The spike trains of individual neurons have been described using various stochastic processes (Gerstein & Mandelbrot, 1964; Geisler & Goldberg, 1966; Lansky & Radil, 1987; Gabbiani & Koch, 1998). We use the discrete Poisson density to approximate the number of impulses per unit time in

1182

T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid

the discharge of individual neurons. Real spike trains can be well described as Poisson processes in some cases but not in others (Munemori, Hara, Kimura, & Sato, 1984; Turcott et al., 1994; Lestienne, 1996; Rieke, Warland, & de Ruyter van Stevenick, 1997). The advantage of the Poisson density is that it is specied using only one parameter, the mean, and this parameter has been reported for neurons in many of the brain regions that supply sensory input to the deep SC. Other densities require setting two or more parameters, which could not be well constrained on the basis of available data but could be chosen in such a way that the simulated multisensory enhancement effects would be greatly exaggerated. Thus, our model construction using Poisson densities to represent the inputs is better constrained than others that would use alternative densities, and conclusions based on it are therefore more robust. 6.2 Unimodality versus Bimodality. Our analysis suggests that whether a deep SC neuron is multisensory may depend on the statistical properties of its inputs. We use two probability distributions to dene each sensory input: one for its spontaneous discharge and the other for its activity as it is driven by the sensory attributes of the target. A critical feature of these distributions is their degree of overlap. The more the spontaneous and driven distributions of an input overlap, the more uncertain and ambiguous will that input be for indicating the presence of a target. Our model suggests that if a deep SC neuron receives a unimodal input for which the spontaneous and driven distributions have little overlap, then that neuron receives unambiguous unimodal input and has no need for input of another modality. Conversely, if a deep SC neuron receives a unimodal input for which the spontaneous and driven distributions overlap a lot, then that neuron receives ambiguous unimodal input. Such a deep SC neuron could effectively disambiguate this input by combining it with an input of another modality, even if the other input is also ambiguous by itself. About 45% of deep SC neurons in the cat and 21% in the monkey receive input from two sensory modalities (Wallace & Stein, 1996). An input of a third modality could provide further disambiguation. About 9% of all deep SC neurons in the cat and 6% in the monkey receive input of three modalities. Inputs to the deep SC are likely to vary in the degree to which their spontaneous discharges and driven responses overlap. For example, sensory neurons vary in their sensitivity, and overlap would occur for those that are less sensitive to stimulation. Sensory inputs would therefore vary in their degree of ambiguity. This could explain why some deep SC neurons are unimodal while others are multimodal (Wallace & Stein, 1996). 6.3 Spontaneous Activity. In the model, the Bayesian probability of a target increases from zero to one as the input increases. We suggest that the responses of individual deep SC neurons are proportional to this Bayesian probability. The Bayesian probability is zero or nearly zero for inputs that are

Multisensory Enhancement in the Superior Colliculus

1183

most likely spontaneous. For the example shown in Figure 2, the Bayesian probability is nearly zero even for an input of 16 impulses per second (4 impulses per 250-msec interval). The spontaneous activity of deep SC neurons is also zero or nearly zero, despite the fact that most of the neurons that supply input to the deep SC have nonzero spontaneous ring rates. The spontaneous rates of neurons in some of the cortical and subcortical visual and auditory regions that project to the deep SC range from zero to over 100 impulses per second (anterior ectosylvian visual area in anesthetized cat, 1–8 Hz: Mucke, Norita, Benedek, & Creutzfeldt, 1982; pretectum in anesthetized cat, 1–31 Hz: Schmidt, 1996; retinal ganglion cells in anesthetized cat, 0–80 Hz: Ikeda, Dawes, & Hankins, 1992; superior olivary complex in anesthetized cat, 0–140 Hz: Guinan, Guinan, & Norris, 1972; inferior colliculus in alert cat, 2–54 Hz: Bock, Webster, & Aitkin, 1971; trapezoid body in anesthetized cat, 0–104 Hz: Brownell, 1975). For the unanesthetized rabbit, mean spontaneous rates of 36.1 § 37.8 (SD) and 11.1 § 11.4 (SD) impulses per second have been observed for single neurons in the superior olivary complex and inferior colliculus, respectively (D. Fitzpatrick and S. Kuwada, pers. comm.). Although neurons discharge spontaneously in most of the regions that provide inputs to the deep SC, most deep SC neurons exhibit low or zero spontaneous discharge, in both the anesthetized preparation and the alert, behaving animal (Peck, Schlag-Rey, & Schlag, 1980; Middlebrooks & Knudsen, 1984; Grantyn & Berthoz, 1985). In the context of the model, we would interpret this low level of deep SC neuron activity as indicating that the probability of a target is low when the inputs are spontaneous. 6.4 Driven Responses. The driven responses of deep SC neurons are inuenced by a multidimensional array of stimulus parameters in addition to location in space. For visual stimuli these can include size, and rate and direction of motion (Stein & Arigbede, 1972; Gordon, 1973; Peck et al., 1980; Wallace & Stein, 1996). For auditory stimuli, factors inuencing driven responses include sound frequency, intensity, and motion of the sound source (Gordon, 1973; Wise & Irvine, 1983, 1985; Hirsch, Chan, & Yin, 1985; Yin, Hirsch, & Chan, 1985). Specicity for these parameters is most likely derived from the sensory systems that provide input to the deep SC. In the model, we represent input activity as the number of impulses per unit time, which increases monotonically as the optimality of the sensory stimulus increases. The multidimensional array of stimulus parameters, as well as the multiplicity of input sources, raises the possibility of enhancement within a single sensory modality. For example, a visual input specic for size might enhance another visual input specic for direction of motion. The basic model, which includes only one input from each of two sensory modalities, can be generalized to include arbitrarily many inputs of various modalities and specicities.

1184

T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid

The Bayesian probability increases as the input increases, and is one or nearly one for inputs that are most likely driven. A Bayesian probability of one could correspond to the maximum activity level of deep SC neurons. The model suggests that unimodal input could maximally drive individual deep SC neurons, even those that are multisensory (e.g., Figure 5). However, in order for some inputs to drive deep SC neurons maximally, they would have to attain very high activity levels. For example, when the auditory input alone is driven in Figure 5B (line with diamonds), it does not produce a Bayesian probability of one even when it has attained 80 impulses per second (20 impulses per unit time). It is possible that some unimodal inputs are unable to attain the activity levels needed to drive deep SC neurons maximally, regardless of the optimality of the unimodal stimulus. The maximal response of a multimodal neuron to such unimodal input would be less than its maximal bimodal or multimodal response. Neurophysiological evidence suggests that this may be the case for some deep SC neurons (Stein & Meredith, 1993). 6.5 Multisensory Enhancement and Inverse Effectiveness. When the responses of deep SC neurons are interpreted as being proportional to the Bayesian probability of a target, then multisensory enhancement is analogous to an increase in that probability due to multimodal input. The strength of the evidence of a target provided to a deep SC neuron is related to the size of its driven input. If a driven input of one modality is large, a deep SC neuron receives overwhelming evidence on the basis of that input modality alone. The probability of a target would be high, and driven input of another modality will not increase that probability much. Conversely, if a driven input of one modality is small, then the evidence it provides to a deep SC neuron is weak, and the probability of a target computed on the basis of that input alone would be small. In that case, the probability of a target could be greatly increased by corroborating input of another modality. According to the probabilistic interpretation, the amount of multisensory enhancement should be larger for smaller unimodal inputs. The experimental paradigm actually used to demonstrate multisensory enhancement would accentuate this effect. Typically in studying a multisensory deep SC neuron, the response to stimuli of two modalities presented together is compared with the response to a stimulus of either modality presented alone (Meredith & Stein, 1986b; Stein & Meredith, 1993). According to the model, the response of a bimodal, deep SC neuron to a unimodal stimulus would be proportional to the probability of a target given that one input is driven and the other is spontaneous. This bimodal, driven/spontaneous probability is lower than that which would be computed by a unimodal deep SC neuron receiving only the driven input (see Figure 5A). The both-driven bimodal probability can be increased greatly over the driven/spontaneous bimodal probabilities, especially when the driven/spontaneous probabilities are smaller.

Multisensory Enhancement in the Superior Colliculus

1185

Thus, the amount of increase in target probability due to the integration of multimodal sensory inputs goes up rapidly as the responses to unimodal inputs decrease. This relationship is illustrated in Figures 5 and 6. It is analogous to the phenomenon of inverse effectiveness, which is the hallmark of multisensory enhancement in deep SC neurons (Stein & Meredith, 1993). The property of inverse effectiveness is precisely what would be expected on the basis of the hypothesis that deep SC neurons use their multisensory inputs to compute the probability of a target. Acknowledgments

We thank Pierre Moulin, Hao Pan, Sylvian Ray, Jesse Reichler, and Christopher Seguin for many helpful discussions and for comments on the manuscript prior to submission. This work was supported by NSF grant IBN 92-21823 and a grant from the Critical Research Initiatives of the State of Illinois, both to T.J.A. References Applebaum, D. (1996). Probability and information: An integrated approach. Cambridge: Cambridge University Press. Barlow, H. B. (1969). Pattern recognition and the responses of sensory neurons. Annals of the New York Academy of Sciences, 156, 872–881. Barlow, H. B. (1972).Single units and sensation: A neuron doctrine for perceptual psychology? Perception, 1, 371–394. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Clarendon Press. Bock, G. R., Webster, W. R., & Aitkin, L. M. (1971). Discharge patterns of single units in inferior colliculus of the alert cat. J. Neurophysiol., 35, 265–277. Brownell, W. E. (1975). Organization of the cat trapezoid body and the discharge characteristics of its bers. Brain Res., 94(3), 413–433. Cadusseau, J., & Roger, M. (1985). Afferent projections to the superior colliculus in the rat, with special attention to the deep layers. J. Hirnforsch., 26(6), 667– 681. Duda, R. O., & Hart, P. E. (1973). Pattern classication and scene analysis. New York: Wiley. Edwards, S. B., Ginsburgh, C. L., Henkel, C. K., & Stein, B. E. (1979). Sources of subcortical projections to the superior colliculus in the cat. J. Comp. Neurol., 184(2), 309–329. Gabbiani, F., & Koch, C. (1998). Principles of spike train analysis. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (2nd ed.). (pp. 312–360). Cambridge, MA: MIT Press. Geisler, C. D., & Goldberg, J. M. (1966). A stochastic model of the repetitive activity of neurons. Biophys. J., 6, 53–69. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophys. J., 4, 41–68.

1186

T. J. Anastasio, P. E. Patton, and K. Belkacem-Boussaid

Gordon, B. (1973). Receptive elds in deep layers of cat superior colliculus. J. Neurophysiol., 36(2), 157–178. Grantyn, A., & Berthoz, A. (1985). Burst activity of identied tecto-reticulospinal neurons in the alert cat. Exp. Brain Res., 57(2), 417–421. Guinan, J. J., Guinan, S. S., & Norris, B. E. (1972). Single auditory units in the superior olivary complex. I. Responses to sounds and classications based on physiological properties. Int. J. Neurosci., 4, 101–120. Hellstrom, C. W. (1984).Probabilityand stochastic processes for engineers. New York: Macmillan. Hirsch, J. A., Chan, J. C., & Yin, T. C. (1985). Responses of neurons in the cat’s superior colliculus to acoustic stimuli. I. Monaural and binaural response properties. J. Neurophysiol., 53(3), 726–745. Ikeda, H., Dawes, E., & Hankins, M. (1992).Spontaneous ring level distiguishes the effects of NMDA and non-NMDA receptor antagonists on the ganglion cells in the cat retina. Eur. J. Pharm., 210, 53–59. King, A. J., & Palmer, A. R. (1985).Integration of visual and auditory information in bimodal neurones in guinea-pig superior colliculus. Exp. Brain Res., 60, 492–500. Lansky, P., & Radil, T. (1987). Statistical inference on spontaneous neuronal discharge patterns. Biol. Cybern., 55, 299–311. Lestienne, R. 1996. Determination of the precision of spike timing in the visual cortex of anaesthetised cats. Biol. Cybern., 74, 55–61. Meredith, M. A., Clemo, H. R., & Stein, B. E. (1991). Somatotopic component of the multisensory map in the deep laminae of the cat superior colliculus. J. Comp. Neurol., 312(3), 353–370. Meredith, M. A., Nemitz, J. W., & Stein, B. E. (1987). Determinants of multisensory integration in superior colliculus neurons. I. Temporal factors. J. Neurosci., 7(10), 3215–3229. Meredith, M. A., & Stein, B. E. (1983). Interactions among converging sensory inputs in the superior colliculus. Science, 221(4608), 389–391. Meredith, M. A., & Stein, B. E. (1986a). Spatial factors determine the activity of multisensory neurons in cat superior colliculus. Brain Res., 365(2), 350– 354. Meredith, M. A., & Stein, B. E. (1986b).Visual, auditory, and somatosensory convergence on cells in superior colliculus results in multisensory integration. J. Neurophysiol., 56(3), 640–662. Meredith, M. A., & Stein, B. E. (1990). The visuotopic component of the multisensory map in the deep laminae of the cat superior colliculus. J. Neurosci., 10(11), 3727–3742. Meredith, M. A., & Stein, B. E. (1996). Spatial determinants of multisensory integration in cat superior colliculus neurons. J. Neurophysiol., 75(5), 1843– 1857. Middlebrooks, J. C., & Knudsen, E. I. (1984). A neural code for auditory space in the cat’s superior colliculus. J. Neurosci., 4(10), 2621–2634. Mucke, L., Norita, M., Benedek, G., & Creutzfeldt, O. (1982). Physiologic and anatomic investigation of a visual cortical area situated in the ventral bank of the anterior ectosylvian sulcus of the cat. Exp. Brain Res., 46, 1–11.

Multisensory Enhancement in the Superior Colliculus

1187

Munemori, J., Hara, K., Kimura, M., & Sato, R. (1984). Statistical features of impulse trains in cat’s lateral geniculate neurons. Biol. Cybern., 50, 167– 172. Peck, C. K., Schlag-Rey, M., & Schlag, J. (1980). Visuo-oculomotor properties of cells in the superior colliculus of the alert cat. J. Comp. Neurol., 194(1), 97–116. Rieke, F., Warland, D., & de Ruyter van Stevenick, R. (1997). Spikes: Exploring the neural code. Cambrige, MA: MIT Press. Russell, S., & Norvig, P. (1995). Articial intelligence: A modern approach. Upper Saddle River, NJ: Prentice Hall. Schmidt, M. (1996).Neurons in the cat pretectum that project to the dorsal lateral geniculate nucleus are activated during saccades. J. Neurophysiol., 76(5), 2907– 2918. Sparks, D. L., & Hartwich-Young, R. (1989). The deep layers of the superior colliculus. In R. H. Wurtz & M. Goldberg (Eds.), The neurobiology of saccadic eye movements (pp. 213–255). Amsterdam: Elsevier. Stein, B. E., & Arigbede, M. O. (1972).A parametric study of movement detection properties of neurons in the cat’s superior colliculus. Brain Res., 45, 437–454. Stein, B. E., Huneycutt, W. S., & Meredith, M. A. (1988). Neurons and behavior: The same rules of multisensory integration apply. Brain Res., 448(2), 355–358. Stein, B. E., & Meredith, M. A. (1993). The merging of the senses. Cambridge, MA: MIT Press. Stein, B. E., Meredith, M. A., Huneycutt, W. S., & McDade, L. (1989). Behavioral indices of multisensory integration: Orientation to visual cues is affected by auditory stimuli. J. Cog. Neurosci., 1(1), 12–24. Turcott, R. G., Lowen, S. B., Li, E., Johnson, D. H., Tsuchitani, C., & Teich, M. C. (1994). A non-stationary Poisson point process describes the sequence of action potentials over long time scales in lateral-superior-olive auditory neurons. Biol. Cybern., 70, 209–217. Wallace, M. T., & Stein, B. E. (1994). Cross-modal synthesis in the midbrain depends on input from cortex. J. Neurophysiol., 71(1), 429–432. Wallace, M. T., & Stein, B. E. (1996). Sensory organization of the superior colliculus in cat and monkey. Prog. Brain Res., 112, 301–311. Wise, L. Z., & Irvine, D. R. (1983). Auditory response properties of neurons in deep layers of cat superior colliculus. J. Neurophysiol., 49(3), 674–685. Wise, L. Z., & Irvine, D. R. (1985). Topographic organization of interaural intensity difference sensitivity in deep layers of cat superior colliculus: Implications for auditory spatial representation. J. Neurophysiol., 54(2), 185–211. Wurtz, R. L., & Goldberg, M. E. (Eds.). (1989). The neurobiology of saccadic eye movements. Amsterdam: Elsevier. Yin, T. C., Hirsch, J. A., & Chan, J. C. (1985). Responses of neurons in the cat’s superior colliculus to acoustic stimuli. II. A model of interaural intensity sensitivity. J. Neurophysiol., 53(3), 746–758. Received August 5, 1998; accepted May 10, 1999.

LETTER

Communicated by Anthony Zador

Choice and Value Flexibility Jointly Contribute to the Capacity of a Subsampled Quadratic Classier Panayiota Poirazi Bartlett W. Mel Department of Biomedical Engineering, University of Southern California, Los Angeles, CA 90089, U.S.A. Biophysical modeling studies have suggested that neurons with active dendrites can be viewed as linear units augmented by product terms that arise from interactions between synaptic inputs within the same dendritic subregions. However, the degree to which local nonlinear synaptic interactions could augment the memory capacity of a neuron is not known in a quantitative sense. To approach this question, we have studied the family of subsampled quadratic classiers: linear classiers augmented by the best k terms from the set of K D (d2 C d) / 2 second-order product terms available in d dimensions. We developed an expression for the total parameter entropy, whose form shows that the capacity of an SQ classier does not reside solely in its conventional weight values—the explicit memory used to store constant, linear, and higher-order coefcients. Rather, we identify a second type of parameter exibility that jointly contributes to an SQ classier’s capacity: the choice as to which product terms are included in the model and which are not. We validate the form of the entropy expression using empirical studies of relative capacity within families of geometrically isomorphic SQ classiers. Our results have direct implications for neurobiological (and other hardware) learning systems, where in the limit of high-dimensional input spaces and low-resolution synaptic weight values, this relatively little explored form of choice exibility could constitute a major source of trainable model capacity. 1 Introduction

Experimental evidence continues to accumulate indicating that the dendritic trees of many neuron types are electrically “active”; they contain voltage-dependent ionic conductances that can lead to full regenerative propagation of action potentials and other dynamical phenomena (for recent reviews see Johnston, Magee, Colbert, & Christie, 1996; Stuart, Spruston, Sakmann, & Hausser, 1997; Stuart, Spruston, & Hausser, 1999). Although our state of knowledge regarding the functions of dendritic trees is largely conned to the realm of hypothesis (Koch, Poggio, & Torre, 1982; Rall & c 2000 Massachusetts Institute of Technology Neural Computation 12, 1189–1205 (2000) °

1190

Panayiota Poirazi and Bartlett W. Mel

A

hypothetical response at soma

C

B

A+B

A+C soma

Figure 1: Conceptual illustration of location-dependent nonlinear interactions in an active dendritic tree. Two inputs, A and B, activated at high frequency on a dendritic branch containing a thresholding nonlinearity could be sufcient to cause the branch (and the cell) to “re” repeatedly, while coactivation of more distantly separated inputs, A and C, might generate only a rare suprathreshold event at any location. This type of nonlinear interaction between A and B is sometimes loosely termed multiplicative.

Segev, 1987; Shepherd & Brayton, 1987; Borg-Graham & Grzywaca, 1992; Mel, 1992a, 1992b, 1993, 1994, 1999; Zador, Claiborne, & Brown, 1992; Yuste & Tank, 1996; Segev & Rall, 1998; Mel, Ruderman, & Archie, 1998a, 1998b), results of existing experimental and modeling studies strongly suggest that electrical compartmentalization provided by the dendritic branching structure, coupled with local thresholding provided by active dendritic channels, together are likely to have a profound impact on the way in which synaptic inputs to a dendritic tree are integrated to produce a cell’s output (see Figure 1). A key challenge in deciphering the functions of individual neurons involves properly characterizing the form of nonlinear interactions between synapses within a dendritic compartment, and to this end, a variety of biophysical modeling studies have identied possible functional roles for such interactions (Koch et al., 1982; Rall & Segev, 1987; Shepherd & Brayton, 1987; Borg-Graham & Grzywacz, 1992; Kock & Poggio, 1992; Zador et al., 1992; Mel 1992a, 1992b, 1993). In this same vein, two recent biophysical modeling studies (Mel et al., 1998a, 1998b) have shown that the time-averaged response of a cortical pyramidal cell with weakly excitable dendrites can mimic the output of a quadratic “energy” model, a formalism that has been used to model a variety of receptive eld properties in visual cortex (Pollen & Ronner, 1983; Adelson & Bergen, 1985; Heeger, 1992; Ohzawa, DeAngelis, & Freeman, 1997). This close connection between dendritic and receptive

Capacity of a Subsampled Quadratic Classier

1191

eld nonlinearities is highly suggestive and supports the notion that active dendrites could contribute to a diverse set of neuronal functions in the mammalian central nervous system that involve a common, quasi-static multiplicative nonlinearity of relatively low order (for further discussion, see Mel, 1999). The potential contributions of active dendritic processing to the brain’s capacity to learn and remember are a fundamental problem in neuroscience that have thus far received relatively little attention. In one study addressing the issue of neuronal capacity, Zador and Pearlmutter (1996) computed the Vapnik-Chervonenkis (1971) dimension of an integrate-and-re neuron with a passive dendritic tree. However location-dependent nonlinear synaptic interactions, which are likely to occur within the dendrites of many neuron types, were not considered in this study. To begin to address this issue, we previously developed a spatially extended single-neuron model called the “clusteron,” which explicitly included location-dependent multiplicative synaptic interactions in the neuron’s input-output function (Mel, 1992a) and illustrated how a neuron with active dendrites could act as a high-dimensional quadratic learning machine. The clusteron abstraction was conceived primarily to help quantify the impact of local dendritic nonlinearities on the usable memory capacity of a cell; an earlier biophysical modeling study had demonstrated empirically that this impact could be signicant (Mel, 1992b). However, an additional insight gained from this work was that a cell with s input sites is likely capable of representing only a small fraction of the O( d2 ) second-order interaction terms available among d input lines, since in the brain, it is often reasonable to assume that d is very large. In short, if a neuron does indeed pull out higher-order features in its dendrites, it is likely to have to choose very carefully among them. Inspired in part by this observation, our goal here has been to quantify the degree to which inclusion of a subset of the best available higher-order terms augments the capacity of a subsampled quadratic (SQ) classier relative to its linear counterpart. Although it has been long appreciated that inclusion of higher-order terms can increase the power of a learning machine for both regression and classication (Poggio, 1975; Barron, Mucciardi, Cook, Craig, & Barron, 1984; Giles & Maxwell, 1987; Ghosh & Shin, 1992; Guyon, Boser, & Vapnik, 1993; Karpinsky & Werther, 1993; Schurmann, 1996), and algorithms for learning polynomials have often included heuristic strategies for subsampling the intractable number of higher-order product terms, existing theory bearing on the learning capabilities of quadratic classiers is limited. Cover (1965) showed that an rth order polynomial classier consisting of all r-wise products of the input variables in d dimensions has a natural separating capacity of 2

¡ dCr ¢ r

,

1192

Panayiota Poirazi and Bartlett W. Mel

or twice the number of parameters. However, an SQ classier with k nonzero product terms, which would appear at rst to have a Cover capacity of 2(1 C d C k), in fact contains additional covert parameters relating to the choice of nonzero coefcients to be included in the classier, since this choice is a bona de form of model exibility that can be used to t training data. Viewed in another way, the SQ classier with k nonzero terms also implicitly species zero values on the product terms that were not chosen, so that the capacity of an SQ classier must necessarily take account of these additional internal degrees of freedom. Our specic objective here has been to obtain an expression that quanties the separate contributions to the capacity of an SQ classier that arise from parameter value versus choice exibility. Beyond the theoretical signicance of this question and its broader implications for the learning performance of classiers that subsample from large sets of nonlinear basis functions, our hope has been that this analysis and its subsequent renements will lead to further insights into the potential contributions of active dendritic processing to the memory capacity of neural tissue. 2 Toward the Capacity of a Subsampled Quadratic Classier

We consider the family of SQ classiers in which only k of the K D (d2 C d) / 2 available second-order terms in d dimensions is allowed to take on a nonzero value. In the special case where k D 0, we are left with a pure linear classier, while k D K corresponds to a full quadratic classier. Crucially, the k nonlinear terms included in the model are not chosen at random, but are chosen instead to be the best subset of k terms—that which minimizes training set error. The capacity of a learning machine depends on the number of distinct input-output functions that can be expressed, which in the case of a classier corresponds to the number of different labelings (dichotomies) that can be assigned to N patterns in d dimensions (Cover, 1965; Vapnik & Chervonenkis, 1971). Although a general expression for the number of dichotomies realizable by an SQ classier is difcult to obtain as a function of N, d, and k, an upper bound on the total parameter entropy of an SQ classier taken over the space of randomly drawn training sets may be expressed as B( k, d) D (1 C d C k) w C log 2

¡K¢ k

,

(2.1)

where the rst term measures the entropy contained in the stored parameter values, with w an unknown representing the average number of bits of value entropy per parameter, and the second term measures the entropy contained in the choice of higher-order terms, assuming that all combina-

Capacity of a Subsampled Quadratic Classier

1193

Classifier Bits vs. Product Term Ratio 2000

Bits Used in Classifiers

d = 10 d = 20 1500

1000

d = 30

choice bits

weight bits

500

0 0

20 40 60 80 Product Term Ratio p = k/K (%)

100

Figure 2: SQ capacity curves were generated using equation 2.1 with w D 4. Dashed lines indicate the contribution of explicit coefcients to the bit total, while the additional increment is provided by the combinatorial term in equation 2.1 pertaining to the choice of nonzero coefcients over the available collection of product terms. Both terms are O(d2 ).

tions of k higher-order terms are chosen equally often. The factorization of value versus choice entropy assumes that the choice of higher-order terms, and the values assigned to them, are statistically independent. The relative magnitudes of value versus choice capacity are shown in Figure 2, where B( d, k) is plotted as a function of the product term ratio p D Kk , for w D 4. The contribution of the weight values to the bit total is given by the dashed line connecting the p D 0% and p D 100% points on each curve. The contribution of the choice term has the form of an upside-down “U” and can be seen as the increment to the bit count above the dashed line. 1 When the mapping between learning problems and classier parameters is deterministic—i.e., when there always exists a single best setting of classier error parameters to minimize training, and when all classier states are equally likely, equation 2.1 represents the mutual information contained in the classier parameters about training set labels. Unfortunately, equa1 The expression for B assumes that the value bits and the choice bits are independent of each other, which holds only when the k best-product terms chosen for inclusion in the classier are unlikely to include any zero values. This assumption breaks down as k ! K, in which limit some of the k largest learned coefcients will have values near zero. This leads in turn to an overprediction of classier capacity according to equation 2.1 and the resulting slight nonmonotonicity seen in Figure 2 for large p values. A more intricate counting method would be needed to separate these two sources of classier capacity for large p values.

1194

Panayiota Poirazi and Bartlett W. Mel

tion 2.1 cannot be directly used to predict error rates for SQ classiers. First, the value of w is generally unknown a priori. Second, actual error rates do not depend on parameter entropy alone, but depend also on geometric factors that determine the suitability of the classier ’s parameters to specic training set distributions. Following the maxim that capacity is closely linked to trainable degrees of freedom, however, and the principle that capacity and error rates should be related to each other in consistent (if unknown) ways, we conjectured that there could exist subfamilies of isomorphic SQ classiers wherein the relative capacity of any two classiers i and j, at any error rate and for any training set distribution, is given by the ratio of their respective B( k, d) values. Specically, a family of isomorphic SQ classiers C is dened such that for any two classiers fi, jg 2 C , the relative number of patterns N learnable by each classier at error rate 2 obeys the relation Ri / j D

Ni Bi D , for all 2 . Nj Bj

(2.2)

Put another way, two isomorphic classiers should yield equal error rates when asked to learn an equivalent number of patterns per parameter bit, for any input distribution. We describe several tests of this scaling conjecture in the following sections. 2.1 Special Cases: Linear and Full Quadratic Classiers. The scaling relation expressed in equation 2.2 may be easily veried in the limiting cases of k D 0 and k D K, which correspond to linear and full quadratic classiers, respectively. In both cases, the combinatorial term representing the choice capacity in equation 2.1 vanishes. Assuming that the value-entropy scaling factor w is always equal for the two classiers at a given error rate, the capacity ratio for two isomorphic “choiceless” classiers indexed by i and j simplies to the ratio of their explicit parameter counts:

Ri / j D

1 C di 1 C dj

(2.3)

for the linear case and

¡d C 2¢ 2 D ¡ ¢ D dd d C2 i

Ri / j

j

2

2 i 2 j

C 3di C 2 C 3dj C 2

(2.4)

for the full quadratic case. Both outcomes are consistent with the fact that the natural separating capacity for both linear and full polynomial discriminant functions is directly proportional to the number of free parameters

Capacity of a Subsampled Quadratic Classier

1195

available to the classier (Cover, 1965). Numerical simulations and supporting calculations conrmed that this capacity scaling at every error rate holds for linear and quadratic classiers in 10, 20, and 30 dimensions (see Figure 3). The superposition of performance curves for linear classiers in 10, 20, and 30 dimensions shown in Figure 3C validates the assumption in equation 2.3 that w is independent of d. Thus, randomly labeled samples drawn from a spherical gaussian distribution evoke the same number of bits of entropy in a linear classier, per parameter, regardless of the dimension in which the classier sits. A statistical mechanical interpretation of this fact is given in Riegler & Seung (1997). Similarly, the scaling of full quadratic performance curves shown in Figure 3C implies that the value of w is independent of d within this family of learning machines as well. However, it is important to note that the value of w is not necessarily consistent between the two families, since the graphs of error rate versus training set size for linear and quadratic classiers are not related by any simple scaling factor (see Figure 3C). In fact, full quadratic classiers performed better per parameter than linear classiers for small training sets—in this case, up to six patterns per model parameter. The opposite was true for large training sets for which the training set covariance matrix approaches the identity, thus depleting any information about class boundaries contained in the O( d2 ) quadratic parameters. 2.2 Capacity Scaling for General SQ Classiers. We tested the scaling relation in equation 2.2 for SQ classiers in the general case when 0 < k < K. In all simulations reported here, the coefcients for the k nonzero second-order terms were determined as follows. A conjugate gradient algorithm was used to train a full quadratic classier to minimize mean squared error (MSE) over the training set. Given the sphericity of the training set distribution, the selection of the k best-product terms after training could be made based on weight magnitude, a strategy that is not valid in general (see Bishop, 1995). However, in this case, it was veried empirically that the increase in MSE or classication error when a single weight was set to zero grew roughly monotonically with the weight magnitude (see Figure 4). The pruned classier, including only the k largest second-order terms, was retrained to minimize MSE. During this second round of training, the output of the classier was passed through a sigmoidal thresholding function (1 ¡ e¡x / s ) / (1 C e¡x / s ) with slope s D 0.51. In some cases, the resulting value parameters were binarized to take on values of § 1. Graphs of error rate versus training set size for SQ classiers are shown in Figure 5A for a range of k values in 10 dimensions. The capacity boost resulting from inclusion of varying numbers of second-order terms to a linear classier was culled from this graph by choosing a xed error rate

1196

Panayiota Poirazi and Bartlett W. Mel

A Model Fit to Performance Curves

B Model Fit to Performance Curves

45

45

for Quadratic classifiers

Classification Error (%)

for Linear classifiers

40

d = 10

35

35

30

30

25

25

20

20

15

15

10

10

spherical gaussian model

5 0

50

100

150

non-spherical gaussian model learned quadratic discriminant

5

learned linear discriminant 0

d = 10

40

0

200

0

200

400

600

800

1000

Training Patterns

Training Patterns VC Cover Dimension Capacity

C Scaling of Performance Curves Across Dimension

45

Classification Error (%)

40

d = 10, 20, 30

35 30 25

linear full quadratic

20 15 10 5 0

0

2

4

6

8

10

Training Patterns / Model Parameters

Figure 3: We trained linear and full quadratic classiers using randomly labeled patterns drawn from a d-dimensional zero-mean spherical gaussian distribution and measured average classication error versus training set size. (A) A simple analytical model (crosses) t numerical simulations results (squares) in the asymptote of large N (see the appendix). Dashed lines indicate VC dimension 1 C d and Cover capacity 2(1 C d) for the perceptron, which correspond to 1% and 7% error rates, respectively. Error rates slightly above the theoretical minimum (0% at VC dimension) were due to imperfect control of the learning-rate parameter. (B) A Bayes-optimal classier based on estimated covariance matrices (crosses) provided a control for the performance of a numerically optimized full quadratic classier (squares); the performance curves again merge in the limit of large N. (C) Scaling of linear and quadratic error rates across dimension, when plotted against training patterns per model parameter (1 C d for linear, 1 C d C (d2 C d) / 2 for quadratic).

Capacity of a Subsampled Quadratic Classier Distribution of Learned Weights

A

d = 10, N = 100

Linear

35

Quadratic

Squared Error Increase

Relative Probabilities

40 d = 10, N = 100

0.16 0.14 0.12 0.1 0.08 0.06 0.04

30 25 20 15 10 Linear 5

0.02 0 10

MSE Increase for Individual Weight

B

0.2 0.18

1197

8

6

4

2

0

2

Weight Values

4

6

8

10

0

Quadratic 0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Weight Magnitude

Figure 4: Distributions of learned weight values and increments to MSE with pruning of individual weights. (A) Similar distribution of linear versus secondorder weight values after training a full quadratic classier. (B) Effect of pruning individual weights on the MSE. For each data point, x-value indicates weight magnitude; y-value indicates the increment to MSE when that weight was set to 0 (without further retraining). As expected from the symmetry of the training distribution, an increase in MSE is roughly monotonic with weight magnitude for both linear and second-order weights. A similar relationship was seen for increments in classication error as a function of weight magnitude.

and recording the associated capacities for the linear versus various SQ curves (see Figure 5B). The contribution of the choice exibility is evident here, leading to capacity boosts in excess of that predicted by explicit parameter counts. For example, at a xed 1% error rate in 10 dimensions, addition of the 10 best quadratic terms to the 11 linear and constant terms boosted the trainable capacity by more than a factor of 3, whereas the number of explicit parameters grew by less than a factor of 2. By analogy with the special cases of linear and full quadratic classiers, we sought to identify subfamilies of isomorphic SQ classiers within which the relative capacity of any two classiers was given by Ri /j . In doing so, however, two complications arise. First, it is a priori unclear what condition(s) on d and k bind together families of isomorphic SQ classiers in general, if such families exist. Second, even if such families do exist, w no longer cancels in the expression for Ri / j , as it does for both linear and full quadratic classiers. In cases where w is unknown, therefore, evaluation of Ri / j is prevented. Regarding the criterion for classier isomorphism, we found empirically that error rates for SQ classiers i and j in dimensions di and dj could be brought into register across dimension with a simple capacity scaling, as

1198 Performance of SQ Classifiers

45

Classification Error (%)

d = 10

k =20

30

k =30

25 20

k =40

15 full quadratic

10

k = 40

8

k = 30 6

k = 20

4

k =10

2

5 0

k = 55 (full quadratic)

10

k =10

35

Capacity Boosts for SQ Classifiers (relative to linear)

12

linear

d = 10

40

B

Capacity Ratio (SQ/linear)

A

Panayiota Poirazi and Bartlett W. Mel

0

100

200 Training Patterns

300

0

k = 0 (linear) 0

5

10 15 20 Classification Error (%)

25

Figure 5: Performance of SQ classiers. (A) Performance curves for various values of k in a 10-dimensional input space. (B) Ratio of the memory capacity of SQ classiers in 10 dimensions relative to a linear classier k D 0, plotted as a function of the error rate.

long as the relation kj ki D Ki Kj

(2.5)

was obeyed. This suggests that a key geometric invariance in the SQ family involves the proportion of higher-order terms included in the classier (see Figure 6B). This rule also operates correctly for the special cases of linear (k D 0) and full quadratic (k D K) classiers. By contrast, another plausible condition on d and k failed to bind together sets of isomorphic SQ classiers: classiers that use the same fraction of their complete parameter set (see Figure 6A). Having established that two SQ classiers with equal product term ratios p D Kk produce performance curves that are isomorphic up to a scaling of their capacities, the key question remaining was whether the scaling factor would be predicted by Ri / j . Assuming wi D wj as for the linear and full quadratic cases, we note that the relative capacity of two isomorphic SQ classiers is again independent of w and approaches ( ddi ) 2 for large di and dj . This j occurs because as di and dj grow large, the ratio of the two value capacities considered separately, and the two choice capacities considered separately, both individually approach ( ddji ) 2 . However, for the lower-dimensional cases included in our empirical studies, the capacity ratio Ri / j does not entirely shed its dependence on w and p. We found that a value of w D 4 produced good ts to the empirically determined scaling factors for p-matched SQ

Capacity of a Subsampled Quadratic Classier Comparison of Parameter Bits Used for Isomorphic Capacity Curves

A

Product Term Ratio in Target Dimension (%)

Parameters Used in Target Dimension (% of total parameters)

d = 20 d = 30

80

60

40

20

0

0

20 40 60 80 Parameters Used in 10 Dimensions (% of total parameters)

Comparison of 2nd order Terms Used for Isomorphic Capacity Curves

B

100

100

1199

100 d = 20 d = 30

80

60

40

20

0

0

20

40

60

80

100

Product Term Ratio p in 10 Dimensions (%)

Figure 6: Relationship between SQ classiers in different dimensions whose performance curves were isomorphic. (A) The nonlinear relation shown indicates that no simple proportionality exists between the total parameter counts (1 C d C k) used by two SQ classiers whose performance curves were found to be isomorphic up to an x-axis scaling. 20-d (crosses) and 30-d (squares) classiers were compared to a 10-d reference case. (B) In contrast, a linear relation (equality) holds between the product term ratios p D Kk for two isomorphic SQ classiers.

classiers in 10, 20, and 30 dimensions, for a wide range of p values tested, and for training sets drawn from both gaussian and uniform distributions 2 (see Figure 7). To eliminate uncertainty in the value of w, we trained binary-valued SQ classiers so that w could be assigned a value of 1 in the expression for Ri / j . In this case, SQ classiers were trained as before, but prior to testing, the real-valued coefcients were mapped to § 1 according to their sign. This radical quantization of weight values led to large increases in error rates and gross alterations in the form of the error curves; compare Figure 8A to Figure 7A. However Ri /j could now be evaluated with no free parameters and continued to produce nearly perfect superpositions of error rate curves within isomorphic families (see Figure 8). The nonmonotonicities seen in these binary-valued cases are of unknown origin. However, given that they scale across dimension, we infer they arise from consistent, family-specic idiosyncrasies in the number, and independence, of the dichotomies realizable for particular training set distributions. 2

Fits using w D 3 were nearly as good; the scaling factor ’s weak dependence on w is explained by the fact that w appears in both the numerator and denominator of Ri / j .

1200

Panayiota Poirazi and Bartlett W. Mel SQ Performance Curves with Analog Weights Scale Across Dimesion

A

B

40

35

d

k 30

10 12 20 46 30 90

30 25

k Å K

20

Uniform

21%

15

k Å K

10 5 0

Classification Error (%)

Classification Error (%)

35

0

SQ Curves with Analog Weights have different shapes for Uniform vs. Gaussian Distributions

45%

d

k

10 20 30

25 95 210

0.5 1 1.5 2 2.5 Training Patterns / Parameter Bits

25 Gaussian

20 15 10

k Å K

5

3

0

0

18%

d

k

10 20

10 40

0.5 1 1.5 2 2.5 Training Patterns/Parameter Bits

3

Figure 7: Error rates for SQ classiers in different dimension could be superimposed when matched for product term ratio p D Kk . (A) Error rates were equivalent for isomorphic SQ classiers in 10, 20, and 30 dimensions when training set size was normalized by capacity as given by equation 2.1, with w D 4. Two groups of curves shown are for p ¼ 21% and p ¼ 45%. The approximation arises from quantization of k values. The different shapes of the two groups of curves imply that classiers with different p-ratios perform differently per bit in learning the same training set. (B) When training patterns were drawn from a uniform distribution, normalized capacity curves for isomorphic SQ classiers continued to match across dimension, although the shapes of the curves were different from those produced by the same classiers trained with gaussian samples. 3 Discussion

We derived a simple expression for the total parameter exibility of a subsampled quadratic classier, with separate factors for value versus choice capacity. To validate the form of the expression, we (i) identied families of geometrically isomorphic SQ classiers, with linear and full quadratic discriminant functions as special cases; (ii) measured relative capacities within these families; (iii) demonstrated that equation 2.2 predicted capacity scaling factors for SQ classiers in 10 to 30 dimensions with one free parameter for classiers with real-valued coefcients, and with no free parameters for classiers with binary-valued coefcients; and (iv) showed that for classiers with real-valued or binary-valued coefcients, both the predicted and the empirically derived capacity scaling factors were consistent for two very different training set distributions, suggesting that the expression for capacity scaling in equation 2.2 may hold for any well-behaved input distribution. The quantication of capacity in equation 2.1 is not yet supported by an exact function counting theorem as has been worked out for linear and full

Capacity of a Subsampled Quadratic Classier A

Shapes for Uniform vs. Gaussian Distributions

50

Å

21%

45

d

30

k

25

K

Å

k

45%

20 15 10 5 0

0

0.5

1

1.5

2

2.5

d

k

10 20 30

25 95 210

3

Training Patterns/Parameter bits

3.5

Uniform

40

10 12 20 46 30 90

Classification Error (%)

k

35 K Classification Error (%)

B SQ Curves with Binary Weights have Different

SQ Performance Curves with Binary Weights Scale Across Dimension

40

1201

35 Gaussian

30 25 20 15

k Å K

10 5 4

0 0

1

18%

d

k

10 20

10 40

2 3 4 5 6 Training Patterns/Parameter Bits

7

Figure 8: Error curves for p-matched SQ classiers trained with gaussian distribution and constrained to use binary valued § 1 weights. Superposition of curves was achieved using capacity scaling factor Ri / j with no free parameters, since w could be set to 1 in this case.

rth-order polynomials (Cover, 1965), nor has it been derived from a fundamental statistical mechanical perspective as has been done, for example, for the simple perceptron (Riegler & Seung, 1997)—and we cannot predict how difcult it will be to see through either of these approaches. In its favor, however, the simple form of equation 2.1, coupled with the condition for classier isomorphism given in equation 2.5, together account for much of the empirical variance in model capacity expressed by this rather complex class of learning machines. It is therefore likely that these two quantitative relations will be echoed in future theorems and proofs bearing on the class of subsampled polynomial classiers. While choice capacity is in some sense invisible, it is nonetheless very real. Thus, when a single optimally chosen nonlinear basis function is added to a linear model, the system behaves as if it gained not only an additional w bits of value, or coefcient, capacity, but a bonus capacity of log 2 ( K) bits owing to the choice that was made. When K is large and w is small, this bonus can be highly signicant. It seems plausible that the bit-counting approach of equation 2.1 could be useful in quantifying the capacity of other types of classiers that adaptively subsample from large families of nonlinear basis functions—for example, a classier that draws an optimal set of k radial basis functions from a complete set of K basis functions tiling an input space. The practical signicance of choice capacity is most acute for hardwarebased learning systems, possibly including the brain, where in the limits of low-resolution weight values (Petersen, Malenka, Nicoll, & Hopeld,

1202

Panayiota Poirazi and Bartlett W. Mel

1998) and very high-dimensional input spaces, learning capacity may be dominated by the exibility to choose from a huge set of candidate nonlinear basis functions. As a corollary, the capacity of such learning systems could be signicantly underestimated if based solely on counts of existing units or synaptic connections—a strategy that, by contrast, is appropriate when used to estimate the capacity of a fully connected neural network (Baum & Haussler, 1989). Appendix: Derivation of Formula for Error Rate of Spherical Perceptron in Limit of Large N

We derived an expression for perceptron error rates using a simple geometric argument valid in the limit of large, randomly labeled training sets. When N patterns are drawn from a zero-mean spherical gaussian distribution G and randomly labeled with § 1, the resulting positive and negative training sets Tpos and Tneg may each be modeled as spherical gaussian blobs Gpos and Gneg, with means Xpos and Xneg slightly shifted from the origin. The sample means Xpos and Xneg are themselves distributed as d-dimensional gaussians, but with reduced variance given by p s0 D s / N /2 . (A.1) Given that most of the mass of a high-dimensional spherical p gaussian with variance s 2 is contained in a thin shell around radius s d, we may view Xpos and Xneg as randomly oriented vectors of length, p p L D s 0 d D s (2d / N ).

(A.2)

Since in high dimension Xpos and Xneg are also orthogonal, the expected p distance between them is given by D X D 2L. The optimal linear discriminant function is thus a hyperplane that bisects, and is perpendicular to, the line connecting Xpos and Xneg . Given that Tpos and Tneg have equal prior probability, the expected classication error is given by the integrated probability contained in Gpos (or Gneg) lying beyond the discriminating hyperplane. Since Gpos is factorial, this calculation reduces to a complementary error p function—the total probability lying beyond the distance D X / 2 D s d / N from the origin of a 1-dimensional gaussian. Thus, in the limit of high dimension with N À d, the expected classication error is ! r 1 d , (A.3) CE D erfc 2 2N

(

using the standard substitution z D px , and the complementary error R 1 22s function dened by erfc(z) D p2p z e¡t dt.

Capacity of a Subsampled Quadratic Classier

1203

In a variant of this classication problem, the N training samples drawn from G are all assigned to Tpos , while Tneg is taken to be G itself. This problem thus requires responding yes to the largest possible fraction of training samples, while saying no to the largest possible fraction of the probability density contained in G. Assuming equal priors for positive and negative exemplars during testing, the optimal separating hyperplane in this case cuts halfway through, and perpendicular to, the line connecting Xpos to the origin. This reduces D X, and hence the z score in equation A.3, by a factor q p 1 d ). of 2, to give a classication error of CE D 2 erfc( 4N This was dubbed the “d´eja-v u” ` learning problem and analyzed by Pearlmutter (1992) using a somewhat different approach. Acknowledgments

Thanks to Barak Pearlmutter, Dan Ruderman, Chris Williams, and Lyle Borg-Graham for helpful comments and suggestions. Funding for this work was provided by the NSF, the ONR, and a Myronis fellowship. References Adelson, E., & Bergen, J. (1985). Spatiotemporal energy models for the perception of motion. J. Opt. Soc. Amer., A2, 284–299. Barron, R., Mucciardi, A., Cook, F., Craig, J., & Barron, A. (1984). Adaptive learning networks: Development and applications in the united states of algorithms related to gmdh. In S. Farlow (Ed.), Self-organizing methods in modeling. New York: Marcel Dekker. Baum, E., & Haussler, D. (1989). What size net gives valid generalization? Neural Computation, 1, 151–160. Bishop, C. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press. Borg-Graham, L., & Grzywacz, N. (1992). A model of the direction selectivity circuit in retina: Transformations by neurons singly and in concert. In T. McKenna, J. Davis, & S. Zornetzer (Eds.), Single neuron computation (pp. 347–375). Orlando, FL: Academic Press. Cover, T. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans. ElectronicComputers, 14, 326–334. Ghosh, J., & Shin, Y. (1992). Efcient higher order neural networks for classication and function approximation. International Journal of Neural Systems, 3, 323–350. Giles, C., & Maxwell, T. (1987). Learning, invariance, and generalization in highorder neural networks. Applied Optics, 26, 4972–4978. Guyon, I., Boser, B., & Vapnik, V. (1993). Automatic capacity tuning of very large vc-dimension classiers. In S. Hanson, J. Cowan, & C. Giles (Eds.),Advances in

1204

Panayiota Poirazi and Bartlett W. Mel

neural information processing systems, 5 (pp. 147–155). San Mateo, CA: Morgan Kaufmann. Heeger, D. (1992). Half-squaring in responses of cat striate cells. Visual Neurosci., 9, 427–443. Johnston, D., Magee, J., Colbert, C., & Christie, B. (1996). Active properties of neuronal dendrites. Ann. Rev. Neurosci., 19, 165–186. Karpinsky, M., & Werther, T. (1993). Vc dimension and uniform learnability of sparse polynomials and rational functions. SIAM J. Comput., 22, 1276–1285. Koch, C., & Poggio, T. (1992). Multiplying with synapses and neurons. In T. McKenna, J. Davis, & S. Zornetzer (Eds.), Single neuron computation (pp. 315–345). Orlando, FL: Academic Press. Koch, C., Poggio, T., & Torre, V. (1982). Retinal ganglion cells: A functional interpretation of dendritic morphology. Phil. Trans. R. Soc. Lond. B, 298, 227– 264. Mel, B. W. (1992a). The clusteron: Toward a simple abstraction for a complex neuron. In J. Moody, S. Hanson, & R. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 35–42). San Mateo, CA: Morgan Kaufmann. Mel, B. W. (1992b). NMDA-based pattern discrimination in a modeled cortical neuron. Neural Comp., 4, 502–516. Mel, B. W. (1993). Synaptic integration in an excitable dendritic tree. J. Neurophysiol., 70(3), 1086–1101. Mel, B. W. (1994). Information processing in dendritic trees. Neural Computation, 6, 1031–1085. Mel, B. W. (1999). Why have dendrites? A computational perspective. In G. Stuart, N. Spruston, & M. H¨ausser (Eds.), Dendrites (pp. 271–289).Oxford: Oxford University Press. Mel, B. W. , Ruderman, D. L., & Archie, K. A. (1998a). Toward a single cell account of binocular disparity tuning: An energy model may be hiding in your dendrites. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 208–214). Cambridge, MA: MIT Press. Mel, B. W. , Ruderman, D. L., & Archie, K. A. (1998b). Translation-invariant orientation tuning in visual “complex” cells could derive from intradendritic computations. J. Neurosci., 17, 4325–4334. Ohzawa, I., DeAngelis, G., & Freeman, R. (1997).Encoding of binocular disparity by complex cells in the cat’s visual cortex. J. Neurophysiol., 77. Pearlmutter, B. A. (1992). How selective can a linear threshold unit be? In International Joint Conference on Neural Networks. Beijing, PRC: IEEE. Petersen, C. C. H., Malenka, R. C., Nicoll, R. A., & Hopeld, J. J. (1998). All-ornone potentiation at ca3–ca1 synapses. Proc. Natl. Acad. Sci. USA, 95, 4732– 4737. Poggio, T. (1975). Optimal nonlinear associative recall. Biol. Cybern., 9, 201. Pollen, D., & Ronner, S. (1983). Visual cortical neurons as localized spatial frequency lters. IEEE Trans. Sys. Man Cybern., 13, 907–916. Rall, W., & Segev, I. (1987). Functional possibilities for synapses on dendrites and on dendritic spines. In G. Edelman, W. Gall, & W. Cowan (Eds.), Synaptic

Capacity of a Subsampled Quadratic Classier

1205

function (pp. 605–636). New York: Wiley. Riegler, P., & Seung, H. S. (1997). Vapnik-chervonenkis entropy of the spherical perceptron. Phys. Rev. E, 55, 3283–3287. Schurmann, J. (1996). Pattern classication: A unied view of statistical and neural approaches. New York: Wiley. Segev, I., & Rall, W. (1998). Excitable dendrites and spines—earlier theoretical insights elucidate recent direct observations. Trends Neurosci., 21, 453–460. Shepherd, G., & Brayton, R. (1987). Logic operations are properties of computersimulated interactions between excitable dendritic spines. Neurosci., 21, 151– 166. Stuart, G., Spruston, N., Sakmann, B., & Hausser, M. (1997). Action potential initiation and backpropagation in neurons of the mammalian CNS. TINS, 20, 125–131. Stuart, G., Spruston, N., & Hausser, M. (Eds.). (1999). Dendrites. Oxford: Oxford University Press. Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and Its Applications, 16, 264–280. Yuste, R., & Tank, D. W. (1996). Dendritic integration in mammalian neurons, a century after Cajal. Neuron, 16, 701–716. Zador, A., Claiborne, B., & Brown, T. (1992). Nonlinear pattern separation in single hippocampal neurons with active dendritic membrane. In J. Moody, S. Hanson, & R. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 51–58). San Mateo, CA: Morgan Kaufmann. Zador, A., & Pearlmutter, B. A. (1996). VC dimension of an integrate-and-re model neuron. Neural Computation, 8, 611–624. Received July 1, 1998; accepted July 20, 1999.

LETTER

Communicated by John Platt

New Support Vector Algorithms ¤ Bernhard Sch olkopf ¨ Alex J. Smola GMD FIRST, 12489 Berlin, Germany, and Department of Engineering, Australian National University, Canberra 0200, Australia

Robert C. Williamson Department of Engineering, Australian National University, Canberra 0200, Australia Peter L. Bartlett RSISE, Australian National University, Canberra 0200, Australia We propose a new class of support vector algorithms for regression and classication. In these algorithms, a parameter º lets one effectively control the number of support vectors. While this can be useful in its own right, the parameterization has the additional benet of enabling us to eliminate one of the other free parameters of the algorithm: the accuracy parameter e in the regression case, and the regularization constant C in the classication case. We describe the algorithms, give some theoretical results concerning the meaning and the choice of º, and report experimental results. 1 Introduction

Support vector (SV) machines comprise a new class of learning algorithms, motivated by results of statistical learning theory (Vapnik, 1995). Originally developed for pattern recognition (Vapnik & Chervonenkis, 1974; Boser, Guyon, & Vapnik, 1992), they represent the decision boundary in terms of a typically small subset (Sch¨olkopf, Burges, & Vapnik, 1995) of all training examples, called the support vectors. In order for this sparseness property to carry over to the case of SV Regression, Vapnik devised the so-called e -insensitive loss function, |y ¡ f ( x)| e D maxf0, | y ¡ f ( x )| ¡ e g,

(1.1)

which does not penalize errors below some e > 0, chosen a priori. His algorithm, which we will henceforth call e -SVR, seeks to estimate functions, f ( x ) D ( w ¢ x ) C b, ¤

w , x 2 R N , b 2 R,

(1.2)

Present address: Microsoft Research, 1 Guildhall Street, Cambridge, U.K.

c 2000 Massachusetts Institute of Technology Neural Computation 12, 1207– 1245 (2000) °

1208

B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett

based on independent and identically distributed (i.i.d.) data, ( x 1 , y1 ), . . . , ( x`, y`) 2 R N £ R .

(1.3)

Here, R N is the space in which the input patterns live but most of the following also applies for inputs from a set X . The goal of the learning process is to nd a function f with a small risk (or test error), Z R[ f ] D

l( f, x , y) dP( x , y),

(1.4)

where P is the probability measure, which is assumed to be responsible for the generation of the observations (see equation 1.3) and l is a loss function, for example, l( f, x , y) D ( f ( x ) ¡ y) 2 , or many other choices (Smola & Sch¨olkopf, 1998). The particular loss function for which we would like to minimize equation 1.4 depends on the specic regression estimation problem at hand. This does not necessarily have to coincide with the loss function used in our learning algorithm. First, there might be additional constraints that we would like our regression estimation to satisfy, for instance, that it have a sparse representation in terms of the training data. In the SV case, this is achieved through the insensitive zone in equation 1.1. Second, we cannot minimize equation 1.4 directly in the rst place, since we do not know P. Instead, we are given the sample, equation 1.3, and we try to obtain a small risk by minimizing the regularized risk functional, 1 kw k2 C C ¢ Reemp [ f ]. 2

(1.5)

Here, kw k2 is a term that characterizes the model complexity, Reemp [ f ] :D

` 1X

` iD1

| yi ¡ f ( x i )| e ,

(1.6)

measures the e -insensitive training error, and C is a constant determining the trade-off. In short, minimizing equation 1.5 captures the main insight of statistical learning theory, stating that in order to obtain a small risk, one needs to control both training error and model complexity—that is, explain the data with a simple model. The minimization of equation 1.5 is equivalent to the following constrained optimization problem (see Figure 1): minimize

` 1 1X (j i C ji¤ ), t ( w , » (¤) ) D kw k2 C C ¢ 2 ` iD1

(1.7)

New Support Vector Algorithms

1209

Figure 1: In SV regression, a desired accuracy e is specied a priori. It is then attempted to t a tube with radius e to the data. The trade-off between model complexity and points lying outside the tube (with positive slack variables j ) is determined by minimizing the expression 1.5.

subject to

(( w ¢ x i ) C b) ¡ yi · e C j i yi ¡ (( w ¢ x i ) C b) · e j i(¤)

C j i¤

¸ 0.

(1.8) (1.9) (1.10)

Here and below, it is understood that i D 1, . . . , `, and that boldface Greek letters denote `-dimensional vectors of the corresponding variables; (¤) is a shorthand implying both the variables with and without asterisks. By using Lagrange multiplier techniques, one can show (Vapnik, 1995) that this leads to the following dual optimization problem. Maximize W ( ® , ® ¤ ) D ¡e ¡

subject to

` X

iD1

` X

iD1

(ai¤ C ai ) C

` X

` X

iD1

(ai¤ ¡ ai ) yi

1 (a¤ ¡ ai )(aj¤ ¡ aj )(x i ¢ xj ) 2 i, jD1 i

(ai ¡ ai¤ ) D 0

h ai(¤) 2 0,

C `

(1.11)

(1.12)

i .

(1.13)

The resulting regression estimates are linear; however, the setting can be generalized to a nonlinear one by using the kernel method. As we will use precisely this method in the next section, we shall omit its exposition at this point.

1210

B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett

To motivate the new algorithm that we shall propose, note that the parameter e can be useful if the desired accuracy of the approximation can be specied beforehand. In some cases, however, we want the estimate to be as accurate as possible without having to commit ourselves to a specic level of accuracy a priori. In this work, we rst describe a modication of the e SVR algorithm, called º-SVR, which automatically minimizes e . Following this, we present two theoretical results on º-SVR concerning the connection to robust estimators (section 3) and the asymptotically optimal choice of the parameter º (section 4). Next, we extend the algorithm to handle parametric insensitivity models that allow taking into account prior knowledge about heteroscedasticity of the noise. As a bridge connecting this rst theoretical part of the article to the second one, we then present a denition of a margin that both SV classication and SV regression algorithms maximize (section 6). In view of this close connection between both algorithms, it is not surprising that it is possible to formulate also a º-SV classication algorithm. This is done, including some theoretical analysis, in section 7. We conclude with experiments and a discussion. 2 º-SV Regression

To estimate functions (see equation 1.2) from empirical data (see equation 1.3) we proceed as follows (Sch¨olkopf, Bartlett, Smola, & Williamson, 1998). At each point x i , we allow an error of e . Everything above e is cap(¤) tured in slack variables j i , which are penalized in the objective function via a regularization constant C, chosen a priori (Vapnik, 1995). The size of e is traded off against model complexity and slack variables via a constant º ¸ 0: (¤)

(

` 1 1X (ji C j i¤ ) , e ) D kw k2 C C ¢ ºe C 2 ` iD1

minimize

t (w, »

subject to

((w ¢ xi ) C b) ¡ yi · e C j i yi ¡ (( w ¢ x i ) C b) · e C

!

(2.1)

(2.2)

j i¤

(2.3)

j i(¤) ¸ 0, e ¸ 0.

(2.4) (¤)

(¤)

For the constraints, we introduce multipliers ai , g i , b ¸ 0, and obtain the Lagrangian L( w , b, ® D

(¤)

, b, »

(¤)

, e, ´ (¤) )

` ` X 1 CX (ji C j i¤ ) ¡ b e ¡ (g iji C gi¤j i¤ ) kw k2 C Cºe C 2 ` iD1 iD1

New Support Vector Algorithms

¡ ¡

` X

iD1 ` X

iD1

1211

ai (j i C yi ¡ ( w ¢ x i ) ¡ b C e ) ai¤ (j i¤ C ( w ¢ xi ) C b ¡ yi C e ).

(2.5)

To minimize the expression 2.1, we have to nd the saddle point of L—that (¤) is, minimize over the primal variables w , e, b, j i and maximize over the (¤) (¤) dual variables ai , b, g i . Setting the derivatives with respect to the primal variables equal to zero yields four equations: wD

X i

C ¢º ¡ ` X

iD1 C `

(ai¤ ¡ ai ) xi

X i

(2.6)

(ai C ai¤ ) ¡ b D 0

(2.7)

(ai ¡ ai¤ ) D 0

(2.8)

¡ ai(¤) ¡ gi(¤) D 0.

(2.9) (¤)

In the SV expansion, equation 2.6, only those ai will be nonzero that correspond to a constraint, equations 2.2 or 2.3, which is precisely met; the corresponding patterns are called support vectors. This is due to the KarushKuhn-Tucker (KKT) conditions that apply to convex constrained optimization problems (Bertsekas, 1995). If we write the constraints as g( x i , yi ) ¸ 0, with corresponding Lagrange multipliers ai , then the solution satises ai ¢ g( x i , yi ) D 0 for all i. Substituting the above four conditions into L leads to another optimization problem, called the Wolfe dual. Before stating it explicitly, we carry out one further modication. Following Boser et al. (1992), we substitute a kernel k for the dot product, corresponding to a dot product in some feature space related to input space via a nonlinear map W, k( x , y ) D (W( x) ¢ W( y )).

(2.10)

By using k, we implicitly carry out all computations in the feature space that W maps into, which can have a very high dimensionality. The feature space has the structure of a reproducing kernel Hilbert space (Wahba, 1999; Girosi, 1998; Sch¨olkopf, 1997) and hence minimization of kw k2 can be understood in the context of regularization operators (Smola, Sch¨olkopf, & Muller, ¨ 1998). The method is applicable whenever an algorithm can be cast in terms of dot products (Aizerman, Braverman, & Rozonoer, 1964; Boser et al., 1992; Sch¨olkopf, Smola, & Muller, ¨ 1998). The choice of k is a research topic in its

1212

B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett

own right that we shall not touch here (Williamson, Smola, & Sch¨olkopf, 1998; Sch¨olkopf, Shawe-Taylor, Smola, & Williamson, 1999); typical choices include gaussian kernels, k( x , y ) D exp(¡kx ¡ y k2 / (2s 2 )) and polynomial kernels, k(x , y ) D ( x ¢ y ) d (s > 0, d 2 N). (¤) Rewriting the constraints, noting that b, gi ¸ 0 do not appear in the dual, we arrive at the º-SVR optimization problem: for º ¸ 0, C > 0, maximize W ( ®(¤) ) D

` X

iD1

¡

subject to

` X

i D1

i D1

` 1X (a¤ ¡ ai )(aj¤ ¡ aj ) k( x i , xj ) 2 i, jD1 i

(ai ¡ ai¤ ) D 0

h ai(¤) 2 0, ` X

(ai¤ ¡ ai ) yi

C

(2.11)

(2.12)

i (2.13)

`

(ai C ai¤ ) · C ¢ º.

(2.14)

The regression estimate then takes the form (cf. equations 1.2, 2.6, and 2.10), f ( x) D

` X

iD1

(ai¤ ¡ ai )k( x i , x ) C b,

(2.15)

where b (and e ) can be computed by taking into account that equations 2.2 P and 2.3 (substitution of j (aj¤ ¡ aj ) k( xj , x ) for ( w ¢ x) is understood; cf. (¤)

equations 2.6 and 2.10) become equalities with ji D 0 for points with (¤) 0 < ai < C / `, respectively, due to the KKT conditions. Before we give theoretical results explaining the signicance of the parameter º, the following observation concerning e is helpful. If º > 1, then e D 0, since it does not pay to increase e (cf. equation 2.1). If º · 1, it can still happen that e D 0—for example, if the data are noise free and can be perfectly interpolated with a low-capacity model. The case e D 0, however, is not what we are interested in; it corresponds to plain L1 -loss regression. We will use the term errors to refer to training points lying outside the tube 1 and the term fraction of errors or SVs to denote the relative numbers of For N > 1, the “tube” should actually be called a slab—the region between two parallel hyperplanes. 1

New Support Vector Algorithms

1213

errors or SVs (i.e., divided by `). In this proposition, we dene the Pmodulus of absolute continuity of a function f as the function 2 (d ) D sup i | f (bi ) ¡ is taken over all disjoint intervals ( ai , bi ) with f (ai )|, where the supremum P ai < bi satisfying i ( bi ¡ ai ) < d. Loosely speaking, the condition on the conditional density of y given x asks that it be absolutely continuous “on average.” Suppose º-SVR is applied to some data set, and the resulting e is nonzero. The following statements hold: Proposition 1.

i. º is an upper bound on the fraction of errors. ii. º is a lower bound on the fraction of SVs. iii. Suppose the data (see equation 1.3) were generated i.i.d. from a distribution P(x , y) D P( x) P( y| x ) with P ( y| x) continuous and the expectation of the modulus of absolute continuity of its density satisfying limd!0 E2 (d ) D 0. With probability 1, asymptotically, º equals both the fraction of SVs and the fraction of errors. Ad (i). The constraints, equations 2.13 and 2.14, imply that at most (¤) (¤) a fraction º of all examples can have ai D C / `. All examples with j i > 0 (¤) (¤) (i.e., those outside the tube) certainly satisfy ai D C / ` (if not, ai could (¤) grow further to reduce j i ). Ad (ii). By the KKT conditions, e > 0 implies b D 0. Hence, equation 2.14 becomes an equality (cf. equation 2.7). 2 Since SVs are those examples for (¤) which 0 < ai · C / `, the result follows (using ai ¢ ai¤ D 0 for all i; Vapnik, 1995). Ad (iii). The strategy of proof is to show that asymptotically, the probability of a point is lying on the edge of the tube vanishes. The condition on P( y| x ) means that Proof.

¢ ¡ sup E P | f (x ) C t ¡ y| < c x < d (c )

(2.16)

f,t

for some function d (c ) that approaches zero as c ! 0. Since the class of SV regression estimates f has well-behaved covering numbers, we have (Anthony & Bartlett, 1999, chap. 21) that for all t, ! ³ ´ ` O Pr sup P`(| f (x ) C t ¡ y| < c / 2) < P(| f ( x) C t¡y| < c ) > a < c1 c¡ 2 ,

(

2

f

In practice, one can alternatively work with equation 2.14 as an equality constraint.

1214

B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett

where PO` is the sample-based estimate of P (that is, the proportion of points that satisfy | f ( x) ¡y C t| < c ), and c1 , c2 may depend onc and a. Discretizing the values of t, taking the union bound, and applying equation 2.16 shows that the supremum over f and t of PO`( f (x ) ¡ y C t D 0) converges to zero in probability. Thus, the fraction of points on the edge of the tube almost surely converges to 0. Hence the fraction of SVs equals that of errors. Combining statements i and ii then shows that both fractions converge almost surely to º. Hence, 0 · º · 1 can be used to control the number of errors (note that for º ¸ 1, equation 2.13 implies 2.14, since ai ¢ ai¤ D 0 for all i (Vapnik, 1995)). Moreover, since the constraint, equation 2.12, implies that equation 2.14 is P (¤) equivalent to i ai · Cº / 2, we conclude that proposition 1 actually holds for the upper and the lower edges of the tube separately, with º / 2 each. (Note that by the same argument, the number of SVs at the two edges of the standard e -SVR tube asymptotically agree.) A more intuitive, albeit somewhat informal, explanation can be given in terms of the primal objective function (see equation 2.1). At the point of the solution, note that if e > 0, we must have ( @ / @e )t (w , e ) D 0, that is, º C ( @ / @e )Reemp D 0, hence º D ¡(@ / @e ) Reemp . This is greater than or equal to the fraction of errors, since the points outside the tube certainly do contribute to a change in Remp when e is changed. Points at the edge of the tube possibly might also contribute. This is where the inequality comes from. Note that this does not contradict our freedom to choose º > 1. In that case, e D 0, since it does not pay to increase e (cf. equation 2.1). Let us briey discuss how º-SVR relates to e -SVR (see section 1). Both algorithms use the e -insensitive loss function, but º-SVR automatically computes e . In a Bayesian perspective, this automatic adaptation of the loss function could be interpreted as adapting the error model, controlled by the hyperparameter º. Comparing equation 1.11 (substitution of a kernel for the dot product is understood) and equation 2.11, we note that e -SVR requires P an additional term, ¡e `iD1 (ai¤ C ai ), which, for xed e > 0, encourages (¤) that some of the ai will turn out to be 0. Accordingly, the constraint (see equation 2.14), which appears inº-SVR, is not needed. The primal problems, equations 1.7 and 2.1, differ in the term ºe . If º D 0, then the optimization can grow e arbitrarily large; hence zero empirical risk can be obtained even when all as are zero. In the following sense, º-SVR includes e -SVR. Note that in the general case, using kernels, w N is a vector in feature space. N then e -SVR with e set a If º-SVR leads to the solution eN , w N , b, N N , b. priori to eN , and the same value of C, has the solution w Proposition 2.

New Support Vector Algorithms

1215

If we minimize equation 2.1, then x e and minimize only over the remaining variables. The solution does not change. Proof.

3 The Connection to Robust Estimators

Using the e -insensitive loss function, only the patterns outside the e -tube enter the empirical risk term, whereas the patterns closest to the actual regression have zero loss. This, however, does not mean that it is only the outliers that determine the regression. In fact, the contrary is the case. Proposition 3 (resistance of SV regression).

Using support vector regression with the e -insensitive loss function (see equation 1.1), local movements of target values of points outside the tube do not inuence the regression. Shifting yi locally does not change the status of ( x i , yi ) as being a point outside the tube. Then the dual solution ®(¤) is still feasible; it satises (¤) the constraints (the point still has ai D C / `). Moreover, the primal solution, withj i transformed according to the movement of yi , is also feasible. Finally, (¤) the KKT conditions are still satised, as still ai D C / `. Thus (Bertsekas, 1995), ®(¤) is still the optimal solution. Proof.

The proof relies on the fact that everywhere outside the tube, the upper (¤) bound on the ai is the same. This is precisely the case if the loss function increases linearly ouside the e -tube (cf. Huber, 1981, for requirements on robust cost functions). Inside, we could use various functions, with a derivative smaller than the one of the linear part. For the case of the e -insensitive loss, proposition 3 implies that essentially, the regression is a generalization of an estimator for the mean of a random variable that does the following: Throws away the largest and smallest examples (a fractionº / 2 of either category; in section 2, it is shown that the sum constraint, equation 2.12, implies that proposition 1 can be applied separately for the two sides, using º / 2). Estimates the mean by taking the average of the two extremal ones of the remaining examples. This resistance concerning outliers is close in spirit to robust estimators like the trimmed mean. In fact, we could get closer to the idea of the trimmed mean, which rst throws away the largest and smallest points and then computes the mean of the remaining points, by using a quadratic loss inside the e -tube. This would leave us with Huber’s robust loss function.

1216

B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett

Note, moreover, that the parameter º is related to the breakdown point of the corresponding robust estimator (Huber, 1981). Because it species the fraction of points that may be arbitrarily bad outliers, º is related to the fraction of some arbitrary distribution that may be added to a known noise model without the estimator failing. Finally, we add that by a simple modication of the loss function (White, (¤) 1994)—weighting the slack variables » above and below the tube in the target function, equation 2.1, by 2l and 2(1¡l), withl 2 [0, 1]—respectively— one can estimate generalized quantiles. The argument proceeds as follows. Asymptotically, all patterns have multipliers at bound (cf. proposition 1). The l , however, changes the upper bounds in the box constraints applying to the two different types of slack variables to 2Cl / ` and 2C(1 ¡ l) / `, respectively. The equality constraint, equation 2.8, then implies that (1 ¡ l) and l give the fractions of points (out of those which are outside the tube) that lie on the two sides of the tube, respectively. 4 Asymptotically Optimal Choice of º

Using an analysis employing tools of information geometry (Murata, Yoshizawa, & Amari, 1994; Smola, Murata, Sch¨olkopf, & Muller, ¨ 1998), we can derive the asymptotically optimal º for a given class of noise models in the sense of maximizing the statistical efciency.3 Denote p a density with unit variance,4 and P a family of noise y models generated from P by P :D fp| p D s1 P( s ), s > 0g. Moreover assume that the data were generated i.i.d. from a distribution p(x, y) D p(x) p( y ¡ f ( x)) with p( y ¡ f ( x)) continuous. Under the assumption that SV regression produces an estimate fO converging to the underlying functional dependency f , the asymptotically optimal º, for the estimation-of-locationparameter model of SV regression described in Smola, Murata, Sch¨olkopf, & Muller ¨ (1998), is Remark.

Z º D 1¡

e

P(t) dt where

¡e

e :D argmin t

1 ( P(¡t ) C P(t)) 2

¡

Z 1¡

t

P( t) dt

¡t

¢ (4.1)

3 This section assumes familiarity with some concepts of information geometry. A more complete explanation of the model underlying the argument is given in Smola, Murata, Scholkopf, ¨ & Muller ¨ (1998) and can be downloaded from http://svm.rst.gmd.de. 4 is a prototype generating the class of densities . Normalization assumptions are made for ease of notation.

New Support Vector Algorithms

1217

To see this, note that under the assumptions stated above, the probability of a deviation larger than e , Prf|y ¡ fO( x)| > e g, will converge to © ª Pr | y ¡ f (x)| > e D

Z

£fRn[¡e, e]g

Z

D1¡

e

¡e

p( x) p(j ) dx dj

p(j ) dj .

(4.2)

This is also the fraction of samples that will (asymptotically) become SVs (proposition 1, iii). Therefore an algorithm generating a fraction º D 1 ¡ Re (j ) dj SVs will correspond to an algorithm with a tube of size e . The ¡e p consequence is that given a noise model p(j ), one can compute the optimal e for it, and then, by using equation 4.2, compute the corresponding optimal value º. To this end, one exploits the linear scaling behavior between the standard deviation s of a distribution p and the optimal e . This result, established in Smola, Murata, Sch¨olkopf, & Muller ¨ (1998) and Smola (1998), cannot be proved here; instead, we shall merely try to give a avor of the argument. The basic idea is to consider the estimation of a location parameter using the e -insensitive loss, with the goal of maximizing the statistical efciency. Using the Cram´er-Rao bound and a result of Murata et al. (1994), the efciency is found to be e

³e ´ s

D

Q2 . GI

(4.3)

Here, I is the Fisher information, while Q and G are information geometrical quantities computed from the loss function and the noise model. This means that one only has to consider distributions of unit variance, say, P, to compute an optimal value of º that holds for the whole class of distributions P. Using equation 4.3, one arrives at 1 1 G / 2 D ( P(¡e ) C P( e ))2 e( e ) Q

¡

Z 1¡

e ¡e

¢

P(t) dt .

(4.4)

The minimum of equation 4.4 yields the optimal choice of e , which allows computation of the corresponding º and thus leads to equation 4.1. p Consider now an example: arbitrary polynomial noise models (/ e¡|j | ) with unit variance can be written as !p ! r r 1 C ( 3 / p) p C ( 3 / p) P(j ) D ¡ ¢ exp ¡ |j | (4.5) C ( 1 / p) 2 C ( 1 / p) C 1 / p

((

where C denotes the gamma function. Table 1 shows the optimal value of º for different polynomial degrees. Observe that the more “lighter-tailed”

1218

B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett

Table 1: Optimal º for Various Degrees of Polynomial Additive Noise. Polynomial degree p Optimal º

1

2

3

4

5

6

7

8

1.00

0.54

0.29

0.19

0.14

0.11

0.09

0.07

the distribution becomes, the smaller º are optimal—that is, the tube width increases. This is reasonable as only for very long tails of the distribution (data with many outliers) it appears reasonable to use an early cutoff on the inuence of the data (by basically giving all data equal inuence via ai D C / `). The extreme case of Laplacian noise (º D 1) leads to a tube width of 0, that is, to L1 regression. We conclude this section with three caveats: rst, we have only made an asymptotic statement; second, for nonzero e , the SV regression need not necessarily converge to the target f : measured using |.| e , many other functions are just as good as f itself; third, the proportionality between e and s has only been established in the estimation-of-location-parameter context, which is not quite SV regression. 5 Parametric Insensitivity Models

We now return to the algorithm described in section 2. We generalized e -SVR by estimating the width of the tube rather than taking it as given a priori. What we have so far retained is the assumption that the e -insensitive zone has a tube (or slab) shape. We now go one step further and use parametric models of arbitrary shape. This can be useful in situations where the noise is heteroscedastic, that is, where it depends on x . (¤) Let ffq g (here and below, q D 1, . . . , p is understood) be a set of 2p positive functions on the input space X . Consider the following quadratic (¤) (¤) program: for given º1 , . . . , ºp ¸ 0, minimize t ( w , » (¤) , "(¤) ) D kw k2 / 2 0

p X

C C¢@

subject to

qD1

1 (ºq eq C ºq¤ eq¤ ) C

((w ¢ xi ) C b) ¡ yi · yi ¡ (( w ¢ x i ) C b) ·

X q

X q

j i(¤) ¸ 0, eq(¤) ¸ 0.

` 1X

` iD1

(ji C j i¤ ) A

(5.1)

eqfq ( xi ) C j i

(5.2)

eq¤fq¤ ( x i ) C j i¤

(5.3) (5.4)

New Support Vector Algorithms

1219

A calculation analogous to that in section 2 shows that the Wolfe dual consists of maximizing the expression 2.11 subject to the constraints 2.12 and 2.13, and, instead of 2.14, the modied constraints, still linear in ®(¤) , ` X

iD1

ai(¤) fq(¤) ( xi ) · C ¢ ºq(¤) .

(5.5)

In the experiments in section 8, we use a simplied version of this optimization problem, where we drop the term ºq¤ eq¤ from the objective function, equation 5.1, and use eq and fq in equation 5.3. By this, we render the problem symmetric with respect to the two edges of the tube. In addition, we use p D 1. This leads to the same Wolfe dual, except for the last constraint, which becomes (cf. equation 2.14), ` X

iD1

(ai C ai¤ )f ( x i ) · C ¢ º.

(5.6)

Note that the optimization problem of section 2 can be recovered by using the constant function f ´ 1.5 The advantage of this setting is that since the same º is used for both sides of the tube, the computation of e, b is straightforward: for instance, by solving a linear system, using two conditions as those described following equation 2.15. Otherwise, general statements are harder to make; the linear system can have a zero determinant, depending on whether the functions (¤) (¤) fp , evaluated on the x i with 0 < ai < C / `, are linearly dependent. The latter occurs, for instance, if we use constant functions f (¤) ´ 1. In this case, it is pointless to use two different values º, º¤ ; for the constraint (see P (¤) equation 2.12) then implies that both sums `iD1 ai will be bounded by ¤ C ¢ minfº, º g. We conclude this section by giving, without proof, a generalization of proposition 1 to the optimization problem with constraint (see equation 5.6): Proposition 4.

that e > 0. Then

Suppose we run the above algorithm on a data set with the result

i. Pº`( x ) is an upper bound on the fraction of errors. i

f

i

ii. Pº`( x ) is an upper bound on the fraction of SVs. i

f

i

5 Observe the similarity to semiparametric SV models (Smola, Frieß, & Scholkopf, ¨ 1999) where a modication of the expansion of f led to similar additional constraints. The important difference in the present setting is that the Lagrange multipliers ai and ai¤ are treated equally and not with different signs, as in semiparametric modeling.

1220

B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett

iii. Suppose the data in equation 1.3 were generated i.i.d. from a distribution P( x , y) D P( x )P( y| x ) with P( y| x) continuous and the expectation of its modulus of continuity satisfying limd!0 E2 (d) D 0. With probability 1, R asymptotically, the fractions of SVs and errors equal º¢ ( f ( x ) dPQ( x)) ¡1 , where PQ is the asymptotic distribution of SVs over x . 6 Margins in Regression and Classication

The SV algorithm was rst proposed for the case of pattern recognition (Boser et al., 1992), and then generalized to regression (Vapnik, 1995). Conceptually, however, one can take the view that the latter case is actually the simpler one, providing a posterior justication as to why we started this article with the regression case. To explain this, we will introduce a suitable denition of a margin that is maximized in both cases. At rst glance, the two variants of the algorithm seem conceptually different. In the case of pattern recognition, a margin of separation between two pattern classes is maximized, and the SVs are those examples that lie closest to this margin. In the simplest case, where the training error is xed to 0, this is done by minimizing kw k2 subject to yi ¢ (( w ¢ x i ) C b) ¸ 1 (note that in pattern recognition, the targets yi are in f§ 1g). In regression estimation, on the other hand, a tube of radius e is tted to the data, in the space of the target values, with the property that it corresponds to the attest function in feature space. Here, the SVs lie at the edge of the tube. The parameter e does not occur in the pattern recognition case. We will show how these seemingly different problems are identical (cf. also Vapnik, 1995; Pontil, Rifkin, & Evgeniou, 1999), how this naturally leads to the concept of canonical hyperplanes (Vapnik, 1995), and how it suggests different generalizations to the estimation of vector-valued functions. Let ( E, k.kE ) , ( F, k.kF ) be normed spaces, and X We dene the e -margin of a function f : X ! F as Denition 1 (e -margin).

me ( f ) :D inffkx ¡ ykE : x , y 2 X , k f ( x) ¡ f ( y )kF ¸ 2e g.

½ E. (6.1)

me ( f ) can be zero, even for continuous functions, an example being f ( x) D 1 / x on X D R C . There, me ( f ) D 0 for all e > 0. Note that the e -margin is related (albeit not identical) to the traditional modulus of continuity of a function: given d > 0, the latter measures the largest difference in function values that can be obtained using points within a distance d in E. The following observations characterize the functions for which the margin is strictly positive. Lemma 1 (uniformly continuous functions).

With the above notations, me ( f ) is positive for all e > 0 if and only if f is uniformly continuous.

New Support Vector Algorithms

1221

By denition of me , we have

Proof.

¡

k f ( x) ¡ f ( y )kF ¸ 2e H) kx ¡ y kE ¸ me ( f )

¢

(6.2)

¡ ¢ () kx ¡ y kE < me ( f ) H) k f ( x ) ¡ f ( y )kF < 2e ,

(6.3)

that is, if me ( f ) > 0, then f is uniformly continuous. Similarly, if f is uniformly continuous, then for each e > 0, we can nd a d > 0 such that k f ( x ) ¡ f ( y )kF ¸ 2e implies kx ¡ y kE ¸ d. Since the latter holds uniformly, we can take the inmum to get me ( f ) ¸ d > 0. We next specialize to a particular set of uniformly continuous functions. Lemma 2 (Lipschitz-continuous functions). If there exists some L > 0 such that for all x , y 2 X , k f ( x ) ¡ f ( y )kF · L ¢ kx ¡ y kE , then me ¸ 2Le . Proof.

Take the inmum over kx ¡ y kE ¸

k f ( x)¡ f (y )kF L

¸

2e . L

Example 1 (SV regression estimation).

Suppose that E is endowed with a dot product (. ¢ .) (generating the norm k.kE ). For linear functions (see equation 1.2) the margin takes the form me ( f ) D k2wek . To see this, note that since | f (x ) ¡ f ( y )| D |( w ¢ ( x ¡ y ))| , the distance kx ¡ y k will be smallest given | (w ¢ ( x ¡ y))| D 2e , when x ¡ y is parallel to w (due to Cauchy-Schwartz), i.e. if x ¡ y D § 2e w / kw k2 . In that case, kx ¡ y k D 2e / kw k. For xed e > 0, maximizing the margin hence amounts to minimizing kw k, as done in SV regression: in the simplest form (cf. equation 1.7 without slack variables j i ) the training on data (equation 1.3) consists of minimizing kw k2 subject to | f (x i ) ¡ yi | · e.

(6.4)

Example 2 (SV pattern recognition; see Figure 2). We specialize the setting of example 1 to the case where X D fx 1 , . . . , x `g. Then m1 ( f ) D kw2 k is equal to

the margin dened for Vapnik’s canonical hyperplane (Vapnik, 1995). The latter is a way in which, given the data set X , an oriented hyperplane in E can be uniquely expressed by a linear function (see equation 1.2) requiring that minf| f ( x)|: x 2 X g D 1.

(6.5)

Vapnik gives a bound on the VC-dimension of canonical hyperplanes in terms of kw k. An optimal margin SV machine for pattern recognition can be constructed from data, (x 1 , y1 ), . . . , ( x `, y`) 2 X

£ f§ 1g

(6.6)

1222

B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett

Figure 2: 1D toy problem. Separate x from o. The SV classication algorithm constructs a linear function f (x) D w ¢ x C b satisfying equation 6.5 (e D 1). To maximize the margin me ( f ), one has to minimize |w|.

as follows (Boser et al., 1992): minimize kw k2 subject to yi ¢ f ( x i ) ¸ 1.

(6.7)

The decision function used for classication takes the form f ¤ ( x ) D sgn(( w ¢ x ) C b).

(6.8)

The parameter e is superuous in pattern recognition, as the resulting decision function, f ¤ ( x ) D sgn(( w ¢ x ) C b),

(6.9)

will not change if we minimize kw k2 subject to yi ¢ f (x i ) ¸ e.

(6.10)

Finally, to understand why the constraint (see equation 6.7) looks different from equation 6.4 (e.g., one is multiplicative, the other one additive), note that in regression, the points ( xi , yi ) are required to lie within a tube of radius e , whereas in pattern recognition, they are required to lie outside the tube (see Figure 2), and on the correct side. For the points on the tube, we have 1 D yi ¢ f ( xi ) D 1 ¡ | f ( xi ) ¡ yi |. So far, we have interpreted known algorithms only in terms of maximizing me . Next, we consider whether we can use the latter as a guide for constructing more general algorithms. Example 3 (SV regression for vector-valued functions). Assume E D R N . For linear functions f (x ) D W x C b , with W being an N £ N matrix, and b 2 R N ,

New Support Vector Algorithms

1223

we have, as a consequence of lemma 1, me ( f ) ¸

2e , kWk

(6.11)

where kWk is any matrix norm of W that is compatible (Horn & Johnson, 1985) with k.kE . If the matrix norm is the one induced by k.kE , that is, there exists a unit vector z 2 E such that kW z kE D kWk, then equality holds in 6.11. To see the latter, we use the same argument as in example 1, setting x ¡ y D 2e z / kWk. qP N 2 For the Hilbert-Schmidt norm kWk2 D i, jD1 Wij , which is compatible with the vector norm k.k2 , the problem of minimizing kWk subject to separate constraints for each output dimension separates into N regression problems. In Smola, Williamson, Mika, & Sch¨olkopf (1999), it is shown that one can specify invariance requirements, which imply that the regularizers act on the output dimensions separately and identically (i.e., in a scalar fashion). In particular, it turns out that under the assumption of quadratic homogeneity and permutation symmetry, the Hilbert-Schmidt norm is the only admissible one. 7 º-SV Classication

We saw that º-SVR differs from e -SVR in that it uses the parameters º and C instead of e and C. In many cases, this is a useful reparameterization of the original algorithm, and thus it is worthwhile to ask whether a similar change could be incorporated in the original SV classication algorithm (for brevity, we call it C-SVC). There, the primal optimization problem is to minimize (Cortes & Vapnik, 1995) 1 t( w , » ) D kw k2 C 2

C `

X

ji

(7.1)

i

subject to yi ¢ (( x i ¢ w ) C b) ¸ 1 ¡ ji , j i ¸ 0.

(7.2)

The goal of the learning process is to estimate a function f ¤ (see equation 6.9) such that the probability of misclassication on an independent test set, the risk R[ f ¤ ], is small.6 Here, the only parameter that we can dispose of is the regularization constant C. To substitute it by a parameter similar to the º used in the regression case, we proceed as follows. As a primal problem for º-SVC, we 6

Implicitly we make use of the f0, 1g loss function; hence the risk equals the probability of misclassication.

1224

B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett

consider the minimization of 1 1X t ( w , » , r ) D kw k2 ¡ ºr C ji 2 ` i

(7.3)

subject to (cf. equation 6.10) yi ¢ (( xi ¢ w ) C b) ¸ r ¡ j i ,

(7.4)

j i ¸ 0,

(7.5)

r ¸ 0.

For reasons we shall expain, no constant C appears in this formulation. To understand the role of r , note that for » D 0, the constraint (see 7.4) simply states that the two classes are separated by the margin 2r / kw k (cf. example 2). To derive the dual, we consider the Lagrangian L( w , » , b, r , ®, ¯ , d) D

1 1X kw k2 ¡ ºr C ji 2 ` i ¡

X i

(ai ( yi ((x i ¢ w ) C b) ¡ r C j i ) C bij i )

¡ dr ,

(7.6)

using multipliers ai , bi , d ¸ 0. This function has to be mimimized with respect to the primal variables w , » , b, r and maximized with respect to the dual variables ®, ¯, d . To eliminate the former, we compute the corresponding partial derivatives and set them to 0, obtaining the following conditions: wD

X

(7.7)

ai yi x i

i

ai C b i D 1 / `,

0D

X

ai yi ,

i

X i

ai ¡d D º.

(7.8)

In the SV expansion (see equation 7.7), only those ai can be nonzero that correspond to a constraint (see 7.4) that is precisely met (KKT conditions; cf. Vapnik, 1995). Substituting equations 7.7 and 7.8 into L, using ai , bi , d ¸ 0, and incorporating kernels for dot products leaves us with the following quadratic optimization problem: maximize W ( ®) D ¡

1X ai aj yi yj k( x i , xj ) 2 ij

(7.9)

New Support Vector Algorithms

1225

subject to 0 · ai · 1 / `

(7.10)

0D

(7.11)

X i

X

ai yi

i

(7.12)

ai ¸ º.

The resulting decision function can be shown to take the form !

(

f ¤ (x ) D sgn

X i

ai yi k( x, x i ) C b .

(7.13)

Compared to the original dual (Boser et al., 1992; Vapnik, 1995), there are two differences. First, there is an additional constraint, 7.12, similar to the P regression case, 2.14. Second, the linear term i ai of Boser et al. (1992) no longer appears in the objective function 7.9. This has an interesting consequence: 7.9 is now quadratically homogeneous in ®. It is straightforward to verify that one obtains exactly the same objective function ifP one starts with the primal function t ( w , » , r ) D kw k2 / 2 C C ¢ (¡ºr C (1 / `) i j i ) (i.e., if one does use C), the only difference being that the constraints, 7.10 and 7.12 would have an extra factor C on the right-hand side. In that case, due to the homogeneity, the solution of the dual would be scaled by C; however, it is straightforward to see that the corresponding decision function will not change. Hence we may set C D 1. To compute b and r , we consider two sets S§ , of identical size s > 0, containing SVs x i with 0 < ai < 1 and yi D § 1, respectively. Then, due to the KKT conditions, 7.4 becomes an equality with ji D 0. Hence, in terms of kernels, b D¡

1 2s

X

X

x 2SC [S¡

j

aj yj k( x , xj ),

(7.14)

0 r D

1 X X

1 @ 2s x 2S

C

j

aj yj k(x , xj ) ¡

X X x 2S¡

aj yj k(x , xj ) A .

(7.15)

j

As in the regression case, the º parameter has a more natural interpretation than the one we removed, C. To formulate it, let us rst dene the term margin error. By this, we denote points with j i > 0—that is, that are either errors or lie within the margin. Formally, the fraction of margin errors is 1 r Remp [ f ] :D | fi: yi ¢ f (x i ) < r g|. `

(7.16)

1226

B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett

Here, f is used to denote the argument of the sgn in the decision function, equation 7.13, that is, f ¤ D sgn ± f . We are now in a position to modify proposition 1 for the case of pattern recognition: Suppose k is a real analytic kernel function, and we run º-SVC with k on some data with the result that r > 0. Then Proposition 5.

i. º is an upper bound on the fraction of margin errors. ii. º is a lower bound on the fraction of SVs. iii. Suppose the data (see equation 6.6) were generated i.i.d. from a distribution P( x , y) D P( x ) P( y| x) such that neither P(x , y D 1) nor P( x , y D ¡1) contains any discrete component. Suppose, moreover, that the kernel is analytic and non-constant. With probability 1, asymptotically, º equals both the fraction of SVs and the fraction of errors.

Ad (i). By the KKT conditions, r > 0 impliesd D 0. Hence, inequality 7.12 becomes an equality (cf. equations 7.8). Thus, at most a fraction º of all examples can have ai D 1 / `. All examples with j i > 0 do satisfy ai D 1 / ` (if not, ai could grow further to reduce j i ). Ad (ii). SVs can contribute at most 1 / ` to the left-hand side of 7.12; hence there must be at least º` of them. Ad (iii). It follows from the condition on P( x , y) that apart from some set of measure zero (arising from possible singular components), the two class distributions are absolutely continuous and can be written as integrals over distribution functions. Because the kernel is analytic and nonconstant, it cannot be constant in any open set; otherwise it would be constant everywhere. Therefore, functions f constituting the argument of the sgn in the SV decision function (see equation 7.13) essentially functions in the class of SV regression functions) transform the distribution over x into distributions such that for all f , and all t 2 R , limc !0 P(| f ( x) C t| < c ) D 0. At the same time, we know that the class of these functions has well-behaved covering numbers; hence we get uniform convergence: for all c > 0, sup f | P(| f (x ) C t| < c ) ¡ PO`(| f ( x ) C t| < c )| converges to zero in probability, where PO` is the sample-based estimate of P (that is, the proportion of points that satisfy | f ( x) C t| < c ). But then for all a > 0, limc !0 lim`!1 P(sup f PO `(| f ( x) C t| < c ) > a) D 0. Hence, sup PO`(| f (x ) C t| D 0) converges to zero in probability. Using t D § r Proof.

f

thus shows that almost surely the fraction of points exactly on the margin tends to zero; hence the fraction of SVs equals that of margin errors.

New Support Vector Algorithms

1227

Combining (i) and (ii) shows that both fractions converge almost surely to º. Moreover, since equation 7.11 means that the sums over the coefcients of positive and negative SVs respectively are equal, we conclude that proposition 5 actually holds for both classes separately, with º / 2. (Note that by the same argument, the number of SVs at the two sides of the margin asymptotically agree.) A connection to standard SV classication, and a somewhat surprising interpretation of the regularization parameter C, is described by the following result: If º-SV classication leads to r > 0, then C-SV classication, with C set a priori to 1 /r , leads to the same decision function. Proposition 6.

If one minimizes the function 7.3 and then xes r to minimize only over the remaining variables, nothing will change. Hence the obtained solution w 0 , b0 , » 0 minimizes the function 7.1 for C D 1, subject to the constraint 7.4. To recover the constraint 7.2, we rescale to the set of variables w 0 D w /r , b0 D b /r , » 0 D » /r . This leaves us, up to a constant scaling factor r 2 , with the objective function 7.1 using C D 1 /r . Proof.

As in the case of regression estimation (see proposition 3), linearity of the (¤) target function in the slack variables » leads to “outlier ” resistance of the estimator in pattern recognition. The exact statement, however, differs from the one in regression in two respects. First, the perturbation of the point is carried out in feature space. What it precisely corresponds to in input space therefore depends on the specic kernel chosen. Second, instead of referring to points outside the e -tube, it refers to margin error points—points that are misclassied or fall into the margin. Below, we use the shorthand z i for W( x i ). Proposition 7 (resistance of SV classication).

Suppose w can be expressed

in terms of the SVs that are not at bound, that is, wD

X

c i zi ,

(7.17)

i

with c i 6D 0 only if ai 2 (0, 1 / `) (where the ai are the coefcients of the dual solution). Then local movements of any margin error zm parallel to w do not change the hyperplane. Since the slack variable of z m satises j m > 0, the KKT conditions (e.g., Bertsekas, 1995) imply am D 1 / `. If d is sufciently small, then transforming the point into z 0m :D z m C d ¢ w results in a slack that is still nonzero, Proof.

1228

B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett

0 that is, j m0 > 0; hence we have am D 1 / ` D am . Updating the j m and keeping all other primal variables unchanged, we obtain a modied set of primal variables that is still feasible. We next show how to obtain a corresponding set of feasible dual variables. To keep w unchanged, we need to satisfy

X i

ai yi zi D

X i6Dm

ai0 yi z i C am ym z 0m .

Substituting z 0m D z m C d ¢ w and equation 7.17, we note that a sufcient condition for this to hold is that for all i 6D m, ai0 D ai ¡dc i yi am ym . Since by assumption c i is nonzero only if ai 2 (0, 1 / `), ai0 will be in (0, 1 / `) if ai is, provided d is sufciently small, and it will equal 1 / ` if ai does. In both cases, we end up with a feasible solution ®0 , and the KKT conditions are still satised. Thus (Bertsekas, 1995), ( w , b) are still the hyperplane parameters of the solution. Note that the assumption (7.17) is not asP restrictive as it may seem. Although the SV expansion of the solution, w D i ai yi z i , often contains many multipliers ai that are at bound, it is nevertheless conceivable that, especially when discarding the requirement that the coefcients be bounded, we can obtain an expansion (see equation 7.17) in terms of a subset of the original vectors. For instance, if we have a 2D problem that we solve directly in input space, with k(x , y ) D ( x ¢ y), then it already sufces to have two linearly independent SVs that are not at bound in order to express w . This holds for any overlap of the two classes—even if there are many SVs at the upper bound. For the selection of C, several methods have been proposed that could probably be adapted for º (Sch¨olkopf, 1997; Shawe-Taylor & Cristianini, 1999). In practice, most researchers have so far used cross validation. Clearly, this could be done also for º-SVC. Nevertheless, we shall propose a method that takes into account specic properties of º-SVC. The parameter º lets us control the number of margin errors, the crucial quantity in a class of bounds on the generalization error of classiers using covering numbers to measure the classier capacity. We can use this connection to give a generalization error bound for º-SVC in terms of º. There are a number of complications in doing this the best possible way, and so here we will indicate the simplest one. It is based on the following result: Suppose r > 0, 0 < d < 12 , P is a probability distribution on X £ f¡1, 1g from which the training set, equation 6.6, is drawn. Then with probability at least 1 ¡ d for every f in some function class F , the Proposition 8 (Bartlett, 1998).

New Support Vector Algorithms

1229

probability of error of the classication function f ¤ D sgn ± f on an independent test set is bounded according to r

R[ f ¤ ] · Remp [ f ] C

q ¡ 2 `

ln N

¢ ( F , l21`, r / 2) C ln(2 /d ) ,

(7.18)

where N ( F , l`1 , r ) D sup XDx1 ,..., x` N ( F | X , l1 , r ) , F | X D f( f ( x1 ), . . . , f ( x`)): f 2 F g, N ( FX , l1 , r ) is the r -covering number of FX with respect to l1 , the usual l1 metric on a set of vectors. To obtain the generalization bound for º-SVC, we simply substitute the r bound Remp [ f ] · º (proposition 5, i) and some estimate of the covering numbers in terms of the margin. The best available bounds are stated in terms of the functional inverse of N , hence the slightly complicated expressions in the following. Proposition 9 (Williamson et al., 1998).

Denote BR the ball of radius R around the origin in some Hilbert space F. Then the covering number N of the class of functions

F D fx 7! ( w ¢ x ): kw k · 1, x 2 BR g

(7.19)

at scale r satises log2 N

n 2 2 c R ( F , l`1 , r ) · inf n r2

1 n

o ¡ ¢ log2 1 C n` ¸ 1 ¡ 1,

(7.20)

where c < 103 is a constant. This is a consequence of a fundamental theorem due to Maurey. For ` ¸ 2 one thus obtains log2 N

(F , l`1 , r ) ·

c2 R2 log2 ` ¡ 1. r2

(7.21)

To apply these results to º-SVC, we rescale w to length 1, thus obtaining a margin r / kw k (cf. equation 7.4). Moreover, we have to combine propositions 8 and 9. Using r / 2 instead of r in the latter yields the following result. Proposition 10. Suppose º-SVC is used with a kernel of the form k( x , y) D k(kx ¡ y k) with k(0) D 1. Then all the data points W( xi ) in feature space live in a

ball of radius 1 centered at the origin. Consequently with probability at least 1 ¡d over the training set (see equation 6.6), the º-SVC decision function f ¤ D sgn ± f ,

1230

B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett

P with f (x ) D i ai yi k( x , xi ) (cf. equation 7.13), has a probability of test error bounded according to s 2 4c2 kw k2 r ¤ log2 (2`) ¡ 1 C ln(2 /d ) R[ f ] · Remp [ f ] C r2 ` s 2 4c2 kw k2 ·ºC log2 (2`) ¡ 1 C ln(2 /d) . r2 `

¡

¡

¢

¢

Notice that in general, kw k is a vector in feature space. Note that the set of functions in the proposition differs from support vector decision functions (see equation 7.13) in that it comes without the C b term. This leads to a minor modication (for details, see Williamson et al., 1998). Better bounds can be obtained by estimating the radius or even optimizing the choice of the center of the ball (cf. the procedure described by Sch¨olkopf et al. 1995; Burges, 1998). However, in order to get a theorem of the above form in that case, a more complex argument is necessary (see Shawe-Taylor, Bartlet, Williamson, & Anthony, 1998, sec. VI for an indication). We conclude this section by noting that a straightforward extension of the is to include parametric models fk ( x ) for the margin, and º-SVC algorithm P thus to use q r q fq (x i ) instead of r in the constraint (see equation 7.4)—in complete analogy to the regression case discussed in section 5. 8 Experiments 8.1 Regression Estimation. In the experiments, we used the optimizer LOQO. 7 This has the serendipitous advantage that the primal variables b and e can be recovered as the dual variables of the Wolfe dual (see equation 2.11) (i.e., the double dual variables) fed into the optimizer.

8.1.1 Toy Examples. The rst task was to estimate a noisy sinc function, given ` examples (xi , yi ), with xi drawn uniformly from [¡3, 3], and yi D sin(p xi ) / (p xi ) C ui , where the ui were drawn from a gaussian with zero mean and variance s 2 . Unless stated otherwise, we used the radial basis function (RBF) kernel k( x, x0 ) D exp(¡|x ¡ x0 | 2 ), ` D 50, C D 100, º D 0.2, and s D 0.2. Whenever standard deviation error bars are given, the results were obtained from 100 trials. Finally, the risk (or test error) of a regression estimate f was computed with respect to the sinc function without noise, R3 as 16 ¡3 | f ( x) ¡ sin(p x) / (p x)| dx. Results are given in Table 2 and Figures 3 through 9. 7

Available online at http://www.princeton.edu/»rvdb/.

New Support Vector Algorithms

1231

Figure 3: º-SV regression with º D 0.2 (top) and º D 0.8 (bottom). The larger º allows more points to lie outside the tube (see section 2). The algorithm automatically adjusts e to 0.22 (top) and 0.04 (bottom). Shown are the sinc function (dotted), the regression f , and the tube f § e .

1232

B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett

Figure 4: º-SV regression on data with noise s D 0 (top) and s D 1; (bottom). In both cases, º D 0.2. The tube width automatically adjusts to the noise (top: e D 0; bottom: e D 1.19).

New Support Vector Algorithms

1233

Figure 5: e -SV regression (Vapnik, 1995) on data with noise s D 0 (top) and s D 1 (bottom). In both cases, e D 0.2. This choice, which has to be specied a priori, is ideal for neither case. In the upper gure, the regression estimate is biased; in the lower gure, e does not match the external noise (Smola, Murata, Sch¨olkopf, & Muller, ¨ 1998).

1234

B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett

Figure 6: º-SVR for different values of the error constant º. Notice how e decreases when more errors are allowed (large º), and that over a large range of º, the test error (risk) is insensitive toward changes in º.

New Support Vector Algorithms

1235

Figure 7: º-SVR for different values of the noise s. The tube radius e increases (¤) linearly with s (largely due to the fact that both e and the j i enter the cost function linearly). Due to the automatic adaptation of e , the number of SVs and points outside the tube (errors) are, except for the noise-free case s D 0, largely independent of s.

1236

B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett

Figure 8: º-SVR for different values of the constant C. (Top) e decreases when the regularization is decreased (large C). Only very little, if any, overtting occurs. (Bottom) º upper bounds the fraction of errors, and lower bounds the fraction of SVs (cf. proposition 1). The bound gets looser as C increases; this corresponds to a smaller number of examples ` relative to C (cf. Table 2).

New Support Vector Algorithms

1237

Table 2: Asymptotic Behavior of the Fraction of Errors and SVs. `

10

50

100

200

500

1000

1500

2000

e

0.27

0.22

0.23

0.25

0.26

0.26

0.26

0.26

Fraction of errors

0.00

0.10

0.14

0.18

0.19

0.20

0.20

0.20

Fraction of SVs

0.40

0.28

0.24

0.23

0.21

0.21

0.20

0.20

Notes: The e found byº-SV regression is largely independent of the sample size `. The fraction of SVs and the fraction of errors approach º D 0.2 from above and below, respectively, as the number of training examples ` increases (cf. proposition 1).

Figure 10 gives an illustration of how one can make use of parametric insensitivity models as proposed in section 5. Using the proper model, the estimate gets much better. In the parametric case, we used º D 0.1 and R f ( x) D sin2 ((2p / 3)x), which, due to f ( x) dP(x) D 1 / 2, corresponds to our standard choice º D 0.2 in º-SVR (cf. proposition 4). Although this relies on the assumption that the SVs are uniformly distributed, the experimental ndings are consistent with the asymptotics predicted theoretically: for ` D 200, we got 0.24 and 0.19 for the fraction of SVs and errors, respectively. 8.1.2 Boston Housing Benchmark. Empirical studies using e -SVR have reported excellent performance on the widely used Boston housing regression benchmark set (Stitson et al., 1999). Due to proposition 2, the only difference between º-SVR and standard e -SVR lies in the fact that different parameters, e versus º, have to be specied a priori. Accordingly, the goal of the following experiment was not to show that º-SVR is better than e -SVR, but that º is a useful parameter to select. Consequently, we are interested only inº and e , and hence kept the remaining parameters xed. We adjusted C and the width 2s 2 in k( x , y ) D exp(¡kx ¡ y k2 / (2s 2 )) as in Sch¨olkopf et al. (1997). We used 2s 2 D 0.3 ¢ N, where N D 13 is the input dimensionality, and C / ` D 10 ¢ 50 (i.e., the original value of 10 was corrected since in the present case, the maximal y-value is 50 rather than 1). We performed 100 runs, where each time the overall set of 506 examples was randomly split into a training set of ` D 481 examples and a test set of 25 examples (cf. Stitson et al., 1999). Table 3 shows that over a wide range ofº (note that only 0 · º · 1 makes sense), we obtained performances that are close to the best performances that can be achieved by selecting e a priori by looking at the test set. Finally, although we did not use validation techniques to select the optimal values for C and 2s 2 , the performances are state of the art (Stitson et al., 1999, report an MSE of 7.6 for e -SVR using ANOVA kernels, and 11.7 for Bagging regression trees). Table 3, moreover, shows that in this real-world application, º can be used to control the fraction of SVs/errors.

1238

B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett

Table 3: Results for the Boston Housing Benchmark (top:º-SVR; bottom: e -SVR). º

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

automatic e MSE STD Errors SVs

2.6 9.4 6.4 0.0 0.3

1.7 8.7 6.8 0.1 0.4

1.2 9.3 7.6 0.2 0.6

0.8 9.5 7.9 0.2 0.7

0.6 10.0 8.4 0.3 0.8

0.3 10.6 9.0 0.4 0.9

0.0 11.3 9.6 0.5 1.0

0.0 11.3 9.5 0.5 1.0

0.0 11.3 9.5 0.5 1.0

0.0 11.3 9.5 0.5 1.0

e MSE STD Errors SVs

0

1

2

3

4

5

6

7

8

9

10

11.3 9.5 0.5 1.0

9.5 7.7 0.2 0.6

8.8 6.8 0.1 0.4

9.7 6.2 0.0 0.3

11.2 6.3 0.0 0.2

13.1 6.0 0.0 0.1

15.6 6.1 0.0 0.1

18.2 6.2 0.0 0.1

22.1 6.6 0.0 0.1

27.0 7.3 0.0 0.1

34.3 8.4 0.0 0.1

Note: MSE: Mean squared errors; STD: standard deviations thereof (100 trials); errors: fraction of training points outside the tube; SVs: fraction of training points that are SVs.

8.2 Classication. As in the regression case, the difference between CSVC and º-SVC lies in the fact that we have to select a different parameter a priori. If we are able to do this well, we obtain identical performances. In other words, º-SVC could be used to reproduce the excellent results obtained on various data sets using C-SVC (for an overview; see Sch¨olkopf, Burges, & Smola, 1999). This would certainly be a worthwhile project; however,we restrict ourselves here to showing some toy examples illustrating the inuence of º (see Figure 11). The corresponding fractions of SVs and margin errors are listed in Table 4. 9 Discussion

We have presented a new class of SV algorithms, which are parameterized by a quantity º that lets one control the number of SVs and errors. We described º-SVR, a new regression algorithm that has been shown to be rather

Figure 9: Facing page. º-SVR for different values of the gaussian kernel width 2s2 , using k(x, x0 ) D exp(¡|x ¡ x0 | 2 / (2s2 )). Using a kernel that is too wide results in undertting; moreover, since the tube becomes too rigid as 2s2 gets larger than 1, the e needed to accomodate a fraction (1 ¡ º) of the points, increases significantly. In the bottom gure, it can again be seen that the speed of the uniform convergence responsible for the asymptotic statement given in proposition 1 depends on the capacity of the underlying model. Increasing the kernel width leads to smaller covering numbers (Williamson et al., 1998) and therefore faster convergence.

New Support Vector Algorithms

1239

useful in practice. We gave theoretical results concerning the meaning and the choice of the parameter º. Moreover, we have applied the idea underlying º-SV regression to develop a º-SV classication algorithm. Just like its regression counterpart, the algorithm is interesting from both a prac-

1240

B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett

Figure 10: Toy example, using prior knowledge about an x-dependence of the noise. Additive noise (s D 1) was multiplied by the function sin2 ((2p / 3)x). (Top) The same function was used as f as a parametric insensitivity tube (section 5). (Bottom) º-SVR with standard tube.

New Support Vector Algorithms

1241

Table 4: Fractions of Errors and SVs, Along with the Margins of Class Separation, for the Toy Example Depicted in Figure 11. º

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Fraction of errors Fraction of SVs Margin 2r / kw k

0.00 0.29 0.009

0.07 0.36 0.035

0.25 0.43 0.229

0.32 0.46 0.312

0.39 0.57 0.727

0.50 0.68 0.837

0.61 0.79 0.922

0.71 0.86 1.092

Note: º upper bounds the fraction of errors and lower bounds the fraction of SVs, and that increasing º, i.e. allowing more errors, increases the margin.

tical and a theoretical point of view. Controlling the number of SVs has consequences for (1) run-time complexity, since the evaluation time of the estimated function scales linearly with the number of SVs (Burges, 1998); (2) training time, e.g., when using a chunking algorithm (Vapnik, 1979) whose complexity increases with the number of SVs; (3) possible data compression applications—º characterizes the compression ratio: it sufces to train the algorithm only on the SVs, leading to the same solution (Sch¨olkopf et al., 1995); and (4) generalization error bounds: the algorithm directly optimizes a quantity using which one can give generalization bounds. These, in turn, could be used to perform structural risk minimization over º. Moreover, asymptotically, º directly controls the number of support vectors, and the latter can be used to give a leave-one-out generalization bound (Vapnik, 1995).

Figure 11: Toy problem (task: separate circles from disks) solved using º-SV classication, using parameter values ranging from º D 0.1 (top left) to º D 0.8 (bottom right). The larger we select º, the more points are allowed to lie inside the margin (depicted by dotted lines). As a kernel, we used the gaussian k( x, y ) D exp(¡kx ¡ y k2 ).

1242

B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett

In both the regression and the pattern recognition case, the introduction of º has enabled us to dispose of another parameter. In the regression case, this was the accuracy parameter e ; in pattern recognition, it was the regularization constant C. Whether we could have as well abolished C in the regression case is an open problem. Note that the algorithms are not fundamentally different from previous SV algorithms; in fact, we showed that for certain parameter settings, the results coincide. Nevertheless, we believe there are practical applications where it is more convenient to specify a fraction of points that is allowed to become errors, rather than quantities that are either hard to adjust a priori (such as the accuracy e ) or do not have an intuitive interpretation (such as C). On the other hand, desirable properties of previous SV algorithms, including the formulation as a denite quadratic program, and the sparse SV representation of the solution, are retained. We are optimistic that in many applications, the new algorithms will prove to be quite robust. Among these should be the reduced set algorithm of Osuna and Girosi (1999), which approximates the SV pattern recognition decision surface by e -SVR. Here, º-SVR should give a direct handle on the desired speed-up. Future work includes the experimental test of the asymptotic predictions of section 4 and an experimental evaluation of º-SV classication on realworld problems. Moreover, the formulation of efcient chunking algorithms for the º-SV case should be studied (cf. Platt, 1999). Finally, the additional freedom to use parametric error models has not been exploited yet. We expect that this new capability of the algorithms could be very useful in situations where the noise is heteroscedastic, such as in many problems of nancial data analysis, and general time-series analysis applications (Muller ¨ et al., 1999; Mattera & Haykin, 1999). If a priori knowledge about the noise is available, it can be incorporated into an error model f ; if not, we can try to estimate the model directly from the data, for example, by using a variance estimator (e.g., Seifert, Gasser, & Wolf, 1993) or quantile estimator (section 3). Acknowledgments

This work was supported in part by grants of the Australian Research Council and the DFG (Ja 379/7-1 and Ja 379/9-1). Thanks to S. Ben-David, A. Elisseeff, T. Jaakkola, K. Muller, ¨ J. Platt, R. von Sachs, and V. Vapnik for discussions and to L. Almeida for pointing us to White’s work. Jason Weston has independently performed experiments using a sum inequality constraint on the Lagrange multipliers, but declined an offer of coauthorship. References Aizerman, M., Braverman, E., & Rozonoer, L. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25, 821–837.

New Support Vector Algorithms

1243

Anthony, M., & Bartlett, P. L. (1999). Neural network learning: Theoretical foundations. Cambridge: Cambridge University Press. Bartlett, P. L. (1998). The sample complexity of pattern classication with neural networks: The size of the weights is more important than the size of the network. IEEE Transactions on Information Theory, 44(2), 525–536. Bertsekas, D. P. (1995). Nonlinear programming. Belmont, MA: Athena Scientic. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classiers. In D. Haussler (Ed.), Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory (pp. 144–152). Pittsburgh, PA: ACM Press. Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 1–47. Cortes, C., & Vapnik, V. (1995). Support vector networks. Machine Learning, 20, 273–297. Girosi, F. (1998). An equivalence between sparse approximation and support vector machines. Neural Computation, 10(6), 1455–1480. Horn, R. A., & Johnson, C. R. (1985). Matrix analysis. Cambridge: Cambridge University Press. Huber, P. J. (1981). Robust statistics. New York: Wiley. Mattera, D., & Haykin, S. (1999). Support vector machines for dynamic reconstruction of a chaotic system. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—Support vector learning (pp. 211–241). Cambridge, MA: MIT Press. Muller, ¨ K.-R., Smola, A., R¨atsch, G., Sch¨olkopf, B., Kohlmorgen, J., and Vapnik, V. (1999). Predicting time series with support vector machines. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—Support vector learning (pp. 243–253). Cambridge, MA: MIT Press. Murata, N., Yoshizawa, S., & Amari, S. (1994). Network information criterion— determining the number of hidden units for articial neural network models. IEEE Transactions on Neural Networks, 5, 865–872. Osuna, E., & Girosi, F. (1999). Reducing run-time complexity in support vector machines. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—Support vector learning (pp. 271–283). Cambridge, MA: MIT Press. Platt, J. (1999). Fast training of SVMs using sequential minimal optimization. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods— Support vector learning (pp. 185–208). Cambridge, MA: MIT Press. Pontil, M., Rifkin, R., & Evgeniou, T. (1999). From regression to classication in support vector machines. In M. Verleysen (Ed.), Proceedings ESANN (pp. 225– 230). Brussels: D Facto. Sch¨olkopf, B. (1997). Support vector learning. Munich: R. Oldenbourg Verlag. Sch¨olkopf, B., Bartlett, P. L., Smola, A., & Williamson, R. C. (1998). Support vector regression with automatic accuracy control. In L. Niklasson, M. Bod´en, & T. Ziemke (Eds.), Proceedings of the 8th International Conference on Articial Neural Networks (pp. 111–116). Berlin: SpringerVerlag.

1244

B. Sch¨olkopf, A. J. Smola, R. C. Williamson, and P. L. Bartlett

Sch¨olkopf, B., Burges, C. J. C., & Smola, A. J. (1999). Advances in kernel methods— Support vector learning. Cambridge, MA: MIT Press. Sch¨olkopf, B., Burges, C., & Vapnik, V. (1995).Extracting support data for a given task. In U. M. Fayyad & R. Uthurusamy (Eds.), Proceedings, First International Conference on Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI Press. Sch¨olkopf, B., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (1999). Kerneldependent support vector error bounds. In Ninth International Conference on Articial Neural Networks (pp. 103–108). London: IEE. Sch¨olkopf, B., Smola, A., & Muller, ¨ K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319. Sch¨olkopf, B., Sung, K., Burges, C., Girosi, F., Niyogi, P., Poggio, T., & Vapnik, V. (1997). Comparing support vector machines with gaussian kernels to radial basis function classiers. IEEE Trans. Sign. Processing, 45, 2758– 2765. Seifert, B., Gasser, T., & Wolf, A. (1993). Nonparametric estimation of residual variance revisited. Biometrika, 80, 373–383. Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5), 1926–1940. Shawe-Taylor, J., & Cristianini, N. (1999). Margin distribution bounds on generalization. In Computational Learning Theory: 4th European Conference (pp. 263– 273). New York: Springer. Smola, A. J. (1998). Learning with kernels. Doctoral dissertation, Technische Universit a¨ t Berlin. Also: GMD Research Series No. 25, Birlinghoven, Germany. Smola, A., Frieß, T., & Sch¨olkopf, B. (1999). Semiparametric support vector and linear programming machines. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.),Advances in neural informationprocessing systems, 11 (pp. 585–591).Cambridge, MA: MIT Press. Smola, A., Murata, N., Sch¨olkopf, B., & Muller, ¨ K.-R. (1998). Asymptotically optimal choice of e -loss for support vector machines. In L. Niklasson, M. Bod´en, & T. Ziemke (Eds.), Proceedings of the 8th International Conference on Articial Neural Networks (pp. 105–110). Berlin: SpringerVerlag. Smola, A., & Sch¨olkopf, B. (1998). On a kernel-based method for pattern recognition, regression, approximation and operator inversion. Algorithmica, 22, 211–231. Smola, A., Sch¨olkopf, B., & Muller, ¨ K.-R. (1998). The connection between regularization operators and support vector kernels. Neural Networks, 11, 637– 649. Smola, A., Williamson, R. C., Mika, S., & Sch¨olkopf, B. (1999). Regularized principal manifolds. In Computational Learning Theory: 4th European Conference (pp. 214–229). Berlin: Springer-Verlag. Stitson, M., Gammerman, A., Vapnik, V., Vovk, V., Watkins, C., & Weston, J. (1999). Support vector regression with ANOVA decomposition kernels. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods— Support vector learning (pp. 285–291). Cambridge, MA: MIT Press.

New Support Vector Algorithms

1245

Vapnik, V. (1979). Estimation of dependences based on empirical data [in Russian]. Nauka: Moscow. (English translation: Springer-Verlag, New York, 1982). Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Vapnik, V., & Chervonenkis, A. (1974). Theory of pattern recognition [in Russian]. Nauka: Moscow. (German Translation: W. Wapnik & A. Tscherwonenkis, Theorie der Zeichenerkennung, Akademie-Verlag, Berlin, 1979). Wahba, G. (1999). Support vector machines, reproducing kernel Hilbert spaces and the randomized GACV. In B. Sch¨olkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—Support vector learning (pp. 69–88). Cambridge, MA: MIT Press. White, H. (1994). Parametric statistical estimation with articial neural networks: A condensed discussion. In V. Cherkassky, J. H. Friedman, & H. Wechsler (Eds.), From statistics to neural networks. Berlin: Springer. Williamson, R. C., Smola, A. J., & Sch¨olkopf, B. (1998). Generalization performance of regularization networks and support vector machines via entropy numbers of compact operators (Tech. Rep. 19 Neurocolt Series). London: Royal Holloway College. Available online at http://www.neurocolt.com. Received December 2, 1998; accepted May 14, 1999.

Neural Computation, Volume 12, Number 6

CONTENTS Article

Separating Style and Content with Bilinear Models Joshua B. Tenenbaum and William T. Freeman

1247

Notes

Relationships Between the A Priori and A Posteriori Errors in Nonlinear Adaptive Neural Filters Danilo P. Mandic and Jonathon A. Chambers

1285

The VC Dimension for Mixtures of Binary Classiers Wenxin Jiang

1293

The Early Restart Algorithm Malik Magdon-Ismail and Amir F. Atiya

1303

Letters

Attractor Dynamics in Feedforward Neural Networks Lawrence K. Saul and Michael I. Jordan

1313

Visualizing the Function Computed by a Feedforward Neural Network Tony A. Plate, Joel Bert, John Grace, and Pierre Band

1337

Discriminant Pattern Recognition Using TransformationInvariant Neurons Diego Sona, Alessandro Sperduti, and Antonina Starita

1355

Observable Operator Models for Discrete Stochastic Time Series Herbert Jaeger

1371

Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons Shun-ichi Amari, Hyeyoung Park, and Kenji Fukumizu

1399

Nonmonotonic Generalization Bias of Gaussian Mixture Models Shotaro Akaho and Hilbert J. Kappen

1411

Efcient Block Training of Multilayer Perceptrons A Navia-V´azquez and A. R. Figueiras-Vidal

1429

An Optimization Approach to Design of Generalized BSB Neural Associative Memories Jooyoung Park and Yonmook Park

1449

Nonholonomic Orthogonal Learning Algorithms for Blind Source Separation Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki

1463

ARTICLE

Communicated by John Platt

Separating Style and Content with Bilinear Models Joshua B. Tenenbaum ¤ Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A. William T. Freeman MERL, a Mitsubishi Electric Research Lab, 201 Broadway, Cambridge, MA 02139, U.S.A. Perceptual systems routinely separate “content” from “style,” classifying familiar words spoken in an unfamiliar accent, identifying a font or handwriting style across letters, or recognizing a familiar face or object seen under unfamiliar viewing conditions. Yet a general and tractable computational model of this ability to untangle the underlying factors of perceptual observations remains elusive (Hofstadter, 1985). Existing factor models (Mardia, Kent, & Bibby, 1979; Hinton & Zemel, 1994; Ghahramani, 1995; Bell & Sejnowski, 1995; Hinton, Dayan, Frey, & Neal, 1995; Dayan, Hinton, Neal, & Zemel, 1995; Hinton & Ghahramani, 1997) are either insufciently rich to capture the complex interactions of perceptually meaningful factors such as phoneme and speaker accent or letter and font, or do not allow efcient learning algorithms. We present a general framework for learning to solve two-factor tasks using bilinear models, which provide sufciently expressive representations of factor interactions but can nonetheless be t to data using efcient algorithms based on the singular value decomposition and expectation-maximization. We report promising results on three different tasks in three different perceptual domains: spoken vowel classication with a benchmark multispeaker database, extrapolation of fonts to unseen letters, and translation of faces to novel illuminants. 1 Introduction

Perceptual systems routinely separate the “content” and “style” factors of their observations, classifying familiar words spoken in an unfamiliar accent, identifying a font or handwriting style across letters, or recognizing a familiar face or object seen under unfamiliar viewing conditions. These and many other basic perceptual tasks have in common the need to pro¤

Current address:Department of Psychology, Stanford University, Stanford, CA 94305, U.S.A. c 2000 Massachusetts Institute of Technology Neural Computation 12, 1247– 1283 (2000) °

1248

Joshua B. Tenenbaum and William T. Freeman

cess separately two independent factors that underlie a set of observations. This article shows how perceptual systems may learn to solve these crucial two-factor tasks using simple and tractable bilinear models. By tting such models to a training set of observations, the inuences of style and content factors can be efciently separated in a exible representation that naturally supports generalization to unfamiliar styles or content classes. Figure 1 illustrates three abstract tasks that fall under this framework: classication, extrapolation, and translation. Examples of these abstract tasks in the domain of typography include classifying known characters in a novel font, extrapolating the missing characters of an incomplete novel font, or translating novel characters from a novel font into a familiar font. The essential challenge in all of these tasks is the same. A perceptual system observes a training set of data in multiple styles and content classes and is then presented with incomplete data in an unfamiliar style, missing either content labels (see Figure 1A) or whole observations (see Figure 1B) or both (see Figure 1C). The system must generate the missing labels or observations using only the available data in the new style and what it can learn about the interacting roles of style and content from the training set of complete data. We describe a unied approach to the learning problems of Figure 1 based on tting models that discover explicit parameterized representations of what the training data of each row have in common independent of column, what the data of each column have in common independent of row, and what all data have in common independent of row and column— the interaction of row and column factors. Such a modular representation naturally supports generalization to new styles or content. For example, we can extrapolate a new style to unobserved content classes (see Figure 1B) by combining content and interaction parameters learned during training with style parameters estimated from available data in the new style. Several models for the underlying factors of observations have recently been proposed in the literature on unsupervised learning. These include essentially additive factor models, as used in principal component analysis (Mardia et al., 1979), independent component analysis (Bell & Sejnowski, 1995), and cooperative vector quantization (Hinton & Zemel, 1994; Ghahramani, 1995), and hierarchical factorial models, as used in the Helmholtz machine and its descendants (Hinton et al., 1995; Dayan et al., 1995; Hinton & Ghahramani, 1997). We model the mapping from style and content parameters to observations as a bilinear mapping. Bilinear models are two-factor models with the mathematical property of separability: their outputs are linear in either factor when the other is held constant. Their combination of representational expressiveness and efcient learning procedures enables bilinear models to overcome two principal drawbacks of existing factor models that might be applied to learning the tasks in Figure 1. In contrast to additive factor models, bilinear models provide for rich factor interactions by allowing factors to modulate each other ’s contributions multiplicatively (see section 2).

Separating Style and Content

A

1249

Classification

A A A

B C D E B C D E B C D E

A B C D E B CA E D

Training Generalization

B

Extrapolation

A A A

B C D E B C D E B C D E

A B C D E ? ? C D E C

Translation

A A A

B C D E ? B C D E B C D E

? ?

A ?

B C D E ? ? ? ? F G H

Figure 1: Given a labeled training set of observations in multiple styles (e.g., fonts) and content classes (e.g., letters), we want to (A) classify content observed in a new style, (B) extrapolate a new style to unobserved content classes, and (C) translate from new content observed only in new styles into known styles or content classes.

Model dimensionality can be adjusted to accommodate data that arise from arbitrarily complex interactions of style and content factors. In contrast to hierarchical factorial models, model tting can be carried out by efcient

1250

Joshua B. Tenenbaum and William T. Freeman

techniques well known from the study of linear models, such as the singular value decomposition (SVD) and the expectation-maximization (EM) algorithm, without having to invoke extensive stochastic (Hinton et al., 1995) or deterministic (Dayan et al., 1995) approximations. Our approach is also related to the “learning-to-learn” research program (Thrun & Pratt, 1998)—also known as task transfer or multitask learning (Caruana, 1998). The central insight of “learning to learn” is that learning problems often come in clusters of related tasks, and thus learners may automatically acquire useful biases for a novel learning task by training on many related ones. Rather than families of related tasks, we focus on how learners can exploit the structure in families of related observations, bound together by their common styles, content classes, or style £ content interaction, to acquire general biases useful for carrying out tasks on novel observations from the same family. Thus, our work is closest in spirit to the family discovery approach of Omohundro (1995), differing primarily in our focus on bilinear models to parameterize the style £ content interaction. Section 2 explains and motivates our bilinear modeling approach. Section 3 describes how these models are t to a training set of observations. Sections 4, 5, and 6 present specic applications of these techniques to the three tasks of classication, extrapolation, and translation, using realistic data from three different perceptual domains. Section 7 suggests directions for future work, and section 8 offers some concluding comments. We will use the terms style and content generically to refer to any two independent factors underlying a set of perceptual observations. For tasks that require generalization to novel classes of only one factor (see Figures 1A and 1B), we will refer to the variable factor (which changes during generalization) as style and the invariant factor (with a xed set of classes) as content. For example, in a task of recognizing familiar words spoken in an unfamiliar accent, we would think of the words as content and the accent as style. For tasks that require generalization across both factors (see Figure 1C), the labels style and content are arbitrary, and we will use them as seems most natural. 2 Bilinear Models

We have explored two bilinear models, closely related to each other, which we distinguish by the labels symmetric and asymmetric. This section describes the two models and illustrates them on a simple data set of face images. 2.1 Symmetric Model. In the symmetric model, we represent both style s and content c with vectors of parameters, denoted a s and b c and with dimensionalities I and J, respectively. Let y sc denote a K-dimensional observation vector in style s and content class c. We assume that y sc is a bilinear

Separating Style and Content

1251

function of as and b c given most generally by the form D ysc k

J I X X iD1 jD1

wijk asi bjc .

(2.1)

Here i, j, and k denote the components of style, content, and observation vectors, respectively.1 The wijk terms are independent of style and content and characterize the interaction of these two factors. Their meaning becomes clearer when we rewrite equation 2.1 in vector form. Letting W k denote the I £ J matrix with entries fwijk g, equation 2.1 can be written as T

D as W k b c . ysc k

(2.2)

In equation 2.2, the K matrices W k describe a bilinear map from the style and content vector spaces to the K-dimensional observation space. The interaction terms have another interpretation, which can be seen by writing the symmetric model in a different vector form. Letting w ij denote the K-dimensional vector with components fwijk g, equation 2.1 can be written as y sc D

X i, j

w ij asi bjc .

(2.3)

In equation 2.3, the wijk terms represent I £ J basis vectors of dimension K, and the observation y sc is generated by mixing these basis vectors with coefcients given by the tensor product of a s and b c . Of course, all of these interpretations are formally equivalent, but they suggest different intuitions which we will exploit later. As a concrete example, Figure 2 illustrates a symmetric model of face images of different people in different poses (sampled from the complete data in Figure 6). Here the basis vector interpretation of the wijk terms is most natural, by analogy to the well-known work on eigenfaces (Kirby & Sirovich, 1990; Turk & Pentland, pose 1991). Each pose is represented by a vector of I parameters, ai , and each person person by a vector of J parameters, bj . To render an image of a particular person in a particular pose, a set of I £ J basis images w ij is linearly mixed with coefcients given by the tensor product of these two parameter vectors (see equation 2.3). The symmetric model can exactly reproduce the observations when I and J equal the numbers of observed styles S and content classes C, respectively, as is the case in Figure 2. The model provides coarser but more compact representations as these dimensionalities are decreased. 1

The model in equation 2.1 may appear trilinear, but we view the wijk terms as describing a xed bilinear mapping from a s and b c to y sc .

1252

Joshua B. Tenenbaum and William T. Freeman

Figure 2: Illustration of a symmetric bilinear model for a small set of faces (a subset of Figure 6). The two factors for this example are person and pose. pose person One vector of coefcients, ai , describes the pose, and a second vector, bj , describes the person. To render a particular person under a particular pose, pose person the vectors ai and bj multiply along the four rows and three columns of the array of basis images w ij . The weighted sum of basis images yields the reconstructed faces y pose, person .

2.2 Asymmetric Model. Sometimes linear combinations of a few basis styles learned during training may not describe new styles well. We can obtain more exible, asymmetric models by letting the interaction terms wijk

Separating Style and Content

1253

P themselves vary with style. Then equation 2.1 becomes ysc D i, j wsijk asi bjc . k Without loss of generality, we can combine the style-specic terms of equation 2.1 into s D ajk

X i

wsijk asi ,

(2.4)

s bc . ajk j

(2.5)

giving D ysc k

X j

Again, there are two interpretations of the model, corresponding to different vector forms of equation 2.5. First, letting A s denote the K £ J matrix with s g, equation 2.5 can be written as entries fajk y sc D A s b c .

(2.6)

s terms as describing a style-specic linear map Here, we can think of the ajk from content space to observation space. Alternatively, letting ajs denote the s g, equation 2.5 can be written as K-dimensional vector with components fajk

y sc D

X j

ajs bjc .

(2.7)

s terms as describing a set of J style-specic Now we can think of the ajk basis vectors that are mixed according to content-specic coefcients bjc (independent of style) to produce the observations. Figure 3 illustrates an asymmetric bilinear model applied to the face database, with head pose as the style factor. Each pose is represented by pose a set of J basis images aj and each person by a vector of J parameters person

. To render an image of a particular person in a particular pose, the bj pose-specic basis images are linearly mixed with coefcients given by the person-specic parameter vector. Note that the basis images for each pose look like eigenfaces (Turk & Pentland, 1991) in the appropriate style of each pose. However, they do not provide a true orthogonal basis for any one pose, as in Moghaddam and Pentland (1997), where a distinct set of eigenfaces is computed for each of several poses. Instead, the factorized structure of the model ensures that corresponding basis vectors play corresponding roles across poses (e.g., the rst vector holds roughly the mean face for that pose, the second seems to modulate hair distribution, and the third seems to modulate head size), which is crucial for adapting to new styles. Familiar content can be easily

1254

Joshua B. Tenenbaum and William T. Freeman

person

Pose basis images: a j

pose

Person coefficients: b j

Reconstructed faces: y

pose, person

Figure 3: The images of Figure 2 represented by an asymmetric bilinear model person with head pose as the style factor. Person-specic coefcients bj multiply pose

pose-specic basis images aj ; the sum reconstructs a given person in a given pose. The basis images are similar to an eigenface representation within a given pose (Moghaddam & Pentland, 1997),except that in this model the different basis images are constrained to allow one set of person coefcients to reconstruct the same face across different poses.

translated to a new style by mixing the new style-specic basis functions with the old content-specic coefcients. Figure 4 shows the same data represented by an asymmetric model, but with the roles of style and content switched. Now the ajs parameters provide a basis set of poses for each person’s face. Again, corresponding basis vectors play corresponding roles across styles (e.g., for each person’s face, the rst

Separating Style and Content

1255

Person basis images: a j

person

Pose coefficients: b pose j

Reconstructed faces: y

person, pose

Figure 4: Asymmetric bilinear model applied to the data of Figure 2, treating pose identity as the style factor. Now pose-specic coefcients bj weight personperson

specic basis images aj to reconstruct the face data. Across different faces, corresponding basis images play corresponding roles in rotating head position.

vector holds roughly the mean pose, the second modulates head orientation, the third modulates amount of hair showing, and the fourth adds in facial detail), allowing ready stylistic translation. Finally, we note that because the asymmetric model can be obtained by summing out redundant degrees of freedom in the symmetric model (see equation 2.4),2 the three sets of basis images in Figures 2 through 4 are not at all independent. Both the pose- and person-specic basis images in Figures 3 and 4 can be expressed as linear combinations of the symmetric model basis images in Figure 2, mixed according to the pose- or personspecic coefcients (respectively) from Figure 2. The asymmetric model’s high-dimensional matrix representation of style may be too exible in adapting to data in new styles and cannot support 2

This holds exactly only when the dimensionalities of style and content vectors are equal to the number of observed styles and content classes, respectively.

1256

Joshua B. Tenenbaum and William T. Freeman

translation tasks (see Figure 1C) because it does not explicitly model the structure of observations that is independent of both style and content (represented by wijk in equation 2.1). However, if overtting can be controlled by limiting the model dimensionality J or imposing some additional constraint, asymmetric models may solve classication and extrapolation tasks (see Figures 1A and 1B) that could not be solved using symmetric models with a realistic number of training styles.

3 Model Fitting

In conventional supervised learning situations, the data are divided into complete training patterns and incomplete (e.g., unlabeled) test patterns, which are assumed to be sampled randomly from the same distribution (Bishop, 1995). Learning then consists of tting a model to the training data that allows the missing aspects of the test patterns (e.g., class labels in a classication task) to be lled in given the available information. The tasks in Figure 1, however, require that the learner generalize from training data sampled according to one distribution (i.e., in one set of styles and content classes) to test data drawn from a different but related distribution (i.e., in a different set of styles or content classes). Because of the need to adapt to a different but related distribution of data during testing, our approach to these tasks involves model tting during both training and testing phases. In the training phase, we learn about the interaction of style and content factors by tting a bilinear model to a complete array of observations of C content classes in S styles. In the testing or generalization phase, we adapt the same model to new observations that have something in common with the training set, in either content or style or in their interaction. The model parameters corresponding to the assumed commonalities are xed at the values learned during training, and new parameters are estimated for the new styles or content using algorithms similar to those used in training. New and old parameters are then combined to accomplish the desired classication, extrapolation, or translation task. This section presents the basic algorithms for model tting during training. The algorithms for model tting during testing are essentially variants of these training algorithms but are task-specic; they will be presented in the appropriate sections below. The goal of model tting during training is to minimize the total squared error over the training set for the symmetric or asymmetric models. This goal is equivalent to maximum likelihood estimation of the style and content parameters given the training data, under the assumption that the data were generated from the models plus independently and identically distributed gaussian noise.

Separating Style and Content

1257

3.1 Asymmetric Model. Because the procedure for tting the asymmetric model is simpler, we discuss it rst. Let y ( t) denote the tth training observation (t D 1, . . . , T). Let the indicator variable hsc ( t) D 1 if y ( t) is in style s and content class c, and 0 otherwise. Then the total squared error E over the training set for the asymmetric model (in the form of equation 2.6) can be written as

ED

T X S X C X tD1 sD1 cD1

hsc (t)|| y (t) ¡ A s b c | | 2 .

(3.1)

If the training set contains equal numbers of observations in each style and in each content class, there exists a closed-form procedure to t the asymmetric model using the SVD. Although we are the rst to use this procedure as the basis for a learning algorithm, it is mathematically equivalent to a family of computer vision algorithms (Koenderink & van Doorn, 1997) best known in the context of recovering structure from motion of tracked points under orthographic projection (Tomasi & Kanade, 1992). Let P sc ( ) ( ) sc th t y t yN D P , sc ( t ) h t the mean observation in style s and content class c. These observations are most naturally represented in a three-way array, but in order to work with standard matrix algorithms, we must stack these SC K-dimensional (column) vectors into a single ( SK ) £ C observation matrix:

2 6

N D4 Y

yN 11

.. .

¢¢¢ .. .

yN S1

yN 1C

3

7 5.

(3.2)

yN SC

We can then write the asymmetric model (see equation 2.6) in compact matrix form,3 N D AB , Y

(3.3)

identifying the ( SK ) £ J matrix A and J £ C matrix B as the stacked style and content parameters, respectively:

2 6

A D4

A1

.. .

3 7 5,

(3.4)

AS

3 Strictly speaking, equation 3.3 is a model of the mean observations. However, when the data are evenly distributed across styles and content classes, the parameter values that minimize the total squared error for equation 3.3 will also minimize E in equation 3.1.

1258

Joshua B. Tenenbaum and William T. Freeman

h

i

B D b1 ¢ ¢ ¢ bC .

(3.5)

To nd the least-squares optimal style and content parameters for equation 3.3, we compute the SVD of YN D USV T . (By convention, we always take the diagonal elements of S to be ordered by decreasing eigenvalue.) We then dene the style parameter matrix A to be the rst J columns of US and the content parameter matrix B to be the rst J rows of V T . The model dimensionality J can be chosen in various ways: from prior knowledge, by requiring a sufciently good approximation to the data, or by looking for an “elbow” in the singular value spectrum. If the training data are not distributed equally across the different styles and content classes, we must minimize equation 3.1 directly. There are many ways to do this. We use a quasi-newton method (BFGS; Press, Teukolsky, Vetterling, & Flannery, 1992) with initial parameter estimates determined by the SVD of the mean observation matrix YN , as described above. If there happen to be no observations in a particular style s and content class c, N will have some indeterminate (0 / 0) entries. Before taking the SVD of YN , Y we replace any indeterminate entries by the mean of the observations in the appropriate style s (across all content classes) or content class c (across all styles). In our experience, this method has yielded satisfactory results, although it is at least an order of magnitude slower than the closed-form SVD solution. Note that if the training data are almost equally distributed across styles and content classes, then the closed-form SVD solution found by assuming that the data are exactly balanced will almost minimize equation 3.1, and improving this solution using quasi-newton optimization may not be worth the much greater effort involved. Because all of the examples presented below have equally distributed training observations, we will need only the closed-form procedure for the remainder of this article. 3.2 Symmetric Model. The total squared error E over the training set for the symmetric model (in the form of equation 2.2) can be written as

ED

N X S X C X K X tD1 sD1 cD1 kD1

hsc (t)( yk ( t) ¡ a s W k b c )2 . T

(3.6)

Again, if we assume that the training set consists of an equal number of observations in each style and content class, there are efcient matrix algorithms for minimizing E. The algorithm we use was described for scalar observations by Magnus and Neudecker (1988) and adapted to vector observations by Marimont and Wandell (1992), in the context of characterizing color surface and illuminant spectra. Essentially, we repeatedly apply the SVD algorithm for tting the asymmetric model, alternating the role of style and content factors within each iteration until convergence. First we need a few matrix denitions. Recall that YN consists of the SC K-dimensional mean observation vectors yN sc stacked into a single SK £ C

Separating Style and Content

1259

A

B

Figure 5: Schematic illustration of the vector transpose (following Marimont & Wandell, 1992). For a matrix Y considered as a two-dimensional array of stacked vectors (A), its vector-transpose YVT is dened to be the same stacked vectors with their row and column positions in the array transposed (B).

matrix (see equation 3.2). In general, for any AK £ B matrix X constructed by stacking AB K-dimensional vectors A down and B across, we can dene its vector transpose X VT to be the BK £ A matrix consisting of the same Kdimensional vectors stacked B down and A across, where the vector a across and b down in X becomes the vector b across and a down of X VT (see Figure 5). In particular, YN VT consists of the means yN sc stacked into a single ( KC) £ S matrix:

2 6

N VT D 4 Y

yN 11

.. .

¢¢¢ .. .

yN C1

yN 1S

3 7 5.

(3.7)

yN CS

Finally, we dene the IK £ J stacked weight matrix W , consisting of the IJ K-dimensional basis functions w ij (see equation 2.3) in the form

2 6

W D4

w 11

.. .

w I1

¢¢¢ .. .

w 1J

3 7 5.

w IJ

Its vector transpose W VT is also dened accordingly.

(3.8)

1260

Joshua B. Tenenbaum and William T. Freeman

We can then write the symmetric model (see equation 2.6) in either of these two equivalent matrix forms,4 h

YN D W VT A

iVT

B,

(3.9)

YN VT D [WB ]VT A ,

(3.10)

identifying the I £ S matrix A and the J £ C matrix B as the stacked style and content parameter vectors, respectively: h

i

h

i

A D a 1 ¢ ¢ ¢ aS , B D b 1 ¢ ¢ ¢ b C .

(3.11)

The iterative procedure for estimating least-squares optimal values of A and B proceeds as follows. We initialize B using the closed-form SVD procedure described above for the asymmetric model (see equation 3.5). Note that this initial B is an orthogonal matrix (BB T is the J £ J identity matrix), N T ]VT D W VT A (from equation 3.9). Thus, given this initial estiso that [YB N T ]VT D USV T and update our mate for B , we can compute the SVD of [YB T estimate of A to be the rst I rows of V . This A is also orthogonal, so that [YN VT A T ]VT D WB (from equation 3.10). Thus, given this estimate of A , we can compute the SVD of [YN VT A T ]VT D USV T and update our estimate of B to be the rst J rows of V T . This completes one iteration of the algorithm. Typically, convergence occurs within ve iterations (around 10 SVD operations). Convergence is also guaranteed. (See Magnus & Neudecker, 1988, for a proof for the scalar case, K D 1, which can be extended to the vector case N T ]VT A T ]VT to considered here.) Upon convergence, we solve for W D [[YB obtain the basis vectors independent of both style and content. As with the asymmetric model, if the training data are not distributed equally across the different styles and content classes, we minimize equation 3.1 starting from the same initial estimates for A and B but using a more costly quasi-Newton method. 4 Classication

Many common classication problems involve multiple observations likely to be in one style, for example, recognizing the handwritten characters on an envelope or the accented speech of a telephone voice. People are signicantly better at recognizing a familiar word spoken in an unfamiliar voice or a familiar letter character written in an unfamiliar font when it is embedded in the context of other words or letters in the same novel style 4 As with the asymmetric model, when the data are evenly distributed across styles and content classes, the parameter values that minimize the total squared error for the mean observations in equations 3.9 and 3.10 will also minimize E in Equation 3.6.

Separating Style and Content

1261

(van Bergem, Pols, & van Beinum, 1988; Sanocki, 1992; Nygaard & Pisoni, 1998), presumably because the added context allows the perceptual system to build a model of the new style and factor out its inuence. In this section, we show how a perceptual system may use bilinear models and assumptions of style consistency to factor out the effects of style in content classication and thereby signicantly improve classication performance on data in novel styles. We rst describe the two concrete tasks investigated. We then present the general classication algorithm and the results of several experiments comparing this algorithm to standard techniques from the pattern recognition literature, such as nearest neighbor classication, which do not explicitly model the effects of style on content classication. 4.1 Task Specics. We report experiments with two data sets: a benchmark speech data set and a face data set that we collected ourselves. The speech data consist of six samples of each of 11 vowels (content classes) uttered by 15 speakers (styles) of British English.5 Each data vector consists of K D 10 log area parameters, a standard vocal tract representation computed from a linear predictive coding analysis of the digitized speech. The specic task we must learn to perform is classication of spoken vowels (content) for new speakers (styles). The face data were introduced in section 2 (see Figures 2–4). Figure 6 shows the complete data set, consisting of images of 11 people’s faces (styles) viewed under 15 different poses (content classes). The poses span a grid of three vertical positions (up, level, down) and ve horizontal positions (far left, left, straight ahead, right, far right). The pictures were shifted to align the nose tip position, found manually. The images were then blurred and cropped to 22 £32 pixels, and represented simply as vectors of K D 704 pixel values each. The specic task we must learn to perform is classication of head pose (content) for new people’s faces (styles). 4.2 Algorithm. For both speech and face data sets, we train on observations in all content classes but in a subset of the available styles (the “training” styles). We t asymmetric bilinear models (see equation 3.3) to these training data using the closed-form SVD procedure described in section 3.1. This yields a K £ J matrix A s representing each style s and a J-dimensional vector b c representing each content class c. The model dimensionality J is a free parameter, which we discuss in depth at the end of this section. Fitting these asymmetric models to either data set required less than 1 second. This and all other execution times reported below were obtained for nonoptimized MATLAB code running on a Silicon Graphics O2 workstation.

5

The data were originally collected by David Deterding and are now available online from the UC Irvine machine learning repository (http://www.ics.uci.edu/»mlearn).

1262

Joshua B. Tenenbaum and William T. Freeman

Person

Pose

Figure 6: Classication of familiar content in a new style, illustrated on a data set of face images. During training, we observe the faces of different people (style) in different poses (content), resulting in the matrix of faces shown. During generalization, we observe a new person’s face in each of these familiar poses, and we seek to classify the pose of each new image. The separable mixture model allows us to build up a model for the new face at the same time that we classify the new poses, thereby improving classication performance.

The generalization task is to classify observations in the remaining styles (the “test” styles) by lling in the missing content labels for these novel observations using the style-invariant content vectors b c xed during training (see Figure 6). Trying to estimate both content labels and style parameters for the new data presents a classic chicken-and-egg problem, very much like the problems of k-means clustering or mixture modeling (Duda & Hart, 1973). If the content class assignments were known, then it would be easy to estimate parameters for a new style sQ by simply inserting into the basic asymmetric bilinear model (i.e., equation 2.6) all the observation vectors in style sQ, along with the appropriate content vectors b c , and solving for the style matrix A sQ . Similarly, if we had a model A sQ of new style sQ, then we could classify any test observation from this new style based simply on its distance to each of the known content vectors b c multiplied by A sQ .6 Initially, however, both style models and content class assignments are unknown for the test data. To handle this uncertainty, we embed the bilinear model within a gaussian mixture model to yield a separable mixture model (SMM) (Tenenbaum &

6

This is equivalent to maximum likelihood classication, assuming gaussian likelihood functions centered on the predictions of the bilinear model A sQ b c .

Separating Style and Content

1263

Freeman, 1997), which can then be t efciently to new data using the EM algorithm (Dempster, Laird, & Rubin, 1977). The mixture model has S £ C gaussian components—one for each pair of S styles and C content classes— with means given by the predictions of the bilinear model. However, only on the order of (S C C) parameters are needed to represent the means of these S £ C gaussians, because of the bilinear model’s separable structure. To classify known content in new styles and estimate new style parameters simultaneously, the EM algorithm alternates between calculating the probabilities of all content labels given current style parameter estimates (E-step) and estimating the most likely style parameters given current content label probabilities (M-step), with likelihood determined by the gaussian mixture model. If, in addition, the test data are not segmented according to style, style label probabilities can be calculated simultaneously as part of the E-step. More formally, after training on labeled data from S styles and C content classes, we are given test data from the same C content classes and SQ new styles, with labels for content (and possibly also style) missing. The new style matrices A sQ are also unknown, but the content vectors b c are known, having been xed in the training phase. We assume that the probability of a new unlabeled observation y being generated in new style sQ and old content c is given by a gaussian distribution with variance s 2 centered at the prediction of the bilinear model: p( y| sQ, c) / expf¡ky ¡ A sQ b c k2 / (2s 2 )g. ( ) The P total probability of y is then given by the mixture distribution p y D s, c), unless sQ,c p( y | sQ , c ) p( sQ, c ). Here we assume equal prior probabilities p(Q the observations are otherwise labeled. 7 The EM algorithm (Dempster et al., 1977) alternates between two steps in order to nd new style matrices A sQ and style-content labels p( sQ, c| y ) that best explain the test data. In the E-step, we compute the probabilities p( sQ, c| y ) D p( y | sQ, c)p( sQ, c) / p( y ) that each test vector y belongs to style sQ and content class c, given the current style matrix sQ estimates. In the M-step, we estimate new style matrices P by setting A to ¤ maximize the total log-likelihood of the test data, L D y log p( y ). The Mstep can be computed in closed form by solving the equations @L¤ / @A sQ D 0, sQ given the probability estimates from the E-step and which are linear in AP P the quantities m sQc D y p( sQ, c| y )y and nsQc D y p( sQ, c| y ): " sQ

A D

X c

#" T m sQc b c

X

T nsQc b c b c

#¡1

.

(4.1)

c

The EM algorithm is guaranteed to converge to a local maximum of L¤ , and this typically takes around 20 iterations for our problems. After each 7 That is, if the style identity s¤ of y is labeled but the content class is unknown, we 6 let p(Qs, c) D 1 / C if sQ D s¤ and 0 if sQ D s¤ . If both the style and content identities of y are Q ( D unlabeled, we let p(Qs, c) 1 / SC) for all sQ and c.

1264

Joshua B. Tenenbaum and William T. Freeman

E-step, test vectors in new styles can by selecting the content P be classied 8 (Q ). class c that maximizes p( c| y ) D , y Classication performance p s c| sQ is determined by the percentage of test data for which the probability of content class c, as given by EM, is greatest for the actual content class. Note that because content classication is determined as a by-product of the E-step, the standard practice of running EM until convergence to a local maximum in likelihood will not necessarily lead to optimal classication performance. In fact, we have often observed overtting behavior, in which optimal classication is obtained after only two or three iterations of EM, but the likelihood continues to increase in subsequent iterations as observations are assigned to incorrect content classes. This classication algorithm thus has three free parameters for which good values must somehow be determined. In addition to the model dimensionality J mentioned above, these parameters include the model variance s 2 and the maximum number of iterations for which EM is run, tmax . In general, we set these parameters using a leave-one-style-out cross-validation procedure with the training data. That is, given S complete training styles, we train separate bilinear models on each of the S subsets of S ¡ 1 styles and evaluate these models’ classication performance on the one style that each was not trained on. For each of these S training set splits, we try a range of values for the parameters J, s 2 , and tmax . Those values that yield the best average performance over the S training set splits are then used in tting a new bilinear model to the full training data and generalizing to the designated test set. The main drawback of this cross-validation procedure is that it can be time-consuming. Although each run of EM is quite fast—on the order of seconds—an exhaustive search of the full parameter space in a leave-one-out design may take hours. In real applications, such a search would hopefully need to be done only once in a given domain, if at all, and might also be guided by domain-specic knowledge that could narrow the search space signicantly. Initialization is an important factor in determining the success of the EM algorithm. Because we are primarily interested in good classication, we initialize EM in the E-step, using the results of a simple 1-nearest-neighbor classier to set the content-class assignments p( c| x ). That is, we initially assign each test observation to the content class of the most similar training observation (for which the content labels are known). 4.3 Results. We conducted four experiments: three using the benchmark speech data to investigate different aspects of the algorithm’s behavior and one using the face data to provide evidence from a separate domain. In each case, we report results with all three free parameters set using the

8

6 If the style s¤ of y is known, then p(Qs, c| y ) D 0 for all sQ D s¤ (see previous note). In ¤ ) ). y y that case, p(c| is simply equal to p(s , c|

Separating Style and Content

1265

Table 1: Accuracy of Classifying Spoken Vowels from a Benchmark Multispeaker Database. Classier Multilayer perceptron (MLP) Radial basis function network (RBF) 1-nearest neighbor (1-NN) Discriminant adaptive nearest neighbor (DANN) Generic parameter settings Optimal parameter settings Separable mixture models (SMM)—test speakers labeled All parameters set by CV (J D 3, s 2 D 1 / 64, tmax D 2) tmax D 1; J, s 2 set by CV (J D 3, s 2 D 1 / 32) tmax D 1; optimal J, s 2 (J D 4, s 2 D 1 / 16) Separable mixture models (SMM)—test speakers not labeled All parameters set by CV (J D 3, s 2 D 1 / 64, tmax D 2) tmax D 1; J, s 2 set by CV (J D 3, s 2 D 1 / 32) tmax D 1; optimal J, s 2 (J D 3, s 2 D 1 / 64)

Percentage Correct on Test Data 51% 53% 56% 59.7% 61.7% 75.8% 68.2% 77.3% 69.8% § .3% 59.9% § 1.1% 63.2% § 1.5%

Notes: MLP, RBF, and 1-NN results were obtained by Robinson (1989). The DANN classier (Hastie & Tibshirani, 1996) achieves the best performance we know of for an approach that does not adapt to new speakers. The SMM classiers perform significantly better vowel classication by simultaneously modeling speaker style. “CV” denotes the cross-validation procedure for parameter setting described in the text.

cross-validation procedure described above, as well as for two conditions in which EM was run until convergence (tmax D 1): both J and s 2 determined by cross validation and both J and s 2 set to their optimal values (as an indicator of best in-principle performance for the maximum likelihood solution). 4.3.1 Train 8, Test 7 on Speech Data—Speakers Labeled. The rst experiment with the speech data was the standard benchmark task described in Robinson (1989). Robinson compared many learning algorithms trained to categorize vowels from the rst eight speakers (four male and four female) and tested on samples from the remaining seven speakers (four male and three female). The variety and the small number of styles make this a difcult task. Table 1 shows the best results we know of for standard approaches that do not adapt to new speakers. Of the many techniques that Robinson (1989) tested, 1-nearest neighbor (1-NN) performs the best, with 56.3% correct; chance is approximately 9% correct. The discriminant adaptive nearest neighbor (DANN)classier (Hastie & Tibshirani, 1996) slightly outperforms 1-NN, obtaining 59.7% correct with its generic parameter settings and 61.7% correct for optimal parameter settings.

1266

Joshua B. Tenenbaum and William T. Freeman

After tting an asymmetric bilinear model to the training data, we tested classication performance using our SMM on the seven new speakers’ data. We rst assumed that style labels were available for the test data (indicating only a change of speaker, but no information about the new speaker ’s style). Running EM until convergence (tmax D 1), our best results of 77.3% correct were obtained with J D 4 and s 2 D 1 / 16. Almost comparable results of 75.8% correct were obtained using cross-validation to set all parameters (J, s 2 , tmax ) automatically (see Table 1). The SMM clearly outperforms the many nonadaptive techniques tested, exploiting extra information available in the speaker labels that nonadaptive techniques make no use of. 4.3.2 Train 8, Test 7 on Speech Data—Speakers Not Labeled. We next repeated the benchmark experiment without the assumption that any style labels were available during testing. Thus, our SMM algorithm and the nonadaptive techniques had exactly the same information available for each test observation (although the SMM had style labels available during training). We used EM to gure out both the speaker assignments and the vowel class assignments for the test data. EM requires that the number of style components S in the mixture model be set in advance; we chose S D 7 (the actual number of distinct speakers) for consistency with the previous experiment. In initializing EM in the E-step, we assigned each test observation equally to each new style component, plus or minus 10% random noise to break symmetry and allow each style component to adapt to a distinct subset of the new data. Using cross-validation to set J, s 2 , and tmax automatically, we obtained 69.8% § 0.3% correct. Table 1 presents results for other parameter settings. These scores reect average performance over 10 random initializations. Not surprisingly, SMM performance here was worse than in the previous section, where the correct style labels were assumed to be known. Nonetheless, the SMM still provided a signicant improvement over the best nonadaptive approaches tested. 4.3.3 Train 14, Test 1 on Speech Data. We noticed that the performance of the SMM on the speech data varied widely across different test speakers and also depended signicantly on the particular speakers chosen for training and testing. In some cases, the SMM did not perform much better than 1NN, and in other cases the SMM actually did worse. Thus, we decided to conduct a more systematic study of the effects of individual speaker style on generalization. Specically, we tested the SMM’s ability to classify the speech of each of the 15 speakers in the database individually when the other 14 speakers were used for training. Because only one speaker was presented during testing, we used only a single style model in EM, and thus there was no distinction between the “speakers labeled” and “speakers not labeled” conditions investigated in the previous two sections. Averaged over all 15 possible test speakers, 1-NN obtained 63.9% § 3.4% correct. Using cross-validation to set J, s 2 , and tmax automatically, our SMM

Separating Style and Content

1267

obtained 74.3% § 4.2% correct. Running EM until convergence (tmax D 1), we obtained 75.6% § 4.0% using the best parameter settings of J D 11, s 2 D .01, and 73.5% § 4.4% correct using cross-validation to select J and s 2 automatically. These SMM scores are not signicantly different from each other but are all signicantly higher than 1-NN as measured by paired ttests (p < .01 in all cases). The results suggest that the superior performance of SMM over nonadaptive classiers such as nearest neighbor will hold in general over a range of different test styles. This experiment also provided an opportunity to assess systematically the performance details of EM on the task of speech classication. For all 15 possible test speakers, EM converged within 30 iterations, taking less than 5 seconds. The optimal number of iterations for EM as determined by cross-validation was always between 1 and 10 iterations, with a mean of 3.2 iterations. We can assess the chance of falling into a bad local minimum with EM by comparing the classication performance of the SMM t using EM with the performance of 1-NN on a speaker-by-speaker basis. For 14 of 15 test speakers, the SMM consistently obtained better classication rates than 1-NN, regardless of which of the three parameter-setting procedures described above was used. The remaining speaker was always classied better by 1-NN. Evidently the vast majority of stylistic variations in this domain can be successfully modeled by the SMM and separated from the relevant content to be classied using the EM algorithm. 4.3.4 Train 10, Test 1 on Face Data. To provide further support for the general advantage of SMM over nonadaptive approaches, we replicated the previous experiment using the face database instead of the speech data. Although the two databases are of roughly comparable size, the nature of the observations is quite different: 704-dimensional vectors of pixel values versus 10-dimensional vectors of vocal tract log area coefcients. Specically, we tested the SMM’s ability to classify head pose for each of the 11 faces in the database individually when the other 10 people’s faces were used for training. There were 15 possible poses, with one image of each face in each pose. Averaged over all 11 possible test faces, 1-NN obtained 53.9% § 4.3% correct. Using cross-validation to set J, s 2 , and tmax automatically, our SMM obtained 73.9% § 6.7% correct. Running EM until convergence (tmax D 1), we obtained 80.6% § 7.5% using the best parameter settings of J D 6, s 2 D 105 , and 75.8% § 6.4% correct using cross-validation to select J and s 2 automatically. As on the speech data, these SMM scores are not signicantly different from each other but do represent signicant improvements over 1-NN as measured by paired t-tests (p < .05 in all cases). The performance details of EM on this face classication task were quite similar to those on the speech classication task reported above. For all 11 possible test faces, EM converged within 30 iterations, taking less than 20 seconds. (The longer convergence times here reect the higher-dimensional

1268

Joshua B. Tenenbaum and William T. Freeman

vectors involved.) The optimal number of iterations for EM as determined by cross-validation was between 9 and 20 iterations, with a mean of 12.6 iterations. For 9 of 11 test faces, the SMM consistently obtained better classication rates than 1-NN over all three parameter-setting procedures we used, indicating that local minima do not prevent the EM algorithm from factoring out the majority of stylistic variations in this domain. 4.4 Discussion. Our approach to style-adaptive content classication involves two signicant modeling choices: the use of a bilinear model of the mean observations and the use of a gaussian mixture model—centered on the predictions of the bilinear model—for observations whose content or style assignments are unknown. The mixture model provides a principled probabilistic framework, allowing us to use the EM algorithm to solve the chicken-and-egg problem of simultaneously estimating style parameters for the new data and labeling the data according to content class (and possibly style as well). The bilinear structure of the model allows the M-step to be computed in closed form by solving systems of linear equations. In this sense, bilinear models represent the content of observations independent of their style in a form that can be generalized easily to model data in new styles. To summarize our results, we found that separating style and content with bilinear models improves content classication in new styles substantially over the best nonadaptive approaches to classication, even when no style information is available during testing, and dramatically so when style demarcation is available. Although our SMM classier has several free parameters that must be chosen a priori, we showed that nearoptimal values can be determined automatically using a cross-validation training procedure. We obtained good results on two very different data sets, low-dimensional speech data and high-dimensional face image data, suggesting that our approach may be widely applicable to many two-factor classication tasks that can be thought of in terms of recognizing invariant content elements under variable style. 5 Extrapolation

The ability to draw analogies across observations in different contexts is a hallmark of human perception and cognition (Hofstadter, 1995; Holyoak & Barnden, 1994). In particular, the ability to produce analogous content in a novel style—and not just recognize it, as in the previous section—has been taken as a severe test of perceptual abstraction (Hofstadter, 1995; Grebert, Stork, Keesing, & Mims, 1992). The domain of typography provides a natural place to explore these issues of analogy and production. Indeed, Hofstadter (1995) has argued that the question, “What is the letter ‘a’?” may be “the central problem of AI” (p. 633). Following Hofstadter, we study the task of extrapolating a novel font from an incomplete set of letter observations in that font to the remaining unobserved letters. We rst describe the

Separating Style and Content

1269

task specics and our shape representation for letter observations. We then present our algorithm for stylistic extrapolation based on bilinear models and show results on extrapolating a natural font. 5.1 Task Specics. Given a training set of C D 62 characters (content) in S D 5 standard fonts (style), the task is to generate characters that are stylistically consistent with letters in a novel sixth font. The initial data were obtained by digitizing the uppercase letters, lowercase letters, and digits 0–9 of the six fonts at 38 £ 38 pixels using Adobe Photoshop. Successful shape modeling often depends on having an image representation that makes explicit the appropriate structure that is only implicit in raw pixel values. Specically, the need to represent shapes of different topologies in comparable forms motivates using a particle-based representation (Szeliski & Tonnesen, 1992). We also want the letters in our representation to behave like a linear vector space, where linear combinations of letters also look like letters. Beymer and Poggio (1996) advocate a dense warp map for related problems. Combining these two ideas, we chose to represent each letter shape by a 2 £ 38 £ 38 D 2888-dimensional vector of (horizontal and vertical) displacements that a set of 38 £ 38 D 1444 ink particles must undergo to form the target shape from a reference grid. With identical particles, there are many possible such warp maps. To ensure that similarly shaped letters are represented by similar warps, we use a physical model. We give each particle of the reference shape (taken to be the full rectangular bitmap) unit positive charge, and each pixel of the target letter negative charge proportional to its gray level intensity. The total charge of the target letter is set equal to the total charge of the reference shape. We track the electrostatic force lines from each particle of the reference shape to where they intersect the plane of the target letter (see Figure 7A). The target shape is positioned opposite the reference shape, 100 pixel units away in depth. The force lines land in a uniform density over the target letter, creating a correspondence between each ink particle of the reference shape and a position on the target letter. The result is a smooth, dense warp map specifying a translation that each ink particle of the reference shape must undergo to reassemble the collection of particles into the target letter shape (see Figure 7B). The electrostatic forces used to nd the correspondences are easily calculated from Coulomb’s law (Jackson, 1975). We call this a Coulomb warp representation. To render a warp map representation of a shape, we rst translate each particle of the reference shape by its warp map value, using a grid at four times the linear pixel resolution. We then blur with a gaussian (standard deviation equal to ve pixels) threshold the ink densities at 60% of their maximum and subsample to the original font resolution. By allowing noninteger charge values and subpixel translations, we can preserve font antialiasing information. Figure 8 shows three pairs of shapes of different topologies, and the average of each pair in a pixel representation and in a Coulomb warp rep-

1270

Joshua B. Tenenbaum and William T. Freeman

Reference shape (+ charges) Target shape ( charges)

A

B

shapes

Coulomb warp

pixel

Figure 7: Coulomb warp representation for shape. A target shape is represented by the displacements that a collection of ink particles would undergo to assemble themselves into the target shape. (A) Correspondences between ink particles of the reference shape and those of the target shape are found by tracing the electrostatic force lines between positively and negatively charged ink particles. (B) The vector from the start to the end of the eld line species the translation for each particle. The result is a smooth warp, with similar shapes represented by similar warps.

averages

Figure 8: The result of averaging shapes together under different representations. Averaging in a pixel representation always gives a “double exposure” of the two shapes. Under the Coulomb warp representation, a circle averaged with a square gives a rounded square; a lled circle averaged with a square gives a very thick rounded square. The letter A averaged with the letter B yields a shape that is arguably intermediate to the shapes of the two letters. This property of the representation makes it well suited to linear algebraic manipulations with bilinear models.

Separating Style and Content

1271

resentation. Averaging the shapes in a pixel representation simply yields a “double exposure” of the two images; averaging in a Coulomb warp representation results in a shape intermediate to the two being averaged. 5.2 Algorithm. During training, we t the asymmetric bilinear model (see equation 3.3) to ve full fonts using the closed-form SVD procedure described in section 3.1. This yields a K £ J matrix A s representing each font s and a J-dimensional vector b c representing each letter class c, with the observation dimensionality K D 2888. Training on this task took approximately 48 seconds. Adapting the model to an incomplete new style sQ can be carried out in closed form using the content vectors b c learned during training. Suppose we observe M samples of style sQ, in content classes C D fc1 , . . . , cM g. We nd the style matrix A sQ that minimizes the total squared error over the test data,

E¤ D

X c2C

ky sQc ¡ A sQ b c k2 .

(5.1)

The minimum of E¤ is found by solving the linear system @E¤ / @A sQ D 0. Missing observations in the test style sQ and known content class c can then be synthesized from y sQc D A sQ b c . In order to allow the model sufcient expressive range to produce naturallooking letter shapes, we set the model dimensionality J as high as possible. However, such a exible model led to overtting on the available letters of the test font and consequently poor synthesis of the missing letters in that font. To regularize the style t to the test data and thereby avoid overtting, we add a prior term to the squared-error cost of equation 5.1 that encourages A sQ to be close to the linear combination of training style parameters A 1 , . . . , A S , that best ts the test font. Specically, let A OLC be the value of A sQ that minimizes equation 5.1 subject to the constraint that A OLC is a linear P combination of the training style parameters A s , that is, A OLC D SsD1 as A s for some values of as . (OLC stands for “optimal linear combination.”) Without loss of generality, we can think of the as coefcients as the best-tting style parameters of a symmetric bilinear model with dimensionality I equal to the number of styles S. We then redene E¤ to include this new cost, E¤ D

X c2C

ky sQc ¡ A sQ b c k2 C lkA sQ ¡ A OLC k2 ,

(5.2)

and again minimize E¤ by solving the linear system @E¤ / @A sQ D 0 . The tradeoff between these two costs is determined by the free parameter l, which we set by eye to yield results with the best appearance. For this example, we used a model dimensionality of 60 and l D 2 £ 104 .9 All model tting during this testing phase took less than 10 seconds. 9

This large value of l was necessary to compensate for the differences in scale between

1272

Joshua B. Tenenbaum and William T. Freeman

A

?

?

B synthetic actual Figure 9: Style extrapolation in typography. (A) Rows 1–5: The rst 13 (of 62) letters of the training fonts. Row 6: The novel test font, with A–I unseen by the model. (B) Row 1: Font extrapolation results. Row 2: The actual unseen letters of the test font. 5.3 Results. Figure 9 shows the results of extrapolating the unseen letters “A”–“I” of a new font, Monaco, using the asymmetric model with a symmetric model (OLC) prior as described above. All characters in the Monaco font except the uppercase letters “A” through “I” were used to estimate its style parameters (via equation 5.2). In contrast to the objective performance scores on the classication tasks reported in the previous section, here the evaluation of our results is necessarily subjective. Given examples of “A” through “I” in only the ve training fonts, the model nonetheless has succeeded in rendering these letters in the test font, with approximately correct shapes for each letter class and with the distinctive stylistic features of Monaco: strict upright posture and uniformly thin strokes. Note that each of these stylistic features appears separately in one or more of the two error norms in equation 5.2 and does not directly represent the relative importance of these two terms. To assess the relative contributions of the asymmetric and symmetric (OLC) terms to the new style matrix A sQ , we can compare the distances from A sQ to the two style matrices that minimize equation 5.2 for l D 0 and l D 1. Call these two Q Q ( D A OLC ), respectively. For l D 2 £ 104 , the distance kA sQ ¡ A sQ matrices A sasym and A ssym asym k Q is approximately 2.5 times greater than the distance kA sQ ¡ A ssym k, suggesting that the symmetric (OLC) term makes a greater—but not vastly greater—contribution to the new style matrix.

1273

pi xe lr ep . as ym m et ric sy m m et ric as sy ym m m m e e t tr ric ic pr wit io h ac r tu al

Separating Style and Content

Figure 10: Result of different methods applied to the font extrapolation problem (Figure 9), where unseen letters in a new font are synthesized. The asymmetric bilinear model has too many parameters to t, and generalization to new letters is poor (second column). The symmetric bilinear model has only 5 degrees of freedom for our data and fails to represent the characteristics of the new font (third column). We used the symmetric model result as a prior to constrain the exibility of the asymmetric model, yielding the results shown here (fourth column) and in Figure 9. All of these methods used the Coulomb warp representation for shape. Performing the same calculations in a pixel representation requires blurring the letters so that linear combinations can modify shape, and yields barely passable results (rst column). The far right column shows the actual letters of the new font.

the training fonts, but they do not appear together in any one training font. 5.4 Discussion. We have shown that it is possible to learn the style of a font from observations and extrapolate that style to unseen letters, using a hybrid of asymmetric and symmetric bilinear models. Note that the asymmetric model uses 173,280 degrees of freedom (the 2888 £ 60 matrix A sQ ) to describe the test style, while the optimal linear combination style model A OLC uses only ve degrees of freedom (the coefcients as of the ve training styles). Results using only the high-dimensional asymmetric model without the low-dimensional OLC term are far too unconstrained and fail to look like recognizable letters (see Figure 10, second column). Results using only the low-dimensional OLC model without the highdimensional asymmetric model are clearly recognizable as the correct letters, but fail to capture the distinctive style of Monaco (see Figure 10, third column). The combination of these two terms, with a exible high-

1274

Joshua B. Tenenbaum and William T. Freeman

dimensional model constrained to lie near the subspace of known style parameters, is capable of successful stylistic extrapolation on this example (see Figure 9 and Figure 10, fourth column). Perceptually, the hybrid model results appear more similar to the low-dimensional OLC results than to the high-dimensional asymmetric model results, which is consistent with the fact that the hybrid style matrix is somewhat closer in Euclidean distance to the OLC matrix than to the pure asymmetric model (see note 9). Using an appropriate representation, such as our Coulomb warp, was also important in obtaining visually satisfying results. Applying the same modeling methodology to a pixel space representation of letters resulted in signicantly less appealing output (see Figure 10, rst column). Previous models of extrapolation and abstraction in typography have been restricted to articial grid-based fonts, for which the grid elements provide a reasonable distributed representation (Hofstadter, 1995; Grebert et al., 1992), or even simpler “grandmother cell” representations of each letter (Polk & Farah, 1997). In contrast, our shape representation allowed us to work directly with natural fonts. Why did this extrapolation task require a hybrid modeling strategy and a specially designed input representation, while the classication tasks of the previous section did not? Two general differences between classication and extrapolation tasks make the latter kind of task objectively more difcult. First, successful classication requires only that the outputs of the bilinear model be close to the test data under the generic metric of mean squared error, while success in extrapolation is judged by the far more subtle metric of visual appearance (Teo & Heeger, 1994). Second, successful classication requires that the model outputs be close to the data only in a relative sense— the test data must be closer to the model outputs using the correct content classes than to the model outputs using incorrect content classes—while extrapolation tasks require model and data to be close in an absolute sense— the model outputs must actually look like plausible instances of the correct content classes. Although the choice of model and representation turned out to be essential in this example, our results were obtained without any detailed knowledge or processing specic to the domain of typography. Hofstadter (1995) has been critical of approaches to stylistic extrapolation that minimize the role of domain-specic knowledge and processing, in particular, the connectionist model of Grebert et al. (1992), arguing that models that “don’t know anything about what they are doing” (p. 408) cannot hope to capture the subtleties and richness of a human font designer’s productions. We agree with Hofstadter ’s general diagnosis. There are many aspects of typographical competence, and we have tried to model only a subset of those. In particular, we have not tried to model the higher-level creative processes of expert font designers, who draw on elaborate knowledge bases, reect on the results of their work, and engage in multiple revisions of each synthesized character. However, we do think that our approach captures two

Separating Style and Content

1275

essential aspects of the human competence in font extrapolation: people’s representations of letter and font characteristics are modular and independent of each other, and people’s knowledge of letters and fonts is abstracted from the ability to perform the particular task of character synthesis. Generic connectionist approaches to font extrapolation such as Grebert et al. (1992) do not satisfy these constraints. Letter and font information is mixed together inextricably in the extrapolation network’s weights, and an entirely different network would be needed to perform recognition or classication tasks with the same stimuli. Our bilinear modeling approach, in contrast, captures the perceptual modularity of style and content in terms of the mathematical property of separability that characterizes equations 2.1 through 2.7. Knowledge of style s (in equation 2.6, for example) is localized to the matrix parameter A s , while knowledge of content class c is localized to the vector parameter b c , and both can be freely combined with other content or style parameters, respectively. Moreover, during training, our models acquire knowledge about the interaction of style and content factors that is truly abstracted from any particular behavior, and thus can support not only extrapolation of a novel style, but also a range of other synthesis and recognition tasks as shown in Figure 1. 6 Translation

Many important perceptual tasks require the perceiver to recover simultaneously two unknown pieces of information from a single stimulus in which these variables are confounded. A canonical example is the problem of separating the intrinsic shape and texture characteristics of a face from the extrinsic lighting conditions, which are confounded in any one image of that face. In this section, we show how a perceptual system may, using a bilinear model, learn to solve this problem from a training set of faces labeled according to identity and lighting condition. The bilinear model allows a novel face viewed under a novel illuminant to be translated to its appearance under known lighting conditions, and the known faces to be translated to the new lighting condition. Such translation tasks are the most difcult kind of two-factor learning task, because they require generalization across both factors at once. That is, what is common across both training and test data sets is not any particular style or any particular content class, but only the manner in which these two factors interact. Thus, only a symmetric bilinear model (see equations 2.1–2.3) is appropriate, because only it represents explicitly the interaction between style and content factors, in the W k parameters. 6.1 Task Specics. Given a training set of S D 23 faces (content) viewed under C D 3 different lighting conditions (style) and a novel face viewed under a novel light source, the task is to translate the new face to known lighting conditions and the known faces to the new lighting condition. The

1276

Joshua B. Tenenbaum and William T. Freeman

face images, provided by Y. Moses of the Weizmann Institute, were cropped to remove nonfacial features and blurred and subsampled to 48 £ 80 pixels. Because these faces were aligned and lacked sharp edge features (unlike the typographical characters of the previous section), we could represent the images directly as 3840-dimensional vectors of pixel brightness values. 6.2 Algorithm. We rst t the symmetric model (see equation 2.2) to the training data using the iterated SVD procedure described in section 3.2. This yields vector representations a s and b c of each face c and illuminant s, and a matrix of interaction parameters W (dened in equation 3.8). The dimensionalities for as and b c were set equal to S and C, respectively, allowing the bilinear model maximum expressivity while still ensuring a unique solution. Because these dimensionalities were set to their maximal values, only one iteration of the tting algorithm (taking approximately 10 seconds) was required for complete training. For generalization from a single test image y˜ , we adapt the model simultaneously to both the new face identity cQ and the new illuminant sQ, while holding xed the face £ illuminant interaction term W learned during training. Specically, we rst make an initial guess for the new face identity vector b cQ (e.g., the mean of the training set style vectors) and solve for the least-squares optimal estimate of the illuminant vector a sQ : sQ

a D

µh Wb

cQ

iVT ¶¡1

y˜ .

(6.1)

Here [. . .]¡1 denotes the pseudoinverse function. Given this new value for a sQ , we then reestimate b cQ from cQ

b D

µh W

VT sQ

a

iVT ¶¡1

y˜ ,

(6.2)

and iterate equations 6.1 and 6.2 until both asQ and b cQ converge. In the example presented in Figure 11, convergence occurred after 14 iterations (approximately 16 seconds). We can then generate images of the new face under c D a sT W b cQ ) and of known face under the new known illuminant s (from ysQ c k k T Qc Q s s c illuminant (from yk D a W k b ). 6.3 Results. Figure 11 shows results, with both the old faces translated to the new illuminant and the new face translated to the old illuminants. For comparison, the true images are shown next to the synthetic ones. Again, evaluation of these results must necessarily be subjective. The lighting and shadows for each synthesized image appear approximately correct, as do the facial features of the old faces translated to the new illuminant. The facial features of the new face translated to the old illuminants appear slightly

Separating Style and Content

1277

A

B

? ? ?

?

C synthetic actual

Figure 11: Translation across style and content in shape-from-shading. (A) Row 1–3: The rst 13 (of 24) faces viewed under the three illuminants used for training. Row 4: The single test image of a new face viewed under a new light source. (B) Column 1: Translation of the new face to known illuminants. Column 2: The actual (unseen) images. (C) Row 1: Translation of known faces to the new illuminant. Row 2: The actual (unseen) images.

blurred, but otherwise resemble the new face more than any of the old faces. One reason the synthesized images of the new face are not as sharp as the synthesized images of the old faces is that the latter are produced by averaging images of a single face under several lighting conditions—across which all the facial features are precisely aligned—while the former are produced by averaging images of many faces under a single lighting condition—across which the facial features vary signicantly in their positions. 6.4 Discussion. A history of applying linear models to face images motivates our bilinear modeling approach. The original work on eigenfaces (Kirby & Sirovich, 1990; Turk & Pentland, 1991) showed that images of many different faces taken under xed lighting and viewpoint conditions occupy a low-dimensional linear subspace of pixel space. Subsequent work by Hallinan (1994) established the complementary result that images of a single face taken under many different lighting conditions also occupy a low-dimensional linear subspace. Thus, the factors of facial identity and illumination have already been shown to satisfy approximately the denition of bilinearity—the effects of one factor are linear when the other is held constant—so it is natural to integrate them into a bilinear model.

1278

Joshua B. Tenenbaum and William T. Freeman

While the general problem of separating shape, texture, and illumination features in an image is underdetermined (Barrow & Tenenbaum, 1978), here the bilinear model learned during training provides sufcient constraint to approximately recover both face and illumination parameters from a single novel image. Atick, Grifn, & Redlich (1996) proposed a related approach to learning shape-from-shading for face images, based on a linear model of head shape in three dimensions and a nonlinear physical model of the image formation process. In contrast, our bilinear model uses only two-dimensional (i.e., image-based) information and requires no prior knowledge about the physics of image formation. Of course, we have not solved the general shape-from-shading recovery problem for arbitrary objects under arbitrary illumination. Our solution (as well as that of Atick et al., 1996) depends critically on the assumption that the new image, like the images in the training set, depicts an upright face under reasonable lighting conditions. In fact, there is evidence that the brain often does not solve the shape-from-shading problem in its most general form, but rather has learned (or evolved) solutions to important special cases such as face images (Cavanagh, 1991). So-called Mooney faces—brightness-thresholded face images in which shading is the only cue to shape—can be easily recognized as images of three-dimensional surfaces when viewed in upright position, but cannot be discriminated from two-dimensional ink blotches when viewed upside down so that the shading conventions are atypical (Shepard, 1990), or when the underlying three-dimensional structure has been distorted away from a globally facelike shape (Moore & Cavanagh, 1998). More generally, the ability to learn constrained solutions to a priori underconstrained inference problems may turn out to be essential for perception (Poggio & Hurlbert, 1994; Nayar & Poggio, 1996). Bilinear models offer one simple and general framework for how biological and articial perceptual systems may learn to solve a wide range of such tasks. 7 Directions for Future Work

The most obvious extension of our work is to observations and tasks with more than two underlying factors, via multilinear models (Magnus & Neudecker, 1988). For example, a symmetric trilinear model in three factors q, r, and s would take the form: qrs

yl

D

X i, j,k

q

wijkl ai bjr csk .

(7.1)

The procedures for tting these models are direct generalizations of the learning algorithms for two-factor models that we describe here. As in the two-factor case, we iteratively apply linear matrix techniques to solve for the parameters of each factor given parameter estimates for the other factors, until all parameter estimates converge (Magnus & Neudecker, 1988).

Separating Style and Content

1279

As with any other learning framework, the success of bilinear models depends on having suitable input representations. Further research is needed to determine what kinds of representations will endow specic kinds of observations with the most nearly bilinear structure. In particular, the font extrapolation task might benet from a representation that is better tailored to the important features of letter shapes. Also, it would be of interest to develop general procedures for incorporating available domain-specic knowledge into bilinear models, for example, via a priori constraints on the model parameters (Simard, LeCun, & Denker, 1993). Finally, we would like to explore the relevance of bilinear models for research in neuroscience and psychophysics. The essential representational feature of bilinear models is their multiplicative factor interactions. At the physiological level, multiplicative neuronal interactions (Andersen, Essick, & Siegel, 1985; Olshausen, Anderson, & van Essen, 1993; Riesenhuber & Dayan, 1997), arising from nonlinear synaptic (Koch, 1997) or populationlevel (Salinas & Abbott, 1993) mechanisms, have been proposed for visual computations that require the synergistic combination of two inputs, such as modulating spatial attention (Andersen et al., 1985; Olshausen et al., 1993; Riesenhuber & Dayan, 1997; Salinas & Abbott, 1993) or estimating motion (Koch, 1997). May these same kinds of circuits be co-opted to perform some of the tasks we study here? It has been suggested that SVD, the essential computational procedure for learning in our framework, can be implemented naturally in neural networks using Hebb-like learning rules (Sanger, 1989, 1994; Oja, 1989). Could these or similar mechanisms support a biologically plausible learning algorithm for bilinear models? At the level of psychophysics, many studies have demonstrated that the ability of the human visual and auditory systems to factor out contextually irrelevant dimensions of variation, such as lighting conditions or speaker accent, is neither perfect nor instantaneous. Observations in unusual styles are frequently more time-consuming to process and more likely to be processed incorrectly (van Bergem et al., 1988; Sanocki, 1992; Nygaard & Pisoni, 1998; Moore & Cavanagh, 1998). Do bilinear models have difculty on the same kinds of stimuli that people do? Do the dynamics of adaptation in bilinear models (e.g., EM for classication, or the iterative SVD-based procedure for translation) take longer to converge on stimuli that people are slower to process? Can our approach provide a framework for designing and modeling perceptual learning experiments, in which subjects learn to separate the effects of style and content factors in a novel stimulus domain? These are just a few of the empirical questions motivated by a bilinear modeling approach to studying perceptual inference. 8 Conclusions

In one sense, bilinear models are not new to perceptual research. It has previously been shown that several core vision problems, such as the recovery

1280

Joshua B. Tenenbaum and William T. Freeman

of structure from motion under orthographic projection (Tomasi & Kanade, 1992) and color constancy under multiple illuminants (Brainard & Wandell, 1991; Marimont & Wandell, 1992; D’Zmura, 1992) are solvable efciently because they are fundamentally bilinear at the level of geometry or physics (Koenderink & van Doorn, 1997). These results are important but of limited usefulness, because many two-factor problems in perception do not have this true bilinear character. More commonly, perceptual inferences based on accurate physical models of factor interactions are either very complex (as in speech recognition), underdetermined (as in shape-from-shading), or simply inappropriate (as in typography). Here we have proposed that perceptual systems may often solve such challenging two-factor tasks without detailed domain knowledge, using bilinear models to learn approximate solutions rather than to describe explicitly the intrinsic geometry or physics of the problem. We presented a suite of simple and efcient learning algorithms for bilinear models, based on the familiar techniques of SVD and EM. We then demonstrated the scope of this approach with applications to three two-factor tasks—classsication, extrapolation, and translation—using three kinds of signals—speech, typography, and face images. With their combination of broad applicability and ready learning algorithms, bilinear models may prove to be a generally useful component in the tool kits of engineers and brains alike. Acknowledgments

We thank E. Adelson, B. Anderson, M. Bernstein, D. Brainard, P. Dayan, W. Richards, and Y. Weiss for helpful discussions. Joshua Tenenbaum was a Howard Hughes Medical Institute predoctoral fellow. Preliminary reports of this material appeared in Tenenbaum and Freeman (1997) and Freeman and Tenenbaum (1997). References Andersen, R., Essick, G., & Siegel, R. (1985). The encoding of spatial location by posterior parietal neurons. Science, 230, 456–458. Atick, J. J., Grifn, P. A., & Redlich, A. N. (1996). Statistical approach to shape from shading: Reconstruction of 3d face surfaces from single 2d images. Neural Computation, 8, 1321–1340. Barrow, H. G., & Tenenbaum, J. M. (1978). Recovering intrinsic scene characteristics from images. In A. R. Hanson & E. M. Riseman (Eds.), Computer vision systems (pp. 3–26). New York: Academic Press. Bell, A., & Sejnowski, T. (1995). An information–maximization approach to blind separation and blind deconvolution. Neural Computation, 7(6), 1129– 1159. Beymer, D., & Poggio, T. (1996). Image representations for visual learning. Science, 272, 1905–1909.

Separating Style and Content

1281

Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Brainard, D., & Wandell, B. (1991). A bilinear model of the illuminant’s effect on color appearance. In M. S. Landy and J. A. Movshon (Eds.), Computational models of visual processing (chap. 13). Cambridge, MA: MIT Press. Caruana, R. (1998). Multitask learning. In S. Thrun and L. Pratt (Eds.), Learning to learn (pp. 95–134). Norwell, MA: Kluwer. Cavanagh, P. (1991). What’s up in top-down processing? In A. Gorea (Ed.), Representations of vision: Trends and tacit assumptions in vision research (pp. 295– 304). Cambridge: Cambridge University Press. Dayan, P., Hinton, G., Neal, R., & Zemel, R. (1995). The Helmholtz machine. Neural Computation, 7(5), 889–904. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statist. Soc. B, 39, 1–38. Duda, R. O., & Hart, P. E. (1973). Pattern classication and scene analysis. New York: Wiley-Interscience. D’Zmura, M. (1992).Color constancy: Surface color from changing illumination. Journal of the Optical Society of America A, 9, 490–493. Freeman, W. T., & Tenenbaum, J. B. (1997). Learning bilinear models for twofactor problems in vision. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (pp. 554–560). San Juan, PR. Ghahramani, Z. (1995). Factorial learning and the EM algorithm. In G. Tesauro, D. Touretzky, and T. Leen (Eds.), Advances in neural information processing systems (Vol. 7, pp. 617–624). Cambridge, MA: MIT Press. Grebert, I., Stork, D. G., Keesing, R., & Mims, S. (1992). Connectionist generalization for production: An example from gridfont. Neural Networks, 5, 699– 710. Hallinan, P. W. (1994). A low-dimensional representation of human faces for arbitrary lighting conditions. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (pp. 995–999). Hastie, T., & Tibshirani, R. (1996). Discriminant adaptive nearest neighbor classication. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18, 607–616. Hinton, G., Dayan, P., Frey, B., & Neal, R. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268, 1158–1161. Hinton, G., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Phil. Trans. Royal Soc. B, 352, 1177–1190. Hinton, G. E., & Zemel, R. (1994). Autoencoders, minimum description length, and Helmholtz free energy. In J. Cowan, G. Tesauro, and J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 3–10). San Mateo, CA: Morgan Kauffman. Hofstadter, D. (1985). Metamagical themas. New York: Basic Books. Hofstadter, D. (1995).Fluid concepts and creative analogies. New York: Basic Books. Holyoak, K., & Barnden, J. (Eds.). (1994). Advances in connectionist and neural computation theory. Norwood, NJ: Ablex. Jackson, J. D. (Ed.). (1975). Classical electrodynamics (2nd ed.). New York: Wiley.

1282

Joshua B. Tenenbaum and William T. Freeman

Kirby, M., & Sirovich, L. (1990). Application of the Karhunen-Loeve procedure for the characterization of human faces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(1), 103–108. Koch, C. (1997). Computation and the single neuron. Nature, 385, 207–211. Koenderink, J. J., & van Doorn, A. J. (1997). The generic bilinear calibration– estimation problem. International Journal of Computer Vision, 23(3), 217–234. Magnus, J. R., & Neudecker, H. (1988).Matrix differentialcalculus with applications in statistics and econometrics. New York: Wiley. Mardia, K., Kent, J., & Bibby, J. (1979). Multivariate analysis. London: Academic Press. Marimont, D. H., & Wandell, B. A. (1992). Linear models of surface and illuminant spectra. Journal of the Optical Society of America A, 9(11), 1905–1913. Moghaddam, B., & Pentland, A. P. (1997). Probabilistic visual learning for object representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7), 696–710. Moore, C., & Cavanagh, P. (1998). Recovery of 3d volume from 2-tone images of novel objects. Cognition, 67 (1,2), 45–71. Nayar, S., & Poggio, T. (Eds.). (1996). Early visual learning. New York: Oxford University Press. Nygaard, L. C., & Pisoni, D. B. (1998). Talker-specic learning in speech perception. Perception and Psychophysics, 60(3), 355–376. Oja, E. (1989). Neural networks, principal components, and subspaces. International Journal of Neural Systems, 1, 61–68. Olshausen, B., Anderson, C., & van Essen, D. (1993). A neural model of visual attention and invariant pattern recognition. Journal of Neuroscience, 13, 4700– 4719. Omohundro, S. M. (1995). Family discovery. In D. Touretzky, M. Moser, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 402– 408). Cambridge, MA: MIT Press. Poggio, T., & Hurlbert, A. (1994).Observations on cortical mechanisms for object recognition and learning. In C. Koch & J. Davis (Eds.), Large scale neuronal theories of the brain (pp. 152–182). Cambridge, MA: MIT Press. Polk, T. A., & Farah, M. J. (1997). A simple common contexts explanation for the development of abstract letter identities. Neural Computation, 9(6), 1277–1289. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992).Numerical recipes in C. New York: Cambridge University Press. Riesenhuber, M., & Dayan, P. (1997). Neural models for part-whole hierarchies. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 17–23). Cambridge, MA: MIT Press. Robinson, A. (1989). Dynamic error propagation networks. Unpublished doctoral dissertation, Cambridge University. Salinas, E., & Abbott, L. (1993). A model of multiplicative neural responses in parietal cortex. In Proc. Nat. Acad. Sci. USA (pp. 11956–11961). Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks, 2, 459–473. Sanger, T. (1994). Two algorithms for iterative computation of the singular value decomposition from input/output samples. In J. Cowan, G. Tesauro, &

Separating Style and Content

1283

J. Alspector (Eds.),Advances in neural informationprocessing systems, 6 (pp. 144– 151). San Mateo, CA: Morgan Kauffman. Sanocki, T. (1992).Effects of font- and letter-specic experience on the perceptual processing of letters. American Journal of Psychology, 105(3), 435–458. Shepard, R. N. (1990). Mind sights. New York: Freeman. Simard, P. Y., LeCun, Y., & Denker, J. (1993). Efcient pattern recognition using a new transformation distance. In S. Hanson, J. Cowan, & L. Giles (Eds.), Advances in neural information processing systems, 5. San Mateo, CA: Morgan Kaufman. Szeliski, R., & Tonnesen, D. (1992). Surface modeling with oriented particle systems. In Proc. SIGGRAPH 92 (Vol. 26, pp. 185–194). Tenenbaum, J. B., & Freeman, W. T. (1997). Separating style and content. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 662–668). Cambridge, MA: MIT Press. Teo, P., & Heeger, D. (1994). Perceptual image distortion. In First IEEE International Conference on Image Processing (Vol. 2, pp. 982–986). New York: IEEE. Thrun, S., & Pratt, L. (Eds). (1998). Learning to learn. Norwell, MA: Kluwer. Tomasi, C., & Kanade, T. (1992). Shape and motion from image streams under orthography: A factorization method. International Journal of Computer Vision, 9(2), 137–154. Turk, M., & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71–86. van Bergem, D. R., Pols, L. C., & van Beinum, F. J. K. (1988). Perceptual normalization of the vowels of a man and a child in various contexts. Speech Communication, 7(1), 1–20. Received October 27, 1998; accepted June 8, 1999.

NOTE

Communicated by Simon Haykin

Relationships Between the A Priori and A Posteriori Errors in Nonlinear Adaptive Neural Filters Danilo P. Mandic School of Information Systems, University of East Anglia, Norwich, U.K. Jonathon A. Chambers Signal Processing Section, Department of Electrical and Electronic Engineering, Imperial College of Science, Technology and Medicine, London, U.K. The lower bounds for the a posteriori prediction error of a nonlinear predictor realized as a neural network are provided. These are obtained for a priori adaptation and a posteriori error networks with sigmoid nonlinearities trained by gradient-descent learning algorithms. A contractivity condition is imposed on a nonlinear activation function of a neuron so that the a posteriori prediction error is smaller in magnitude than the corresponding a priori one. Furthermore, an upper bound is imposed on the learning rate g so that the approach is feasible. The analysis is undertaken for both feedforward and recurrent nonlinear predictors realized as neural networks. 1 Introduction

A posteriori techniques have been considered in the area of linear adaptive lters (Treichler, Johnson, & Larimore, 1987; Ljung & Soderstrom, 1983; Douglas & Rupp, 1997). However, in the area of neural networks, the use of a posteriori techniques is still in its infancy. Recently it has been shown that an a posteriori approach in the neural networks framework exhibits behavior correspondent to the normalised least mean square (NLMS) algorithm in the linear adaptive lters case (Mandic & Chambers, 1998). Consequently, it is expected that the instantaneous a posteriori output error eN( k ) is smaller in magnitude than the corresponding a priori error e( k ) (Treichler et al., 1987; Mandic & Chambers, 1998). However, little is known about the relationships between the a posteriori and a priori error, learning rate, slope in the nonlinear activation function of a neuron, and feasibility of such a neural predictor. In the case of a single-node neural network, with a nonlinear activation function of a neuron W, the a priori output of a network y is given by ³ ´ (1.1) y ( k ) D W x T ( k )w ( k ) , where x ( k), w ( k ), and (¢)T denote, respectively, the input vector to a network, c 2000 Massachusetts Institute of Technology Neural Computation 12, 1285– 1292 (2000) °

1286

Danilo P. Mandic and Jonathon A. Chambers

the weight vector, and vector transpose operator. Function W is assumed to belong to the class of sigmoid functions. The updated weight vector w ( k C 1) is available from the learning algorithm before the next, updated, input vector x ( k C 1), therefore an a posteriori estimate yN can be calculated as ³ ´ (1.2) yN ( k ) D W xT ( k)w ( k C 1) . The corresponding instantaneous a priori and a posteriori errors at the output neuron of a neural network are given respectively as e(k ) D d( k) ¡ y( k) and eN(k ) D d( k ) ¡ yN (k ), where d(k ) is some teaching signal. Our aim is to preserve | eN( k)| · c |e(k)|,

0
(1.3)

at each iteration, for both feedforward and recurrent neural networks acting as a nonlinear predictor. This a priori learning a posteriori error algorithm corresponds to the normalized gradient-descent algorithm for neural networks (Haykin, 1996; Mandic & Chambers, 1998). In this work we seek to guarantee that the a posteriori error eN is uniformly smaller in magnitude than the corresponding a priori error e. The problem can be represented in the gradient-descent setting as ¡ ¢ w (k C 1) D w ( k ) ¡ rw E e( k ) ³ ´ eN( k) D d(k ) ¡ W x T (k )w (k C 1) subject to

| eN( k )| · c | e( k)|, 0 < c < 1,

(1.4)

where the cost function E (e(k )) is some nonlinear function of the instantaneous a priori output error e( k ), typically a quadratic function of e( k ). Here, we provide relationships between the a priori prediction error e( k), a posteriori prediction error eN(k ), learning rate of a gradient-descent learning algorithm g(k), and the slope b of a nonlinear activation function of a neuron W, for both the feedforward and recurrent case, with respect to objective 1.4. Constraints are imposed on the nonlinear activation function W, so that equation 1.3 holds. Moreover, the conditions for the learning rate g are given so that approach 1.4 is feasible. In that case, as a matter of example, we derive the relationship between the learning rate g and the slope b for the logistic nonlinear activation function of a neuron. 2 Contraction Mapping and Nonlinear Activation Functions

By the contraction mapping theorem (CMT), function K is a contraction on [a, b] 2 R if (Gill, Murray, & Wright, 1981; Zeidler, 1986): i. x 2 [a, b] ) K( x ) 2 [a, b]

Errors in Nonlinear Adaptive Neural Filters

1287

Figure 1: Contraction mapping.

ii. 9c < 1 2 RC s.t. |K(x) ¡ K( y )| · c |x ¡ y|, 8x, y 2 [a, b]

as shown in Figure 1. Applying the mean value theorem (MVT) (Luenberger, 1969) to the denition of CMT, for 8x, y 2 [a, b], 9j 2 ( a, b) such that |W(x) ¡ W(y)| D |W 0 (j )(x ¡ y)| D |W 0 (j )|| x ¡ y|.

(2.1)

Now, clause c < 1 in part ii of the CMT becomes c ¸ |W 0 (j )|, j 2 ( a, b). For the example of the logistic nonlinear activation function of a neuron W(x) D

1 1 C e¡bx

(2.2)

whose rst derivative is W 0 ( x) D ¡

be¡b x 1 C e¡bx

¢2 ,

(2.3)

c < 1 , b < 4 is the condition for function W to be a contraction on 8[a, b] 2 R (Mandic & Chambers, 1999a). 3 The Case of a Feedforward Neural Filter

The gradient-descent algorithm for single-node a priori adaptation a posteriori error networks, with the cost function in the form of E( k) D 12 e2 ( k), is given by Narendra and Parthasarathy (1990, 1991): ³ ´ w (k C 1) D w ( k ) C g(k)e(k)W 0 x T (k )w (k ) x( k ) ³ ´ (3.1) eN(k ) D d( k) ¡ W x T ( k )w ( k C 1) . This case represents a generalization of nite impulse response (FIR) linear adaptive lters. Multiplying the rst equation in equation 3.1 from the left side by x T ( k) and applying the nonlinear activation function W on either side, we obtain ³ ´ W x T ( k )w ( k C 1) ³ ³ ´ ´ D W xT ( k)w (k ) C g(k)e(k)W 0 x T ( k )w ( k ) kx (k )k22 . (3.2) Further analysis depends on the function W, which can exhibit either contractive or expansive behavior.

1288

Danilo P. Mandic and Jonathon A. Chambers

3.1 Contractive Activation Function. If function W is a contraction, then

W(a C b ) · W(a) C W(b).

(3.3)

The lower bound for the a posteriori error obtained by the algorithm 3.1 with constraint 1.3 and a contractive nonlinear activation function W, is ³ ´ h i (3.4) eN(k ) > 1 ¡ g(k)W 0 x T (k )w (k ) kx ( k )k22 e( k). Theorem 1.

¡ ¢ With a D xT ( k)w ( k) and b D g(k)e(k)W 0 xT ( k)w (k ) kx (k )k22 , applying inequality 3.3 to 3.2 and subtracting d(k ) from both sides of the resulting equation, due to contractivity of W, we obtain ³ ³ ´ ´i h (3.5) eN(k ) ¸ 1 ¡ W g(k)W 0 xT ( k)w ( k) kx( k )k22 e(k ). Proof.

For W a contraction, |W(j )| < |j | , 8j 2 R, and equation 3.5 nally becomes ³ ´ h i (3.6) eN(k ) > 1 ¡ g(k)W 0 x T (k )w (k ) kx ( k )k22 e( k), which is the lower bound for the a posteriori error for a contractive nonlinear activation function. The range allowed for the learning rateg ( k ) in an a priori adaptation a posteriori error algorithm, 3.1, with constraint 1.3, for the conditions given in theorem 1, is Corollary 1.

0 < g(k) <

W0

¡

1 x T ( k )w ( k )

¢

kx( k )k22

.

(3.7)

3.1.1 Some Simplications. For W a contraction, | W 0 (j )| < 1, 8j 2 R. Therefore, even stricter conditions than those given in theorem 1 and corollary 1 are as follows. The lower bound for the a posteriori error obtained by algorithm 3.1 with constraint 1.3 and a contractive nonlinear activation function W, is h i (3.8) eN(k ) > 1 ¡ g(k)kx( k )k22 e( k ). Theorem 2.

The range allowed for the learning rate g(k), for the conditions given in theorem 2, is Corollary 2.

0 < g(k) <

1 . kx (k )k22

(3.9)

Errors in Nonlinear Adaptive Neural Filters

1289

If, for convenience, the input data are normalized within a unit norm (kxk22 < 1), condition 3.9 becomes further simplied to 0 < g(k) < 1. 3.2 Expansive Activation Function. If function W is an expansion, then W(a C b ) ¸ W(a) C W(b), | W(j )| > |j |, 8j 2 R, and | W 0 (j )| ¸ 1. In this case, equation 3.2 becomes

³ ³ ´ ´i h eN( k) · 1 ¡ W g(k)W 0 x T ( k )w ( k ) kx (k )k22 e( k).

(3.10)

For W an expansion, |W(j )| > |j | , and equation 3.4 nally becomes ³ ´ h i eN( k) < 1 ¡ g(k)W 0 xT ( k)w ( k) kx( k )k22 e( k ),

(3.11)

which holds for negative g, which is not feasible. 4 The Case of a Recurrent Neural Filter

In this case, the gradient-descent updating equation regarding the recurrent neuron can be symbolically expressed as (Haykin, 1994) @y(k ) @w (k )

³ ´ D P(k C 1) D W 0 x T ( k )w ( k ) [x (k ) C w (k )P (k )] ,

(4.1)

where vector P denotes the set of corresponding gradients of the output neuron, and vector x( k ) encompasses both the external and feedback inputs to the recurrent neuron. This case is a generalization of linear adaptive innite impulse response (IIR) lters. The correction to the weight vector at the time instant k becomes

D w ( k ) D g(k)e(k)P ( k ).

(4.2)

The real time recurrent learning (RTRL) (Williams & Zipser, 1989) based learning algorithm for single-node a priori adaptation a posteriori error networks is now given by w (k C 1) D w ( k ) C g(k)e(k)P (k )

³ ´ eN(k ) D d( k) ¡ W x T ( k )w ( k C 1) .

(4.3)

In the spirit of algorithm 1.4, following the same principle as for feedforward networks, we obtain the lower error bound for the a priori adaptation a posteriori error algorithm in single-node recurrent neural networks acting as nonlinear adaptive lters.

1290

Danilo P. Mandic and Jonathon A. Chambers

The lower bound for the a posteriori error obtained by algorithm 4.3 with constraint 1.3, and a contractive nonlinear activation function W, is h i (4.4) eN(k ) > 1 ¡ g(k)x T ( k )P ( k) e(k ), Theorem 3.

whereas the range allowed for the learning rate g(k) is given in the following corollary. The range allowed for the learning rateg ( k ) in an a priori adaptation a posteriori error algorithm, 4.3, with constraint 1.3, and the conditions given in theorem 3, is Corollary 3.

0 < g(k) <

1 x T (k )P( k )

.

(4.5)

4.1 The Case of a General Recurrent Neural Network. For recurrent neural networks of the Williams-Zipser type (Williams & Zipser, 1989), with N neurons and one output neuron, the weight matrix update for an RTRL training algorithm can be expressed as

D W (k ) D g(k)e(k)

@y1 ( k ) @W ( k )

D g(k)e(k)P 1 ( k),

where W (k ) represents the weight matrix and P 1 ( k) D 1 ( ) p n,l k

@y1 ( k ) , @wn, l

(4.6) @y1 ( k ) @W (k )

is the matrix

of gradients at the output neuron D where index n runs along the N neurons in the network, and index l runs along the inputs to the network. This equation is similar to equation 4.2, with the only difference being that weight matrix W replaces weight vector w and gradient matrix P D [P 1 , . . . , P N ] replaces gradient vector P. Notice that in order to update matrix P 1 , a modied version of equation 4.1 has to update gradient matrices P i , i D 1, . . . , N. More details about this procedure can be found in Williams and Zipser (1989) and Haykin (1994). Undertaking the analysis in the same manner as for a recurrent perceptron, we obtain the following conditions imposed on the learning rate g and the a posteriori error eN for a priori learning a posteriori error recurrent neural predictor with an arbitrary size. The lower bound for the a posteriori error obtained by an a priori learning a posteriori error RTRL algorithm, 4.6, with constraint 1.3, and a contractive nonlinear activation function W, is h i (4.7) eN(k ) > 1 ¡ g(k)x T ( k )P 1 ( k) e( k ), Corollary 4.

whereas the range of allowable learning rates g(k) is 0 < g(k) <

1 x T (k )P 1 ( k )

.

(4.8)

Errors in Nonlinear Adaptive Neural Filters

1291

4.2 The Case of a Linear Activation Function. In the case of a linear activation function, which is neither contractive nor expansive, the nonlinear networks with a single neuron, for both the feedforward and recurrent case, degenerate respectively into linear adaptive nite impulse response (FIR) and IIR lters. A comprehensive analysis of such cases is given in Treichler et al. (1987) and Ljung and Soderstrom (1983). 5 The Case of the Logistic Activation Function

We provide a simple example of our analysis for the case of a logistic nonlinear activation function of a neuron. In section 2, we showed that the condition for the logistic activation function to be a contraction is b < 4. As such a function is monotone and increasing, the bound on its rst derivative, 2.3, is W 0 (j ) · b4 , 8j 2 R. That being the case, the conditions from theorem 1 and corollary 1 become, respectively, eN( k) >

i 1h 4 ¡ g(k)bkx ( k)k22 e(k ) 4

(5.1)

and 0 < g(k) <

4 . bkx ( k)k22

(5.2)

Based on theorem 3 and corollary 3, similar conditions can be derived for the recurrent case. These relationships considerably extend and shed additional light on the recently derived relations between the learning rate g and the slope in the nonlinear activation function b for a general a priori neural network (Thimm, Moerland, & Fiesler, 1996; Mandic & Chambers, 1999b). 6 Conclusions

We have provided relationships between the a priori and a posteriori prediction error, learning rate, and slope of the nonlinear activation function of a nonlinear adaptive lter realized by a neural network. This leads to the lower bound on the a posteriori error in a priori learning a posteriori error neural predictors, whose a posteriori output error is uniformly smaller in magnitude than the a priori one. The lower bound is derived based on the learning rate g, the rst derivative of a general nonlinear activation function of a neuron around the current point on the error performance surface, and the L 2 norm of the input vector. This has been achieved for learning algorithms based on gradient descent for both the feedforward and recurrent cases. In both cases, further conditions on the learning rate g are imposed so that the approach is feasible. In that case, a general nonlinear activation function of a neuron has to exhibit contractive behavior. The relationship

1292

Danilo P. Mandic and Jonathon A. Chambers

between the learning rate g and the slope b is further evaluated for the example of the logistic activation function. References Douglas, S. C., & Rupp, M. (1997). A posteriori updates for adaptive lters. In Conference Record of the Thirty-First Asilomar Conference on Signals, Systems and Computers (Vol. 2, pp. 1641–1645). Gill, P. E., Murray, W., & Wright, M. H. (1981). Practical optimization. London: Academic Press. Haykin, S. (1994). Neural networks—A comprehensive foundation. Englewood Cliffs, NJ: Prentice Hall. Haykin, S. (1996). Adaptive lter theory (3rd ed.). Englewood Cliffs, NJ: Prentice Hall. Ljung, L., & Soderstrom, T. (1983). Theory and practice of recursive identication. Cambridge, MA: MIT Press. Luenberger, D. G. (1969). Optimization by vector space methods. New York: Wiley. Mandic, D. P., & Chambers, J. A. (1998). A posteriori real time recurrent learning schemes for a recurrent neural network based non-linear predictor. IEE Proceedings—Vision, Image and Signal Processing, 145(6), 365–370. Mandic, D. P., & Chambers, J. A. (1999a).Global asymptotic stability of nonlinear relaxation equations realised through a recurrent perceptron. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP– 99) (Vol. 2, pp. 1037–1040). Mandic, D. P., & Chambers, J. A. (1999b). Relationship between the slope of the activation function and the learning rate for the RNN. Neural Computation, 11(5), 1069–1077. Narendra, K. S., & Parthasarathy, K. (1990).Identication and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1(1), 4–27. Narendra, K. S., & Parthasarathy, K. (1991). Gradient methods for the optimization of dynamical systems containing neural networks. IEEE Transactions on Neural Networks, 2(2), 252–262. Thimm, G., Moerland, P., & Fiesler, E. (1996). The interchangeability of learning rate and gain in backpropagation neural networks. Neural Computation, 8, 451–460. Treichler, J. R., Johnson, Jr., C. R., & Larimore, M. G. (1987). Theory and design of adaptive lters. New York: Wiley. Williams, R., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270–280. Zeidler, E. (1986). Nonlinear functional analysis and its applications (Vol. 1). Berlin: Springer-Verlag. Received October 9, 1998; accepted May 26, 1999.

NOTE

Communicated by Steven Nowlan

The VC Dimension for Mixtures of Binary Classiers Wenxin Jiang Department of Statistics, Northwestern University, Evanston, IL 60208, U.S.A. The mixtures-of-experts (ME) methodology provides a tool of classication when experts of logistic regression models or Bernoulli models are mixed according to a set of local weights. We show that the VapnikChervonenkis dimension of the ME architecture is bounded below by the number of experts m and above by O( m4 s2 ), where s is the dimension of the input. For mixtures of Bernoulli experts with a scalar input, we show that the lower bound m is attained, in which case we obtain the exact result that the VC dimension is equal to the number of experts. 1 Introduction

The Vapnik-Chervonenkis (VC) dimension is a central concept for recent developments in computational learning theory (see Anthony & Biggs, 1992). The VC dimension is a combinatorial parameter dened on a set of binary functions or a system of classiers, which shows the expressive power of the system. In risk minimization (Vapnik, 1982, 1992, 1998), the VC dimension provides a bound to the rate of uniform convergence of empirical risk to the actual risk. In the formalism of PAC (probably approximately correct) learning (Valiant, 1984), the VC dimension is used for a number of purposes, including planning for the size of the training sample and estimating the computational efciency. (See Anthony & Biggs, 1992, for a review of the concept and its applications.) A special issue of Discrete Applied Mathematics (Volume 86, Number 1, 1998) is dedicated to the study of VC dimension. As Shawe-Taylor (1998) pointed out in the preface of this special issue, “The VC dimension seems to have a knack of cropping up in very different contexts.” Unfortunately, the VC dimension is rarely estimated exactly, except for very simple classication systems (see theorem 7.4.1 of Anthony & Biggs, 1992, for an exact result for linear threshold perceptrons.) More often, efforts are made on looking for upper bounds and lower bounds of the VC dimension. There has been considerable study of the VC dimension for various neural network systems—for example, Koiran and Sontag (1997; lower bounds, for networks with continuous activations), Baum and Haussler (1989; upper and lower bounds, for hard-threshold networks), and Karpinski and Macintyre (1997; upper bounds, for sigmoidal and Pfafan networks). c 2000 Massachusetts Institute of Technology Neural Computation 12, 1293– 1301 (2000) °

1294

W. Jiang

Mixtures-of-experts (ME) (Jacobs, Jordan, Nowlan, & Hinton, 1991) and hierarchical mixtures-of-experts (HME) (Jordan & Jacobs, 1994) are different types of neural network systems, which generalize the usual mixture models in the statistics literature (Titterington, Smith, & Makov 1985), where simple probability models are mixed with weights depending on the inputs. ME and HME have had wide applications for examining relationships among variables (Cacciatore & Nowlan, 1994; Meil aL & Jordan, 1995; Ghahramani & Hinton, 1996; Tipping & Bishop, 1997; Jaakkola & Jordan, 1998). Depending on the experts employed, the ME and HME can be used for either regression or classication. When the experts are probability models of binary outcomes, the ME and HME generate tools of binary classication. In this context, it is an important proposition to study the VC dimension of the ME and HME systems. However, although there has been considerable study of the VC dimension for ordinary neural networks, the VC dimensions of the ME and HME systems are yet unknown. This article focuses on bounds for the VC dimension for ME models. The extension to the HME situation is discussed in a remark following the main results. We consider the mixture of logistic regression experts, as well as its important special case: the mixture of Bernoulli experts. An upper bound is shown to be O( m4 s2 ), where m is the number of experts and s is the dimension of the input. A lower bound of the VC dimension is found to be m. Furthermore, this lower bound is achieved for mixtures of Bernoulli experts with a scalar input, in which case we obtain an exact result: the VC dimension is equal to the number of experts. To be self-contained, we review the VC dimension in the next section. 2 The Classiers and the VC Dimension

Denote the input as x , which is an element from an input space X . We take X as the s-dimensional Euclidean space <s . Denote the response as y, which is an element from an output space Y D f0, 1g. A classier Ch on X is a mapping from X to Y , which is often labeled by a parameter h lying in a parameter space H. The space of classiers H on X is called a hypothesis space, which is simply H D fCh : h 2 Hg. A sample S of inputs of length n is a set S D fx 1 , x 2 , . . . , xn g in X n , where each input xj is from X . A sample S is said to be shattered by a hypothesis space H if for each of the 2n possible sets of responses of length n (of the form O D fy1 , y2 , . . . , yn g 2 Y n ), there is a classier Ch in H such that Ch ( xj ) D yj , for all j D 1, . . . , n (or in short, Ch ( S ) D O, where Ch ( S ) ´ fCh ( x1 ), Ch ( x 2 ), . . . , Ch (x n )g). The VC dimension VCdim(H) of the hypothesis space H is the maximum length of any sample S that can be shattered by H. VCdim(H) D 1 if S of any length may be shattered by H. Following are some obvious properties related to the VC dimension: 1. If one can nd a sample fx 1 , x 2 , . . . , x n g of length n that can be shattered by H, then VCdim(H) ¸ n.

Mixtures of Binary Classiers

1295

2. If no sample of length r can be shattered by H (i.e., if for each S D fx 1 , x2 , . . . , xr g of length r, we can nd a response sequence O D 6 O for any classier Ch in H), then no fy1 , y2 , . . . , yr g such that Ch ( S ) D sample of length r0 > r can be shattered by H either. 3. If no sample of length r can be shattered by H, then VCdim(H) < r. 4. If one can nd a sample fx 1 , x 2 , . . . , x n g of length n that can be shattered by H, but no sample of length n C 1 can be shattered by H, then VCdim(H) D r.

5. For any two hypothesis spaces H and H 0 , if H ½ H 0 and H shatters a sample S, then H0 also shatters S. 6. For any two hypothesis spaces H and H 0 , if H ½ H 0 , then VCdim(H) · VCdim(H0 ). In the following, we rst describe the “experts” to be mixed, then the “gating functions” (the mixing weights) and the ME models. 3 The Binary Experts and Mixtures of Binary Experts

In statistical pattern recognition and in supervised learning (Vapnik, 1992, 1998), one considers a random input-response pair ( x , y) 2 X £ Y , where y takes value from Y D f0, 1g and is random (noisy) conditional on x . Learning is based on a data set comprising pairs of (x , y ), and looks for a classier Ch from a hypothesis space, such that Ch (x ) and y are closest in the sense of risk minimization. The ME is constructed in this context. (This is different from probably approximately correct [PAC] learning, where y conditional on x is noiseless. However, we note that PAC learning can still be performed based on the ME, since, as we will see, the ME generates a hypothesis space, based on which one can certainly consider the learning of a noiseless response (target function) y( x ).) A binary expert proposes a probability model of the response conditional on the input: p( y D 1| x ) D p( x ), p( y D 0| x ) D 1 ¡ p( x ). One example is the logistic regression expert, where p(x ) D f1 C exp(¡a¡ ¯ T x )g¡1 , parameterized by a ( s C 1) vector (a, ¯ T ). Another example is the Bernoulli expert, where p( x ) D p , parameterized by a scalar p in (0, 1). The mixtures of m ¸ 2 experts use a local mixture of the simple experts and propose p( x ) D

m X

gk ( x )p k ( x ),

(3.1)

kD1

where gk (x )’s are local weights, called the gating functions, and p k ( x )’s are the probability functions of the simple experts. For example, when the Bernoulli experts are mixed, p k ( x ) D p k , p k 2 (0, 1), for each k; when

1296

W. Jiang

the logistic regression experts are mixed, p k ( x ) D f1 C exp(¡ak ¡ ¯ Tk x )g¡1 , (ak , ¯ Tk ) 2 <sC 1 , for each k. The gating functions take a multinomial logit form, 0 vk C u Tk x

gk ( x ) D e

/@

1 m X jD1

vj C ujT x

e

A , k D 1, . . . , m,

(3.2)

where (vk , u Tk ) 2 <sC 1 for each k, and vm D u m D 0 is often imposed for the sake of identiability. A probability model such as equation 3.1 generates a classier Ch by Ch ( x ) D I[t (x I h ) ¸ 0], where the discriminant function t ( xI h ) can be taken P v CuT x as p( x ) ¡ c or p(x ) ¡ c multiplied by a positive function (such as jmD1 e j j ), for some c in (0, 1) such as 0.5. Here I[¢] is the indicator function dened by I[true] D 1 and I[ f alse] D 0. When m logistic regression experts are mixed, the grand parameter h of the classier includes all components of (vk , u Tk ), k D 1, . . . , m ¡ 1, and (aj , ¯jT ), j D 1, . . . , m; and h takes values in the parameter space H D <(2m¡1 )( sC 1) . The hypothesis space fCh : h 2 Hg is denoted as HLm (X ), where X denotes the space for input x , and L represents “logistic.” When Bernoulli experts are used, the parameter h of the classier includes all components of ( vk , u Tk ), k D 1, . . . , m ¡ 1, and p j , j D 1, . . . , m; and h takes values in the parameter space H D <( m¡1 )( sC 1) £ (0, 1)m . The hypothesis ( ), where B denotes “Bernoulli.” space fCh : h 2 Hg is denoted as Hm B X Note that by taking the ¯ k to be zero for Ch in HLm (X ), and by reparameterizing f1 C exp(¡ak )g¡1 as p k , k D 1, . . . , m, we can produce any classier in HBm ( X ). Therefore: HBm (X ) ½ HLm ( X ).

(3.3)

4 Main Results

The hypothesis space HBm (<) can shatter a sample of one-dimensional inputs of length m. Lemma 1.

The hypothesis space HBm (<) cannot shatter any sample of one-dimensional inputs of length m C 1 or more. Lemma 2.

Theorem 1.

VCdim(HBm (<)) D m. (This is an exact result.)

Theorem 2.

VCdim(HLm (<s )) ¸ VCdim(HBm (<s )) ¸ m.

Theorem 3.

VCdim(HBm (<s )) · VCdim(HLm (<s )) D O( m4 s2 ).

Mixtures of Binary Classiers

1297

The methods of proof can also be applied to study HME of binary classiers. For example, for HMEs having a tree structure with a xed number of splits at each level, a lower-bound m and an upper-bound O(m4 s2 ) can be similarly obtained for the VC dimension. Remark 1.

We suspect that the upper bound in theorem 3 can be lowered. To indicate how the VC dimension may change with the dimension of input, we present three more exact results in the special cases when the number of experts is either one or two: Remark 2.

1. VCdim(H1B (<s )) D 1.

2. VCdim(H2B (<s )) D s C 1. 3. VCdim(H1L (<s )) D s C 1

Result 1 is obvious since the classiers in HB1 (<s ) map the entire input space to either 0 or 1, and HB1 (<s ) can therefore shatter only a sample of length 1. Results 2 and 3 are obtained by recognizing that the hypothesis spaces are equivalent to that of the linear threshold perceptrons with s real-valued inputs, and then quoting theorem 7.4.1 of Anthony and Biggs (1992, p. 78). In all three cases, the VC dimension increases with s at most linearly. Together with the exact result in theorem 1, they naturally lead to a conjecture that an upper bound could be of the form O( ms) D O(dim(h )), for VCdim(HLm (<s )) and VCdim(HBm (<s )) for general m and s. A referee commented that some of our results can be obtained from the proof techniques used in Dudley (1978), Haussler (1992), and Vapnik (1998). We note that some techniques in Dudley (1978, theorem 7.2) and Haussler (1992, theorem 4) can be extended to the hypothesis space in this article with discriminant functions ft (¢Ih ): h 2 Hg not forming a vector space. A lower bound n of the VC dimension is established if we can nd examples fx 1 , . . . , xn g and a zero point h0 of the mapping ¿Q : H 7!
5 Proofs Proof of Lemma 1. Consider a sample S D fj1 , j2 , . . . , jm g of length m, where jk D ( k ¡ 1 / 2) / m, k D 1, 2, . . . , m. Consider any potential response O D fy1 , y2 , . . . , ym g, yj 2 f0, 1g, j D 1, . . . , m. Construct a classier Ch in

HBm (<) by choosing the parameter as follows. Let h have “expert” components p j D yj C (¡1)yj 2 2 (0, 1), where 2 is some small positive number, for

1298

W. Jiang

j D 1, . . . , m. Let the “gating” components of h be vk D ffm(m ¡ 1) ¡ k( k ¡ 1)g(b / 2m) and uk D (k ¡m)b, for k D 1, . . . , m, where b is some large, positive number. The corresponding gating functions gk ( x) ´ fk (xI b) (constructed from equation 3.2 are identied as the functions fk ( xI b) used in the proposition in Jiang and Tanner (1999), after canceling a common factor expfm(m ¡ 1)(b / 2m) ¡ mbxg appearing in both the denominator and the numerator of gk ( x ). By the proof of the proposition in Jiang and Tanner (1999), we know that at x D jk , k D 1, . . . , m, fj ( xI b) converges to the characteristic function of the set Bj , Bj D [( j ¡ 1) / m, j / m), for each j D 1, . . . , m, as b ! 1. (We denote the characteristic function as ÂBj (x ), which is equal to I[x 2 Bj ] for each P j.) Then the probability function p(x ) D jmD1 fj ( xI b)p j (from equation 3.1) P converges to p1 ( x ) D jmD1 ÂBj ( x)p j , at each x D jk , k D 1, . . . , m. Note that jk is in Bj only when j D k, for each j, k 2 f1, . . . , mg. Therefore p(jk ) converges to p1 (jk ) D p k D yk C (¡1)yk 2 , as b ! 1. This implies that for sufciently large b and sufciently small 2 , p(jk ) can be made arbitrarily close to the f0, 1g-valued yk , so that for any preselected cutoff point c in (0, 1), Ch (jk ) D I[p(jk ) ¸ c] D I[yk ¸ c] D yk , which can be established for all k. This implies that S can be shattered by (<). Hm B Consider any sample of length r: S D fx1 , . . . , xr g, where r > m. If any of the two xj ’s are the same, then obviously S cannot be shattered by HBm (<). Therefore, we can, without loss of generality, assume that x1 < x2 < . . . < xr . Consider a particular sequence of potential outcome O D fg1 , . . . , gr g, where the gi ’s alternately take values 0 and 1. We (<) can map S to O. are going to show that no Ch in Hm B Suppose the contrary: that there is a classier Ch in HBm (<), such that Ch ( xi ) equals gi for all i D 1, . . . , r. Then the function p( xi ) ¡ c alternately takes negative and nonnegative values for r times, as xi increases from x1 to xr . Dene a continuously differentiable function, Proof of Lemma 2.

0 f ( x) ´ @

m X jD1

1 vj C uj x A

e

fp(x) ¡ cg D

m X jD1

(p j ¡ c)evj C uj x .

Following p(xi ) ¡ c, f ( xi ) also alternately takes negative and nonnegative values for r times, as xi increases from x1 to xr . This implies that the derivative f 0 ( x ) changes sign at least (r¡1 ) times as x increases, and consequently f 0 ( x) has to have at least ( r ¡ 2) zeros. Denote r as the number of roots for f 0 ( x). Then we have r ¸ r ¡ 2 > m ¡ 2.

Mixtures of Binary Classiers

1299

P uj x However, f 0 ( x ) can be written as f 0 (x ) D jm¡1 D1 Aj e , where Aj D (p j ¡ v 0 j c)uj e , by noting that um D 0. Therefore, f ( x ) is a sum of (m ¡ 1) exponential functions, which can have at most ( m ¡ 2) roots (see the proof of lemma 2 in Hole, 1996, which quotes Braess, 1986, chapter 4). We then have r · m ¡ 2, 6 Ch ( S ) contradicting r ¸ r ¡ 2 > m ¡ 2. We have therefore proved that O D (<) for any Ch in HBm (<). In other words, Hm cannot shatter proving the S, B lemma. Proof of Theorem 1.

shows that theorem.

Lemma 1 shows VCdim(HBm (<)) ¸ m, while lemma 2 < m C 1. Combining these results proves the

VCdim(HBm (<))

Note that a classier Ch in HBm (<) can be regarded also as a classier on <s , if Ch , now as a classier on <s , takes in the rst component of an input x in <s . In this sense, the hypothesis space HBm (<) is also a space of classiers on <s . Then we obviously have HBm (<) ½ HBm (<s ), since a classier in HBm (<) is also a classier in HBm (<s ), with a parameter h being such that the second to the sth components of each u k are all zero. Combined with equation 3.3, we have HBm (<) ½ HBm (<s ) ½ HLm (<s ). Then due to property 6 in section 2, applying theorem 1, we have Proof of Theorem 2.

m D VCdim(HBm (<)) · VCdim(HBm (<s )) · VCdim(HLm (<s )), proving the theorem. The inequality is proved in the previous theorem. To prove the upper bound for VCdim(HLm (<s )), we apply the results of Karpinski and Macintyre (1997). In our situation, any classier in HLm (<s ) has a decision boundary p(x ) ¡ c D 0 or, equivalently, Proof of Theorem 3.

0 t D@

m X jD1

1( e

vj C ujT x

A

m Y kD1

) T

¡ak ¡¯ k x )

(1 C e

fp( x ) ¡ cg D 0.

Note that t is a polynomial on the q D 2m ¡ 1 elements eL j , j D 1, . . . , 2m ¡ 1, where each L j is linear in the parameter h . The L j ’s include ¡a1 ¡ ¯ T1 x , . . . , ¡am ¡ ¯Tm x , v1 C u T1 x, . . . , vm¡1 C u Tm¡1 x . (Note vm D u m D 0 .) The degree d of the polynomial t is d D m C 1. The total number of parameters l D dim(h ) D (2m ¡ 1)(s C 1). For hypothesis spaces with this type of classier, Karpinski and Macintyre (1997), based on differentiable topology arguments, obtained an upper bound O((ql)2 ) for the VC dimension (see their application 3.2, with one “atomic formula” t). In our situation, O(( ql)2 ) is the same as O( m4 s2 ), which proves the theorem.

1300

W. Jiang

6 Conclusion

This article investigates the VC dimension of the mixtures of experts for binary classiers and provides some bounds and exact results that can be useful for the study of risk minimization and PAC learning. An alternative approach to the VC theory uses statistical mechanics to analyze the generalization performance of the mixtures-of-experts (see Kang & Oh, 1997, who show that the generalization error undergoes a second-order phase transition related to symmetry breaking and follows asymptotically an inverse power law as the sample size increases. They, however, consider hardboundary gating functions and only a small number of experts (e.g., m D 2). As Seung, Sompolinsky, and Tishby (1992) pointed out, the two different approaches—VC theory that employs inequalities and bounds versus statistical physics that uses approximations—provide complementary perspectives to the study of generalization. Acknowledgments

I am grateful to Martin A. Tanner for suggesting this research problem and to the referees for providing interesting references and insightful comments. References Anthony, M., & Biggs, N. (1992). Computational learning theory: An introduction. Cambridge: Cambridge University Press. Baum, E. B., & Haussler, D. (1989). What size nets gives valid generalization? Neural Comp., 1, 151–160. Braess, D. (1986). Nonlinear approximation theory. Berlin: Springer-Verlag. Cacciatore, T. W., & Nowlan, S. J. (1994). Mixtures of controllers for jump linear and non-linear plants. In G. Tesauro, D. S. Touretzky, and T. K. Leen (Eds.), Advances in neural informations processing systems, 6 (pp. 719–726). San Mateo, CA: Morgan Kaufmann. Dudley, R. M. (1978). Central limit theorems for empirical measures. Annals of Probability, 6, 899–929. Ghahramani, Z., & Hinton, G. E. (1996). The EM algorithm for mixtures of factor analyzers (Tech. Rep. No. CRG-TR-96-1). Toronto: Department of Computer Science, University of Toronto. Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100, 78–150. Hole, A. (1996). Vapnik-Chervonenkis generalization bounds for real valued neural networks. Neural Comp., 8, 1277–1299. Jaakkola, T. S., & Jordan, M. I. (1998). Improving the mean eld approximation via the use of mixture distributions. In M. I. Jordan (Ed.), Learning in graphical models. Norwell, MA: Kluwer.

Mixtures of Binary Classiers

1301

Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local experts. Neural Comp., 3, 79–87. Jiang, W., & Tanner, M. A. (1999). On the approximation rate of hierarchical mixtures-of-experts for generalized linear models. Neural Comp., 11, 1183– 1198. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Comp., 6, 181–214. Kang, K., & Oh, J. (1997). Statistical mechanics of the mixture of experts. In M. C. Mozer, M. I. Jordan, and T. Petsche (Eds.), Advances in neural information processing systems, 9, (pp. 183–189). Cambridge, MA: MIT Press. Karpinski, M., & Macintyre, A. (1997). Polynomial bounds for VC dimension of sigmoidal and general Pfafan neural networks. Journal of Computer and System Science, 54, 169–176. Koiran, P., & Sontag, E. D. (1997). Neural networks with quadratic VC dimension. Journal of Computer and System Science, 54, 190–198. Meil aL , M., & Jordan, M. I. (1995). Learning ne motion by Markov mixtures of experts (A. I. Memo No. 1567). Cambridge, MA: Articial Intelligence Laboratory, MIT. Seung, H. S., Sompolinsky, H., & Tishby, N. (1992).Statistical mechanics of learning from examples. Physical Review A, 45, 6056–6091. Shawe-Taylor, J. (1998). Preface: Special issue of DAM on the Vapnik-Chervonenkis dimension. Discrete Applied Mathematics, 86, 1–2. Tipping, M. E., & Bishop, C. M. (1997).Mixtures of probabilisticprincipal component analysers (Tech. Rep. No. NCRG-97-003). Birmingham, UK: Department of Computer Science and Applied Mathematics, Aston University. Titterington, D. M., Smith, A. F. M., & Makov, U. E. (1985). Statistical analysis of nite mixture distributions. New York: Wiley. Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27, 1134–1142. Vapnik, V. N. (1982). Estimation of dependences based on empirical data. New York: Springer-Verlag. Vapnik, V. N. (1992). Principles of risk minimization for learning theory. In J. E. Moody, S. J. Hanson, & R. D. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 831–838). San Mateo, CA: Morgan Kaufmann. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. Received November 24, 1998; accepted April 21, 1999.

NOTE

Communicated by Todd Leen

The Early Restart Algorithm Malik Magdon-Ismail Amir F. Atiya Learning Systems Group, Electrical Engineering Department, California Institute of Technology, Pasadena, CA 91125, U.S.A. Consider an algorithm whose time to convergence is unknown (because of some random element in the algorithm, such as a random initial weight choice for neural network training). Consider the following strategy. Run the algorithm for a specic time T . If it has not converged by time T , cut the run short and rerun it from the start (repeat the same strategy for every run). This so-called restart mechanism has been proposed by Fahlman (1988) in the context of backpropagation training. It is advantageous in problems that are prone to local minima or when there is a large variability in convergence time from run to run, and may lead to a speed-up in such cases. In this article, we analyze theoretically the restart mechanism, and obtain conditions on the probability density of the convergence time for which restart will improve the expected convergence time. We also derive the optimal restart time. We apply the derived formulas to several cases, including steepest-descent algorithms. 1 Introduction

One of the difculties encountered when training multilayer networks using the backpropagation method, or any other training method, is the slow convergence and the uncertainty in the length of the training time. There is a large variability in the convergence time from run to run, since this time depends strongly on the initial conditions, which are generated randomly. To avoid such an uncertainty, Fahlman (1988) proposed the restart mechanism. In this mechanism, a training algorithm is cut short if it has not converged after a certain number of iterations and is restarted with new random weights; the hope is that this “fresh start” will nd its way faster to the solution. The restart mechanism has also been applied in a number of other optimization applications (e.g., Ghannadian & Alford, 1996; Hu, Shonkwiler, & Spruill, 1997). The restart algorithm can reduce the overall average and standard deviation of the training time, since it avoids waiting for excessively long runs. In addition, the restart algorithm will become even more indispensable if there are local minima. If for such a situation we do not restart, the training algorithm could spend forever approaching an unacceptable local minimum c 2000 Massachusetts Institute of Technology Neural Computation 12, 1303– 1312 (2000) °

1304

M. Magdon-Ismail and A. F. Atiya

and never reach the global minimum. In this article we present a theoretical analysis of the restart algorithm and derive conditions under which restart becomes advantageous. The results we develop indicate that whenever an algorithm displays fat-tailed behavior, multimodality, or potential lack of convergence, restart can be useful and, in fact, absolutely necessary when there is a nonzero probability of not converging. Our goal is to gain some insight into the value of restart by providing such conditions and a general framework for understanding when restart can be helpful. The analysis developed applies to any type of algorithm; it is not restricted to training algorithms in the neural network eld. 2 Analysis of the Restart Algorithm

Because the weights are generated randomly, the training time to convergence, t, will obey a certain probability density, p( t). In our case, converging to a local minimum is not considered convergence; thus, the density p( t) might have a delta function at 1, which represents the probability that the algorithm does not converge at all. Let T D E[t] be the expected convergence time for the algorithm. Let T be the restart time, and let TRES ( T ) D ERES [t] be the expected convergence time for the algorithm when applying restart. We will denote the probability that we restart by aT : Z aT D probability of restart D 1 ¡

T 0

p(t )dt.

(2.1)

The optimal restart time, T ¤ , can be obtained by nding the value of T that minimizes TRES ( T ). The following theorem gives TRES ( T ) in terms of p( t) and T. Theorem 1.

sion

The optimal time to restart is the time T ¤ that minimizes the expres-

TRES ( T ) D

aT TC 1 ¡ aT

Z

T 0

t

p( t ) dt 1 ¡ aT

(2.2)

with respect to T. The expectation of the convergence time can be derived as follows (let T be the restart time): Proof.

Z

TRES ( T ) D

0

T

tp(t)dt C aT ( TRES ( T ) C T ) .

(2.3)

The rst term represents the contribution due to the case that the algorithm converged before the restart time. The second term represents the case that

The Early Restart Algorithm

1305

restart occurred. Note that the term in parentheses is due to the fact that if restart occurs, the expected convergence time equals the time already spent until restart (´ T ), plus the expectation of another application of the restart algorithm (´ TRES (T )). Of course, the term in parentheses is weighted by the probability that restart occurs (aT ). Solving for TRES ( T ) in equation 2.3, we obtain aT TRES (T ) D TC 1 ¡ aT

Z

T 0

t

p( t ) dt. 1 ¡ aT

(2.4)

To obtain the optimal restart time, we need to obtain the minimum of equation 2.4 with respect to T. This completes the proof. The optimal restart time can be obtained by taking the derivative of equation 2.2 and equating to zero. We get 1 ¡ aT ¤ D T¤ C p( T ¤ )

R T¤ 0

tp( t)dt aT ¤

(2.5)

as the condition that has to be satised by the optimal restart time. Equation 2.2 for TRES (T ) can also be viewed from a different light, aT which will give additional insight. The term 1¡a equals the innite series T 2 i aT C aT C ¢ ¢ ¢ (aT represents restarting i times). Hence this series represents P iC 1 i the expected number of restarts (which is evaluated as 1 iD1 i(aT ¡ aT ) 2 and hence equal aT C aT C ¢ ¢ ¢). This term is multiplied by the length of the restart time, to obtain the mean amount of time spent restarting. The second term represents the mean time until convergence given that no restart will occur. The division of p(t ) by 1 ¡ aT is a normalization factor, so that the probability density in the expectation (in the second term) represents the density for the training time given that no restart will occur. Note.

Several studies have proposed methods for estimating p( t) for backpropagation networks (e.g., Leen & Moody, 1993; Orr & Leen, 1993; Heskes, Slijpen & Kappen, 1992). These estimates of p(t ) can be used to estimate T according to equation 2.2. Note.

The real value of restart is when the algorithm in question is prone to local minima. Then, restarting can be interpreted as repeated attempts to reach the global minimum. For such a case, p(t) will have a delta function at 1. Let p(t ) D p0 (t) C P1d(t ¡ 1). We obtain the following theorem.

(2.6)

1306

M. Magdon-Ismail and A. F. Atiya

For the problem considered above with 0 < P1 < 1, it is always advantageous to restart, i.e. 9 T such that TRES ( T ) < T . Theorem 2.

It is easy to see that T D 1, because of the delta function at 1. Now consider TRES ( T ). It is always possible to choose T such that it is nite and R 6 0, because of the condition P1 D 6 1. From equation 2.2, such that 0T p( t)dt D since the numerator is nite and the denominator is nonzero, TRES (T ) is nite and hence strictly less than T . Proof.

It is interesting to note that P1 does not appear explicitly in the expression for the optimal restart time. However, it affects the restart in an indirect way, by affecting the integrals of p(t) and tp(t), because the density integrates to 1. 3 Application of the Theory

To illustrate the analysis, we have applied the theory to a number of examples. 3.1 Steepest-Descent Algorithms. Here, we consider steepest- and gradient-descent algorithms and investigate the effect of restart on their expected convergence time. We rst derive an expression for the density of the convergence time p( t) for general steepest-descent algorithms. Consider for simplicity the error function associated with a linear hypothesis function,

ED

N ³ X i D1

wT x i ¡ yi

´2

,

(3.1)

where x i ’s are the input examples, yi ’s are the target outputs, and w is the weight vector. For nonlinear hypothesis functions, the error surface is approximately quadratic close to the minimum. Thus, the analysis derived will be approximately valid for the nonlinear case if we assume we start relatively close to the minimum. For simplicity consider the case of a three-dimensional weight vector. The more general N-dimensional case, though algebraically more cumbersome, can be treated using identical techniques. Suitably normalizing the xi ’s, the probability P[k] of stopping at iteration k can be derived (see the appendix). For the case where the weight initialization is around the true weight vector, P[k] is given by equation A.9. Figure 1a shows the probability distribution P[k] as given in equation A.9, together with the probability distribution as estimated from a numerical simulation of the steepest-descent algorithm. The gure shows the accurate agreement between theory and simulation (in this linear case). Figure 1b shows TRES ( k ) as a function of k, calculated from equation 2.2. For this case, we can see that the optimal restart time is T D 2.

The Early Restart Algorithm

1307

Figure 1: Steepest descent. (a) The experimental and theoretical probability of stopping at iteration k for the specic choices of the parameters s, d, g as shown. (b) TRES (k) as a function of k obtained using the theoretical P(k).

We have applied the restart method to the backpropagation algorithm. We chose the XOR problem using a 2-2-1 network, a widely studied benchmark. It is well known that the XOR problem with the 2-2-1 network possesses no local minima (see Sprinkhuizen-Kuyper & Boers, 1996, 1998; Hamey, 1995, 1998). First, we have estimated the probability density of the convergence time p( t), by generating the initial weights according to a gaussian density with S D s 2 I, and repeatedly training the network (with learning rate g) until the prespecied tolerance for the error function (d) is achieved. We have implemented over 150,000 training runs. The resulting estimated density is as shown in Figure 2a. As is well known, steepest descent exhibits a large tail, so in spite of the fact that convergence is expected at 1000 iterations, many training runs did not converge until after 3000 iterations. Applying the restart formula, equation 2.2, we obtain the expected convergence time as a function of the restart time T. This relationship is shown in Figure 2b. One can see that restart improves convergence speed. In particular, the optimal restart time is at about 950 iterations. Example: XOR problem.

1308

M. Magdon-Ismail and A. F. Atiya

Figure 2: Two-dimensional XOR problem. (a) The experimental probability of stopping at time t for the specic choices of the parameters s, d, g as shown. (b) TRES (k) as a function of k obtained using the density in a. 3.2 A Sum of Gamma Functions Density. As another example, we consider a more sophisticated density: the sum of two gamma functions (see Figure 3a). Many algorithms exhibit several peaks in their convergence time density. In multilayer training, this can happen in particular if the minimum is between a steep and a at surface. Also, in other optimization applications, such as maximization of likelihood functions, this same situation is frequently encountered because of the nature of the problem (see MagdonIsmail & Abu-Mostafa, 1998). Figure 3b shows the expected convergence time as a function of the restart time. The best restart time is T ¤ D 6 in this example. We can see the importance of choosing the right restart time for such a case; choosing T slightly smaller or larger would be detrimental. 4 Conclusion

In this study we have analyzed the restart mechanism. We have shown that it is often faster on the average to stop a running algorithm and restart it with new initial conditions. The restart algorithm is essential when there are local minima, because it reduces the expected convergence time from 1 to

The Early Restart Algorithm

1309

Figure 3: Sum of two gamma functions. (a) The probability of stopping at time t. (b) TRES (T) as a function of T.

a nite number. However, we have shown in the simulations that restart is also advantageous in some applications where no local minima exist, partly because of the large tail behavior or the multimodality for the density of the convergence time. We have also derived the optimal restart time in an attempt to understand the phenomenon. Although the convergence time density is often unknown (except for algorithm developers, who can afford to run the algorithm many times), the formula derived will give guidance as to how to choose a good estimate of the restart time. As an example, if it is known that the minimum is between a steep and a at surface, then the density is multimodal, and an estimate of the modes will lead to a better estimate of the restart time. Appendix A: Distribution of the Number of Iterations for Gradient Descent on the Squared Error with Linear Models

Suppose the input vectors are fx i gN iD1 . Corresponding to each input is the output yi , the ith element in the column vector y . Let the input matrix X be

1310

M. Magdon-Ismail and A. F. Atiya

given by X D [x 1 x 2 ¢ ¢ ¢ xN ]. The error E can be put in the following form: ED

N X i D1

( wT xi ¡ yi )2 D wT XX T w ¡ 2wT Xy C y T y .

(A.1)

We use the gradient-descent algorithm: 1 @E , w( k C 1) D w( k ) ¡ g 2 @w( k )

(A.2)

with @E / @w( k ) D 2(XX T w ¡ Xy ). Assuming without loss of generality that XX T D 1, we nd that E depends on the iteration number as E (k ) D kw ¡ w¤ k2 D (1 ¡ g)2k kw(0) ¡ w¤ k2 .

(A.3)

We suppose the condition for stopping is that E · d. Letting P[k] denote the probability of stopping at iteration k, we have that "

# p p d d ¤ · kw(0) ¡ w k < . P[k] D P (1 ¡ g)k¡1 (1 ¡ g)k

(A.4)

Thus, we need the distribution of kw(0) ¡ w¤ k. We will restrict ourselves to a three-dimensional weight space. The more general case, though algebraically more complex, is treated with exactly the same techniques. Suppose that w(0) has a normal distribution with mean 0 and variance s 2 . Let F(r ) D P[kw(0) ¡ w¤ k · r ]: Z 2 1 ¡w A.5) F(r ) D d3 w e 2s2 . 2 3 2 / (2p s ) kw¡w¤ k2 ·r 2

Transforming to spherical coordinates with the orientation such that the z¡axis is aligned with w¤ , doing the w integration, and making the substitution v D cosh, we have F(r ) D

Z

2p (2p s 2 )3 / 2

1

0

Z

dw

1

¡

¡1

dv w2 e

w2 2s 2

.

(A.6)

w2 ¡2ww¤ v·r 2 ¡w¤ 2

One needs to consider separately the cases w¤ ¸ r and w¤ · r . Then, differentiating the expression for F(r ), one obtains the density. After considerable calculations, one then nds that

¡

2 w¤r f (r ) D ¤ r sinh s2 w

¢e

¡w

p

¤ 2Cr2 2s 2

2p s 2

.

(A.7)

The Early Restart Algorithm

1311

p Let d / (1 ¡ g)k¡1 for k > 0 and l 1 D 0 for k D 0. Let l 2 ( k ) D p l 1 (k ) D d / (1 ¡ g)k . Then, from equation A.4 we see that Z P[k] D

l 2 (k )

l 1 (k )

dr f (r ).

(A.8)

When w¤ is close to 0, this expression reduces to µ P[k] D erf

¡

l2 p 2s 2

¢

¡ ¡ erf

l1 p 2s 2

¢¶ C p

2 2p s 2

" ¡

l 1e

l2 1 2s 2

¡

¡ l2e

l2 2 2s 2

# , (A.9)

p Rx where erf ( x ) D 2 / p 0 dt exp(¡t2 ). The case where w(0) is distributed normally about the true weight vector is also equivalent to the case w¤ ! 0. Acknowledgments

We thank Yaser Abu-Mostafa and the Caltech Learning Systems Group for their useful input. We also acknowledge the support of NSF’s Engineering Research Center at Caltech. References Fahlman, S. (1988). An empirical study of learning speed in back-propagation networks (CMU Tech. Rep. No. CMU-CS-88-162). Pittsburgh: Carnegie Mellon University. Ghannadian, F., & Alford, C. (1996). Application of random restart to genetic algorithms. Intelligent Systems, 95, 81–102. Hamey, L. (1995). Analysis of the error surface of the XOR network with two hidden nodes (Computing Rep. No. 96/167C).Department of Computing, Macquarie University. Hamey, L. (1998). XOR has no local minima: A case study in neural network error surface analysis. Neural Networks, 11, 669–681. Heskes, T., Slijpen, E., & Kappen, B. (1992). Learning in neural networks with local minima. Physical Review A, 46(8): 5221–5231. Hu, X., Shonkwiler, R., & Spruill, M. (1997). Random restart in global optimization. Preprint, 1192-015,School of Mathematics, Georgia Institute of Technology. Leen, T., & Moody, J. (1993). Weight space probability densities in stochastic learning: I. Dynamics and equilibria. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, (pp. 451–458). San Mateo, CA: Morgan Kaufmann. Magdon-Ismail, M., & Abu-Mostafa, Y. (1998). Validation of volatility models. Journal of Forecasting, 17, 349–368.

1312

M. Magdon-Ismail and A. F. Atiya

Orr, G., & Leen, T. (1993). Weight space probability densities in stochastic learning: II. Transients and basin hopping times. In C. L. Giles, S. J. Hanson, & J. D. Cowan (Eds.), Advances in neural information processing systems, (pp. 507–514). San Mateo, CA: Morgan Kaufmann. Sprinkhuizen-Kuyper, I., & Boers, E. (1996). The error surface of the simplest XOR network has only global minima. Neural Computation, 8, 1301–1320. Sprinkhuizen-Kuyper, I., & Boers, E. (1998). The error surface of the 2-2-1 XOR: The nite stationary points. Neural Networks, 11, 683–690. Received October 2, 1998; accepted April 13, 1999.

LETTER

Communicated by Radford Neal

Attractor Dynamics in Feedforward Neural Networks Lawrence K. Saul AT&T Labs—Research, Florham Park, NJ 07932, U.S.A. Michael I. Jordan University of California, Berkeley, CA 94720, U.S.A. We study the probabilistic generative models parameterized by feedforward neural networks. An attractor dynamics for probabilistic inference in these models is derived from a mean eld approximation for large, layered sigmoidal networks. Fixed points of the dynamics correspond to solutions of the mean eld equations, which relate the statistics of each unit to those of its Markov blanket. We establish global convergence of the dynamics by providing a Lyapunov function and show that the dynamics generate the signals required for unsupervised learning. Our results for feedforward networks provide a counterpart to those of Cohen-Grossberg and Hopeld for symmetric networks. 1 Introduction

Attractor neural networks lend a computational purpose to continuous dynamical systems. Celebrated uses of these networks include the storage of associative memories (Amit, 1989), the reconstruction of noisy images (Koch, Marroquin, & Yuille, 1986), and the search for shortest paths in the traveling salesman problem (Hopeld & Tank, 1986). In all of these examples, a distributed computation is performed by an attractor dynamics and its ow to stable xed points. These examples can also be formulated as problems in probabilistic reasoning; indeed, it is well known that symmetric neural networks can be analyzed as statistical mechanical ensembles or Markov random elds (MRFs). Attractor neural networks and MRFs are connected by the idea of an energy surface. This connection has led to new algorithms for probabilistic inference in symmetric networks, a problem traditionally addressed by stochastic sampling procedures (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953; Geman & Geman, 1984). For example, in one of the rst unsupervised learning algorithms for neural networks, Ackley, Hinton, and Sejnowski (1985) applied Gibbs sampling to estimate the statistics of binary MRFs. Known as Boltzmann machines, these networks relied on time-consuming Monte Carlo simulation and simulated annealing as an inner loop of their learning procedure. Subsequently, Peterson and Anderson c 2000 Massachusetts Institute of Technology Neural Computation 12, 1313– 1335 (2000) °

1314

Lawrence K. Saul and Michael I. Jordan

(1987) introduced a faster deterministic method for probabilistic inference. Their method, based on the so-called mean eld approximation from statistical mechanics, transformed the binary-valued MRF into a network with continuous-valued units. The continuous network, endowed with dynamics given by the mean eld equations, is itself an attractor network; in particular, it possesses a Lyapunov function (Cohen & Grossberg, 1983; Hopeld, 1984). Thus, one can perform approximate probabilistic inference in a binary MRF by relaxing a deterministic, continuous network. In this article, we show that this linkage of attractor dynamics and probabilistic inference is not limited to symmetric networks or (equivalently) to models represented as undirected graphs. We investigate an attractor dynamics for feedforward networks, or directed acyclic graphs (DAGs); these are networks with directed edges but no directed loops. The probabilistic models represented by DAGs are known as Bayesian networks, and together with MRFs, they comprise the class of probabilistic models known as graphical models (Lauritzen, 1996). Like their undirected counterparts, Bayesian networks have been proposed as models of both articial and biological intelligence (Pearl, 1988). The units in Bayesian networks represent random variables, while the links represent assertions of conditional independence. These independence relations endow DAGs with a precise probabilistic semantics. Any joint distribution over a xed, nite set of random variables can be represented by a Bayesian network, just as it can be represented by an MRF. What is compactly represented by one type of graphical model, however, may be quite clumsily represented by the other. MRFs arise naturally in statistical mechanics, where they describe the Gibbs distributions for systems in thermal equilibrium. Bayesian networks, on the other hand, are designed to model causal or generative processes; hidden Markov models, Kalman lters, softsplit decision trees—these are all examples of Bayesian networks. The connection between Bayesian networks and neural network models of learning was pointed out by Neal (1992). Neal studied Bayesian networks whose units represented binary random variables and whose conditional probability tables were parameterized by sigmoid functions. He showed that these probabilistic networks have gradient-based learning rules that depend only on locally available information (Buntine, 1994; Binder, Koller, Russell, & Kanazawa, 1997). These observations led Dayan, Hinton, Neal, and Zemel (1995) and Hinton, Dayan, Frey, and Neal (1995) to propose the Helmholtz machine—a multilayered probabilistic network that learns hierarchical generative models of sensory inputs. Helmholtz machines were conceived not only as tools for statistical pattern recognition, but also as abstract models of top-down and bottom-up processing in the brain. Following the work on Helmholtz machines, a number of researchers began to investigate unsupervised learning in large, layered Bayesian networks (Lewicki & Sejnowski, 1996; Saul, Jaakkola, & Jordan, 1996). As in undirected MRFs, probabilistic inference in these networks is generally in-

Attractor Dynamics in Feedforward Neural Networks

1315

tractable, and approximations are required. Saul et al. (1996) proposed a mean eld approximation for these networks, analogous to the existing one for Boltzmann machines. Their approximation transformed the binaryvalued network into a continuous-valued network whose statistics were described by a set of mean eld equations. These equations related the statistics of each unit to a weighted sum of statistics from its Markov blanket (Pearl, 1988), a natural generalization of the notion of neighborhood in undirected MRFs. This earlier work did not, however, exhibit the solutions of these equations as xed points of a simple continuous dynamical system. In particular, Saul et al. (1996) did not provide an attractor dynamics nor a Lyapunov function for their mean eld equations. In this article, we bring this sequence of ideas full circle by forging a link between attractor dynamics and probabilistic inference for directed networks. The link is achieved via mean eld theory, just as in the undirected case. In particular, we describe an attractor dynamics whose stable xed points correspond to solutions of the mean eld equations. We also establish global convergence of these dynamics by providing a Lyapunov function. Our results thus provide an understanding of feedforward (Bayesian) networks that parallels the usual understanding of symmetric (MRF) networks. In both cases, we have a satisfying semantics for the set of allowed probability distributions; in both cases, we have a mean eld theory that sidesteps the intractability of exact probabilistic inference; and in both cases, we have an attractor dynamics that transforms a discrete-valued network into a continuous dynamical system. While this article builds on previous work, we have tried to keep it selfcontained. The organization is as follows. In section 2, we introduce the probabilistic models represented by DAGs and review the problems of inference and learning. In section 3, we present the mean eld theory for these networks: the mean eld equations, the attractor dynamics, and the learning rule. In section 4, we describe some experiments on a database of handwritten digits and compare our results to known benchmarks. Finally, in section 5, we present our conclusions, as well as directions for future research. 2 Probabilistic DAGs

Consider a feedforward network—or equivalently, a directed acyclic graph—in which each unit represents a binary random variable Si 2 f0, 1g and each link corresponds to a nonzero, real-valued weight, Wij , to unit i from unit j. Thus, Wij is a weight matrix whose zeros indicate missing links in the underlying DAG. Note that by assumption, Wij is zero for j ¸ i. We can view this network as dening a probabilistic model in which missing links correspond to statements of conditional independence. In particular, suppose that the instantiations of the random variables Si are generated by a causal process in which each unit is activated or inhibited (i.e., set to

1316

Lawrence K. Saul and Michael I. Jordan

one or zero) depending on the values of its parents. This generative process is modeled by the joint distribution P(S) D

Y i

P( Si |S1 , S2 , . . . , Si¡1 ) D

Y i

P( Si |p Si ),

(2.1)

where p Si denotes the parents of the ith unit. Equation 2.1 states that given the values of its parents, the ith unit is conditionally independent of its other ancestors in the graph. These qualitative statements of conditional independence are encoded by the structure of the graph and hold for arbitrary values of the weights Wij attached to nonmissing links. The quantitative predictions of the model are determined by the conditional distributions, P( Si |p Si ), in equation 2.1. In this article, we consider sigmoid networks for which 0 1 X

P (Si D 1|p Si ) D s @

Wij Sj A ,

(2.2)

j

where s(z) D [1 C e¡z ]¡1 ; thus the sign of Wij , positive or negative, determines whether unit j excites or inhibits unit i in the generative process. Although we have not done so here, it is straightforward to include a bias term in the argument of the sigmoid function. Note that the weights in the network induce correlations between the units in the network, with higher-order correlations arising as information is propagated through one or more layers. The sigmoid nonlinearity ensures that the multilayer network does P not have a single-layer equivalent. In what follows, we denote by si D s( j Wij Sj ) the squashed weighted sum on the right-hand side of equation 2.2; this top-down signal, received by each unit from its parents, can also be regarded as a random variable in its own right. Layered networks of this form (see Figure 1) were introduced as hierarchical generative models by Hinton et al. (1995). In typical applications, one imagines the units in the bottom layer to encode sensory data and the units in the top layers to encode different dimensions of variability. Thus, for example, in networks for image recognition, the bottom units might encode pixel values, while the top units encode higher-order features such as orientation and occlusion. The promise of these networks lies in their ability to parameterize hierarchical, nonlinear models of multiple interacting causes. Effective use of these networks requires the ability to make probabilistic inferences. Essentially these are queries to ascertain likely values for certain units in the network, given values—or evidence—for other units. Let V denote the visible units for which values are known and H the hidden units for which values must be inferred. In principle, inferences can be made from the posterior distribution, P (H|V ) D

P( H, V ) , P( V )

(2.3)

Attractor Dynamics in Feedforward Neural Networks

1317

Figure 1: A layered Bayesian network parameterizes a hierarchical generative model for the data encoded by the units in its bottom layer.

where P( H, V ) is the joint distribution P over hidden and visible units, as given by equation 2.1, and P(V ) D H P( H, V ) is the marginal distribution obtained by summing over all congurations of hidden units. Exact probabilistic inference, however, is generally intractable in large Bayesian networks (Cooper, 1990). In particular, if there are many hidden units, then the sum to compute P( V ) involves an exponentially large number of terms. The same difculty makes it impossible to compute statistics of the posterior distribution, P( H|V ). Besides the problem of inference, one can also consider the problem of learning, or parameter estimation, in these networks. Unsupervised learning algorithms in probabilistic networks are designed to maximize the loglikelihood 1 of observed data. The likelihood of each data vector is given P by the marginal distribution, P( V ) D H P( H, V ). Local learning rules are derived by computing the gradients of the log-likelihood, ln P( V ), with respect to the weights of the network (Neal, 1992; Binder et al., 1997). For each

1

For simplicity of exposition, we do not consider forms of regularization (e.g., penalized likelihoods, cross-validation) that may be necessary to prevent overtting.

1318

Lawrence K. Saul and Michael I. Jordan

data vector, this gives the on-line update: £ ¤ D Wij / E ( Si ¡ si )Sj ,

(2.4)

where E[¢ ¢ ¢] denotes an expectation with respect to the conditional distribuP tion, P( H|V ), and si D s( j Wij Sj ) is the top-down signal from the parents of the ith unit. Note that the update takes the form of a delta rule, with the top-down signal si being matched to the target value provided by Si . Intuitively, the learning rule adjusts the weights to bring each unit’s expected value in line with an appropriate target value. These target values are specied explicitly by the evidence for the visible units in the network. For the other units in the network—the hidden units—appropriate target values must be computed by running an inference algorithm. Generally in large, layered networks, we can compute neither the loglikelihood ln P( V ) nor the statistics of P( H|V ) that appear in the learning rule, equation 2.4. A learning procedure can nesse these problems in two ways: (1) by optimizing the weights with respect to a more tractable cost function, or (2) by substituting approximate values for the statistics of the hidden units. As we shall see, both strategies are employed in the meaneld theory for these networks. 3 Mean Field Theory

Mean eld theory is a general method from statistical mechanics for estimating the statistics of correlated random variables (Parisi, 1988). The name arises from P physical models in which weighted sums of random variables, such as j Wij Sj , are interpreted as local magnetic elds. Roughly speaking, under certain conditions, a central limit theorem may be applied to these sums, and a useful approximation is to ignore the uctuations in these elds and replace them by their mean value—hence the name, “mean eld” theory. More sophisticated versions of the approximation also exist, in which one incorporates the leading terms in an expansion about the mean. The mean eld approximation was originally developed for Gibbs distributions, as arise in MRFs. In this article we develop a mean eld approximation for large, layered networks whose probabilistic semantics are given by equations 2.1 and 2.2. As in MRFs, our approximation exploits the averaging phenomena that occur at units whose conditional distributions, P P( Si |p Si ), are parameterized in terms of weighted sums, such as j Wij Sj . Addressing the twin issues of inference and learning, the approximation enables one to compute effective substitutes for the log-likelihood, ln P( V ), and the statistics of the posterior distribution, P( H| V ). The organization of this section is as follows. Section 3.1 describes the general approach behind the mean eld approximation. Among its many interpretations, mean eld theory can be viewed as a principled way of

Attractor Dynamics in Feedforward Neural Networks

1319

approximating an intractable probabilistic model by a tractable one. A variational principle chooses the parameters of the tractable model to minimize an entropic measure of error. The parameters of the tractable model are known as the mean eld parameters, and they serve as placeholders for the true statistics of the posterior distribution, P( H|V ). Our most important result for feedforward neural networks is a compact set of equations for determining the mean eld parameters. These mean eld equations relate the statistics of each unit to those of its Markov blanket. Section 3.2 gives a succinct statement of the mean eld equations, along with a number of useful intuitions. A more detailed derivation is given in the appendix. The mean eld equations are a coupled set of nonlinear equations whose solutions cannot be expressed in closed form. Naturally, this raises the following concern: Have we merely replaced one intractable problem—that of calculating averages over the posterior distribution, P( H|V )—by an equally intractable one—that of solving the mean eld equations? In section 3.3, we show how to solve the mean eld equations using an attractor dynamics. This makes it quite straightforward to solve the mean eld equations, typically at much less computational cost than (say) sampling the statistics of P( H|V ). Finally, in section 3.4, we present a mean eld learning algorithm for these networks. Weights are adapted by a regularized delta rule that depends only on locally available information. Interestingly, the attractor dynamics for solving the mean eld equations generates precisely those signals required for unsupervised learning. 3.1 A Variational Principle. We now return to the problem of probabilistic inference in layered feedforward networks. Our goal is to obtain the statistics of the posterior distribution, P( H|V ), for some full or partial instantiation V of the units in the network. Since it is generally intractable to compute these statistics exactly, we adopt the following two-step approach: (1) introduce a parameterized family of simpler distributions whose statistics are easily computed; (2) approximate P( H|V ) by the member of this family that is “closest,” as determined by some entropic measure of distance. The starting point of the mean eld approximation is to consider the family of factorial distributions:

Q( H|V ) D

Y i2H

m Si i (1 ¡ m i )1¡Si .

(3.1)

The parameters m i represent the mean values of the hidden units under the factorial distribution, Q(H|V ). Note that by design, most statistics of Q( H| V ) are easy to compute because the distribution is factorial.

1320

Lawrence K. Saul and Michael I. Jordan

We can measure the distance between the distribution Q( H|V ) in equation 3.1 and the true posterior distribution P (H|V ) by the Kullback-Leibler (KL) divergence:

KL( Q||P ) D

X

µ Q( H|V ) ln

H

¶ Q(H|V ) . P( H|V )

(3.2)

The KL divergence is strictly nonnegative, vanishing only if Q( H| V ) D P( H| V ). The idea behind the mean eld approximation is to nd the parameters fm i g that minimize KL(Q|| P ) and then to use the statistics of Q( H|V ) as a substitute for the statistics of P (H|V ). The rst step of this calculation is to rewrite the posterior distribution P( H|V ) using equation 2.3, thus breaking the right-hand side of equation 3.1 into three terms: KL( Q||P ) D

X H

¡

Q( H|V ) ln Q( H| V )

X H

Q( H| V ) ln P( H, V ) C ln P( V ).

(3.3)

The rst two terms on the right-hand side of this equation depend on properties of the approximate distribution, Q( H|V ). The rst measures the (negative) entropy, and the second term measures the expected value of ln P( H, V ). The last term in equation 3.3 is simply the log-likelihood of the evidence, which—importantly—does not depend on the statistics of Q( H|V ). Thus, this last term can be ignored when we minimize KL(Q||P) with respect to the parameters fm i g. It nevertheless has important consequences for learning, a subject to which we return in section 3.4. 3.2 Mean Field Equations. The rst-order statistics of Q( H|V ) that minimize KL(Q| |P) naturally depend on the weights of the network, Wij , and the evidence, V. This dependence is captured by the mean eld equations, which are derived by evaluating and minimizing the right-hand side of equation 3.3. In this work, we make two simplifying assumptions to derive the mean eld equations: rst, that the weighted sum of inputs to each unit can be modeled by a gaussian distribution in large networks, and second, that certain intractable averages over Q( H|V ) can be approximated by the use of an additional variational principle. Details of these calculations are given in the appendix. In what follows, we present the mean eld equations as a fait accompli so that we can emphasize the main intuitions that emerge from the approximation of P( H|V ) by Q( H| V ). For sigmoid DAGs, the mean eld approximation works by keeping track of two parameters, fm i , ji g, for each unit in the network. Although

Attractor Dynamics in Feedforward Neural Networks

1321

only the rst of these appears explicitly in equation 3.1, it turns out that both are needed to evaluate and minimize the right-hand side of equation 3.3. Roughly speaking, these parameters are stored as approximations to the statistics of the true posterior distribution. In particular, m i ¼ E[Si |V] approximates each unit’s posterior mean, while ji ¼ E[si |V] approximates the expected top-down signal in equation 2.2. In some trivial cases, these statistics can be computed exactly. For visible units, E[Si | V] is identically zero or one, as determined by the evidence, and for units with no parents, E[si |V] is constant, independent of the evidence. More generally, though, these statistics cannot be exactly computed, and the parameters fm i , ji g represent approximate values. The values of the mean eld parameters fm i , ji g are computed by solving a coupled set of nonlinear equations. For large, layered networks, these mean eld equations are:

2 mi D s 4

X j

2 ji D s 4

X j

Wijm j C

X j

3 X 1 Wji (m j ¡ jj ) ¡ (1 ¡ 2m i ) Wji2jj (1 ¡ jj )5 , (3.4) 2 j

3 X 1 2 Wijm j C (1 ¡ 2ji ) W ij m j (1 ¡ m j )5 . 2 j

(3.5)

In certain cases, these equations may have multiple solutions. Roughly speaking, in these cases, each solution corresponds to the statistics of a different mode (or peak) of the posterior distribution. The mean eld equations couple the parameters of each unit to those of its parents and children. In layered networks, this amounts to a direct coupling between units in adjacent layers. The terms inside the brackets of equations 3.4 and 3.5 can be viewed as effective inuences on each unit in the network. Let us examine these inuences, concentrating for the moment on the leading-order terms linear in the weights, Wij . In equation 3.4, we see that the parameter m i depends on the statistics of its Markov P blanket (Pearl, 1988)—that is, on its parents through the weighted sum j Wijm j , on P its children through the weighted sum j Wjim j , and on the parents of its P children through the weighted sum j Wjijj . To some extent, the difference, P (m ), j Wji j ¡ jj captures the effect of explaining away in which units in one layer are coupled by evidence in the layers below. In equation 3.5, we see that the parameter ji depends on only the statistics of its Pparents, with the leading dependence coming through the weighted sum j W ijm j . Thus, we can interpret ji as an approximation to the expected top-down signal, E[si |V]. The quadratic terms in equations 3.4 and 3.5, proportional to Wij2 , capture higher-order corrections to the dependencies already noted. For example, in equation 3.5, these terms cause any variance in the parents of unit i to push ji ¼ E[si |V] away from the extreme values of zero or one.

1322

Lawrence K. Saul and Michael I. Jordan

These directed probabilistic networks have twice as many mean eld parameters as their undirected counterparts. For this we can offer the following intuition. Whereas the parameters m i are determined by top-down and bottom-up inuences, the parameters ji are determined only by top-down inuences. The distinction—essentially one between parents and children— is meaningful only for directed graphical models. 3.3 Attractor Dynamics. The mean eld equations provide a self-consistent description of the statistics m i ¼ E[Si |V] and ji ¼ E[si | V] in terms of the corresponding statistics for the ith unit’s Markov blanket. Except in special cases, however, the solutions to these equations cannot be expressed in closed form. Thus, in general, the values for the parameters fm i , ji g must be found by numerically solving equations 3.4 and 3.5. This is greatly facilitated by expressing the solutions to these equations as xed points of an attractor dynamics; we can then solve the mean eld equations by integrating a set of differential equations. To this end, we associate with each unit the conjugate parameters:

gi D hi D

X j

X j

Wijm j C Wijm j C

X j

X 1 Wji (m j ¡ jj ) ¡ (1 ¡ 2m i ) Wji2jj (1 ¡ jj ), 2 j

X 1 (1 ¡ 2ji ) W ij2 m j (1 ¡ m j ), 2 j

(3.6) (3.7)

whose values are simply equal to the arguments of the sigmoid functions in the mean eld equations. The variables gi and hi summarize the inuences of the ith unit’s Markov blanket. We consider the dynamics: £ ¤ tm mP i D ¡ m i ¡ s(gi ) ,

(3.8)

th hP i D C [ji ¡ s(hi )] ,

(3.9)

where tm and th are (positive) time constants and mP i and hP i are the time derivatives of m i and hi . Note that equation 3.9 species the time derivative of hi , not ji . As we show below, however, this does not present any difculty in integrating the dynamics. By construction, the xed points of this dynamics correspond to solutions of the mean eld equations. To prove the stability of these xed points, we introduce the Lyapunov function, LD

Xµ ij

C

¡Wijm im j C

" X Z i

mi 0

s

1 2 2 W j m j (1 ¡ m j ) 2 ij i Z

¡1

(m )dm C

hi

¶

# s(h)dh ,

¡1

(3.10)

Attractor Dynamics in Feedforward Neural Networks

1323

where s ¡1 (m ) is the inverse sigmoid function, that is, s ¡1 (s( h )) D h. The rst and third terms in this Lyapunov function are identical to what Hopeld (1984) considered for symmetric networks; the others are peculiar to sigmoid DAGs. Consider the time derivative of this Lyapunov function under the dynamics of equations 3.8 and 3.9. Note that this dynamics does not correspond to a strict gradient descent in L, which would trivially give rise to a proof of convergence. With some straightforward algebra, however, one can show that LP D ¡

X nh i

i o s ¡1 (m i C tm mP i ) ¡ s ¡1 (m i ) mP i C th hP 2i · 0,

(3.11)

where the inequality follows from the observation that the sigmoid function is monotonically increasing. Thus, the function L always decreases under the attractor dynamics. As we discuss in the appendix, the ow to stable xed points can be viewed as computing the approximate distribution, Q( H|V ), that best matches the true posterior distribution, P( H| V ). In practice, one solves the mean eld equations by discretizing the attractor dynamics in equations 3.8 and 3.9. We experimented with two simple schemes to compute updated values fmQ i , jQi g at time t C D t based on current values fm i , ji g at time t. One of these was a rst-order Euler method: mQ i D m i C mP i D t, ´ i¡1 ³ P 1 hP 2 jQi D ¡ hi C hP i D t ¡ j Wij mQ j , j Wij mQ j (1 ¡ mQ j ) 2

(3.12) (3.13)

followed (when necessary) by clipping operations that projected fm i , ji g into the interval [0, 1]. The other scheme we tried was a slight variant that sidestepped the division operation in equation 3.13. This was done by making additional use of the sigmoid squashing function, replacing equation 3.13 by jQi D s(hi C hP i D t).

(3.14)

This second method does not strictly reduce to the continuous attractor dynamics in the limit D t ! 0; however, empirically it tended to converge more rapidly to solutions of the mean eld equations. For this reason, and also because of its naturalness, we favored this method in practice. Figure 2 shows typical traces of L versus time for both methods. The traces were computed from one of the networks learned in section 4. 3.4 Mean Field Learning. The Lyapunov function in equation 3.10 has another interpretation that is important for unsupervised learning. Noting that the KL divergence between two distributions is strictly nonnegative, it

1324

Lawrence K. Saul and Michael I. Jordan

Figure 2: Typical convergence of the Lyapunov function, L, under the discretized attractor dynamics. The top curve shows the trace using equation 3.13 and the bottom curve using equation 3.14.

follows from equation 3.3 that ln P( V ) ¸ ¡

X H

µ Q( H| V ) ln

¶ Q( H| V ) . P (H, V )

(3.15)

Equation 3.15 gives a lower bound on the log-likelihood of the evidence, ln P( V ), in terms of an average over the tractable distribution, Q( H|V ). The lower bound on ln P( V ) can be used as an objective function for unsupervised learning in generative models (Hinton et al., 1995). Whereas in tractable networks, one adapts the weights to maximize the log-likelihood ln P( V ), as in equation 2.4, in intractable networks, one adapts the weights to maximize the lower bound. In general, it is not possible to evaluate the right-hand side of equation 3.15 exactly; further approximations are required. In the appendix, we evaluate equation 3.15 assuming that the weighted sum of inputs to each unit has a gaussian distribution. We also make use of an additional variational principle to estimate intractable averages over Q(H|V ). Evaluating equation 3.15 in this way leads to the Lyapunov function in equation 3.10. With this interpretation, we can view equation 3.10 as a surrogate objective

Attractor Dynamics in Feedforward Neural Networks

1325

function for unsupervised learning. Thus, in addition to computing approximate statistics of the posterior distribution, P(H|V ), the attractor dynamics in equations 3.8 and 3.9 also computes a useful objective function. (Under certain limiting conditions, the Lyapunov function in equation 3.10 actually provides a lower bound on the log-likelihood, ln P(V ) ¸ ¡L, as opposed to merely an estimate.) Note the dual role of the Lyapunov function in the mean eld approximation: the attractor dynamics minimizes L with respect to the mean eld parameters fm i , ji g, while the learning rule minimizes L with respect to the weights Wij . A useful picture is to imagine these two minimizations occurring on vastly different timescales, with the mean eld parameters fm i , ji g tracking changes in the evidence much more rapidly than the weights, Wij . Put another way, short-term memories are stored by the mean eld parameters and long-term memories by the weights. We derive a mean eld learning rule by computing the gradients of the Lyapunov function L with respect to the weights, Wij . Applying the chain rule gives X @L @m k X @L @jk dL @L C C D , dW ij @Wij @m k @Wij @jk @Wij k k

(3.16)

where the last two terms account for the fact that the mean eld parameters depend implicitly on the weights through equations 3.4 and 3.5. (Here we have assumed that the attractor dynamics are allowed to converge fully before adapting the weights to new evidence.) We can simplify this expression by noting that the mean eld equations describe xed points at which @L / @m k D @L / @jk D 0; thus the last two terms in equation 3.16 vanish. Evaluating the rst term in equation 3.16 gives rise to the on-line learning rule: £

¤

D Wij / (m i ¡ ji ) m j ¡ Wijji (1 ¡ ji )m j (1 ¡ m j ) .

(3.17)

Comparing this learning rule to equation 2.4, we see that the mean eld parameters ll in for the statistics of Si and si . This is, of course, what makes the learning algorithm tractable. Whereas the statistics of P( H|V ) cannot be efciently computed, the parameters fm i , ji g are found by solving the mean eld equations. Note that the right-most term of equation 3.7 has no counterpart in equation 2.4. This term, a regularizer induced by the mean eld approximation, causes Wij to be decayed according to the mean eld statistics of si and Sj . In particular, the weight decay is suppressed if either ji or m j is saturated near zero or one; in effect, weights between highly correlated units are burned in to their current values.

1326

Lawrence K. Saul and Michael I. Jordan

Figure 3: (Left) Actual images from the training set. (Middle) Images sampled from the generative models of trained networks. (Right) Images whose bottom halves were inferred from their top halves.

4 Experimental Results

We used a database of handwritten digits to evaluate the computational abilities of unsupervised neural networks represented by DAGs. The database consisted of 11,000 examples of handwritten digits compiled by the U.S. Postal Service Ofce of Advanced Technology. The examples were preprocessed to produce 8 £ 8 binary images, as in Figure 3. For each digit, we divided the data into a training set of 700 examples and a test set of 400 examples. The partition of data into training and test sets was the same as used in previous studies (Hinton et al., 1995; Saul et al., 1996). We used the mean eld algorithm from the previous section to learn generative models of each digit class. The generative models were parameterized by three-layer networks with 8 £ 24 £ 64 architectures. In 100 independent experiments,2 we trained 10 networks, one for each digit class, and then used these networks to classify the images in the test set. The test images were labeled by whichever network returned the highest value of ¡L, used as a stand-in for the true log-likelihood, ln P( V ). The mean classication error rate in these experiments was 4.4%, with a standard deviation of 0.2%. These results are considerably better than standard benchmarks on this database (Hinton et al., 1995), such as k-nearest neighbors (6.7%) and backpropagation (5.6%). They also improve slightly on results from the wake-sleep learning rule (4.8%) in Hemholtz machines (Hinton et al., 1995) and from an earlier version (4.6%) of the mean eld learning rule (Saul et al., 1996). 2 The experimental details were as follows. Each network was trained by ve passes through the training examples. The weights were adapted using a xed learning rate of 0.05. Mean eld parameters were computed by 16 iterations of the discretized attractor dynamics, equations 3.12 and 3.14, with tm D th D 1 and a step size of D t D 0.25. The mean eld parameters were initialized by a top-down pass through the netP work, setting ji D s( j Wijjj ) and m i D ji for the hidden units. The weights Wij were initialized by random draws from a gaussian distribution with zero mean and small variance.

Attractor Dynamics in Feedforward Neural Networks

1327

The classication results show that the mean eld networks have learned noisy but essentially accurate models of each digit class. This is conrmed visually by looking at images sampled from the generative model of each network (see Figure 3). The three columns in this gure show, from left to right, actual images from the training set, fantasies sampled from the generative models of trained networks, and images whose top halves were taken from those in the rst column and whose bottom halves were inferred, or lled in, by the attractor dynamics. These last images show that probabilistic DAGs can function as associative memories in the same way as symmetric neural networks, such as the Hopeld model (1984). 5 Discussion

In this article we have extended the attractor paradigm of neural computation to feedforward networks parameterizing probabilistic generative models. The probabilistic semantics of these networks (Lauritzen, 1996; Neal, 1992; Pearl, 1988) differ in useful respects from those of symmetric neural networks, for which the attractor paradigm was rst established (Cohen & Grossberg, 1983; Hopeld, 1982; Geman & Geman, 1984; Ackley et al., 1985; Peterson & Anderson, 1987). Borrowing ideas from statistical mechanics, we have derived a mean eld theory for approximate probabilistic inference. We have also exhibited an attractor dynamics that converges to solutions of the mean eld equations and that generates the signals required for unsupervised learning. While learning and dynamics have been twin themes of neural network research since its inception, it often appears that the eld is divided into two camps: one studying symmetric networks with energy functions, the other studying feedforward networks that do not involve iterative forms of relaxation. In our view, this split has prevented researchers from combining the benets of both approaches to computation. We note that despite the strong convergence results available for symmetric networks, there have been few applications for these networks involving any signicant element of learning. Likewise, despite the powerful learning abilities of feedforward networks, there have been few applications involving more complex forms of inference and decision making. In the remainder of this section, we discuss the many compelling reasons for combining these two approaches and suggest how this might be done using the ideas in this article. Let us begin by considering feedforward networks. Many practical learning algorithms have been developed for feedforward networks, and numerous theoretical results are available to characterize their properties for approximation and estimation. The usual framework for feedforward networks is one of supervised learning, or function approximation. In particular, a network induces a functional relationship between x and y based on a training set consisting of ( x, y ) pairs. Subsequent x inputs can be used

1328

Lawrence K. Saul and Michael I. Jordan

as queries, and the network interpolates or extrapolates to provide a response y. Although useful and general, this framework also has limitations. In particular, it is not always the case that the form of future queries is known in advance of training, and indeed, as in the classical setting of associative memory, it can be useful to allow arbitrary components of the joint ( x, y) vector to serve as queries. For example, in control and optimization applications, one would like to use y as a query and extract a corresponding x. In missing data problems, one would like to ll in components of the x vector given y or other components of x. In applications involving diagnosis, model critiquing, explanation, and sensitivity analysis, one would often like to nd values of hidden units that correspond to particular input or output patterns. Finally, in problems with unlabeled examples, one would like to do some form of unsupervised learning. In our view, these manifold problems are best treated as general inference problems on the database of knowledge stored by the network. Moreover, as is suggested by the heuristic iterative techniques that have been employed to “invert” feedforward networks (Hoskins, Hwang, & Vagners, 1992; Jordan & Rumelhart, 1992), we expect issues in dynamical systems to become relevant when inference is performed in an “upstream” direction. Even in the classical setting, where feedforward networks are used for function approximation, an inferential perspective can be useful. Consider two logistic hidden units with strong, positive connections to a logistic output unit. If the output unit has a target value of one, then we can exploit the fact that only one hidden unit sufces to activate the output unit. In particular, if we have additional evidence that (say) the rst hidden unit is activated, perhaps via its connection to another output unit, then we can infer that the second hidden unit is not required to be activated, and thus can be used for other purposes. This explaining-away phenomenon reects an induced correlation between the hidden units, and it is natural in many diagnostic settings involving hidden causes (Pearl, 1988). It and other induced correlations between hidden units can be exploited if we augment our view of feedforward network learning to include an “upstream” inferential component. While classical feedforward networks are powerful learning machines and weak inference engines, the opposite can be said of symmetric neural networks. Properly congured, symmetric networks can perform inferences as complex as solving the traveling salesman problem (Hopeld & Tank, 1986), yet few have emerged in applications involving a signicant element of learning. In our view, the reasons for this are twofold (Pearl, 1988). First, it is a general fact that undirected graphical models—of which symmetric neural networks, such as the Boltzmann machine, are a special case—are less modular than directed graphical models. In a directed model, units that are downstream from the queried and observed units can simply be deleted; they have no effect on the query. In undirected networks no such

Attractor Dynamics in Feedforward Neural Networks

1329

modularity generally exists; units are generally coupled via the partition function. Second, in a directed network, it is possible to use causal intuitions to understand representation and processing in the model. This can often be an overwhelming advantage. Moreover, if the domain being modeled has a natural causal structure, then it is natural to use a directed model that accords with the observed direction of causality. We take two lessons from the previous successes of neural computation: (1) from the abilities of symmetric networks, that complex forms of inference require an element of iterative processing; and (2) from the abilities of feedforward networks, that the capacity to learn is greatly enhanced by the element of directionality. We believe that the formalism in this article combines the best aspects of symmetric and feedforward neural networks. The models we study are represented by directed acyclic graphs and thus have the natural advantages of modularity and causality that accrue to feedforward networks. Moreover, because they are endowed with probabilistic semantics, they also support complex types of inference and reasoning. This allows them to be applied to a broad range of problems involving diagnosis, explanation, control, optimization, and missing data. Our formalism also reconciles the problems of unsupervised and supervised learning in a manner reminiscent of the Boltzmann machine (Ackley et al., 1985). The supervised case simply emerges as the limiting case in which all of the input and output units are contained in the set of visible units. Finally, as in symmetric neural networks, approximate probabilistic inference is performed by relaxing a continuous dynamical system. Our formalism thus preserves the many compelling features of the attractor paradigm, including the guarantees of stability and convergence, the potential for massive parallelism, and the physical analogy of an energy surface. We have contrasted the networks in this article to standard backpropagation networks, which do not make use of probabilistic semantics or attractor dynamics. Another representational difference is that the units in backpropagation networks take on continuous values, whereas the units in sigmoid Bayesian networks represent binary random variables. Our focus on binary random variables, as opposed to continuous ones, however, should not be construed as a fundamental limitation of our methods. Ideas from mean eld theory can be applied to probabilistic models of continuous random variables, and such applications may be of interest for more sophisticated generative models (Hinton & Ghahramani, 1997). Note that our analysis transforms a feedforward network into a recurrent network that possesses a Lyapunov function. This recurrent network (essentially equations 3.8 and 3.9 viewed as a recurrent network) is not a symmetric network, and its Lyapunov function does not follow directly from the theorems of Cohen and Grossberg (1983) and Hopeld (1984). We have derived the attractor dynamics for these networks by combining ideas from statistical mechanics with the probabilistic machinery of directed graphical models. Of course, one can also study recurrent networks that possess

1330

Lawrence K. Saul and Michael I. Jordan

a Lyapunov function, independent of any underlying probabilistic formulation. In fact, Seung, Richardson, Lagarias, and Hopeld (1998) recently exhibited a Lyapunov function for excitatory-inhibitory neural networks with a mixture of symmetric and antisymmetric interactions. Interestingly, their Lyapunov function has a similar structure to the one in equation 3.10. A general concern with dynamical approaches to computation involves the amount of time required to relax to equilibrium. Although we found empirically that this relaxation time was not long for the problem of recognizing handwritten digits (16 iterations of the discretized differential equations), the issue requires further attention. Beyond general numerical methods for speeding convergence, one obvious approach is to consider methods for providing better initial estimates of the mean eld parameters. This general idea is suggestive of the Helmholtz machine of Hinton et al. (1995). The Helmholtz machine is a pair of feedforward networks, a top-down generative model that corresponds to the Bayesian network in Figure 1, and a bottom-up recognition model that computes the conditional statistics of the hidden units induced by the input vector. This latter network replaces the mean eld equations in our approach. The recognition model is itself learned, essentially as a probabilistic inverse to the generative model. This approach obviates the need for the iterative solution of mean eld equations. The trade-off for this simplicity is a lack of theoretical guarantees, and the fact that the recognition model cannot handle missing data or support certain types of reasoning, such as explaining away, that rely on the combination of top-down and bottom-up processing. One attractive idea, however, is to use a bottom-up recognition model to make initial guesses for the mean eld parameters, then to use an attractor dynamics to rene these guesses. Even without such enhancements, however, we believe that the attractor paradigm in directed graphical models is worthy of further investigation. Attractor neural networks have provided a viable approach to probabilistic inference in undirected graphical models (Peterson & Anderson, 1987), particularly when combined with deterministic annealing. We attribute the lack of learning-based applications for symmetric neural networks to their representational limitations for modeling causal processes (Pearl, 1988) and the peculiar instabilities arising from the sleep phase of Boltzmann learning (Neal, 1992; Galland, 1993). By combining the virtues of attractor dynamics with the probabilistic semantics of feedforward networks, we feel that a more useful and interesting model emerges. Appendix: Details of Mean Field Theory

In this appendix we derive the mean eld approximation for large, layered networks whose probabilistic semantics are given by equations 2.1 and 2.2. Starting from the factorized distribution for Q(H|V ), equation 3.1, our goal is to minimize the KL divergence in equation 3.3, with respect to the param-

Attractor Dynamics in Feedforward Neural Networks

1331

eters fm i g. Note that this is equivalent to maximizing the lower bound on ln P( V ), given in equation 3.15. The rst term on the right-hand side of equation 3.3 is simply minus the entropy of the factorial distribution, Q(H|V ), or: X X£ ¤ m i ln m i C (1 ¡ m i ) ln(1 ¡ m i ) . (A.1) Q( H|V ) ln Q(H|V ) D i

H

Here, for notational convenience, we have introduced parameters m i for all the units in the network, hidden and visible. For the visible units, we use these parameters simply as placeholders for the evidence. Thus, the visible units are clamped to either zero or one, and they do not contribute to the entropy in equation A.1. Evaluating the second term on the right-hand side of equation 3.3 is not as straightforward as the entropy. In particular, for each unit, let zi D

X

(A.2)

W ij Sj

j

denote its weighted sum of parents, and let si D s(zi ) denote its squashed top-down signal. From equations 2.1 and 2.2, we can write the joint distribution in these networks as ln P(S ) D D

X i

[Si ln si C (1 ¡ Si ) ln(1 ¡ si )]

X¡ i

£ ¤¢ Si zi ¡ ln 1 C ezi .

(A.3) (A.4)

Note that to evaluate the second term in equation 3.3, we must average the right-hand side of equation A.4 over the factorial distribution, Q(H|V ). The logarithm term in equation A.4, however, makes it impossible to compute this average in closed form. Clearly, another approximation is needed to compute the expected value of ln[1 C ezi ], averaged over the distribution, Q( H|V ). We can make progress by studying the sum of inputs, zi , as a random variable in its own right. Under the distribution Q( H|V ), the right-hand side of equation A.2 is a weighted sum of independent random variables with means m j and variances m j (1 ¡ m j ). The number of terms in this sum is equal to the number of hidden units in the preceding layer. In large networks, we expect the statistics of this sum—or, more precisely, the distribution Q( zi | V )—to be governed by a central limit theorem. In other words, to a very good approximation, Q(zi |V) assumes a gaussian distribution with mean and variance: hzi i D

X

W ijm j ,

(A.5)

j

D E X dz2i D W ij2 m j (1 ¡ m j ), j

(A.6)

1332

Lawrence K. Saul and Michael I. Jordan

where h¢i is used to denote the expected value. The gaussianity of Q( zi |V) emerges in the thermodynamic limit of large, layered networks where each unit receives an innite number of inputs from the hidden units in the preceding layer. In particular, suppose that unit i has Ni parents and that p the weights W ij are bounded by Ni |W ij | < c for some constant P c. Then in the limit Ni ! 1, the third- and higher-order cumulants of j W ij Sj vanish for any distribution under which Sj are independently distributed p binary variables. The assumption that Ni Wij < c implies that the weights are uniformly small and evenly distributed throughout the network; it is a natural assumption to make for robust, fault-tolerant networks whose computing abilities do not degrade catastrophically with random “lesions” in the weight matrix. Although only an approximation for nite networks, in what follows we make the simplifying assumption that Q( zi | V ) is a gaussian distribution. This assumption—specically tailored to large, layered networks whose evidence arrives in the bottom layer—leads to the simple mean eld equations and attractor dynamics in section 3.3 The asymptotic form of Q( zi |V) and the logarithm term in equation 4.4 motivate us to consider the following lemma. Let z denote a gaussian random variable with mean hzi and variance hdz2 i, and consider the expected value, hln[1 C ez ]i. Then, for any real number j , we have the upper bound (Seung, 1995): hln[1 C ez ]i D hln[ej z e¡j z (1 C ez )]i,

(A.7)

(1¡j )z

D j hzi C hln[e¡j z C e ¡jz

· j hzi C lnhe

(1¡j )z

Ce

]i,

i,

(A.8) (A.9)

where the last line follows from Jensen’s inequality. Since z is gaussian, it is straightforward to perform the averages on the right-hand side. This gives us an upper bound on hln[1 C ez ]i expressed in terms of the mean and variance: h i 1 2 hln[1 C ez ]i · j 2 hdz2 i C ln 1 C ehziC (1¡2j )hdz i / 2 . 2

(A.10)

The right-hand side of equation A.10 is a convex function of j whose minimum occurs in the interval j 2 [0, 1]. We can use this lemma to compute an approximate value for hln[1 C ezi ]i, where the average is performed with respect to the distribution, Q( H| V ). This is done by introducing an extra parameter, ji , for each unit in the network, then substituting ji and the statistics of zi into equation A.10. Note 3 One can also proceed without making this assumption, as in Saul et al. (1996), to derive approximations for nonlayered networks. The resulting mean eld equations, however, do not appear to lend themselves to a simple attractor dynamics.

Attractor Dynamics in Feedforward Neural Networks

1333

that the terms ln[1 C ezi ] appear in equation A.4 with an overall minus sign; thus, to the extent that Q( zi | V ) is well approximated by a gaussian distribution, the upper bound in equation A.10 translates into a lower bound on hln P( S )i. In particular, from equation A.4, we have: hln P( S )i ¼

X ij

¡

Wijm im j ¡

X i

1X 2 2 Wijji m j (1 ¡ m j ) 2 ij

» ¤¼ P£ W m C 1 (1¡2ji )Wij2 m j (1¡m j ) ln 1 C e j ij j 2 .

(A.11)

The right-hand side of equation A.11 becomes a lower bound on hln P( S )i in the thermodynamic limit where Q( zi | V ) is described by a gaussian distribution. The objective function for the mean eld approximation is the difference between equations A.1 and A.11; these expressions correspond to the rst two terms of equation 3.3. The difference of these two equations is in fact the Lyapunov function, L, from equation 3.10. This can be shown by appealing to the denition of hi in equation 3.7 and by noting that Z Z

s(h) dh D ln[1 C eh ],

(A.12)

s ¡1 (m ) dm D m ln m C (1 ¡ m ) ln(1 ¡ m ),

(A.13)

m where s ¡1 (m ) D ln[ 1¡m ] is the inverse sigmoid function. Thus we have derived the Lyapunov function by evaluating the KL divergence in equation 3.2. It follows that the Lyapunov function measures the discrepancy between the distributions Q( H| V ) and P(H|V ) in terms of the mean eld parameters, fm i , ji g. Optimal values for these parameters are found by minimizing L; in particular, computing the gradients @L / @m i and @L / @ji and equating them to zero leads to the mean eld equations, equations 3.4 and 3.5.

Acknowledgments

We are grateful to H. Seung for many useful discussions about attractor dynamics and Lyapunov functions. We also thank Hinton et al. (1995) for sharing their preprocessing software for images of handwritten digits. References Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147–169.

1334

Lawrence K. Saul and Michael I. Jordan

Amit, D. J. (1989). Modeling brain function. Cambridge: Cambridge University Press. Binder, J., Koller, D., Russell, S. J., & Kanazawa, K. (1997). Adaptive probabilistic networks with hidden variables. Machine Learning, 29, 213–244. Buntine, W. (1994). Operations for learning with graphical models. Journal of Articial Intelligence Research, 2, 159–225. Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transactions on Systems, Man, and Cybernetics, 13, 815–826. Cooper, G. (1990). Computational complexity of probabilistic inference using Bayesian belief networks. Articial Intelligence, 42, 393–405. Dayan, P., Hinton, G. E., Neal, R. M., & Zemel, R. (1995).The Helmholtz machine. Neural Computation, 7, 889–904. Galland, C. C. (1993).The limitations of deterministic Boltzmann machine learning. Network: Computations in Neural Systems, 4, 355–379. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741. Hinton, G. E., Dayan, P., Frey, B. J., & Neal, R. M. (1995). The wake-sleep algorithm for unsupervised neural networks. Science, 268, 1158–1161. Hinton, G. E., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions Royal Society B, 352, 1177–1190. Hopeld, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, USA, 79, 2554–2558. Hopeld, J. J. (1984). Neurons with graded responses have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, USA, 81, 3088–3092. Hopeld, J. J., & Tank, D. W. (1986). Computing with neural circuits: A model. Science, 233, 625–633. Hoskins, D., Hwang, J., & Vagners, J. (1992). Iterative inversion of neural networks and its application to adaptive control. IEEE Transactions on Neural Networks, 3, 292–301. Jordan, M. I., & Rumelhart, D. E. (1992). Forward models: Supervised learning with a distal teacher. Cognitive Science, 16, 307–354. Koch, C., Marroquin, J., & Yuille, A. (1986).Analog “neuronal” networks in early vision. Proceedings of the National Academy of Sciences, USA, 83, 4263–4267. Lauritzen, S. L. (1996). Graphical models. Oxford: Oxford University Press. Lewicki, M. S., & Sejnowski, T. J. (1996). Bayesian unsupervised learning of higher order structure. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953). Equation of state calculations for fast computing machines. Journal of Chemical Physics, 21, 1087–1092. Neal, R. M. (1992). Connectionist learning of belief networks. Articial Intelligence, 56, 71–113.

Attractor Dynamics in Feedforward Neural Networks

1335

Parisi, G. (1988). Statistical eld theory. Redwood City, CA: Addison-Wesley. Pearl, J. (1988). Probabilistic reasoning in intelligent systems. San Mateo, CA: Morgan Kaufmann. Peterson, C., & Anderson, J. R. (1987). A mean eld theory learning algorithm for neural networks. Complex Systems, 1, 995–1019. Saul, L. K., Jaakkola, T. S., & Jordan, M. I. (1996). Mean eld theory for sigmoid belief networks. Journal of Articial Intelligence Research, 4, 61–76. Seung, H. S. (1995).Annealed theories of learning. In J.-H. Oh, C. Kwon, & S. Cho (Eds.), Neural networks: The statistical mechanics perspective: Proceedings of the CTP-PRSRI Joint Workshop on Theoretical Physics. Singapore: World Scientic. Seung, H. S., Richardson, T. J., Lagarias, J. C., & Hopeld, J. J. (1998). Minimax and Hamiltonian dynamics of excitatory-inhibitory networks. In M. Jordan, M. Kearns, & S. Solla (Eds.), Advances in neural information processing systems, 10. Cambridge, MA: MIT Press. Received October 28, 1998; accepted May 14, 1999.

LETTER

Communicated by Halbert White and Jude Shavlik

Visualizing the Function Computed by a Feedforward Neural Network Tony A. Plate Bios Group LP, Santa Fe, NM 87501, U.S.A. Joel Bert John Grace Department of Chemical Engineering, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada Pierre Band Environmental Health Centre, Health Canada, Ottawa, Ontario, K1A OL2, Canada A method for visualizing the function computed by a feedforward neural network is presented. It is most suitable for models with continuous inputs and a small number of outputs, where the output function is reasonably smooth, as in regression and probabilistic classication tasks. The visualization makes readily apparent the effects of each input and the way in which the functions deviate from a linear function. The visualization can also assist in identifying interactions in the tted model. The method uses only the input-output relationship and thus can be applied to any predictive statistical model, including bagged and committee models, which are otherwise difcult to interpret. The visualization method is demonstrated on a neural network model of how the risk of lung cancer is affected by smoking and drinking. 1 Introduction

In application settings, neural networks are commonly regarded as “black boxes” because of the difculty in understanding both the function computed and the way in which it was computed. Even if neural networks perform better than more interpretable models, their “black box” nature can make them less valuable. For example, in a recent series on medical applications of neural network in the Lancet, many authors and correspondents expressed their fear in trusting important decisions to systems they do not understand (Wyatt, 1995; Sharp, 1995; Dodds, 1995). Thus, the development of good methods for interpreting neural networks could make them far more useful. Researchers have proposed a number of methods for understanding feedforward neural networks. These fall at various points along a “howwhat” spectrum. “How” methods aim to provide an understanding of how c 2000 Massachusetts Institute of Technology Neural Computation 12, 1337– 1353 (2000) °

1338

T. Plate, J. Bert, J. Grace, and P. Band

the network computes a function, usually by interpreting weights and hidden unit activations. “What” methods ignore the internal workings and aim to describe the function computed by the network. Methods such as contribution analysis (Shultz, Oshima-Takane, & Takane, 1995; Shultz & Elman, 1994; Sanger, 1989), which are designed to provide an interpretable view of the internal representational space, fall rmly at the “how” end of the spectrum. Rule extraction methods that translate hidden unit connectivities into rules (Towell & Shavlik, 1993; Blasig, 1994) fall somewhere in the middle of the spectrum. Other rule extraction methods that treat the network as a black box and derive a description of its function in the form of rules (Saito & Nakano, 1988) fall at the “what” end of the spectrum. In general, rule extraction methods appear to be more suited to understanding networks for clear-cut classication tasks. For continuous-output tasks, including classication tasks in which classes overlap so much that outputs are best described probabilistically, it seems more appropriate to use “what” methods that provide some quantitative or graphical indication of the effect of input variables on the output. Baxt (1992), Baxt and White (1995), and Moseholm, Taudorf, and Frosig (1993) describe some such methods. The method proposed in this article is similar to these latter methods, but is intended to overcome some of their limitations by adapting the graphical plots of generalized additive models (Hastie & Tibshirani, 1990) to the display of functions computed by neural networks and other exible statistical models. 2 Background

Before proceeding further we review generalized additive models and previously proposed methods for graphically displaying the effects of inputs on neural networks. 2.1 Plots of Generalized Additive Models. Part of the attraction of Hastie and Tibshirani’s (1990) generalized additive models (GAMs) is that although they are nonlinear, they are easily interpreted from simple graphical displays. A generalized additive model can be expressed as

g( x ) D h(h1 (x1 ) C h2 (x2 ) C ¢ ¢ ¢ C hk ( xk )), where x D ( x1 , x2 , . . . , xk ) is the vector of input values and g( x ) is the predicted output value. The functions h1 , h2 , . . . can be arbitrary nonlinear functions. The function h (the inverse link function in the terminology of GAMs) relates the sum of the hi to the output variable. The choice of h depends on the distribution of the target variable and the distribution of errors in the target values. Ideally, h should be chosen so that the probability of observing a particular output value y given some input vector x depends only on the difference between y and the average value of y for that input vector. For example, for a continuous output variable with gaussian error having variance independent of the predicted value of the output, the identity function

Visualizing Neural Networks

1339

is appropriate for h, because the probability of observing a particular output value depends on only the difference between the observed and predicted (i.e., mean) values, and not directly on the predicted value. For binary valued target variables, the logistic function h( a) D 1 / (1 C e¡a ) is usually appropriate; it gives the probability of the target’s being 1. The possibility of different functions for h makes GAMs “generalized”; appropriate functions exist for a wide range of error distributions. The way the functions of the inputs combine makes GAMs “additive.” Although the effect of a particular input can be nonlinear, there are no interactions between inputs; the contribution to the sum of hi of a particular input does not depend on the values of other inputs. (However, one can use functions of more than one variable in a GAM, which allows for interactions between the inputs in the same function.) A GAM is easily interpreted from plots of the effects of inputs on the additive scale of the model, that is, plots of hi . The hi are usually functions of one variable, and hence are simple to display. Note that the effects of the inputs on the response scale of the model, that is, the range of h, may not be additive and cannot be easily displayed. For the remainder of this article, we refer to the domain of h—the range of the hi —as the natural additive scale of the model. 2.2 Plots and Quantitative Summaries of the Effects of Inputs in Neural Networks. One straightforward way to illustrate graphically the func-

tion computed by a neural network is to plot two-dimensional surfaces showing the response of the network to varying values for two particular inputs, while the rest of the inputs are held at some constant value. Moseholm et al. (1993) show pairwise surface plots for their model of how various conditions affect lung function. Each plot shows the effect of different values of a pair of variables while the others are held at their mean values. This type of display can illustrate main effects and pairwise interactions. However, it fails to illustrate whether other variables are interacting with the pair displayed. It is possible that the shape of the surface varies greatly with the values of the other inputs. Thus, this type of display could possibly give misleading impressions of the effects of inputs. An additional drawback is that the number of pairwise plots becomes unwieldy when there are more than a few inputs. Baxt (1992) introduced a technique for quantitatively summarizing, in a single number, the effect of a particular input on the output of a neural network. He denes the average effect of a particular input as the average change in the output when that input is perturbed, for the examples in a sample. The average effect for input i is dened as

DN i fO D

n 1X (p) (p) sign(di (xi ) ¡ xi ) n pD1 h ( ) i ( p) (p) ( p) (p) p ¢ fO(x1 , . . . , xi¡1 , di ( xi ), xiC 1 , . . . , xk ) ¡ fO(x ( p ) ) ,

1340

T. Plate, J. Bert, J. Grace, and P. Band

where fO is the function computed by the network, di is a function that perturbs values of ith input variable Xi , n is the number of training cases, ( p) (p) and x( p ) is the vector of k input values ( x1 , . . . , xk ) for the pth training case. The sign function has a value of 1 or ¡1 depending on whether its argument is positive or negative. The role of the sign function is to switch the sign of the change in fO when the change in xi is negative, making DN i fO in effect the average sign-corrected change in fO that occurs for a change in xi . The input-perturbation function di must be chosen appropriately. When Xi is a binary variable, the choice is unproblematic: di should swap the value. ( p) However, for continuous Xi , the best denition of d(xi ) is not clear. Baxt uses a randomly chosen value of Xi . Baxt’s technique is well suited to summarizing the effects of binary variables. However, like some other variants of contribution and sensitivity analysis, this technique is not so well suited to describing the effects of continuous variables, because it does not relate the magnitude of the effect to the magnitude of the variable; it calculates an average effect. Furthermore, the average effect can be zero where there are strong effects of opposite sign in different parts of the input space. 3 Adapting GAM-Style Plots to Functions with Interactions

The additive scale GAM-style plots can be adapted to nonadditive functions (i.e., ones with interactions) by plotting the effect on the output of a particular input variable for some selection of points in the input space. The plot for input variable i is no longer a function as with GAMs, but is a set of two-dimensional (2D) points (effects). Each point records the effect of changing xi from some baseline value bi , in the context of xed values for the other inputs. The set of points is f(xi , D i (x ))| x 2 X g, where xi is the ith element of x , and X is some suitable set of points in the input space. D i ( x ) is the effect on fO of changing Xi from bi to xi in the context of x, where bi is some baseline value for Xi . As with GAM, effects are best plotted on the natural additive scale of the model, and it will be assumed that this is fO(¢). For a neural network, this will usually be the total input of the output unit. D i is dened as

D i ( x ) D fO( x ) ¡ fO( x1 , . . . , xi¡1 , bi , xiC 1 , . . . , xk ). The plot of these points can be enhanced by drawing each point as a short line segment, where the slope of the segment represents the partial gradient @ fO/ @xi evaluated at x . As with GAM, the plots should be made so that the vertical axis on each plot covers the same range, in order to allow easy comparison of the magnitudes of effects of different variables.

Visualizing Neural Networks

1341

Figure 1: Effects of x1 and x2 on the four example functions. The effect of the ith input variable at a particular input point (D i ( x )) is the change in f resulting from changing X1 to x1 from b1 (the baseline value, shown by a vertical dotted line) while keeping the other inputs constant. The effects are plotted as short line segments, centered at (xi , D i ( x )), where the slope of the segment is given by the partial derivative. Variables that strongly inuence the function value have a large total vertical range of effects. Functions without interactions appear as possibly broken straight lines (linear functions) or curves (nonlinear functions). Interactions show up as vertical spread at a particular horizontal location, that is, a vertical scattering of segments. Interactions are present when the effect of a variable depends on the values of other variables.

Figure 1 shows plots of the effects of variables for four 2D functions from Hwang, Lay, Maechler, Martin, and Schimert (1994). These functions are used here because they illustrate the different effects of nonlinearities and interactions: g(0) is linear, g(4) and g(5) have nonlinearities, and g(1) and g(5) have interactions. The denitions of the functions are as follows. Linear additive Simple interaction

g(0) (x1 , x2 ) D x1 C 2x2 g(1) (x1 , x2 ) D 10.391((x1 ¡ 0.4)(x2 ¡ 0.6) C 0.36)

1342

Additive function

Complex interaction

T. Plate, J. Bert, J. Grace, and P. Band

g(4) ( x1 , x2 ) D 1.3356(1.5(1 ¡ x1 ) C exp(2x1 ¡ 1) sin(3p (x1 ¡ 0.6)2 ) C 1.3356 exp(3(x2 ¡ 0.5)) sin(4p (x2 ¡ 0.9)2 )) (5) ( g x1 , x2 ) D 1.9(1.35 C exp(x1 ) sin(13(x1 ¡ 0.6)2 ) exp(¡x2 ) sin(7x2 ))

The effects are plotted at random locations in X1 2 [0 : 1], X2 2 [0 : 1], and the “additive scale” of each function is the function value. The vertical dotted lines show the baseline values. The nonlinearities in these functions show up as curves in the pattern of effects, and the interactions show up as vertical spread at a particular horizontal location. These plots have a number of informative features: Trend. An overall trend in a plot indicates a trend in the effect of the vari-

able. For example, the plots for g(0) show that the value of the function increases with increasing x1 and x2 .

Overall vertical range. The overall vertical range reects the importance

of the variable—that is, the degree to which the variable can affect the output. Plots for variables with no effect will be a horizontal line. The overall vertical range can be affected by the choice of baseline value for the variable, but the degree of this effect will be small except in pathological cases. For the function g(0) , the greater range in effects of x2 reects its greater importance: g(0) is dened as g(0) D x1 C 2x2 .

Local vertical spread. Vertical spread of values at a particular horizontal

position indicates interaction between the variable plotted and one or more other variables. Care must be taken not to read too much into the horizontal location of points of maximum or minimum spread in these plots, for example, x1 D 0 for g(1) , or x2 D 0.45 for g(5) . These locations are entirely determined by the choice of baseline values: spread is necessarily zero at the baseline value.

Nonlinearity. In the absence of spread, nonlinearity is easily detected as

curves in the patterns of effects, as with g(4) . However, in the presence of spread, nonlinearity can be masked. However, the angles of segments can provide clues as to the absence and presence of nonlinearities, such as with g(5) .

These plots of effects differ from a projection onto one axis in that the additive effects of the other variables are removed. This results in a single curve for additive functions, as with g(0) and g(4) , whereas projections of the same functions would have vertical spread. Once it is clear that a certain variable, say x1 , is involved in interactions, it is possible to infer whether the interaction involves another variable, say x2 , by coloring (or some other means of identication such as separate panels) the effects in the plot for x1 according to their value of x2 . If the effects of x1

Visualizing Neural Networks

1343

differ markedly for different values of x2 , it indicates an interaction between the two variables. An example with real data is given in the next section. 3.1 Choosing the Points for Visualization. In order not to overload the additive effects plots with too much information, one must plot the additive effects at only a limited number of points (the set X ). This is especially true when the input space is high-dimensional. In such cases, a reasonable number of points can populate the input space only very sparely, and consequently the points must be chosen carefully. The set of examples (training or testing or both) is an obvious choice for the points at which to plot effects. The advantage of this over using random points is that the relative density of points in different parts of the plot is apparent. When plotting at points corresponding to examples, isolated points may indicate either outliers in the input space, or regions of the function in which there is very little support for the value the function is producing. However, one disadvantage of plotting only at points corresponding to examples is that one may fail to spot undesirable or anomalous behavior of the tted model in regions on the input space not populated by any examples. One possibly informative approach would be to make a color-coded plot at training examples, test examples, and random points. 3.2 Scatterplots of Partial Gradients. The partial gradients at particular points, which are shown as slopes on line segments in the additive-effects plots, can be illustrated in a scatterplot. Such plots contain useful information, though they are not as straightforward to interpret as additive-effects plots. Scatterplots of partial gradients for the four test functions are shown in Figure 2. As with additive-effects plots, the points for additive functions form single curves in the plot. The points for linear functions form horizontal lines. Interactions show up as spread at a horizontal position. These gradient scatterplots have one advantage over additive-effects plots: they do not depend on any baseline value. However, the vertical axis of these plots is more difcult to interpret, as it is the gradient of the effect, rather than the effect. Furthermore, the vertical scales for different inputs are somewhat arbitrary, as the absolute value of a partial gradient with respect to input xi depends on the range of values of xi and thus can be changed by scaling the input. Consequently, all the gradient scatterplots in this article have been scaled in terms of “screen gradients”—a line running from the bottom left to the top right of the corresponding additive-effects plot is considered to have a gradient of 1. This is not entirely satisfactory; outliers can have a large effect on the range of a particular input, and thus omitting or including them can affect the relative values of plotted partial gradients for different inputs. 3.3 Measuring Importance of Variables and Degrees of Interaction.

Although the importance of variables and the extent to which they are involved in interactions (i.e., degree of nonadditivity) can be estimated from

1344

T. Plate, J. Bert, J. Grace, and P. Band

Figure 2: Scatterplots of partial gradients for the four example functions. The vertical scale is in terms of “screen gradients”; the vertical axis is scaled so that a gradient of one corresponds to the slope from the bottom left corner to the top right corner of the plot of additive effects. As with plots of additive effects, functions without interactions appear as straight lines or curves, and interactions result in vertical scattering.

plots of additive effects, it is also useful to have objective measures. There are two reasonable measures of the importance of variable of a particular variable. 1. The variance of additive effects for xi 2. The variance of partial gradients for xi The advantage of the rst is that it relates directly to the vertical range in additive-effects plots and is not affected by scaling of the input variables. However, it is affected by the choice of baseline value. The second does not depend on any choice of baseline value, but does suffer from the disadvantage that relative importances for different variables are affected by scaling the input values. One possible measure of the degree to which xi interacts with other variables is the average squared error in a smooth t to the partial gradients of f with respect to xi . This corresponds to the error in smooth ts to the points in the various plots in Figure 2. For smooth additive functions, such as g(0) and g(4) , the error in the t will be very small. The error will be large when there is a large, vertical spread at a particular value of the input variable. A smooth t to the gradient points can be calculated with any one-dimensional

Visualizing Neural Networks

1345

smoother; the degrees of interaction reported in this article were based on the smooth t computed by the BRUTO curve tter described in Hastie and Tibshirani (1990). This measure of degree of interaction is independent of the baseline value, but is affected by scaling of the inputs variables. Condence bounds on these measure for the importance of a feature, or the degree of interaction, could be calculated using a bootstrap procedure in the same manner as Baxt and White (1995) construct condence intervals on the average effects of variables. 4 Visualizing a Neural Network Model for the Risk of Cancer

We have developed a neural network model for the risk of developing squamous-cell lung cancer versus developing other nonsmoking-related cancers, based on knowledge of age and smoking and drinking habits (see Plate, Bert, Grace, & Band, 1997). The network has 19 inputs, 2 tanh hidden units, and 1 logistic output unit (see Bishop, 1995, for details of this type of network). The inputs are age and various measures of tobacco and alcohol use. For each of six products—cigarettes (C), cigars (G), pipes (P), beer (B), wine (W), and spirits (S)—there are three consumption measures. For cigarettes these are number of years of cigarette smoking (CYR), number of cigarettes smoked per day (CDAY), and number of years since the subject quit smoking cigarettes (CYQUIT). There are analogous variables for the other products, though alcohol consumption rate is measured per week (e.g., BWK—number of standard beers per week). The network was trained using MacKay’s (1992) evidence framework to choose optimal values for weight penalties. The effects of variables in a trained network are shown in Figure 3. The vertical scale in these plots is the total input of the output unit. This value is the natural log of the predicted relative odds of developing squamous cell lung cancer; thus, a movement of one unit on the vertical axis corresponds to a change in relative odds by a factor of 2.718. The baseline for variables in these plots is the minimum value of the variable (20 for age, 0 for all others). Tests of the predictive accuracy of this network indicate that its predictions are slightly better, at a moderate level of statistical signicance, than those of a stepwise logistic-regression model of the data (i.e., an additive linear model). On a held-out test set of size 1450, the neural network has a deviance (i.e., negative log-likelihood) of 905.5; the stepwise logisticregression model, which selects the variables AGE (age in years), CDAY, CYR, CYQUIT, and WYR (years of wine drinking), has a deviance of 925.3; and the null model, which predicts the average training-case value, has a deviance of 1210.1. A bootstrap test revealed that the predictions of the neural network were better than those of the logistic-regression model in 96.5% of 50,000 resampled test sets. Despite the only slight superiority of the neural network predictions, the tted models are quite different functions. For the tted logistic-regression model, the plots of the additive effects would all be straight lines. However, Figure 3 shows that the neural network devi-

1346

T. Plate, J. Bert, J. Grace, and P. Band

Figure 3: Effects of variables on a tted neural network model for squamous cell lung cancer. The input variables are AGE (age in years) and consumption measures of cigarettes (D), cigars (G), pipes (P), beer (B), wine (W) (YR is total years of consumption, DY is units consumed per day, WK is units consumed per week, and YQUIT is years elapsed since quitting). The vertical scale in these plots is the total input of the logistic output unit (i.e., the log-odds of developing this type of cancer. The plots show that CYR (years of cigarette smoking) and GDAY (cigars per day), for example, have a strong, positive effect on the risk (i.e., more cigarettes per day or more years of smoking cigars increases the risk of cancer), while CYQUIT and WYR have a negative effect on the risk (i.e., a protective effect). Other variables, such as BYR (years of beer drinking), have very little effect on the risk.

Visualizing Neural Networks

1347

Figure 4: Effects of years of cigarette smoking (CYR) partitioned by number of cigars smoked per day (GDAY). These plots show that CYR interacts with GDAY: the effect of CYR diminishes for greater values of GDAY. If CYR did not interact with GDAY, the plots would show similar patterns of effects of CYR for different values of GDAY.

ates strongly from an additive linear model. The neural network identies more variables as important (the variables whose effects have a large vertical range) and also involves CYR, GDAY (cigars per day), and PYR (years of pipe smoking) in strong interactions. The neural network does agree with logistic regression model in the sign of effects: in both, AGE, CYR, and CDAY are positively associated with risk, and CYQUIT and WYR are negatively associated with risk. We can begin to identify interactions by partitioning examples into sets according to one input variable and looking at whether effects of another variable are distinct for each partition. The examples in each partition can be shown in separate plots or identied by color. Figure 4 shows separate plots of the effect of CYR for different ranges of GDAY. These plots strongly suggest an interaction between years of cigarette smoking and number of cigars smoked per day. For people who smoke more cigars, the additional risk due to cigarette smoking is less. Indeed, this interaction tested as highly signicant when added to a logistic regression model. Color plots are a more economical way to present this information. Figure 5 shows the effects of eight of the more important variables color-coded by ranges of GDAY, CYR, and CYQUIT. The bands of color in the effects of CYR in Figure 5a reproduce Figure 4 and show that CYR interacts with GDAY; for higher values of GDAY, CYR has a lesser effect. In fact, Figure 5a shows that GDAY is involved in interactions with all of the plotted variables. Figure 5b shows that CYR is involved in interactions with the plotted variables, but not as strongly as GDAY; the bands of colors in Figure 5b are not as clear as in Figure 5a. As expected, the plots in Figure 5b show an interaction between GDAY and CYR, and between PYR and CYR, but not between other variables and CYR. In contrast, no interactions for CYQUIT stand out in Figure 5c. This does not mean that CYQUIT is not involved in interactions—obviously it is—but rather that the interactions are of a complex nature or involve variables not plotted here.

1348

T. Plate, J. Bert, J. Grace, and P. Band

Visualizing Neural Networks

1349

Figure 6: Scatterplots of partial gradients, colored by GDAY (cigars per day). As with color-coded plots of additive effects, vertical differentiation of colors indicates an interaction, as in the plot for CYR (years of cigarette smoking), for example. A measure of the degree to which a particular variable interacts with others can be had by tting a smooth curve to these points and calculating the average squared error of the t.

The color-coded scatterplots of partial gradients (see Figure 6) are also informative. Again, the interactions between GDAY and CYR and between GDAY and CDAY stand out clearly. The protective effects of GYQUIT (years since quitting cigars), CYQUIT, and WYR can also been seen. Finally, Figure 7 shows where the variables fall into a 2D space dened by overall importance (vertical axis) and degree of interaction (horizontal axis). Overall importance is measured as the variance of the additive effects, that is, the variance in the vertical direction in the plots in Figures 3 or 5. Degree of interaction is measured as the variance of the (vertical) error in a smooth t to the partial gradients, that is, a smooth t to the points in FigFigure 5: Facing page. Plots of additive effects enhanced by color coding by the range of a particular variable. Only the variables with larger effects are shown here. The three panels are color coded by the ranges of three different variables: (A) GDAY (cigars per day), (B) CYR (years of cigarette smoking), and (C) CYQUIT (years since quitting cigarettes). One plot in each panel shows the effects of the variable being color coded for. This plot shows the ranges of the color coding; for example, the plot for GDAY in A shows that blue codes for points with GDAY D 0, green codes for points with GDAY D 1, yellow codes for points with GDAY between 2 and 4, and red codes for points with GDAY between 5 and 10. These plots allow one to spot interactions between the colorcoded variables and others: interactions appear as vertical differentiation of color—as in CYR in A (implying CYR interacts with GDAY) and GDAY in B (the same interaction).

1350

T. Plate, J. Bert, J. Grace, and P. Band

Figure 7: Importance of effects versus strength of interaction. The vertical axis is the importance of the effect of a particular variable, as measured by the variance of effects for that variable (i.e., the vertical variance in effects plots). The horizontal axis is the strength of interaction for a particular variable, as measured by the average squared error of a smooth t to the partial derivatives. This plot shows that in the function computed by this neural network, the strength of interactions for a variable is roughly proportional to the importance of effects for that variable. There are no variables that have a distinctly additive effect (these would appear in the top-left quadrant).

ure 6. Variables that lie toward the bottom left of this plot have little overall effect and do not have strong degrees of interaction. Variables that lie toward the top right have a large effect and are strongly interactive. Variables that lie toward the top left have a large additive effect. Interestingly, the variables for the neural network lie near the diagonal; none of the variables in the neural network has a large effect that is mainly additive. Bearing in mind that an additive linear function provides a model of these data that is very nearly as good as the neural network, one must be suspicious that the function computed by the trained neural network has gratuitous interactions—ones that are not supported by the data. This raises the question of whether one could construct exible statistical models in which there is some control over

Visualizing Neural Networks

1351

the degree of interaction allowed in the function. This possibility is pursued in (Plate, 1999), in which gaussian process models with a user-controllable degree of interaction are applied to a task on which neural networks have an order-of-magnitude lower error than linear models. One nal point to note is that the proposed method for calculating additive effects can require the model to predict an output for an “absurd” input. For example, in calculating the additive effect of age for a 50 year old who has been smoking for 30 years, we ask the network to calculate an output for a 20 year old who has been smoking 30 years. While this may seem suspicious, a possible defense is that we are not really interested in the actual output for the absurd case, but only need it to estimate how much 50 years of age contributes in the real case. This type of situation is likely to be more of a problem for nearest-neighbor-style models in which an “absurd” input may not have any close neighbors. 5 Discussion

Plots of additive effects provide rapid insight into the function computed by a feedforward neural network. The plots show the overall importance and direction of effects of individual inputs and whether inputs are involved in interactions. They also make apparent the way in which a tted model differs from a linear or generalized linear model. These differences can be of two kinds: nonlinearities in the response to single variables and interactions among variables. It is both interesting and valuable to have some idea of how a particular neural network differs from a linear model, especially as linear or logistic regression models are surprisingly effective on many realworld applications. The method for constructing these plots does not depend on the internal structure of the network. Consequently, it can be applied to understanding any tted statistical model that can be used in a predictive mode, including committee and bagged models, which are difcult to interpret by any method that requires more knowledge about the tted model than merely its black box functionality. Acknowledgments

This research was partially funded by grants from the Workers Compensation Board of British Columbia, NSERC, and IRIS (Canada). The research was conducted while T. P. was a postdoctoral fellow at the University of British Columbia and the BC Cancer Agency, and then a lecturer at Victoria University, Wellington, New Zealand. S-plus software for producing the plots described in this article is available from the Statlib archive of software for S-plus at http://lib.stat.cmu. edu/S/ in the le ploteff.tar, or directly from T. P. ([email protected]).

1352

T. Plate, J. Bert, J. Grace, and P. Band

References Baxt, W. G. (1992). Analysis of the clinical variables driving decision in an articial neural network trained to identify the presence of myocardial infarcation. Annals of Emergency Medicine, 21(12), 1439–1444. Baxt, W. G., & H. White (1995). Bootstrapping condence intervals for clinical input variable effects in a network trained to identify the presence of acute myocardial infarction. Neural Computation, 7, 624–638. Bishop, C. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Blasig, R. (1994). GDS: Gradient descent generation of symbolic classication rules. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 1093–1100). San Mateo, CA: Morgan Kaufmann. Dodds, S. R. (1995, Dec. 2). Neural networks (letter). Lancet, 346, 1500–1501. Hastie, T. J., & Tibshirani, R. J. (1990). Generalized additive models. London: Chapman and Hall. Hwang, J.-N., Lay, S.-R., Maechler, M., Martin, R. D., & Schimert, J. (1994). Regression modeling in back-propagation and projection pursuit learning. IEEE Transactions on Neural Networks, 5(3), 342–353. MacKay, D. J. C. (1992). The evidence framework applied to classication networks. Neural Computation, 4(5), 698–714. Moseholm, L., Taudorf, E., & Frosig, A. (1993). Pulmonary function changes in asthmatics associated with low-level SO2 and NO2 , air pollutiion, weather, and medicine intake. Allergy, 48, 334–344. Plate, T. A. (1999). Accuracy versus interpretability in exible modeling: Implementing a tradeoff using gaussian process models. Behaviourmetrika, 26, 29–50. Plate, T., Bert, J., Grace, J., & Band, P. (1997). A comparison between neural networks and other statistical techniques for modeling the relationship between tobacco and alcohol and cancer. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing, 9. Cambridge, MA: MIT Press. Saito, K., & Nakano, R. (1988). Medical diagnostic expert system based on PDP model. In IEEE International Conference on Neural Networks (pp. 255–262). San Diego, CA. Sanger, D. (1989). Contribution analysis: A technique for assigning responsibilities to hidden units in connectionist networks. Connection Science, 1, 115–138. Sharp, D. (1995, Oct. 21). From “black box” to bedside, one day? Lancet, 346, 1050. Shultz, T. R., & Elman, J. L. (1994). Analyzing cross connected networks. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 1117–1124). San Mateo, CA: Morgan Kaufmann. Shultz, T., Oshima-Takane, Y., & Takane, Y. (1995). Analysis of unstandardized contributions in cross connected networks. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 601– 608). Cambridge, MA: MIT Press.

Visualizing Neural Networks

1353

Towell, G., & Shavlik, J. (1993). Extracting rened rules from knowledge-based neural networks. Machine Learning, 13(1), 71–101. Wyatt, J. (1995, Nov. 4). Nervous about articial neural networks? Lancet, 346, 1175–1177. Received June 25, 1998; accepted April 2, 1999.

LETTER

Communicated by Laurenz Wiskott and Patrice Simard

Discriminant Pattern Recognition Using Transformation-Invariant Neurons Diego Sona Alessandro Sperduti Antonina Starita Dipartimento di Informatica, Universit`a di Pisa, 56125, Pisa, Italy To overcome the problem of invariant pattern recognition, Simard, LeCun, and Denker (1993)proposed a successful nearest-neighbor approach based on tangent distance, attaining state-of-the-art accuracy. Since this approach needs great computational and memory effort, Hastie, Simard, and S a¨ ckinger (1995)proposed an algorithm (HSS) based on singular value decomposition (SVD), for the generation of nondiscriminant tangent models. In this article we propose a different approach, based on a gradientdescent constructive algorithm, called TD-Neuron, that develops discriminant models. We present as well comparative results of our constructive algorithm versus HSS and learning vector quantization (LVQ) algorithms. Specically, we tested the HSS algorithm using both the original version based on the two-sided tangent distance and a new version based on the one-sided tangent distance. Empirical results over the NIST-3 database show that the TD-Neuron is superior to both SVD- and LVQ-based algorithms, since it reaches a better trade-off between error and rejection. 1 Introduction

In several pattern recognition systems the principal and most desired feature is robustness against transformations of patterns. Simard, LeCun, and Denker (1993) partially solved this problem by proposing the tangent distance as a classication function invariant to small transformations. They used the concept in a nearest-neighbor algorithm, achieving state-of-the-art accuracy on isolated handwritten character recognition. However, this approach has a quite high computational complexity due to the large number of Euclidean and tangent distances that need to be calculated. Different researchers have shown how such complexity can be reduced at the cost of increased space complexity. Simard (1994) proposed a ltering method based on multiresolution and a hierarchy of distances, while Sperduti and Stork (1995) devised a graph-based method for rapid and accurate search through prototypes. Different approaches to the problem, aiming at the reduction of classication time and space requirements, while trying to preserve the same c 2000 Massachusetts Institute of Technology Neural Computation 12, 1355– 1370 (2000) °

1356

D. Sona, A. Sperduti, and A. Starita

accuracy, have been studied. Specically, Hastie, Simard, and S¨ackinger (1995) developed rich models for representing large subsets of the prototypes through a singular value decomposition (SVD)–based algorithm, while Schwenk and Milgram (1995b) proposed a modular classication system (Diabolo) based on several autoassociative multilayer perceptrons, which use tangent distance as the error reconstruction measure. A different but related approach has been pursued by Hinton, Dayan, and Revow (1997), who propose two different methods for modeling the manifolds of data. Both methods are based on locally linear low-dimensional approximations to the underlying data manifolds. All the above models are nondiscriminant.1 Although nondiscriminant models have some advantages over discriminant models, (Hinton et al., 1997), the amount of computation during recognition is usually higher for nondiscriminant models, especially if a good trade-off between error and rejection is required. On the other hand, discriminant models take more time to be trained. In several applications, however, it is more convenient to spend extra time for training, which is usually performed only once or a few times, so as to have a faster recognition process, which is repeated millions of times. In this cases, discriminant models should be preferred. In this article, we discuss a constructive algorithm for the generation of discriminant tangent models.2 The proposed algorithm, which is an improved version of the algorithm previously presented in Sona, Sperduti, and Starita (1997), is based on the denition of the TD-Neuron (TD stands for “tangent distance”) where the net input is computed by using the one-sided tangent distance instead of the standard dot product. Using this denition, we have devised a constructive algorithm, which we compare here with HSS and learning vector quantization (LVQ) algorithms. In particular, we report results obtained for the HSS algorithm using both the original version based on the two-sided tangent distance and a new version based on the one-sided tangent distance. For comparison, we also present the results of the LVQ2.1 algorithm, which turned out to be the best among the LVQ algorithms. The one-sided version of the HSS algorithm was derived in order to have a fair comparison against TD-Neuron, which exploits the one-sided tangent distance. Empirical results over the NIST-3 database of handwritten digits show that the TD-Neuron is superior to both HSS algorithms and LVQ algorithms since it reaches a better trade-off between error and rejection. More surprising, our results show that the one-sided version of the HSS algorithm is superior to the two-sided version, which performs poorly when introducing a rejection class. An additional advantage of the proposed algorithm is the constructive approach.

1

Schwenk and Milgram (1995a) proposed a discriminant version of Diablo as well. In the sense that the model for each class is generated taking into account negative examples also, that is, examples of patterns belonging to the other classes. 2

Discriminant Pattern Recognition

1357

In Sections 2 and 3, we give an overview of tangent distance and tangent distance models, respectively, and dene a novel version of the HSS algorithm, based on one-sided tangent distance. A new formulation for discriminant tangent distance models is proposed in section 4, while the proposed TD-Neuron model is presented in section 5, which also includes details on the training algorithm. Comparative empirical results on a handwritten digit recognition task between our algorithm, HSS algorithms, LVQ, and nearest-neighbor algorithms with Euclidean distance are presented in section 6. Finally, a discussion of the results and conclusions are reported in section 7. 2 Tangent Distance Overview

Consider a pattern recognition problem where invariance for a set of n different transformations is required. Given an image X i , the function X i ( µ ) is a manifold of at most n dimensions, representing the set of patterns that can be obtained by transforming the original image through the chosen transformations, where µ is the amount of transformations and X i D X i ( 0 ). The ideal would be to use the transformation-invariant distance, DI ( X i , X j ) D min kX i (® ) ¡ X j ( µ )k.

®, µ

However, the formalization of the manifold equation and, in particular, the computation of the distance between the two manifolds, is very hard. For this reason, Simard et al. (1993) proposed an approach based on the local linear approximation of the manifold by Q i(µ ) D X i C X

n X jD1

j

T Xi hj ,

j where T Xi are n different tangent vectors at the point X i ( 0 ), which can easily be computed by nite difference. The distance between the two manifolds is then approximated by the so-called tangent distance (Simard et al., 1993):

Q i ( ®) ¡ X Q j (µ )k. DT ( X i , X j ) D min kX

®, µ

(2.1)

Of course, the approximation is accurate only for local transformations; however, in character recognition problems, global invariance may not be desired, since it can cause confusion between patterns such as n and u. The tangent distance dened by equation 2.1 is called two-sided tangent distance, since it is computed between two subspaces. There exists also a less computationally expensive version, one-sided tangent distance (Schwenk

1358

D. Sona, A. Sperduti, and A. Starita

& Milgram, 1995a), where the distance is computed between a subspace and a pattern in the following way: Q i (® ) ¡ X j k. (X i , X j ) D min kX D1-sided T

(2.2)

®

3 Tangent Distance Models

The main drawback of tangent distance is its high computational requirement, if compared with Euclidean distance. For this reason, several authors have tried to devise compact models, based on tangent distance, that can summarize relevant information conveyed by a set of patterns. To address this problem, Hastie et al. (1995) proposed an algorithm for the generation of rich models representing large subsets of patterns. Given a set of patterns fX 1 , . . . , X NC g of class C, they proposed the tangent subspace model,

M (µ) D W C

n X

T ihi ,

iD1

where W is the centroid and the set fT i g constitutes the associated invariant subspace of dimension n. According to this denition, for each class C, the model M C can be computed as NC X

min kM ( µp ) ¡ X p (®p )k2 , M C D arg min M pD1 µp ®p

(3.1)

minimizing the error function over W and T i . The above denition constitutes a difcult optimization problem; however, it can be solved for a xed value of n (i.e., the subspace dimension) by an iterative algorithm based on singular value decomposition, as proposed by Hastie et al. (1995). If the problem is formulated using the one-sided tangent distance, then equation 3.1 becomes NC X

min kM (µp ) ¡ X p k2 , M C D arg min M pD1 µp

(3.2)

which can be easily solved by principal component analysis theory, also called Karhunen-Lo´eve expansion. In fact, equation 3.2 can be minimized by choosing W as the average over all available samples X p and T i as the most representative eigenvectors (principal components) of the covariance matrix S , where

S D

NC 1 X ( X p ¡ W )( X p ¡ W )T . NC pD1

Discriminant Pattern Recognition

1359

We will refer to the two versions of the algorithms as HSS, and when necessary we will specify which one is used (one-sided or two-sided). By construction, the HSS algorithms return nondiscriminant models. In fact, they use only the evidence provided by positive examples of the target class. Moreover, the two-sided HSS algorithm can be used only if a priori knowledge on invariant transformations is present. If this knowledge is not present, the introduction of invariance with respect to an arbitrary transformation can be risky, since this can remove information relevant for the classication task. In this situation, it is preferable to use the one-sided version, which does not commit to any specic transformation. 4 A General Formulation

There are good reasons for using discriminant models. Although Schwenk and Milgram (1995a) suggested how to modify the learning rule of Diablo to obtain discriminant models, they never proposed a formalization of discriminant models using tangent distance. In this section, we present a general formulation that allows the user to develop discriminant or nondiscriminant tangent models. To be able to devise discriminant models, equation 3.1 must be modied in such a way to take into account that all available data must be used during the generation process. The basic idea is to dene a model for class C that minimizes the tangent distances from patterns belonging to C and maximizes the tangent distances from patterns not in C (i.e., in C). Mathematically this can be expressed, for each class C, by 2 3 NC X

NC X

M C D arg min 4 DT ( M , X pC ) ¡ l DT ( M , X pC )5 , M pD1 pD1

(4.1)

where M is the generic model fW , T i , . . . , T n g, NC is the number of patterns

X pC belonging to the class C, and NC is the number of patterns X pC not belonging to C. Note that the second sum is multiplied by a constant l, which identies how much discriminant the model should be. If l D 0, equation 4.1 becomes equal to equation 3.1 (or 3.2 when considering the one-sided tangent distance). On the other hand, if l is large, the resulting model may not be a good descriptive model for class C. In any case, no bounded solution to equation 4.1 may exist if the term associated with l is not bounded. 5 TD-Neuron

The TD-Neuron is so called because it can be considered as a neural computational unit that computes (as net input) the square of the one-sided tangent distance of the input vector X k from a prototype model dened by a set of

1360

D. Sona, A. Sperduti, and A. Starita

Figure 1: Geometric interpretation of equation 5.2. Note that W and Ti span the invariance manifold, d is the Euclidean distance between the pattern X and the centroid W, and net D (D1T-sided )2 is the one-sided tangent distance.

internal parameters (weights). Specically, it is characterized by a set of n C 1 vectors of the same dimension as the input vectors. One vector (W ) is used as the reference vector (centroid), while the remaining vectors fT 1 , . . . , T n g are used as tangent vectors. Moreover, the set of tangent vectors constitutes an orthonormal basis. This set of parameters is organized in such a way as to form a tangent model. Formally, the net input of a TD-Neuron for a pattern k is netk D min kM ( µ ) ¡ X k k2 C b,

µ

(5.1)

where b is the offset. A good model should return small net input for patterns belonging to the learned class. Since the tangent vectors constitute an orthonormal basis, equation 5.1 can exactly and easily be computed by using the projections of the input vector over the model subspace (see Figure 1): netk D k X k ¡ W k2 ¡ | {z }

dk

n X iD1

[(X k ¡ W )t T i ]2 C b

Discriminant Pattern Recognition

D dtk dk ¡

n X iD1

1361

[ dtk T i ]2 C b, | {z }

(5.2)

c ik

where, for the sake of notation, dk denotes the difference between the input pattern X k and the centroid W , and the projection of dk over the ith tangent vector is denoted by c ik . Note that the right side of equation 5.2 mainly involves dot products, just as in a standard neuron. The output of the TD-Neuron is then computed by transforming the net through a nonlinear monotone function f . In our experiments, we have used the symmetric sigmoidal function ok D

2 ¡ 1. 1 C enetk

(5.3)

We have used a monotonic decreasing function so that the output corresponding to patterns belonging to the target class will be close to 1. 5.1 Training the TD-Neuron. A discriminant model based on the TDNeuron can be obtained by adapting equation 4.1. Given a training set f(X 1 , t1 ), . . . , ( X N , tN )g, where

»

ti D

1 ¡1

if X i 2 C if X i 2 C

is the ith desired output for the TD-Neuron, and N D NC C NC is the total number of patterns in the training set, an error function can be dened as

ED

N 1X ( tk ¡ o k ) 2 , 2 D1

(5.4)

k

where ok is the output of the TD-Neuron for the kth input pattern. Equation 5.4 can be written into a form similar to equation 4.1 by making explicit the target values ti , by splitting the sum in such a way to group together patterns belonging to C and patterns belonging to C, and by weighting the negative examples with a constant l ¸ 0:

2 3 NC NC X 1 X 2 2 (1 ¡ ok ) C l (1 C oj ) 5 . ED 4 2 kD1 jD1

In our experiments we have chosen l D

NC , NC

(5.5)

thus balancing the strength

of patterns belonging to C and those belonging to C.

1362

D. Sona, A. Sperduti, and A. Starita

Using equations 5.2 and 5.3, it is trivial to compute the changes for the centroid, the tangent vectors, and the offset by using a gradient-descent approach over equation 5.5: " !#

D W D ¡g D T i D ¡g

¡ @E ¢ @W

D ¡2 g

@T i

D ¡2 g

¡ @E ¢

D b D ¡gb

¡ @E ¢ @b

D gb

N X

kD1

N £ X kD1

N X

kD1

(

(tk ¡ ok ) fk0 dk ¡ ( tk ¡ ok ) fk0 c ik dk

[(tk ¡ ok ) fk0 ]

¤

n X

c ik T i

(5.6)

iD1

(5.7) (5.8)

ok where g and gb are learning parameters, and fk0 D @@net . k Before training the TD-Neuron by gradient descent, however, the tangent subspace dimension must be decided. To solve this problem, we developed a constructive algorithm that adds tangent vectors one by one, according to computational needs. This idea is also justied by the observation that using equations 5.6 through 5.8 leads to the sequential convergence of the tangent vectors according to their relative importance. This means that in the rst approximation, all the tangent vectors remain random vectors while the centroid converges rst. Then one of the tangent vectors converges to the most relevant transformation (while the remaining tangent vectors are still immature), and so on until all the tangent vectors converge, one by one, to less and less relevant transformations. This behavior suggests starting the training using only the centroid (i.e., without tangent vectors) and then adding tangent vectors as needed. Under this learning scheme, since there are no tangents when the centroid is computed, equation 5.6 becomes

D W D ¡g

¡ @E ¢ @W

D ¡2g

N X kD1

[(tk ¡ ok ) fk0 dk ].

(5.9)

The constructive algorithm has two phases (see Table 1). First is the centroid computation, based on the iterative use of equations 5.8 and 5.9. Then the centroid is frozen, and one by one all tangent vectors T i are trained using equations 5.7 and 5.8. At each iteration in the learning phase, the tangent vector T i must be orthonormalized with respect to the already computed tangent vectors. 3 If after a xed number of iterations (we have used 300 iterations) the total error variation is less than a xed threshold (0.01%) and no change in the classication performance over training set occurs, a new tangent vector is added. The tangent vectors are iteratively added until changes in the classication accuracy become irrelevant. 3

The computational complexity of the orthonormalization is linear in the number of already computed tangent vectors.

Discriminant Pattern Recognition

1363

Table 1: Constructive Algorithm for the TD-Neuron. Initialize the centroid W. Update b and W by equations 5.8 and 5.9 until they converge. Freeze W. REPEAT .

Initialize a new tangent vector Ti . Update Ti and b with equations 5.7 and 5.8, and orthonormalize Ti with respect to f T1 , . . . , Ti¡1 g until it converges. Freeze Ti . UNTIL new Ti gives little accuracy changes.

The initialization of internal vectors of the TD-Neuron can be done in many different ways. On the basis of empirical evidence, we have concluded that the learning phase of the centroid can be considerably reduced by initializing the centroid with the mean value of the patterns belonging to the positive class. We have also devised a “good” initialization algorithm for tangent vectors (see Table 2) that tries to minimize the drop in the net input for all the patterns due to the increase in the tangent subspace dimension (see Figure 2). This is obtained by introducing a new tangent vector that mainly spans the residual subspace between the patterns in the positive class and the current model. In this way, patterns that are in the negative class will be only mildly affected by the new introduced tangent vector. We have also devised a “better ” initialization algorithm based on principal components of difference vectors for all classes. However, the observed training speed-up is not justied by the additional computational overhead due to SVD computation needed at each tangent vector insertion. In our experiments, we have also used a simple form of regularization over the parameters (weight decay with penalty equal to 0.997, i.e., all paTable 2: Initialization Procedure for the Tangent Vectors. For each class c 2 (C [ C), compute the mean value of differences between patterns and model: dc D

1 Nc

P Nc

p D1

(Xp ¡ M(h )).

Orthonormalize the vector dc of the class C with respect to the mean values of differences of all other classes belonging to C, and return it as the new initial tangent vector.

1364

D. Sona, A. Sperduti, and A. Starita

Figure 2: Total error variation during learning phase for pattern 0. At each new tangent insertion, there is a reduction of the distance between patterns and the model (also for patterns belonging to class C). This affects the output of the neuron for all patterns, increasing the total output squared error.

rameters are multiplied by the penalty before adding the gradient), obtaining a better convergence. 6 Results

In order to obtain comparable results, we had to use the same number of parameters for all algorithms. In particular, for tangent-based algorithms, the number of vectors is given by the number of tangent vectors incremented by 1 (the centroid). For this reason, the LVQ algorithms are compared to tangent distance-based algorithms using a number of reference vectors equal to the number of tangent vectors plus 1. Furthermore, with the two-sided HSS algorithm, we also have to consider the tangent vectors corresponding to the input patterns. Specically, we used six transformations (tangent vectors) for each input pattern: clockwise and counterclockwise rotations and translation in the four cardinal directions.4 We tested our constructive algorithm versus the two versions of the HSS algorithm and LVQ algorithms in the LVQ PAK package (Kohonen, Hynni4

We preferred to approximate the exact tangents by nite differences; in this way, the local shape of the manifold is interpolated in a better way.

Discriminant Pattern Recognition

1365

nen, Kangas, Laaksonen, & Torkkola, 1996) (optimized-learning-rate LVQ1, original LVQ1, LVQ2.1, LVQ3), using 10,704 binary digits taken from the NIST-3 data set. The binary 128 £ 128 digits were transformed into 64-gray level 16 £16 images by a simple local counting procedure.5 The only preprocessing transformation performed was the elimination of empty borders. The training set consisted of 5000 randomly chosen digits; the remaining digits were used in the test set. For each tangent distance algorithm, a single tangent model for each class of digit was computed. With the LVQ algorithms, a set of reference vectors was used for each class; the number of reference vectors was chosen so as to have as many parameters as in the tangent distance algorithms. In particular, we tested all algorithms based on tangent distance using a different number of vectors for each experiment, starting from 1 vector per class (centroid without tangent vectors) up to 16 vectors per class (centroid plus 15 tangent vectors). The number of reference vectors for LVQ algorithms has been chosen accordingly. Concerning LVQ algorithms, here we report just the results obtained by using LVQ2.1 with 1-NN based on Euclidean distance as the classication rule, since this algorithm reached the best performance over an extended set of experiments involving LVQ algorithms with different settings for the learning parameters. The classication of the test digits was performed using the label of the closest model for HSS, the 1-NN rule for LVQ algorithms, and the highest output for the TD-Neuron algorithm. For the sake of comparison, we also performed a classication using the nearest-neighbor rule (1-NN) with the Euclidean distance as classication metric. In this case, we classied each pattern by looking at the label of the nearest vector in the learning set. In Figure 3 we report the results obtained on the test set for different numbers of tangent vectors for all models. The best classication result (96.84%) was given by 1-NN with Euclidean distance, followed by the two-sided HSS with 9 tangent vectors (96.6%). 6 With the same number of parameters (15 tangent vectors), the TD-Neuron and the one-sided HSS gave a performance rate of 96.51% and 96.42%, respectively. Finally, LVQ2.1 obtained a performance rate of 96.48% using 15 vectors per class. From these results, it can be noted that the two-sided HSS algorithm does overt the data after the ninth tangent vector; this is not true for the remaining algorithms. Nevertheless, all the models reach a similar performance with the same number of parameters, which is slightly below the performance attained by the 1-NN classier using Euclidean distance. However, both the tangent models and LVQ algorithms have the advantage of being less demanding in both space and response time than 1-NN.

5 The original image is partitioned into 16 £ 16 windows and the number of pixel with value equal to 1 is used as the gray value for the corresponding pixel in the new image. 6 There are also six tangent vectors corresponding to the input pattern.

1366

D. Sona, A. Sperduti, and A. Starita

Figure 3: Results obtained on the test set by models generated by both versions of HSS, LVQ2.1, TD-Neuron, and 1-NN with Euclidean distance.

It is interesting that the recognition performance of the TD-Neuron system is basically monotone in the number of parameters (tangent vectors), while all other methods present overtting and high variance in the results. Although from Figure 3 it seems that the tangent models and LVQ2.1 are equivalent, when introducing a rejection criterion the model generated by the TD-Neuron outperforms the other algorithms (see Figure 4). We used the same rejection criterion for all algorithms. A pattern is rejected when the difference between the rst- and the second-best outputs belonging to different classes is less than a xed threshold.7 Furthermore, introducing the rejection criterion also leads to the surprising result that the one-sided version of HSS performs better than the two-sided version. In order to assess whether the better performance exhibited by the TDNeuron was due to the specic rejection criterion or the discriminant capability of the model, we performed some experiments using a different rejection criterion: reject when the best output value is smaller (TD-Neuron) or greater (HSS, LVQ and 1-NN) than a threshold. In Figure 5 we report the curves obtained for the best models—TD-Neuron and one-sided HSS— with, for comparison, the corresponding curves of Figure 4. These results demonstrate that the improvement shown by the TD-Neuron is not tight to a specic rejection criterion. Moreover, removing the 7

Obviously the threshold is different for each algorithm.

Discriminant Pattern Recognition

1367

Figure 4: Error-rejection curves for all algorithms: one-sided HSS with 15 tangent vectors, two-sided HSS with 9 and 15 tangent vectors, TD-Neuron with 9 and 15 tangent vectors, LVQ2.1 with 15 reference vectors, and 1-NN with Euclidean distance. The boxed diagram shows a detail of the curves demonstrating that the TD-Neuron with 15 tangent vectors has the best trade-off between error and rejection.

sigmoidal function from the TD-Neuron during the test phase, the rejection curves does not change signicantly for both rejection criteria. Thus, we can conclude that the improvement of the TD-Neuron is mainly due to the discriminant training procedure. In Figure 6, four examples of TD-Neuron models are reported. In the left-most column, the centroids of patterns 0, 1, 2, and 3 are shown. The remaining columns contain the rst (and most important) four tangent vectors for each model. 7 Discussion and Conclusion

We introduced the TD-Neuron, which implements the one-sided version of the tangent distance, and gave a constructive learning algorithm for building a tangent subspace with discriminant capabilities. There are many advantages to using the proposed computational model versus the HSS model and LVQ algorithms. Specically, we believe that the proposed approach is particularly useful in applications where it is very important to have a classication system that is both discriminant and fast in recognition.

1368

D. Sona, A. Sperduti, and A. Starita

Figure 5: Detail of error-rejection curves of one-sided HSS and TD-Neuron for two different rejection criteria: threshold on the output absolute value (abs) and threshold on the difference between the best and the second-best output models (diff). The curves for the TD-Neuron model with removed nonlinear output are individuated by the additional label (linear).

We also compared the TD-Neuron constructive algorithm to two different versions of the HSS algorithm, the LVQ2.1 algorithm and the 1-NN classication criterion. The obtained results over the NIST-3 database of handwritten digits show that the TD-Neuron is superior to the HSS algorithms based on singular value decomposition and the LVQ algorithms, since it reaches a better trade-off between error and rejection. Moreover, we have assessed that the better trade-off is mainly due to the discriminant capabilities of our model. Concerning the proposed neural formulation, we believe that the nonlinear sigmoidal transformation is useful since removing it and using the ( )) for training would drastically reduce the size of the inverted target (o¡1 k tk space of solutions in the weight space. In fact, very high or very low values for the net input would not minimize the error function with the inverted target, while being fully acceptable in the proposed model. Moreover, the nonlinearity increases the stability of learning because of the saturated output. During the model generation, for a xed number of tangent vectors, the HSS algorithm is faster than ours because it needs only a fraction of the training examples (only one class). However, our algorithm is remarkably more efcient than HSS algorithms when a family of tangent models, with an increasing number of tangent vectors, must be generated. Also the LVQ

Discriminant Pattern Recognition

1369

Figure 6: Tangent models obtained by the TD-Neuron for digits 0, 1, 2, and 3. The centroids are shown in the left-most column; the remaining columns show the rst four tangent vectors.

algorithms are faster than TD-Neuron, but they have as a drawback a poor rejection performance. An additional advantage of the TD-Neuron model is that because the training algorithm is based on a gradient-descent technique, several TDNeurons can be arranged to form a hidden layer in a feedforward network with standard output neurons, which can be trained by a trivial extension of backpropagation. This may lead to a remarkable increase in the transformation-invariant features of the system. Furthermore, it should be possible to extract information easily from the network regarding the most important features used during classication (see Figure 6). References Hastie, T., Simard, P. Y., & S¨ackinger, E. (1995). Learning prototype models for tangent distance. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 999–1006). Cambridge, MA: MIT Press. Hinton, G. E., Dayan, P., & Revow, M. (1997). Modeling the manifold of images of handwritten digits. IEEE Transactions on Neural Networks, 8, 65–74.

1370

D. Sona, A. Sperduti, and A. Starita

Kohonen, T., Hynninen, J., Kangas, J., Laaksonen, J., & Torkkola, K. (1996). LVQ PAK: The learning vector quantization program package (Tech. Rep. No. A30). Espoo, Finland: Helsinki University of Technology, Laboratory of Computer and Information Science. Available online at: http://www.cis.hut./ nnrc/nnrcc-programs.html. Schwenk, H., & Milgram, M. (1995a). Learning discriminant tangent models for handwritten character recognition. In International Conference on Articial Neural Networks (pp. 985–988). Berlin: Springer-Verlag. Schwenk, H., & Milgram, M. (1995b).Transformation invariant autoassociation with application to handwritten character recognition. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 991–998). Cambridge, MA: MIT Press. Simard, P. Y. (1994). Efcient computation of complex distance metrics using hierarchical ltering. In J. D. Cowan, G. Tesauro, and J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 168–175). San Mateo, CA: Morgan Kaufmann. Simard, P. Y., LeCun, Y., & Denker, J. (1993). Efcient pattern recognition using a new transformation distance. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 50–58). San Mateo, CA: Morgan Kaufmann. Sona, D., Sperduti, A., & Starita, A. (1997). A constructive learning algorithm for discriminant tangent models. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 786–792). Cambridge, MA: MIT Press. Sperduti, A., & Stork, D. G. (1995). A rapid graph-based method for arbitrary transformation-invariant pattern classication. In G. Tesauro, D. S. Touretzky, and T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 665–672). Cambridge, MA: MIT Press. Received April 7, 1998; accepted April 8, 1999.

LETTER

Communicated by Michael Jordan

Observable Operator Models for Discrete Stochastic Time Series Herbert Jaeger German National Research Center for Information Technology, Institute for Intelligent Autonomous Systems, D-53754 Sankt Augustin, Germany A widely used class of models for stochastic systems is hidden Markov models. Systems that can be modeled by hidden Markov models are a proper subclass of linearly dependent processes, a class of stochastic systems known from mathematical investigations carried out over the past four decades. This article provides a novel, simple characterization of linearly dependent processes, called observable operator models. The mathematical properties of observable operator models lead to a constructive learning algorithm for the identication of linearly dependent processes. The core of the algorithm has a time complexity of O( N C nm3 ), where N is the size of training data, n is the number of distinguishable outcomes of observations, and m is model state-space dimension. 1 Introduction

Hidden Markov models (HMMs) (Bengio, 1996) of stochastic processes were investigated mathematically long before they became a popular tool in speech processing (Rabiner, 1990) and engineering (Elliott, Aggoun, & Moore, 1995). A basic mathematical question was to decide when two HMMs are equivalent, that is, describe the same distribution (of a stochastic process) (Gilbert, 1959). This problem was tackled by framing HMMs within a more general class of stochastic processes, now termed linearly dependent processes (LDPs). Deciding the equivalence of HMMs amounts to characterizing HMM-describable processes as LDPs. This strand of research came to a successful conclusion in Ito, Amari, and Kobayashi (1992), where equivalence of HMMs was characterized algebraically and a decision algorithm was provided. That article also gives an overview of the work done in this area. It should be emphasized that linearly dependent processes are unrelated to linear systems in the standard sense, that is, systems whose state sequences are generated by some linear operator (e.g., Narendra, 1995). The term linearly dependent processes refers to certain linear relationships between conditional distributions that arise in the study of general stochastic processes. LDPs are thoroughly “nonlinear ” in the standard sense of the word. c 2000 Massachusetts Institute of Technology Neural Computation 12, 1371– 1398 (2000) °

1372

H. Jaeger

Figure 1: (a) Standard view of trajectories. A time-step operator T yields a sequence ABAA of states. (b) OOM view. Operators TA , TB are concatenated to yield a sequence ABAA of observables.

The class of LDPs has been characterized in various ways. The most concise description was developed by Heller (1965), using methods from category theory and algebra. This approach was taken up and elaborated in a recent comprehensive account on LDPs and the mathematical theory of HMMs, viewed as a subclass of LDPs (Ito, 1992). All of this work on HMMs and LDPs was mathematically oriented and did not bear on the practical question of learning models from data. In this article, I develop an alternative, simpler characterization of LDPs, called observable operator models (OOMs). OOMs require only concepts from elementary linear algebra. The linear algebra nature of OOMs gives rise to a constructive learning procedure, which makes it possible to estimate models from data very efciently. The name observable operator models arises from the way in which stochastic trajectories are mathematically modeled in this approach. Traditionally, trajectories of discrete-time systems are seen as a sequence of states generated by the repeated application of a single (possibly stochastic) operator T (see Figure 1a). Metaphorically, a trajectory is seen as a sequence of locations in state-space, which are visited by the system due to the action of a time-step operator. In OOM theory, trajectories are perceived in a complementary fashion. From a set of operators (say, fT A , TB g), one operator is stochastically selected for application at every time step. The system trajectory is then identied with the sequence of operators. Thus, an observed piece of trajectory . . . ABAA . . . would correspond to a concatenation of operators . . . TA (T A (T B ( T A . . .))) . . . (see Figure 1b). The fact that the observables are the operators themselves led to the naming of this kind of stochastic model. An appropriate metaphor would be to view trajectories as sequences of actions. Stochastic sequences of operators are a well-known object of mathematical investigation (Iosifescu & Theodorescu, 1969). OOM theory grows out of the novel insight that the probability of selecting an operator can be computed using the operator itself. This article covers the following topics: how a matrix representation of OOMs can be construed as a generalization of HMMs (section 2); how OOMs are used to generate and predict stochastic time series (section 3); how an

Observable Operator Models

1373

Figure 2: HMM with two hidden states s1 , s2 and two outcomes O D fa, bg. Fine arrows indicate admissible hidden state transitions, with their corresponding probabilities marked besides. The initial state distribution (2 / 3, 1 / 3) of the Markov chain is indicated inside the state circles. Emission probabilities P[ai | sj ] of outcomes are annotated beside bold gray arrows.

abstract information-theoretic version of OOMs can be obtained from any stochastic process (section 4); how these abstract OOMs can be used to prove a fundamental theorem that reveals when two OOMs in matrix representation are equivalent (section 5); that some low-dimensional OOMs can model processes that can be modeled either only by arbitrarily high-dimensional HMMs or by none at all and that one can model a conditional rise and fall of probabilities in processes timed by “probability oscillators” (section 6); how one can use the fundamental equivalence theorem to obtain OOMs whose state-space dimensions can be interpreted as probabilities of certain future events (section 7); and how these interpretable OOMs directly yield a constructive procedure to estimate OOMs from data (section 8). Section 9 gives a brief conclusion. 2 From HMMs to OOMs

In this section, OOMs are introduced by generalization from HMMs. In this way it becomes immediately clear why the latter are a subclass of the former. A basic HMM species the distribution of a discrete-time, stochastic process ( Yt )t2N , where the random variables Yt have nitely many outcomes from a set O D fa1 , . . . , an g. HMMs can be dened in several equivalent ways. In this article we adhere to the denition that is customary in the speech recognition community and other application areas. The specication is done in two stages. First, a Markov chain ( Xt )t2N produces sequences of hidden states from a nite state set fs1 , . . . , sm g. Second, when the Markov chain is in state sj at time t, it “emits” an observable outcome, with a time-invariant probability P[Yt D ai | Xt D sj ]. Figure 2 presents an exemplary HMM. Formally, the state transition probabilities can be collected in an m £ m matrix M that at place ( i, j) contains the transition probability from state si to sj (i.e., M is a Markov matrix, or stochastic matrix). For every a 2 O , we collect the emission probabilities P[Yt D a | Xt D sj ] in a diagonal observation matrix Oa of size m £ m. Oa contains, in its diagonal, the probabilities

1374

H. Jaeger

P[Yt D a | Xt D s1 ], . . . , P[Yt D a | Xt D sm ]. For the example from Figure 2, this gives MD

¡1/4 1

¢

3 /4 , 0

Oa D

¡1 /2

¢

1 /5

,

Ob D

¡1/2

¢

4 /5

.

(2.1)

In order to characterize an HMM fully, one also must supply an initial distribution w0 D (P[X0 D s1 ], . . . , P[X0 D sm ])T . (Superscript ¢T denotes transpose of vectors and matrices. Vectors are assumed to be column vectors throughout this article, unless noted otherwise.) The process described by the HMM is stationary if w0 is an invariant distribution of the Markov chain, that is, if it satises MT w0 D w0 .

(2.2)

See Doob (1953) for details on Markov chains. It is well known that the matrices M and Oa (a 2 O ) can be used to compute the probability of nite-length observable sequences. Let 1 D (1, . . . , 1) denote the m-dimensional row vector of units, and let Ta :D M T Oa . Then the probability of observing the sequence ai0 , . . . , aik among all possible sequences of length k C 1 is equal to the number obtained by applying Tai0 , . . . , Taik to w0 , and summing up the components of the resulting vector by multiplying it with 1 : P[ai0 , . . . , aik ] D 1 Taik , ¢ ¢ ¢ , Tai0 w0 .

(2.3)

The term P[ai0 , . . . , aik ] in equation 2.3 is a shorthand notation for P[X0 D ai0 , . . . , Xk D aik ], which will be used throughout this article. Equation 2.3 is a matrix notation of the well-known forward algorithm for determining probabilities of observation sequences in HMMs. Proofs of equation 2.3 may be found in Ito et al. (1992) and Ito (1992). M can be recovered from the operators T a by observing that MT D M T ¢ id D MT ( Oa1 C ¢ ¢ ¢ C Oan ) D Ta1 C ¢ ¢ ¢ C Tan .

(2.4)

Equation 2.3 shows that the distribution of the process ( Yt ) is specied by the operators T ai and the vector w0 . Thus, the matrices T ai and w0 contain the same information as the original HMM specication in terms of M, Oai , and w0 . One can rewrite an HMM as a structure ( Rm , ( Ta )a2 , w0 ), where Rm is the domain of the operators Ta . The HMM from the example, written in this way, becomes ³ ´ M D R2 , (Ta , Tb ), w0

¡ ¡¡ 1 /8

D R2 ,

3 /8

¢ ¡

1 /5 1 /8 , 0 3 /8

4 /5 0

¢¢

¢

, (2 / 3, 1 / 3)T .

(2.5)

Observable Operator Models

1375

Now, one arrives at the denition of an OOM by (1) relaxing the requirement that MT be the transpose of a stochastic matrix, to the weaker requirement that the columns of MT each sum to 1, and by (2) requiring from w0 merely that it has a component sum of 1. That is, negative entries are now allowed in matrices and vectors, which are forbidden in stochastic matrices and probability vectors. Using the letter P t in OOMs in places where T appears in HMMs and introducing m D a2 ta in analogy to equation 2.4 yields: Denition 1.

where w0 2

Rm

An m-dimensional OOM is a triple A D ( Rm , (ta )a2 , w0 ), and ta : Rm 7! Rm are linear operators, satisfying

1. 1 w0 D 1. P 2. m D a2 ta has column sums equal to 1.

3. For all sequences ai0 , . . . , aik it holds that 1taik , . . . , tai0 w0 ¸ 0. Conditions 1 and 2 reect relaxations (1) and (2) mentioned previously, while condition 3 ensures that one obtains nonnegative values when the OOM is used to compute probabilities. Unfortunately, condition 3 is useless for deciding or constructing OOMs. An alternative to condition 3, which is suitable for constructing OOMs, will be introduced in section 6. Since function concatenations of the form taik ±, ¢ ¢ ¢ , ±tai0 will be used often in the sequel, we introduce a shorthand notation for handling sequences of symbols. Following the conventions of formal language theory, we shall denote the empty sequence by e (i.e., the sequence of length 0 that “consists” of no symbol at all), the set of all sequences of length k of symbols from O , S by O k ; k¸ 1 O k by O C ; and the set fe g [ O C by O ¤ . We shall write aN 2 O C to denote any nite sequence a0 , . . . , an , and taN to denote tan ±, ¢ ¢ ¢ , ±ta0 . An OOM, as dened here, species a stochastic process, if one makes use of an analog of equation 2.3: Let A D (Rm , (ta )a2 , w0 ) be an OOM according to the previous denition. Let V D O 1 be the set of all innite sequences over O , and A be the s-algebra generated by all nite-length initial events on V. Then, if one computes the probabilities of initial nite-length events in the following way, Proposition 1.

P0 [Na] :D 1taN w0 ,

(2.6)

the numerical function P0 can be uniquely extended to a probability measure P on (V, A), giving rise to a stochastic process (V, A, P, ( Xt )t2N ), where Xn ( a1 a2 ¢ ¢ ¢) D an . If w0 is an invariant vector of m , i.e., if m w0 D w0 , the process is stationary. The proof is given in appendix A. Since we introduced OOMs here by generalizing from HMMs, it is clear that every process whose distribution can be specied by an HMM can also be characterized by an OOM.

1376

H. Jaeger

I conclude this section with a remark on LDPs and OOMs. It is known that the distributions of LDPs can be characterized through matrix multiplications in a fashion very similar to equation 2.6 (cf. Ito, 1992, theorem 1.8): P[ai0 . . . aik ] D 1 QIaik , ¢ ¢ ¢ , QIai0 w0 .

(2.7)

The matrix Q does not depend on a, while the “projection matrices” Ia do. If one puts Q D id, Ia D ta , one easily sees that the class of processes that can be described by OOMs is the class of LDPs. 3 OOMs as Generators and Predictors

This section explains how to generate and predict the paths of a process (Xt )t2N whose distribution is specied by an OOM A D ( Rk , (ta )a2 , w0 ). We describe the procedures mathematically and illustrate them with an example. More precisely, the generation task consists of randomly producing, at times t D 0, 1, 2, . . . , outcomes ai0 , ai1 , ai2 , . . . , such that (1) at time t D 0, the probability of producing b is equal to 1tb w0 according to equation 2.6, and (2) at every time step t > 0, the probability of producing b (after ai0 , . . . , ait¡1 have already been produced) is equal to P[Xt D b | X0 D ai0 , . . . , Xt¡1 D ait¡1 ]. Using equation 2.6, the latter amounts to calculating at time t, for every b 2 O , the conditional probability P[Xt D b | X0 D ai0 , . . . , Xt¡1 D ait¡1 ] D

P[X0 D ai0 , . . . , Xt¡1 D ait¡1 , Xt D b] P[X0 D ai0 , . . . , Xt¡1 D ait¡1 ]

D 1 tb tait¡1 , ¢ ¢ ¢ , tai0 w0 / 1 tait¡1 , ¢ ¢ ¢ , tai0 w0 ! tait¡1 , ¢ ¢ ¢ , tai0 w0 D 1 tb 1 tait¡1 , ¢ ¢ ¢ , tai0 w0

(

D: 1tb wt ,

(3.1)

and producing at time t the outcome b with this conditional probability (the symbol D: denotes that the term on the right-hand side is being dened by the equation). Calculations of equation 3.7 can be carried out incrementally if one observes that for t ¸ 1, wt can be computed from wt¡1 : wt D

tat¡1 wt¡1 1 tat¡1 wt¡1

.

(3.2)

Note that all vectors wt thus obtained have a component sum equal to 1. Observe that the operation 1 tb ¢ in equation 3.1 can be done effectively by pre-computing the vector vb :D 1 tb . Computing equation 3.1 then amounts

Observable Operator Models

1377

to multiplying the (row) vector vb with the (column) vector wt , or, equivalently, evaluating the inner product hvb , wt i. The prediction task is completely analogous to the generation task. Given an initial realization ai0 , . . . , ait¡1 of the process up to time t ¡ 1, one has to calculate the probability by which an outcome b is going to occur at the next time step t. This is again an instance of equation 3.1, the only difference being that now the initial realization is not generated by oneself but is externally given. Many-time-step probability predictions of collective outcomes can be calculated by evaluating inner products too. Let the collective outcome A D fbN1 , . . . , bNn g consist of n sequences of length s C 1 of outcomes (i.e., outcome A is recorded when any of the sequences bNi occurs). Then the probability that A is going to occur after an initial realization aN of length t ¡ 1 can be computed as follows: P[( Xt , . . . , XtC s ) 2 A | ( X0 , . . . , Xt¡1 ) D aN ] D D

X

N b2A

X

N b2A

* D

P[(Xt , . . . , XtC s ) D bN | ( X0 , . . . , Xt¡1 ) D aN ] 1tbN wt D:

X N b2A

+

X N b2A

hvbN , wt i

vbN , wt D: hv A , wt i.

(3.3)

To calculate the future probability of a collective outcome A repeatedly, utilization of equation 3.3 reduces computational load considerably because the vector vA needs to be (pre-)computed only once. The generation procedure shall now be illustrated using the exemplary OOM M from equation 2.5. We rst compute the vectors va , vb : va D 1ta D 1 vb D 1tb D 1

¡ 1 /8

¢

3 /8

1 /5 D (1 / 2, 1 / 5), 0

3/8

4 /5 D (1 / 2, 4 / 5). 0

¡1/8

¢

Starting with w0 D (2 / 3, 1 / 3), we obtain probabilities hva , w0 i D 2 / 5, hvb , w0 i D 3 / 5 of producing a versus b at the rst time step. We make a random decision for a versus b, weighted according to these probabilities. Let us assume the dice fall for b. We now compute w1 D tb w0 / 1tb w0 D (7 / 12, 5 / 12)T . For the next time step, we repeat these computations with w1 in place of w0 .

1378

H. Jaeger

4 From Stochastic Processes to OOMs

This section introduces OOMs again, but this time in a top-down fashion, starting from general stochastic processes. This alternative route claries the fundamental nature of observable operators. Furthermore, the insights obtained in this section will yield a short and instructive proof of the central theorem of OOM equivalence, to be presented in the next section. The material presented here is not required after the next section; readers without a strong interest in probability theory can skip it. In section 2, we described OOMs as structures ( Rm , (ta )a2 , w0 ). In this section, we arrive at isomorphic structures ( G, ( ta )a2 , ge ), where again G is a vector space, ( ta )a2 is a family of linear operators on G, and ge 2 G. However, the vector space G is now a space of certain numerical prediction functions. In order to discriminate OOMs characterized on spaces G from the “ordinary” OOMs, we shall call ( G, ( ta )a2 , ge ) a predictor-space OOM. Let (Xt )t2N , or for short, ( Xt ) be a discrete-time stochastic process with values in a nite set O . Then the distribution of (Xt ) is uniquely characterized by the probabilities of nite initial subsequences, that is, by all probabilities of the kind P[Na], where aN 2 O C . N We introduce shorthand for conditional probabilities, by writing P[Na | b] N for P[( Xn , . . . , XnC s ) D aN | ( X0 , . . . , Xn¡1 ) D b]. We shall formally write the unconditional probabilities as conditional probabilities, too, with the empty condition e ; we use the notation P[Na | e ] :D P[(X0 . . . Xs ) D aN ] D P[Na]. Thus, the distribution of ( Xt ) is uniquely characterized by its conditional N continuation probabilities, that is, by the conditional probabilities P[Na | b], C N ¤ where aN 2 O , b 2 O . For every bN 2 O ¤ , we collect all conditioned continuation probabilities of Nb into a numerical function

gbN : O C ! R,

N if P[b] N D 6 0 aN 7! P[Na | b], N D 0. 7! 0, if P[b]

(4.1)

The set fgbN | bN 2 O ¤ g uniquely characterizes the distribution of (Xt ) too. Intuitively, a function gbN describes the future distribution of the process N after an initial realization b. Let D denote the set of all functions from O C into the real numbers, that is, the numerical functions on nonempty sequences. D canonically becomes a real vector space if one denes scalar multiplication and vector addition as follows: for d1 , d2 2 D, a, b 2 R, aN 2 O C put (ad1 C b d2 )(aN ) :D a( d1 ( aN )) C b(d2 ( aN )). Let G D hfgbN | bN 2 O ¤ gi denote the linear subspace spanned in D by the conditioned continuations. Intuitively, G is the (linear closure of the) space of future distributions of the process (Xt ).

Observable Operator Models

1379

We are now halfway done with our construction of ( G, ( ta )a2 , ge ). We have constructed the vector space G, which corresponds to Rm in the “ordinary” OOMs from section 2, and we have dened the initial vector ge , which is the counterpart of w0 . It remains to dene the family of observable operators. In order to specify a linear operator on a vector space, it sufces to specify the values the operator takes on the basis of the vector space. Choose O0¤ µ O ¤ such that the set fgbN | bN 2 O0¤ g is a basis of G. Dene, for every a 2 O , a linear function ta : G ! G by putting N gN ta( gbN ) :D P[a | b] ba

(4.2)

N denotes the concatenation of the sequence bN with a). It for all bN 2 O0¤ (ba turns out that equation 4.2 carries over from basis elements bN 2 O 0¤ to all bN 2 O ¤ : Proposition 2.

tion

For all bN 2 O ¤ , a 2 O , the linear operator ta satises the condi-

N gN . ta( gbN ) D P[a | b] ba

(4.3)

The proof is given in appendix B. Intuitively, the operator ta describes the change of knowledge about a process due to an incoming observation of a. More precisely, assume that the process has been observed up to time n. That is, an initial observation bN D b0 , . . . , bn has been made. Our knowledge about the state of the process at this moment is tantamount to the predictor function gbN . Then assume that at time n C 1, an outcome a is observed. After that, our knowledge about the process state is expressed by gba N . But this is N (up to scaling by P[a | b]) just the result of applying ta to the old state, gbN . The operators (ta )a2 are the analog of the observable operators (ta )a2 in OOMs and can be used to compute probabilities of nite sequences: Let fgbN | bN 2 O 0¤ g be a basis Pof G. Let aN :D ai0 , . . . , aik be an initial realization of ( Xt ) of length k C 1. Let iD1,...,n ai gbNi D taik , ¢ ¢ ¢ , tai0 ge be the linear combination of taik , . . . , tai0 ge from basis vectors. Then it holds that Proposition 3.

P[Na] D

X

ai .

(4.4)

iD1,...,n

Note that equation 4.4 is valid for any basis fgbN | bN 2 O0¤ g. The proof can be found in appendix C. Equation 4.4 corresponds exactly to equation 2.6 since left-multiplication of a vector with 1 amounts to summing the vector components, which in turn are the coefcients of that vector with respect to a vector space basis.

1380

H. Jaeger

Due to equation 4.4, the distribution of the process ( Xt ) is uniquely characterized by the observable operators ( ta )a2 . Conversely, these operators are uniquely dened by the distribution of (Xt ). Thus, the following denition makes sense: Let ( Xt )t2N be a stochastic process with values in a nite set O . The structure ( G, (ta )a2 , ge ) is called the predictor-space observable operator model of the process. The vector space dimension of G is called the dimension of the process and is denoted by dim(Xt ). Denition 2.

I remarked in section 1 that stochastic processes have previously been characterized in terms of vector spaces. Although the vector spaces were constructed in ways other than G, they led to equivalent notions of process dimension. Heller (1965) called nite-dimensional (in our sense) stochastic processes nitary; in Ito et al. (1992) the process dimension (if nite) was called minimum effective degree of freedom. Equation 4.3 claries the fundamental character of observable operators: ta describes how the knowledge about the process’s future after an observed past bN (i.e., the predictor function gbN ) changes through an observation of a. The power of the observable operator idea lies in the fact that these operators turn out to be linear (see proposition 2). I have treated only the discrete time, discrete value case here. However, predictor-space OOMs can be dened in a similar way also for continuous-time, arbitrary-valued processes (detailed out in Jaeger, 1999). It turns out that in those cases, the resulting observable operators are linear too. In the sense of updating predictor functions, the change of knowledge about a process due to incoming observations is a linear phenomenon. In the remainder of this section, I describe how the dimension of a process is related to the dimensions of ordinary OOMs of that process. Proposition 4.

1. If ( Xt ) is a process with nite dimension m, then an m-dimensional ordinary OOM of this process exists. 2. A process (Xt ) whose distribution is described by a k-dimensional OOM A D (Rk , (ta )a2 , w0 ) has a dimension m · k. The proof is given in appendix D. Thus, if a process ( Xt ) has dimension m and we have a k-dimensional OOM A of ( Xt ), we know that an m-dimensional OOM A0 exists, which is equivalent to A in the sense of specifying the same distribution. Furthermore, A0 is minimal dimensional in its equivalence class. A minimal-dimensional OOM A0 can be constructively obtained from A in several ways, all of which amount to an implicit

Observable Operator Models

1381

construction of the predictor-space OOM of the process specied by A. Since the learning algorithm presented in later sections can be used for this construction too, I do not present a dedicated procedure for obtaining minimal-dimensional OOMs here. 5 Equivalence of OOMs

Given two OOMs A D ( Rk , (ta )a2 , w0 ), B D (Rl , (ta0 )a2 , w00 ), when are they equivalent in the sense that they describe the same distribution? This question can be answered using the insights gained in the previous section. First, construct minimal-dimensional OOMs A0 , B 0 , which are equivalent to A and B, respectively. If the dimensions of A0 , B0 are not equal, then A and B are not equivalent. We can therefore assume that the two OOMs whose equivalence we wish to ascertain have the same (and minimal) dimension. Then the answer to our question is given in the following proposition: Two minimal-dimensional OOMs A D (Rm , (ta )a2 , w0 ), B D ( Rm , (ta0 )a2 , w00 ) are equivalent iff there exists a bijective linear map %: Rm ! Rm , satisfying the following conditions: Proposition 5.

1.

%( w0 )

D w00 .

2. ta0 D %ta%¡1 for all a 2 O .

3. 1 v D 1%v for all (column) vectors v 2 Rm . Sketch of proof: (: trivial. ): We have done all the hard work in the previous section. Let s , s be the canonical projections from A , B on the predictor-space OOM of the process specied by A (and hence by B). Observe that s , s are bijective linear maps that preserve the component sum of vectors. Dene % :D s ¡^ ± s . Then condition 1 in proposition 5 follows from s ( w0 ) D s (w00 ) D ge , condition 2 follows from 8cN 2 O C : s(tcN w0 ) D s(tcN0 w0 ) D P[Nc]gcN , and condition 3 from the fact that s , s preserve component sum of vectors. 6 A Non-HMM Linearly Dependent Process

The question of when an LDP can be captured by a HMM has been fully answered in the literature (original result in Heller, 1965; renements in Ito, 1992), and examples of non-HMM LDPs have been given. I briey restate the results and then elaborate on a simple example of such a process. It was rst described in a slightly different version in Heller (1965). The aim is to provide an intuitive insight into how the class of LDPs is “larger ” than the class of processes that can be captured by HMMs.

1382

H. Jaeger

Characterizing HMMs as LDPs heavily draws on the theory of convex cones and nonnegative matrices. I rst introduce some concepts, following the notation of a standard textbook (Berman & Plemmons, 1979). With a set S µ Rn , we associate the set SG , the set generated by S, which consists of all nite nonnegative linear combinations of elements of S. A set K µ Rn is dened to be a convex cone if K D KG . A convex cone KG is called n-polyhedral if K has n elements. A cone K is pointed if for every nonzero v 2 K, the vector ¡v is not in K. A cone is proper if it is pointed and closed, and its interior is not empty. Using these concepts, proposition 6.a1 and 6.a2 give two conditions that individually are equivalent to condition 3 in denition 1, and 6b renes condition 6.a1 for determining when an OOM is equivalent to a HMM. Finally, 6.c states necessary conditions that every ta in an OOM must satisfy. Proposition 6. (t ) a1. Let A D ( Rm , (ta )a2 , w0 ) be a structure P consisting of linear maps a a2 on Rm and a vector w0 2 Rm . Let m :D a2 ta . Assume that the rst two conditions from denition 1 hold: 1w0 D 1 and m has column sums equal to

1. Then A is an OOM if and only if there exist pointed convex cones (Ka )a2 satisfying the following conditions: 1. 1 v ¸ 0 for all v 2 Ka (where a 2 O ). S 2. w0 2 ( a2 Ka )G . 3. 8a, b 2 O : tb Ka µ Kb .

a2. Using the same assumptions as before, A is an OOM if and only if there exists a pointed convex cone K satisfying the following conditions: 1. 1 v ¸ 0 for all v 2 K. 2. w0 2 K.

3. 8a 2 O : ta K µ K. b. Assume that A is an OOM. Then there exists an HMM equivalent to A if and only if a pointed convex cone K according to condition a2 exists that is npolyhedral for some n. n can be selected such that it is not greater than the minimal state number for HMMs equivalent to A. c. Let A be a minimal-dimensional OOM and ta be one of its observable operators, and K be a cone according to a2. Then (i) the spectral radius %(ta ) of ta is an eigenvalue of ta , (ii) the degree of %(ta ) is greater than or equal to the degree of any other eigenvalue l with | l | D %(ta ), and (iii) an eigenvector of corresponding to %(ta ) lies in K. (The degree of an eigenvalue l of a matrix is the size of the largest diagonal block in the Jordan canonical form of the matrix, which contains l.)

Observable Operator Models

1383

The proof of parts a1 and b goes back to Heller (1965) and has been reformulated in Ito (1992). 1 The equivalence of proposition 6.a1 with 6.a2 is an easy exercise. The conditions collected in proposition 6.c are equivalent to the statement that ta K µ K for some proper cone K (proof in theorems 3.2 and 3.5 in Berman & Plemmons, 1979). It is easily seen that for a minimaldimensional OOM, the cone K required in proposition 6.a2 is proper. Thus, proposition 6.c is a direct implication of 6.a2. Proposition 6 has two simple but interesting implications. First, every two-dimensional OOM is equivalent to an HMM (since all cones in two dimensions are polyhedral). Second, every nonnegative OOM (i.e., matrices ta have only nonnegative entries) is equivalent to an HMM (since nonnegative matrices map the positive orthant, which is a polyhedral cone, on itself). Proposition 6.c is sometimes useful to rule out a structure A as an OOM by showing that some ta fails to satisfy the conditions given. Unfortunately, even if every ta of a structure A satises the conditions in proposition 6.c, A need not be a valid OOM. Imperfect as it is, however, proposition 6.c is the strongest result available at this moment in the direction of characterizing OOMs. Proposition 6 is particularly useful to build OOMs from scratch, starting with a cone K and constructing observable operators satisfying ta K µ K. Note, however, that the theorem provides no means to decide, for a given structure A, whether A is a valid OOM, since the theorem is nonconstructive with respect to K. More specically, proposition 6.b yields a construction of OOMs that are not equivalent to any HMM. This will be demonstrated in the remainder of this section. Let tQ : R3 ! R3 be the linear mapping that right-rotates R3 by an angle Q around the rst unit vector e1 D (1, 0, 0). Select some angle Q that is not a rational multiple of 2p . Then put ta :D atQ , where 0 < a < 1. Any three-dimensional process described by a three-dimensional OOM containing such a ta is not equivalent to any HMM: let Ka be a convex cone corresponding to a according to Proposition 6.a1. Due to condition 3, Ka must satisfy ta Ka µ Ka . Since ta rotates any set of vectors around (1, 0, 0) by Q, this implies that Ka is rotation symmetric around (1, 0, 0) by Q. Since Q is a nonrational multiple of 2p , Ka cannot be polyhedral. According to proposition 6.b, this implies that an OOM that features this ta cannot be equivalent to any HMM. I describe now such an OOM with O D fa, bg. The operator ta is xed, according to the previous considerations, by selecting a D 0.5 and Q D 1.0. For tb , we take an operator that projects every v 2 R3 on a multiple of 1

Heller and Ito use a different denition for HMMs, which yields a different version of the minimality statement in proposition 6.b.

1384

H. Jaeger

Figure 3: Rise and fall of probability to obtain a from the probability clock C. The horizontal axis represents time steps t, and the vertical axis represents the probabilities P[a | at ] D 1 ta (tat w0 / 1 tat w0 ), which are rendered as dots. The solid line connecting the dots is given by f (t) D 1 ta (tt w0 / 1tt w0 ), where tt is the rotation by angle t described in the text.

(.75, 0, .25), such that m D ta C tb has column vectors with component sums equal to 1 (cf. denition 1.2). The circular convex cone K whose border is obtained from rotating (.75, 0, .25) around (1, 0, 0) obviously satises the conditions in proposition 6.a2. Thus, we obtain a valid OOM provided that we select w0 2 K. Using abbreviations s :D sin(1.0), c :D cos(1.0), the matrices read as follows: 0 1 1 0 0 ta D 0.5 @0 c sA 0 ¡s c 0 1 .75 ¢ .5 .75(1 ¡ .5c C .5s) .75(1 ¡ .5s ¡ .5c) A. 0 0 tb D @ 0 (6.1) .25 ¢ .5 .25(1 ¡ .5c C .5s) .25(1 ¡ .5s ¡ .5c) As starting vector w0 , we take (.75, 0, .25) to obtain an OOM C D ( R3 , (ta , tb ), w0 ). I will briey describe the phenomenology of the process generated by C . The rst observation is that every occurrence of b “resets” the process to the initial state w0 . Thus, we only have to understand what happens after uninterrupted sequences of a’s. We should look at the conditional probabilities P[¢ | e ], P[¢ | a], P[¢ | aa], . . .—at P[¢ | at ], where t D 0, 1, 2 . . . . Figure 3 gives a plot of P[a | at ]. The process could amply be called a probability clock or probability oscillator. Rotational operators can be exploited for timing effects. In our example, for instance, if the process were started in the state according to t D 4 in Figure 3, there would be a high chance that two initial a’s would occur, with a rapid drop in probability for a third or fourth. Such nonexponential-

Observable Operator Models

1385

decay duration patterns for identical sequences are difcult to achieve with HMMs. Essentially HMMs offer two possibilities for identical sequences: (1) recurrent transitions into a state where a is emitted or (2) transitions along sequences of states, each of which can emit a. Option 1 is cumbersome because recurrent transitions imply exponential decay of state, which is unsuitable for many empirical processes. Option (2) blows up model size. A more detailed discussion of this problem in HMMs can be found in Rabiner (1990). If one denes a three-dimensional OOM C 0 in a similar manner but with a rational fraction of 2p as an angle of rotation, one obtains a process that can be modeled by a HMM. It follows from Proposition 6.b that the minimal HMM state number in this case is at least k, where k is the smallest integer such that kQ is a multiple of 2p . Thus, the smallest HMM equivalent to a suitably chosen three-dimensional OOM can have an arbitrarily great number of states. In a similar vein, any HMM that would give a reasonable approximation to the process depicted in Figure 3 would require at least six states, since in this process it takes approximately six time steps for one full rotation. 7 Interpretable OOMs

Some minimal-dimension OOMs have a remarkable property: their statespace dimensions can be interpreted as probabilities of certain future outcomes. These interpretable OOMs are described in this section. Let ( Xt )t2N be an m-dimensional LDP. For some suitably large k, let O k D A1 [, ¢ ¢ ¢ , [Am be a partition of the set of sequences of length k into m disjoint nonempty subsets. The collective outcomes Ai are called characteristic events if some sequences bN1 , . . . , bNm exist such that the m £ m matrix (P[Ai | bNj ])i, j is nonsingular (where P[Ai | bNj ] denotes characteristic events:

(7.1) P

( a aN 2Ai P[N

| bNj ]). Every LDP has

Proposition 7. Let ( Xt )t2N be an m-dimensional LDP. Then there exists some k ¸ 1 and a partition O k D A1 [, ¢ ¢ ¢ , [Am of O k into characteristic events.

The proof is given in appendix E. Let A D ( Rm , (ta )a2 , w0 ) be an mdimensional OOM of the process ( Xt ). Using the characteristic events A1 , . . . , Am , we shall now construct from A an equivalent, interpretable OOM A( A1 , . . . , Am ), which has the property that the m state vector components represent the probabilities that m characteristic events will occur. More precisely, if during the generation procedure described in section 3, ) ( A( A1 , . . . , Am ) is in state wt D ( w1t , . . . , wm t at time t, the probability P[ XtC 1 ,

1386

H. Jaeger

. . . , XtC k ) 2 Ai | wt ] that the collective outcome Ai is generated in the next k time steps, is equal to wit . In shorthand notation: P[Ai | wt ] D wit . We shall use proposition 5 to obtain A( A1 , . . . , Am ). Dene tAi :D Dene a mapping %: Rm ! Rm by %( x)

(7.2) P

:D (1 tA1 x, . . . , 1 tAm x ).

aN 2Ai taN .

(7.3)

The mapping % is obviously linear. It is also bijective, since the matrix ( P[Ai | bNj ]) D ( 1 tAi xj ), where xj :D tbNj w0 / 1 tbNj w0 , is nonsingular. Furthermore, % preserves component sums of vectors, since for i D 1, . . . , m, it holds that 1 xj D 1 D 1 ( P[A1 | xj ], . . . , P[Am | xj ]) D 1 ( 1 tA1 x, . . . , 1 tAm x ) D 1%( xj ). (Note that a linear map preserves component sums if it preserves component sums of basis vectors.) Hence, % satises the conditions of proposition 5. We therefore obtain an OOM equivalent to A by putting

A ( A1 , . . . , Am ) D ( Rm , (%ta%¡1 )a2 , %w0 ) D: ( Rm , (ta0 )a2 , w00 ).

(7.4)

In A( A1 , . . . , Am ), equation 7.2 holds. To see this, let w0t be a state vector obtained in a generation run of A( A1 , . . . , Am ) at time t. Then conclude w0t D 0 0 0 0 %%¡1 wt D ( 1 tA1 ( %¡1 wt ), . . . , 1 tAm (%¡1 wt )) D ( P[A1 | %¡1 wt ], . . . , P[Am | ¡1 0 0 0 % wt ]) (computed in A ) D ( P[A1 | wt ], . . . , P[Am | wt ]) (computed in A( A1 , . . . , Am )). The m £ m matrix corresponding to % can easily be obtained from the original OOM A by observing that %

D (1 tAi ej ),

(7.5)

where ei is the ith unit vector. The following fact lies at the heart of the learning algorithm presented in the next section: Proposition 8.

In an interpretable OOM A( A1 , . . . , Am ) it holds that

1. w0 D (P[A1 ], . . . , P[Am ]). N 1 ], . . . , P[bA N m ]). 2. t N w0 D ( P[bA b

The proof is trivial. The state dynamics of interpretable OOM can be graphically represented in a standardized fashion, which makes it possible to compare the dynamics of different processes visually. Any state vector wt occurring in a generation run of an interpretable OOM is a probability vector. It lies in the nonnegative hyperplane H ¸ 0 :D f(v1 , . . . , vm ) 2 Rm | v1 C ¢ ¢ ¢ C vm D 1, vi ¸ 0 for i D

Observable Operator Models

1387

Figure 4: (a) Positioning of H ¸ 0 within state-space. (b) State sequence of the probability clock, corresponding to Figure 3.

1, . . . , mg. Therefore, if one wishes to depict a state sequence w0 , w1 , w2 . . . , one needs only to render the bounded area H ¸ 0 . Specically, in the case m D 3, H ¸ 0 is the triangular surface shown in Figure 4a. We can use it as the drawing plane, putting the point (1 / 3, 1 / 3, 1 / 3) in the origin. For our orientation, we include the contours of H¸ 0 into the graphical reprep sentation. This is an equilateral triangle whose edges have length 2. If w D ( w1 , w2 , w3 ) 2 H¸ 0 is a state vector, its componentspcan be recovered from its position within this triangle by exploiting wi D 2 / 3di , where the di are the distances to the edges of the triangle. A similar graphical representation of states was rst introduced in Smallwood and Sondik (1973) for HMMs. When one wishes to represent graphically states of higher-dimensional, interpretable OOMs (i.e., where m > 3), one can join some of the characteristic events until three merged events are left. State vectors can then be plotted in a way similar to the one just outlined. To see an instance of interpretable OOMs, consider the probability clock example from equation 6.1. With k D 2, the following partition (among others) of fa, bg2 yields characteristic events: A1 :D faag, A2 :D fabg, A3 :D fba, bbg. Using equation 7.5, one can calculate the matrix %, which we omit here, and compute the interpretable OOM C ( A1 , A2 , A3 ) using equation 7.4. These are the observable operators ta and tb thus obtained: 0

0.645 ta D @0.355 0

¡0.395 0.395 1

1 0.125 ¡0.125 A , 0

0

0 tb D @0 0

0 0 0

1 0.218 0.329 A . 0.452

(7.6)

Figure 4b shows a 30-step state sequence obtained by iterated applications of ta of this interpretable equivalent of the probability clock. This sequence corresponds to Figure 3.

1388

H. Jaeger

8 Learning OOMs

This section describes a constructive algorithm for learning OOMs from data. For demonstration, it is applied to learning the probability clock from data. There are two standard situations where one wants to learn a model of a stochastic system: (1) from a single long data sequence (or a few of them), one wishes to learn a model of a stationary process, and (2) from many short sequences, one wants to induce a model of a nonstationary process. OOMs can model both stationary and nonstationary processes, depending on the initial state vector w0 (cf. proposition 1). The learning algorithms presented here are applicable in both cases. For the sake of notational convenience, we treat only the stationary case. We shall address the following learning task. Assume that a sequence S D a0 a1 . . . aN is given and that S is a path of an unknown stationary LDP (Xt ). We assume that the dimension of ( Xt ) is known to be m (the question of how to assess m from S is discussed after the presentation of the basic learning algorithm). We select m characteristic events A1 , . . . , Am of ( Xt ) (again, selection criteria are discussed after the presentation of the algorithm). Let A( A1 , . . . , Am ) be an OOM of ( Xt ), which is interpretable with respect to A1 , . . . , Am . The learning task, then, is to induce from S an OOM AQ that is an estimate of A( A1 , . . . , Am ) D (Rm , (ta )a2 , w0 ). We require that the estimation be consistent almost surely—that is, for almost every innite path S1 D a0 a1 . . . of ( Xt ), the sequence ( AQ n )n¸ n0 obtained from estimating OOMs from initial sequences Sn D a0 a1 ¢ ¢ ¢ an of S1 , converges to A( A1 , . . . , Am ) (in some matrix norm). An algorithm meeting these requirements shall now be described. As a rst step we estimate w0 . Proposition 8.1 states that w0 D ( P[A1 ], . . . , Q 1 ], . . . , P[A Q m ]), where P[Am ]). Therefore, a natural estimate of w0 is wQ 0 D (P[A Q P[Ai ] is the estimate for P[Ai ] obtained by counting frequencies of occurrence of Ai in S, as follows: Q i ] D Number of aN 2 Ai occurring in S D P[A Number of aN occurring in S

number of aN 2 Ai occurring in S N¡kC 1

,

(8.1)

where k is the length of events Ai . In the second step, we estimate the operators ta . According to proposition 8.2, 8(2), for any sequence bNj it holds that ta (tbNj w0 ) D ( P[bNj aA1 ], . . . , P[bNj aAm ]).

(8.2)

An m-dimensional linear operator is uniquely determined by the values it takes on m linearly independent vectors. This basic fact from linear algebra

Observable Operator Models

1389

leads us directly to an estimation of ta , using equation 8.2. We estimate m linQ bNj A1 ], . . . , P[ Q bNj Am ]) early independent vectors vj :D tbNj w0 by putting vQj D ( P[ (j D 1, . . . , m). For the estimation we use a similar counting procedure as in equation 8.1: N Q bNj Ai ] D Number of bNa (where aN 2 Ai ) occurring in S , P[ N¡l¡kC 1

(8.3)

where l is the length of bNj . Furthermore, we estimate the results vj0 :D ta (tbNj w0 ) Q bNj aA1 ], . . . , P[ Q bNj aAm ]), where of applying ta to vi by vQ 0 D ( P[ j

N Q bNj aAi ] D Number of baNa (where aN 2 Ai ) occurring in S . P[ N¡l¡k

(8.4)

Thus we obtain estimates (vQj , vQj0 ) of m argument-value pairs ( vj , vj0 ) D ( vj , ta vj ) of applications of ta . From these estimated pairs, we can compute an estimate tQa of ta through an elementary linear algebra construction: rst collect Q and the vectors vQ 0 as columns in a the vectors vQj as columns in a matrix V, j Q a , then obtain tQa D W Q a VQ ¡1 . matrix W This basic idea can be augmented in two respects: 1. Instead of simple sequences bNj , one can just as well take collective Q j Ai ]), W Qa D events Bj of some common length l to construct VQ D (P[B Q j aAi ]) (exercise). We will call Bj indicative events. ( P[B Q a as described above, one can also use Q W 2. Instead of constructing V, raw count numbers, which saves the divisions on the right-hand side in equations 8.1, 8.3, and 8.4. That is, use V # D (#butlastBj Ai ), Wa# D (#Bj aAi ), where #butlastBj Ai is the raw number of occurrences of Bj Ai in Sbutlast :D s1 . . . sN¡1 , and #Bj aAi is the raw number of occurrences of Bj aAi in S. This gives the same matrices tQa as the original procedure. Assembled in an orderly fashion, the entire procedure works as follows (assume that model dimension m, indicative events Bj , and characteristic events Ai have already been selected): Step 1. Compute the m £ m matrix V# D (#butlastBj Ai ).

Step 2. Compute, for every a 2 O , the m £ m matrix Wa# D (#Bj aAi ). Step 3. Obtain tQa D Wa# ( V# )¡1 .

The computational demands of this procedure are modest compared to today’s algorithms used in HMM parameter estimation. The counting for

1390

H. Jaeger

V # and Wa# can be done by a single sweep of an inspection window (of length k C l C 1) over S. Multiplying or inverting m £ m matrices essentially has a computational cost of O( m3 / p ) (this can be slightly improved, but the effects become noticeable only for very large m), where p is the degree of parallelization. The counting and inverting/multiplying operations together give a time complexity of this core procedure of O( N C nm3 / p ), where n is the size of O . We shall now demonstrate the mechanics of the algorithm with an articial toy example. Assume that the following path S of length 20 is given: S D abbbaaaabaabbbabbbbb. We estimate a two-dimensional OOM. We choose the simplest possible characteristic events A1 D fag, A2 D fbg and indicative events B1 D fag, B2 D fbg. First we estimate the invariant vector w0 , by putting wQ0 D (#a, #b) / N D (8 / 20, 12 / 20). Then we obtain V # and W a# , W b# by counting occurrences of subsequences in S: V# D W a#

D

W b# D

¡#

butlastaa

#butlastab

¡ #aaa

¢ ¡

¢ ¡

#aab

2 , 1

#abb

#bba 1 D #bbb 3

2 . 5

¢ ¡

3 , 7

¢

#baa 2 D #bab 2

¡ #aba

¢

#butlastba 4 D #butlastbb 4

¢

From these raw counting matrices we obtain estimates of the observable operators by tQa D Wa# ( V# )¡1 D tQb D Wb# ( V# )¡1 D

¡3 /8

¢

1 /8 , ¡1 / 8

5 /8

¡ ¡1 /16 1 / 16

That is, we have arrived at an estimate

¡ ¡

3 8 AQ D R2 , / 5/8

¢ ¡

¢

5 / 16 . 11 / 16

1/8 ¡1 / 16 , ¡1 / 8 1 / 16

¢

¢

5 / 16 , (9 / 20, 11 / 20) . 11 / 16

(8.5)

This concludes the presentation of the learning algorithm in its core ver-

Observable Operator Models

1391

sion. Obviously before one can start the algorithm, one has to x the model dimension, and select characteristic and indicative events. These two questions shall now be briey addressed. In practice, the problem of determining the “true” model dimension seldom arises. Empirical systems quite likely are very high-dimensional or even innite-dimensional. Given nite data, one cannot capture all of the true dimensions. Rather, the task is to determine m such that learning an m-dimensional model reveals m signicant process dimensions, while any contributions of higher process dimensions are insignicant in the face of estimation error. Put bluntly, the task is to x m such that data are neither overtted nor underexploited. A practical solution to this task capitalizes on the idea that a size m model does not overt data in S if the “noisy” matrix V # has full rank. Estimating the true rank of a noisy matrix is a common task in numerical linear algebra, for which heuristic solutions are known (Golub & Loan, 1996). A practical procedure for determining an appropriate model dimension, which exploits these techniques, is detailed out in Jaeger (1998). Selecting optimal characteristic and indicative events is an intricate issue. The quality of estimates AQ varies considerably with different characteristic and indicative events. The ramications of this question are not fully understood, and only some preliminary observations can be supplied here. A starting point toward a principled handling of the problem is certain conditions on Ai , Bj reported in Jaeger (1998). If Ai , Bj are chosen according to these conditions, it is guaranteed that the resulting estimate AQ is itself interpretable with respect to the chosen characteristic events A1 , . . . , Am , that is, it holds that AQ D AQ ( A1 , . . . , Am ). In turn, this warrants that AQ yields an unbiased estimate of certain parameters that dene ( Xt ). A second observation is that the variance of estimates AQ depends on the selection of Ai , Bj . For instance, characteristic events should be chosen such that sequences aN x , aN y 2 Ai collected in the same characteristic event have a high correlation in the sense that the probabilities PQ S [Nax | Bj ], PQS [Nay | Bj ] (where j D 1, . . . , m) correlate strongly. The better this condition is met, the smaller is the variance of PQS [Ai ], and, accordingly, the variance of VQ and WQ a . Several other rules of thumb similar to this one are discussed in Jaeger (1998). A third and last observation is that characteristic events can be chosen such that available domain knowledge is exploited. For instance, in estimating an OOM of a written English text string S, one might collect in each characteristic or indicative event such letter strings as belong to the same linguistic (phonological or morphological) category. This would result in an interpretable model whose state dimensions represent the probabilities for these categories to be realized next in the process.

1392

H. Jaeger

We now apply the learning procedure to the probability clock C introduced in sections 6 and 7.2 (We provide only a sketch here. A more detailed account is contained in Jaeger, 1998.) C was run to generate a path S of length N D 30, 000. C was started in an invariant state (cf. proposition 1); therefore S is stationary. We shall construct from S an estimate C Q of C . Assume that we know that the process dimension is m D 3 (this value was also obtained by the dimension estimation heuristics mentioned above). The rules of thumb noted above led to characteristic events A1 D faag, A2 D fabg, A3 D fba, bbg and indicative events B1 D faag, B2 D fab, bbg, B3 D fbag (details of how these events were selected are documented in Jaeger, 1998). Using these characteristic and indicative events, an OOM C Q D (R3 , (tQa , tQb ), wQ 0 ) was estimated using the algorithm described in this section. We briey highlight the quality of the model thus obtained. First, we compare the operators C Q with the operators of the interpretable version C ( A1 , A2 , A3 ) D ( R3 , (ta , tb ), w0 ) of the probability clock (cf. equation 7.6). The average absolute error of matrix entries in tQa , tQb versus ta , tb was found to be approximately .0038 (1.7% of average correct value). A comparison of the matrices tQa , tQb versus ta , tb is not too illuminating, however, because some among the matrix entries have little effect on the process (i.e., varying them greatly would only slightly alter the distribution of the process). Conversely, information in S contributes only weakly to estimating these matrix entries, which are therefore likely to deviate considerably from the corresponding entries in the original matrices. Indeed, taken as a sequence generator, the estimated model is much closer to the original process than a matrix entry deviation of 1.7% might suggest. Figure 5 illustrates the t of probabilities and states computed with C Q versus the true values. Figure 5a is similar to Figure 3, plotting the probabilities P Q [a | at ] against time steps t. The solid line represents the probabilities of the original probability clock. Even after 30 time steps, the predictions are very close to the true values. Figure 5b is an excerpt of Figure 4b and shows the eight most frequently taken states of the original probability clock. Figure 5c is an overlay of Figure 5b with a 100,000-step run of the estimated model C Q. Every hundredth step of the 100,000-step run was additionally plotted into Figure 5b to obtain Figure 5c. The original states closely agree with the states of the estimated model. I conclude this section by emphasizing a shortcoming of the learning algorithm in its present form. An OOM is characterized by the three conditions stated in denition 1. The learning algorithm guarantees only that 2

The calculations were done using the Mathematica software package. Data and Mathematica programs are available online at www.gmd.de/People/Herbert.Jaeger/.

Observable Operator Models

1393

Figure 5: Comparison of estimated model versus original probability clock. (a) Probabilities of P(a | aN ) as captured by the learned model (dots) versus the original (solid line). (b) The most frequent states of the original. (c) States visited by the learned model. See the text for details.

the estimated structure AQ satises the rst two conditions (an easy exercise). Unfortunately, it may happen that AQ does not satisfy condition 3 and thus might not be a valid OOM. That is, AQ might produce negative “probabilities” P[Na]. For instance, the structure AQ estimated in equation 8.5 is not a valid OOM; one obtains 1tb ta tb ta tb w0 D ¡0.00029. A procedure for transforming an “almost-OOM” into a nearest (in some matrix norm) valid OOM would be highly desirable. Progress is currently blocked by the lack of a decision procedure for determining whether an OOM-like structure A satises condition 3. In practice, however, even an “almost-OOM” can be useful. If the model dimension is selected properly, the learning procedure yields either valid OOMs or pseudo ones that are close to valid ones. Even the latter ones Q a] if aN is not too long and P[Na] is not too close yield good estimates P[N to zero. Further examples and discussion can be found in Jaeger (1997b, 1998).

1394

H. Jaeger

9 Conclusion

The results reported in this article are all variations on a single insight: The change of predictive knowledge that we have about a stochastic system is a linear phenomenon. This leads to the concept of observable operators, which has been detailed here for discrete-time, nite-valued processes but is applicable to every stochastic process (Jaeger, 1999). The linear nature of observable operators makes it possible to cast the system identication task purely in terms of numerical linear algebra. The proposed system identication technique proceeds in two stages: rst, determine the appropriate model dimension and choose characteristic and indicative events; second, construct the counting matrices and from them the model. While the second stage is purely mechanical, the rst involves some heuristics. The situation is reminiscent of HMM estimation, where in a rst stage, a model structure has to be determined—usually by hand; or of neural network learning, where a network topology has to be designed before the mechanical parameter estimation can start. Although the second stage of model construction for OOMs is transparent and efcient, it remains to be seen how well the subtleties involved with the rst stage can be mastered in actual applications. Furthermore, the problem remains to be solved of how to deal with the fact that the learning algorithm may produce invalid OOMs. One reason for optimism is that these questions can be posed in terms of numerical linear algebra, which arguably is the best understood of all areas of applied mathematics. Appendix A: Proof of Proposition 1

A numerical function P on the set of nite initial sequences of V can be uniquely extended to the probability distribution of a stochastic process if the following two conditions are met: 1. P is a probability measure on the set of initial sequences of length k C 1, P for all k ¸ 0. That is, (a) P[ai0 . . . aik ] ¸ 0 and (b) ai ¢¢¢ai 2 kC 1 P[ai0 . . . aik ] 0 k D 1. 2. The values of P on the initial sequences of length k C 1 agree P with continuation of the process in the sense that P[ai0 . . . aik ] D b2 P[ai0 . . . aik b]. The process is stationary, if additionally the following condition holds: P 3. P[ai0 . . . aik ] D b0 ...bs 2 sC 1 P[b0 . . . bs ai0 . . . aik ] for all ai0 . . . aik in O kC 1 and s ¸ 0. Point 1a is warranted by virtue of condition 3 from denition 1. Point 1b is a consequence of conditions 1 and 2 from the denition (exploit that condition 2 implies that left-multiplying a vector by m does not change the

Observable Operator Models

1395

sum of components of the vector): X aN 2

kC 1

P[Na] D

X aN 2

D1

k

(

1taN w0

X a2

! ta ¢ ¢ ¢

(

X

! ta w0

a2

(

k C 1 terms

(

X

!! ta

a2

D 1m ¢ ¢ ¢ m w0 D 1 w0 D 1.

For proving point 2, again exploit condition 2: X

b2

P[Nab] D

X

1tb taN w0 D 1m taN w0 D 1 taN w0 D P[Na].

b2

Finally, the stationarity criterion 3 is obtained by exploiting m w0 D w0 : X

N b2

sC 1

Na ] D P[bN

X

N b2

sC 1

1 taN tbN w0

( s C 1 terms m )

D 1taN m . . . m w0

D 1taN w0 D P[Na]. Appendix B: Proof of Proposition 2

P Let bN 2 O ¤ , and gbN D niD1 ai gcNi be the linear combination of gbN from basis elements of G. Let dN 2 O C . Then we obtain the statement of the proposition through the following calculation: !! ³X ´ n X (ta ( g N ))(dN) D ta (dN) D ai gcN ai ta (gcN ) ( dN)

((

b

D D

³X X

i

i

iD1

´ X ai P[a | cNi ] gcNi a ( dN) D ai P[a | cNi ]P[dN | cNi a]

ai P[a | cNi ]

N X P[cNi ]P[adN | cNi ] P[cNi ad] D ai P[a | cNi ]P[cNi ] P[Nci ]

N D P[a | b] N P[dN | ba] N D gbN ( adN) D P[adN | b] N g N ( dN). D P[a | b] ba

Appendix C: Proof of Proposition 3

From an iterated application of equation 4.2, it follows that taik , ¢ ¢ ¢ , tai0 ge D P[ai0 . . . aik ]gai0 ...aik . Therefore, it holds that

gai0 ...aik D

X iD1,...,n

ai gN . P[ai0 . . . aik ] bi

1396

H. Jaeger

Interpreting the vectors gbNi and gai0 ...aik as probability distributions (cf. equaP i tion 4.1), it is easy to see that iD1...n P[ai a,...,a D 1, from which the statement i ] immediately follows.

0

k

Appendix D: Proof of Proposition 4

To see proposition 4.1, let (G, (ta )a2 , ge ) be the predictor space OOM of ( Xt ). Choose fbN1 , . . . , bNm g ½ O ¤ such that the set fgbNi | i D 1, . . . , mg is a basis of G. Then it is an easy exercise to show that an OOM A D ( Rm , (ta )a2 , w0 ) and an isomorphism p : G ! Rm exist such that (1) p ( ge ) D w0 , (2) p (gbNi ) D ei , where ei is the ith unit vector, (3) p (ta d) D tap ( d) for all a 2 O , d 2 G. These properties imply that A is an OOM of (Xt ). In order to prove proposition 4.2, let again ( G, ( ta )a2 , ge ) be the predictor space OOM of ( Xt ). Let C be the linear subspace of Rk spanned by the vectors fw0 g [ ftaN w0 | aN 2 O C g. Let ftbN1 w0 , . . . , tbNl w0 g be a basis of C. Dene a linear mapping s from C to G by putting s(tbNi w0 ) :D P[bNi ] gbNi , where i D 1, . . . , l. s is called the canonical projection of A on the predictor-space OOM. By a straightforward calculation, it can be shown that s(w0 ) D ge and that for all cN 2 O C it holds that s(tcN w0 ) D P[c] N gcN (cf. Jaeger, 1997a for these and other properties of s). This implies that s is subjective, which in turn yields m · k. Appendix E: Proof of Proposition 7

From the denition of process dimension (see denition 2) it follows that m sequences aN 1 , . . . , aN m and m sequences bN1 , . . . , bNm exist such that the m £ m matrix ( P[Naj | bNi ]) is nonsingular. Let k be the maximal length occurring in the sequences aN 1 , . . . , aN m . Dene collective events Cj of length k by using the sequences aj as initial sequences, that is, put Cj :D fNaj cN | cN 2 O k¡| aNj | g, where | aNj | denotes the length of aNj . It holds that (P[Cj | bNi ]) D ( P[Naj | bNi ]). We transform the collective events Cj in two steps in order to obtain characteristic events. In the rst step, we make them disjoint. Observe that due to their construction, two collective events Cj1 , Cj2 are either disjoint or one is properly included in the other.SWe dene new, nonempty, pairwise disjoint, collective events Cj0 :D Cj n Cx ½Cj Cx by taking away from Cj all collective events properly included in it. It is easily seen that the matrix ( P[C0 | bNi ]) can be obj

tained from ( P[Cj | bNi ]) by subtracting certain rows from others. Therefore, this matrix is nonsingular too. In the second step, we enlarge the Cj0 (while preserving disjointness) in

order to arrive at collective events Aj that exhaust O k . Put C00 D O k n ( C01 [

Observable Operator Models

1397

¢ ¢ ¢ [ C0m ). If C00 D ;, Aj :D Cj0 ( j D 1, . . . , m ) are characteristic events. If 6 ;, consider the m £ ( m C 1) matrix (P[C0 | bNi ])iD1,...,m, jD0,...,m . It has C0 D 0

j

rank m, and column vectors vj D ( P[Cj0 | bN1 ], . . . , P[Cj0 | bNm ]). If v0 is the null vector, put A1 :D C00 [ C01 , A2 :D C02 , . . . , Am :DPC0m to obtain characteristic events. If v0 is not the null vector, let v0 D ºD1,...,m aºvº be its linear combination from the other column vectors. Since all vº are nonnull, nonnegative vectors, some aº0 must be properly greater than 0. A basic linear algebra argument (exercise) shows that the m £ m matrix made from column vectors v1 , . . . , vº0 C v0 , . . . , vm has rank m. Put A1 :D C01 , . . . , Aº0 :D Cº0 0 [ C00 , . . . , Am :D C0m to obtain characteristic events. Acknowledgments

I am deeply grateful to Thomas Christaller for his condence and great support. This article has beneted enormously from advice given by Hisashi Ito, Shun-Ichi Amari, Rafael Nunez, ˜ Volker Mehrmann, and two extremely constructive anonymous reviewers, whom I heartily thank. The research described in this article was carried out under a postdoctoral grant from the German National Research Center for Information Technology (GMD). References Bengio, Y. (1996).Markovian models for sequential data (Tech. Rep. No. 1049). Montreal: Department d’Informatique et Recherche Op´erationelle, Universit´e de Montr´eal. Available online at: http://www.iro.umontreal.ca/labs/neuro/ pointeurs/hmmsTR.ps. Berman, A., & Plemmons, R. (1979). Nonnegative matrices in the mathematical sciences. San Diego, CA: Academic Press. Doob, J. (1953). Stochastic processes. New York: Wiley. Elliott, R., Aggoun, L., & Moore, J. (1995). Hidden Markov models: Estimation and control. New York: Springer-Verlag. Gilbert, E. (1959). On the identiability problem for functions of nite Markov chains. Annals of Mathematical Statistics, 30, 688–697. Golub, G., & Loan, C. van. (1996).Matrix computations (3rd ed.). Baltimore: Johns Hopkins University Press. Heller, A. (1965). On stochastic processes derived from Markov chains. Annals of Mathematical Statistics, 36, 1286–1291. Iosifescu, M., & Theodorescu, R. (1969).Random processes and learning. New York: Springer-Verlag. Ito, H. (1992). An algebraic study of discrete stochastic systems. Unpublished doctoral dissertation, University of Tokyo, Bunkyo-ku, Tokyo, 1992. Available by ftp from http://kuro.is.sci.toho-u.ac.jp:8080/english/D/. Ito, H., Amari, S.-I., & Kobayashi, K. (1992). Identiability of hidden Markov information sources and their minimum degrees of freedom. IEEE Transactions on Information Theory, 38(2), 324–333.

1398

H. Jaeger

Jaeger, H. (1997a).Observable operator models and conditioned continuation representations (Arbeitspapiere der GMD No. 1043). Sankt Augustin: GMD. Available online at: http://www.gmd.de/People/Herbert.Jaeger/Publications.html. Jaeger, H. (1997b). Observable operator models II: Interpretable models and model induction (Arbeitspapiere der GMD No. 1083).Sankt Augustin: GMD. Available online at: http://www.gmd.de/People/Herbert.Jaeger/Publications.html. Jaeger, H. (1998). Discrete-time, discrete-valued observable operator models: A tutorial (GMD Report No. 42). Sankt Augustin: GMD. Available online at: http://www.gmd.de/People/Herbert.Jaeger/Publications.html. Jaeger, H. (1999). Characterizing distributions of stochastic processes by linear operators (GMD report No. 62). Sankt Augustin: GMD. Available online at http://www.gmd.de/People/Herbert.Jaeger/Publications.html. Narendra, K. (1995). Identication and control. In M. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 477–480). Cambridge, MA: MIT Press/Bradford Books. Rabiner, L. (1990).A tutorial on hidden Markov models and selected applications in speech recognition. In A. Waibel and K.-F. Lee (Eds.), Readings in speech recognition (pp. 267–296). San Mateo, CA: Morgan Kaufmann. Smallwood, R., & Sondik, E. (1973). The optimal control of partially observable Markov processes over a nite horizon. Operations Research, 21, 1071–1088. Received January 5, 1998; accepted April 8, 1999.

LETTER

Communicated by David Saad

Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons Shun-ichi Amari Hyeyoung Park Kenji Fukumizu RIKEN Brain Science Institute, Wako-shi, Hirosawa, Saitama 351-0198, Japan The natural gradient learning method is known to have ideal performances for on-line training of multilayer perceptrons. It avoids plateaus, which give rise to slow convergence of the backpropagation method. It is Fisher efcient, whereas the conventional method is not. However, for implementing the method, it is necessary to calculate the Fisher information matrix and its inverse, which is practically very difcult. This article proposes an adaptive method of directly obtaining the inverse of the Fisher information matrix. It generalizes the adaptive Gauss-Newton algorithms and provides a solid theoretical justication of them. Simulations show that the proposed adaptive method works very well for realizing natural gradient learning. 1 Introduction

Natural gradient (Amari, 1998; Yang & Amari, 1998) gives an on-line learning algorithm that takes the Riemannian metric of the parameter space into account. The natural gradient method is based on information geometry (Amari, 1985; Amari & Nagaoka, 2000) and uses the Riemannian metric of the parameter space to dene the steepest direction of a cost function. It may be regarded as a version of the stochastic descent method. It uses a matrix learning rate equal to the inverse of the Fisher information matrix, which plays the role of the Riemannian metric in the space of perceptrons. It overcomes two types of shortcomings involved in backpropagation learning: efciency and slow convergence. On-line learning may use a training example once when it is observed, while batch learning stores all the examples and any examples can be reused later. Therefore, it is in general true that the performance of on-line learning is worse than batch learning. This is true for the conventional backpropagation method. Amari (1998) shows that natural gradient achieves Fisher efciency. According to the Cramer-Rao bounds, this is the best asymptotic performance that any unbiased learning algorithm can achieve. The backpropagation method is known to be very slow in convergence. This is because of plateaus in which the parameter is trapped in the proc 2000 Massachusetts Institute of Technology Neural Computation 12, 1399– 1409 (2000) °

1400

S. Amari, H. Park, and K. Fukumizu

cess of learning, taking a long time to get rid of them. Statistical-mechanical analysis (Saad & Solla, 1995) has made it clear that plateaus are ubiquitous in backpropagation learning, and various acceleration methods so far proposed cannot avoid or escape from them. Amari (1998) and Yang and Amari (1998) suggested that the natural gradient algorithm has the possibility of avoiding plateaus or quickly escaping from them. This has been theoretically conrmed (Rattray, Saad, & Amari, 1998), where an ideal performance of the natural gradient method was theoretically demonstrated in the thermodynamical limit. However, it is difcult to calculate the Fisher information matrix of multilayer perceptrons. Even when it is obtained, its inversion is computationally expensive. Yang and Amari (1998) gave an explicit form of the Fisher information matrix and the computational complexity of its inversion by assuming that the distribution of input signals is gaussian. Rattray et al. (1998) gave it in terms of statistical-mechanical order parameters. These results show that it is difcult to implement natural gradient learning for practical large-scale problems, however excellent its performance is. This article proposes an adaptive method of obtaining the inverse of the Fisher information matrix directly without any matrix inversion, by applying the Kalman lter technique. The proposed adaptive method generalizes the adaptive Gauss-Newton method (LeCun, Boutou, Orr, & Muller, ¨ 1998) and provides a solid theoretical justication based on different philosophical ideas. Computer experiments demonstrate that the proposed method has almost the same performance as the original natural gradient method and that its convergence speed is surprisingly faster than the conventional backpropagation method. This is a preliminary study; more general performances of the proposed method will be reported in the future by Park, Amari, and Fukumizu (2000). 2 Natural Gradient for MLP

Let us consider a multilayer perceptron (MLP) that receives an n-dimensional input signal x and emits a scalar output signal y. When it has m hidden units and one linear output unit, its input-output behavior is written as yD

m X aD1

¡ ¢ va Q wa ¢ x C ba C b0 C j,

(2.1)

where w a is an n-dimensional connection weight vector from the input to the ath hidden unit (a D 1, . . . , m); va is the connection weight from the ath hidden unit to the output unit; ba and b0 are biases to the ath hidden node and the output node, respectively; and j is a random noise subject to N(0, s©2 ). The function Qªis a sigmoidal-type activation function. The parameters w 1 , . . . , w m I bI v can be summarized into a single fm(n C 2) C 1gdimensional vector µ. It is a stochastic network because of noise j . The set

Natural Gradient Learning for Multilayer Perceptrons

1401

S of all such stochastic multilayer perceptrons forms a manifold where µ plays the role of a coordinate system in the space S. A multilayer perceptron having parameter µ is connected with the conditional probability distribution of output y conditioned on input x, µ ¶ ª2 1 © exp ¡ 2 y ¡ f (x, µ ) , 2s 2p s

(2.2)

¡ ¢ va Q wa ¢ x C ba C b0

(2.3)

p( y| x, µ ) D p

1

where f ( x, µ ) D

m X aD1

is the mean value of y given input x. The space S of multilayer perceptrons is identied with the set of all the conditional probability distributions of equation 2.2. Its logarithm is l( y| x, µ ) D ¡

p ª2 1 © y ¡ f ( x, µ ) ¡ log( 2p s) 2 2s

(2.4)

and is called the log-likelihood. This can be regarded as the negative of the square of an error when y is a target value given x, except for a scale and a constant term. Hence, the maximization of the likelihood is equivalent to the minimization of the square error, ¡ ¢ 1© ª2 l¤ y, x, µ D y ¡ f ( x, µ ) . 2

(2.5)

Let us consider the training of a perceptron. Given a sequence of training examples fx1 , y¤1 ), ( x2 , y¤2 ), . . .g, the conventional on-line backpropagation learning algorithm is written as ¡ ¢ µtC 1 D µ t ¡ gt rl¤ xt , y¤t I µt ,

(2.6)

where µ t is the network parameter at time t, gt is a learning rate that may depend on t, » rl¤ ( x, y¤ , µ ) D

@ ¤ l ( x, y¤ I µ ) @hi

¼ (2.7)

is the gradient of the loss function l¤ , and y¤t is the desired output signal given from the teacher. The gradient rl¤ is in general believed to be the steepest direction of scalar function of l¤ . This is true only when the space S is Euclidean and µ is an orthonormal coordinate system. In our case of stochastic perceptrons, the

1402

S. Amari, H. Park, and K. Fukumizu

parameter space S has a Riemannian structure (Amari, 1998; Yang & Amari, 1998), so that the ordinary gradient does not represent the steepest direction of the loss function. The steepest descent direction of the loss function l¤ ( µ ) in a Riemannian space is given (Amari, 1998) by Q ¤ ( µ ) D ¡G¡1 (µ )r l¤ ( µ ), ¡ rl

(2.8)

where G¡1 is the inverse of a matrix G D ( gij ) called the Riemannian metric tensor. This gradient is called the natural gradient of the loss function l¤ ( µ ) in the Riemannian space, and it suggests the natural gradient descent algorithm of the form Q ¤(µ ) µ tC 1 D µt ¡ gt rl

(2.9)

(Amari, 1998; Edelman, Arias, & Smith, 1998). In the case of a stochastic multilayer perceptron, the Riemannian metric tensor G( µ ) D ( gij ( µ )) is given by the Fisher information matrix (Amari, 1998), µ gij ( µ ) D E

@l( y| xI µ ) @l( y| xI µ ) @hi

¶

@hj

,

(2.10)

where E denotes expectation with respect to the input-output pair ( x, y) given by equation 2.2. Note that we do not use the teacher signal y¤ to dene it. It was proved that the on-line learning method based on the natural gradient is asymptotically efcient, as the optimal batch algorithm is. It was also suggested that natural gradient learning is much easier to get rid of plateaus than ordinary gradient learning (Amari, 1998; Yang & Amari, 1998). The statistical physical approach elucidated an ideal behavior of natural gradient learning (Rattray et al., 1998), avoiding plateaus. 2.1 Adaptive Estimation of Natural Gradient. The Fisher information matrix of equation 2.10 at µ t can be rewritten, by using equation 2.4, as

µ Gt D E

@l( y| xI µt ) @l( y| xI µ t ) @µt

@µt

0¶

i µ @f ( xI µ ) @f ( xI µ ) 0 ¶ 1 h t t 2 ( )g fy ¡ I E f x µ E t 4s 4 @µ t @µ t µ 0¶ 1 @f ( xI µ t ) @f ( xI µ t ) D , E 4s 2 @µ t @µt D

(2.11) (2.12) (2.13)

where prime denotes transposition of a vector or matrix. For the calculation of expectation, we have to know the probability distribution q(x ) of input x. Such knowledge about the input distribution, however, is hardly given in practical problems. In addition, it is difcult to obtain an explicit form of

Natural Gradient Learning for Multilayer Perceptrons

1403

G even if we know it. Moreover, inversion of G is computationally costful when the number m of the hidden units is large. All of these suggest that natural gradient learning is not practical. In order to overcome these difculties, ¡ we ¢ propose an adaptive method of directly obtaining an estimate of G¡1 µ t , which does not require costful inversion of G. Since the Fisher information at µ t is obtained by the expectation over inputs x of r f (r f )0 , we make use of the adaptive method of obtaining GO t , which is an estimate of G( µ t ), given by ¡ ¢ ¡ ¢0 GO t D (1 ¡ et ) GO t¡1 C et r f xt¡1 , µ t¡1 r f xt¡1 , µ t¡1 ,

(2.14)

where et is a time-dependent learning rate. Typical examples are et D c / t and 0 et D e . When we put et D 1 / t, GO t is equal to the arithmetic ¡ ¢ mean of r fi r fi 2 except for a scalar factor (1 / 4s ), where r fi D r f xi , µi , i D 1, . . . , t. When ¡ ¢ µ t converges to µ ¤ , GO t converges to G µ ¤ . Our purpose is to obtain GO ¡1 t directly. To this end, we use the well-known technique of the Kalman lter by applying the following identity, which holds for a nonsingular symmetric n £ n matrix A and an n £ k matrix B( k < n ), ¡

aA C bBB0

¢¡1

1 D A¡1 ¡ a

¡

¢ IC

¡ pb

¢ A B a ¢ ¡ pb ¡1

b 0 ¡1 BA B a

¡1

a

A¡1 B

¢

0

,

(2.15)

where a and b are scalars (see Bottou, 1998). We apply the identity to the right-hand side of equation 2.14 for A D GO t¡1 and B D r ft . We then have the following adaptive algorithm to obtain an estimate of the inverse of Fisher information: ¡ ¢0 ¡1 O GO ¡1 1 O ¡1 et ¡1 t r ft r ft Gt O . (2.16) GtC 1 D Gt ¡ (1 ¡ et ) (1 ¡ et ) C et r f 0 GO ¡1 r ft 1 ¡ et t t When et is small, we may approximate it by a simpler ¡ ¢0 ¡1 O ¡1 O (1 C et ) GO ¡1 GO ¡1 t ¡ et Gt r ft r ft Gt . tC 1 D

(2.17)

The related natural gradient learning algorithm is given by ¡ ¢ ¤ µtC 1 D µ t ¡ gt GO ¡1 t rl x t , yt I µ t .

(2.18)

Equations 2.17 and 2.18 are the new method that we propose to implement natural gradient learning.

1404

S. Amari, H. Park, and K. Fukumizu

Since we have introduced one more learning rate et in addition to gt , we need to study how to choose et . We see that et determines how quickly and ¡ ¢ ¡1 how accurately GO ¡1 µt . In the nal stage where µ t is close t converges to G to µ ¤ , an error in G¡1 is not serious since the estimator µ t is consistent even when G¡1 is misspecied. Here, an adequate choice of gt is much more important. There are lots of theories concerning the determination of gt . (See, for example, Amari, 1998, for its adaptive determination.) Avoidance of plateaus is an important benet of natural gradient. If et is too small, there is a serious delay for adjusting GO ¡1 so that the parameter might approach plateaus, even though it escapes from them eventually. On the other hand, there is numerical instability for large et . All in all, we suggest using a constant et that is not so small. The performance is insensitive to et in a reasonable constant range. These are conrmed by computer simulations. It should be noted here that the natural gradient method given by equations 2.8 and 2.9 is completely different from the Newton-Raphson method in which G( µ ) is replaced by the Hessian £ ¤ H( µ ) D E rrl¤ ( x, y¤ I µ ) .

(2.19)

H( µ ) depends on the specic cost function l¤ , which includes the teacher signal y¤ explicitly, but G(µ ) is the metric of the underlying space S independent of the target function to be approximated. However, when the loglikelihood is used as the cost function and the teacher signal y is generated by a network having parameter µ ¤ , H( µ ¤ ) D G( µ ¤ )

(2.20)

holds at µ¤ . Hence, the natural gradient is equivalent to the Newton method at around the optimal point. The Hessian H ( µ ) is not necessarily positive-denite. Moreover, it includes the second derivatives. To avoid calculations of the second derivatives and to keep H( µ ) positive-denite, the Gauss-Newton method approximates H (µ ), by neglecting the terms of the second derivatives by the sample average, T 1X r f ( xt , µ )r f (xt , µ )0 . GN ( µ ) D T tD1

(2.21)

LeCun et al. (1998) also describe stochastic or adaptive Gauss-Newton method, when l¤ is a quadratic error function. The proposed method generalizes the adaptive Gauss-Newton algorithms and provides a solid theoretical justication even when l¤ is not quadratic. The proposed adaptive method is computationally costless. It should be noted that when µ is close to a plateau, GN ( µ ) is almost singular, and its inverse diverges.

Natural Gradient Learning for Multilayer Perceptrons

1405

3 Experimental Results 3.1 Simple Toy Model. We conducted an experiment for comparing the convergence speeds among the conventional gradient, the natural gradient, and the proposed adaptive methods. We used a simple MLP model,

yD

2 X aD1

va Q ( wa ¢ x) C j,

(3.1)

having two-dimensional input and no bias terms. Assuming that the input is subject to the gaussian distribution N (0, I ), where I is the identity matrix, we can get the Fisher information and its inverse explicitly so that we have the exact natural gradient learning algorithm. Training data (xt , y¤t ), t D 1, 2, . . . at each time t are given by the noisy output y¤t from a teacher network. The teacher network is an orthogonal committee machine having the same structure as the student network, with the true parameter dened as | w 1 | D | w 2 | D 1,

wT1 w2 D 0, va D 1.

(3.2)

The variance of output noise is put equal to 0.1. We chose initial values of the parameters randomly subject to the uniform distribution on small intervals: h i wai » 0.3 C U ¡10¡8 , 10¡8 , h i (3.3) va » 0.2 C U ¡10¡4 , 10¡4 , a, i D 1, 2. We conducted the three learning algorithms simultaneously with the same initial value of parameters and the same training data sequence under various conditions. A typical result of convergence speed is shown in Figure 1. As shown in Figure 1, the proposed learning algorithm can give the convergence speed as fast as that of the exact natural gradient learning algorithm. They can escape from plateaus much faster than the conventional one in this example. The learning rate was selected empirically to get an accurate estimator and a good convergence speed for each algorithm. Figure 1 is a typical case with a small, constant learning rate g. We also tried the case with gt D c / t and others. Various choices of et are also checked. When G is Q ¤ become too large. In such a close to a singular matrix, G¡1 and hence rl case, we made gt smaller to avoid too large an adjustment of the parameter vector at one time. The adaptive method works very well in this example but has a tendency of being trapped in a plateau when et is chosen too small, while the natural gradient method does not. This is due to a time delay for adjusting G¡1 . We have checked the effect of different choices of et . However, there are few changes in the performance within an adequate range of et , that is, the range 0.05–0.5. When et is too large, numerical instability emerges. When it is too small, for example, 0.01, the plateau phenomena become remarkable.

1406

S. Amari, H. Park, and K. Fukumizu

Figure 1: Average learning curves for the three algorithms: OGL, the ordinary gradient learning; NGL, the natural gradient learning; ANGL, the adaptive natural gradient learning. 3.2 Extended XOR Problem. To show that the proposed learning algorithm can be applied to practical problems, we treated a pattern classication problem. We used the extended exclusive OR problem, which is a benchmark problem for pattern classiers. Figure 2 shows the pattern set to be classied into two categories. It consists of nine clusters, each assigned to one of the two classes marked by the symbols ± and C . Each cluster is generated subject to a gaussian distribution specied by a mean and a common covariance matrix. Each cluster has 200 elements, but there are some overlapped parts between the two clusters. We used the MLP model with 12 hidden units and 1 nonlinear output unit. Since it is difcult to obtain an explicit form of natural gradient, we conducted experiments for ordinary gradient learning and proposed learning. The desired outputs y are put equal to 1 and 0 for the two classes. Since the size of the training examples is xed, we used them repeatedly, and the mean square training error over the set was calculated at each cycle. The algorithms started with initial values generated randomly, subject to

h i wai , va , ba , bo » U ¡10¡1 , 10¡1 ,

a, i D 1, . . . , m.

(3.4)

We conducted 10 independent runs. The generalization error sometimes does not decrease to a desired level in the case of ANGL, but it mostly fails in the case of the conventional method. We do not know if the state converged

Natural Gradient Learning for Multilayer Perceptrons

1407

Figure 2: Extended XOR problem.

to a poor local minimum or is trapped at a plateau. Figure 3 shows the best learning curve for each algorithm. Since we barely get the zero training error, we stopped learning when there was no more increase in the classication rate for training data set. The resulting training errors were 0.0754 and 0.0675 for ordinary gradient learning and proposed learning, respectively. The necessary learning cycles to get the same classication rates (98.11%) are 154,300 and 1500 for ordinary gradient learning and proposed learning, respectively. This means that the convergence speed of proposed learning was surprisingly quick in this example. 4 Conclusions and Discussions

We have proposed an adaptive calculation of the inverse of Fisher information matrix without matrix inversion and an adaptive natural gradient learning algorithm using it. With a simple toy problem, we showed that the proposed learning algorithm is as fast as the exact natural gradient learning algorithm, which is proved to be Fisher efcient. The proposed method is remarkably faster than the conventional learning method. The experiment with a benchmark pattern recognition problem showed that the proposed

1408

S. Amari, H. Park, and K. Fukumizu

Figure 3: Learning curves for extended XOR problem: OGL, ordinary gradient learning; ANGL, adaptive natural gradient learning.

learning algorithm can also be applied to the practical problem successfully. It is demonstrated that the improvement of the convergence speed by the proposed learning method over the ordinary gradient learning method is remarkable. This article is a preliminary study of the topological and metrical structure of the parameter space of MLPs. More detailed theoretical and experimental results will be published in future. Acknowledgments

We express our gratitude to Leon Bottou and Noboru Murata for their discussions and suggestions. References Amari, S. (1985). Differential-geometrical method in statistics. Berlin: SpringerVerlag. Amari, S. (1998). Natural gradient works efciently in learning. Neural Computation, 10, 251–276. Amari, S., & Nagaoka, H. (2000). Information geometry. New York: American Mathematical Society and Oxford University Press. Bottou, L. (1998). Online algorithms and stochastic approximations. In D. Saad (Ed.), Online learning in neural networks (pp. 9–42). Cambridge: Cambridge University Press. Edelman, A., Arias, T., & Smith, S. T. (1998). The geometry of algorithms with orthogonality constraints. SIAM Journal of Matrix Analysis and Applications,

Natural Gradient Learning for Multilayer Perceptrons

1409

20, 303–353. LeCun, Y., Bottou, L., Orr, G. B., & Muller, ¨ K.-R. (1998). Efcient backprop. In G. B. Orr & K. R. Muller ¨ (Eds.), Neural networks—Tricks of the trade, (pp. 5–50). Berlin: Springer-Verlag. Park, H., Amari, S., & Fukumizu, K. (2000). Adaptive natural gradient learning algorithms for various stochastic models. Submitted. Rattray, M., Saad, D. & Amari, S. (1998). Natural gradient descent for on-line learning. Physical Review Letters, 81, 5461–5464. Saad, D., & Solla, S. A. (1995).On-line learning in soft committee machines. Phys. Rev. E, 52, 4225–4243. Yang, H. H., & Amari, S. (1998). Complexity issues in natural gradient descent method for training multilayer perceptrons. Neural Computation, 10, 2137– 2157. Received December 9, 1998; accepted May 14, 1999.

LETTER

Communicated by Shun-ichi Amari and Todd Leen

Nonmonotonic Generalization Bias of Gaussian Mixture Models Shotaro Akaho Electrotechnical Laboratory, Information Science Division, Ibaraki 305-8568, Japan Hilbert J. Kappen RWCP Theoretical Foundation SNN, Department of Medical Physics and Biophysics, University of Nijmegen, NL 6525 EZ Nijmegen, The Netherlands Theories of learning and generalization hold that the generalization bias, dened as the difference between the training error and the generalization error, increases on average with the number of adaptive parameters. This article, however, shows that this general tendency is violated for a gaussian mixture model. For temperatures just below the rst symmetry breaking point, the effective number of adaptive parameters increases and the generalization bias decreases. We compute the dependence of the neural information criterion on temperature around the symmetry breaking. Our results are conrmed by numerical cross-validation experiments. 1 Introduction

An important problem for learning is to optimize the model such that the best generalization performance is obtained. Generalization performance depends on the number of adaptive parameters in the model. We can reduce the training error as much as needed by increasing the number of adaptive parameters. However, the performance of the model should be evaluated by the generalization error, which measures the error for test samples. The best generalization performance is usually not obtained by maximizing the number of adaptive parameters (Vapnik, 1984; Rissanen, 1986). The generalization bias is dened as the difference between the generalization error and the training error. The expected generalization bias can be measured approximately by the neural information criterion (NIC) or numerically through cross-validation on test data (Moody, 1992; Amari, 1993; Murata, Yoshizawa, & Amari, 1991, 1994). Typically, if the number of model parameters increases, the training error decreases because a better t can be obtained. At the same time, the generalization bias is expected to increase due to the larger model variability. In this article we present an example where this intuition is violated: the case that generalization bias decreases when at the same time the number of model parameters increases. c 2000 Massachusetts Institute of Technology Neural Computation 12, 1411– 1427 (2000) °

1412

Shotaro Akaho and Hilbert J. Kappen

We consider radial basis Boltzmann machines (RBBM), a special class of gaussian mixture models (Kappen, 1995). Gaussian mixture models have attracted a lot of attention in the neural network community (Barkai, Seung, & Sompolinsky, 1993; Titterington, 1985) because they are closely related to several neural network models such as radial basis function (RBF) networks (Poggio & Girosi, 1990) and hierarchical mixture of experts (HME) networks (Jacobs & Jordan, 1991; Jordan & Jacobs, 1994). For gaussian mixtures, complexity and generalization performance are controlled by the number of mixture components. In RBBMs, the complexity of the model is controlled by a continuous parameter b. When b is small, the maximum likelihood (ML) solution of the RBBM degenerates into one gaussian. At a critical value of b, the ML solution becomes a mixture of several gaussians. This phenomenon of symmetry breaking is repeated recursively for increasing b. Thus, b controls the effective number of mixture components and, in this way, the generalizaton performance. In this article we study the generalization bias for RBBMs around the rst symmetry breaking point. In section 2, we dene RBBMs and show the symmetry breaking mechanism. In section 3, we derive the condition when the symmetry breaking is two-way or h-way, where h is the number of mixture components in the RBBM. In section 4, we show an analytical result that the generalization bias of the model, as measured by the NIC, decreases if the symmetry breaking is two-way, even though the supercial effective number of parameters increases. If we can assume that training error does not change signicantly around the symmetry breaking point, this anomaly implies that the ML solution just below the critical temperature is expected to realize a smaller generalization error than just above the critical temperature. In section 5, we show that this effect is conrmed numerically. 2 Radial Basis Boltzmann Machines

Let us consider a gaussian mixture model with equal priors in which the variance of all components is spherically symmetric and identical: h 1X p( x | WI b ) D h iD1

r

b exp(¡b kx ¡ w i k2 ). p

(2.1)

W denotes the set of adaptive parameters fw 1 , . . . , w h g. The complexity of the model is determined by h, the number of gaussian components, and b, a control parameter called the inverse temperature in physics. Statistically b is equal to half the inverse variance. We chose spherical covariance matrices, since we are interested in the relation between complexity and symmetry breaking. We expect that our results can be generalized to mixture models with full covariance matrices.

Generalization Bias of Gaussian Mixture Models

1413

This unsupervised model was originally proposed by Rose, Gurewitz, and Fox (1990) and generalized by Kappen (1993, 1995; Nijman & Kappen, 1997) to the supervised case. The model is related to fuzzy clustering as well (Bezdek, 1980). In this article, we refer to model 2.1 as a RBBM. Equation 2.1 can be written as p( x ) D

h X iD1

p( x, i) D

h X

p( x|i)p(i),

iD1

with i labeling the individual clusters and p( i) D 1 / h the prior probability of each cluster. p( x| i) is simply a gaussian distribution in x centered on wi . For xed b, the ML solution for the cluster means is easily derived and is given by Rose et al. (1990),

wi D

hxp( i| x)i , hp(i| x )i

where h ¢ i denotes expectation values with respect to q( x), Z h¢i ´ ¢ q( x) dx, and q( x) is the target distribution from which the training samples and test samples are generated. These coupled equations can be solved by a variety of methods such as the expectation maximization (EM) algorithm. An example of the ML solutions as a function of the temperature is shown in Figure 1. The training data are generated with equal probability from two distributions: u[0.5, 1.5] and N[¡1, 0.32 ], where u[a, b] is the uniform distribution on [a, b] and N[m , v] is the gaussian distribution with mean m and variance v. The number of training samples is 100, and h D 100. For small b, the ML solution is of the form w 1 D w 2 D ¢ ¢ ¢ D w h , that is, the solution corresponds to one cluster. At a critical value of b, the cluster splits into smaller parts. These symmetry breakings reoccur recursively at higher values of b. Therefore, the number of gaussian components can be controlled by adjusting the temperature in this model. So at any b, although the total number of kernels is h, only a smaller effective number of kernels is used. For this reason, we assume h to be sufciently large. The complexity is then controlled by b only. 3 Symmetry Breaking Point of the RBBM

Since the effective number of parameters changes at the symmetry breaking points (SBPs), we study the symmetry breaking process in detail. Although the behavior of symmetry breaking is complicated to analyze, we have some results on the rst SBP, the highest temperature at which the phase transition occurs. These results are expected to be applicable for the other SBPs qualitatively.

1414

Shotaro Akaho and Hilbert J. Kappen

Figure 1: An example of the ML solution and the symmetry breaking phenomenon. Horizontal axis: log(b); vertical axis: w or x. Dots show the ML solution for each temperature; dashes on the right show the training samples.

When the temperature is higher than 1 / bc , all the gaussian components are degenerated into one gaussian, and the ML solution is given by w i D hx i. One can compute the rst SBP analytically as b D bc ´

1 , 2l 1

(3.1)

where l 1 is the maximal eigenvalue of the covariance matrix of x. This means that the rst symmetry breaking occurs when b is equal to the variance of samples along the rst principal component axis (Rose et al., 1990). This result is considered to be applicable to other SBPs as follows. After breaking, the two clusters drift apart as a function of b. If the distance between the clusters is sufciently large, each data point contributes to the covariance matrix of the closest cluster only. The eigenvalues of this reduced matrix determine the next critical temperature, as in Equation 3.1. 3.1 Below the First SBP. We can characterize the behavior of the symmetry breaking under the following assumption:

Generalization Bias of Gaussian Mixture Models

1415

The target distribution q( x ) is dened on R (one dimension) and is assumed to be symmetric. The number of gaussian components h in the RBBM is taken to be even. Assumption.

Let s 2 ´ h(x¡hxi)2 i and s4 ´ h(x¡hxi)4 i denote, respectively, the secondand fourth-order moment of the distribution q(x ). The fourth-order cumulant is dened as k4 ´ s4 ¡ 3(s 2 )2 (k4 / s4 is called the kurtosis in statistics). Under the assumption, the behavior of the rst symmetry breaking can be classied into the following two cases: Proposition 1.

6 0 the symmetry breaking is two-way: the components split into two 1. If k4 D clusters. The relation between the ML solution wi and b in the neighborhood of the rst SBP bc is given by

Db ’

s4 ( D w i )2 , 6(s 2 )4

(3.2)

where D b D b ¡ b c > 0, D wi D wi ¡ hxi.

2. If k4 D 0 the symmetry breaking is h-way: the individual components D wi are underdetermined up to the third-order approximation that we computed. This suggests that no preferred breaking pattern exists, although fourth-order corrections may somewhat limit this freedom. The relation between the ML solution wi and b in the neighborhood of the rst SBP bc is written as

Db ’ where sw2 D

1 s2 , 2(s 2 )2 w 1 h

P

i D wi

2

(3.3)

, D wi D wi ¡ hxi.

k4 is equal to zero when q( x ) is gaussian; hence the condition represents the similarity between q( x ) and gaussian in the sense of the fourth-order cumulant. An outline of the proof of proposition 1 is given in appendix A. This result can be intuitively understood as follows. When b increases toward the critical point, the stability of the symmetric solution wi D hxi decreases. The stability is measured by the eigenvalues of the Hessian, which is the matrix of second derivatives of the likelihood. At b D bc some eigenvalues of the Hessian become zero. Thus, the symmetric solution becomes 6 0, only one eigenvalue becomes zero. The correspondunstable. When k4 D ing eigenvector is in the direction of symmetry breaking, and the breaking is twofold: one cluster moves in the positive eigendirection and one in the negative eigendirection. When k4 D 0, however, all eigenvalues become zero simultaneously, and thus all directions become unstable. The breaking is h-fold, because the initial movement of each cluster center can be in any arbitrary direction.

1416

Shotaro Akaho and Hilbert J. Kappen

Equation 3.3 is interpreted as follows: Suppose that we have an innite P number of kernels. The model p( x | w) D (1 / h ) hiD1 w (x ¡ wi , b ), with w ( x, b ) a gaussian density function with mean 0 and variance 1 / 2b, can be written as Z p( x | w) D p( w )w (x ¡ w, b ) dw, with p( w) some distribution over the kernel means. As an example of k4 D 0, let us consider the case that the target distribution q(x ) is gaussian. Therefore, at large b, p( x | w ) will approach a gaussian distribution. Since p(x | w) is the convolution of p(w ) with a gaussian distribution, it follows that p(w ) for large b will also approach a gaussian distribution. 4 Nonmonotonic Generalization Bias

Since for the RBBM the effective number of adaptive parameters changes as a function of temperature, we must nd the temperature that gives the best generalization. Overtting results in a suboptimal effective number of components and a suboptimal clustering. Although the generalization bias is a statistical quantity and unknown in general, we can estimate it through cross-validation. Alternatively, we can approximate the mean generalization bias, which is known as the NIC (Murata et al., 1991, 1994) or the effective number of parameters (Moody, 1992). Using this value, we can select the model that minimizes the sum of the training likelihood and the mean generalization bias. Another wellknown generalization bias is Akaike’s information criterion (AIC), which is given by the number of independent parameters. However, AIC assumes that the target distribution is an RBBM, which is a poor assumption around the SBPs. 4.1 Neural Information Criterion. Given a set of P training samples X( P) D fx1 , . . . , xP g from a target probability distribution q( x), the ML solution W D W ( P) maximizes the empirical likelihood over the training samples, P) ( R(emp WI b ) ´

P 1X log p(xi | WI b ). P iD1

(4.1)

The true likelihood is dened as the expectation value of the likelihood over the target distribution q( x): Rexp ( WI b ) ´ hlog p(x | WI b )i.

(4.2)

The generalization bias can be approximated by making a Taylor expansion (P) ( WI b ) around W ( P) and the fact that the asymptotic distribution of of Remp

Generalization Bias of Gaussian Mixture Models

1417

W ( P) is locally a gaussian distribution (Murata et al., 1991, 1994). Consequently, when P is large enough, the mean generalization bias is asymptotically given by D E D E h NIC P) ( (P) , R(emp W I b ) ¡ Rexp ( W ( P) I b ) ’ P

(4.3)

where the average is taken over all training sets of size P, X(P ) . hNIC is the NIC, dened by hNIC (b ) ´ Tr[H(W ¤ )¡1 D( W ¤ )],

(4.4)

where W ¤ denotes the ML solution of the true likelihood. H ( W ) and D( W ) are the following matrices, Hij ( W ) ´ ¡h@i @j log p( xI W, b )i, Dij ( W ) ´ h@i log p( x | W, b ) @j log p( x | W, b )i,

(4.5) (4.6)

where @i ´ @@wi . If q( x) belongs to the model set, NIC is equal to AIC because H (W ¤ ) D D( W ¤ ). In practice, when we apply the NIC in selecting a model, we need to evaluate the bias by using only one training set. In this case, the bias is given by P) ( (P) R(emp W I b ) ¡ Rexp ( W ( P) I b ) ’

hNIC U C p , P P

(4.7)

p (P) ( W ¤ I b ) ¡ Rexp (W ¤ I b )g is a random variable of order 1 where U D PfRemp with zero mean. It can be shown that U is the same for all the models within a nested set of models (Murata et al., 1994), which holds for the RBBM. p Therefore, although the U / P term dominates the NIC term, it does not affect the model selection. In the following sections, we compute the behavior of the NIC near the SBP for the RBBM. The effective number of adjustable parameters in the RBBM is constant between symmetry breakings and increases stepwise for increasing b. Since the NIC measures the complexity of the model, it is expected to increase as b increases. Our analysis indeed shows a linear increase of NIC with b just before the symmetry breaking (b < bc ). However, just after the symmetry breaking (b > bc ), our analysis shows a decrease of the NIC depending on k4 . 4.2 Above the First SBP. When there is only one cluster (b < b c ), we can compute hNIC explicitly. The following proposition shows that the generalization bias increases linearly in proportion to b:

1418

Shotaro Akaho and Hilbert J. Kappen

Proposition 2.

If b < bc , NIC is given by

hNIC (b ) D 2bTr[Vx ],

(4.8)

where Vx is the covariance matrix of q(x ). An outline of the proof of proposition 2 is given in appendix B. Intuitively, the NIC measures the sensitivity of the likelihood against sample uctuation. Proposition 2 explains this intuition because for small b, one has broad kernels whose centers are less sensitive to sample uctuations. 4.3 Below the First SBP. Because the situation below the critical temperature is very complicated, we analyze the NIC under the same assumption as in section 3.1: Proposition 3.

lim

b #bc

@ @b

6 0 and s4 D 6 (s 2 ) 2 , Under the assumption and also if k4 D

hNIC (b ) D ¡1.

(4.9)

6 (s 2 ) 2 applies to all distributions except for a mixture distribuThe condition s4 D tion of 2 d-functions. If q(x ) D (d( x¡1) C d(x C 1)) / 2 one obtains @hNIC (b c ) / @b D ¡4.

An outline of the proof of proposition 3 is given in appendix C. Proposition 3 states that the NIC of the RBBM decreases even if the effective number of parameters increases when the symmetry breaking in the rst SBP is two-way. Since the training error is approximately constant around the SBP, we conclude that this model gives slightly better generalization error (in terms of NIC) just below the critical temperature than at or just above the critical temperature. Whether this anomalous behavior affects the optimal value of b depends on the functional dependence of the training error on b. Although this effect causes a local minimum in the generalization error as a function of b around the SBP, its global minimum might be attained at much higher or lower values of b. It is not easy to analyze the case k4 D 0, since the symmetry breaking is h-way and the ML solution is underdetermined by the third-order approximation as shown in proposition 1. 5 Experiments

In this section, we show some computer simulation results for the two cases with different fourth-order cumulants presented in proposition 1. In both cases, the target distribution q( x) is created so that the mean is 0.0 and the variance is 1.0. Therefore, log b c ’ ¡0.693, and the ML solution for training

Generalization Bias of Gaussian Mixture Models

1419

Figure 2: ML solution versus log(b ). Data are generated from a distribution 6 with k4 D 0. Horizontal axis: log(b ); vertical axis: w or x. Dots show the ML solution for each temperature; dashes on the right show the training samples.

samples breaks around this value. Both the number of training samples (P) and the number of gaussian components (h) is set to 100. We use the EM algorithm (Dempster, Laird, & Rubin, 1977) to optimize the empirical likelihood. We initialize the EM algorithm with one gaussian component on each of the training samples. The test set consists of 100,000 samples, which are generated from the same distribution as training samples. 6 5.1 The Case of k4 D 0. The target distribution is a mixture of two

gaussians,

r

n o n oi C1 h exp ¡C1 (x ¡ C2 )2 C exp ¡C1 ( x C C2 )2 , p p where C1 D 12.5, C2 D 0.96. C1 and C2 are chosen such that the variance of q( x ) is 1.0. An example of the ML solution as a function of the temperature around the rst SBP is shown in Figure 2. The approximate generalization bias, as given by the NIC, as a function of temperature is shown in Figure 3. Note that the vertical tangent at the breaking point is in agreement with proposition 3. 1 q( x) D 2

1420

Shotaro Akaho and Hilbert J. Kappen

Figure 3: NIC as a function of log(b ) (magnied around the symmetry breaking 6 point). Data are generated from a distribution with k4 D 0. Horizontal axis: log(b); vertical axis: NIC multiplied by the number of training samples; dashed line: bc

The generalization bias averaged over 50 experiments with different training sets is shown in Figure 4. It differs from Figure 3 in two ways. It does not show the vertical tangent as in the individual runs because the vertical tangent appears only very close to the symmetry breaking point and the symmetry breaking point uctuates for different training sets. Second, the term U in equation 4.7 is rather different for different training sets, giving a smoothing effect on the average generalization bias. 5.2 The Case of k4 D 0. The target distribution is one gaussian with a

unit variance,

1 exp(¡x2 / 2). q(x ) D p 2p In this case, the convergence of the EM algorithm is much more unstable than in the case of the previous section, because of the reasons mentioned in proposition 1. A typical ML solution as a function of temperature is shown in Figure 5.

Generalization Bias of Gaussian Mixture Models

1421

Figure 4: Generalization bias from cross-validation, averaged over 50 experiments, as a function of log(b ). Data are generated from a distribution with 6 k4 D 0. Horizontal axis: log(b ); vertical axis: generalization bias multiplied by the number of training samples; dashed line: average of empirical value of bc ’s.

The generalization bias averaged over 50 experiments with different random numbers are shown in Figure 6. We did not observe a single instance 6 0. where the likelihood is maximal around the rst SBP as in the case of k4 D 6 Conclusion

We have shown the nonmonotonical behavior of the generalization bias of a special class of gaussian mixture models, called the radial basis Boltzmann machines. For high temperature (large variance in the gaussian kernels), the generalization error increases linearly with b. On the other hand, below the critical temperature, the symmetry breaking phenomenon depends critically on the value of the fourth cumulant k4 . 6 0, the generalization bias decreases with b below the critical temIf k4 D perature. This means that the NIC decreases during the symmetry breaking process. After symmetry breaking is complete, NIC increases again. While NIC decreases, the effective number of adaptive parameters increases because the kernels split up. It is normally assumed that NIC measures the

1422

Shotaro Akaho and Hilbert J. Kappen

Figure 5: ML solution versus log(b ). Data are generated from a gaussian distribution (k4 D 0). Horizontal axis: log(b); vertical axis: w or x. Dots show the ML solution for each temperature; dashes on the right show the training samples.

effective number of adaptive parameters. We conclude that this relation is violated around the SBPs. This anomalous behavior affects the optimal model selection, which in this article is the choice of the optimal b. When the optimal generalization error occurs around an SBP, increasing b will decrease (or at least not increase) training error as well as decrease NIC. If k4 ’ 0 we predict theoretically that symmetry breaking is h-way, but this is not observed numerically. We attribute this discrepancy to the delicacy of the symmetry breaking process and the numerical instability of the optimization procedure. Although our analysis was restricted to one dimension, we expect our results to hold in higher dimensions as well. If the number of kernels is large, the condition of an even number of kernels is not very strict. If the target distribution is nonsymmetric and contains odd moments, other results could be observed. Appendix A: Outline of the Proof of Proposition 1

We assume hxi D 0 without loss of generality. Since we assume an even number of gaussian components, let h D 2h0 .

Generalization Bias of Gaussian Mixture Models

1423

Figure 6: Generalization bias from cross-validation, averaged over 50 experiments, as a function of log(b ). Data are generated from a gaussian distribution (k4 D 0). Horizontal axis: log(b ); vertical axis: generalization bias multiplied by the number of training samples; dashed line: average of experimental values of bc ’s.

Since the target distribution is symmetric and the number of gaussians is even, the symmetry breaking will be symmetric and we can write h0 1 X p(x | WI b ) D 0 2h iD1

r

b pi ( x | wi , b ), p

(A.1)

where pi ( x | wi , b ) ´ exp(¡b(x ¡ wi )2 ) C exp(¡b(x C wi )2 ).

(A.2)

The derivative of the log-likelihood is dened by Li (x | W, b ) ´

@ @wi

log p(x | W, b ) D

@i pi ( x | wi , b )

p( x | W, b )

.

(A.3)

The expectation value of Li under the target distribution q is equal to zero at the ML solution, Rexp ( W ¤ , b ) D hLi (x | W ¤ , b )i D 0.

(A.4)

1424

Shotaro Akaho and Hilbert J. Kappen

Above the critical temperature, W D 0. Below the critical temperature, W will become nonzero. Let D wi denote the weight vector describing the mean of kernel i. In order to obtain a nontrivial solution of D wi as a function of D b D b ¡ bc , we expand Li ( x | W, b ) to third order in W and to rst order in b around b D b c and W D 0: hLi ( x | W, b )i D

¡

¢

2 1X s4 D b D wi ¡ ¡ 1 D wi D wj2 0 2 0 2 )2 2 jD h (s h 6 i » ¼ 1 3 s4 s4 C ¡ 3 ¡ ¡ 1 D w3i 6h0 (s 2 )2 (s 2 )2 h0 (s 2 )2 C higher order terms. (A.5)

¡

¢

Setting hLi ( x | W, b )i D 0 and neglecting higher-order terms, we obtain h0 simultaneous equations of D b and D wi . Since a solution D wi D 0 gives a local minimum of the likelihood, we 6 6 can assume D wi D 0. If s4 D 3(s 2 )2 , we obtain D w2i D D wj2 and D b D 2 4 2 fs4 / 6(s ) gD wi , which is the rst case of proposition 1. On the other hand, P if s4 D 3(s 2 )2 , the simultaneous equations degenerate to the equation D b D i D w2i / f2h0 (s 2 )2 g. It means wi cannot be determined uniquely from b when we neglect higher-order terms. Appendix B: Outline of the Proof of Proposition 2

If the temperature is higher than the rst SBP, all gaussians degenerate to one gaussian; therefore the only thing we should do is to calculate NIC for one gaussian model.1 We derive D( W ¤ ) and H( W ¤ ) for one gaussian model as follows, Dij (W ¤ ) D 4b 2 Vij , Hij (W ¤ ) D 2bdij ,

(B.1) (B.2)

where Vij is a covariance between xi and xj and dij is Kronecker’sd. Therefore NIC is given by hNIC (b ) D Tr[H(W ¤ )¡1 D( W ¤ )] D 2bTr[Vx ].

(B.3)

Appendix C: Outline of the Proof of Proposition 3

Similar to appendix A, we assume hxi D 0 without loss of generality. The symmetry breaking is two-way from the assumption of proposition 3. As a 1

This is not strictly true, because the matrices H and D are still h-dimensional. However, one can show that this h dependence drops out in equation 4.4.

Generalization Bias of Gaussian Mixture Models

1425

result, one can show that the expansion of hNIC around the SBP is identical to the case of only two gaussian kernels. Therefore, we analyze the NIC of the model of two gaussians. The model of two gaussians is written as 1 p( x | w 1 , w 2 I b ) D 2

r

i b h expf¡b(x ¡ w1 )2 g C expf¡b(x C w2 )2 g . (C.1) p

D(w1 , w2 ) and H( w1 , w2 ) can be calculated from their denition. At the ML solution, we have w1 D w2 D w. Therefore we obtain µ

D( w, w ) D H( w, w ) D

µ

d0 d2 d1 d2

¶ d2 , d0 ¶ d2 , d1

(C.2) (C.3)

where ½ ¾ ( p 1 )2 d0 D 4b 2 ( x ¡ w)2 2 , p ½ ¾ p1 p1 ¡ 4b 2 ( x ¡ w)2 , d1 D d0 C 2b p p ½ ¾ p1 p2 d2 D ¡4b 2 (x ¡ w )(x C w ) 2 , p

(C.4) (C.5) (C.6)

where p1 D exp(¡b(x ¡ w)2 ), p2 D exp(¡b(x C w )2 ) and p D p1 C p2 . Let hO NIC (b, w ) D Tr[H(w, w)¡1 D( w, w)],

(C.7)

which is equal to hNIC (b ) for w D w¤ . Expanding hO NIC (b, w ) around the rst SBP (b D b c , w¤ D 0) with respect to b and w, d0 d1 ¡ d22 , hO NIC (b, w ) D Tr[H¡1 D] D 2 2 d1 ¡ d22

(C.8)

we obtain » hO NIC (b, w ) D hO NIC (bc , 0) C

¼

@ O hNIC (bc , 0) D b @b

» ¼ 1 @2 O (b hNIC c , 0) D w2 2 @w2 C higher-order terms, C

(C.9)

1426

Shotaro Akaho and Hilbert J. Kappen

where both the second and the third terms on the right-hand side of equation C.9 are of order D b, since D w2 ’ f6(s 2 )4 / s4 gD b at the ML solution from equation 3.2. Substituting d0 , d1 , d2 by their values, the coefcient of the second term is given by @ O hNIC (bc , 0) D 2s 2 , @b

(C.10)

and the coefcient of the third term before substituting b D bc is given by

¡

1 @2 O 1 ¡ 4b 2 s4 hNIC (b, 0) D 4b 1 ¡ 2bs 2 ¡ 2 2 @w 1 ¡ 2bs 2

¢

.

(C.11)

6 (s 2 )2 , and using the fact that bc D 1 / (2s 2 ) and s4 ¸ (s 2 )2 , we When s4 D infer that equation C.11 diverges to ¡1 as b converges to bc from right. The only case that s4 D (s 2 )2 is when q(x ) is equal to d(x) or (d( x ¡ a) C d(x C a )) / 2. Since there is no SBP in the former case, we need to consider only the latter case. Without loss of generality, we assume a D 1 and the derivative taken from the right is derived from a simple calculation,

@ O hNIC (bc , 0) D ¡4. @b

(C.12)

References Amari, S. (1993). A universal theorem on learning curves. Neural Networks, 6, 161–166. Barkai, N., Seung H. S., & Sompolinsky, H. (1993). Scaling laws in learning of classication tasks. Physical Review Letters, 70, 3167–3170. Bezdek, J. C. (1980). A convergence theorem for the fuzzy ISODATA clustering algorithms. IEEE Trans. on PAMI, 2, 1–8. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Soc. Ser. B, 39, 1–38. Jacobs, R. A., & Jordan, M. I. (1991). A competitive modular connectionist architecture. In D. Touretzky & R. Lippmann (Eds.), Advances in neural information processing systems, 3 (pp. 767–773). San Mateo, CA: Morgan Kaufmann. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. Kappen, B. (1993). Using Boltzmann machines for probability estimation: A general framework for neural network learning. In S. Gielen et al. (Eds.), Proc. of ICANN’93 (pp. 521–526). Berlin: Springer-Verlag. Kappen, B. (1995). Deterministic learning rules for Boltzmann machines. Neural Networks, 8, 537–548. Moody, J. (1992). The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In J. E. Moody, S.

Generalization Bias of Gaussian Mixture Models

1427

J. Hanson, & R. P. Lippman (Eds.), Advances in neural information processing systems, 4 (pp. 847–854). San Mateo, CA: Morgan Kaufmann. Murata, N., Yoshizawa, S., & Amari, S. (1991). A criterion for determining the number of parameters in an articial neural network model. In T. Kohonen et al. (Eds.), Articial neural network (ICANN) (pp. 9–14). Amsterdam: Elsevier. Murata, N., Yoshizawa, S., & Amari, S. (1994).Network information criterions— determining the number of parameters for an articial neural network model. IEEE Trans. on Neural Networks, 5, 865–872. Nijman, M. J., & Kappen, H. J. (1997). Symmetry breaking and training from incomplete data with radial basis Boltzmann machines. International Journal of Neural Systems, 8, 301–316. Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proceedings of the IEEE, 78, 1481–1497. Rissanen, J. (1986). Stochastic complexity and modeling. Annals of Statistics, 14, 1080–1100. Rose, K., Gurewitz, E., & Fox, G. (1990).Statistical mechanics of phase transitions in clustering. Physical Review Letters, 65, 945–948. Titterington, D. M. (1985). Statistical analysis of nite mixture distribution. New York: Wiley. Vapnik, V. A. (1984). Estimation of dependences based on empirical data. New York: Springer-Verlag. Received October 14, 1998; accepted May 5, 1999.

LETTER

Communicated by Chris Williams

Efcient Block Training of Multilayer Perceptrons A. Navia-V a´ zquez A. R. Figueiras-Vidal DTC, Universidad Carlos III de Madrid, 28911-Legan´es (Madrid), Spain The attractive possibility of applying layerwise block training algorithms to multilayer perceptrons MLP, which offers initial advantages in computational effort, is rened in this article by means of introducing a sensitivity correction factor in the formulation. This results in a clear performance advantage, which we verify in several applications. The reasons for this advantage are discussed and related to implicit relations with secondorder techniques, natural gradient formulations through Fisher ’s information matrix, and sample selection. Extensions to recurrent networks and other research lines are suggested at the close of the article. 1 Introduction

Multilayer perceptrons (MLP) offer a powerful approach for solving estimation and classication problems, where gradient descent with error backpropagation (BP) is the classical algorithm for training them. The BP algorithm not only has problems with local minima (also present in other search methods) but also with the speed of convergence. Similar drawbacks exist to some extent in many BP variants. The so-called block learning methods represent an interesting alternative because they proceed blockwise, decomposing the global problem into simpler layer-by-layer minimizations, such that training consists of iteratively solving those minimizations by means of efcient methods. This new approach makes gradient descent unnecessary (hence, avoiding its problems) and has foreseeable additional benets, such as lower computational cost along with easier analysis and control of the training process. We will disregard as block training techniques those hybrid (between block and gradient) algorithms, such as the moving targets (MT) algorithm (Rohwer, 1991), the LS-based algorithm proposed by Di Claudio, Parisi and Orlandi (1993), and the dynamic hidden targets (DHT) algorithm (Song & Hassoun, 1990) because they use gradient descent to some extent. We will therefore consider as pure block training methods either those that rely on linearization of the network (Douglas & Meng, 1991; Deller & Hunt, 1992) or those that use the inverse of the activation function to propagate target values to the previous layer (Pethel & Bowden, 1992; Biegler-Konig ¨ & B¨armann, 1993). Furthermore, we will focus here on the latter because of c 2000 Massachusetts Institute of Technology Neural Computation 12, 1429– 1447 (2000) °

1430

A. Navia-V´azquez and A. R. Figueiras-Vidal

their proved superior convergence. Basically, these methods decompose the global problem into minimization problems at every layer—mainly, least squares (LS) minimizations—with a twofold purpose: obtaining optimal weight values for every layer and backpropagating target values to previous layers. In the following section, we present the block training framework in more detail and qualitatively analyze a direct implementation known as least squares backpropagation (LSB) (Biegler-Konig ¨ & B a¨ rmann, 1993). In section 3, we propose a modication to LSB by solving reduced-sensitivity minimizations, thereby obtaining the reduced-sensitivity, least-squares block (RS-LSB) algorithm. Section 4 compares the performance of the proposed algorithm with some other well-known training methods, using some standard simulation examples. In section 5 we compare the computational cost per epoch of these algorithms. In section 6 we give some theoretical insights about the relationship between RS-LSB and some other efcient minimization techniques, such as natural gradient or second-order methods; additionally, we comment on its relationship with sample selection and regularization strategies. Finally, we present our conclusions and propose some further work. 2 Block Training 2.1 General Formulation. To preserve generality, a feedforward neural network (FFNN) scheme with K layers has been adopted, layer k containing Nk neurons. The goal of the network is to learn the associations (xp , dp ) for a total of P patterns, where xp is the input and dp the corresponding desired output. From now on, matrix notation is adopted by storing the input and output patterns as the rows of matrices X (P £ N0 ) and D (P £ NL ), respectively. Using this notation and taking into account that every layer contains linear (weighted sum) and nonlinear (sigmoid-like activation function) components, the update equations for layer k are: Z

(k)

O

(k)

D O (k¡1 ) W ( k) D [1 |tanhZ

(k)

],

(2.1) (2.2)

where | denotes augmentation, O ( k¡1) is the output of the preceding layer, W ( k ) is the weight matrix interconnecting layers k ¡ 1 and k, and Z ( k ) is the matrix formed with the state values. In equation 2.2, a hyperbolic tangent function, which maps state values to the range (¡1, C 1), has been used without loss of generality, and an unity vector ( 1 ) has been added to O ( k) such that offset values for layer k C 1 can be stored as the rst row of W ( kC 1) . Additionally, matrices T ( k) and Q ( k) store output and state target values, respectively. Thus, O (0) D X (at layer 0) and T (K ) D D (at layer K). In block training, the global minimization problem is suboptimally decomposed by solving at every layer (and iteratively until convergence)

Block Training of Multilayer Perceptrons

1431

“double” least-squares problems—that is, with respect to the weights (to solve the linear part, equation 2.1, optimally) and with respect to the previous layer outputs (to propagate errors into the preceding layer). These adjustments are usually carried out by solving LS minimizations because, for linear problems, they lead to well-known robust and efcient algorithms. A more detailed description of a block training framework is given in the next section, where we describe a basic implementation. 2.2 A Qualitative Analysis of Direct Application. One of the simplest block training methods is the LSB algorithm (Biegler-K¨onig & B¨armann, 1993). We propose to apply at every layer the following three-step procedure. First, the inverse of the activity function is applied to target outputs in order to compute target states, ³ ´ Q (k ) D tanh¡1 T ( k ) . (2.3)

b ( k) for that layer can be obtained by Then the optimal weight matrix W solving

® ®

® ®

b (k) ¡ Q (k) ® . min ®O ( k¡1) W 2 b(k) W

(2.4)

Once the new weights are computed, a second minimization problem is solved to obtain output target values for the previous layer (T ( k¡1) ),

® ®

® ®

b (k) ¡ Q (k) ® . min ®T ( k¡1) W

T (k¡1)

2

(2.5)

Thus, a training epoch is completed after applying this three-step procedure at layers K, K ¡ 1, . . . , 1. This procedure is repeated until an appropriate convergence criterion is reached. A nal consideration has to be taken into account when implementing LSB. Specically, the target values obtained for the previous layer usually lie outside the range (¡1, C 1), requiring a normalization procedure to map them into the appropriate interval, before applying the inverse activation function, equation 2.3. A scaling matrix C ( k) has to be computed at every layer if normalization is necessary (a detailed description of the computation of C ( k) , as well as its inverse, can be found in (Biegler-Konig ¨ & B¨armann, 1993). The new matrices are ³ ´¡1 ( ) ( ) ( ) e T k¡1 D T k¡1 C k e (k) D C(k) W (k) , W e (k) D T ( k¡1) W ( k) , and values in e which ensures that e T ( k¡1) W T ( k¡1 ) lie in (¡1, C 1), such that equation 2.3 can be applied.

1432

A. Navia-V´azquez and A. R. Figueiras-Vidal

Figure 1: Contour plot and comparison of convergency trajectories of LSB, BP, and RS-LSB for the simple test case.

In some practical situations, we have found that LSB has problems related to the nal error and weight values. To analyze the behavior of LSB, a very simple model has been selected, which, in spite of its simplicity, allows us to highlight the weak points of LSB and propose an improved block training algorithm. The test problem is the following: Given a network with a single hidden neuron, linear activation in the second layer,1 and specic target outputs, nd the optimal weights using LSB with a random weight initialization. A test case can be created with random input values chosen in the interval (¡1, 1) and desired outputs computed using a model with w(1) =4 and w(2) =1.5. 2 The associated error surface can be easily computed for this case and is depicted in Figure 1 as a contour plot with respect to w(1) and w(2) . As it can be observed, there exists a minimum at (1.5, 4) and the surface resembles a valley with a curved bottom (marked with a dashed line); in this area partial derivatives become very small, and the convergence of

1 In this case, no offset values have been used, and the weight matrices are reduced to real numbers w(1) and w(2) . 2 The choice of these values is not critical whenever w(1) is not very small; otherwise, the model would be nearly linear and the experiment irrelevant.

Block Training of Multilayer Perceptrons

1433

gradient-descent algorithms slows dramatically. This effect is corroborated by observing the gradient-descent trajectory in Figure 1, where the rst 20 weight values have been represented by circles. When LSB is used, it converges to the bottom of the valley in a few iterations (mainly adjusting w(2) ). Once there, it moves along the bottom, not toward the minimum but in the opposite direction (see Figure 1). We observe that LSB is not able to nd a reasonable solution and, furthermore, it exhibits the undesirable behavior of moving away from the minimum.3 The nal situation achieved with LSB is a model where the hidden neuron works in its linear region (small weight values in the rst layer and, hence, small state values for all the patterns). Similar effects can be observed in networks with many neurons, where the LSB algorithm fails to obtain a good solution, as will be shown in section 4 by means of several real-world applications. 3 Reduced Sensitivity Block Training ( )

In an MLP architecture, an error in state zpk is propagated to the output ( )

of the neuron with a different gain spk (referred to as sensitivity) for every pattern. This “error gain” can be approximately computed as the derivative ( ) of the activation function at zpk ; that is, sensitivity is high at the linear region of the activation function and low at the saturation regions. In a block training framework, the inverse of the activation function is used to compute target states from target outputs (see equation 2.3), and some ( ) problems appear when spk is very small, because target state values (Q ( k) ) show high variability even for small perturbations in T ( k) (the “inverse error ( ) gain” is 1 / spk ). We propose to take advantage of a weighted LS (WLS) cost function (the weighting values being the sensitivities sp ) to obtain an improved solution: 4 E( w ) D

P £ X

sp ep (w )

pD1

¤2

D

P h X pD1

³ ´i2 . sp opT w ¡qp

(3.1)

Minimizing E( w ) with respect to w can be carried out using classical LS techniques (such as Moore-Penrose pseudoinverse, QR or SVD decomposition, and gaussian elimination) by redening q and O as follows, q s D diag(sp )q

O s D diag(sp )O , 3

(3.2) (3.3)

This happens even when weights are initialized very close to the optimal values. Indexes referring to neuron and layer have been dropped for convenience. This error is local, and the WLS minimization has to be repeated for every neuron in the network. 4

1434

A. Navia-V´azquez and A. R. Figueiras-Vidal

and solving

®

®

min ®O s w ¡ q s ®2 . w

(3.4)

If we apply this WLS formulation to the LSB algorithm, we obtain the RS-LSB algorithm. In RS-LSB, a generalized expression for sensitivity can be used: ( )

( )

sp,kj D ( tanh0 ( zp,kj ))b ,

(3.5)

where tanh0 is the derivative of the activation function. Expression 3.5 is qualitatively equivalent to simply evaluating the derivative, but it provides an additional adjustable parameter b. Some interesting characteristics arise with this sensitivity denition because every neuron will selectively focus on patterns with maximal sensitivity.5 At the rst layer, the maximal sensitivity locus for neuron j is a strip in the N0 -dimensional input space, centered around the hyperplane (1) o (0) w j D 0, and vanishing as the distance to the hyperplane increases. In following layers, this locus can take different shapes, because states are computed using nonlinear combinations of previous layer outputs. For illustration, we present a simple surface modeling problem. We constructed a training set using 400 samples from the surface depicted in Figure 2a and trained a 2-4-1 MLP using RS-LSB. The resulting surface is depicted in Figure 2b; in Figure 2c we represent the four sensitivity strips at the input space using gray shades (one strip for every neuron in the hidden layer, and drawn superimposed), where every hyperplane (center of every strip) has been marked using dashed lines. Several trials yielded different arrangements of strips at the rst layer, but in every case, a circular sensitivity strip at the output layer (see Figure 2d) was obtained, and the problem was solved. If we regard the problem represented in Figure 2a as a binary classication task (with a circular boundary between classes as solution), we observe that the network concentrates on patterns closer to such a boundary (see Figure 2d), thereby implementing a pattern selection strategy (this aspect will be discussed further in section 6.4). We have observed through extensive simulation that the nal performance is not very dependent on the exact location of the strips and that the block training method manages to generate a strip disposition, which, combined in following layers, leads to a satisfactory solution (the low variance of the experiments described in section 4 also suggests this fact). Note that 5 Standard BP behaves in an analogous manner (neurons are more strongly sensitive to being trained by patterns that fall in their “active” region), while LSB does not; hence, one of the benets of RS-LSB is to incorporate this pattern discrimination behavior into a block training framework.

Block Training of Multilayer Perceptrons

1435

Figure 2: Visualization of the sensitivity strips in a simple case: (a) original surface and (b) NN approximation using a 2-4-1 MLP trained with RS-LSB. The sensitivity strips for the input layer can be observed in (c) (four superimposed, one for every neuron); the maximal sensitivity locus at the output layer is depicted in (d).

if b D 0, the algorithm reduces to LSB (strips having innity width) and the network loses approximation capabilities, as will be shown in section 4 by means of several examples. If small, random values are used to initialize the weights, the neuron states are usually close to zero (zp ¼ 0) at initial stages, yielding sensitivity values very close to the unity. In this case, there is very little discrimination (as in the LSB algorithm), and ill conditioning in the next layer may arise (especially when Nk > Nk¡1 ). To avoid this, sensitivity is computed using modied state values in matrix Z ( k) , scaled and shifted to the range (¡a, a): 0 ( )

0 ±

k sp, j D @tanh0 @a

( )

2zp,kj ¡ (m1 C m2 ) ( m2 ¡ m1 )

² 11b

AA

(3.6)

1436

A. Navia-V´azquez and A. R. Figueiras-Vidal

where n o ( ) m1 D min zp,kj I

p D 1, . . . , PI j D 1, . . . , NK

n o (k) m2 D max zp, j I

p D 1, . . . , PI j D 1, . . . , NK .

p, j

p, j

Sensitivity computation depends not only on b but also on the weight values of every neuron (s D ( tanh0 ( o T w ))b ), and therefore the algorithm is not severely constrained by a particular choice of b (whenever it is greater than zero; otherwise LSB algorithm is obtained). In fact, we have found that similar sensitivity values can be obtained with different values of b with appropriate weight values. Because the method is not highly sensitive to the selection of b, we did not consider it necessary to explore this further analytically (nontrivial task). We suggest, instead, trying several values and retaining the best results (usually it is enough to try out b D 1, 2, 3), a procedure that we have followed in the experiments. It is possible to simplify the RS method using incomplete data set analysis (Rao & Toutenburg, 1995), where some patterns (those with sensitivity below a threshold) can be purposely discarded or truncated. Analogous simplied procedures have been applied in (Azimi-Sadjadi & Liou, 1992; Wang & Chen, 1996); unfortunately, patterns are discarded if they lie outside the inverse activation function domain, which clearly is a suboptimal criterion. 4 Application Examples

Before presenting the experimental results in detail, we will describe some features (common in all of the experiments) such as the NN architecture, error measures, experimental set-up, convergence criterion, and particular implementations of the classical training algorithms used for comparison. Most of the specications conform with those dened in Proben, an NN benchmark specication document and database (Prechelt, 1994), although we will explicitly state them for further clarity. With respect to the NN architecture, we use throughout an MLP with a single hidden layer and feedforward connections only (no intralayer connections, or shortcuts); the number of inputs and hidden neurons are different in every experiment. A hyperbolic tangent is used in the hidden layer, and the output layer uses linear activation functions for estimation problems and sigmoidal ones for classication problems. Initialization is random, and weights are chosen in [¡0.1, 0.1]. We restrict ourselves to single hidden-layer MLPs due to their universal approximation capabilities, with the number of hidden neurons chosen without much care. This may lead to suboptimal solutions but poses no problem for efciency comparisons

Block Training of Multilayer Perceptrons

1437

of the training algorithms given a xed architecture, the matter of main concern here. Performance is evaluated using the normalized mean square error (NMSE D MSE / s 2 , s 2 being the variance of the output training values) for approximation problems, and classication error (CE) for classication problems. The convergence criterion used is early stopping by cross-validation. In particular, we interrupt the training process when generalization loss (GL(n) D 100(Evalid ( n) / Eopt (n ) ¡ 1), Eopt (n ) D minfE valid (i), i D 1, . . . , ng, as dened in Prechelt (1994) exceeds 1%. Data sets are split into training, validation, and test sets using a specied number of patterns, and the data are normalized to the interval (¡1, 1). Ten runs are conducted in every experiment, and resulting data are collected using mean values and standard deviations. We have chosen Matlab to set up experiments because it is a widely used tool and many NN training algorithms are readily available. Although faster implementations of these algorithms are usually available in C or C++, the use of Matlab is reasonable for comparison purposes if we measure the number of oating-point operations (FLOPS) consumed by every algorithm until convergence. The gradient descent with error BP, fast backpropagation (FBP), and Levenberg-Mardquardt (LM) algorithms are available in the neural networks toolbox (trainbp, trainbpx, and trainlm routines, respectively), while conjugate gradient (CG) and quasi-Newton (with BroydenFletcher-Goldfarb-Shanno update, BFGS) methods were not available there and had to be extracted from Bishop’s NETLAB library (routines conjgrad and quasinew, respectively, also coded in Matlab). 6 Block learning methods are also implemented in Matlab, using gauss elimination to solve every LS minimization. Unless otherwise specied, all of these algorithms have been run with the default parameter settings included in the original libraries. We now describe the data sets used, as well as the results obtained with each of the algorithms. 4.1 The “Cancer” Data Set. This data set has been extracted from the Proben benchmark data set collection (Prechelt, 1994) and is also part of the UCI machine learning repository under the description “breast cancer Wisconsin.”7 The purpose here is to use some cell characteristics obtained by microscopic examination (nine in total) in order to classify a tumor as benign or malignant. The total number of patterns is 699, and we have chosen a data partition that uses the rst 50% of the patterns for train-

6

NETLAB can be obtained online at http://www.ncrg.aston.ac.uk/netlab. Proben is available for anonymous FTP from the Neural Bench archive at Carnegie Mellon University (ftp.cs.cmu.edu,/afs/cs/project/connect/bench/contrib/pr echelt). 7

1438

A. Navia-V´azquez and A. R. Figueiras-Vidal

Table 1: Performance Comparison for the “Cancer” Data Set. BP

FBP

CG

BFGS

LM

LSB

RS-LSB

Train CE %

3.7 (0.12)

4 (0.3)

3.7 (0.2)

4.6 (0.7)

3.9 (0.5)

4.9 (0.4)

3.8 (0.3)

Test CE %

1.7 (0.1)

2.2 (0.5)

1.8 (0.2)

2.8 (0.9)

1.8 (0.4)

2.8 (0.4)

1.7 (0.2)

No Iters.

2517 (887)

79 (36)

12.8 (0.5)

8.8 (4.2)

5.7 (2.9)

8 (1.2)

9.2 (0.9)

Load % BP

100%

3.5%

5.5%

3.5%

11%

1.1%

1.7%

Notes: We show Classication error (mean values and standard deviations, in parentheses), corresponding to 10 runs. The average number of epochs to converge and (measured) computational cost have also been included, the latter expressed in percentage with respect to BP load.

ing, 25% for validation, and 25% for testing (cancer1 partition in Proben). The MLP architecture used is 9-8-1 for all cases, and the parameters used are the default ones in the implementation, except for the learning rate in BP, which was set to 10¡3 (otherwise training was exceedingly slow). In RS-LSB we used a D 3 and b D 1. The number of runs is 10, and the results of the experiment are shown in Table 1, where we present mean and standard values of training and test error classication. We observe that the best-performing methods are BP, CG, LM, and RS-LSB, LSB yielding a suboptimal solution (CE almost twice than in the RS-LSB case). Although further explanation will be given in next section, we can see that block learning methods are advantageous in computational load with respect to the other algorithms, RS-LSB being about three times faster than CG or six times faster than LM, two of the fastest algorithms yielding similar performance. 4.2 The “Boston Housing” Data Set. This data set has been extracted from the Delve8 database and is also part of the Statlib database at the Carnegie Mellon University.9 It contains information about housing in the Boston area. The purpose here (prototask described as “price”) is to use some variables (for example, per capita crime rate by town, nitric oxides concentration, average number of rooms per dwelling, and pupil-teacher ratio by town—up to 13 mixed categorical and continuous variables) to pre-

8 Delve—Data for Evaluating Learning in Valid Experiments—is a standardized environment designed to evaluate the performance of methods that learn relationships based primarily on empirical data; it can be accessed online at the Computer Science Department of the University of Toronto (http://www.cs.toronto.edu/»delve/). 9 The location is http://lib.stat.cmu.edu/datasets/boston.

Block Training of Multilayer Perceptrons

1439

Table 2: Results for the “Boston Housing” Data Set. BP

FBP

CG

BFGS

LM

LSB

RS-LSB

Train NMSE

0.09 (0.05)

0.41 (0.12)

0.17 (0.05)

0.26 (0.06)

0.12 (0.2)

0.21 (0.02)

0.12 (0.03)

Test NMSE

0.18 (0.01)

0.45 (0.1)

0.27 (0.05)

0.3 (0.07)

0.21 (0.11)

0.31 (0.01)

0.19 (0.05)

No Iters.

688 (68)

51 (35)

15 (4.7)

7.6 (3.9)

7.4 (2.9)

10 (2.7)

31 (6.5)

Load % BP

100%

7%

24%

15%

363%

7%

19%

Notes: We illustrate performance by means of classication error (mean values and standard deviations, in parentheses), corresponding to 10 runs. The average number of epochs to converge and the (measured) computational cost have also been included, the latter expressed in percentage with respect to BP load.

dict house prices. The total number of patterns is 506 in this case, and we also used a data partition with the rst half of the patterns for training, 25% for validation, and 25% for testing. The MLP architecture used is 13-20-1, and the training parameters are the default ones, except for the learning rate in BP, which was set to 10¡3 . In RS-LSB we have used a D 3 and b D 1. The number of runs is 10, and the results are shown in Table 2, where we present mean and standard values of training and test NMSE. We observe that the best-performing methods are BP, LM, and RS-LSB, LSB yielding a suboptimal solution (NMSE almost double than with RS-LSB). In this case, RS-LSB is about 5 times faster than BP and 20 times faster than LM, the two algorithms yielding similar performance. 4.3 The “Computer Activity” Data Set. This data set is also included in the Delve data set and contains information about computer system measures (e.g., number of system calls per second, number of “write” calls) in order to predict CPU consumption. The number of available patterns is 8192, and we have used the rst half of them for training, 25% for validation, and the rest for testing (with the original ordering). The MLP architecture used is 12-8-1, and the parameters are the default ones, except for the learning rate in BP, which was set to 5 £ 10¡5 (the default value was too high for this application and caused instabilities). In RS-LSB we have used a D 3 and b D 3. The number of runs is 10, and the results are shown in Table 3, where we present mean and standard values of training and test error classication. We observe that the best-performing methods are BP, FBP, CG, and RS-LSB, LSB yielding a suboptimal solution (NMSE about four times bigger than in RS-LSB case). This time, RS-LSB is about 13 times faster than BP, 6 times faster than FBP, and 11 times faster than CG.

1440

A. Navia-V´azquez and A. R. Figueiras-Vidal

Table 3: Experiment Results for the “Computer Activity” Data Set. BP

FBP

CG

BFGS

LM

LSB

RS-LSB

Train NMSE

0.04 (0.004)

0.07 (0.01)

0.04 (0.002)

0.15 (0.14)

0.3 (0.2)

0.26 (0.05)

0.08 (0.02)

Test NMSE

0.05 (0.005)

0.07 (0.01)

0.05 (0.002)

0.15 (0.13)

0.26 (0.19)

0.25 (0.03)

0.06 (0.01)

No Iters.

395 (120)

153 (57)

30 (5.4)

13.6 (7.9)

16 (5.6)

6.2 (1.7)

10.9 (4)

Load % BP

100%

39%

85%

37%

187%

4%

8%

Notes: We show classication error (mean values and standard deviations, in parentheses) corresponding to 10 runs. The average number of epochs to converge and the (measured) computational cost have also been included, the latter expressed in percentage with respect to BP load.

5 Computational Considerations

In order to evaluate the speed gain of block learning methods with respect to more classical approaches (BP, FBP, CG, BFGS, and LM), we have registered the number of oating operations per epoch consumed by every algorithm and the number of epochs needed to converge. In Tables 1 through 3 we show the computational load with respect to BP. It can be observed that from this point of view, block methods always appear to be advantageous, RS-LSB being from 3 to 13 times faster than the fastest competitor (among those with comparable nal performance)—the moderate computational excess needed by RS-LSB with respect to LSB being justied by the improved quality of the solutions. Although the numbers collected in Tables 1 through 3 reect actual computational cost, it is convenient to give approximate expressions for computational load per epoch for every one of the algorithms under test. It is rather complicated to derive exact analytical expressions because of the numerous operations, and therefore we have obtained empirical expressions assuming computational cost linear in the number of weights for the BP, FBP, CG, and BFGS cases, quadratic in the number of weights for the LM case and quadratic in the number of neurons in LSB and RS-LSB; the free (multiplying) parameters are adjusted afterward by LS tting in a logarithmic scale (i.e., computational loads are rst converted to a log scale in order to obtain a similar tting error for problems with different complexity). This estimation gives, as shown in Figure 3, estimates (-o-) close to measured numbers (-*-) (the analytical expressions are also included in Figure 3). The accuracy of these expressions serves to conrm that, as we had guessed, BP, FBP, CG, and BFGS behave linearly in the number of weights (NW ), LM is quadratic in NW , and block training methods are quadratic in the number of neurons (NN ).

Block Training of Multilayer Perceptrons

1441

Figure 3: (a) Computational load (number of oating operations per epoch) as a function of PNw for the three data sets under study (o) and empirical estimation (*). The estimates are accurate enough. (b) Estimated load expressions for every algorithm where NW is the number of weights in the network (including biases) and NN the number of neurons (including the input layer).

6 Some Theoretical Insights 6.1 Points of Tangency with Second-Order Searching. The basic local technique to nd the minimum of a nonquadratic error function using

1442

A. Navia-V´azquez and A. R. Figueiras-Vidal

second-order information is the well-known Newton’s method. It accepts that the error function E( w ) can be locally approximated by a second-order polynomial and computes Newton step D w N (such that E( w C D w N ) is minimal), solving H ( w )D w N D ¡rE( w )

(6.1)

where rE(w ) is the gradient vector and H ( w ) the Hessian matrix. In our case, vector w represents one of the columns of W ( k) ; we are solving a local minimization problem at every neuron. If we consider the error term r (w ) D tanh(Ow ) ¡ t, then E( w ) D 1 ( ) T ( ), r 2 w r w the gradient vector is rE( w ) D J( w )T r( w )

(6.2)

and the Hessian matrix can be reasonably approximated (Battiti, 1992) using the Jacobian matrix J( w ) D @r (w ) / @w , H ( w ) ’ J (w )T J (w ).

(6.3)

We obtain the Gauss-Newton step by solving equation 6.1 using these values,

D w GN D ¡( J( w )T J( w ))¡1 J T ( w )r (w ),

(6.4)

such that strong similarities (and also qualitative differences) appear between the Gauss-Newton method and the RS-LSB algorithm (with b D 1 in equation 3.5. To facilitate their comparison, we obtain the analogous RS-LSB step by conveniently rewriting equation 3.4 and minimizing with respect to the weight update 10 D w RS¡LSB,

®

®

min ®O s w C O s D w RS¡LSB ¡ q s ®2 ,

D w RS¡LSB

(6.5)

and, if we use the Moore-Penrose generalized inverse to solve equation 6.5, we obtain

D w RS¡LSB D ¡(O Ts O s )¡1 O Ts ( O s w ¡ q s )

D ¡(O Ts O s )¡1 O Ts r RS¡LSB( w ).

(6.6)

10 Note that block learning methods can be formulated in direct or incremental form, the former yielding a new set of weights in every step (solving equation 3.4) and the latter obtaining weight increments (solving equation 6.5) to be added to present weight values, but both formulations leading to identical solutions.

Block Training of Multilayer Perceptrons

1443

It is straightforward to verify that O s D J (w ) (using equation 3.5, with b D 1) such that, to compare RS-LSB and Gauss-Newton methods, only the following residual terms have to be considered: rGN ( w ) D r (w ) D tanh( Ow ) ¡ t

(6.7)

rRS¡LSB( w ) D ( O s w ¡ q s ) D diag( s )( Ow ¡ q ).

(6.8)

The multiplication by sensitivity in equation 6.8 emphasizes those error terms opT w ¡ qp with low value for | opT w | because sp becomes negligible when | opT w | ! 1, thereby yielding the localization properties mentioned in section 3. Thus, the RS-LSB algorithm, when considered as a whole, combines an accurate local method to adjust the weights of every neuron with the layerwise training approach that decomposes the global optimization problem into smaller LS problems. When we use general expression 3.6 for the sensitivity, which means 6 1 and the scaling procedure, RS-LSB and Gauss-Newton methods using b D are no longer directly comparable, but the qualitative behavior of RS-LSB is maintained. 6.2 Relationship with Regularization Techniques. Taking advantage of the LS formulation, it would be easy to include an additional regularization term a kw k2 in equation 3.4, such that the minimization becomes ® ©® ª min ®O s w ¡ q s ®2 C a kw k2 , (6.9) w

which leads to well-known regularization techniques that either add a small value (a) to the diagonal of O Ts O s or discard those singular values smaller than a. We have observed through experimentation that this extra regularization mechanism sometimes excessively limits the approximation capabilities of the network (being also highly sensitive to the choice of a). As the RS-LSB algorithm already relies on minimal norm solutions, it generally obtains solutions with good generalization, and we found no reason to include this additional mechanism. (Further comments on regularization are included in section 6.4.) 6.3 Relationship with Natural Gradient. It has been recently shown that the parameter space of perceptrons is not Euclidean but has a Riemannian structure (Amari, 1998). Thus, the ordinary gradient does not provide the steepest direction, which is given by the natural (contravariant) gradient. Preliminary results (Amari, 1998) show that the performance of this method is good, releasing BP from many of its convergency problems. The Riemannian steepest descent direction of E( w ) is

e w ) D G ¡1 ( w )(¡r E( w )), ¡rE(

(6.10)

1444

A. Navia-V´azquez and A. R. Figueiras-Vidal

where G ( w ) is the Riemannian metric tensor. If we formulate every neuron using a statistical model t D tanh( o T w ) C n, n being an N(0, sn2 ) random variable and po ( o T ) being the distribution of inputs, the input-output joint probability is p( o T , tI w ) D po ( o T ) p

1 2p sn

n o exp ¡(tanh(o T w )¡t)2 / sn2 ,

(6.11)

and the geometry of the space is given by the Fisher information matrix ( ) T @ log p( o T , tI w ) @ log p( o T , tI w ) G(w ) D E , (6.12)

¡

¢¡

@w

¢

@w

which becomes (Amari, 1998) G(w ) D

1 Ef(tanh0 (o T w ))2 oo T g. sn2

(6.13)

The expectation in equation 6.13 has to be estimated using an average over the nite training set of P patterns: b(w ) D G D

P 1 X ( tanh0 ( opT w )op )( tanh0 ( opT w )op )T 2 sn P pD1

1 T J ( w )J (w ). sn2 P

(6.14)

Block methods assume normally distributed errors in state-space and, in particular, RS-LSB uses the weighted error terms ep D sp (opT w ¡qp ) (see equation 3.1) such that the new joint probability function is p( o T , tI w ) D po ( o T ) p

1 2p sn

n o exp ¡(diag( s )( o T w ¡q))2 / sn2 .

(6.15)

It is also easy to nd the Fisher information matrix in this case: b RS¡LSB (w ) D G

P 1 X 1 (sp op )( sp op )T D 2 O Ts O s , 2 sn P pD1 sn P

(6.16)

which is completely equivalent to equation 6.12, given the relationship J( w ) D O s . We observe that the RS-LSB update, equation 6.6, contains a b ¡1 ( ). factor proportional to G RS¡LSB w Therefore, it truly takes into account the nonorthonormal geometry of the space, beneting from the good convergence properties attributed to natural gradient. On the other hand, note b LSB( w ) D 12 O T O (it lacks the that in LSB, the estimated Fisher matrix is G s P n

tanh0 ( opT w ) terms), which is clearly different from equation 6.12.

Block Training of Multilayer Perceptrons

1445

6.4 On Implicit Sample Selection and Regularization. Sample selection is a well-known strategy for improving the efciency of neural networks. The rst proposal for introducing sample selection (Munro, 1992) demonstrated this, as have many other modications (Cachin, 1994), some using other cost objectives (Telfer & Szu, 1994), according to the discussion in Cid-Sueiro, Arribas-S´anchez, Urb´an-Munoz, ˜ and Figueiras-Vidal (1999). Recently, powerful learning architectures, the support vector machines (SVM), have appeared as a result of formulations including generalization objectives (Vapnik, 1995; Vapnik, Golowich, & Smola, 1997; Sch¨olkopf et al., 1997). These architectures also include (implicit) sample selection. Our approach has some differences from these. Specically, the selection is carried out independently at every neuron, and none of the patterns is (fully) discarded at any point. These characteristics can be advantageous. For example, the second approach seems to be a softer way of doing things, more adequate when training data may include misclassied patterns or outliers. Finally, note that in the above references for SVM (and also in Bartlett, 1997), good performance is attributed to controlling the weight sizes (as in the rst proposal of regularization for pruning purposes; (Hinton, 1987). In this sense, since LS minimizations yield minimum-norm solutions, LSB, and in particular RS-LSB, will also follow this principle.

7 Conclusions

We have seen that block training methods constitute a powerful approach to MLP training, promising a great reduction in computational effort with respect to other techniques. Some problems arise in direct implementations such as LSB, mainly concerning the nal error and generalization capability. We have proposed and developed a reduced sensitivity approach, which has proved to be advantageous with respect to other well-known training algorithms, as shown by means of several real-world examples. Our approach is also related to some other efcient training techniques, such as second-order approaches or natural gradient descent. Additionally, reducing the sensitivity of the model to some of the training patterns can be considered an implicit sample selection strategy, which also yields benets for the global learning process. Extending this approach to other scenarios, such as recurrent neural networks, could yield simpler and practically applicable designs, one of the limitations of this type of neural networks. To apply other cost functions and additional sample selection strategies are also interesting tasks, as are analyzing sensitivities and variable selection under these new training procedures.

1446

A. Navia-V´azquez and A. R. Figueiras-Vidal

Acknowledgments

This work has been partially supported by the III National R & D Plan of the Spanish Government (CICYT), under grant TIC96-0500-C10-03. References Amari, S.-I. (1998). Natural gradient works efciently in learning. Neural Computation, 10, 251–276. Azimi-Sadjadi, M., & Liou, R. (1992). Fast learning process of multilayer neural networks using recursive least squares method. IEEE Trans. on Signal Processing, 40, 446–450. Bartlett, P. (1997). For valid generalization, the size of the weights is more important than the size of the network. In M. M. Mozer, M. Jordan, T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 134–140). San Mateo, CA: Morgan Kaufmann. Battiti, R. (1992). First and second-order methods for learning: Between steepest descent and Newton’s method. Neural Computation, 4, 141–166. Biegler-K onig, ¨ F., & B¨armann, F. (1993). A learning algorithm for multilayered neural networks based on linear least squares problems. Neural Networks, 6, 127–131. Cachin, C. (1994). Pedagogical pattern selection strategies. Neural Networks, 7, 175–181. Cid-Sueiro, J., Arribas-S´anchez, I., Urb´an-Mu˜noz, S., & Figueiras-Vidal, A. R. (1999). Cost functions to estimate “a posteriori” probabilities in multi-class problems. IEEE Trans. on Neural Networks, 10(3), 645–656. DiClaudio, E. D., Parisi, R., & Orlandi, G. (1993). LS-based training algorithms for neural networks. In C. Kamm & G. Kuhn (Eds.), Neural networks for signal processing III (pp. 22–29). New York: IEEE Press. Deller, J., & Hunt, S. (1992). A simple “linearized” algorithm which outperforms back-propagation. In Proc. Intl. Joint Conf. on Neural Networks (Vol. 3, pp. 133– 138). Douglas, S., & Meng, T. (1991). Linearized least-squares training of multilayer feedforward neural networks. In Proc. Intl. Joint Conference on Neural Networks (Vol. 1, pp. 307–312). Hinton, G. (1987). Learning traslation invariant recognition in massively parallel networks. In J. Bakker, A. Nijman, & P. Treleaven (Eds.), Proc. PARLE Conference on Parallel Architectures and Languages Europe (pp. 1–13). Berlin: Springer-Verlag. Munro, P. (1992). Repeat until bored: A pattern selection strategy. In J. Moody, S. Hanson, & R. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 1001–1008). San Mateo, CA: Morgan Kaufmann. Pethel, S., & Bowden, C. (1992). Neural network applications to nonlinear time series analysis. In Proc. Intl. Conf. on Industrial Electronics, Control, Instrumentation and Automation, Power Electronics and Motion Control (Vol. 2, pp. 1094– 1098).

Block Training of Multilayer Perceptrons

1447

Prechelt, L. (1994). Proben1—A set of neural network benchmark problems and benchmarking rules (Tech. Rep. No. 21/94). Fakultat fur Informatik, Universitat Karlsruhe, Karlsruhe, Germany. Available online via anonymous FTP: ftp.ira.uka.de, /pub/papers/techreports /1994/1994-21.ps.Z. Rao, C., & Toutenburg, H. (1995). Linear models: Least squares and alternatives. New York: Springer-Verlag. Rohwer, R. (1991).The moving targets training algorithm. In D. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 558–565). San Mateo, CA: Morgan Kaufmann. Sch¨olkopf, B., Sung, K., Burges, C., Girosi, F., Nigoyi, P., Poggio, T., & Vapnik, V. (1997). Comparing support vector machines with gaussian kernels to radial basis function classiers. IEEE Trans. on Signal Processing, 45, 2758–2765. Song, J., & Hassoun, M. (1990). Learning with hidden targets. In Proc. Intl. Joint Conf. on Neural Networks (Vol. 3, pp. 93–98). Telfer, B., & Szu, H. (1994). Energy functions for minimizing misclassication error with minimum-complexity networks. Neural Networks, 7, 809–818. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Vapnik, V., Golowich, S., & Smola, A. (1997).Support vector method for function approximation, regression, estimation and signal processing. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 281–287). Cambridge, CA: MIT Press. Wang, G.-J., & Chen, C.-C. (1996). A fast multilayer neural-network training algorithm based on the layer-by-layer optimizing procedures. IEEE Trans. Neural Networks, 7, 768–775. Received June 26, 1998; accepted May 15, 1999.

LETTER

Communicated by Richard Golden

An Optimization Approach to Design of Generalized BSB Neural Associative Memories Jooyoung Park Department of Control and Instrumentation Engineering, Korea University, Chochiwon, Chungnam, 339-800, Korea Yonmook Park Department of Information Engineering, Graduate School, Korea University, Chochiwon, Chungnam, 339-800, Korea This article is concerned with the synthesis of the optimally performing GBSB (generalized brain-state-in-a-box) neural associative memory given a set of desired binary patterns to be stored as asymptotically stable equilibrium points. Based on some known qualitative properties and newly observed fundamental properties of the GBSB model, the synthesis problem is formulated as a constrained optimization problem. Next, we convert this problem into a quasi-convex optimization problem called GEVP (generalized eigenvalue problem). This conversion is particularly useful in practice, because GEVPs can be efciently solved by recently developed interior point methods. Design examples are given to illustrate the proposed approach and to compare with existing synthesis methods. 1 Introduction

Recently, the problem of realizing associative memories via recurrent neural networks has received much attention (Hassoun, 1993; Michel & Farrell, 1990). In general, the key properties required for these neural associative memories are the following (Lillo, Miller, Hui, & Zak, 1994): The desired binary patterns, which will be called the prototype patterns in this article, are stored as asymptotically stable equilibrium points of the system. When a vector sufciently close to a stored prototype pattern is applied as an initial condition, then the system trajectory converges to the prototype pattern. The number of asymptotically stable equilibrium points that do not correspond to the prototype patterns (i.e., spurious states) is minimal. The system is globally stable; every trajectory of the system converges to some equilibrium point. c 2000 Massachusetts Institute of Technology Neural Computation 12, 1449– 1462 (2000) °

1450

Jooyoung Park and Yonmook Park

Among the various kinds of promising neural models suitable for achieving these properties are the so-called brain-state-in-a-box (BSB) neural networks. This model was rst proposed by Anderson, Silverstein, Ritz, and Jones (1977) and has attracted substantial interest among researchers. Cohen and Grossberg (1983) proved a theorem on the global stability of the continuous-time, continuous-state BSB dynamical systems with real symmetric weight matrices. Golden (1986) proved that all trajectories of the BSB dynamical systems with real symmetric weight matrices approach the set of equilibrium points under a certain condition, which will be used as the global stability condition in this article. Marcus and Westervelt (1989) reported results on global stability for a large class of BSB model type systems. Perfetti (1995) analyzed qualitative properties of the BSB model and formulated the design of the BSB-based associative memories as a constrained optimization in the form of a linear programming with an additional nonlinear constraint. Also, he proposed an ad hoc iterative algorithm to solve the constrained optimization and illustrated the algorithm with some design examples. Park, Cho, and Park (1999) recast Perfetti’s formulation into a semidenite programming problem (SDP) by converting its nonlinear constraints into a set of linear matrix inequalities (LMIs). We are concerned here with developing a synthesis procedure for associative memories based on an advanced form of the BSB model, which is often referred to as GBSB (generalized BSB). Our synthesis procedure will be given in the form of an LMI-based optimization problem called a generalized eigenvalue problem (GEVP). This article may be viewed as extending the design method of Park et al. (1998) for the GBSB memories. Both articles use a readily available software (MATLAB LMI Control Toolbox) rather than implementing ad hoc algorithms for the memory synthesis. The most signicant difference between them is that the synthesis procedure we present is based on the GEVP, while the method of Park et al. (1999) uses the SDP. A design example in section 4 will demonstrate the superiority of the proposed GEVP-based method over the SDP method. The GBSB model was proposed and studied by Hui and Zak (1992) and Golden (1993). Lillo, Miller, Hui, and Zak (1994) analyzed the dynamics of the GBSB model and presented a novel synthesis procedure for the GBSBbased associative memories. Their procedure uses a decomposition of the weight matrix, which results in an asymmetric interconnection structure, an asymptotic stability of the prototype patterns, and a small number of spurious states. Later, Zak, Lillo, and Hui (1996) incorporated the learning and forgetting capabilities into the synthesis method of Lillo et al. (1994). Also, Chan and Zak (1997) proposed a “designer ” neural network to solve the problem of realizing associative memories via the GBSB neural networks. In this article, attention is focused on how to nd the parameters of the optimally performing GBSB given a set of the prototype patterns. After introducing fundamental properties of the GBSB model, we formulate the synthesis of the GBSB that can store the prototype patterns with optimal

Optimization Approach to Design

1451

performance as a constrained optimization problem. Next, we convert the synthesis problem into an optimization problem called a generalized eigenvalue problem (GEVP). This conversion is particularly useful in practice, because GEVPs can be efciently solved by recently developed interior point methods (Boyd, El-Ghaoui, Feron, & Balakrishnan, 1994; Vandenberghe & Balakrishnan, 1997). For specic algorithms belonging to these interior point methods, one can refer to Boyd and El-Ghaoui (1993) and Nesterov and Nemirovskii (1994), for example. An implementation of the Nesterov and Nemirovskii algorithm is also provided in the LMI Control Toolbox (Gahinet, Nemirovskii, Laub, & Chilali, 1995). Since GEVPs can be efciently solved by interior point methods, converting the synthesis problem into a GEVP is equivalent to nding a solution to the original problem. As an optimum searcher for the GEVP appearing in the proposed approach, we use the function “gevp” of the LMI Control Toolbox. Throughout this article, we use the following denitions and notation, in which Rn denotes the set of real n-vectors. A symmetric matrix A 2 6 0 , and A > 0 denotes Rn£n is positive denite if xT Ax > 0 for any x D this. Also, A > B denotes that A ¡ B is positive denite. H n denotes the hypercube [¡1, C 1]n . By a binary vector, we mean a bipolar binary vector whose element is either C 1 or ¡1, and the set of all these binary vectors in H n is denoted by Bn . The usual Hamming distance between two vectors x ¤ and x in Bn is denoted by H( x ¤ , x ). By an l-neighborhood of a vertex c 2 Bn , we mean the set fc ¤ 2 B n | H (c ¤ , c ) · lg. In section 2, we present some known qualitative properties and newly observed fundamental properties of the GBSB model. Also, based on the properties, we formulate the synthesis of the GBSB with optimal performance as a constrained optimization problem. Section 3 describes how this synthesis problem can be converted into a GEVP. In section 4, we present numerical examples to show the superiority of the proposed method over existing synthesis methods. Finally, in section 5 contains concluding remarks. 2 Background Results

The dynamics of the GBSB model is described by the following state equation: x( k C 1) D g( x ( k) C a( Wx ( k ) C b )), n

(2.1)

where x ( k) 2 R is the state vector at time k, a > 0 is the step size, W 2 Rn£n is the weight matrix, b 2 Rn is the bias vector, and g: Rn ! Rn is a linear saturating function whose ith component is a function dened as follows: 8 if xi ¸ 1, <1 if ¡1 < xi < 1, gi ([x1 ¢ ¢ ¢ xi ¢ ¢ ¢ xn ]T ) D xi : ¡1 if xi · ¡1.

1452

Jooyoung Park and Yonmook Park

This GBSB model is a generalized version of the BSB network proposed by Anderson et al. (1977), and it differs from the original network in the presence of the bias vector b . In the discussion on the stability of the GBSB model, we use the following denitions: A point x e 2 Rn is an equilibrium point of system 2.1 if x(0) D xe implies x ( k ) D x e , 8k > 0. An equilibrium point xe of system 2.1 is stable if for any 2 exists d > 0 such that

> 0, there

kx (0) ¡ x e k < d implies kx ( k) ¡ x e k < 2 , 8k > 0. An equilibrium point x e of system 2.1 is asymptotically stable if it is stable and there exists d > 0 such that x (k ) ! x e as k ! 1 if kx (0) ¡ x e k < d.

System 2.1 is globally stable if every trajectory of the system converges to some equilibrium point. The criteria on the stability of the GBSB model are now well established (Lillo et al., 1994; Golden, 1993; Perfetti, 1995): A vertex c of the hypercube H n is an equilibrium point of system 2.1 if and only if 0 ci @

1 n X jD1

wijc j C bi A ¸ 0, 8i 2 f1, . . . , ng.

(2.2)

A vertex c of the hypercube Hn is an asymptotically stable equilibrium point of system 2.1 if 0 ci @

n X jD1

1 wijc j C bi A > 0, 8i 2 f1, . . . , ng.

(2.3)

If condition 2.3 holds for a vertex c 2 Bn and wii D 0, 8i 2 f1, . . . , ng, then only on the vertices c ¤ 2 Bn such that H(c , c ¤ ) > 1 can other asymptotically stable equilibrium points of system 2.1 exist.1 1 In fact, this was proved only for the BSB model in theorem 1 and corollary 2 of Perfetti (1995). However, one can easily prove that this still holds true for the GBSB model by following the arguments of Perfetti (1995).

Optimization Approach to Design

1453

System 2.1 is globally stable if the weight matrix W is symmetric and l min > ¡1,

(2.4)

where l min is the smallest eigenvalue of I C aW . Designing GBSB neural associative memories based on the above criteria only is not good enough in general. The guidelines for addressing the attraction domains of the prototype patterns and the number of spurious states should be added. We present our main theorem, which will be used to provide an additional guideline for performance optimization: Letc 2 Bn be an asymptotically stable equilibrium point of system 2.1 and let l be an integer in f1, . . . , ng. If W and b satisfy Theorem.

wii D 0, 8i 2 f1, . . . , ng 0 1 n X

and c i @

jD1

wijc j C bi A > 2(l ¡ 1) max | wij | , i D 1, . . . , n,

¤

(2.6)

j

then any binary vector c properties: c

(2.5)

¤

2 Bn such that 1 · H (c ¤ , c ) · l has the following

is not an equilibrium point.

6 c i , then at the next time step, xi moves toward c i (i.e., If x (0) D c ¤ and c i¤ D xi (1) is closer to c i than xi (0) is).

Proof . Let c

¤

2 Bn be any binary vector satisfying 1 · H (c ¤ , c ) · l, and let 4

6 c i (i.e., c ¤ D ¡c i ). Since d D c ¤ ¡ c satises c i¤ D i

n X wijdj D |wi1d1 C ¢ ¢ ¢ C 0 £ di C ¢ ¢ ¢ C windn | jD1 · 2(l ¡ 1) max | wij | , j

we have 0 c i¤

1

n X

@

jD1

wijc j¤

00

C bi A D ¡c i @@

1 n X j D1

0

wijc j C bi A C @

11 n X j D1

wijdj AA

0 1 X n @ A @ A · ¡c i ci wijc j C bi C wijdj jD1 jD1 0

n X

1

1454

Jooyoung Park and Yonmook Park

n X @ A D ¡c i wijc j C bi C wijdj jD1 jD1 0

1

0

1

n X

n X

· ¡c i @

jD1

wijc j C bi A C 2(l ¡ 1) max |wij |. j

This, together with inequalities 2.6, implies that c i¤

(

n X iD1

! wijc j¤

C bi

< 0.

(2.7)

Thus, condition 2.2 does not hold for c ¤ 2 Bn . Therefore, c ¤ cannot be an equilibrium point of system 2.1. Now, let system 2.1 start from x(0) D c ¤ . Pn ¤ From inequality 2.7, we see that c i and ( jD1 wijc j¤ C bi ) have different polarities. Since 0

0

xi (1) D g @xi (0) C a @ 0

0

n X j D1

n X

D g @c i¤ C a @

jD1

11 wij xj (0) C bi AA 11

wijc j¤ C bi AA ,

we have |xi (1) ¡ c i | < |xi (0) ¡ c i | D 2.

From our theorem, we see that W and b satisfying conditions 2.5 and 2.6 ensure that the vertices in the l-neighborhood of the prototype pattern c cannot become spurious states and are very likely to be in the attraction domain ofc . Thus, a possible strategy for decreasing the number of spurious states and increasing the attraction domains of the prototype patterns is to maximize the l in the constraints 2.6 for each prototype pattern c 2 Bn . This observation and other background results allow us the following formulation for the optimal design: Given a set of the prototype patterns x (1) , . . . , x ( m) , the parameters of the optimally performing GBSB can be found by solving: max s.t.

d ( >³0) ´ ( ) Pn (k ) xi k jD1 wij xj C bi > 2d maxj |wij |, i D 1, . . . , n, k D 1, . . . , m, wii D 0, i D 1, . . . , n, W D WT , l min > ¡1,

(2.8)

Optimization Approach to Design

1455

where l min is the smallest eigenvalue of I C aW . This optimization problem has two groups of nonlinear constraints, which prevent us from applying readily available optimization techniques. In the next section, we show that this problem can be converted into an optimization problem that can be efciently solved. 3 A GEVP Approach to Design of GBSBs

In this section, we convert problem 2.8 into an optimization problem called GEVP. The general form of a GEVP is given as follows (Boyd et al., 1994): min l s.t. l B (z ) ¡ A (z ) > 0, B ( z ) > 0, C ( z ) > 0, where A ( z ), B (z ), and C ( z ) are symmetric matrices that are afne functions of the variable z. This GEVP is to minimize the maximum generalized eigenvalue of the pair ( A ( z ), B ( z )) subject to the linear matrix inequality (LMI) constraints B ( z ) > 0 and C ( z ) > 0. It is a quasi-convex optimization problem and can be efciently solved by recently developed interior point methods (Boyd et al., 1994; Vandenberghe & Balakrishnan, 1997). Before proceeding further, note that multiple LMIs C (1) ( z ) > 0 , . . . , C ( p) ( z ) > 0 can be expressed as the single LMI diag( C (1) (z ), . . . , C ( p) ( z )) > 0. In order to convert problem 2.8 into a GEVP, we rst introduce additional variables qi , i D 1, . . . , n satisfying 0

x(i k)

1

n X

@

jD1

( ) wij xj k

C bi A > 2dqi > 2d max |wij |, j

i D 1, . . . , n, k D 1, . . . , m.

(3.1)

Note that inequalities 3.1 are only a restatement of the rst constraint set of problem 2.8 with the deliberately chosen terms 2dqi placed between ( ) P ( ) xi k ( jnD1 wij xj k C bi ) and 2d maxj |wij |. Also, note that inequalities 3.1 can be divided into LMIs qi ¡ wij > 0, i D 1, . . . , n, j D 1, . . . , n, wij C qi > 0, i D 1, . . . , n, j D 1, . . . , n, and a set of inequalities, 0 ¡2dqi C x(i k) @

n X jD1

1 ( )

k wij xj C bi A > 0, i D 1, . . . , n, k D 1, . . . , m.

1456

Jooyoung Park and Yonmook Park

Next, we consider the eigenvalue condition, inequality 2.4. Since I C aW is real symmetric, its eigenvalues are real, and corresponding eigenvectors can be chosen to be real orthonormal (Strang, 1988). Thus, its spectral decomposition can be written as I C aW D U L U T ,

where the eigenvalues of I C aW appear on the diagonal of L , and U , whose columns are the real orthonormal eigenvectors of I C aW , satises UU T D U T U D I (Strang, 1988). Note that l min > ¡1 is equivalent to L > ¡I .

(3.2)

By pre- and postmultiplying matrix inequality 3.2 by U and U T, respectively, we obtain an equivalent condition I C aW > ¡I , which is again equivalent to the following LMI: 2I C aW > 0 . As a result of the conversion processes, the problem of nding the GBSB that can store the prototype patterns x (1) , . . . , x ( m ) with optimal performance can now be reformulated as the following GEVP: min s.t.

(¡d)

³P ´ (k ) n (¡d)(2qi ) C x(i k) C w x b i > 0, jD1 ij j 8 i D 1, ¢ ¢ ¢ , n, k D 1, ¢ ¢ ¢ , m, 9 > qi ¡ wij > 0, i D 1, . . . , n, j D 1, . . . , n, > > > > > > < wij C qi > 0, i D 1, . . . , n, j D 1, . . . , n, > = 2I C aW > 0, > > > > w D 0, i D 1, . . . , n, > > > > : ii ; T W DW .

(3.3)

Note that the constraints inside “f¢g” are LMIs in qi , i D 1, . . . , n, and the independent scalar entries of W satisfying wii D 0, i D 1, . . . , n and W D WT . As a further requirement that should be considered in designing GBSB neural associative memories, the magnitudes of the weights |wij | should have a reasonable upper bound so that the resulting networks could be implemented without difculty. To reect this practical concern, the following LMIs are added in the list of constraints of problem 3.3: L < qi < U, i D 1, . . . , n.

(3.4)

Note that the resulting optimization problem is still in the form of GEVP.

Optimization Approach to Design

1457

4 Design Examples

To demonstrate the applicability of the GEVP approach proposed in this article and to compare with existing synthesis methods, we consider three design examples. In these examples, we consider the GBSB model or the BSB model with n D 10 neurons. The performance of each designed memory system is to be evaluated by simulations, in which every possible binary vector is applied as an initial condition for the system, and the system is allowed to evolve from the initial condition to a nal state. Analyzing the simulation results, one can obtain average recall probabilities and the convergence rate to correct patterns, which are dened as follows: Given the pair of a prototype pattern x ( k) and an integer l 2 f0, 1, . . . , ng, the probability that a binary initial condition vector at the Hamming distance of l away from x ( k) is attracted to x( k) is denoted by P( x( k) , l). By the average recall probability Prob( l), we mean the average of the 4

( x ( k) , l) over all the prototype patterns x (1) , . . . , x( m ) (i.e., Prob(l) D PP ( (k) )g f m kD1 P x , l / m).

By the convergence rate to correct patterns, we mean the percentage that binary initial condition vectors are attracted to the prototype patterns nearest in the sense of Hamming distance.

In the rst example, which is taken from Lillo et al. (1994), we wish to store the following six prototype patterns: x(1) x(2) x(3) x(4) x(5) x(6)

D D D D D D

[ [ [ [ [ [

¡1 C1 ¡1 ¡1 C1 C1

C1 C1 C1 C1

¡1 C1

C1

¡1 C1 ¡1 ¡1 ¡1

C1

¡1 C1 ¡1 C1 C1

C1

¡1 ¡1 ¡1 C1 ¡1

C1 C1

¡1 ¡1 ¡1 C1

¡1 ¡1 C1 C1 C1 C1

¡1 ¡1 ¡1 ¡1 C1 C1

¡1 C1 ¡1 C1 C1 ¡1

¡1 ¡1 ¡1 C1 ¡1 ¡1

]T ]T ]T ]T ]T ]T

The step size of the GBSB model was set to be a D 0.3 as in Lillo et al. (1994). Solving the corresponding GEVP with the bounds in inequalities 3.4 set to L D 0.1 and U D 0.2, we obtained a GBSB memory for the rst example. According to the simulation results, the GBSB designed by the proposed method is with no spurious state, while the GBSB of Lillo et al. (1994) has two spurious states in Bn . The convergence rates to correct patterns and the average recall probabilities of these GBSBs are shown in Table 1 and Figure 1, respectively. In the second example, which is taken from Chan and Zak (1997), we want to store the following ve prototype patterns in the GBSB model: x(1) x(2)

D D

[ [

¡1 C1

C1 C1

¡1 ¡1

C1

¡1

C1 C1

C1

¡1

¡1 C1

C1

¡1

C1 C1

C1 C1

]T ]T

1458

Jooyoung Park and Yonmook Park Table 1: Convergence Rates to Correct Patterns of the GBSBs Designed for the First Example. Convergence Rate to Correct Patterns (%) Lillo et al. (1994) Proposed method

x (3) x (4) x (5)

D D D

[ [ [

¡1 C1 C1

C1 C1

¡1

C1

¡1 ¡1

C1 C1

¡1

13.6 92.7

¡1 ¡1 C1

¡1 C1 C1

C1

¡1 C1

¡1 C1 ¡1

C1 C1

¡1

¡1 C1 ¡1

]T ]T ]T

As in Chan and Zak (1997), the step size of the GBSB model was set to be a D 0.3. Solving the corresponding GEVP with L D 0.1 and U D 0.2, we obtained a GBSB memory for the second example. From the simulations, we found that the GBSB designed by the proposed method and the GBSB of Chan and Zak (1997) are both with no spurious state. The convergence rates to correct patterns and the average recall probabilities of these GBSBs are shown in Table 2 and Figure 2, respectively.

Figure 1: Average recall probabilities of the GBSBs designed for the rst example.

Optimization Approach to Design

1459

Table 2: Convergence Rates to Correct Patterns of the GBSBs Designed for the Second Example. Convergence Rate to Correct Patterns (%) Chan and Zak (1997) Proposed method

80.1 85.3

In the third and nal example, which was considered in Perfetti (1995) and Park et al. (1999), the prototype patterns of the second example are again considered. The design objective of this example is to store the prototype patterns in the BSB model instead of the GBSB model. For a fair comparison, the bias vector and the step size were xed at b D 0 and a D 1, respectively, as in Perfetti (1995) and Park et al. (1999). Solving the corresponding GEVP with L D 0.1 and U D 0.2, we obtained a BSB memory. From the simulations for the BSB designed by the proposed method, the BSB of Perfetti (1995), and BSB I of Park et al. (1999), we observed that each BSB is with ve spurious states, which consist of the negatives of the prototype patterns. The

Figure 2: Average Recall probabilities of the GBSBs designed for the second example.

1460

Jooyoung Park and Yonmook Park Table 3: Convergence Rates to Correct Patterns of the BSBs Designed for the Third Example. Convergence Rate to Correct Patterns (%) Perfetti (1995) BSB I of Park et al. (1998) Proposed method

46.7 46.4 48.9

convergence rates to correct patterns and the average recall probabilities of these BSBs are shown in Table 3 and Figure 3, respectively. The performance comparisons in the above design examples show that the memories designed by the proposed method generally have higher average recall probabilities, higher convergence rates to correct patterns, and fewer or a comparable number of spurious states than the memories designed by other recently developed synthesis methods.

Figure 3: Average recall probabilities of the BSBs designed for the third example.

Optimization Approach to Design

1461

5 Conclusion

We have addressed the problem of nding the parameters of the GBSB neural networks that can store given prototype patterns with optimal performance. Based on known qualitative properties and newly derived fundamental properties of the GBSB model, the problem was formulated as a nonlinear constrained optimization problem. Then this problem was converted into a quasi-convex optimization problem called a GEVP. This conversion is particularly useful in practice, because GEVPs can be efciently solved by recently developed interior point methods. Three design examples were presented to illustrate the proposed method and to compare with existing methods. The simulation results showed that the memories obtained by the proposed method have better performance than the memories designed by existing methods.

References Anderson, J. A., Silverstein, J. W., Ritz, S. A., & Jones, R. S. (1977). Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psychological Review, 84, 413–451. Boyd, S., & El-Ghaoui, L. (1993). Method of centers for minimizing generalized eigenvalues. Linear Algebra and Applications, 188, 63–111. Boyd, S., El-Ghaoui, L., Feron, E., & Balakrishnan, V. (1994). Linear matrix inequalities in systems and control theory. Philadelphia: SIAM. Chan, H. Y., & Zak, S. H. (1997). On neural networks that design neural associative memories. IEEE Transactions on Neural Networks, 8, 360–372. Cohen, M. A., & Grossberg, S. (1983). Absolute stability of global pattern formulation and parallel memory storage by competitive neural networks. IEEE Transactions on Systems, Man and Cybernetics, 13, 815–826. Gahinet, P., Nemirovskii, A., Laub, A. J., & Chilali, M. (1995). LMI control toolbox. Natick, MA: Mathworks. Golden, R. M. (1986).The brain-state-in-a-boxneural model is a gradient descent algorithm. Journal of Mathematical Psychology, 30, 73–80. Golden, R. M. (1993). Stability and optimization analyses of the generalized brain-state-in-a-box neural network model. Journal of Mathematical Psychology, 37, 282–298. Hassoun, M. H. (Ed.). (1993). Associative neural memories: Theory and implementation. New York: Oxford University Press. Hui, S., & Zak, S. H. (1992). Dynamical analysis of the brain-state-in-a-box (BSB) neural models. IEEE Transactions on Neural Networks, 3, 86–100. Lillo, W. E., Miller, D. C., Hui, S., & Zak, S. H. (1994).Synthesis of brain-state-in-abox (BSB) based associative memories. IEEE Transactions on Neural Networks, 5, 730–737. Marcus, C. M., & Westervelt, R. M. (1989). Dynamics of iterated-map neural networks. Physical Review A, 40, 501–504.

1462

Jooyoung Park and Yonmook Park

Michel, A. N., & Farrell, J. A. (1990). Associative memories via articial neural networks. IEEE Control Systems Magazine, 10, 6–17. Nesterov, Y., & Nemirovskii, A. (1994). Interior point polynomial methods in convex programming. Philadelphia: SIAM. Park, J., Cho, H., & Park, D. (1999). On the design of BSB neural associative memories using semidenite programming. Neural Computation, 11, 1985– 1994. Perfetti, R. (1995). A synthesis procedure for brain-state-in-a-box neural networks. IEEE Transactions on Neural Networks, 6, 1071–1080. Strang, G. (1988). Linear algebra and its applications (3rd ed.). Orlando, FL: HBJ Publishers. Vandenberghe, L., & Balakrishnan, V. (1997). Algorithms and software for LMI problems in control. IEEE Control Systems Magazine, 17, 89–95. Zak, S. H., Lillo, W. E., and Hui, S. (1996).Learning and forgetting in generalized brain-state-in-a-box (GBSB) neural associative memories. Neural Networks, 9, 845–854. Received February 4, 1999; accepted April 20, 1999.

LETTER

Communicated by Klaus Obermayer

Nonholonomic Orthogonal Learning Algorithms for Blind Source Separation Shun-ichi Amari Tian-Ping Chen Andrzej Cichocki RIKEN Brain Science Institute, Brain-Style Information Systems Group, Japan Independent component analysis or blind source separation extracts independent signals from their linear mixtures without assuming prior knowledge of their mixing coefcients. It is known that the independent signals in the observed mixtures can be successfully extracted except for their order and scales. In order to resolve the indeterminacy of scales, most learning algorithms impose some constraints on the magnitudes of the recovered signals. However, when the source signals are nonstationary and their average magnitudes change rapidly, the constraints force a rapid change in the magnitude of the separating matrix. This is the case with most applications (e.g., speech sounds, electroencephalogram signals). It is known that this causes numerical instability in some cases. In order to resolve this difculty, this article introduces new nonholonomic constraints in the learning algorithm. This is motivated by the geometrical consideration that the directions of change in the separating matrix should be orthogonal to the equivalence class of separating matrices due to the scaling indeterminacy. These constraints are proved to be nonholonomic, so that the proposed algorithm is able to adapt to rapid or intermittent changes in the magnitudes of the source signals. The proposed algorithm works well even when the number of the sources is overestimated, whereas the existent algorithms do not (assuming the sensor noise is negligibly small), because they amplify the null components not included in the sources. Computer simulations conrm this desirable property. 1 Introduction

Blind source separation is the problem of recovering a number of independent random or deterministic signals when only their linear mixtures are available. Here, “blind” implies that the mixing coefcients, the original signals, and their probability distributions are not known a priori. The problem is also called independent component analysis (ICA) or blind extraction of source signals. The problem has become increasingly important because of many potential applications, such as those in speech recognition and enc 2000 Massachusetts Institute of Technology Neural Computation 12, 1463– 1484 (2000) °

1464

Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki

hancements, telecommunication, and biomedical signal analysis and processing. However, there still remain open theoretical problems to be solved further. The source signals can be identied only up to a certain ambiguity, that is, up to a permutation and arbitrary scaling of original signals. In order to resolve the scaling indeterminacy, most learning algorithms developed so far impose some constraints on the power or amplitude of the separated signals (Jutten & H´erault, 1991; H´erault & Jutten, 1986; Bell & Sejnowski, 1995; Common, 1994; Douglas, Cichocki, & Amari, 1997; Sorouchyari, 1991; Chen & Chen, 1994; Cichocki, Unbehauen, & Rummert, 1994; Cichocki, Unbehauen, Moszczynski, & Rummert, 1994; Cichocki, Amari, Adachi, & Kasprzak, 1996; Cichocki, Bogner, Moszczynski, & Pope, 1997; Cardoso & Laheld, 1996; Girolami & Fyfe, 1997a, 1997b; Amari, Chen, & Cichocki, 1997; Amari, Cichocki, & Yang, 1995, 1996; Amari & Cardoso, 1997; Amari, 1967; Amari & Cichocki, 1998). When the source signals are stationary and their average magnitudes do not change over time, the constraints work well to extract the original signals in xed magnitudes. However, the source signals change their local average magnitudes in many practical applications such as evoked potentials and speech by humans. In such cases, when a component source signal becomes very small suddenly, the corresponding row of the separating matrix tends to be large in the learning process to compensate for this change and to emit the output signal of the same amplitude. In particular, when one source signal becomes silent, the separating matrix diverges. This causes serious numerical instability. This article proposes an efcient learning algorithm with equivariant properties that works well even when the magnitudes of the source signals change quickly. To this end, we rst introduce an equivalence relation in the set of all the demixing matrices. Two matrices are dened to be equivalent when the recovered signals differ only by scaling. When the trajectory of the learning dynamics of the demixing matrix moves in a single equivalence class, the recovered signals remain the same except for scaling. Therefore, such a motion is ineffective in learning. In order to reduce ineffective movements in the learning process, we modify the natural gradient learning rule (Amari, Cichocki, & Yang, 1996; Amari, 1998) such that the trajectory of learning is orthogonal to the equivalence class. This imposes constraints on the directions of movements in learning. However, such constraints are proved to be nonholonomic, implying that there are no integral submanifolds corresponding to the constraints. This implies that the trajectory can reach any demixing matrices in spite of the constraints, a property that is useful in robotic design. Nonholonomic dynamics is a current research topic in robotics. Since the trajectory governed by the proposed algorithm does not include useless (redundant) components, the algorithm can be more efcient (with respect to convergence and efciency) than those algorithms with other

Nonholonomic Orthogonal Learning Algorithms

1465

constraints. It is also robust against sudden changes in the magnitudes of the source signals. Even when some components disappear, the algorithm still works well. Therefore, our algorithm can be used even when the number of source signals is overestimated. These features have been conrmed by computer simulations. The stability conditions for the proposed algorithm are also derived. 2 Critical Review of Existing Adaptive Learning Algorithms

The problem of blind source separation is formulated as follows (H´erault & Jutten, 1986; Jutten & H´erault, 1991). Let n source signals s1 ( t), . . . , sn (t ) at time t be represented by a vector s(t ) D [s1 ( t) ¢ ¢ ¢ sn (t)]T , where the superscript T denotes transposition of a vector or matrix. They are assumed to be stochastically independent. We assume that the source signals are not directly accessible, and we can instead observe n signals x1 (t), . . . , xn (t ) summarized in a vector x( t) D [x1 ( t) ¢ ¢ ¢ xn ( t)]T , which are instantaneous linear mixtures of the n source signals. Therefore, we have

x( t) D As( t),

(2.1)

where A D [aij ]n£n is an unknown nonsingular mixing matrix. The source signals s(1), s(2), . . . , s(T ) are generated at discrete times t D 1, 2, . . . , T subject to unknown probability distributions. In order to recover the original source signals s( t), we make new linear mixtures of the sensor signals x(t ) by

y (t ) D W ( t)x( t),

(2.2)

where W D [wij ]n£n is called the demixing or separating matrix. If W ( t) D A¡1 , the original source signals are recovered exactly. However, A or W is unknown, so we need to estimate W from the available observations x(1), . . . , x( t). Here, W ( t) is an estimate. It is well known that the mixing matrix A, and hence source signals, are not completely identiable. There are two types of indeterminacies or ambiguities: the order of the original source signals is indenite, and the scales of the original signals are indenite. The rst indeterminacy implies that a permutation or reordering of recovered signals remains possible, but this is not critical, because what is wanted is to extract n independent signals; their ordering is not important. The second indeterminacy can be explained as follows. When the signals s are transformed as s0 D L s, where L is a diagonal scaling matrix with nonzero diagonal entries, the observed signals are kept invariant if the mixing matrix A is changed into A0 D A L ¡1 , because x D As D A0 s0 . This implies that W and W 0 D L W are equivalent in the sense that they give the same estimated source signals except for their scales. Taking these indeterminacies into account, we can estimate

1466

Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki

W D A ¡1 P L at best, where P is a permutation matrix and L is a diagonal matrix. These intrinsic (unavoidable) ambiguities do not seem to be a severe limitation, because in most applications, most information is contained in the waveforms of the signals (for instance, speech signals and multisensor biomedical recordings) rather than their magnitudes and the order in which they are arranged. In order to make the scaling indeterminacies clearer, we introduce an equivalence relation in the n2 -dimensional space of nonsingular matrices Gl( n ) D fW g. We dene W and L W as equivalent, W » LW ,

(2.3)

where L is an arbitrary nonsingular diagonal matrix. Given W , its equivalence class

C W D fW 0 | W 0 D L W g

(2.4)

is an n-dimensional subspace of Gl( n ) consisting of all the matrices equivalent to W . Therefore, a learning algorithm need not search for the nonidentiable W D A¡1 but searches for the equivalence class C W that contains the true W D A¡1 except for permutations. In existing algorithms, in order to determine the scales, one of two constraints is imposed explicitly or implicitly. The rst is a rigid constraint. Let us introduce n restrictions on W , fi ( W ) D 0,

i D 1, . . . , n.

(2.5)

A typical one is fi ( W ) D wii ¡ 1,

i D 1, . . . , n.

Dene the set Qr of W ’s satisfying the above constraints to be

Qr D fW | fi ( W ) D 0,

i D 1, . . . , ng.

(2.6)

The set Qr is supposed to form a ( n2 ¡ n )-dimensional manifold transversal to C W£. In order to resolve the scaling indeterminacies, we require that ¤ W ( t) D wij ( t) always be included in the Qr . Such a rigid constraint is implemented, for example, by the learning algorithm wij ( t C 1) D wij ( t) ¡ g(t)Qi ( yi (t ))yj ( t),

6 iD j,

(2.7)

where Q(yi ) are suitably chosen nonlinear functions, g(t) is the step size, and all the diagonal elements wii are xed equal to 1.

Nonholonomic Orthogonal Learning Algorithms

1467

This idea of reducing the complexity of a learning algorithm slightly is a simple one and has been used in several papers (H´erault & Jutten, 1986; Jutten & H´erault, 1991; Sorouchyari, 1991; Chen & Chen, 1994; Moreau & Macchi, 1995; Matsuoka, Ohya, & Kawamoto, 1995; Oja & Karhunen, 1995). However, the constraints (see equation 2.5) do not allow equivariant properties (Cardoso & Laheld, 1996). Therefore, when A is ill conditioned, the algorithm does not work well. The second constraint species the power or energy of the recovered (estimated) signals. Such soft constraints can be expressed in a general form as E[hi ( yi )] D 0,

i D 1, . . . , n,

(2.8)

where E denotes the expectation and hi ( yi ) is a nonlinear function; typically hi ( yi ) D 1 ¡ Qi ( yi )yi

(2.9)

or hi ( yi ) D 1 ¡ y2i .

(2.10)

The latter guarantees that the variances or powers of the recovered signals yi are 1. Such soft constraints are implemented automatically in the adaptive learning rule,

W (t C 1) D W ( t) C g(t)F [y (t)]W ( t),

(2.11)

where F [y ( t)] is an n £ n matrix of the form (Cichocki, Unbehauen, & Rummert, 1994; Amari et al., 1996)

F [y ( t)] D I ¡ ’( y ( t))y T (t ),

(2.12)

(or Amari et al., 1997) F[y (t )] D I ¡ a’( y (t ))y T (t) C b y ( t)’T (y ( t))

(2.13)

with adaptively or adequately chosen constants a and b, or (Cardoso & Laheld, 1996),

F [y ( t)] D I ¡ y ( t)y T ( t) ¡ ’( y ( t))y T (t ) C y (t)’( y T ( t)).

(2.14)

The diagonal elements of the matrix F ( y ) in equation 2.11 can be regarded as specifying the constraints, £ ¤ E hi ( yi ) D 0.

(2.15)

1468

Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki

Then, although W ( t) can move freely in C W in the learning process, any equilibrium point of the averaged learning equation satises the constraints of equation 2.15. Let us dene the set Qs of W ’s by

Qs D fW | E[hi ( yi )] D 0,

i D 1, . . . , ng.

(2.16)

Then Qs is an ( n2 ¡ n )-dimensional attracting submanifold and W ( t) is attracted toward Qs , although it does not stay exactly in Qs (t) because of stochastic uctuations. Summarizing, rigid constraints do not permit algorithms with equivariant properties (Cardoso & Laheld 1996). Soft constraints enable us to construct algorithms with equivariant properties. However, for source signals whose amplitudes or powers change over time (e.g., signals with long silent periods and spikes or rapid jumps like speech signals), it is more difcult to achieve rapid convergence to equilibria because of inadequate specication of constraints than in the stationary case. Special tricks or techniques may be employed, like random selection of training samples and dilation of silence periods (Bell & Sejnowski, 1995; Amari, Douglas, & Cichocki, 1998). Performance may be poorer for such nonstationary signals. Moreover, it is easy to show that such algorithms 2.11 through 2.14 are unstable (in the sense that weights wij ( t) diverge to innity) in cases when the number of the true sources is fewer than the number of the observed signals, where W is a quadratic n £ n matrix. Such a case arises often in practice when one or more source signals are silent or decay to zero in an unpredictable way. In this article, we will elucidate the roles of such constraints geometrically. We then show that such constraints and drawbacks can be lifted or at least relaxed by introducing a new algorithm. In other words, we impose no explicit constraints for the separating matrix W and search for the equivalence class C W directly. This article is theoretical in nature, and simulations are limited. Larger-scale simulations studies and applications to real data are necessary for proving the practical usefulness of the proposed method. 3 Derivation of a New Learning Algorithm

In order to derive a new algorithm, we recapitulate the derivation of the natural gradient algorithm due to Amari, Chen, & Cichocki (1997). To begin, we consider the loss function, l( y , W ) D ¡ log | det(W )| ¡

m X

log pi ( yi ),

(3.1)

iD1

where pi ( yi ) are some positive probability density functions (p.d.f.) of output

Nonholonomic Orthogonal Learning Algorithms

1469

signals and det( W ) denotes the determinant of matrix W . We can then apply the stochastic gradient descent learning method to derive a learning rule. In order to calculate the gradient of l, we derive the total differential dl of l when W is changed from W to W C dW . In component form, dl D l( y , W C dW ) ¡ l( y , W ) D

X @l i, j

@wij

dw ij ,

(3.2)

where the coefcients @l / @wij of dwij represent the gradient of l. Simple algebraic and differential calculus yields dl D ¡tr(dW W ¡1 ) C ’( y )T dy ,

(3.3)

where tr is the trace of a matrix and ’( y ) is a column vector whose components are Qi ( yi ) D ¡

d log(pi ( yi ) pPi ( yi ) D¡ , dyi pi ( yi )

(3.4)

“ P ” denoting differentiation. From y D W x, we have dy D dW x D dW W ¡1 y .

(3.5)

Hence, we dene dX D dW W ¡1 ,

(3.6)

whose components dxij are linear combinations of dwij . The differentials dxij form a basis of the tangent space at W of Gl( n ), since they are linear combinations of the basis dw ij . The space Gl( n) of all the nonsingular matrices is a Lie group. An element W is mapped to the identity I when W ¡1 is multiplied from the right. A small deviation dW at W is mapped to dX D dW W ¡1 at I . By using dX , we can compare deviations dW 1 and dW 2 at two different points W 1 ¡1 and W 2 , using dX 1 D dW 1 W ¡1 1 and dX 2 D dW 2 W 2 . The magnitude of deviation dX is measured by the inner product ³ ´ X¡ ¢ 2 hdX , dX iI D tr dX dX T D (3.7) dxij . i, j

This implies that the inner product of dW at W is given by » ³ ´T ¼ hdW , dW iW D tr dW W ¡1 dW W ¡1 ,

(3.8)

1470

Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki

P which is different from ( dwij )2 . Therefore, the space Gl( n ) has a Riemannian metric. It should be noted that dX D dW W ¡1 is a nonintegrable differential form so that we do not have a matrix function X ( W ) which gives equation 3.6. Nevertheless, the nonholonomic basis dX / is useful. Since the basis dX is orthonormal in the sense that kdX k2 is given by equation 3.7, it is effective to analyze the learning equation in terms of dX . The natural Riemannian gradient (Amari, Cichocki, & Yang, 1995; Amari et al., 1996, 1998) is automatically implemented by it and the equivariant properties automatically hold. It is easy to rewrite the results in terms of dW by using equation 3.6. The gradient dl in equation 3.3 is expressed in the differential form: dl D ¡tr(dX ) C ’( y )T dXy .

(3.9)

This leads to the stochastic gradient learning algorithm,

D X (t ) DX ( t C 1)¡ X (t) D ¡g(t)

h i dl Dg(t) I ¡ ’(y ( t))y T ( t) , (3.10) dX

in terms of D X (t) D D W (t)W ¡1 ( t). This is rewritten as h i W ( t C 1) D W (t) C g(t) I ¡ ’(y ( t))y T ( t) W ( t)

(3.11)

in terms of W ( t). It was shown that the diagonal term in equation 3.11 can be set arbitrarily (Cichocki, Unbehauen, Moszczynski, & Rummert, 1994; Cichocki & Unbehauen, 1996; Amari & Cardoso, 1997). The problem has been analyzed also based on the information geometry of semiparametric statistical models (Amari & Kawanabe, 1997a, 1997b; Bickel, Klaassen, Ritov, & Wellner, 1993). Therefore, the above algorithm can be generalized in a more exible and universal form as h i (3.12) W ( t C 1) D W (t) C g(t) L ( t) ¡ ’( y ( t))y T ( t) W ( t), where L ( t) is any positive denite scaling diagonal matrix. By determining L correctly depending on y , we have various algorithms. We propose ©

ª

L (t ) D diag y1 ( t)Q1 ( y1 (t )), . . . , yn (t )Qn ( yn (t)) . This algorithm is effective when the nonlinear activation functions Qi ( yi ) are suitably chosen. This corresponds to setting new constraints dxii ( t) D 0,

i D 1, . . . , n.

(3.13)

Nonholonomic Orthogonal Learning Algorithms

1471

The new constraints lead to a new learning algorithm, written as

W (t C 1) D W ( t) ¡ g(t)F [y (t)]W ( t),

(3.14)

where all the elements on the main diagonal of the n £ n matrix F (y ) are put equal to zero, that is, fii D 0

(3.15)

and fij D Qi ( yi )yj ,

6 iD j.

(3.16)

The learning rule, 3.14, is called the orthogonal nonholonomic natural gradient descent algorithm for the reason shown in the next section. Algorithm 3.14 can be generalized as follows (Amari et al. 1997),

W (t C 1) D W ( t) ¡ g(t)F fy ( t)gW ( t),

(3.17)

with entries of the matrix F ( y ) as fij D dijl ii C a1 yi yj C a2 Qi ( yi )yj ¡ a3 yi Qi ( yj ),

(3.18)

where dij is Kronecker delta, l ii D ¡a1 y2i ¡ a2 Qi ( yi )yi C a3 yi Qi ( yi ) and a1 , a2 , and a3 are parameters that may be determined adaptively. 4 Nonholonomicity and Orthogonality

The space Gl( n ) D fW 0 g is partitioned into equivalence classes C W , where any W 0 belonging to the same C W is regarded as an equivalent demixing matrix. Let dW C be a direction tangent to C W , that is, W C dW C and W belong to the same equivalence class. The learning equation determines D W ( t) depending on the current y and W . When D W ( t) includes components belonging to the tangent directions to C W , such components are ineffective because they drive W within the equivalent class. Therefore, it is better to design a learning rule such that its trajectories D W ( t) are always orthogonal to the equivalence classes. Since C W are n-dimensional subspaces, if we could nd a family of n2 ¡ n-dimensional subspaces Q’s such that C W and Q are orthogonal, we could impose the constraints that the learning trajectories would belong to Q. Therefore, one interesting question arises: Is there a family of ( n2 ¡ n )-dimensional submanifolds Q that are orthogonal to the equivalence classes C W ? The answer is no. We can prove now that there does

1472

Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki

not exist a submanifold that is orthogonal to the families of submanifolds C W ’s. The direction dW is orthogonal to C W , if and only if

Theorem 1.

dxii D 0,

i D 1, . . . , n

(4.1)

where dX D dW W ¡1 . Since the equivalent class C W consists of matrices L W , L is regarded as a coordinate system in C W . A small deviation of W in C W is written as Proof.

dW C D d L W ,

(4.2)

where d L is diag(dl 1 , . . . , dl n ). The tangent space of C W is spanned by them. The inner product of dW and dW C is given by «

dW , dW C

¬ W

D E D dW W ¡1 , dW C W ¡1 I X « ¬ D dX , d L I D dxij dl ij i, j

D

X

dxii dl ii .

(4.3)

Therefore, dW is orthogonal to C W when and only when dW satises dxii D 0 for all i. We next show that dX is not integrable, that is, there are no matrix functions G( W ) such that dX D

@G @W

± dW ,

(4.4)

where @G @W

± dW D

X @G i, j

@wij

dw ij .

(4.5)

If such G exists, X D G(W ) denes another coordinate system in Gl(n ). Even when such G does not exist, dX is well dened, and it forms a basis in the tangent space of Gl(n ) at W . Such a basis is called a nonholonomic basis (Schouten, 1954; Frankel, 1997). Theorem 2.

The basis dened by dX D dW W ¡1 is nonholonomic.

Nonholonomic Orthogonal Learning Algorithms Proof.

1473

Let us assume that there exists a function G( W ) such that

dX D dG( W ) D dW W ¡1 .

(4.6)

We now consider another small deviation from W to W C d W . We then have ddX D ddG D dW d(W ¡1 ).

(4.7)

We have d(W ¡1 ) D ( W C d W )¡1 ¡ W ¡1 ’ ¡W ¡1d W W ¡1 .

(4.8)

ddX D ¡dW W ¡1d W W ¡1 .

(4.9)

Hence,

Since matrices are in general noncommutative, we have 6 dd X . ddX D

(4.10)

This shows that there does not exist a matrix X because dd G D ddG always hold when a matrix G exists. Our constraints dxii D 0,

i D 1, . . . , n

(4.11)

restrict the possible directions of D W and dene ( n2 ¡n)-dimensional movable directions at each point W of Gl( n ). These directions are orthogonal to C W . However, by the same reasoning, there do not exist functions gi (W ) such that dxii D dgi ( W ) D

X @gi (W ) j,k

@wjk

dwjk .

(4.12)

This implies that there exist no subspaces dened by gi (W ) D 0. Such constraints are said to be nonholonomic. The learning equation, 3.14, with equation 3.15 denes a learning dynamics with nonholonomic constraints. At each point W , D W is constrained in ( n2 ¡ n ) directions. However, the trajectories are not constrained in any ( n2 ¡ n )-dimensional subspace, and they can reach any points in Gl( n).

1474

Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki

This property is important when the amplitudes of si ( t) change over time. If the constraints are £ ¤ E hi ( yi ) D 0,

(4.13)

for example, h i E y2i ¡ 1 D 0,

(4.14)

when E[s2i ] suddenly become 10 times smaller, the learning dynamics with the soft constraints make the ith row of W 10 times larger in order to compensate for this change and keep E[y2i ] D 1. Therefore, even when W converges to the true C W , large uctuations emerge from the ineffective movements caused by changes in amplitude of si . This sometimes causes numerical instability. On the other hand, our nonholonomic dynamics is always orthogonal to C W so that such ineffective uctuations are suppressed. This is conrmed by computer simulations a few of which are presented in section 6. There is a great deal of research on nonholonomic dynamics. Classical analytical dynamics uses nonholonomic bases to analyze the spinning gyro (Whittaker, 1940). It was also used in relativity (Whittaker, 1940). Kron (1948) used the nonholonomic constraints to present a general theory of electromechanical dynamics of generators and motors. Nonholonomic properties play a fundamental role in continuum mechanics of distributed dislocations (see, e.g., Amari, 1962). An excellent explanation is found in Brockett (1993), where the controllability of nonlinear dynamical systems is shown by using the related Lie algebra. Recently robotics researchers have been eager to analyze dynamics with nonholonomic constraints (Suzuki & Nakamura 1995). Remark 1.

Theorem 2 shows that the orthogonal natural gradient descent algorithm evolves along a trajectory path that does not include redundant (useless) components in the directions of C W . Therefore, it seems likely that the orthogonal algorithm can be more efcient than other algorithms, as has been conrmed by preliminary computer simulations. Remark 2.

5 Stability Analysis

In this section, we discuss the local stability of the orthogonal natural gradient descent algorithm, 3.14, dened by

D W ( t) D ¡g(t)F fy ( t)gW (t),

(5.1)

Nonholonomic Orthogonal Learning Algorithms

1475

6 with fii D 0 and fij D Qi ( yi )yj , if i D j, in the vicinity of the desired equilibrium submanifold C A¡1 . Similar to a theorem established in (Amari et al., 1997a), we formulate the following theorem:

Theorem 3.

When the zero-mean independent source signals satisfy conditions

E(Qi ( si )sj ) D 0,

if

E(QP i ( si )sj sk ) D 0,

(5.2)

6 iD j

if

(5.3)

6 kD j

and when the following inequalities hold, E(QP i ( si )sj2 )E( QPj ( sj )s2i ) > E( si Qi ( si ))E(sj Qi ( sj )),

i, j D 1, . . . , n,

(5.4)

then the desired equivalence class C W , W D A¡1 P L is a stable equilibrium. However, if conditions 5.2 and 5.3 are still satised and E(QP i ( si )sj2 )E( QPj ( sj )s2i ) < E( si Qi ( si ))E(sj Qi ( sj )),

(5.5)

then replacing fij D Q(yi )yj by fij D yi Qj ( yj ),

6 iD j

(5.6)

it is guaranteed that the class C W , W D A¡1 P L , contains stable separating equilibria. Let W D A¡1 and (dW )i denote the ith row of dW and ( W ¡1 ) j be the jth column of W ¡1 . Then taking into account that dxij D Qi ( yi )yj and dxii D 0, we evaluate the second differentials in the Hessian matrix as follows: Proof.

d2 xij D Efd(Qi ( yi )yj )g

D EfQP i ( yi )yj dyi g C EfQi ( yi )dyj g

D EfQP i ( yi )yj ( dW )i (W ¡1 s )g C EfQi ( yi )( dW )j ( W ¡1 s )g

D EfQP i ( yi )yj ( dW )i (W ¡1 )j g C EfQi ( yi )yi ( dW )j ( W ¡1 )i g D ¡EfQPi (si )sj2 gdxij ¡ EfQi ( si )si gdxji .

(5.7)

1476

Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki

Hence d2 xij D ¡EfQPi ( si )sj2 gdxij ¡ EfQi (si )si gdxji

(5.8)

d2 xji D ¡EfQj (sj )sj gdxij ¡ EfQPj ( sj )s2i gdxji

(5.9)

d2 xii D 0.

(5.10)

By the assumptions made in theorem 3, we see that the equilibrium point W D A ¡1 is stable. The proof for W D A ¡1 P L is the same and is omitted here for brevity. The proof for the other cases is similar. From the proof of theorem 2, we have Let us consider the special case that all neurons have the same activation function: the cubic nonlinearity (Qi ( yi ) D y3i ). In this case, if all the source signals are subgaussian signals (with negative kurtosis), then the learning algo6 rithm, 3.14, with fii D 0 and fij D y3i yj , if i D j ensures local stability for the desired ¡1 equilibrium points: W D A P L . Analogically, if all the source signals are supergaussian (with positive kurtosis), then the algorithm, 3.14, with fii D 0 and 6 fij D yi yj3 , if i D j ensures that all the desired equilibrium points W D A ¡1 P L are stable equilibrium separating points. Corollary.

6 Computer Simulations

In our computer experiments we have assumed that source signals have generalized gaussian distributions, and corresponding nonlinear activation functions have the form Q(yi ) D | yi | ri ¡1 sgn(yi ) for ri ¸ 1 and Q(yi ) D yi / [|yi | (2¡ri ) C 2 ] for 0 · r1 < 1, where 2 is small positive constant added to prevent singularity of the nonlinear function (Cichocki, Bogner, Moszczynski, & Pope, 1997; Cichocki, Sabala, & Amari, 1998). For spiky supergaussian sources the optimal parameter ri changes between zero and one. In the illustrative examples that follow, we have assumed that ri D 0.5 and the learning rate is xed equal to g D 0.001. Preliminary computer simulation experiments have conrmed the high performance of the proposed algorithm. We will give two illustrative examples to show the efciency of the orthogonal natural gradient adaptive algorithm and verify the validity of our theorems. We also show that the orthogonal natural gradient descent algorithm is usually more efcient than soft constraint natural gradient descent algorithms for nonstationary

Nonholonomic Orthogonal Learning Algorithms

1477

SOURCE SIGNALS

5 0 5 5 0 5 10 5 0 5 1.5

1.55 1.6

1.65 1.7

1.75 1.8

1.85 1.9

1.95 x 10

2

4

(a)

20 10 0 10 20 10 0 10 20 10 0 10 20 10 0 10 1.5

SENSOR SIGNALS

1.55 1.6

1.65 1.7

1.75 1.8

1.85 1.9

1.95 x 10

2

4

(b)

Figure 1: Source and sensor signals for example 1. (a) Original source signals. (b) Mixing (sensors) signals.

signals, especially when the number of sources changes dynamically in time. 6.1 Example 1. To show the effectiveness of our algorithm, we assume that the number of sources is unknown (but their number is not greater than the number of the sensors). Three natural speech signals shown in Figure 1a are mixed in four sensors by using an ill-conditioned 4 £ 3 mixing matrix. It is assumed that the mixing matrix and the number of the sources are unknown. It should be noted that the sensor signals look almost identical (see Figure 1b). Recovered signals are shown in Figure 2a.

1478

Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki 1 0 1 1 0 1 1 0 1 2 0 2 1.5

1.55 1.6

1.65 1.7

1.75 1.8 1.85 1.9

1.95 2 4 x 10

(a)

5 0 5 10 0 10 10 0 10 0.5 0 0.5 1.5

1.55 1.6

1.65 1.7

1.75 1.8 1.85 1.9

1.95 2 4 x 10

(b)

Figure 2: Separated signals for example 1. (a) Separated source signals for zero noise using ONGD algorithm. (b) Separated signals using the standard algorithm (3.11).

The algorithm was able after 5000 iterations not only to estimate the waveform of original speech signals but also their number. It should be noted that one output signal quickly decays to zero. For comparison we applied the standard independent component analysis learning algorithm, 3.11, to the same noiseless data. The standard algorithm also recovered the original three signals well, but it creates a false signal, as shown in Figure 2b. Moreover, the norm of separating matrix W is increasing exponentially, as illustrated in Figure 3a. This problem can be circumvented by computing preliminarily the rank of covariance matrix of the sensor signals and then taking only the number of sensors equal to the rank, provided noise

Nonholonomic Orthogonal Learning Algorithms

1479

Figure 3: Convergence dynamic behavior of the Frobenious norm of a separating matrix W as a function of the number of iterations for (a) the standard algorithm (3.11) for noiseless case, (b) the standard algorithm (3.11) for 1% additive gaussian noise, and (c) the orthogonal natural gradient descent algorithm (3.14).

1480

Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki

is negligible. However, such an approach is not very practical for a nonstationary environment when the number of sources changes continuously in time. When some small (1%) additive noise is added to sensors, the standard algorithm can be stabilized, as shown in Figure 3b. However, a uctuation of the norm of separating matrix W is larger than for the orthogonal natural gradient descent algorithm (see Figures 3b and 3c). 6.2 Example 2. In this simulation, we separated seven highly nonstationary natural speech signals, as shown in Figure 4a, using seven sensors (microphones), assuming that two of sources are temporally zero (silence period). It is assumed again that the number of active source signals and the randomly chosen mixing matrix A are completely unknown. The new learning algorithm, 3.14, was able to separate all signals, as illustrated in Figure 4c. Again we have found by simulation that the standard learning algorithm, 3.11, demonstrates higher uctuation of synaptic weights of the separating matrix. Moreover, the absolute values of the weights increase exponentially to innity for the noiseless case, so we are not able to compare the performance of the two algorithms, because the standard ICA algorithm, 3.11, is principally unstable in the noiseless case (without computing the rank of covariance matrix of sensor signals). Even for small noise, the weights of the separating matrix take very high values and are unstable in the case of the standard ICA method. 7 Conclusion

We have pointed out several drawbacks in existing adaptive algorithms for blind separation of signals—for example, the problem of separation of nonstationary signals and the problem of stability if some sources decay to zero. To avoid these problems, we developed a novel orthogonal natural gradient descent algorithm with the equivariant property. We have demonstrated by computer simulation experiments that the proposed algorithm ensures a good convergence speed even for ill-conditioned problems. The algorithm solves the problem of on-line estimation of the waveforms of the source signals even when the number of sensors is larger than the number of sources that is unknown. Moreover, the algorithm works properly and precisely for highly nonstationary signals like speech signals and biomedical signals. It is proved theoretically that the trajectories governed by the algorithm do not include redundant (useless) components and the algorithm is potentially more efcient than algorithms with soft or rigid constraints. The main feature of the algorithm is its orthogonality and that it converges to an equivalent class of the separating matrices rather than isolated critical points. Local stability conditions for the developed algorithm are also given.

Nonholonomic Orthogonal Learning Algorithms

1481

5 0 5 5 0 5 5 0 5 5 0 5 10 0 10 1.5

1.55 1.6

1.65 1.7

1.75 1.8

1.85 1.9

1.95 2 4 x 10

1.85 1.9

1.95 2 4 x 10

1.85 1.9

1.95 2 4 x 10

(a) 10 0 10 10 0 10 10 0 10 10 0 10 10 0 10 10 0 10 10 0 10 1.5

1.55 1.6

1.65 1.7

1.75 1.8

(b) 2 2 1 0 1 2 2 1 0 1 2 2 2 2 2 2 1.5

1.55 1.6

1.65 1.7

1.75 1.8

(c)

Figure 4: Computer simulation for example 2. (a) Original active speech signals. (b) Mixed (microphone) signals. (c) Separated (estimated) source signals.

1482

Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki

References Amari, S. (1962). On some primary structures of non-Riemannian elasticitytheory. In K. Kondo (Ed.), RAAG memoirs of the unifying study of basic problems in engineering and physical sciences (Vol. 3, pp. 99–108). Tokyo: Gakujutsu Bunken Fukyu-kai. Amari, S. (1967).A theory of adaptive pattern classiers. IEEE Trans. on Electronic Computers, 3, 299–307. Amari, S. (1998). Natural gradient works efciently in learning. Neural Computation, 10, 2692–2700. Amari, S., & Cardoso, J. F. (1997). Blind source separation—Semi-parametric statistical approach. IEEE Trans. on Signal Processing, 45(11), 2692–2700. Amari, S., Chen, T. P., & Cichocki, A. (1997). Stability analysis of adaptive blind source separation. Neural Networks, 10(8), 1345–1351. Amari, S., & Cichocki, A. (1998). Blind signal processing—neural network approaches. Proceedings IEEE, special issue on Blind Identication and Estimation, 86, 2026–2048. Amari, S., Cichocki, A., & Yang, H. H. (1995).Recurrent neural networks for blind separation of sources. In Proceedings of International. Symposium on Nonlinear Theory and its Applications, NOLTA-95 (pp. 37–42). Amari, S., Cichocki, A., & Yang, H. H. (1996).A new learning algorithm for blind signal separation. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 457–763). Cambridge, MA: MIT Press. Amari, S., Douglas, S. C., & Cichocki, A. (1998). Multichannel blind deconvolution and equalization using the natural gradient. Manuscript submitted for publication. Amari, S., & Kawanabe, M. (1997a). Information geometry of estimating functions in semiparametric statistical models. Bernoulli, 3, 29–54. Amari, S., & Kawanabe, M. (1997b). Estimating functions in semi-parametric statistical models. In I. V. Basawa, V. P. Godambe, & R. L. Taylor (Eds.), Estimating functions (Vol. 3, pp. 65–81). Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Bickel, P. J., Klaassen, C. A. J., Ritov, Y., & Wellner, J. A. (1993). Efcient and adaptive estimation for semi-parametric models. Baltimore, MD: Johns Hopkins University Press. Brockett, R. W. (1993). Differential geometry and design of gradient algorithms. In R. E. Green & S. T. Yau (Eds.), Differential geometry: Partial differential equations on manifolds (pp. 69–92). Providence, RI: American Mathematical Society. Cardoso, J. F., & Laheld, B. (1996). Equivariant adaptive source separation. IEEE Trans. on Signal Processing, 12, 3017–3030. Chen, T. P., & Chen, R. (1994). Blind extraction of stochastic and deterministic signals by neural network approach. In Proc. of 28th Asilomar Conference on Signals, Systems and Computer (pp. 892–896).

Nonholonomic Orthogonal Learning Algorithms

1483

Cichocki, A., Amari, S., Adachi, S., & Kasprzak, W. (1996). Self-adaptive neural networks for blind separation of sources. In Proceedings of Int.Symp. on Circuits and Systems (ISCAS-96) (Vol. 2, pp. 157–161). Cichocki, A., Bogner, R. E., Moszczynski, L., & Pope, K. (1997).Modied H´eraultJutten algorithms for blind separation of sources. Digital Signal Processing, 7(2), 80–93. Cichocki, A., Sabala, I., & Amari, S. (1998). Intelligent neural networks for blind signal separation with unknown number of sources. In Proc. of Conf. Engineering of Intelligent Systems, ESI-98 (pp. 148–154). Cichocki, A., Sabala, I., Choi, S., Orsier, B., & Szupiluk, R. (1997). Self adaptive independent component analysis for sub-gaussian and super-gaussian mixtures with unknown number of sources and additive noise. In Proceedings of 1997 International Symposium on Nonlinear Theory and Its Applications (NOLTA-97) (Vol. 2, pp. 731–734). Cichocki, A., & Unbehauen, R. (1996). Robust neural networks with on-line learning for blind identication and blind separation of sources. IEEE Transaction on Circuits and Systems—I: Fundamental Theory and Applications, 43, 894– 906. Cichocki, A., Unbehauen, R., Moszczynski, L., & Rummert, E. (1994). A new on-line adaptive algorithm for blind separation of source signals. In Proc. of 1994 Int. Symposium on Articial Neural Networks ISANN-94 (pp. 406– 411). Cichocki, A., Unbehauen, R., & Rummert, E. (1994). Robust learning algorithm for blind separation of signals. Electronics Letters, 30(17), 1386–1387. Common, P. (1994). Independent component analysis, a new concept? Signal Processing, 36, 287–314. Douglas, S. C., Cichocki, A., & Amari, S. (1997). Multichannel blind separation and deconvolution of sources with arbitrary distributions. In Proc. IEEE Workshop on Neural Networks for Signal Processing (pp. 436–444). New York: IEEE Press. Frankel, T. (1997). The geometry of physics. Cambridge: Cambridge University Press. Girolami, M., & Fyfe, C. (1997a). Extraction of independent signal sources using a deationary exploratory projection pursuit network with lateral inhibition. I.E.E. Proceedings on Vision, Image and Signal Processing Journal, 14, 299–306. Girolami, M., & Fyfe, C. (1997b).Independence are far from normal. In Proceeding of European Symposium on Articial Neural Networks (pp. 297–302). H´erault, J., & Jutten, C. (1986). Space or time adaptive signal processing by neural network models. In J. S. Denker (Ed.), Neural networks for computing: Proceedings of AIP Conference (pp. 206–211). New York: American Institute of Physics. Jutten, C., & H´erault, J. (1991). Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24, 1–20. Kron, G. E. (1948). Geometrical theory of electrodynamics. New York: Wiley. Matsuoka, K., Ohya, M., & Kawamoto, M. (1995). A neural net for blind separation of non-stationary signals. Neural Networks, 8(3), 411–419.

1484

Shun-ichi Amari, Tian-Ping Chen, and Andrzej Cichocki

Moreau, E., & Macchi, O. (1995). High order contrasts for self-adaptive source separation. InternationalJournal of Adaptive Control and Signal Processing, 10(1), 19–46. Oja, E., & Karhunen, J. (1995). Signal separation by nonlinear Hebbian learning. In M. Palaniswami et al. (Eds.), Computational intelligence—A dynamic system perspective (pp. 83–97). New York: IEEE Press. Schouten, J. A. (1954). Ricci calculus. Berlin: Springer-Verlag. Sorouchyari, E. (1991). Blind separation of sources, part III: Stability analysis. Signal Processing, 24, 21–30. Suzuki, T., & Nakamura, Y. (1995). Spiral motion of nonholonomic space robots. Journal of the Robotics Society of Japan, 13, 1020–1029. Whittaker, E. T. (1940). Analytical dynamics. Proc.Roy. Soc., Edinburgh. Received September 15, 1997; accepted June 4, 1998.

Neural Computation, Volume 12, Number 7

CONTENTS Article

The Minimal Local-Asperity Hypothesis of Early Retinal Lateral Inhibition Rosario M. Balboa and Norberto M. Grzywacz

1485

Note

Multidimensional Encoding Strategy of Spiking Neurons Christian W. Eurich and Stefan D. Wilke

1519

Letters

Synergy in a Neural Code Naama Brenner, Steven P. Strong, Roland Koberle, William Bialek, and Rob R. de Ryter van Steveninck

1531

Increased Synchrony with Increase of a Low-Threshold Calcium Conductance in a Model Thalamic Network: A Phase-Shift Mechanism Elizabeth Thomas and Thierry Grisar

1553

Multispikes and Synchronization in a Large Neural Network with Temporal Delays Jan Karbowski and Nancy Kopell

1573

Synchrony in Heterogeneous Networks of Spiking Neurons L. Neltner, D. Hansel, G. Mato, and C. Meunier

1607

Dynamics of Spiking Neurons with Electrical Coupling Carson C. Chow and Nancy Kopell

1643

A Model for Fast Analog Computation Based on Unreliable Synapses Wolfgang Maass and Thomas Natschl¨ager

1679

Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces Aapo Hyv¨arinen and Patrik Hoyer

1705

Tilt Aftereffects in a Self-Organizing Model of the Primary Visual Cortex James A. Bednar and Risto Miikkulainen 1721 Errata

1741

ARTICLE

Communicated by J. van Hateren

The Minimal Local-Asperity Hypothesis of Early Retinal Lateral Inhibition Rosario M. Balboa Norberto M. Grzywacz Smith-Kettlewell Eye Research Institute, San Francisco, CA 94115-1813, U.S.A. Recently we found that the theories related to information theory existent in the literature cannot explain the behavior of the extent of the lateral inhibition mediated by retinal horizontal cells as a function of background light intensity. These theories can explain the fall of the extent from intermediate to high intensities, but not its rise from dim to intermediate intensities. We propose an alternate hypothesis that accounts for the extent’s bell-shape behavior. This hypothesis proposes that the lateral-inhibition adaptation in the early retina is part of a system to extract several image attributes, such as occlusion borders and contrast. To do so, this system would use prior probabilistic knowledge about the biological processing and relevant statistics in natural images. A key novel statistic used here is the probability of the presence of an occlusion border as a function of local contrast. Using this probabilistic knowledge, the retina would optimize the spatial prole of lateral inhibition to minimize attribute-extraction error. The two signicant errors that this minimization process must reduce are due to the quantal noise in photoreceptors and the straddling of occlusion borders by lateral inhibition. 1 Introduction

Theories for lateral inhibition existent in the literature that are based on information-theory-like ideas1 are not consistent with the behavior of lateral inhibition in the retina’s outer plexiform layer (OPL) (Balboa & Grzywacz, 2000). These theories can only explain qualitatively the fall of the lateralinhibition extent as a function of intensity beginning at intermediate intensities (Baldridge & Ball, 1991; Myhr, Dong, & McReynolds, 1994; Lankheet, Rowe, Van Wezel, & van der Grind, 1996). To account for this intensity behavior, three of these theories use second-order statistics (autocorrelation or power spectrum) in natural images (Srinivasan, Laughlin, & Dubs, 1982; 1 We call these theories information-theory-like, because although they essentially deal with signal-to-noise ratio, as does conventional information theory, they do not quantify information per se, as theory does.

c 2000 Massachusetts Institute of Technology Neural Computation 12, 1485– 1517 (2000) °

1486

Rosario M. Balboa and Norberto M. Grzywacz

Atick & Redlich, 1992; McCarthy & Owen, 1996), while the fourth (Field, 1994) uses a four-order statistic (the image kurtosis). These theories explain this behavior, because as intensity decreases, the relative noise due to random photon absorption increases. Hence, the theories must increase the lateral-inhibition extent to average out the noise. No one can rule out that some of the computations performed in the OPL are related to the statistics used by these theories. However, there is something missing to explain the increase of the extent of the OPL’s lateral inhibition with intensity at low intensities (Mangel & Dowling, 1985; Baldridge, 1993; Jamieson, 1994; Xin, Bloomeld, & Persky, 1994). The complete behavior from low to high intensities was only recently published (Xin & Bloomeld, 1999); Figure 1 shows this behavior. (Previous experiments were not sufciently broad to cover carefully the full range of intensities, eliminating artifacts due to adaptation to the test stimuli.) The upper panel shows the electrophysiological measurements of the horizontal cells’ receptive-eld sizes for two different cell types in the rabbit’s retina. The lower panel shows the extension of the neurobiotin coupling for the same horizontal-cell classes. In both panels, the extent is small at low intensities, grows with intensity up to a maximum, and then falls as the intensity increases further. This same intensity behavior was also observed in goldsh (Jamieson, Baldridge, & Ball, 1994). From these results, one can see that different experiments in different species lead to the same bell-shape behavior for horizontal cells and, thus quite likely, for the lateral inhibition that they mediate. Moreover, given that it was observed in different types of horizontal cells, one can say that this is a general behavior, which is not explained by previous theories. In this work, we propose a new information-theory-like hypothesis for the role of the OPL’s lateral inhibition. We focus on the OPL, because it is the rst stage of visual processing and therefore must be dealing with general aspects of natural images. The new hypothesis will be shown to be consistent with the OPL’s intensity behavior. Relatedly, the hypothesis can account for the disappearance of lateral inhibition at extremely low intensities (Barlow, Fitzhugh, & Kufer, 1957; Bowling, 1980) and the shrinkage of its extent for stimuli that are not spatially homogeneous (Reifsnider & Tranchina, 1995; Kamermans, Haak, Habraken, & Spekreijse, 1996). Moreover, the hypothesis explains the inhibition’s division-like mechanism of action (Merwine, Amthor, & Grzywacz, 1995; Verweij, Kamermans, & Spekreijse, 1996) and linearity under low contrasts (Tranchina, Gordon, Shapley, & Toyoda, 1981). Finally, the hypothesis is consistent with the psychophysics of the Steven’s power law of intensity perception (Stevens, 1970) and the asymmetry of positive and negative Mach bands at the two sides of an intensity border (Fiorentini & Radici, 1957; Ratliff, 1965). The hypothesis and its predictions have already appeared in abstract form (Balboa & Grzywacz, 1998).

Early Retinal Lateral Inhibition

1487

Figure 1: (A) Electrophysiological receptive-eld-extent measurements for two different types of horizontal cells in the rabbit retina: type A (HA; empty circles) and type B (HB; lled circles). When dark adapted (¡1), the receptive-eld extent is small. It then rises with background intensity up to a maximum and falls again as the intensity continues to increase. (B) Tracer-coupling extent with neurobiotin in the same cell types. The coupling has the same behavior as the receptive elds in A. (Data from Xin & Bloomeld, 1999.)

2 Hypothesis Rationale 2.1 Philosophy. There is a major philosophical difference between the new hypothesis and some of the theories in the literature (Atick & Redlich, 1992; McCarthy & Owen, 1996). Past theories prescribe extracting the largest

1488

Rosario M. Balboa and Norberto M. Grzywacz

amount possible of nonredundant signal. The new hypothesis prescribes extracting as well as possible signal that is relevant for the animal. An example of what we mean by “relevant” is provided by the work of Kamermans and colleagues with the goldsh’s OPL (Kamermans, Kraaij, & Spekreijse, 1998). If it were to extract all the color information from the scene, then it should obtain information about the color of the objects and illuminant. However, the goldsh’s OPL appears to discard, or at least pay much less attention to, the illuminant’s color. Why would the visual system play down some types of information so early in the processing? We propose two answers to this question. First, by devoting the available output channels of information to fewer types of signal, larger quantities of these signals can be transmitted. Second, if one assumes that the central nervous system has limited capacity, then the system might be better off dealing with fewer types of signal than by dividing its energy among many types, some of which may not be relevant for the animal’s survival. What types of signal should lateral inhibition play down or stress? The most obvious type is illumination. For example, the predictive coding theory (Srinivasan et al., 1982) proposes to reduce information about the intensity of illumination of points whose neighbors have the same intensity. The goal is to t the wide range of intensities in natural scenes in the narrow dynamic range of the second-order neuron, for instance, the bipolar cell. However, the retina must achieve this goal before lateral inhibition, since the photoreceptor itself has a narrow dynamic range (Fain & Dowling, 1973; Baylor, Lamb, & Yau, 1979). Hence, different from the predictive coding theory, we suggest that the main goal of lateral inhibition is not codication of intensity. Rather, we propose that the nonchromatic goals of the OPL’s lateral inhibition are to aid in the optimization of border detection (Shapley & Tolhurst, 1973), and contrast (Campbell & Robson, 1968) and intensity (Stevens, 1970) estimation. Contrast would be particularly important at borders, and intensity would be important away from them. Border position is an important piece of information, because it is from it that the visual system would characterize objects. As such, lateral inhibition is the rst step of image segmentation, by being the rst step of edge detection. Because we emphasize object segmentation, we place particular importance on occluding edges. However, this does not mean that we think that the retina discriminates different types of edges (such as occlusions and shadows). Edge discrimination requires relatively high-level cues (Adelson 1993; Wishart, Frisby, & Buckley, 1997) beyond the scope of the retina. All that we are proposing is that the retina contributes directly to border extraction, with the retinal circuitries designed specically to deal with the statistics of occluding borders. The idea of lateral inhibition being important for border extraction is not new (Ratliff, 1965). What is new here is quantifying the cost of missing borders and specifying the information obtained from them and away from them.

Early Retinal Lateral Inhibition

1489

2.2 Lateral Inhibition and Image-Attribute Extraction. Figure 2.2A illustrates the structure of the new hypothesis for lateral inhibition as a process in the extraction of image attributes. From the stimulus (I(Er ), where Er is position), the retina wishes to estimate a few attributes (the functions H1 , H2 , . . .), particularly the position and contrast of edges and the intensity outside edges. To do so, it rst transduces the stimulus through a noisy process, getting the photoreceptor responses (T(Er )). Then lateral inhibition modulates these responses, giving rise to bipolar responses (B( Er )). Consider the spatial layout of these responses across the population of the bipolar cells. Wherever there are edges in the image, there are uctuations in the layout caused by the lateral inhibition—the so-called Mach bands (Ratliff,

1490

Rosario M. Balboa and Norberto M. Grzywacz

1965). The contrast of an edge and its Mach band’s amplitude (or largest spatial gradient) is linearly related. Because the probability that there is an edge due to occlusions in a point of the image increases with the contrast at that point (Balboa & Grzywacz, 1999c), the spatial bipolar gradient of any given point codes the probability that there is an edge there. Therefore, for a point in the image with a high spatial gradient across bipolar cells, the position of these cells indicates the position of a likely edge, and the gradient is proportional to the presumed edge’s contrast. For points with low gradient, the bipolar cells would code an amplitude-compressed representation of T (Er ). We propose that the visual system implements recovery functions (R1 (B( Er )), R2 ( B(Er )), . . .) to go from the bipolar responses to estimates (He1 , He2 , . . .) of the image attributes. For each attribute, the visual system computes the expected error (E1 , E2 , . . .) by taking into account the prior probability of the attribute in natural images (P(H1 ), P( H2 ), . . .), the probabilistic model relating the attribute to the estimate (P(He1 |H1 ), P( He2 |H2 ), . . .), and Bayesian theory. These models and prior probabilities would come from learning during development or evolution. The models amount to knowledge about the biology linking the stimulus to the estimates of the attributes. The prior probabilities amount to knowledge of the attributes in natural images. The new hypothesis postulates that the prole of lateral inhibition changes to minimize the total error, that is, E D C1 E1 C C2 E2 C ¢ ¢ ¢ ,

(2.1)

Figure 2: Facing page. Rationale of the minimal-local asperity hypothesis. (A) Schematics of the hypothesis. Images are transduced and transformed to bipolar cell output with the participation of lateral inhibition. Several recovery functions (R1 , R2 , . . .) implemented neurally transform the bipolar output to a new signal (He1 , He2 , . . .), which represents estimates of desired attributes from the image (H1 , H2 , . . .). The system then computes the estimation error by using prior knowledge about images and the underlying biology. A signal representing the error feeds back to the OPL to modulate lateral inhibition. (B) Schematics of stimuli and OPL. The dashed line represents the intensities of objects in the world, while the solid line adds the quantal noise to these intensities. Photoreceptors (whose responses are the solid lines) and horizontal and bipolar cells are represented by the shapes labeled P, H, and B, respectively. The gure uses three labeled sites to illustrate three cases of the relationship between these cells and the stimulus. In site 1, the lateral inhibition mediated by the horizontal cells is conned to one object, helping to compress its intensity to the neural signal. This signal suffers from the quantal-noise error. In site 2, the local circuit straddles one border, producing a Mach band signal, which helps to code the border ’s location and contrast. Finally, in site 3, lateral inhibition straddles more than one border, leading to an erroneous border-detection process (the local-asperity error).

Early Retinal Lateral Inhibition

1491

where the Ci ’s are constants, which weigh the cost associated with each error. They are inversely related to “tness” in population genetics 2 (Ridley, 1993) and directly related to the “loss function” in economics (Berger, 1985) and computer science (Gallant, 1995). Figure 2.2B is a one-dimensional (1D) scheme, which illustrates what the main errors in the estimation of the desired visual attributes are. This scheme shows the intensity levels of ve 1D objects (dashed line), with the photon-absorption noise superimposed (solid line). Lateral inhibition mediated by horizontal cells (H) acts directly on the photoreceptors (P), modulating their intensity estimate before it reaches the bipolar cell (B). For reasons of clarity, we show only three nonoverlapping sites of lateral inhibition to describe the dominant errors. Depending on the site, what would the response of the bipolar cell be? In site 1, the bipolar response would provide an estimate of the intensity inside the object. The only error in this site would be due to the photon-absorption noise. In site 2, the bipolar cell and its neighbors would be coding something like the position of a border and its contrast. Again, the main source of error for contrast estimation is the photon-absorption noise. In site 3, the borders of the small object might be altogether missed, since the bipolar cell would be receiving inputs related to three different objects, corrupting its signals. Consequently, there are two dominant sources of errors: the photon-absorption noise and the straddling of multiple borders by lateral inhibition. We call the errors due to the rst source the quantal-noise errors and the error due to the second source the local-asperity error. (Asperity means “roughness of a surface.” If one thinks of images as three-dimensional representations of the intensities in each point, then an asperous image would have a harsh prole.) 2.3 Lateral-Inhibition Extent. The spatial extent of the inhibitory lter is designed to minimize the error in equation 2.1 given a model like that in Figure 2.2A. Mathematical analysis conrms that the two dominant sources of errors are those illustrated in Figure 2.2B (see section 3.2). Theories in the literature all take into account the quantal-noise error (Rushton, 1961). Its process is well known, and the probability of photon absorption follows a Poisson distribution (Fuortes & Yeandle, 1964; Baylor et al., 1979; Grzywacz, Hillman, & Knight, 1988). We estimate the error of this process (section 3) as a function of intensity and the extent of the lateral-inhibition lter. Reducing the intensity produces a relatively noisier signal. To compensate for the error being larger, the system should average the signal over a wider sample of photoreceptor responses and therefore increase the 2 Population genetics quanties the evolution of a population of certain genes, starting from their initial frequencies, and how t they are for reproduction and survival. In our case, the measure of “tness” is related to the contribution of visual behavior to these genes’ success.

1492

Rosario M. Balboa and Norberto M. Grzywacz

extent of the lateral-inhibition lter. The problem with such an average is that more borders would be straddled, causing the local-asperity error. Hence, ideally, to minimize the local-asperity error, the retina would have to use narrow lateral-inhibition extents. It follows that the quantal-Noise and local-asperity errors impose opposite demands on lateral inhibition. To alleviate this problem, the minimal local-asperity hypothesis proposes that the retina chooses the extent of lateral inhibition to minimize a weighed sum of these errors (see equation 2.1), that is, that the retina reaches a compromise. 2.4 Lateral Inhibition’s Mechanism. We chose a division-like mechanism of lateral inhibition to implement the hypothesis. In other words, the inhibition corresponds not to the subtraction of the signal from the surround (as in a difference of gaussians model; Rodiek & Stone, 1965) but rather to a division (reduction of gain) of the center ’s signal. We chose this form of inhibition for three reasons. First, it would be useful to maintain intensity information for regions away from borders. A division-like inhibition ensures the maintenance of this information, whereas a subtractive inhibition may not (Srinivasan et al., 1982). Second, mathematical analyses of subtractive and division-like models (Grzywacz & Balboa, 1999) reveal that the latter but not former allows the extraction of contrast locally. In the linear model, a bipolar cell’s signals confound contrast, mean intensity, and amplitude of inhibition. The only way to disambiguate these signals is to obtain global information about the mean intensity. Third, biophysical evidence points out that the OPL’s lateral inhibition works through a Ca2C -dependent Cl¡ conductance at the photoreceptors synapse (Verweij, Kamermans, & Spekreijse, 1996). Such a conductance operates in a division-like manner. 3 Mathematical Formulation 3.1 Lateral Inhibition’ s Mechanism. Let the intensity-dependent input and output to the photoreceptor ’s synapse at position ( i, j) be Ti,0 j and Bi, j , respectively. One can think of Ti,0 j as the current generated by transduction and Bi, j as the voltage at the bipolar cell or presynaptic site. (The variable Bij can be pre- or postsynaptic, since we assume here for simplicity that the synapse is linear, but our arguments do not depend on that.) The output is

B ij D

Tij0 g0r C g0I

,

(3.1)

where g0r > 0 is inversely proportional to the resting gain of the photoreceptor ’s synapse (the resting conductance) and g0I is inversely proportional to the gain change due to lateral inhibition (varying conductance). (For the justication of using a division-like inhibition, see the text following

Early Retinal Lateral Inhibition

1493

equation 2.1.) Next, we simplify this equation by dividing the numerator and denominator by g0r , and dening Tij D Tij0 / g0r and gI D g0I / g0r . We propose to set gI D ( hi, j ¤ Bi, j )n , where ¤ stands for convolution, hi, j ¸ 0 is the lateral-inhibition lter, and n ¸ 1 is a constant. From this proposition and equation 3.1, we have

Bi, j D

Ti, j . 1 C (hi, j ¤ Bi, j )n

(3.2)

That this equation uses a convolution means that lateral inhibition pools the photoreceptor outputs linearly. It is also imposed for simplicity and because there is no evidence against, that if i2 C j2 D k2 C l2 , then hi, j D hk,l , that is, the inhibitory lter is isotropic. Finally, that equation 3.1 raises hi, j ¤ Bi, j to the power of n could mean the cooperation of n molecules of inhibitory transmitter or, nonexclusively, nonlinearities in the calcium or chloride conductances. 3.2 Lateral Inhibition’ s Extent. The goal is to nd the lateral-inhibition lter that minimizes the error in equation 2.1 in a model like that in Figure 2.2. Elsewhere (Grzywacz & Balboa, 1999), we performed a mathematical analysis to estimate this error. The results of the analysis are similar to the intuition provided in Figure 2.2B, which shows that formulas for errors similar to those presented below are expected from models like that in Figure 2.2A no matter what approximations one uses. Briey, the analysis used the following mathematical procedures: Errors were typically quantied as spatial integrals of the square of the relative differences between desired attributes and their estimates. We used a Bayesian formulation to weigh the various possible images and the estimated attributes from these images. For this purpose, it was assumed that the attribute estimation from a given image was corrupted by Poisson noise. We then linearized the errors with Taylor expansions around the mean intensity, assuming that the images did not have intensities that were too low (relatively little noise) and contrasts that were too high (Ruderman & Bialek, 1994; Vu, McCarthy, & Owen, 1997; Zhu & Mumford, 1997; Balboa & Grzywacz, 1999b). Finally, it was assumed that the width of inhibitory lter is much narrower than the mean size of objects in the images, making simultaneous crossing of two borders when centering the lter on a point rare but not impossible. This was a simplifying assumption that was probably at best rough for “cluttered” images (Zhu & Mumford, 1998). However, this assumption probably held for most images, since the receptive elds of retinal cells are typically narrow. This assumption allowed approximating the number of local-asperity errors (multiple-border crossings) as a Poisson distribution (large number of positions and low probability of crossings). After this analysis, a good

1494

Rosario M. Balboa and Norberto M. Grzywacz

approximation for the error (see equation 2.1) was E D EQ C E A ,

(3.3)

where EQ and E A are the quantal-noise and local-asperity errors, respectively. (Compared to equation 2.1, equation 3.3 arbitrarily left out the Ci coefcients, because EQ and E A have free, multiplicative gain parameters.) The analysis (Grzywacz & Balboa, 1999) also revealed that the quantalnoise error is to a good approximation

EQ D

QQ

±P P i

O2 j hi, j

hIi, j i1 / 2

²1/2

,

(3.4)

P where QQ > 0 is a free parameter of the hypothesis, hO i, j D hi, j / i, j hi, j is the lter normalized, and hIi, j i is the mean input to the photoreceptor ’s synapse from the photoreceptor ’s outer segment averaged over all positions (i, j). Although we do not derive this equation here, understanding its interpretation will help clarify its appropriateness. The equation’s parameter QQ indicates the noisiness of the photoreceptors. The quantal noisiness varies with the amplitude and time course of the single-photon response (Fuortes & Yeandle, 1964; Lillywhite, 1977; Baylor et al., 1979; Grzywacz et al., 1988), and thus, QQ has a wide range of values in nature. Expectedly, as the noisiness increases, the quantal-noise error (E Q ) becomes larger. Also as expected, the quantal-noise error falls with the square root of the input. This is because the quantal-noise error is a relative error and the noise (standard deviation)-to-signal (mean) ratio in a Poisson process is inversely proportional to the square root of the signal (Burington, 1973). Finally, the error in equation 3.4 can be reduced by changing hO i, j . As hO i, j becomes more spatially extended, its values become P smaller, since i, j hO i, j D 1 and hO i, j · 1 (since hi, j ¸ 0). This causes hO 2i, j to P fall rapidly, reducing i, j hO 2i, j and thus EQ . In other words, the larger the spatial extent of the inhibitory lter (hi, j ) is, the smaller the quantal-noise error is. We also analyzed the local-asperity error (EA in equation 3.3; Grzywacz & Balboa, 1999). Because this error depends on the size of the inhibitory lter, one needs to estimate its width. One example of such an estimate is P the spatial standard deviation (r ) of hO ij . (Because ij hO ij D 1 and 0 · hO ij · 1, hO ij is a probability function and thus has a well-dened standard deviation.) It should not be surprising that under the assumptions described above, the

Early Retinal Lateral Inhibition

1495

local-asperity error is to a good approximation 0 EA D QA

XX k

l

@

X ij ( i¡k )2 C ( j¡l )2 ·pr 2

PO

(

1

! rI I A ,

(3.5)

i, j

where QA > 0 is a constant and PO (| rI I | i, j ) is the probability that there is rI an occluding border at position ( i, j) given that the contrast there is I i, j (footnote 3). This equation is proportional to the mean number of occluding borders found when a window of radius r is placed over an arbitrary point in the image. The “k” and “l” sums perform calculation of mean over the retina (with the normalization constant included in QA ). The “i” and “j” sum is proportional (also through QA ) to the number of borders around position ( k, l). This is because PO gives the probability (fraction) of a border at each position. Elsewhere (Balboa & Grzywacz, 1999b), we present results of measurements of PO (| r I / I| i, j ), which are needed for estimating the local-asperity error through equation 3.5. In that work, we show that one of the main properties of this function is that it is a sigmoidal function of local contrast. Because the vast majority of contrasts are low (Ruderman & Bialek, 1994; Vu et al., 1997; Zhu & Mumford, 1997; Balboa & Grzywacz, 1999b), we here neglect the saturation portion of the sigmoidal and approximate PO

(

! m rI rI D , I I i, j

(3.6)

i, j

where m ¸ 1. A multiplicative constant that should appear on the righthand side of this equation is set to 1 without loss of generality, since the balance between EQ and E A is controlled by the parameters QQ and QA (see equations 3.4 and 3.5). 3.3 Lateral Inhibition’s Amplitude. Why does the amplitude of h not matter much for the computation of error? Although the lateral-inhibition extent matters for the quantal-noise error, the amplitude of hi, j does not, since equation 3.4 depends only on the lter normalized, hO i, j . (However, this amplitude matters for the OPL’s output.) The amplitude is eliminated in the derivation of equation 3.4, since the quantal-noise error is relative, that is, it is standard deviation over mean. Similarly, errors made by the 3 This is the most natural denition of local contrast for the problem at hand. One thinks of local contrast as the ratio between the difference of intensities in neighbor points and the mean intensity. However, for an image on a continuous domain, neighbor points are not well dened. Therefore, a differential denition of contrast using the gradient operation is natural.

1496

Rosario M. Balboa and Norberto M. Grzywacz

local-asperity term do not depend on h’s amplitude. They are just a count of how many multiple borders are crossed on average when centering the lter on a point in the image. This is because the error depends only on the probability that borders pass through points inside the lter, and this probability is independent of the lter ’s amplitude. 4 Results 4.1 Bell-Shape Behavior. We simulated our hypothesis with 89 natural underwater and atmospheric images collected from the World Wide Web (Balboa, 1997). In the simulations, we found the inhibitory lter that minimized the attribute-estimation error expressed in equation 3.3 using equations 3.4 and 3.5, and the methods in the appendix. The main result of the simulations appears in Figure 3. It reveals a bell-shape behavior of the lateral-inhibition extent as a function of intensity. The rise at low intensities begins at around 3 to 10 photons per integration time of the photoreceptor. Therefore, the hypothesis predicts that the bell-shape behavior is due to cone processing, because rods are saturated at these intensity levels (Rao, Buchsbaum, & Sterling, 1994). This behavior contrasted with that stemming from other theories of lateral inhibition in the literature (Srinivasan et al., 1982; Atick & Redlich, 1992; Field, 1994). Those theories predicted that the lateral-inhibition extent falls with intensity (Balboa & Grzywacz, 2000). The stepwise appearance of this extent as a function of intensity in Figure 3 was due to our searching of the extent of the optimal lter in jumps of powers of 2 (see the appendix). However, averaging the lateral-inhibition extent across images eliminated this stepwise appearance. Figure 3 also presents the dependence of the bell-shape behavior of the lateral-inhibition extent on Q, which is proportional to the ratio between the weight of the quantal-noise error (QQ —see equation 3.4) and that of the localasperity error (QA —see equation 3.5) (see the appendix). The effect of a rise in Q is to increase the lateral-inhibition ± ² extent as Q rises. Figure 4 illustrates

rI the effect of the shape of PO on the bell-shape curves. The main I i, j free parameter quantifying this shape is m, which gives the power of the ± ²

rI rI rise of PO as a function of (see equation 3.6). The effect of this I i, j I parameter is to increase the lateral-inhibition extent as m rises. Moreover, for small m (little curvature), the new hypothesis no longer predicts a bellshape behavior but only a fall of the lateral-inhibition extent as a function of intensity. For m D 1, there never was a bell-shape behavior; for m D 2 this behavior occurred in 80 images out of 89; and for m D 4, this behavior was always there.

4.2 Habitats. Besides depending on the hypothesis parameters, the lateral inhibition’s bell-shape behavior also depends on the visual habitat.

Early Retinal Lateral Inhibition

1497

Figure 3: Lateral-inhibition extent as a function of intensity and parametric on Q, which weighs the quantal-noise error relative to the local-asperity error. Standard curve (solid): Q D 1000, m D 4. The other curves vary only Q. There is a bell-shape behavior in the lateral-inhibition extent for the two images shown. By increasing Q, the lateral-inhibition extent increases.

There is a small statistical difference in the lateral-inhibition extent between underwater and atmospheric images (one-sided Mann-Whitney test, DF D 87, U D 727, z D 2.12, p < 0.0172). The difference is such that the mean maximal extent of lateral inhibition is 33% larger in underwater than in atmospheric images. One can explain this by observing that light scattering and absorption are so high in water (Mobley, 1995) that underwater images

1498

Rosario M. Balboa and Norberto M. Grzywacz

Figure 4: Lateral-inhibition extent as a function of intensity, and parametric on m, which models the curvature of the probability-of-real-borders-as-a-functionof-contrast curve. Standard curve (solid): Q D 1000, m D 4. The other curves vary only m. There is a bell-shape behavior in the lateral inhibition extent for the two images shown, but only when m > 1 that is, when there is a curvature in the probability-of-real-borders distribution.

often have large regions of relatively constant intensity. Relatedly, another aspect of visual habitats that seems relevant is whether they have many small objects (as in tropical foliage) or large regions of relatively constant reectance (as in Arctic scenes). To test whether such habitats lead to different receptive elds according to the minimal-local asperity hypothesis,

Early Retinal Lateral Inhibition

1499

we reclassify the images into smooth and rugged categories. 4 The maximal extent of lateral inhibition for smooth images is larger than for rugged images (one-sided Mann-Whitney test, DF D 34, U D 101, p < 0.05). The mean maximal extent was 45% larger for smooth than for rugged images. (Such a difference was observed in horizontal cells when using smooth and rugged backgrounds; Reifsnider & Tranchina, 1995; Kamermans et al., 1998.) 4.3 Other Results Consistent with Biology. As dened in equation 3.2, lateral inhibition would help with luminance adaptation, complementing mechanisms such as pupil-size change and photoreceptor adaptation (for a review see Dowling, 1987). To see this, let Ti, j D T0 and Bi, j D B0 be constants. (This simplistic example corresponds to a steady-state stimulus with no borders or noise.) Then,

B0 D

T0 , 1 C hn0 Bn0

where h0 D

P

i, j hi, j . C1

B0 C hn0 Bn0

(4.1)

From this equation one gets

D T0 .

(4.2)

This equation shows that B0 grows monotonically with T0 , since dT0 D 1 C (n C 1)hn0 Bn0 > 0, dB0

(4.3)

and that if T0 D 0, then B0 D 0. At low light levels, B0 À Bn0 C 1 (because B0 grows with T0 and B0 (T 0 D 0) D 0) and therefore, B0 D T0 .

(4.4)

Consequently, the response grows proportionally with intensity at low intensities, which is consistent with the biological data (Ashmore & Falk, 1980; Robson & Frishman, 1995). Another biologically correct consequence of limT0 !0 B0 D 0 is that lateral inhibition disappears at low intensities (Barlow et al., 1957; Bowling, 1980). 4 Four individuals (two were the authors) performed a classication of all the images according to the following instruction: “We are giving you a set of eighty-nine natural images. Imagine that at each pixel, one inserts a column perpendicular to the plane of the image to a height proportional to the pixel’s gray level. This would form a threedimensional surface. Please classify the images into two sets of approximately equal sizes, such that one set contains those images with the largest regions of relatively low roughness in the imaginary three-dimensional surface.” After the individuals performed this classication, we placed into the “smooth” and “rugged” categories only those images that all the individuals judged to have low and high roughnesses, respectively.

1500

Rosario M. Balboa and Norberto M. Grzywacz

At high intensities, Bn0 C 1 À B0 , and 1

B0 D

T0nC 1 n

h0nC 1

,

(4.5)

which, if h0 does not depend strongly on intensity, corresponds to a strong intensity compression, that is, luminance adaptation. This compression is caused by the division in equation 3.1, not by n, since there is also compression for n D 1. However, the compression is stronger when n > 1. This compression leads to a power law behavior with intensity, being thus directly consistent with the Stevens’s power law (Stevens, 1970). A connection between this law and a similar feedback mechanism was noted previously for invertebrate photoreceptors (Grzywacz & Hillman, 1988). Therefore, one of the roles proposed by the minimal local-asperity hypothesis for lateral inhibition in the OPL is the compression of the wide range of natural intensities into the narrow dynamic range of the bipolar cells (their total range is about 20 mV—Werblin, 1977—and their noise is of the order of 1 mV). This compression should occur in regions of the image away from borders (see Figure 2.2B, site 1). We observe such a compression, which is strengthened as the inhibitory feedback’s power (n) increases (see Figure 5). As mentioned before, this form of compression is identical to the Steven’s power law (Stevens, 1970). However, observe that there is no compression until on average about one photon is absorbed per integration time of the photoreceptor (see Figure 5). Besides the bell-shape behavior (see Figures 3 and 4) and the adaptationrelated Stevens’s power law behavior (see Figure 5), the minimal localasperity hypothesis is also consistent with at least four other aspects of retinal biology (two of which were discussed above, but are recapitulated here to condense all the biological implications of the hypothesis in a single place). First, at low background intensity, lateral inhibition disappears, as discussed after equation 4.4 (Barlow et al., 1957; Bowling, 1980). Second, the mechanism of action should be division-like, as discussed after equation 2.1 (Merwine et al., 1995; Verweij, Kamermans, & Spekreijse, 1996). Third, the same mathematical analysis that led to equation 3.3 (Grzywacz & Balboa, 1999) shows that despite being division-like, lateral inhibition would lead to responses proportional to contrast even when relatively high gradients of illuminations exist (Tranchina et al., 1981). Fourth, that analysis also shows that positive Mach bands tend to be larger than negative ones, particularly at low background intensities (Fiorentini & Radici, 1957; Ratliff, 1965). An intuition for this asymmetry arises when one considers the extreme case of an edge for which only one of its sides has zero intensity. Because inhibition is division-like, the zero-intensity side will not respond (zero divided by another number is still zero) and thus not develop a Mach band (see equation 3.1), whereas the other side will.

Early Retinal Lateral Inhibition

1501

Figure 5: Output at the bipolar cell level as a function of intensity and parametric on n, which models the stoichiometry of the horizontal cell feedback. The units of intensity are arbitrary, since the curves were obtained from equation 4.2 with an arbitrary but constant value of h0 . At low intensities, there is no compression (linear intensity-output relationship; slope = 1), but at higher intensities, there is (sublinear intensity-output relationship; slope < 1). The compression increases with n (lower slopes). In the no-compression regime, lateral inhibition is negligible.

5 Discussion

This article shows that the minimal local-asperity hypothesis can account for the bell-shape behavior with light intensity of the OPL’s lateral-inhibition extent (see Figures 3 and 4; Baldridge, 1993; Jamieson, 1994; Xin et al., 1994; Xin & Bloomeld, 1999). Moreover, this hypothesis is consistent with several other physiological- and psychophysical-related behaviors (see section 4.3). A testable prediction of this hypothesis is that the details of the bell-shape behavior will depend on the habitat where the animal lives (see section 4.2). We also studied the role of three independent parameters of the hypothesis: the weight of the quantal-noise error relative to the local-asperity error (see Figure 3), the power of the dependence on contrast of the probability that an occluding border passes through a point (see Figure 4), and the exponential power of the inhibitory action (see Figure 5). 5.1 Limitations. Before embarking on a discussion of the implications of the results, we address some of the limitations of the minimal local-asperity hypothesis. Perhaps the most important variable that this hypothesis did

1502

Rosario M. Balboa and Norberto M. Grzywacz

not consider is time. Adaptation of lateral inhibition must be performed taking into account the different range of visual speeds of different animals and thus possibly different timescales. Therefore, the hypothesis may have to be reformulated to incorporate the animals’ speeds. In particular, one may have to study the distribution of occluding borders in images with movement at different speeds (movement may cause border blurring and therefore losses of contrast). Besides movement, another issue that the hypothesis does not address, which is particularly relevant for the OPL, is color processing. This issue may not be major for primates, for which the OPL’s lateral inhibition is not color coded (Calkins & Sterling, 1996; Dacey, 1996), but may be key for other vertebrates (Ogden, Mascetti, & Pierantoni, 1985; Ammermuller, ¨ Muller, & Kolb, 1995). Teleost sh, for example, have complexly color-coded lateral inhibition mediated by three types of horizontal cells (for a review, see Dowling, 1987). And these sh’s lateral inhibition appears to contribute to color constancy, discarding the illuminant information (Kamermans et al., 1998). Hence, it may be simplistic to consider the spatial processing role of lateral inhibition without considering its chromatic one. Verweij, Kamermans, van den Aker, & Spekreijse, (1996) data highlight this conclusion by showing that the receptive eld’s spatial proles in the goldsh’s outer retina is color dependent. In our defense, we were trying to explain a spatial behavior and succeeded where previous theories failed (Balboa & Grzywacz, 2000). Moreover, the ideas developed here can be extended from the spatial domain to formulate a spatiochromatic version of the minimal local-asperity hypothesis. Another limitation of the hypothesis comes from a top-down consideration. Estimation of the local-asperity error requires prior knowledge about the probability of occluding borders as a function of contrast in natural images and knowledge about the computations performed by the neural processes involved (see Figure 2.2). Although the hypothesis species the type of knowledge needed (probability distributions) and its mode of use (Bayesian estimation), it does not tell how to obtain the information. All we can say is that it should probably come from evolutionary or developmental learning, or both. Studies on the learning of prior distributions are under way in biological and computer vision (Weiss, Edelman, & Fahle, 1993; Zhu & Mumford, 1997; Jones, Sinha, Vetter, & Poggio, 1997; Bartlett & Sejnowski, 1998). Another computationally related limitation is that we implement the hypothesis using individual images, but propose that local asperity is an error learned over many images. The assumption is that an individual image provides a sufciently rich sample of its habitat. In practice, though, the retina may not make this assumption, pooling a large sample of images to control adaptation. This would explain why the timescale of adaptation is so slow (for a review see Dowling, 1987). One more limitation that we wish to address has to do with our natural images. The images come from the World Wide Web, and thus lack intensity calibration and may miss energy at high spatial frequencies due

Early Retinal Lateral Inhibition

1503

to JPEG (Joint Photographic Experts Group) compression (Fernandez, 1997) and the photographic lens’s modulation transfer function (Kodak Professional Photographic Catalog, Kodak, Rochester, NY). We next argue that lack of intensity calibration and high-frequency loss are not serious problems for our results. To test whether intensity calibration limits our conclusions, we applied the hypothesis to nine calibrated images (Mackeown, 1994) from the Sowerby Image Database (Sowerby Research Centre, British Aerospace). The simulations yielded a bell-shape behavior for all images. The reason that lack of intensity calibration should not affect the bell-shape behavior was that the main effect of passing images through photographic media is to raise the intensities to a power (the gamma of the emulsion; Altman, 1995). According to our denition of contrast (see note 3), the effect of raising the image to a power would be to multiply the contrast by that power. Hence, from equations 3.5 and 3.6, the local-asperity error would be simply multiplied by a constant, which could be absorbed in QA , with no change in the bellshape behavior (see Figure 3). To test the implications of high-frequency loss, we saved the nine Sowerby images in JPEG with the highest compression factor allowed in Photoshop (Adobe, San Jose, CA), which produced les approximately 9.5 times smaller than before compression. As before, the simulations showed the bell-shape behavior for all images. The only effect of JPEG storage was the generation of bell-shape curves with the same peak but slightly broader tuning, with the mean difference between the integrals over intensity before and after JPEG conversion being smaller than 25%. The explanation for the loss of tuning may be related to losses of contrast due to the compression algorithm (Fernandez, 1997). Loss of high-spatial frequencies would smear edges and thus cause the loss of some of them due to noise. With fewer borders, the images would be less rugged and thus lead to larger lateral-inhibition extents than if the images did not lose high frequencies (see section 4.2). Finally, the approximations made to simulate the hypothesis may also limit the validity of the results. For instance, it would be interesting to include more realistic statistics about natural scenes (Zhu & Mumford, 1997) rather than approximate their distribution as homogeneous in our mathematical derivations (see section 3.2). Moreover, although we argue that the amplitude does not matter (see section 3.3), we could eliminate the squarefunction approximation for the inhibitory lter (see the appendix). Below, we will provide arguments for the bell-shape behavior that are essentially independent of the square-function assumption for the lter. 5.2 Why Does the Hypothesis Account for the Bell-Shape Behavior?

We now illustrate with a simple 1D stepwise model of the image and a 1D approximation of the hypothesis what accounts for the bell-shape behavior of the lateral-inhibition extent observed in the biology. This model and the approximation simplify the mathematics and consequently will

1504

Rosario M. Balboa and Norberto M. Grzywacz

help explain the bell-shape behavior and the meaning of the parameters involved. The 1D image model to which we apply the new hypothesis is illustrated by the dashed line in Figure 2.2B. The model species the distributions of intensity (around a mean intensity) and object size (step length), but these distributions are not important here. By adding Poisson noise to the intensity, we can compute the quantal-noise and local-asperity errors needed for equation 3.3. The approximation here uses a continuum version of the image intensity and inhibitory lter. Substituting equation 3.6 for PO in equation 3.5 and setting QA D 1 (see the appendix for a justication) in the 1D version of equation 3.5, we obtain 0 EA D

X

@

k

1

A . I

kC r X rI m iDk¡r

(5.1)

i

The internal sum in this 1D simplication of the model can be approximated as the sum of two terms (E A D E A1 C E A2 ). The rst (E A1 ) is due to occluding borders (with no noise) and the second (E A2 ) to uctuations caused by noise (with no occluding borders). We use a mean-eld approximation for the rst term, that is, we substitute the mean value of | rI I | m i for its particular value at each border. In this case, if we dene h| rI I | m i D L / 4L, i where L > 0 is a constant and L is the image’s size, then from equation 5.1, the rst term is E A1 D

Lr . 2

(5.2)

(See Balboa, 1997, for further justication of equation 5.2.) The second term is the quantal effect in E A without the effect of the occluding borders. Therefore, to calculate this term, we can assume a homogeneous illumination of the retina and compute the quantal error only for the internal sum in equation 5.1 since in this case the effect of the external sum is only to insert a multiplicative constant. For the internal sum, we approximately have a variance 0 Var @

1

2m A D c r s , I hIi i2m

kC r X rI m

iDk¡r

(5.3)

i

where c > 0 is a constant and s 2 is the variance corresponding to the mean intensity hIi i if one adds Poisson noise. The explanation for this equation is as follows: The variance of the sum of independent variables is the sum of the variances, which explains r . To approximate the variance of | rI I | m i , we can write both the numerator and denominator inside the absolute value

Early Retinal Lateral Inhibition

1505

as a mean plus a small error, and extract the rst-order term of the Taylor expansion. Because here the mean of r I is zero, the expansion yields the error of | r I| m divided by the mean of |I| m . This explains the hIi i2m term in equation 5.3 (“m” for the power of |I| and “2” for the variance). Next, we need to evaluate the variance of the error of | r I| m . If we approximate I’s Poisson noise as gaussian (not too low intensities), then the distribution of r I is also gaussian with zero mean and variance proportional to s 2 . Transforming such distributions to | r I| m and calculating the variance (Gradshtein & Ryzhik, 1994) yields a variance proportional to s 2m . This explains the s 2m term in the numerator of equation 5.3. Hence, by taking the square root of this equation and remembering that for a Poisson noise, s 2 is proportional to hIi i, the EA2 error is 1

E A2 D

gr 2

(5.4)

m

hIi ix2

where g > 0 is a constant. Besides these two terms of the local-asperity error (E A1 and E A2 ), one needs the 1D quantal-noise error. To estimate this error, we use for the inhibitory lter a 1D version of the square prole dened in equation A.1 1 in the appendix (hO i D 2r if |i| · r ; 0 otherwise). Substituting this prole for Ohi in equation 3.4, we obtain EQ D

Q¤ 1

(r hIi ix ) 2

,

(5.5)

where Q ¤ > 0 is a constant. The error function to minimize (the approximation of equation 3.3) is the sum of equations 5.2, 5.4, and 5.5, that is, ED

Lr gr 1 / 2 Q¤ C . m C (r hIix )1 / 2 2 hIix2

(5.6)

To nd the optimal lateral-inhibition lter, one must nd the optimal r . From equation 5.6, the optimal r varies with intensity (hIix ) and the power (m) of the probability that there is an occluding border at a point as a function of its contrast. To study the dependence of the optimal r on intensity, one has to take the derivative of equation 5.6 with respect to r , equate the result to zero, and make the variable change z D 11 / 2 , to obtain hIix

gr zm ¡ Q¤ z C Lr 3 / 2 D 0. Let us rst consider m D 2. The equation has the roots p Q¤ § Q¤2 ¡ 4Lgr 5 / 2 . zD 2gr

(5.7)

(5.8)

1506

Rosario M. Balboa and Norberto M. Grzywacz 2

Therefore, for a sufciently small extent (Q¤ > 4Lgr 5 / 2 ), there are two intensities that minimize the error. We will now demonstrate that with the lower of these two intensities, the extent increases with intensity, while with the other one, the extent falls with intensity. Let us study the roots 2 when r ! 0: Q ¤ > 4Lgr 5 / 2 . In this case, there are two positives roots to equation 5.7. p 2 ¤C Q ¤ ¡4Lgr 5 / 2 Case 1: z D Q (high z corresponds to low intensity). Using the 2gr

two rst terms of Taylor ’s expansion of z with respect to 4Lgr 5 / 2 , one can show p 5/2 Q¤ C Q¤ ¡ 12 4Lgr Q¤ C Q¤2 ¡ 4Lgr 5 / 2 Q D Limr !0 D 1. Limr !0 2gr 2gr

Therefore, at low intensities, z increases when r decreases, and thus, r increases with I. p 2 ¤ ¤ 5 /2 Case 2: z D Q ¡ Q2gr¡4Lgr (low z corresponds to high intensity). At the limit, we obtain Limr !0

Q¤ ¡

p 2 5/2 Q¤ ¡ Q¤ C 12 4Lgr Q¤ ¡ 4Lgr 5 / 2 Q¤ D Limr !0 D 0. 2gr 2gr

Therefore, at high intensities z decreases when r decreases, and thus, r decreases as I increases. Hence, if one sets m D 2 on this 1D approximation, then the lter behaves with the intensity we want when minimizing the error. Furthermore, we will show that this 1D approximation accounts for essentially all the other behaviors of lateral inhibition obtained with full natural images. To understand why the extent of the lateral-inhibition lter always falls with intensity when m D 1 (see Figure 4), one must consider equation 5.7. In this case, zD

1 ¡Lr 3 / 2 D , hIix gr ¡ Q¤

(5.9)

where because hIix > 0, it follows that gr ¡ Q¤ < 0. To show that r falls with hIix , one takes the reciprocal of equation 5.9 and differentiates the result by r to get gr ¡ 3Q¤ dhIix D . 2Lr 5 / 2 dr Hence, because gr ¡ Q¤ < 0, then dhIix / dr < 0. This inequality demonstrates that the extent of the lateral inhibition lter (r ) falls with mean intensity (hIix ) when m D 1.

Early Retinal Lateral Inhibition

1507

One can also use equation 5.7 to understand why the bell-shape behavior becomes more pronounced as m increases (see Figure 4). The only negative term in equation 5.7 (the second one) must balance the other two. Let us say that the balance occurs at some low intensity (high z). If we were to lower the intensity further, then the rst term would rapidly rise if m ¸ 1, and more so with larger m. Hence, r would have to fall steeply to lower the positive terms of equation 5.7 and compensate for the rise due to the rst term. Consequently, the effect of raising m is to produce a faster increase of the lateral-inhibition extent as a function of intensity at low intensities. We can also explain the increase of the lateral-inhibition extent with Q (see Figure 3) by using the 1D equation 5.7. As Q¤ increases (more weight to the quantal-noise error), the negative term of equation 5.7 also increases. Therefore, r must rise in positive terms to balance this increase in negativity. From the success of these results, we believe that we can use the 1D approximation in this section to explain why even when using full natural images, the new hypothesis accounts for the bell-shape behavior of lateralinhibition extent. It turns out that this behavior arises from the dependence of the quantal-noise and local-asperity errors on intensity. Not surprisingly, if the lateral-inhibition lter were xed, then the quantal-noise error would fall slowly as the inverse of the square root of intensity (see Figure 6 and the third term of equation 5.6). In turn, the local-asperity error would fall rapidly with intensity at low intensities (see Figure 6 and the second term of equation 5.6) and would stabilize at high intensities (see Figure 6 and the rst term of equation 5.6). As a result, at low and high intensities, the localasperity error would dominate, forcing the lter to be narrow to minimize the total error, by straddling fewer boundaries. In contrast, at intermediate intensities, the most important error would be the quantal-noise error, which would force a wide lter to reduce the total error by averaging out the noise. Why does the local-asperity error fall fast with intensity at low intensities and stabilize at high intensities? At high intensities, the quantal-noise error is negligible, and the only errors are due to the straddling of borders, which does not depend on intensity. At low intensities, the quantal-noise error is large, producing many spurious high-contrast points that reduce the chance that a true Mach band is detected as an outlier, causing local-asperity errors. This explanation for the bell-shape behavior is in terms of the dependence of the local-asperity and quantal-noise errors on intensity. The explanation follows from the 1D analysis and its equation, 5.6. Consequently, as the analysis of this equation showed, the existence of the bell-shape behavior is independent of parameters for m ¸ 2. In other words, no change of parameters can cause the two curves in Figure 6 to stop intersecting at two points. 5.3 Retinal Biology. To discuss whether the minimal local-asperity hypothesis is biologically plausible, we must address two issues. First, can the

1508

Rosario M. Balboa and Norberto M. Grzywacz

Figure 6: Relationship between the two types of errors used to compute the lateral-inhibition extent as a function of intensity. The units of intensity are arbitrary, since the curves were obtained from equation 5.6 with arbitrary choice of parameters and lateral-inhibition lter radius (r ). The local-asperity curve corresponds to the rst two terms of that equation and the quantal-noise curve to the third term. At low and high intensities, the local-asperity error dominates, but its behavior is different in each zone. This error falls fast with intensity at low intensities but is stable at high intensities. In contrast, the quantal-noise error falls steadily and slowly with intensity and dominates at intermediate intensities. When the local-asperity error dominates, the lateral-inhibition lter tends to be narrow to avoid straddling boundaries, and when the quantal-noise error dominates, the lter tends to be wide to average out the noise.

retina implement the mathematical operations specied in equations 3.4 and 3.5 to measure the error, and, relatedly, given that the postulated inhibition is nonlinear (division-like), can the retina access the image intensities or photoreceptor outputs necessary to perform these operations? Second, where in the retina are the various computations performed? What cellular mechanism does the retina use for these computations? We address the rst issue elsewhere (Balboa, 1997). There, we describe simple and plausible neurophysiological mechanisms that would take care of it. We will not address it any further here, instead devoting the rest of this section to retinal-location and cellular-mechanism issues. The minimal-local asperity hypothesis explains spatial modulations of the horizontal cell receptive eld when it adapts to light intensity. There are at least four known mechanisms that could participate in this modulation. One of the mechanisms is the variation of the gap junctions between hori-

Early Retinal Lateral Inhibition

1509

zontal cells (Mangel & Dowling, 1985; Weiler & Akopian, 1992; Myhr et al., 1994). This variation seems to be partially controlled by dopamine (Lasater & Dowling, 1985; Witkovsky & Dearry, 1991; Hampson, Weiler, & Vaney, 1994). Dopamine comes to the horizontal cell from interplexiform cells (for a review see Dowling, 1987) or by diffusion from the inner plexiform layer (Hare & Owen, 1995). In both cases, amacrine cells control the dopamine level. Therefore, the place where the retina computes the local-asperity and quantal-noise errors may be the inner plexiform layer (where amacrine cells synapse). Another modulatory neurotransmitter system that contributes to horizontal cell coupling adaptation involves nitric oxide (NO—Miyachi, Murakami, & Nakaki, 1990; McMahon & Ponomareva, 1996; Pottek, Schultz, & Weiler, 1997). This agent appears to complement dopamine in its effects on the gap junctions (Pottek et al., 1997). Perhaps more most interesting, while NO reduces the sensitivity to glutamate of horizontal cell, dopamine increases this sensitivity (McMahon & Ponomareva, 1996). Hence, the NO system may add to the mechanism proposed here for the modulation of lateral-inhibition strength in light and dark adaptation (see Figure 5). Another such a mechanism may be the modulation of pH (Hampson et al., 1994). A third mechanism contributing to horizontal cell adaptation involves voltage-dependent conductances in this cell (Winslow & Knapp, 1991; McMahon, 1994). In this case, the control of these conductances does not come from the inner plexiform layer (IPL), but rather from the photoreceptors. By increasing the intensity, the horizontal cells are hyperpolarized, modulating these conductances. When the conductances increase, it becomes harder to propagate current from cell to cell, therefore narrowing the receptive eld of the horizontal cell (Winslow & Knapp, 1991). Because there are many types of conductances in each cell, they could contribute to a complex behavior such as the bell-shape behavior of the lateral-inhibition extent as a function of intensity. The last mechanism that we will discuss for lateral-inhibition adaptation is a specialization of teleost sh. For these animals, increasing the intensity causes the horizontal cell to send nger-like processes to the pedicles of cones. These processes, which are called spinules (Wagner, 1980; De Juan, In´ ˜ õ guez, & Dowling, 1991; Weiler & Janssen-Bienhold, 1993), increase the area of contact between the cells, and thus spinules can increase the feedback strength in the OPL. It is interesting that the control of spinule growth is partially through extraretinal efferent bers (De Juan, Garc´õ a, & Cuenca, 1996; De Juan & Garc´õ a, 1998). This suggests that an extraretinal structure may implement some of the computations of our hypothesis in teleost sh. In particular, the external sums in equation 3.5 to estimate the local-asperity error require a computation over the entire retina, and therefore may require an integration over larger distances than available in retinal processes.

1510

Rosario M. Balboa and Norberto M. Grzywacz

5.4 Multiple Tasks. Is lateral inhibition in the inner plexiform layer, extraretinal parts of the brain, or invertebrates doing a different thing from that in the OPL? We ask this question because other lateral-inhibition models t well data from these systems (Srinivasan et al., 1982; Atick & Redlich, 1992). It is possible and likely that these systems use lateral inhibition for different tasks. These tasks may include extraction of different visual attributes such as direction of motion (Barlow & Levick, 1965), orientation of edges (Silito, 1975, 1979), and temporal modulation of intensity (Werblin, Maguire, Lukasiewicz, Eliasof, & Wu, 1988). And because these tasks are different, they may require different mechanisms of lateral inhibition, as observed experimentally (Merwine et al., 1995; Cook & McReynolds, 1998). These differences may explain why lateral inhibition of ganglion cells may not display a bell-shape behavior (Van Ness & Bouman, 1967). Similarly, lateral inhibition in insects does not appear to have a bell-shape behavior, since one can account for the properties of this inhibition with the predictive-coding theory (Srinivasan et al., 1982; van Hateren, 1992; Balboa & Grzywacz, 2000). Why would the tasks be different in insects? Perhaps the answer is that their “retina” must achieve in one or two layers many of the goals of several layers of the vertebrate system. Moreover, the brains of insects are much smaller than those of vertebrates, making neural resources more constraining (Brooke, Hanley, & Laughlin, 1998) and thus forcing “careful consideration” of tasks. Our philosophy is that for each system one must analyze tasks, limitations, and internal knowledge. There is no reason to believe that these factors are identical across systems, therefore allowing for different behavior in different systems. Appendix

The rst section of this appendix describes the computational approximations performed to implement the minimal local-Asperity hypothesis. The second section describes the error-minimization procedure and the choice of the parameters. A.1 Approximations in the Error Estimation.

A.1.1. Approximation for the Lateral-Inhibition Filter. To simplify the computation of hi, j from the properties of P O ( | r I / I| i, j ) (see equation 3.5) and the quantal noise (see equation 3.4), we approximate hi, j with a step (square) function. In other words, we use a disk in two dimensions dened as » hO i, j D

1 pr ¤2

0

if ( i2 C j2 )1 / 2 · r ¤ otherwise.

From this approximation, the radius r in equation 3.5 is r D r ¤ / 2.

(A.1)

Early Retinal Lateral Inhibition

1511

A.1.2 Approximation for | rI I | i, j . The discrete approximation for the absolute value of the intensity’s gradient used in the simulations is

| r I| i, j D

(

( IiC 1, j ¡ Ii¡1, j )2 4x2

C

( Ii, jC 1 ¡ Ii, j¡1 )2 4y2

!1 2

,

where 4x D 4y are the distance between neighbor pixels. Without loss of generality, we set to 4x D 4y D 1. Because we have to calculate | rI I | i, j , we estimate the local intensity as Ii, j D

Ii, jC 1 C Ii, j¡1 C IiC 1, j C Ii¡1, j , 4

to reduce the effect of noise. When Ii, jC 1 D Ii, j¡1 D IiC 1, j D Ii¡1, j D 0, there is a problem with the computation of rI I . We assume that this occurs at very dim intensities, such that the probability of each photoreceptor’s absorbing a photon is small but not zero. One can approximate this case as if one of the photoreceptors absorbs a photon but the others do not. Therefore, approximately,

rI 1 I D 22 i, j

(

¡4¢ ! 2

1

1 2

D2 ,

if the four values are zero. We justify this approximation by noting that if we wait a sufciently long time, some photons will be absorbed, but the probability of absorbing more than one photon in the integration time of the four photoreceptors is very small. A.2 Optimal Lateral-Inhibition Extent.

A.2.1 Error-Minimization Process. The program has to nd the lateralinhibition extent (r D r ¤ / 2; see equations 3.4, 3.5, and A.1) that minimizes equation 3.3. The inputs to the program are the natural image, the mean intensity, and the parameters (see the next section). The program rst adds noise to the image (Balboa & Grzywacz, 2000) and then computes | rI I | i, j . Next, the program plugs | rI I | m i, j into equation 3.6, uses it to calculate equation 3.5, and with the aid of this equation and equation 3.4 minimizes equation 3.3. The minimization procedure is by exhaustive search over r in geometric steps of 2, starting at 2 pixels. After completion of the exhaustive search, the program estimates the minimal r through interpolation with a parabola around the minimum. We use a parabola as the simplest polynomial interpolation (Burden & Faires, 1989). This minimization was performed in the [10¡1 , 106 ] range of intensities in steps of 0.5 log units.

1512

Rosario M. Balboa and Norberto M. Grzywacz

A.2.2. Parameters. There are four parameters in the equations (n, QQ , QA , and m), but they are not all independent. Because the parameters QQ and QA weigh two terms whose sum is to be minimized, these parameters are not independent. Multiplying them by a common factor does not change the minimizing lter. Hence, without loss of generality, we set the parameter QA D 1. We also dene the parameter Q D QQ / (2p 1 / 2 ) so that for the lter in equation A.1, E Q has the simple form EQ D Q /r hIi1 / 2 . Thus, the only true free parameters are n, Q, and m. To learn their effects on the theory, we vary them in the ranges [1, 3], [100, 10,000], and [1, 4], respectively, in the simulations. Acknowledgments

We thank Stewart Bloomeld for sending us his data on horizontal cell adaptation before publication and Alexander Ball for loaning us his only copy of Jamieson’s doctoral dissertation. Thanks are also due to Andrew Wright for allowing us to use the Sowerby Image Database (British Aerospace) and to Alan Yuille and Scott Konishi for sharing with us their TIFF version of the images from this database. We thank Joaqu´õ n De Juan, Russell Hamer, and the reviewers of this article for critical comments. Alan Yuille, Pierre-Ives Burgi, and Davi Geiger wrote letters of support for R.B.’s work before it was defended as a thesis in computer science. Finally, we thank Jos´e Oncina for nancial support during R.B.’s Ph.D. work at the University of Alicante, Alicante, Spain. This work was supported by National Eye Institute grants EY08921 and EY11170 and the William A. Kettlewell Chair to N.M.G., by Grant TIC97-0941 to Dr. Jos´e Oncina, and by a National Eye Institute Core grant to Smith-Kettlewell. References Adelson, E. H. (1993). Perceptual organization and the judgment of brightness. Science, 262, 2042–2044. Altman, J. H. (1995). Photographic lms. In M. Bass (Ed.), Handbook of optics, Vol. 1: Fundamentals, techniques and design. New York: McGraw-Hill. Ammerm uller, ¨ J., Muller, J. F., & Kolb, H. (1995). The organization of the turtle inner retina. II. Analysis of color-coded and directionally selective cells. J. Comp. Neurol., 358, 35–62. Ashmore, J. F., & Falk, G. (1980). The single-photon signal in rod bipolar cells of the dogsh retina. J. Physiol. Lond., 300, 151–166. Atick, J. J., & Redlich, A. N. (1992). What does the retina know about natural scenes? Neural Comp., 4, 196–210. Balboa, R. M. (1997). Teor´õas para la inhibici´on lateral en la retina basadas en autocorrelaci´on, curtosis y aspereza local de las im´agenes naturales. Unpublished doctoral dissertation, University of Alicante, Alicante, Spain.

Early Retinal Lateral Inhibition

1513

Balboa, R. M., & Grzywacz, N. M. (1998). The minimal-local asperity theory of retinal lateral inhibition. Invest. Ophthalmol. Vis. Sci., 39, S562. Balboa, R. M. & Grzywacz,N. M. (1999a).Occlusions contributeto scaling in natural images. Manuscript submitted for publication. Balboa, R. M., & Grzywacz, N. M. (1999b). Occulusions and their relationship with the distribution of contrasts in natural images. Manuscript submitted for publication. Balboa, R. M., & Grzywacz, N. M. (1999c).Biological evidence for an ecologicalbased theory of early retinal lateral inhibition. Invest. Ophthalmol. Vis. Sci., 40, S386. Balboa, R. M., & Grzywacz, N. M. (2000). The role of early retinal lateral inhibition: More than maximizing luminance information. Vis. Neurosci., 17:1. Baldridge, W. H. (1993).Effect of background illumination on horizontal cell receptiveeld size in the retina of the goldsh (Carassius auratus). Unpublished doctoral dissertation. McMaster University, Hamilton, Ontario, Canada. Baldridge, W. H., & Ball, A. K. (1991).Background illumination reduces horizontal cell receptive eld size in both normal and 6-hydroxydopamine-lesioned goldsh retinas. Visual Neurosci., 7, 441–450. Barlow, H. B., Fitzhugh, R., & Kufer, S. W. (1957). Change of organisation in the receptive elds of the cat’s retina during dark adaptation. J. Physiol., 137, 338–354. Barlow, H. B., & Levick, W. R. (1965). The mechanism of directionally selective units in the rabbit’s retina. J. Physiol., 178, 477–504. Bartlett, M. S., & Sejnowski, T. J. (1998). Learning viewpoint-invariant face representations from visual experience in an attractor network. Network, 9, 399– 417. Baylor, D. A., Lamb, T. D., & Yau, K.-W. (1979). Responses of retinal rods to single photons. J. Physiol., 288, 613–634. Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. New York: Springer-Verlag. Bowling, D. B. (1980).Light responses of ganglions cells in the retina of the turtle. J. Physiol., 209, 173–196. Brooke, M. de L., Hanley, S., & Laughlin, S. B. (1998). The scaling of eye size with body mass in birds. Proc. R. Soc. Lond. B, 266, 405–412. Burden, R. L., & Faires, J. D. (1989). Numerical analysis (4th ed.). Boston: PWSKENT Publishing Company. Burington, R. S. (1973). Handbook of mathematical tables and formulas. New York: McGraw-Hill. Calkins, D. J., & Sterling, P. (1996). Absence of spectrally specic lateral inputs to midget ganglion cells in primate retina. Nature, 381(6583), 613–615. Campbell, F. W., & Robson, J. G. (1968). Application of Fourier analysis to the visibility of gratings. J. Physiol. (Lond.), 197, 551–566. Cook, P. B., & McReynolds, J. S. (1998). Modulation of sustained and transient lateral inhibitory mechanisms in the mudpuppy retina during light adaptation. J. Neurophysiol., 79, 197–204.

1514

Rosario M. Balboa and Norberto M. Grzywacz

Dacey, D. M. (1996). Circuitry for color coding in the primate retina. Proc. Natl. Acad. Sci. USA, 93, 582–588. De Juan, J., & Garc´õ a, M. (1998). Interocular effect of actin depolymerization on spinule formation in teleost retina. Brain Res., 792, 173–177. De Juan, J., Garc´õ a, M., & Cuenca, N. (1996). Formation and dissolution of spinules and changes in nematosome size require optic nerve integrity in black bass (Micropterus salmoides) retina. Brain Res., 707, 213–220. De Juan, J., In´ ˜ õ guez, C., & Dowling, J. E. (1991). Nematosomes in external horizontal cells of white perch (Roccus americana) retina: Changes during dark and light adaptation. Brain Res., 546, 176–180. Dowling, J. E. (1987).The retina: An approachablepart of the brain. Cambridge, MA: Belknap Press of Harvard University Press. Fain, G., & Dowling, J. E. (1973). Intracellular recordings from single rods and cones in the mudpuppy retina. Science, 180, 1178–1181. Fernandez, J. N. (1997). GIFs, JPEGs and BMPs: Handling Internet graphics. Cambridge, MA: MIT Press. Field, D. J. (1994). What is the goal of sensory coding? Neural Comp., 6, 559– 601. Fiorentini, A., & Radici, T. (1957). Binocular measurements of brightness on a eld presenting a luminance gradient. Atti. Fond. Giorgio Ronchi, XII, 453– 461. Fuortes, M. G. F., & Yeandle, S. (1964). Probability of occurrence of discrete potentials waves in the eye of the Limulus. J. Physiol., 47, 443–463. Gallant, S. I. (1995).Expert systems and decision systems using neural networks. In M. A. Arbib, (Ed.), The handbook of brain theory and neural networks (pp. 377389). Cambridge, MA: MIT Press. Gradshtein, I. S., & Ryzhik, I. M. (1994). Table of integrals, series, and products (5th ed.). San Diego, CA: Academic Press. Grzywacz, N. M., & Balboa, R. M. (1999). The minimal-local asperity theory of early retinal lateral inhibition. Manuscript submitted for publication. Grzywacz, N. M., & Hillman, P. (1988). Biophysical evidence that light adaptation in Limulus photoreceptors is due to a negative feedback. Biophys. J., 53, 337–348. Grzywacz, N. M., Hillman, P., & Knight, B. W. (1988). The quantal source of area supralinearity of ash responses in Limulus photoreceptors. J. Gen. Physiol., 91, 659–684. Hampson, E. C., Weiler, R., & Vaney, D. I. (1994). pH-gated dopaminergic modulation of horizontal cell gap junctions in mammalian retina. Proc. R. Soc. Lond. B., 255, 67–72. Hare, W. A., & Owen, W. G. (1995).Similar effects of carbachol and dopamine on neurons in the distal retina of the tiger salamander. Vis. Neurosci., 12, 443–455. Jamieson, M. S. (1994). Mechanisms regulating horizontal cell receptive eld size in goldsh retina. Unpublished doctoral dissertation, McMaster University, Hamilton, Ontario, Canada. Jamieson, M. S., Baldridge, W. H., & Ball, A. K. (1994). Modulation of horizontal cell receptive eld size in light-adapted godlsh retinas. In Proceedings of the Association for Research in Vision and Ophthalmology (pp. 1821, 2623).

Early Retinal Lateral Inhibition

1515

Jones, M. J., Sinha, P., Vetter, T., & Poggio, T. (1997). Top-down learning of lowlevel vision tasks. Curr. Biol., 7, 991–994. Kamermans, M., Haak, J., Habraken, J. B., & Spekreijse, H. (1996). The size of the horizontal cell receptive elds adapts to the stimulus in the light adapted goldsh retina. Vision Res., 36, 4105–4119. Kamermans, M., Kraaij, D. A., & Spekreijse, H. (1998). The cone/horizontal cell network: A possible site for color constancy. Vis. Neurosci., 15, 787–797. Lankheet, M. J., Rowe, M. H., Van Wezel, R. J., & van de Grind, W. A. (1996). Spatial and temporal properties of cat horizontal cells after prolonged dark adaptation. Vision Res., 36, 3955–3967. Lasater, E. M., & Dowling, J. E. (1985). Dopamine decreases conductance of the electrical junctions between cultured retinal horizontal cells. Proc. Natl. Acad. Sci. USA, 82, 3025–3029. Lillywhite, P. G. (1977). Single photon signals and transduction in an insect eye. Journal of Comparative Physiology, 122, 189–200. Mackeown, W. P. J. (1994). Labelling image database and its application to outdoor scene analysis. Unpublished doctoral dissertation, University of Bristol. Mangel, S. C., & Dowling, J. E. (1985). Responsiveness and receptive eld size of carp horizontal cells are reduced by prolonged darkness and dopamine. Science, 229, 1107–1109. McCarthy, S. T., & Owen, W. G. (1996). Preferential representation of natural scenes in the salamander retina. Invest. Ophthalmol. Vis. Sci., 37, S674. McMahon, D. G. (1994). Modulation of electrical synaptic transmission in zebrash retinal horizontal cells. J. Neurosci., 14, 1722–1734. McMahon, D. G., & Ponomareva, L. V. (1996). Nitric oxide and cGMP modulate retinal glutamate receptors. J. Neurophysiol., 76, 2307–2315. Merwine, D. K., Amthor, F. R., & Grzywacz, N. M. (1995). Interaction between center and surround in rabbit retinal ganglion cells. J. Neurophysiol., 73, 1547– 1567. Miyachi, E., Murakami, M., & Nakaki, T. (1990). Arginine blocks gap junctions between retinal horizontal cells. Neuroreport, 1, 107–110. Mobley, C. (1995).Optical properties of water. In M. Bass (Ed.),Handbook of optics, Vol. 1: Fundamentals, techniques and design. New York: McGraw-Hill. Myhr, K. L., Dong, C. J., & McReynolds, J. S. (1994). Cones contribute to lightevoked, dopamine-mediated uncoupling of horizontal cells in the mudpuppy retina. J. Neurophysiol., 72, 56–62. Ogden, T. E., Mascetti, G. G., & Pierantoni, R. (1985). The outer horizontal cell of the frog retina: Morphology, receptor input, and function. Invest. Ophthalmol. Vis. Sci., 26, 643–656. Pottek, M., Schultz, K., & Weiler, R. (1997). Effects of nitric oxide on the horizontal cell network and dopamine release in the carp retina. Vision Res., 37, 1091–1102. Rao, R., Buchsbaum, G., & Sterling, P. (1994). Rate of quantal transmitter release at the mammalian rod synapse. J. Biophys., 67, 57–63. Ratliff, F. (1965). Mach bands: Quantitative studies on neural networks in the retina. San Francisco: Holden-Day.

1516

Rosario M. Balboa and Norberto M. Grzywacz

Reifsnider, E. S., & Tranchina, D. (1995).Background contrast modulates kinetics and lateral spread of responses to superimposed stimuli in outer retina. Vis. Neurosci., 12, 1105–1126. Ridley, M. (1993). Evolution. Boston: Blackwell Scientic. Robson, J. G., & Frishman L. J. (1995). Response linearity and kinetics of the cat retina: The bipolar cell component of the dark-adapted electroretinogram. Vis. Neurosci., 12, 837–850. Rodiek, R. W., & Stone, J. (1965). Analysis of receptive elds of cat retinal ganglion cells. J. Neurophysiol., 28, 832–849. Ruderman, D. L., & Bialek, W. (1994). Statistics of natural images: Scaling in the woods. Phys. Rev. Letter, 73, 814–817. Rushton, W. A. H. (1961). The intensity factor in vision. In W. D. McElroy & H. B. Glass (Eds.), Light and Life (pp. 706–722). Baltimore: Johns Hopkins University Press. Shapley, R. M., & Tolhurst, D. J. (1973).Edge-detectors in human vision. J. Physiol. (Lond.), 229, 165–183. Silito, A. M. (1975). The contribution of inhibitory mechanisms to the receptive eld properties of neurones in the striate cortex of the cat. J. Physiol. (Lond.), 250, 305–322. Silito, A. M. (1979). Inhibitory mechanisms inuencing complex cell orientation selectivity and their modication at high resting discharge levels. J. Physiol. (Lond.) 289, 33–53. Srinivasan, M. V., Laughlin, S. B., & Dubs, A. (1982). Predictive coding: A fresh view of inhibition in the retina. Proc. R. Soc. Lond. B., 216, 427–459. Stevens, S. S. (1970). Neural events and the psychophysical law. Science, 170, 1043–1050. Tranchina, D., Gordon, J., Shapley, R., & Toyoda, J. (1981). Linear information processing in the retina: A study of horizontal cell responses. Proc. Natl. Acad. Sci. USA, 78, 6540–6542. van Hateren, J. H. (1992). Theoretical predictions of spatiotemporal receptive elds of y LMC’s, and experimental validation. J. Comp. Physiol. A, 171, 157–170. Van Ness, F. L., & Bouman, M. A. (1967). Spatial modulation transfer in the human eye. J. Opt. Soc. Am., 57, 401–406. Verweij, J., Kamermans, M., & Spekreijse, H. (1996). Horizontal cells feed back to cones by shifting the cone calcium-current activation range. Vision Res., 36(24), 3943–3953. Verweij, J., Kamermans, M., van den Aker, E. C., & Spekreijse, H. (1996). Modulation of horizontal cell receptive elds in the light adapted goldsh retina. Vision Res., 36, 3913–3923. Vu, T. Q., McCarthy, S. T., & Owen, W. G. (1997). Linear transduction of natural stimuli by dark-adapted and light-adapted rods of the salamander, it Ambystoma tigrinum. J. Physiol., 505, 193–204. Wagner, H. J. (1980).Light-dependent plasticity of the morphology of horizontal cell terminals in cone pedicles of sh retinas. J. Neurocytol., 9, 573–590. Weiler, R., & Akopian, A. (1992). Effects of background illuminations on the receptive eld size of horizontal cells in the turtle retina are mediated by

Early Retinal Lateral Inhibition

1517

dopamine. Neurosci. Lett., 140, 121–124. Weiler, R., & Janssen-Bienhold, U. (1993). Spinule-type neurite outgrowth from horizontal cells during light adaptation in the carp retina: An actindependent process. J. Neurocytol., 22, 129–139. Weiss, Y., Edelman, S., & Fahle, M. (1993). Models of perceptual learning in vernier hyperacuity. Neural Comp., 5, 695–718. Werblin, F. S. (1977). Synaptic interactions mediating bipolar response in the retina of the tiger salamander. In H. B. Barlow and P. Fatt (Eds.), Vertebrate Photoreception (pp. 205–230). San Diego, CA: Academic Press. Werblin, F. S., Maguire, G., Lukasiewicz, P., Eliasof, S., & Wu, S. M. (1988). Neural interactions mediating the detection of motion in the retina of the tiger salamander. Vis. Neurosci., 1, 317–329. Winslow, R. L., & Knapp, A. G. (1991). Dynamic models of the retinal horizontal cell network. Prog. Biophys. Mol. Biol., 56, 107–133. Wishart, K. A., Frisby, J. P., & Buckley, D. (1997). The role of 3-D surface slope in a lightness/brightness effect. Vis. Res., 37, 467–473. Witkovsky, P., & Dearry, A. (1991)Functional roles of dopamine in the vertebrate retina. Prog. Ret. Res., 11, 247–292. Xin, D., & Bloomeld, S. A. (1999). Dark- and light-induced changes in coupling between horizontal cells in mammalian retina. J. Comp. Neurol., 405, 75–87. Xin, D., Bloomeld, S. A., & Persky, S. E. (1994). Effect of background illumination on receptive eld and tracer-coupling size of horizontal cells and AII amacrine cells in the rabbit retinas. In Proceedings of the Association for Research in Vision and Ophthalmology (p. 1363). Zhu, S. C., & Mumford, D. (1997). Prior learning and Gibbs reaction-diffusion. IEEE Trans. on Pattern Analysis and Machine Intelligence, 19, 1236–1250. Zhu, S. C., & Mumford, D. (1998). GRADE: Gibbs reaction and diffusion equations. A framework for pattern synthesis, image denoising, and removing clutter. In Proceedings of the 6th International Conference on Computer Vision (pp. 847–854). Received October 21, 1998; accepted August 6, 1999.

NOTE

Communicated by Emilio Salinas

Multidimensional Encoding Strategy of Spiking Neurons Christian W. Eurich Stefan D. Wilke Institut fur ¨ Theoretische Physik, Universit¨at Bremen, D-28334 Bremen, Germany Neural responses in sensory systems are typically triggered by a multitude of stimulus features. Using information theory, we study the encoding accuracy of a population of stochastically spiking neurons characterized by different tuning widths for the different features. The optimal encoding strategy for representing one feature most accurately consists of narrow tuning in the dimension to be encoded, to increase the singleneuron Fisher information, and broad tuning in all other dimensions, to increase the number of active neurons. Extremely narrow tuning without sufcient receptive eld overlap will severely worsen the coding. This implies the existence of an optimal tuning width for the feature to be encoded. Empirically, only a subset of all stimulus features will normally be accessible. In this case, relative encoding errors can be calculated that yield a criterion for the function of a neural population based on the measured tuning curves. 1 Introduction

The question of an optimal tuning width for the representation of a stimulus by a neural population is still controversial. On the one hand, narrowly tuned cells are frequently encountered, emphasizing the importance of single neurons for perception and motor control (Lettvin, Maturana, McCulloch, & Pitts, 1959; Barlow, 1972). On the other hand, theoretical arguments suggest that in most cases, broadly tuned units and distributed information processing are better suited for accurate representations (Hinton, McClelland, & Rumelhart, 1986; Georgopoulos, Schwartz, & Kettner, 1986; Baldi & Heiligenberg, 1988; Snippe & Koenderink, 1992; Seung & Sompolinsky, 1993; Salinas & Abbott, 1994; Snippe, 1996; Eurich & Schwegler, 1997; Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Zhang & Sejnowski, 1999). A useful measure of the information content of a set of spike trains emitted by a population of neurons is the Fisher information matrix (Deco & Obradovic, 1997; Brunel & Nadal, 1998), which yields a lower bound on the mean squared error for unbiased estimators of the encoded quantity (Cram´er-Rao inequality). In a recent approach using Fisher information, Zhang and Sejnowski (1999) derived an expression for the encoding accuracy of a population of spiking neurons as a function of the tuning width, c 2000 Massachusetts Institute of Technology Neural Computation 12, 1519– 1529 (2000) °

1520

Christian W. Eurich and Stefan D. Wilke

s, for radially symmetric receptive elds in a D-dimensional space. The calculation shows that the encoding accuracy increases with s if D ¸ 3. This result can be complemented by the following consideration. Neurons generally respond to many stimulus features. The main function of a neural population, however, may be the processing of only a subset of these features. In the following, we derive an optimal coding strategy of a population of neurons whose tuning widths differ in the different dimensions. We also study the case of extremely small receptive elds where the population approach breaks down, and demonstrate the existence of an optimal tuning width if only one of the stimulus features is to be encoded accurately. Furthermore, we consider the situation that only a part of all encoded stimulus properties is accessible to an observer. General formulas will be illustrated by the example of a population of neurons with gaussian tuning functions and Poissonian spike statistics. Since part of this article is a continuation of Zhang and Sejnowski (1999), we adopt much of their formalism. 2 Model

Consider a stimulus characterized by a position x D (x1 , . . . , xD ) in a Ddimensional stimulus space, where xi (i D 1, . . . , D) is measured relative to the total range of values in the ith dimension such that it is dimensionless. Furthermore, consider a population of N identical stochastically spiking neurons that re n D ( n(1) , . . . , n( k) , . . . , n( N ) ) spikes in a time interval t following the presentation of the stimulus. The joint probability distribution, P( n I x ), is assumed to take the form P(n I x ) D

N Y kD1

P( k) (n( k) I x ),

(2.1)

that is, the neurons have independent spike generation mechanisms. Note that the neural ring rates may still be correlated; the neurons may have common input or even share the same tuning function. The tuning function of neuron k, f ( k) ( x ), gives the mean ring rate of neuron k in response to the stimulus at position x . Unlike Zhang and Sejnowski (1999), we assume here a form of the tuning function that is not necessarily radially symmetric, f

(k)

( x ) D Fw

(

D X (xi ¡ c(i k) )2 iD1

si2

!

±

²

D: Fw j ( k )2 ,

(2.2)

( )2 ( ) ( )2 ( )2 where ji k :D ( xi ¡ ci k )2 / si2 for i D 1, . . . , D, and j ( k)2 :D j1k C ¢ ¢ ¢ C jDk . F > 0 denotes the maximal ring rate of the neurons, which requires that maxz w (z ) D 1. The ring rates depend on the stimulus only by the local values of the tuning functions, such that P( k) ( n( k) I x ) can be written in the form P( k) (n( k) I x ) D S(n( k) , f (k) ( x ), t ). The function S: N0 £ [0I F]£]0I 1[¡!]0I 1]

Multidimensional Encoding Strategy of Spiking Neurons

1521

is required to be logarithmically differentiable with respect to its second argument but is otherwise arbitrary. For a population of tuning functions with centers c(1) , . . . , c (N ) , a density g( x ) is introduced according to g( x ) :D PN (k) kD1 d( x ¡ c ). The Fisher information matrix, ( Jij ), is dened as µ Jij (x ) :D E

¡@

@xi

ln P(n I x )

¢¡ @

@xj

ln P( n I x )

¢¶

(2.3)

(Deco & Obradovic, 1997), where E[. . .] denotes the expectation value over the probability distribution P (n I x ). The Cram´er-Rao inequality gives a lower bound on the expected estimation error in the ith dimension, 2 i,min ( i D 1, . . . , D ), provided that the estimator is unbiased. In the case of a diagonal 2 Fisher information matrix, it is given by 2 i,min D 1 / Jii ( x ). 3 Information Content of Neural Responses 3.1 Population Fisher Information. For a single-model neuron k described in the previous section, the Fisher information (see equation 2.3) reduces to ( )

k Jij ( x ) D

± ² 1 (k) Aw j ( k)2 , F, t ji( k)jj . si sj

(3.1)

The function Aw , which is independent of k, abbreviates the expression Aw ( z, F, t ) :D 4F2 w 0 ( z )2

1 X

S( n, Fw ( z ), t )T 2 [n, Fw (z ), t ] ,

(3.2)

nD0

d ( ). where T[n, z, t ] :D @@z ln S( n, z, t ) and w 0 ( z) :D dz w z The independence assumption (see equation 2.1) implies that the population Fisher information, Jij ( x ), is the sum of the contributions of the indiP (k ) vidual neurons, Jij ( x ) D N kD1 Jij ( x ). For a constant distribution of tuning curves, g(x ) ´ g ´ const., the population Fisher information becomes independent of x, and the off-diagonal elements vanish (Zhang & Sejnowski, 1999). In this case, the diagonal elements Ji :D Jii are given by

Ji D gDKw (F, t, D )

QD

kD1 sk , si2

(3.3)

where Kw is dened to be Kw ( F, t, D) :D

1 D

Z

1 ¡1

Z dj1 . . .

1 ¡1

djD Aw (j 2 , F, t )j12 .

(3.4)

1522

Christian W. Eurich and Stefan D. Wilke

For identical tuning widths in all dimensions, si ´ s (i D 1, . . . , D), the total P ¡1 ¡1 D¡2 , that Fisher information, J :D ( D iD1 Ji ) , is given by J D gKw ( F, t, D)s is, equation 2.3 from Zhang and Sejnowski (1999) is recovered. Equation 3.3 shows that the Fisher information in the ith dimension is determined by a trade-off between Q the product of the tuning widths in the 6 remaining dimensions k D i, kD 6 i sk , and the tuning width in dimension i, si . In order to assess the consequences of equation 3.3 for neural encoding strategies, we provide an intuitive interpretation of the ratio of tuning widths by introducing effective receptive elds. 3.2 Effective Receptive Fields and Encoding Subpopulation. The tuning functions f ( k) ( x ) encountered empirically typically have a single maximum. For such curves, large values of the single-neuron Fisher information (see equation 3.1) are generally restricted to a region around the center of the tuning function, c (k ) . The fraction p(b ) of Fisher information that falls p ( )2 into a region j k · b around c (k) is given by

R p(b ) :D R

ED RD

dDx dDx

PD

( ) iD1 Jii x

PD

( ) iD1 J ii x

,

(3.5)

where the index ( k ) was dropped because the tuning curves are assumed to have identical forms. A straightforward calculation shows that p(b ) is given by Rb 0 p(b ) D R 1 0

dj j DC 1 Aw (j 2 , F, t ) dj j DC 1 Aw (j 2 , F, t )

.

(3.6) ( )

k Equation 3.6 allows the denition of an effective receptive eld, RFeff , inside of which neuron k conveys a major fraction p0 of Fisher information, (k) RF eff

» q ¼ ( )2 k :D x | j · b0 ,

(3.7)

where b0 is chosen such that p(b0 ) D p0 . (k ) The Fisher information a neuron k carries is small unless x 2 RF eff . This has the consequence that a xed stimulus x is actually encoded by only a subpopulation of neurons. If the distribution of tuning functions does not vary in the proximity of the stimulus position x , g( x0 ) ´ g D const for |xi ¡ x0i | < b0 si (i D 1, . . . , D), the point x in stimulus space is covered by Ncode :D g

D 2p D / 2 (b0 )D Y sj DC ( D / 2) jD1

(3.8)

Multidimensional Encoding Strategy of Spiking Neurons

1523

receptive elds. With the help of equation 3.8, the population Fisher information (see equation 3.3) can be rewritten as Ji D

D2 C(D / 2) Ncode . Kw ( F, t, D) 2p D / 2 (b0 )D si2

(3.9)

Equation 3.9 can be interpreted as follows: We assume that the population of neurons encodes stimulus dimension i accurately, while all other dimension are of secondary importance. The minimal encoding error for dimension i, Ji¡1 , is determined by the shape of the tuning curve that enters through b0 , Kw (F, t, D ), and Ncode ; by the tuning width in dimension i, si ; and by the active subpopulation, Ncode . There is a trade-off between si and Ncode . On the one hand, the encoding error can be decreased by decreasing si , which enhances the Fisher information carried by each single neuron. Decreasing si , on the other hand, will also shrink the active subpopulation via equation 3.8. This impairs the encoding accuracy, because the stimulus position is evaluated by fewer independent estimators. If equation 3.9 is valid due to a sufcient receptive eld overlap, Ncode can be increased by increasing 6 i. This effect is illustrated the tuning widths, sj , in all other dimensions j D in Figure 1. 3.3 Narrow Tuning. In order to study the effects of narrow tuning in dimension i, si ¡! 0, we consider a constant distribution of stimuli, r ( x ) D const., in the region of stimulus space containing the receptive elds of the neural population. A straightforward calculation shows that in this case, the stimulus-averaged Fisher information,

Z hJi i :D

Z

1

dx1 . . .

¡1

1 ¡1

dxD r (x ) Ji ( x ),

(3.10)

is given by hJi i D Ji , that is, the average Fisher information for arbitrary distributions of tuning functions g( x ) is equal to the Fisher information (see equation 3.3) for the uniformly distributed population. Even if si becomes so small that gaps appear between the receptive elds, the mean Fisher information still increases with decreasing si . This property is due to those few stimuli that can be localized extremely accurately by the ring of the narrowly tuned cells. The majority of stimuli, however, fall into the gaps and are not well represented any more. A neural system with these properties is obviously a bad encoder, which shows that hJi i is not a suitable measure for the system performance. Consider instead the stimulus-averaged squared minimal encoding error for unbiased estimators, Z h2

2 i,min i

:D

1

¡1

Z dx1 . . .

1 ¡1

h i dxD r ( x ) J( x )¡1 . ii

(3.11)

1524

Christian W. Eurich and Stefan D. Wilke

x2

x2

A

x 2,s

B

x 2,s

x1

x1,s

x1,s

x1

Figure 1: Population response to the presentation of a stimulus characterized by parameters x1,s and x2, s . Feature x1 is to be encoded accurately. Effective receptive eld shapes are indicated for both populations. If neurons are narrowly tuned in x2 (A), the active population (shaded) is small (here: Ncode D 3). Broadly tuned receptive elds for x2 (B) yield a much larger population (here: Ncode D 15), thus increasing the encoding accuracy.

For uniformly distributed tuning curves and a sufcient receptive eld overlap, equation 3.11 can be simplied using equation 3.9, which yields 2 h2 i,min i D 1 / Ji as expected. For narrowly tuned cells, however, the condition g( x ) ´ g D const. breaks down, and equation 3.9 is no longer valid. The following argument 2 shows that in contrast to the high-overlap approximation 1 / Ji , h2 i,min i will diverge for si ! 0. Let l k ( x ) ¸ 0 be the kth eigenvalue of the real-valued, symmetrical Fisher information matrix, and U( x ) the orthogonal transformation that diagonalizes J(x ), that is, U ( x ) J( x )U( x )T D diag[l 1 (x ), . . . , l D (x )]. If l max (x ) :D maxj fl j ( x )g, one has * 2 h2 i,min i

U (x )ij

D

* ¸

D £ X

¤2

jD1

1 l max ( x )

1 lj(x )

D £ X j D1

+

U( x )ij

¤2

+ ½ D

1 l max ( x )

¾ (3.12)

Multidimensional Encoding Strategy of Spiking Neurons

1525

independent of i. For si ! 0, regions of stimulus space emerge that are not covered by any receptive elds. These gaps are characterized by very small eigenvalues of J (x ), that is, l max ( x ) ! 0. Thus, the minimal encoding error 2 h2 i,min i diverges in the presence of unrepresented areas in stimulus space. PD 2 2 This implies that the total error h2 min i :D iD1 h2 i,min i will also become innite if any of the tuning widths approaches zero. Equation 3.3 shows that the accuracy also decreases for large si . Consequently, there must be an optimal tuning width between the two regimes of broad and small tuning. In the next section, we calculate this optimal tuning width in a specic example. Example: Poissonian Spiking and Gaussian Tuning Functions. For Poissonian spike generation and gaussian tuning functions, the single-neuron Fisher information, equation 3.1, becomes ( ) Jijk ( x ) D

± ² 1 ( ) Ft exp ¡j ( k)2 / 2 ji( k)jj k . si sj

(3.13)

The population Fisher information, equation 3.3, reduces to

Ji D

(2p )D / 2

gFt

QD

kD1 sk . si2

(3.14)

The denition of the effective receptive eld is obtained from p(b ) in equation 3.6, for which we have

8 D/2 X > 2 > b 2k > , > 1 ¡ e¡b / 2 > 2k k! < kD0 p(b ) D " # > ( D¡1 > X) / 2 p D¡2k > 2 b b D!! 1 p > ¡b 2 p > : 2D / 2 C(1C D / 2) ¡e / ( D¡2k )!! C D!! 2 Erf( 2 ) , kD0

D even

D odd, (3.15)

R where Erf(x) :D p2p 0x dt exp(¡t2 ) is the gaussian error function. For a specic distribution of tuning curves in the stimulus space, the optimal tuning width mentioned in the previous paragraph can be calculated explicitly. Consider a population of neurons with tuning curves aligned on a P regular grid with xed spacing D , such that c( k ) D D l D1 D kl el , where kl 2 Z , and el is the lth unit vector. For simplicity, we restrict the calculation to the 2 case D D 2 and study the stimulus-averaged minimal encoding error h2 1,min i

1526

Christian W. Eurich and Stefan D. Wilke

dened in equation 3.11 as a function of the tuning widths sO 1 :D s1 / D and sO 2 :D s2 / D . As a consequence of the grid regularity, it can be written in the form 1

h2

2 1,min i

DD

O 13 sO 2 2 4s Ft

1

Z2sO 1

Z2sO 2 dj1

0

dj2 0

µ ¶¡1 E1 (j1 , sO 1 )2 E1 (j2 , sO 2 )2 £ E 2 (j1 , sO 1 )E0 (j2 , sO 2 ) ¡ , E2 (j2 , sO 2 )E0 (j1 , sO 1 )

(3.16)

P where El (j, s) O :D k2Z exp[¡(j ¡k / s) O 2 / 2](j ¡k / sO )l . For large tuning width in either direction, sO À 1, the sum in E l (j, s) O may be approximated by an integral that is independent of j , and one nds lims!1 E1 (j, sO ) ! 0 and O p (j, lims!1 s) O ! 2p s O for D 0, 2. Thus, the broad-tuning limit sO 1 À 1 El l O 2 and sO 2 À 1 yields h2 1,min i D D 2 sO 1 / (2p Ft sO 2 ) D 1 / Ji ; one recovers the broadtuning Fisher information Ji from (3.14) with D D 2 and 1 /g D D 2 . The encoding error (see equation 3.16) is plotted in Figure 2 as a function of sO 1 for different values of sO 2 . The agreement with the broad-tuning approximation for large tuning widths is clearly visible. If one of the tun2 ing widths is small, however, the minimal encoding error h2 1,min i no longer equals the inverse of equation 3.14. The deviations can be observed in Figure 2 for small sO 1 (right part of the plotted functions) and small sO 2 (upper curve). Thus, Figure 2 reects two main results of this article. First, the broader the tuning is for feature 2, the smaller the encoding error of feature 1. This illustrates our proposed encoding strategy. Second, the encoding error for feature 1 has a unique minimum as a function of sO 1 . From the arguments opt of the previous paragraph, one expects the optimal tuning width s1 to be approximately equal to the smallest possible tuning width for which no gaps appear between the receptive elds of neighboring neurons. Indeed, opt the numerical calculation yields an optimal tuning curve width of 2s1 ¼ 0.8D , which corresponds to the array of tuning curves shown in the inset of Figure 2. For very small tuning widths sO i , one practically leaves the validity range of our Fisher information analysis. If the stimulus falls into a gap between receptive elds such that the cells do not re within a reasonable time interval t , there is no possibility of making an unbiased estimation of the stimulus features. Thus, the Fisher information measure cannot be applied for sO i ! 0. At the calculated optimal tuning width in our example, however, there is still a sufcient receptive eld overlap to allow unbiased estiopt mates (cf. the inset of Figure 2). Therefore, we argue that s1 is well within the range where Fisher information is a valid measure of encoding accuracy.

Multidimensional Encoding Strategy of Spiking Neurons

1527

Dummy

10

3

10

2

10

1

10

0

<e

1,min

2

>Ft /D

2

1 f

0

4

0

4

x

10

1

10

2

10

1

0

10 s 1/D

10

1

Figure 2: Numerical results for the stimulus-averaged squared minimal en2 i / ( D 2 F¡1 t ¡1 ) as a function coding error from equation 3.16. Dashed line: h2 1,min of sO1 for D D 2 and different values of sO2 (from top to bottom): sO 2 D 0.25, 2 i / (D 2 F¡1 t ¡1 ) D 0.5, 1, 2. Solid line: analytical broad-tuning result, that is, h2 1,min (1 / J1 ) / ( D 2 F¡1 t ¡1 ) from equation 3.14. The inset shows gaussian tuning curves of optimal width, s opt ¼ 0.4D .

3.4 The Problem of Hidden Dimensions. A situation that is frequently encountered empirically is that of incomplete knowledge of the neural behavior. Usually an observer will study a neural population with respect to a number of well-dened stimulus dimensions and be unable to nd all relevant dimensions or unable to nd a suitable parameterization for certain stimulus properties. We assume that only d · D of the D total dimensions are known. The Fisher information (see equation 3.3) can be written as Q Q Ji D X ( jdD1 sj ) / si2 , where X :D gDKw (F, t, D )( jDDdC 1 sj ) is an experimentally unknown constant. If the neurons’ tuning curves have been measured in more than one dimension, X can be eliminated by considering the tuning widths as a relative measure of information content, 2 2 i,min

Pd

2 jD1 2 j,min

si2

DP d

2 j D1 sj

,

(3.17)

1528

Christian W. Eurich and Stefan D. Wilke

where i is one of the known dimensions 1, . . . , d. Equation 3.17 is independent of the unknown (or experimentally ignored) stimulus dimensions d C 1, . . . , D, which are to be held xed during the measurement of the d tuning widths. On the basis of the accessible tuning widths only, it gives a relative quantitative measure on how accurately the individual stimulus dimensions are encoded by the neural population. Equation 3.17 states that if si2 is small compared to the sum of the squares of all measured tuning widths, the population activity allows an accurate reconstruction of the corresponding stimulus feature: the population contains much information about xi . A large ratio, on the other hand, indicates that the population response is unspecic with respect to feature xi . As an example, consider the different pathways of signal processing that have been suggested for the visual system (Livingstone & Hubel, 1988). The model states that information about form, color, movement, and depth is in part processed separately in the visual cortex. Based on the physiological properties of visual cortical neurons, equation 3.17 yields a quantitative assessment of the specicity of their encoding with respect to the abovementioned properties—our method provides a test criterion for the validity of the pathway model.

4 Conclusion

We have calculated the Fisher information for a population of stochastically spiking neurons encoding a multitude of stimulus features. If a single feature is to be encoded accurately, narrow tuning for this dimension and a broad tuning in all other dimensions is found to be the optimal encoding strategy. The narrow tuning increases the information content of individual neurons for the feature to be encoded, while the broad tuning in other dimensions increases the population of neurons that actively take part in the representation. However, extremely narrow tuning functions impair performance because the tuning functions must always have sufcient overlap. For an accurate representation of a single feature, there is an optimal tuning width no matter how many stimulus dimensions are encoded in addition. Our results are suitable for applications to sensory or motor systems. Equation 3.17 provides a criterion that can be used to derive quantitative statements on the functional signicance of neuron classes based on the measurement of tuning curves.

Acknowledgments

We thank Klaus Pawelzik, Matthias Bethge, and Helmut Schwegler for fruitful discussions. This work was supported by Deutsche Forschungsgemeinschaft, SFB 517.

Multidimensional Encoding Strategy of Spiking Neurons

1529

References Baldi, P., & Heiligenberg, W. (1988). How sensory maps could enhance resolution through ordered arrangements of broadly tuned receivers. Biological Cybernetics, 59, 313–318. Barlow, H. B. (1972).Single units and sensation: A neuron doctrine for perceptual psychology? Perception, 1, 371–394. Brunel, N., & Nadal, J.-P. (1998). Mutual information, Fisher information, and population coding. Neural Computation, 10, 1731–1757. Deco, G., & Obradovic, D. (1997). An information-theoreticapproach to neural computing (2nd ed.). New York: Springer-Verlag. Eurich, C. W., & Schwegler, H. (1997). Coarse coding: Calculation of the resolution achieved by a population of large receptive eld neurons. Biological Cybernetics, 76, 357–363. Georgopoulos, A. P., Schwartz, A. B., & Kettner, R. E. (1986). Neuronal population coding of movement direction. Science, 233, 1416–1419. Hinton, G. E., McClelland, J. L., & Rumelhart, D. E. (1986). Distributed representations. In D. E. Rumelhart, & J. L. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 77–109). Cambridge, MA: MIT Press. Lettvin, J. Y., Maturana, H. R., McCulloch, W. S., & Pitts, W. H. (1959). What the frog’s eye tells the frog’s brain. Proceedings of the Institute of Radio Engineers (New York), 47, 1940–1951. Livingstone, M. S., & Hubel, D. H. (1988). Segregation of form, color, movement, and depth: Anatomy, physiology, and perception. Science, 240, 740–749. Salinas, E., & Abbott, L. F. (1994). Vector reconstruction from ring rates. Journal of Computational Neuroscience, 1, 89–107. Seung, H. S., & Sompolinsky, H. (1993). Simple models for reading neuronal population codes. Proceedings of the National Academy of Sciences USA, 90, 10749–10753. Snippe, H. P. (1996). Parameter extraction from population codes: A critical assessment. Neural Computation, 8, 511–539. Snippe, H. P., & Koenderink, J. J. (1992). Discrimination thresholds for channelcoded systems. Biological Cybernetics, 66, 543–551. Zhang, K., Ginzburg, I., McNaughton, B. L., & Sejnowski, T. J. (1998). Interpreting neuronal population activity by reconstruction: Unied framework with application to hippocampal place cells. Journal of Neurophysiology, 79, 1017–1044. Zhang, K., & Sejnowski, T. J. (1999). Neuronal tuning: To sharpen or broaden? Neural Computation, 11, 75–84. Received April 13, 1999; accepted August 13, 1999.

LETTER

Communicated by Laurence Abbott

Synergy in a Neural Code Naama Brenner NEC Research Institute, Princeton, NJ 08540, U.S.A. Steven P. Strong Institute for Advanced Study, Princeton, NJ 08544, and NEC Research Institute, Princeton, NJ 08540, U.S.A. Roland Koberle Instituto di F´õsica de S˜ao Carlos, Universidade de S˜ao Paulo, 13560–970 S˜ao Carlos, SP Brasil, and NEC Research Institute, Princeton, NJ 08540, U.S.A. William Bialek Rob R. de Ruyter van Steveninck NEC Research Institute, Princeton, NJ 08540, U.S.A. We show that the information carried by compound events in neural spike trains—patterns of spikes across time or across a population of cells— can be measured, independent of assumptions about what these patterns might represent. By comparing the information carried by a compound pattern with the information carried independently by its parts, we directly measure the synergy among these parts. We illustrate the use of these methods by applying them to experiments on the motion-sensitive neuron H1 of the y’s visual system, where we conrm that two spikes close together in time carry far more than twice the information carried by a single spike. We analyze the sources of this synergy and provide evidence that pairs of spikes close together in time may be especially important patterns in the code of H1. 1 Introduction

Throughout the nervous system, information is encoded in sequences of identical action potentials or spikes. The representation of sense data by these spike trains has been studied for 70 years (Adrian, 1928), but there remain many open questions about the structure of this code. A full understanding of the code requires that we identify its elementary symbols and characterize the messages that these symbols represent. Many different possible elementary symbols have been considered, implicitly or explicitly, in previous work on neural coding. These include the numbers of spikes in time windows of xed size and individual spikes themselves. In cells that c 2000 Massachusetts Institute of Technology Neural Computation 12, 1531– 1552 (2000) °

1532

Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck

produce bursts of action potentials, these bursts might be special symbols that convey information in addition to that carried by single spikes. Yet another possibility is that patterns of spikes—across time in one cell or across a population of cells—can have a special signicance, a possibility that has received renewed attention as techniques emerge for recording the activity of many neurons simultaneously. In many methods of analysis, questions about the symbolic structure of the code are mixed with questions about what the symbols represent. Thus, in trying to characterize the feature selectivity of neurons, one often makes the a priori choice to measure the neural response as the spike count or rate in a xed window. Conversely, in trying to assess the signicance of synchronous spikes from pairs of neurons, or bursts of spikes from a single neuron, one might search for a correlation between these events and some particular stimulus features. In each case, conclusions about one aspect of the code are limited by assumptions about another aspect. Here we show that questions about the symbolic structure of the neural code can be separated out and answered in an information-theoretic framework, using data from suitably designed experiments. This framework allows us to address directly the signicance of spike patterns or other compound spiking events. How much information is carried by a compound event? Is there redundancy or synergy among the individual spikes? Are particular patterns of spikes especially informative? Methods to assess the signicance of spike patterns in the neural code share a common intuitive basis: Patterns of spikes can play a role in representing stimuli if and only if the occurrence of patterns is linked to stimulus variations. The patterns have a special role only if this correlation between sensory signals and patterns is not decomposable into separate correlations between the signals and the pieces of the pattern, such as the individual spikes. We believe that these statements are not controversial. Difculties arise when we try to quantify this intuitive picture: What is the correct measure of correlation? How much correlation is signicant? Can we make statements independent of models and elaborate null hypotheses? The central claim of this article is that many of these difculties can be resolved using ideas from information theory. Shannon (1948) proved that entropy and information provide the only measures of variability and correlation that are consistent with simple and plausible requirements. Further, while it may be unclear how to interpret, for example, a 20% increase in correlation between spike trains, an extra bit of information carried by patterns of spikes means precisely that these patterns provide a factorof-two increase in the ability of the system to distinguish among different sensory inputs. In this work, we show that there is a direct method

Synergy in a Neural Code

1533

of measuring the information (in bits) carried by particular patterns of spikes under given stimulus conditions, independent of models for the stimulus features that these patterns might represent. In particular, we can compare the information conveyed by spike patterns with the information conveyed by the individual spikes that make up the pattern and determine quantitatively whether the whole is more or less than the sum of its parts. While this method allows us to compute unambiguously how much information is conveyed by the patterns, it does not tell us what particular message these patterns convey. Making the distinction between two issues, the symbolic structure of the code and the transformation between inputs and outputs, we address only the rst of these two. Constructing a quantitative measure for the signicance of compound patterns is an essential rst step in understanding anything beyond the single spike approximation, and it is especially crucial when complex multi-neuron codes are considered. Such a quantitative measure will be useful for the next stage of modeling the encoding algorithm, in particular as a control for the validity of models. This is a subject of ongoing work and will not be discussed here. 2 Information Conveyed by Compound Patterns

In the framework of information theory (Shannon, 1948), signals are generated by a source with a xed probability distribution and encoded into messages by a channel. The coding is in general probabilistic, and the joint distribution of signals and coded messages determines all quantities of interest; in particular, the information transmitted by the channel about the source is an average over this joint distribution. In studying a sensory system, the signals generated by the source are the stimuli presented to the animal, and the messages in the communication channel are sequences of spikes in a neuron or a population of neurons. Both the stimuli and the spike trains are random variables, and they convey information mutually because they are correlated. The problem of quantifying this information has been discussed from several points of view (MacKay & McCulloch, 1952; Optican & Richmond, 1987; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998). Here we address the question of how much information is carried by particular “events” or combinations of action potentials. 2.1 Background. A discrete event E in the spike train is dened as a specic combination of spikes. Examples are a single spike, a pair of spikes separated by a given time, spikes from two neurons that occur in synchrony, and so on. Information is carried by the occurrence of events at particular times and not others, implying that they are correlated with some stimulus features and not with others. Our task is to express the information

1534

Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck

conveyed, on average, by an event E in terms of quantities that are easily measured experimentally.1 In experiments, as in nature, the animal is exposed to stimuli at each instant of time. We can describe this sensory input by a function s(t0 ), which may have many components to parameterize the time dependence of multiple stimulus features. In general, the information gained about s( t0 ) by observing a set of neural responses is ID

Z

X responses

Ds( t0 )P[s( t0 ) & response] log2

¡ P[s(t ) & response] ¢ 0

P[s(t0 )]P[response]

, (2.1)

where information is measured in bits. It is useful to note that this mutual information can be rewritten in two complementary forms: Z I D ¡ C

or

Ds( t0 )P[s( t0 )] log2 P[s( t0 )] Z X P[response] Ds( t0 )P[s( t0 )|response] log2 P[s( t0 )|response]

responses

D S[P( s )] ¡ hS[P(s|response)]iresponse,

I D¡

X responses

P(response) log2 P (response)

Z

C

(2.2)

Ds( t0 )P[s( t0 )]

X responses

P[response|s(t0 )] log2 P[response|s( t0 )]

D S[P(response)] ¡ hS[P(response|s)]is ,

(2.3)

where S denotes the entropy of a distribution; by h¢ ¢ ¢is we mean an average over all possible values of the sensory stimulus, weighted by their probabilities of occurrence, and similarly for h¢ ¢ ¢iresponses. In the rst form, equation 2.2, we focus on what the responses are telling us about the sensory stimulus (de Ruyter van Steveninck & Bialek, 1988): different responses point more or less reliably to different signals, and our average uncertainty about the sensory signal is reduced by observing the neural response. In the second form, equation 2.3, we focus on the variability and reproducibility of the neural response (de Ruyter van Steveninck, Lewen, Strong, Koberle, 1 In our formulation, an event E is a random variable that can have several outcomes; for example, it can occur at different times in the experiment. The information conveyed by the event about the stimulus is an average over the joint distribution of stimuli and event outcomes. One can also associate an information measure with individual occurrences of events (DeWeese & Meister, 1998). The average of this measure over the possible outcomes is the mutual information, as dened by Shannon (1948) and used in this article.

Synergy in a Neural Code

1535

& Bialek, 1997; Strong et al., 1998). The range of possible responses provides the system with a capacity to transmit information, and the variability that remains when the sensory stimulus is specied constitutes noise; the difference between the capacity and the noise is the information. We would like to apply this second form to the case where the neural response is a particular type of event. When we observe an event E, information is carried by the fact that it occurs at some particular time tE . The range of possible responses is then the range of times 0 < tE < T in our observation window. Alternatively, when we observe the response in a particular small time bin of size D t, information is carried by the fact that the event E either occurs or does not. The range of possible responses then includes just two possibilities. Both of these points of view have an arbitrary element: the choice of bin size D t and the window size T. Characterizing the properties of the system, as opposed to our observation power, requires taking the limit of high time resolution (D t ! 0) and long observation times (T ! 1). As will be shown in the next section, in this limit, the two points of view give the same answer for the information carried by an event. 2.2 Information and Event Rates. A crucial role is played by the event rate rE ( t), the probability per unit time that an event of type E occurs at time t, given the stimulus history s( t0 ). Empirical construction of the event rate rE ( t) requires repetition of the same stimulus history many times, so that a histogram can be formed (see Figure 1). For the case where events are single spikes, this is the familiar time-dependent ring rate or poststimulus time histogram (see Figure 1c); the generalization to other types of events is illustrated by Figures 1d and 1e. Intuitively, a uniform event rate implies that no information is transmitted, whereas the presence of sharply dened features in the event rate implies that much information is transmitted by these events (see, for example, the discussion by Vaadia et al., 1995). We now formalize this intuition and show how the average information carried by a single event is related quantitatively to the time-dependent event rate. Let us take the rst point of view about the neural response variable, in which the range of responses is described by the possible arrival times t of the event. 2 What is the probability of nding the event at a particular time t? Before we know the stimulus history s( t0 ), all we can say is that the event can occur anywhere in our experimental window of size T, so that the probability is uniform P(response) D P(t) D 1 / T, with an entropy of S[P( t)] D log2 T. Once we know the stimulus, we also know the event rate rE ( t), and so our uncertainty about the occurrence time of the event is reduced. Events will occur preferentially at times where the event rate is large, so the probability distribution should be proportional to rE ( t); with proper

2

We assume that the experiment is long enough so that a typical event is certain to occur at some time.

1536

Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck

Figure 1: Generalized event rates in the stimulus-conditional response ensemble. A time-dependent visual stimulus is shown to the y (a), with the time axis dened to be zero at the beginning of the stimulus. This stimulus runs for 10 s and is repeatedly presented 360 times. The responses of the H1 neuron to 60 repetitions are shown as a raster (b), in which each dot represents a single spike. From these responses, time-dependent event rates rE (t) are estimated: (c) the ring rate (poststimulus time histogram); (d) the rate for spike pairs with interspike time t D 3 § 1 ms and (e) for pairs with t D 17 § 1 ms (e). These rates allow us to compute directly the information transmitted by the events, using equation 2.5.

Synergy in a Neural Code

1537

normalization P (response| s) D P( t|s ) D rE (t) / (TNrE ). Then the conditional entropy is Z T S[P( t|s)] D ¡ dt P( t|s) log2 P( t|s) 0

1 D¡ T

Z

T 0

rE ( t) log2 dt rNE

¡ r ( t) ¢ E

rNE T

.

(2.4)

In principle one should average this quantity over the distribution of stimuli s( t0 ); however, if the time T is long enough and the stimulus history sufciently rich, the ensemble average and the time average are equivalent. The validity of this assumption can be checked experimentally (see Figure 2c). The reduction in entropy is then the gain in information, so I (EI s ) D S[P( t)] ¡ S[P( t|s )] Z 1 T rE ( t) D log2 dt T 0 rNE

¡

or equivalently ½ I (EI s ) D

¡ r (t) ¢ E

rNE

¢

log2

¡ r ( t) ¢ E

rNE

¡ r ( t) ¢ ¾ E

rNE

,

(2.5)

,

(2.6)

s

where here the average is over all possible value of s weighted by their probabilities P( s). In the second view, the neural response is a binary random variable, sE 2 f0, 1g, marking the occurrence or nonoccurrence of an event of type E in a small time bin of size D t. Suppose, for simplicity, that the stimulus takes on a nite set of values s with probabilities P(s ). These in turn induce the event E with probability pE ( s) D P (sE D 1|s)PD rE (s )D t, with an average probability for the occurrence of the event pNE D s P(s ) rE ( s )D t D rNE D t. The information is the difference between the prior entropy and the conditional entropy: I ( EI s) D S( s) ¡ hS(s|sE )i, where the conditional entropy is an average over the two possible values of sE . The conditional probabilities are found from Bayes’ rule, P (s )pE ( s) pNE P (s )(1 ¡ pE ( s)) P(s|sE D 0) D , (1 ¡ pNE ) P(s|sE D 1) D

(2.7)

and with these one nds the information, I ( EI s) D ¡ D

X

X s

s

P( s) log2 P(s ) C

µ P(s ) pE ( s) log2

X

¡ p (s ) ¢ sE D0,1 E

pNE

P( s|sE ) log2 P(s|sE ) C (1 ¡ pE ( s)) log2

¡ 1 ¡ p ( s) ¢ ¶ E

1 ¡ pNE

. (2.8)

1538

Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck

Taking the limit D t ! 0, consistent with the requirement that the event can occur at most once, one nds the average information conveyed in a small time bin; dividing by the average probability of an event, one obtains equation 2.6 as the information per event. Equation 2.6 and its time-averaged form, equation 2.5, are exact formulas that are valid in any situation where a rich stimulus history can be presented

Synergy in a Neural Code

1539

repeatedly. It enables the evaluation of the information for arbitrarily complex events, independent of assumptions about the encoding algorithm. The information is an average over the joint distribution of stimuli and responses. But rather than constructing this joint distribution explicitly and then averaging, equation 2.5 expresses the information as a direct empirical average: by estimating the function rE ( t) as a histogram, we are sampling the distribution of the responses given a stimulus, whereas by integrating over time, we are sampling the distribution of stimuli. This formula is general and can be applied for different systems under different experimental conditions. The numerical result (measured in bits) will, of course, depend on the neural system as well as on the properties of the stimulus distribution. The error bars on the measurement of the information are affected by the niteness of the data; for example, sampling must be sufcient to construct the rates rE ( t) reliably. These and other practical issues in using equation 2.5 are illustrated in detail in Figure 2. 2.3 The Special Case of Single Spikes. Let us consider in more detail the simple case where events are single spikes. The average information conveyed by a single spike becomes an integral over the time-dependent spike rate r( t),

I (1 spikeI s) D

1 T

Z

T 0

dt

¡ r(t) ¢ rN

log2

¡ r(t) ¢ rN

.

(2.9)

It makes sense that the information carried by single spikes should be related to the spike rate, since this rate as a function of time gives a complete Figure 2: Facing page. Finite size effects in the estimation of the information conveyed by single spikes. (a) Information as a function of the bin size D t used for computing the time-dependent rate r(t) from all 360 repetitions (circles) and from 100 of the repetitions (crosses). A linear extrapolation to the limit D t ! 0 is shown for the case where all repetitions were used (solid line). (b) Information as a function of the inverse number of repetitions N, for a xed bin size D t D 2 ms. (c) Systematic errors due to nite duration of the repeated stimulus s(t). The full 10-second length of the stimulus was subdivided into segments of duration T. Using equation 2.9 the information was calculated for each segment and plotted as a function of 1 / T (circles). If the stimulus is sufciently long that an ensemble average is well approximated by a time average, the convergence of the information to a stable value as T ! 1 should be observable. Further, the standard deviation of the information measured from different p segments of length T (error bars) should decrease as a square root law s / 1 / T (dashed line). These data provide an empirical verication that for the distribution used in this experiment, the repeated stimulus time is long enough to approximate ergodic sampling.

1540

Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck

description of the “one-body” statistics of the spike train, in the same way that the single particle density describes the one-body statistics of a gas or liquid. This is true independent of any models or assumptions, and we suggest the use of equations 2.9 and 2.5 to test directly in any given experimental situation whether many body effects increase or reduce information transmission. We note that considerable attention has been given to the problem of making accurate statistical models for spike trains. In particular, the simplest model—the modulated Poisson process—is often used as a standard against which real neurons are compared. It is tempting to think that Poisson behavior is equivalent to independent transmission of information by successive spikes, but this is not the case (see below). Of course, no real neuron is precisely Poisson, and it is not clear which deviations from Poisson behavior are signicant. Rather than trying to t statistical models to the data and then to try to understand the signicance of spike train statistics for neural coding, we see that it is possible to quantify directly the information carried by single spikes and compound patterns. Thus, without reference to statistical models, we can determine whether successive spikes carry independent, redundant, or synergistic information. Several previous works have noted the relation between spike rate and information, and the formula in equation 2.9 has an interesting history. For a spike train generated by a modulated Poisson process, equation 2.9 provides an upper bound on information transmission (per spike) by the spike train as a whole (Bialek, 1990). Even in this simple case, spikes do not necessarily carry independent information: slow modulations in the ring rate can cause them to be redundant. In studying the coding of location by cells in the rat hippocampus, Skaggs, McNaughton, Gothard, and Markus (1993) assumed that successive spikes carried independent information, and that the spike rate was determined by the instantaneous location. They obtained equation 2.9 with the time average replaced by an average over locations. DeWeese (1996) showed that the rate of information transmission by a spike train could be expanded in a series of integrals over correlation functions, where successive terms would be small if the number of spikes per correlation time were small; the leading term, which would be exact if spikes were uncorrelated, is equation 2.9. Panzeri, Biella, Rolls, Skaggs, and Treves (1996) show that equation 2.9, multiplied by the mean spike rate to give an information rate (bits per second), is the correct information rate if we count spikes in very brief segments of the neural response, which is equivalent to asking for the information carried by single spikes. For further discussion of the relation to previous work, see appendix A. A crucial point here is the generalization to equation 2.5, and this result applies to the information content of any point events in the neural response—pairs of spikes with a certain separation, coincident spikes from two cells, and so on—not just single spikes. In the analysis of experiments, we will emphasize the use of this formula as an exact result for the infor-

Synergy in a Neural Code

1541

mation content of single events, rather than an approximate result for the spike train as a whole, and this approach will enable us to address questions concerning the structure of the code and the role played by various types of events. 3 Experiments in the Fly Visual System

In this section, we use our formalism to analyze experiments on the movement-sensitive cell H1 in the visual system of the blowy Calliphora vicina. We address the issue of the information conveyed by pairs of spikes in this neuron, as compared to the information conveyed independently by single spikes. The quantitative results of this section—information content in bits, effects of synergy, and redundancy among spikes—are specic to this system and to the stimulus conditions used. The theoretical approach, however, is valid generally and can be applied similarly to other experimental systems, to nd out the signicance of various patterns in single cells or in a population. 3.1 Synergy Between Spikes. In the experiment (see appendix B for details), the horizontal motion across the visual eld is the input sensory stimulus s( t), drawn from a probability distribution P[s( t)]. The spike train recorded from H1 is the neural response. Figure 1a shows a segment of the stimulus presented to the y, and Figure 1b illustrates the response to many repeated presentations of this segment. The histogram of spike times across the ensemble of repetitions provides an estimate of the spike rate r(t) (see Figure 1c), and equation 2.5 gives the information carried by a single spike, I (1 spikeI s) D 1.53 § 0.05 bits. Figure 2 illustrates the details of how the formula was used, with an emphasis on the effects of niteness of the data. In this experiment, a stimulus of length T D 10 sec was repeated 360 times. As seen from Figure 2, the calculation converges to a stable result within our nite data set. If each spike were to convey information independently, then with the mean spike rate rN D 37 spikes per second, the total information rate would be Rinfo D 56 bits per second. We used the variability and reproducibility of continuous segments in the neural response (de Ruyter van Steveninck et al., 1997; Strong et al., 1998) to estimate the total information rate in the spike train in this experiment and found that Rinfo D 75 bits per second. Thus, the information conveyed by the spike train as a whole is larger than the sum of contributions from individual spikes, indicating cooperative information transmission by patterns of spikes in time. This synergy among spikes motivates the search for especially informative patterns in the spike train. We consider compound events that consist of two spikes separated by a time t , with no constraints on what happens between them. Figure 1 shows segments of the event rate rt (t ) for t D 3 ( §1) ms (see Figure 1d),

1542

Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck

Figure 3: Information about the signal transmitted by pairs of spikes, computed from equation 2.5, as a function of the time separation between the two spikes. The dotted line shows the information that would be transmitted by the two spikes independently (twice the single-spike information).

and for t D 17 ( § 1) ms (see Figure 1e). The information carried by spike pairs as a function of the interspike time t , computed from equation 2.5, is shown in Figure 3. For large t , spikes contribute independent information, as expected. This independence is established within approximately 30 to 40 ms, comparable to the behavioral response times of the y (Land & Collett, 1974). There is a mild redundancy (about 10–20%) at intermediate separations and a very large synergy (up to about 130%) at small t . Related results were obtained using the correlation of spike patterns with stimulus features (de Ruyter van Steveninck & Bialek, 1988). There the information carried by spike patterns was estimated from the distribution of stimuli given each pattern, thus constructing a statistical model of what the patterns “stand for” (see the details in appendix A). Since the time-dependent stimulus is in general of high dimensionality, its distribution cannot be sampled directly and some approximations must be made. de Ruyter van Steveninck and Bialek (1988) made the approximation that patterns of a few spikes encode projections of the stimulus onto low-dimensional subspaces, and further that the conditional distributions P[s(t )| E] are gaussian. The information obtained with these approximations provides a lower bound to the true information carried by the patterns, as estimated directly with the methods presented here.

Synergy in a Neural Code

1543

3.2 Origins of Synergy. Synergy means, quite literally, that two spikes together tell us more than two spikes separately. Synergistic coding is often discussed for populations of cells, where extra information is conveyed by patterns of coincident spikes from several neurons (Abeles, Bergmann, Margalit, & Vaadia, 1993; Hopeld, 1995; Meister, 1995; Singer & Gray, 1995). Here we see direct evidence for extra information in pairs of spikes across time. The mathematical framework for describing these effects is the same, and a natural question is: What are the conditions for synergistic coding? The average synergy Syn[E 1 , E2 I s] between two events E1 and E 2 is the difference between the information about the stimulus s conveyed by the pair and the information conveyed by the two events independently:

Syn[E1 , E2 I s] D I[E1 , E2 I s] ¡ (I[E1 I s] C I[E2 I s] ).

(3.1)

We can rewrite the synergy as: Syn[E1 , E2 I s] D I[E1 I E2 |s] ¡ I[E 1 I E2 ].

(3.2)

The rst term is the mutual information between the events computed across an ensemble of repeated presentations of the same stimulus history. It describes the gain in information due to the locking of compound event (E 1 , E2 ) to particular stimulus features. If events E1 and E2 are correlated individually with the stimulus but not with one another, this term will be zero, and these events cannot be synergistic on average. The second term is the mutual information between events when the stimulus is not constrained or, equivalently, the predictability of event E2 from E1 . This predictability limits the capacity of E2 to carry information beyond that already conveyed by E1 . Synergistic coding (Syn > 0) thus requires that the mutual information among the spikes is increased by specifying the stimulus, which makes precise the intuitive idea of “stimulus-dependent correlations” (Abeles et al., 1993; Hopeld, 1995; Meister, 1995; Singer & Gray, 1995). Returning to our experimental example, we identify the events E1 and E2 as the arrivals of two spikes, and consider the synergy as a function of the time t between them. In terms of event rates, we compute the information carried by a pair of spikes separated by a time t , equation 2.5, as well as the information carried by two individual spikes. The difference between these two quantities is the synergy between two spikes, which can be written as

¡ rN ¢

µ ¶ Z 1 T rt ( t) rt (t) C Syn(t ) D ¡ log2 log2 dt T 0 rNt r( t)r( t ¡ t ) ¶ Z T µ 1 rt ( t) rt (t C t ) r(t) C C ¡2 log2 [r(t)]. dt T 0 rNt rNt rN t rN2

(3.3)

The rst term in this equation is the logarithm of the normalized correlation function, and hence measures the rarity of spike pairs with separation t;

1544

Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck

the average of this term over t is the mutual information between events (the second term in equation 3.2). The second term is related to the local correlation function and measures the extent to which the stimulus modulates the likelihood of spike pairs. The average of this term over t gives the mutual information conditional on knowledge of the stimulus (the rst term in equation 3.2). The average of the third term over t is zero, and numerical evaluation of this term from the data shows that it is negligible at most values of t . We thus nd that the synergy between spikes is approximately a sum of two terms, whose averages over t are the terms in equation 3.2. A spike pair with a separation t then has two types of contributions to the extra information it carries: the two spikes can be correlated conditional on the stimulus, or the pair could be a rare and thus surprising event. The rarity of brief pairs is related to neural refractoriness, but this effect alone is insufcient to enhance information transmission; the rare events must also be related reliably to the stimulus. In fact, conditional on the stimulus, the spikes in rare pairs are strongly correlated with each other, and this is visible in Figure 1a: from trial to trial, adjacent spikes jitter together as if connected by a stiff spring. To quantify this effect, we nd for each spike in one trial the closest spike in successive trials, and measure the variance of the arrival times of these spikes. Similarly, we measure the variance of the interspike times. Figure 4a shows the ratio of the interspike time variance to the sum of the arrival time variances of the spikes that make up the pair. For large separations, this ratio is unity, as expected if spikes are locked independently to the stimulus, but as the two spikes come closer, it falls below one-quarter. Both the conditional correlation among the members of the pair (see Figure 4a) and the relative synergy (see Figure 4b) depend strongly on the interspike separation. This dependence is nearly invariant to changes in image contrast, although the spike rate and other statistical properties are strongly affected by such changes. Brief spike pairs seem to retain their identity as specially informative symbols over a range of input ensembles. If particular temporal patterns are especially informative, then we would lose information if we failed to distinguish among different patterns. Thus, there are two notions of time resolution for spike pairs: the time resolution with which the interspike time is dened and the absolute time resolution with which the event is marked. Figure 5 shows that for small interspike times, the information is much more sensitive to changes in the interspike time resolution (open symbols) than to the absolute time resolution (lled symbols). This is related to the slope in Figure 2: in regions where the slope is large, events should be nely distinguished in order to retain the information. 3.3 Implications of Synergy. The importance of spike timing in the neural code has been under debate for some time now. We believe that some issues in this debate can be claried using a direct information-theoretic approach. Following MacKay and McCulloch (1952), we know that marking

Synergy in a Neural Code

1545

Figure 4: (a) Ratio between the variance of interspike time and the sum of variances of the two spike times. Variances are measured across repeated presentations of same stimulus, as explained in the text. This ratio is plotted as a function of the interspike time t for two experiments with different image contrast. (b) Extra information conveyed cooperatively by pairs of spikes, expressed as a fraction of the information conveyed by the two spikes independently. While the single-spike information varies with contrast (1.5 bits per spike for C D 0.1 compared to 1.3 bits per spike for C D 1), the fractional synergy is almost contrast independent.

spike arrival times with higher resolution provides an increased capacity for information transmission. The work of Strong et al. (1998) shows that for the y’s H1 neuron, the increased capacity associated with spike timing indeed is used with nearly constant efciency down to millisecond resolution. This efciency can be the result of a tight locking of individual spikes to a rapidly varying stimulus, and it could also be the result of temporal patterns providing information beyond rapid rate modulations. The anal-

1546

Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck

Figure 5: Information conveyed by spike pairs as a function of time resolution. An event—pair of spikes—can be described by two times: the separation between spikes (relative time) and the occurrence time of the event with respect to the stimulus (absolute time). The information carried by the pair depends on the time resolution in these two dimensions, both specied by the bin size D t along the abscissa. Open symbols are measurements of the information for a xed 2 ms resolution of absolute time and a variable resolution D t of relative time. Closed symbols correspond to a xed 2 ms resolution of relative time and a variable resolution D t of absolute time. For short intervals, the sensitivity to coarsening of the relative time resolution is much greater than to coarsening of the absolute time resolution. In contrast, sensitivity to relative and absolute time resolutions is the same for the longer, nonsynergistic, interspike separations.

ysis given here shows that for H1, pairs of spikes can provide much more information than two individual spikes, and information transmission is much more sensitive to the relative timing of spikes than to their absolute timing. The synergy is an inherent property of particular spike pairs, as it persists when averaged over all occurrences of the pair in the experiment. One may ask about the implication of synergy to behavior. Flies can respond to changes in a motion stimulus over a time scale of approximately 30 to 40 ms (Land & Collett, 1974). We have found that the spike train of H1 as a whole conveyed about 20 bits per second more than the information conveyed independently by single spikes (see Section 3.1). This implies that over a behavioral response time, synergistic effects provide about 0.8 bit of additional information, equivalent to almost a factor of two in resolving power for distinguishing different trajectories of motion across the visual eld.

Synergy in a Neural Code

1547

4 Summary

Information theory allows us to measure synergy and redundancy among spikes independent of the rules for translating between spikes and stimuli. In particular, this approach tests directly the idea that patterns of spikes are special events in the code, carrying more information than expected by adding the contributions from individual spikes. It is of practical importance that the formulas rely on low-order statistical measures of the neural response, and hence do not require enormous data sets to reach meaningful conclusions. The method is of general validity and is applicable to patterns of spikes across a population of neurons, as well as across time. Specic results (existence of synergy or redundancy, number of bits per spike, etc.) depend on the neural system as well as on the stimulus ensemble used in each experiment. In our experiments on the y visual system, we found that an event composed of a pair of spikes can carry far more than the information carried independently by its parts. Two spikes that occur in rapid succession appear to be special symbols that have an integrity beyond the locking of individual spikes to the stimulus. This is analogous to the encoding of sounds in written English: each of the compound symbols th, sh, and ch stands for sounds that are not decomposable into sounds represented by each of the constituent letters. For spike pairs to act effectively as special symbols, mechanisms for “reading” them must exist at subsequent levels of processing. Synaptic transmission is sensitive to interspike times in the 2–20 ms range (Magelby, 1987), and it is natural to suggest that synaptic mechanisms on this timescale play a role in such reading. Recent work on the mammalian visual system (Usrey, Reppas, & Reid, 1998) provides direct evidence that pairs of spikes close together in time can be especially efcient in driving postsynaptic neurons.

Appendix A: Relation to Previous Work

Patterns of spikes and their relation to sensory stimuli have been quantied in the past using correlation functions. The event rates that we have dened here, which are connected directly to the information carried by patterns of spikes through equation 2.5, are just properly normalized correlation functions. The event rate for pairs of spikes from two separate neurons is related to the joint poststimulus time histogram (PSTH), dened by Aertsen, Gerstein, Habib, and Palm (1989; Vaadia et al., 1995). Making this connection explicit is also an opportunity to see how the present formalism applies to events dened across two cells. Consider two cells, A and B, generating spikes at times ftiA g and ftBi g, respectively. It will be useful to think of the spike trains as sums of unit

1548

Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck

impulses at the spike times, r A ( t) D r B ( t) D

X i

X i

d(t ¡ tA i )

(A.1)

d(t ¡ tBi ).

(A.2)

Then the time-dependent spike rates for the two cells are rA ( t) D hr A (t)itrials, rB ( t) D hr B (t )itrials ,

(A.3) (A.4)

where h¢ ¢ ¢itrials denotes an average over multiple trials in which the same time-dependent stimulus s(t0 ), or the same motor behavior observed. These spike rates are the probabilities per unit time for the occurrence of a single spike in either cell A or cell B, also called the PSTH. We can dene the probability per unit time for a spike in cell A to occur at time t and a spike in cell B to occur at time t0 , and this will be the joint PSTH, JPSTHAB ( t, t0 ) D hr A ( t)r B ( t0 )itrials.

(A.5)

Alternatively, we can consider an event E dened by a spike in cell A at time t and a spike in cell B at time t ¡ t , with the relative time t measured with a precision of D t. Then the rate of these events is Z rE (t) D

D t /2 ¡D t / 2

dt0 JPSTHAB ( t, t ¡ t C t0 )

(A.6)

¼ D tJPSTHAB ( t, t ¡ t ),

(A.7)

where the last approximation is valid if our time resolution is sufciently high. Applying our general formula for the information carried by single events, equation 2.5, the information carried by pairs of spikes from two cells can be written as an integral over diagonal “strips” of the JPSTH matrix, 1 I ( EIs) D T

Z

T 0

dt

JPSTHAB ( t, t ¡ t )

hJPSTHAB ( t, t ¡ t )it

" log2

JPSTHAB ( t, t ¡ t )

hJPSTHAB ( t, t ¡ t )it

# ,

(A.8)

where hJPSTHAB (t, t¡t )it is an average of the JPSTH over time; this average is equivalent to the standard correlation function between the two spike trains. The discussion by Vaadia et al. (1995) emphasizes that modulations of the JPSTH along the diagonal strips allow correlated ring events to convey information about sensory signals or behavioral states, and this information

Synergy in a Neural Code

1549

is quantied by equation A.8. The information carried by the individual cells is related to the corresponding integrals over spike rates, equation 2.9. The difference between the the information conveyed by the compound spiking events E and the information conveyed by spikes in the two cells independently is precisely the synergy between the two cells at the given time lag t . For t D 0, it is the synergy—or extra information—conveyed by synchronous ring of the two cells. We would like to connect this approach with previous work that focused on how events reduce our uncertainty about the stimulus (de Ruyter van Steveninck & Bialek, 1988). Before we observe the neural response, all we know is that stimuli are chosen from a distribution P[s( t0 )]. When we observe an event E at time tE , this should tell us something about the stimulus in the neighborhood of this time, and this knowledge is described by the conditional distribution P[s(t0 )| tE ]. If we go back to the denition of the mutual information between responses and stimuli, we can write the average information conveyed by one event in terms of this conditional distribution, Z Z P[s(t0 ), tE ] I (EI s ) D Ds( t0 ) dtE P[s( t0 ), tE ] log2 P[s( t0 )]P[tE ] Z Z P[s( t0 )| tE ] D . (A.9) dtE P[tE ] Ds( t0 )P[s( t0 )| tE ] log2 P[s( t0 )]

¡

¢

¡

¢

If the system is stationary, then the coding should be invariant under time translations: P[s( t0 )| tE ] D P[s( t0 C D t0 )| tE C D t0 ].

(A.10)

This invariance means that the integral over stimuli in equation A.9 is independent of the event arrival time tE , so we can simplify our expression for the information carried by a single event, Z Z P[s( t0 )| tE ] ( ) ] I EI s D dtE P[tE Ds( t0 )P[s( t0 )| tE ] log2 P[s( t0 )] Z 0 )| ] ( P[s t tE D . (A.11) Ds( t0 )P[s( t0 )| tE ] log2 P[s( t0 )]

¡ ¢

¡

¢

This formula was used by de Ruyter van Steveninck and Bialek (1988). To connect with our work here, we express the information in equation A.11 as an average over the stimulus, ½ ¾ P[s(t0 )| tE ] P[s( t0 )| tE ] ( ) log2 . (A.12) I EI s D P[s( t0 )] P[s( t0 )] s

¡

¢

¡

¢

Using Bayes’ rule, P[s( t0 )| tE ] P[tE |s(t0 )] rE ( tE ) D D , 0 P[s( t )] P[tE ] rNE

(A.13)

1550

Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck

where the last term is a result of the distributions of event arrival times being proportional to the event rates, as dened above. Substituting this back to equation A.11, one nds equation 2.6. Appendix B: Experimental Setup

In the experiment, we used a female blowy, which was a rst-generation offspring of a wild y caught outside. The y was put inside a plastic tube and immobilized with wax, with the head protruding out. The proboscis was left free so that the y could be fed regularly with some sugar water. A small hole was cut in the back of the head, close to the midline on the right side. Through this hole, a tungsten electrode was advanced into the lobula plate. This area, which is several layers back from the compound eye, includes a group of large motion-detector neurons with wide receptive elds and strong direction selectivity. We recorded spikes extracellularly from one of these, the contralateral H1 neuron, Franceschini, Riehle & le Nestour, 1989; Hausen & Egelhaaf, 1989). The electrode was positioned such that spikes from H1 could be discriminated reliably and converted into pulses by a simple threshold discriminator. The pulses were fed into a CED 1401 interface (Cambridge Electronic Design), which digitized the spikes at 10 m s resolution. To keep exact synchrony over the duration of the experiment, the spike timing clock was derived from the same internal CED 1401 clock that dened the frame times of the visual stimulus. The stimulus was a rigidly moving bar pattern, displayed on a Tektronix 608 high-brightness display. The radiance at average intensity IN was about 180 mW/(m2 £ sr), which amounts to about 5 £104 effectively transduced photons per photoreceptor per second (Dubs, Laughlin, & Srinivasan, 1984). The bars were oriented vertically, with intensities chosen at random to be IN(1 § C), where C is the contrast. The distance between the y and the screen was adjusted so that the angular subtense of a bar equaled the horizontal interommatidial angle in the stimulated part of the compound eye. This setting was found by determining the eye’s spatial Nyquist frequency through the reverse reaction (Gotz, ¨ 1964) in the response of the H1 neuron. For this y, the horizontal interommatidial angle was 1.45 degrees, and the distance to the screen 105 mm. The y viewed the display through a round 80 mm diameter diaphragm, showing approximately 30 bars. From this we estimate the number of stimulated ommatidia in the eye’s hexagonal raster to be about 612. Frames of the stimulus pattern were refreshed every 2 ms, and with each new frame, the pattern was displayed at a new position. This resulted in an apparent horizontal motion of the bar pattern, which is suitable to excite the H1 neuron. The pattern position was dened by a pseudorandom sequence, simulating independent random displacements at each time step, uniformly distributed between ¡0.47 degrees and C 0.47 degrees (equivalent to ¡0.32 to C 0.32 omm, horizontal ommatidial spacings). This corresponds to a dif-

Synergy in a Neural Code

1551

fusion constant of 18.1(± )2 per second or 8.6 omm2 per second. The sequence of pseudorandom numbers contained a repeating part and a nonrepeating part, each 10 seconds long, with the same statistical parameters. Thus, in each 20 second cycle, the y saw a 10 second movie that it had seen 20 seconds before, followed by a 10 second movie that was generated independently. Acknowledgments

We thank G. Lewen and A. Schweitzer for their help with the experiments and N. Tishby for many helpful discussions. Work at the IAS was supported in part by DOE grant DE–FG02–90ER40542, and work at the IFSC was supported by the Brazilian agencies FAPESP and CNPq. References Abeles, M., Bergmann, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal ring patterns in the frontal cortex of behaving monkeys. J. Neurophysiol., 70, 1629–1638. Adrian, E., (1928). The basis of sensation. London: Christophers. Aertsen, A., Gerstein, G., Habib, M., & Palm, G. (1989).Dynamics of neuronal ring correlation: modulation of “effective connectivity.” J. Neurophysiol., 61(5), 900–917. Bialek, W. (1990). Theoretical physics meets experimental neurobiology. In E. Jen (Ed.), 1989 Lectures in Complex Systems (pp. 513–595). Menlo Park, CA: Addison-Wesley. de Ruyter van Steveninck, R., & Bialek, W. (1988). Real-time performance of a movement-sensitive neuron in the blowy visual system: coding and information transfer in short spike sequences. Proc. R. Soc. Lond. Ser. B, 234, 379. de Ruyter van Steveninck, R. R., Lewen, G., Strong, S., Koberle, R., & Bialek, W. (1997). Reproducibility and variability in neural spike trains. Science, 275, 275. DeWeese, M. (1996). Optimization principles for the neural code. Network, 7, 325–331. DeWeese, M., & Meister, M. (1998). How to measure information gained from one symbol. APS Abstracts. Dubs, A., Laughlin, S., & Srinivasan, M. (1984). Single photon signals in y photoreceptors and rst order interneurones at behavioural threshold. J. Physiol., 317, 317–334. Franceschini, N., Riehle, A., & le Nestour, A. (1989). Directionally selective motion detection by insect neurons. In D. Stavenga & R. Hardie (Eds.) Facets of Vision, page 360. Berlin: Springer-Verlag. Gotz, ¨ K. (1964). Optomotorische Untersuchung des Visuellen Systems einiger Augenmutanten der Fruchtiege Drosophila. Kybernetik, 2, 77–92.

1552

Brenner, Strong, Koberle, Bialek, and de Ruyter van Steveninck

Hausen, K., & Egelhaaf, M. (1989).Neural mechanisms of visual course control in insects. In D. Stavenga & R. Hardie (Eds.), Facets of Vision, page 391. SpringerVerlag. Hopeld, J. J. (1995). Pattern recognitin computation using action potential timing for stimulus representation. Nature, 376, 33–36. Land, M. F., & Collett, T. S. (1974). Chasing behavior of houseies (Fannia caniculaus): A description and analysis. J. Comp. Physiol., 89, 331–357. MacKay, D., & McCulloch, W. S. (1952). The limiting information capacity of a neuronal link. Bull. Math. Biophys., 14, 127–135. Magelby, K. (1987). Short-term synaptic plasticity. In G. M. Edelman, V. E. Gall, & K. M. Cowan (Eds.), Synaptic function (pp. 21–56). New York: Wiley. Meister, M. (1995). Multineuronal codes in retinal signaling. Proc. Nat. Acad. Sci. (USA), 93, 609–614. Optican, L. M., & Richmond, B. J. (1987).Temporal encodingof two-dimensional patterns by single units in primate inferior temporal cortex. III: Information theoretic analysis. J. Neurophys., 57, 162–178. Panzeri, S., Biella, G., Rolls, E. T., Skaggs, W. E., & Treves, A. (1996). Speed, noise, information and the graded nature of neuronal responses. Network, 7, 365–370. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Shannon, C. E. (1948). A mathematical theory of communication. Bell Sys. Tech. J., 27, 379–423, 623–656. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Ann. Rev. Neurosci., 555–586. Skaggs, W. E., McNaughton, B. L., Gothard, K. M., & Markus, E. J. (1993). An information-theoretic approach to deciphering the hippocampal code. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing, 5 (pp. 1030–1037). San Mateo, CA: Morgan Kaufmann. Strong, S. P., Koberle, R., de Ruyter van Steveninck, R. R., & Bialek, W. (1998). Entropy and information in neural spike trains. Phys. Rev. Lett., 80, 197–200. Usrey, W. M., Reppas, J. B., & Reid, R. C. (1998). Paired-spike interactions and synaptic efciency of retinal inputs to the thalamus. Nature, 395, 384. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neural interactions in monkey cortex in relation to behavioural events. Nature, 373, 515–518. Received February 22, 1999; accepted May 20, 1999.

LETTER

Communicated by Alain Destexhe

Increased Synchrony with Increase of a Low-Threshold Calcium Conductance in a Model Thalamic Network: A Phase-Shift Mechanism Elizabeth Thomas Thierry Grisar Universit´e de Li`ege, Institut L´eon Fredericq, Department of Neurochemistry, Li`ege, Belgium A computer model of a thalamic network was used in order to examine the effects of an isolated augmentation in a low-threshold calcium current. Such an isolated augmentation has been observed in the reticular thalamic (RE) nucleus of the genetic absence epilepsy rat from the Strasbourg (GAERS) model of absence epilepsy. An augmentation of the low-threshold calcium conductance in the RE neurons ( gTs ) of the model thalamic network was found to lead to an increase in the synchronized ring of the network. This supports the hypothesis that the isolated increase in gTs may be responsible for epileptic activity in the GAERS rat. The increase of gTs in the RE neurons led to a slight increase in the period of the isolated RE neuron ring. In contrast, the low-threshold spike of the RE neuron remained relatively unchanged by the increase of gTs . This suggests that the enhanced synchrony in the network was primarily due to a phase shift in the ring of the RE neurons with respect to the thalamocortical neurons. The ability of this phase-shift mechanism to lead to changes in synchrony was further examined using the model thalamic network. A similar increase in the period of RE neuron oscillations was obtained through an increase in the conductance of the calcium-mediated potassium channel. This change was once again found to increase synchronous ring in the network. 1 Introduction

Absence epilepsy is a generalized epilepsy characterized by the presence of bilateral and synchronous spike and wave discharges in the electroencephalogram of the patient. Studies of the disease have been greatly facilitated by the development of animal models of absence epilepsy. One such successful model has been the genetic absence epilepsy rat of Strasbourg (GAERS). Neurophysiological, behavioral, pharmacological, and genetic studies have demonstrated that GAERS is a good animal model for the study of human absence epilepsy (Marescaux, Vergnes, & Depaulis, 1992). c 2000 Massachusetts Institute of Technology Neural Computation 12, 1553– 1571 (2000) °

1554

Elizabeth Thomas and Thierry Grisar

In vivo studies have revealed the presence of synchronized oscillations in the thalamus during paroxysmal activity in the cortex (Buszaki, 1991; Steriade & Contreras, 1995). Many attempts to understand epilepsy have therefore focused on the generation of synchronous, rhythmic activity in the thalamus (von Krosigk, Ball, & McCormick, 1993; Huguenard, Coulter, & Prince, 1990; Crunelli, Turner, Williams, Guyon, & Leresche, 1994; Steriade & Llinas, 1988). Investigations of the generation of rhythmic activity within the thalamus have revealed that the relay nuclei and the reticular thalamic nucleus are involved in the activity (Steriade, Jones, & Llinas, 1990; von Krosigk et al., 1993). Of particular importance in the generation of rhythmic activity in the thalamus is the presence of a low-threshold calcium current that allows the thalamic neurons to re a low-threshold calcium spike (LTS). This spike is characterized by a slowly rising and falling triangle-like potential crowned by one to several fast spikes (Jahnsen & Llinas, 1984; Crunelli, Lightowler, & Pollard, 1989; Huguenard & McCormick, 1992). The availability of the GAERS model of epilepsy allows for the comparison of neuronal membrane properties between the epileptic and nonepileptic rats. A study by Tsakiridou, Bertollini, de Curtis, Avanzini, & Pape (1995) found in comparing the voltage-dependent calcium channels of both rats that the amplitude of the low-threshold calcium current of the GAERS rat was higher. The whole-cell patch clamp analysis revealed that the amplitude of the low-threshold calcium current in the reticular thalamic (RE) neurons (ITs ) of the GAERS rat was ¡198 C / ¡19 pA as opposed to ¡128 C / ¡14 pA for the nonepileptic rat. This augmentation was found to be isolated to particular areas of the thalamus. While the ITs current was found augmented in the RE nucleus, the low-threshold current in the thalamocortical (TC) neurons of both rats was found not to be signicantly different. No signicant differences were found in the high-threshold Ca2C current, IL , between the nonepileptic and GAERS rat. More detailed tests, conducted to understand the reason for the increased amplitude of ITs in the GAERS rat, revealed no signicant differences in the activation and inactivation curves for ITs between the normal and epileptic rat. The kinetics of the current for the two rats was also similar. The augmentation of the ITs current therefore reected an increase in the number of T-type Ca2C channels or an increase in single-channel conductance. The purpose of our project was to investigate the effects of an isolated increase of ITs with the use of a computer model of a thalamic network. This was done by examining the effects of increasing the maximum conductance of the low-threshold calcium current in the RE neurons (gN Ts ) of the model. The transition we observed of increased synchronous ring with the increased gN Ts suggests that this may be responsible for the epileptic activity of the GAERS rat. The second part of the investigation focused on the mechanism by which the increase in gN Ts caused the transition in network activity. Much of this was done by examining the effects of an increase in gN Ts on isolated RE neurons.

Thalamic Synchrony via Phase Shifts

1555

The low-threshold spike of the RE neurons was found not to be signicantly altered by the increase. Instead, the frequency of the RE neuron oscillations was found to decrease slightly with each increase in gN Ts . There was now a phase lag in the points at which the RE neurons red an LTS. In other computational studies of thalamic networks, enhanced synchrony mediated by a decrease in network GABAA has been accompanied by changes in the LTS of the thalamic neurons and big changes in frequency (Golomb, Wang, & Rinzel, 1994; Destexhe, Bal, McCormick, & Sejnowski, 1996). In this study, however, the nature of the low-threshold spikes was relatively unchanged with the enhanced synchrony. The changes in frequency accompanying the enhanced synchrony were slight, but led to changes in the phase relationship between the network RE and TC neurons. This strongly suggests that the phase relationship between the RE and TC neurons is an important control parameter in the network synchrony. These studies may therefore contain some insights for understanding experimental investigations in which changes in network synchrony were accompanied by only small changes in the LTS and frequency of the constituent neurons (Huguenard & Prince, 1994). The ability of this mechanism to bring about other enhancements in synchrony was tested. The increase in the calcium-mediated potassium conductance ( gN K[Ca] ) also led to a similar decrease in the period of RE neuron ring. An increase in synchronous ring was once again observed. Various computer models of the thalamocortical anatomy as well as lowfrequency, synchronous oscillations by the system have already been constructed (Thomas, Patton, & Wyatt, 1991; Patton, Thomas, & Wyatt, 1992; Thomas & Wyatt, 1995; Thomas & Lytton, 1997; Lytton & Thomas, 1997). These models have been of varying degrees of complexity in anatomical structure and in physiological activity. In particular, Lytton and Thomas (1997) review Hodgkin-Huxley models constructed by numerous investigators in the eld. These models have succeeded in reproducing various experimental results, among them intrinsic oscillations of frequencies in the isolated thalamic neurons, low-frequency synchronous oscillations in the thalamic network, and enhanced synchrony with a decrease in network GABAA conductance. The success of these models in reproducing various experimental results encourages their use in order to gain further insight into the underlying ionic channel and single-cell activities that nally lead to bifurcations in network activity. 2 Methods 2.1 Model Construction. The simulated network consisted of 13 RE neurons and 39 TC neurons in a columnar arrangement. Each column consisted of 1 RE neuron reciprocally connected to 3 TC neurons. Each RE neuron also received input from the RE and TC neurons in the column immediately adjacent to it, and the TC neurons received input from the RE neurons of the adjacent column.

1556

Elizabeth Thomas and Thierry Grisar

Each neuron was represented by a single compartment with a 1000 m m2 area. The physiology of the neurons was constructed using the HodgkinHuxley formalism. Most of the parameters and equations necessary to describe the ionic currents of the model were obtained from previous computer models of the thalamic network. The RE neurons were constructed using the following intrinsic currents: a low-threshold calcium current ITs , the calcium-mediated potassium current IK[Ca] , the leak current Il , the fast sodium current INa , and the fast potassium current IK . Inhibition for the RE neurons was mediated by the GABAA synapse acted on by the neurotransmitter c -aminobutyric acid (GABA). The excitatory current IAMPA was due to the fast-acting amino-3-hydroxy-5-methyl-4-isoxazole-proprionic acid (AMPA). The TC neurons were constructed using the low-threshold calcium current IT , the hyperpolarization-activated cationic current Ih , the fast sodium current INa , the fast potassium current IK , and a leak current Il . The synaptic currents for the TC neurons were the two inhibitory IGABAA and IGABAB currents. The internal calcium concentration of each neuron was also computed. The equivalent circuit representation for the neuronal membrane comprises a capacitor and several resistors in parallel (Johnston & Wu, 1994). In accordance with this parallel conductance model for ionic currents in a membrane, the following equations describe the activity of the neurons: Cm VP RE D ¡ITs ¡ IK[Ca] ¡ INa ¡ IK ¡ Il ¡ IGABAA ¡ IGABAB ¡ IAMPA (2.1) (2.2) Cm VP TC D ¡IT ¡ Ih ¡ INa ¡ IK ¡ Il ¡ IGABAA ¡ IGABAB . VRE is the membrane potential of the RE neuron, and VTC is the membrane potential of the TC neuron; Cm is the capacitance of the neuron; Cm had a value 1 m F/cm2 . The values of each intrinsic current I were obtained from Ohm’s law, I D g £ gN £ ( V ¡ Eeq ),

(2.3)

where g is the conductance of the channel, gN is the maximum conductance of the channel, and Eeq is the reversal potential of the current. The magnitude of the current depends on the state of the gates controlling the channels. The opening of these gates is a voltage- and time-dependent process. With the exception of the fast sodium, fast potassium, and calciummediated potassium currents, it was not necessary to obtain the forward and reverse rate constants for the gates. Instead, the state of the gate at time innity, gate1 , and the gate time constant tgate were directly obtained from available experimental results. It was then possible to compute the state of the activation or inactivation gate via the following equation (Hodgkin & Huxley, 1952): gatet D gate1 ¡ ( gate1 ¡ gatet¡1 ) exp[¡D t / tgate ].

(2.4)

Thalamic Synchrony via Phase Shifts

1557

The gate controlling a channel could be an activation gate m or an inactivation gate h. The conductance of a channel g depended on the state of these gates, g D mN h,

(2.5)

where N is the number of activation gates. In the following section we describe the intrinsic currents that were used to model the neurons. In some cases, the equations and parameters used to model the currents were the same for both the TC and the RE neurons. The neuron for which the equation applies will be put in parentheses next to the name of the current 2.2 Low-Threshold Calcium Current, IT (TC Neuron). The presence of a low-threshold Ca2C current responsible for the LTS has been reported in various investigations of thalamocortical cells (Coulter, Huguenard, & Prince, 1989; Jahnsen & Llinas, 1984; Huguenard & McCormick, 1992). Equations for the low-threshold current that were used in this model were obtained from voltage clamp experiments conducted by Huguenard and McCormick (1992) on acutely dissociated rat thalamic relay neurons:

IT D gN T m2 h(V ¡ E Ca ) m1 ( V ) D 1.0 / f1.0 C exp[¡(V C 57.0) / 6.2]g tm ( V ) D 0.612 C 1.0 / fexp[¡(V C 132.0) / 16.7]

(2.6)

h1 ( V ) D 1.0 / f1.0 C exp[(V C 81) / 4.0]g th ( V ) D exp[(V C 467.0) / 66.6] V < ¡80 mV

(2.9)

C exp[(V C 16.8) / 18.2]g

D 28.0 C exp[¡(V C 22.0) / 10.5]

V ¸ 80 mV.

(2.7) (2.8) (2.10) (2.11)

The reversal potential for IT , ECa was xed at 120 mV, as had been done in the simulation by Golomb et al. (1994). The maximum IT conductance gN T D 1.75 mS/cm2 . 2.3 Low-Threshold Calcium Current, ITs (RE Neuron). The RE neurons have also been observed to undergo burst activity during rhythmic activity (Contreras, Curro Dossi, & Steriade, 1993; Avanzini et al., 1992). The lowthreshold calcium current that mediates this burst has been described from voltage clamp experiments by Huguenard and Prince (1992). The equations that describe these currents were obtained from Destexhe, Contreras, Sejnowski, & Steriade, (1994). Unlike the work of Destexhe, Contreras, Sejnowski, and Steriade (1994), the reversal potential of Ca2C was not computed. It was xed at 120 mV, as had been done in the thalamic model of Golomb et al. (1994):

ITs D gN Ts m2 h ( V ¡ ECa )

(2.12)

1558

Elizabeth Thomas and Thierry Grisar

(2.13) m1 ( V ) D 1.0 / f1.0 C exp[¡(V C 52.0) / 7.4]g tm ( V ) D 0.44 C 0.15 / fexp[(V C 27) / 10] C exp[¡(V C 102) / 15]g (2.14) (2.15) h1 ( V ) D 1.0 / f1.0 C exp[(V C 80) / 5.0]g th ( V ) D 22.7 C 0.27 / fexp[(V C 48) / 4] C exp[¡(V C 407) / 50]g.

(2.16)

Unlike the case for the low-threshold current of the TC neurons, gN Ts was varied in order to simulate increases in ITs found in the RE nucleus. In the simulations involving changes of IK[Ca] , gN Ts was xed at 1.75 mS/cm 2 . 2.4 Hyperpolarization Activated Current Ih (TC Neuron). The critical role played by this current in the rhythmic activity of thalamic neurons has been observed by several investigators in the eld. Soltesz et al. (1991) observed that the application of Cs2C could transform spindle oscillations in some thalamic neurons to pacemaker oscillations. In other thalamic neurons, they found that the same procedure eliminated the low-frequency pacemaker oscillations. McCormick and Pape (1990) found that the results from the application of Cs2C in thalamic slices were dose dependent. The application could lead to an increased number of spikes in a burst or to the elimination of the bursts and oscillations. The equations for this current were obtained from voltage clamp measurements made by Huguenard and McCormick (1992) on guinea pig thalamic slices:

Ih D gN h m( V ¡ Eh ) m1 ( V ) D 1.0 / f1.0 C exp[(V C 75) / 5.5]g tm ( V ) D 1 / 1.25fexp[¡14.59 ¡ 0.086V]

C exp[¡1.87 C 0.0701V]g.

(2.17) (2.18) (2.19)

The value of gN h was 0.15 mS/cm 2 . 2.5 Calcium-Mediated Potassium Current IK[Ca] (RE Neurons). Equations for the Ca2C mediated potassium channel came from Destexhe, Mainen, and Sejnowski (1994) and Wallenstein (1994):

IK[Ca] D gN K[Ca] m2 (V ¡ EK ) ([Ca2C

2C n ]

(2.20) 2C n ]

]) D a[Ca m1 / fa[Ca C C 2 2 tm ([Ca ]) D 1 / (a[Ca ] C bg.

C bg

(2.21) (2.22)

The value for n was 2, a D 48 ms¡1 mM¡2 , and b D 0.03 ms¡1 . The value of gN K[Ca] unless otherwise specied was 10 mS/cm2 . 2.6 Leak Current (RE and TC Neuron). McCormick and Huguenard (1992) found the rhythmic activity of their model TC neurons to depend

Thalamic Synchrony via Phase Shifts

1559

critically on the leak currents. The following equation was used to model the leak Il : Il D gl ( V ¡ El ).

(2.23)

The value of gl was 0.01 mS/cm2 for the TC neurons and 0.05 mS/cm 2 for the RE neurons. 2.7 Fast Na C and K C Currents: INa and IK (RE and TC Neuron). The fast spikes riding on top of the depolarizing envelope during a low-threshold spike were found by Jahnsen and Llinas (1984) to be mediated by a fast NaC current. Both a fast NaC current and a delayed rectier current were included in the model in order to reproduce this feature of the low-threshold spikes. The equations for the fast sodium current INa were obtained from a model by McCormick and Huguenard (1992). The equations for the delayed rectier IK were obtained from a thalamic model by Wallenstein (1994). The maximum conductance for the fast sodium channel gN Na for the RE neurons was 20 mS/cm2 , and a value of 30 mS/cm 2 was used for the TC neurons. In the case of gN K , a value of 8 mS/cm2 was used for both the RE and TC neurons. 2.8 Intrinsic Calcium Dynamics. The internal calcium dynamics and the values for internal calcium concentration [Ca2C ] were obtained from a model of the TC neuron constructed by McCormick and Huguenard (1992). The value of [Ca2C ] was calculated for a depth of 100 nm in a single compartmental model of area 1000 m m2 . The following discretized equation was used:

n [Ca2C ]t D [Ca2C ]t¡1 C D t. [¡5.18 £ 10¡3 .(IT or ITs )] / (area depth ) o ¡b[Ca2C ]t¡1 ] (2.24) The constant ¡5.18 £ 10¡3 was used to convert current (in nanoamperes), time (in milliseconds), and volume (in cubic micrometers) to concentration of Ca2C ions. The value of b used in the simulations was 1. 2.9 Synaptic Currents. The synaptic currents were modeled using a kinetic scheme of the binding of transmitter to postsynaptic receptor developed by Destexhe, Mainen, and Sejnowski (1994). According to this scheme, the fraction of postsynaptic receptors in the open state, r, can be obtained from the following equation:

dr D a[T](1 ¡ m ) ¡ bm, dt

(2.25)

1560

Elizabeth Thomas and Thierry Grisar

where [T] is the concentration of neurotransmitter in the synapse and a and b are the forward and backward binding rates, respectively. Both the GABAA and AMPA receptors were affected by neurotransmitters that were released in a pulse of 1 ms duration and 1 mM amplitude. For the GABAB receptor, the neurotransmitter was released in a pulse of 85 ms and 1.0 m M amplitude. The following parameters were used for the GABAA synapses: a D 0.53, b D 0.1, the reversal potential EGABAA D ¡80 mV. For the GABAB synapses, a D 0.016, b D 0.0047, and EGABAB D ¡103 mV. The AMPA conductances were computed from the values a D 2.0, b D 0.3, and EAMPA D 0 mV. The following maximum conductances were used for the synapses on the RE neurons, gN GABAA D 5 nS and gN AMPA D 0.5 nS, and for the TC neurons, gN GABAA D 5nS and gN GABAB D 0.01nS. 2.10 Autocorrelation. The frequency of neuronal ring was computed with the use of an autocorrelation function. The membrane potential V of a neuron was measured at intervals of 0.1 ms. The autocorrelation function C(t ) was then calculated for each delay t :

C(t ) D 1 / Nt

nt X nD1

V (n )V (n C t ),

where 1 / Nt is the number of bins available for the particular delay t that is being used. The value of t that gave a maximum for C(t ) was then taken as the period of the neuronal oscillation. 2.11 Rastergrams. Rastergrams were used to display the synchronized activity of the network (see Figures 1 a–d). Each black dot on the rastergram marks a point where a neuron had a membrane potential above 0 mV. Time is on the x-axis. On the y-axis, the TC neurons are numbered 0 to 38, and the RE neurons are numbered 39 to 51. 2.12 Counting the Number of Cycles of Synchronized Activity. The number of cycles of synchronized activity was computed by constructing autocorrelograms from the discrete activity represented in the rastergrams. Each value of C(t ) greater than 200 was counted as one cycle of synchronized activity (Note that one action potential can be represented by more than one dot since neuronal activity was sampled every 0.1 ms.) Using this method, the number of cycles of synchronized activity in Figure 1a was 3, while the number of cycles of synchronized activity in Figure 1d was 13. This simulation was developed using Microsoft Visual C++. The equations were integrated using a variable time-step Runge-Kutta method (Press, Teukolsky, Vetterling, & Flannery, 1992). The simulations were run on a Silicon Graphics IRIX machine as well as a Pentium 233 MHz.

Thalamic Synchrony via Phase Shifts

1561

Figure 1: Increase in network synchrony as the value of gN Ts is increased. The values of gN Ts are in the following order for the gures A–D. (A) 1.7 mS/cm2 , (B) 1.9 mS/cm2 , (C) 2.1 mS/cm2 , (D) 2.3 mS/cm2 .

1562

Elizabeth Thomas and Thierry Grisar

Figure 2: Number of cycles of synchronous activity in the network for each value of gN Ts . The number of cycles of synchronous activity was computed as described in section 3.

3 Results 3.1 Enhancement of Synchrony with Increased gN Ts . We investigated the effects of an isolated increase in the low-threshold current of the RE neurons by progressively increasing the maximum conductance of the lowthreshold calcium channel in the RE neurons ( gN Ts ). The value of gN Ts was progressively increased from 1.7 to 2.7 mS/cm 2 . The initial stimulus to the network in each case was a pulse in the RE layer of the model. Figure 1 demonstrates the progressive increase in the number of cycles of synchronized activity by the TC neurons. Each rastergram displays the activity of the neuronal populations for 2000 ms. In Figure 1a, at the gN Ts value of 1.7 mS/cm2 , the TC neurons re approximately three cycles of synchronized activity. This number progressively increases until at the gN Ts value of 2.3 mS/cm2 , the TC neurons re in a synchronized manner for every cycle of oscillatory activity (see Figure 1d). The number of cycles of synchronized activity for each value gN Ts can also be seen in Figure 2. The number of cycles of synchronized activity was computed as described in section 2. Although there was an overall increase in the cycles of synchronized activity as gN Ts was increased, some increases of gN Ts were accompanied by a slight decrease in network synchrony. In Figure 1a, the TC neurons re approximately three cycles of synchronized activity and subsequently stop participating in the ring activity of the network. The TC neurons participate in the network ring in the time interval between 150 and 640 ms. An explanation for their failure to reach threshold following 640 ms may be found by examining the average period for the RE neuron population oscillations within this interval and compar-

Thalamic Synchrony via Phase Shifts

1563

ing it to the times outside the interval. It was found that the average period of the RE neuron population ring was approximately 122 ms during the interval when the TC neurons were ring. Outside this interval, however, when the TC neuron oscillations were subthreshold, the RE neuron population was ring with an average period of approximately 67 ms. This reduced period in the oscillations of the RE neuron population probably led to the inhibition of the TC neurons before threshold was reached. 3.2 Single Unit Changes in RE Neurons with Increase in gN Ts . We attempted to understand the mechanism by which the increase in gN Ts led to enhanced synchrony in the network. This was done by studying the effects of the increased gN Ts on the activity of the isolated RE neurons. Figure 3 displays the activity of one isolated RE neuron for each of the gN Ts values in Figure 1. The changes affecting this RE neuron are representative of the changes to the other RE neurons of the network. The nature of the LTS in the isolated RE neuron remained essentially unchanged as gN Ts was increased. The number of fast spikes riding on top of each depolarized hump remained one to two. Figure 3 displays the oscillatory activity of this RE neuron for the gN Ts values 1.7, 1.9, 2.1, and 2.3 mS/cm 2 . A very small increase in the amplitude of the LTS was observed with the alterations in gN Ts . The maximum amplitude of the RE neuron LTS displayed in Figure 3a was 41.9 mV, and the maximum amplitude of the same RE neuron LTS in Figure 3d was 45.7 mV. More notable than the changes to the LTS were the changes in the period of the isolated RE neuron oscillations with each increase in gN Ts . The period of the isolated RE neuron oscillations was found to change with each increase in gN Ts . The number of cycles in the RE oscillations of Figure 2a was 17, while in Figure 2d, the number had decreased to 14. The period of the isolated RE neuron oscillations had therefore increased with each increase in gN Ts . An autocorrelogram was constructed for the RE neuron oscillations in Figure 2 for each value of gN Ts . The peaks of the autocorrelograms are displayed in Figure 4. The period of the RE oscillations increased from a value of 128.7 ms at a gN Ts value of 1.7 mS/cm2 to a period of 157.63 ms at a gN Ts value of 2.7 mS/cm2 . 3.3 Altered Phasic Relationships with Increased gN Ts . The change in the ring frequency of the isolated RE neurons also affected their phasic relationship with the TC neurons in the connected network. Figure 5 displays the change in the relationship of the RE neuron to TC neuron ring as gN Ts is changed. It displays the changing phasic relationship between the RE neuron in Figure 2 and a connected TC neuron. The time at which the TC neuron reached threshold in the third cycle of oscillations (TC phase) is displayed along with the time at which the RE neuron reached threshold in the fourth cycle of oscillations (RE phase). At the gN Ts value of 1.7 mS/cm 2 , the TC neuron reaches threshold for the third cycle of oscillations at 148.28 ms.

1564

Elizabeth Thomas and Thierry Grisar

Figure 3: Activity of an isolated RE neuron with gN Ts values. (A) 1.7 mS/cm2 , (B) 1.9 mS/cm2 , (C) 2.1 mS/cm2 , (D) 2.3 mS/cm2 .

Thalamic Synchrony via Phase Shifts

1565

Figure 4: Change in the period of oscillations of an isolated RE neuron as gN Ts is increased. The period of the oscillations was computed as decribed in section 3.

Figure 5: Phase of TC and RE neuron ring. A TC neuron ring phase was measured at the point at which the TC neuron reached threshold in the third cycle of network activity (the TC neuron activity was subthreshold until then). The ring phase of a connected RE neuron was measured at the point where the RE neuron reached threshold in the fourth cycle of network activity. The phase difference is the difference in the timing of the two events described above.

The delay to this event had decreased to 143.66 ms with the gN Ts value of 2.7 mS/cm2 . In contrast to this small change in timing, the RE neuron reaches threshold for its fourth cycle of oscillations at 253.11 ms for the gN Ts value of 1.7 mS/cm2 , while for the gN Ts value of 2.7 mS/cm 2 , threshold is reached at

1566

Elizabeth Thomas and Thierry Grisar

Figure 6: Number of cycles of synchronized activity (left) and the period of an isolated RE neuron oscillation (right) as the value of gN K[Ca] is varied from 10 to 18 mS/cm2 .

332.53 ms. The difference in the effect of changes in gN Ts on RE neurons as opposed to TC neurons is best demonstrated by the difference in the slopes of the ring phase versus gN Ts curves for the two types of neurons. The slope of the RE phase versus gN Ts curve is 52.75 s cm2 /mS. On the other hand the slope of the TC phase versus gN Ts value is a much lower: ¡4.49 s cm2 /mS. The increasing delay of RE neuron ring with the increase in gN Ts , along with the smaller changes in TC neuron ring, led to an increase in the phase lag of RE neuron ring with respect to TC neuron ring. Also displayed in Figure 5 is the increasing phase lag of RE neuron ring with respect to the TC neuron ring as gN Ts is increased (phase difference). 3.4 Enhanced Synchrony by Increased gN K[Ca] of RE Neurons. The ability of this increased phase lag in the ring of the RE neurons to increase the synchrony of the network was also tested with alterations to the calciummediated potassium channels of the RE neurons. The value of this conductance was increased from 10 to 18 mS/cm2 in the RE neurons of the model. Figure 6 displays the gradual increase in the period of an isolated RE neuron in the model as well as the accompanying increase in synchrony of the network. The number of cycles of synchronized activity in the network increased from three to nine as the value of gN K[Ca] was increased from 10 to 18 mS/cm2 . The frequency of the population oscillations lay between 6 and 7 Hz. Although there was an overall increase in the number of cycles of synchronized activity, there were in some instances a slight decrease as gN K[Ca] was increased. At the gN K[Ca] value of 15 mS/cm 2 , the cycles of syn-

Thalamic Synchrony via Phase Shifts

1567

chronized activity dropped to 5 from a former value of 7 at the gN K[Ca] value of 14 mS/cm2 . The period of oscillations for an isolated RE neuron of the network was measured for each increment in gN K[Ca] . The period of RE neuron oscillations increased from 128.7 ms to 145.09 ms as gN K[Ca] increased from 10 to 18 mS/cm2 . 4 Discussion

We found in our model of low-frequency oscillations in the thalamic network that an isolated increase of the low-threshold calcium conductance in the RE neurons (gN Ts ) was able to enhance synchrony in the network. The thalamus has been found to be an important area in the maintenance of synchronous, oscillatory ring during epilepsy. Electrophysiological recordings from the GAERS have demonstrated that the spike and wave discharges (SWD) in the cortex are accompanied by paroxysmal activity in the thalamus. In some cases, paroxysmal activity in the thalamus was found to precede that in the cortex (Marescaux et al., 1992). The enhanced synchrony found in the model thalamic network with increased gN Ts provides support for the hypothesis that the increased ITs observed in the GAERS may be responsible for the initiation of epileptic activity in the rat. This enhanced synchrony in the thalamus may then be responsible for increased feedback from the cortex and the further synchronization that leads to SWD. The increased gN Ts was not found to cause any notable changes in the number of fast spikes riding on top of the low-threshold spike of the isolated RE neuron (see Figure 3). The relatively unaltered timing at which the TC neuron reached threshold in Figure 5 is further evidence of an almost unchanged inhibitory postsynaptic potential (IPSP) in the TC neuron, and therefore unvaried nature of the LTS in the model RE neuron as gN Ts is increased. The frequency of the RE neuron oscillations, however, undergoes a more notable decrease with the increase in gN Ts . The ability of such a mechanism to enhance synchrony is further demonstrated by Figure 6. The increased gN K[Ca] causes a lowering in the ring frequency of the RE neurons and an accompanying increase in network synchrony. A possible explanation for the increased synchrony in the network with increasing gN Ts may lie in the increasing phase lag of the RE neuron ring (see Figure 5). The slight delay in RE neuron ring ensures that a TC neuron is now able to re. Much of the TC neuron activity that had been previously subthreshold is now able to reach threshold before inhibition by an RE neuron. The increase in the period of RE neuron oscillations does not increase synchrony in a monotonic fashion. Around the RE oscillation period of 140 ms, the increase in both gN Ts as well as gN K[Ca] led to a slight decrease in the number of cycles of synchronized oscillations. This is evidence that other factors play a role in the dynamics of the system. One possibility is a

1568

Elizabeth Thomas and Thierry Grisar

slight increase in the amplitude of the RE neuron LTS (even if the number of fast spikes on the LTS remains much the same) and a concurrent increase in the TC neuron IPSP. This may then lead to a delay in the TC neuron’s ring that once again subjects it to inhibition by a preceding RE neuron LTS. The overall increase in the number of cycles of synchronous activity with the increase in the period of RE neuron oscillations, however, establishes that it is an important control parameter that can be increased in isolation to enhance the synchrony of the thalamic network. Another indication that the increasing period of the RE neurons is not the only factor determining the state of the system may come from comparing Figures 2 and 5. In Figure 2, an abrupt transition in the number of cycles of synchrony occurs at the gN Ts value of 2.3 mS/cm2 . In Figure 5, however, an abrupt transition in the phase transition between the RE and TC neurons occurs at the gN Ts value of 2.6 mS/cm2 . A more likely explanation for the difference in these transition points, however, is that the values in Figure 2 represent the state of the network for 2000 ms. Figure 5, on the other hand, examines the phase difference between the RE and TC neurons occurring within the rst 350 ms of the simulation (This had to be done because at some values of gN Ts , the TC neurons do not reach threshold for very long after this.) An examination of Figure 1, and especially Figure 1a, shows that the early dynamics of the system (such as the ring interval between the RE neurons) are different from the later dynamics of the system. We found in the model that the RE neuron oscillation frequency decreased with the increase in gN Ts . This may not seem plausible in the light of the depolarizing nature of the ITs current. The ITs current, however, requires hyperpolarization for the deinactivation of the channel. The higher depolarization achieved with a higher level of gN Ts may nally lead to a lower level of deinactivation of the current. A more likely explanation for the decrease of the frequency of the RE neurons may be an increase in the conductance of the calcium-mediated potassium channels (gK[Ca] ) with the increase in gN Ts . The increase in gN Ts may lead to an increase in the internal calcium concentration, which may then lead to an increase in gK[Ca] and a concomitant increase in the afterhyperpolarization for the RE neurons. In a model of the TC neuron constructed by McCormick and Huguenard (1992), the increase of IT was found to increase the ring frequency of the model neurons. This was found to be the case in the TC neurons of our model as well. For the RE neurons of our model, the latency to the rst LTS following the initial stimulus is decreased by the increase in gN Ts , but the overall frequency of the oscillations is decreased. From experimental work, indications of the effects of changes in gN Ts could come from studies on the effects of ethosuximide on thalamic slices (Huguenard & Prince, 1994). Huguenard and Prince (1994) found that the addition of ethosuximide on the rat thalamic slice increased the latency to the crowning fast spikes of the LTS for the RE neurons. Leresche et al. (1998) found that the use of ethosuximide increased the latency of the rst action potential in the burst of a TC neuron.

Thalamic Synchrony via Phase Shifts

1569

They also found, however, that in some cases, the frequency of tonic ring in the TC neurons was increased by the addition of ethosuximide. The studies by Leresche et al. (1998), however, have cast doubts on the question of which channel ethosuximide actually exerts its effects on. It may also appear from both of these studies that an increase in the delay to peak of the LTS in the thalamic neurons should cause desynchrony rather than synchrony, since ethoximide is an antiepileptic. In our study, however, the increased delay in the LTS occurred only in an isolated manner (only to the RE neurons) and would therefore have a different network effect than a delay for all the thalamic neurons. The enhanced synchrony in the model with increased gN Ts is compatible with the results of a previous model that displayed lowered threshold for SWD with an augmentation of the same conductance (Destexhe, 1998). Direct comparisons with this work and the previous investigation are difcult, however. The model by Destexhe (1998) included a cortical network. It was unclear if the transition to SWD had taken place due to the effects of cortical feedback on the altered RE neurons, or if an enhanced synchrony could be seen in the isolated thalamic network. The results from the model predict that mechanisms that lead to a phase lag in the ring of the RE neurons with respect to the TC neurons would lead to enhanced synchrony. The capacity of this mechanism to lead to an augmentation in synchrony was further demonstrated by the increasing period in the oscillations of the RE neurons with increase in gK[Ca] and the concomitant enhancement of network synchrony. An increasing phase lag between the ring of the RE neurons and TC neurons can also be achieved by a phase advance of the TC neurons. The results of the model therefore predict that within certain limits, changes in the TC neuron that lead to a decrease in the delay to reach threshold would enhance synchrony in the network. If the increasing phase lag between the ring of the TC and RE neurons lead to enhanced network synchrony, then drugs that cause the opposite effect—a decrease in the phase lag between RE and TC neuron oscillations—have the potential to bring about antiepileptic effects. Acknowledgments

This work was partly supported by a grant from the National Research Foundation (FNRS) of Belgium. References Avanzini, G., de Curtis, M., Marescaux, C., Panzica, F., Spreaco, R., & Vergnes, M. (1992).Role of the thalamic reticular nucleus in the generation of rhythmic thalamo-cortical activities subserving spike and waves. J. Neural Transm., 35, 85–95.

1570

Elizabeth Thomas and Thierry Grisar

Buszaki, G. (1991). The thalamic clock: Emergent network properties. Neuroscience, 41, 351–364. Contreras, D., Curro Dossi, R., & Steriade, M. (1993). Electrophysiological properties of cat reticular thalamic neurones in vivo. J. Physiol. ( Lond.)., 470, 273– 294. Coulter, D. A., Huguenard, J. R., & Prince, D. A. (1989). Calcium currents in rat thalamocortical relay neurons: Kinetic properties of the transient, lowthreshold current. J. Physiol. (Lond.), 414, 587–604. Crunelli, V., Lightowler, S., & Pollard, C. E. (1989). A type of T-type Ca2+ current underlies low-threshold Ca2+ potentials in cells of cat and rat thalamic geniculate nucleus. J. Physiol (Lond.), 413, 543–561. Crunelli, V., Turner, J. P., Williams, S. R., Guyon, A., & Leresche, N. (1994). Potential role of intrinsic and GABAB IPSP-mediated oscillations of thalamic neurons in absence epilepsy. In A. Malafosse, P. Genton, E. Hirsch, C. Marescaux, D. Broglin, & R. Bernasconi (Eds.), Idiopathicgeneralized epilepsies: Clinical, experimental and genetic aspects (pp. 201–213). London: John Libbey & Company. Destexhe, A. (1998). Spike-and-wave oscillations based on the properties of GABAB receptors. J. Neurosci., 18, 9099–9111. Destexhe, A., Bal, T., McCormick, D. A., & Sejnowski, T. J. (1996). Ionic mechanisms underlying synchronized oscillations and propagating waves in a model of ferret thalamic slices. J. Neurophysiol., 76, 2049–2070. Destexhe, A., Contreras D., Sejnowski, T. J., & Steriade, M. (1994). A model of spindle rhythmicity in the isolated thalamic reticular nucleus. J. Neurophysiol., 72, 803–818. Destexhe, A., Mainen, Z. F., & Sejnowski, T. J. (1994). An efcient method for computing synaptic conductances based on a kinetic model of receptor binding. Neural Computation, 6, 1–4. Golomb, D., Wang, X. J., & Rinzel, J. (1994).Synchronization properties of spindle oscillations in a thalamic reticularis nucleus model. J. Neurophysiol., 72, 1109– 1126. Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. J. Physiol. (Lond.)., 117, 500–544. Huguenard, J. R., Coulter, D. A., & Prince, D. A. (1990). Physiology of thalamic relay neurons: Properties of calcium currents involved in burst-ring. In M. Alvoli, P. Gloor, G. Kostoploulos, & R. Naquet (Eds.), Generalized epilepsy: Neurobiological approaches (pp. 181–189). Boston: Birkhauser. Huguenard, J. R., & McCormick, D. A. (1992). Simulation of the currents involved in rhythmic oscillations in thalamic relay neurons. J. Neurophysiol., 68, 1373–1383. Huguenard, J. R., & Prince, D. A. (1992). A novel T-type current underlies prolonged Ca2C -dependent burst ring in GABAergic neurons of rat thalamic reticular neurons. J. Neurosci., 12, 3804–3817. Huguenard, J. R., & Prince, D. A. (1994). Nominal T-current modulation causes robust antioscillatory effects. J. Neurosci., 14, 5485–5502. Jahnsen, H., & Llinas, R. (1984). Ionic basis for the electroresponsiveness and

Thalamic Synchrony via Phase Shifts

1571

oscillatory properties of of guinea-pig thalamic neurons in vitro. J. Physiol. (Lond.), 349, 227–247. Johnston, D., & Wu, S. (1994).Foundations of cellular neurophysiology. Cambridge, MA: The MIT Press. Leresche, N., Parri, H. R., Erdemli, G., Guyon, A., Turner, J. P., Williams, S. R., Asprodini, E., & Crunelli, V. (1998). On the action of the anti-absence drug ethosuximide in the rat and cat thalamus. J. Neuroscience, 18, 4842–4853. Lytton, W., & Thomas, E. (1997). Modeling thalamocortical oscillations. In P. S. Ulinski & E. G. Jones (Eds.), Cerebral cortex. New York: Plenum Press. Marescaux, C., Vergnes, M., & Depaulis, A. (1992). Genetic absence epilepsy rat from Strasbourg—A review. J. Neural Transm. [Suppl.], 35, 37–69. McCormick, D. A., & Huguenard, J. R. (1992). A model of the electrophysiological properties of thalamocortical relay neurons. J. Neurophysiol.,68, 1384–1400. McCormick, D. A., & Pape, H C. (1990). Properties of a hyperpolarizationactivated cation current and its role in rhythmic oscillation in thalamic relay neurones. J. Physiol., 431, 291–318. Patton, P., Thomas, E., & Wyatt, R. (1992). A computational model of vertical signal propagation in the primary visual cortex. Biol. Cybern., 68, 43–52. Press, W., Teukolsky, S., Vetterling, W., & Flannery B. (1992). Numerical recipes in FORTRAN. Cambridge: Cambridge University Press. Soltesz, I., Lightowler, S., Leresche, N., Jassik-Gerschenfeld, D., Pollard, C., & Crunelli, V. (1991). Two inward currents and the transformation of lowfrequency oscillations of rat and cat thalamocortical cells. J. Physiol. (Lond.)., 441, 175–197. Steriade, M., & Contreras, D. (1995). Relations between cortical and thalamic cellular events during transition from sleep patterns to paroxysmal activity. J. Neurosci., 15, 623–642. Steriade, M., Jones, E. G., & Llinas R. R. (1990). Thalamic oscillations and signalling. New York: Wiley. Steriade, M., & Llinas, R. R. (1988). The functional states of the thalamus and the associated neuronal interplay. Physiological Review, 68, 649–742. Thomas, E., & Lytton, W. (1997). Computer model of antiepileptic effects mediated by alterations in GABAA mediated inhibition. NeuroReport, 9, 691–696. Thomas, E., Patton, P., & Wyatt, R. (1991).A computational model of the vertical anatomical organization of primary visual cortex. Biol. Cybern., 65, 189–202. Thomas, E., & Wyatt, R. (1995). A computational model of thalamocortical spindle oscillations. Mathematics and Computers in Simulation, 40, 35–69. Tsakiridou, E., Bertollini, L., de Curtis, M., Avanzini, G., & Pape, H. C. (1995).Selective increase in T-type calcium conductance of reticular thalamic neurons in a rat model of absence epilepsy. J. Neurosci., 15(4), 3110–3117. von Krosigk, M., Ball, T., & McCormick, D. A. (1993). Cellular mechanisms of a synchronized oscillation in the thalamus. Science, 261, 361–364. Wallenstein, G. V. (1994). A model of the electrophysiological properties of nucleus reticularis thalami neurons. Biophysical J., 66, 978–988. Received April 20, 1999; accepted August 26, 1999.

LETTER

Communicated by Roger Traub

Multispikes and Synchronization in a Large Neural Network with Temporal Delays Jan Karbowski Center for Biodynamics, Department of Mathematics, Boston University, Boston, MA 02215, U.S.A., and Center for Theoretical Physics, Polish Academy of Sciences, 02-668 Warsaw, Poland Nancy Kopell Center for Biodynamics, Department of Mathematics, Boston University, Boston, MA 02115, U.S.A. Coherent rhythms in the gamma frequency range are ubiquitous in the nervous system and thought to be important in a variety of cognitive activities. Such rhythms are known to be able to synchronize with millisecond precision across distances with signicant conduction delay; it is mysterious how this can operate in a setting in which cells receive many inputs over a range of time. Here we analyze a version of mechanism, previously proposed, that the synchronization in the CA1 region of the hippocampus depends on the ring of “doublets” by the interneurons. Using a network of local circuits that are arranged in a possibly disordered lattice, we determine the conditions on parameters for existence and stability of synchronous solutions in which the inhibitory interneurons re single spikes, doublets, or triplets per cycle. We show that the synchronous solution is only marginally stable if the interneurons re singlets. If they re doublets, the synchronous state is asymptotically stable in a larger subset of parameter space than if they re triplets. An unexpected nding is that a small amount of disorder in the lattice structure enlarges the parameter regime in which the doublet solution is stable. Synaptic noise reduces the regime in which the doublet conguration is stable, but only weakly. 1 Introduction

Coherent rhythms in the gamma range of frequencies (30–80 Hz) have been found in many parts of the cortex (Gray, Konig, Engel, & Singel, 1989; Llinas & Ribary, 1993; Bragin et al., 1995; Singer & Gray, 1995; Steriade, Amzica, & Contreras, 1996) and are hypothesized to be important in the creation of cell assemblies (Engel, Konig, & Singer, 1991; Singer & Gray, 1995; Vaadia et al., 1995; Stopfer, Bhagavan, Smith, & Laurent, 1997) and in cognitive function (Bressler, Coppola, & Nakamura, 1993; Joliot, Ribary, & Llinas, 1994; Fries, Roelfsma, Engel, Konig, & Singer, 1997). Coherent rhythms with no c 2000 Massachusetts Institute of Technology Neural Computation 12, 1573– 1606 (2000) °

1574

Jan Karbowski and Nancy Kopell

phase lags among the participating cells have a potentially important role to play in plasticity, since synchronous activity is known to encourage the strengthening of mutual connections (Hebb, 1949; Ahissar et al., 1992; Bliss & Collingride, 1993; Magee & Johnston, 1997; Markram, Lubke, Frotscher, & Sakmann, 1997); though synchronous behavior is possible without rhythms, for the visual cortex it has been reported that such cells having a distance of more than 2 mm synchronize their activity only in the presence of rhythms (Konig, Engel, & Singer, 1995). In addition, coherent activity of a set of cells potentiates their effects downstream in the processing. Synchronization is known to exist between areas of the neocortex that are widely separated (Engel, Kreiter, Konig, & Singer, 1991; Bressler et al., 1993; Konig et al., 1995; Roelfsema, Engel, Konig, & Singer, 1997). In hippocampal slice preparations, synchronization of local eld potentials and single unit recordings in the gamma frequency range are observed with areas up to several millimeters apart synchronized within a millesecond in spite of conduction delays that are many times that long (Andersen, Silfvenius, Sundberg, Sveen, & Wigstrom, 1978; Traub, Whittington, Stanford, & Jefferys, 1996). This article takes up the question of how such synchronization over long distances can take place in a stable manner. For networks of spiking neurons, this question has been addressed in two closely related articles. In a pioneering article about the CA1 region of the hippocampus, Traub, Whittington, Stanford, and Jefferys (1996) used large-scale simulations of models containing excitatory pyramidal cells and inhibitory interneurons. They showed that synchronization was invariably accompanied by double spikes (or “doublets”) per cycle in many of the interneurons. The role of the doublets in the synchronization at a pair of separated sites was deduced by Ermentrout and Kopell (1998). In a much reduced setting, the latter showed that in order for the doublets to aid in the synchronization process, it was important that the rst spike of the doublet be due to excitation from the local pyramidal cells and the second spike be due to excitation from the remote site. The delay between the two spikes of a doublet encoded not just the conduction and synaptic delays but also the longer time to spiking that a cell experiences when it is not fully recovered from a previous spike. Ermentrout and Kopell (1998) demonstrated that the timing between the spikes could be used by the system to adjust the timing of the pyramidal cells automatically on the next cycle, in such a way as to make the coherent rhythm stable. (Related articles not as directly relevant are referred to in section 5.) The hypothesis that the rst spike be due to excitation leads to a phase lag between the local excitatory and inhibitory cells. The value of this lag plays no role in the analysis here or in Ermentrout and Kopell (1998) and can be arbitrarily small. Indeed, as in Whittington, Traub, and Jefferys (1995), the I-cells can be in a parameter regime in which they would re by themselves in the absence of excitation; the only restriction is that they not re before the E-cells, a situation that would produce suppression of the E-cells.

Multispikes and Synchronization for Networks with Delays

1575

The work of Ermentrout and Kopell (1998) gave an answer to how the mechanism of Traub, Whittington, Stanford, and Jefferys (1996) could work if all the remote inputs to a local set of cells arrive with the same conduction delay. It left open the question of how the mechanism could work if a cell is to receive many inputs with a wide range of conduction delays. The role of doublets, in particular, is mysterious in such a larger network because there are now many sources of excitation, not just two. This article addresses that question and shows, surprisingly, that doublets still provide information crucial for synchronization, and that in large-parameter regimes, bursts with more than two spikes cannot support stable synchronization. The network investigated in this article is a one-dimensional array of N identical local circuits of excitatory and inhibitory cells that mimics the CA1 region of the hippocampus. There are two different architectures: a spatially ordered network and one with some disorder in the positions of the local circuits. Each local circuit receives input from a large number of others, with conduction delays proportional to the distances traveled; thus, the inputs to a local circuit come over a range of times. The connection strengths are scaled with the number of inputs, so it takes many inputs to create a response, unlike the network of Ermentrout and Kopell (1998), in which all distant inputs are lumped, and the strengths are such that a single signal from the distant circuit creates a response in the local one. The distributed network is potentially much more exible in its response; depending on conduction times, strengths of synapses, and other parameters like the voltage threshold, the amount of disorder, and the range of connections, the number of spikes red by an inhibitory cell in a cycle (dened by the periodic ring of the excitatory cells) can be one, two, or more, and there can be multiple solutions. The actual outcome of the network therefore depends on the stability of the solutions. In order to perform the complicated analytical calculations associated with this extended network, we introduce a simple model of a neuron that mimics nonlinear behavior caused by ionic channels and is somewhat more complex than the standard “integrate-and-re” neuron: a large input causes an immediate spike (and reset), and a smaller one that (with the help of previous small inputs) causes the voltage to pass some threshold eventually causes a spike, but possibly with a delay. This delay, which depends on the previous spiking history of the neuron as well as the timing of the inputs, contains the information used by the network to create the coherence. We also introduce a probability of inputs arriving at a given cell, depending on distance from that cell. With the explicit neuronal dynamics and a lattice-like (though possibly disordered) structure, we are able to carry out computations for the probabilistic behavior. The computations lead to explicit (though complicated) formulas for the time difference between spikes of the excitatory cells at different points of the lattice as a function of their differences at the previous cycle. This is exactly the information needed to decide if synchronous be-

1576

Jan Karbowski and Nancy Kopell

havior is stable when the I-cells re single spikes, doublets, or more spikes per cycle. That is, we do a linear stability analysis near congurations in which the spiking order and the idea of cycles are well dened. From the analysis, we can get bounds on the parameter regions in which one or more of the above synchronous solutions exists and is stable. We are able to see from this that doublets appear to provide more stability than either single spikes or a large number of spikes per cycle. Furthermore, we get the unintuitive result that a small amount of disorder in the lattice structure helps to stabilize the synchronous state. The article gives the central ideas and a broad outline of the computation; a more detailed outline is given in the appendixes. Finally, we analyze the effect of synaptic noise and nd that it reduces the parameter range in which the doublet conguration has stable synchrony, although rather weakly. The list of symbols used in this article is as follows: R N Li, j vl ( t ) vth v0 kl nl a b c g s f

tmi tsi tme tse tEl tNEl WlI WlE

the linear size of the network measured as a conduction time the number of local circuits distance in time units between i and j circuits membrane potential of the lth excitatory or inhibitory cell, specied by the context threshold potential for ring reset potential after a spike number of inputs to the lth inhibitory cell after it res its rst spike until its voltage crosses vl D 0 number of inputs to the lth inhibitory cell after its rst spike until it res the doublet spike (nl > kl ) strength of excitatory-to-inhibitory synaptic coupling between circuits potential value of inhibitory cell after receiving the kl pulse driving current in excitatory cells amplitude of inhibition in excitatory cells range of synaptic excitatory-to-inhibitory connections measure of spatial disorder in positions of circuits membrane time constant for inhibitory cells inhibitory synaptic time constant between local inhibitory and excitatory cells membrane time constant for excitatory cells excitatory synaptic time constant between nonlocal excitatory and inhibitory cells spiking time of the lth excitatory cell in a previous cycle spiking time of the lth excitatory cell in a next cycle delay time for ring of the lth inhibitory cell delay time for ring of the lth excitatory cell

Multispikes and Synchronization for Networks with Delays

1577

E

E

E

E

E

E

I

I

I

I

I

I

Figure 1: Network architecture. Local circuits, containing one excitatory (E) and one inhibitory (I) cell, are connected only via E7!I synapses. These connections are probabilistic in nature. The distances between neighboring circuits are constant for the ordered network. For the disordered network, these distances are slightly distorted.

2 Network Model

Assumptions concerning our neural network model can be decomposed into two classes: assumptions about the network architecture and assumptions about the network dynamics. 2.1 Network Architecture. The network we analyze is a one-dimensional array of linear size R composed of N identical local circuits, each containing one excitatory (E) and one inhibitory (I) cell with synaptic connections from E to I and I to E (see Figure 1). For nonlocal connections, we consider only the synapses from excitatory to inhibitory cells. As in the smaller network analogue of this work (Ermentrout & Kopell, 1998), we neglect the E7!E connections, which are sparse in the CA1 region of the hippocampus (Knowles & Schwartzkroin, 1981). The I7!I local and nonlocal connections are treated as effectively decreasing the E7!I connections, and hence are not introduced explicitly. One simplication from the connectivity in Ermentrout and Kopell (1998) is the neglect of nonlocal I7!E connections, whose range in the CA1 area is smaller than the E7!I connections (Tamamaki & Nojyo, 1990; Freund & Buzsaki, 1996). Ermentrout and Kopell (1998) showed that the nonlocal E7!I connections sufced to produce synchrony between a pair of distant circuits (though the added connection made this synchrony more robust); we therefore investigated whether such connections would also sufce in a much larger extended network. We treat the network architecture as probabilistic and assume that the probability P( Lij ) of a connection between the ith and jth circuits separated by distance Lij depends on only the distance between them (translational invariance in a statistical sense). In what we describe as an ordered network, that distance is |i ¡ j|; in what we shall call the disordered network, the distances are a perturbation of that quantity. In the calculations, only the times of arrival, not actual distances, are relevant so what we call “distances”

1578

Jan Karbowski and Nancy Kopell

will carry units of time. Finally, for computational reasons, we use periodic boundary conditions for our network architecture. 2.2 Network Dynamics. We denote by vl (t ) the membrane potential of the lth I-cell and the lth E-cell. Its dynamics is modeled by

tma

dvl D |vl | C Il , a D e, i, dt

(2.1)

where l D 1, . . . , N, and Il represents input current. Symbols tmi and tme are effective membrane time constants for the I- and E-cells. These are much smaller than the passive membrane time constant determined by the capacitance and leak current, since there are many active conductances that dominate the leak conductance (see section 5 for estimates showing that for the part of the cycle relevant to the calculation, the effective time constant of the I-cell is a fraction of a msec). We add to the membrane dynamics represented by equation 2.1 the following property. When the potential crosses the threshold value vth ( > 0), a cell immediately res, and its potential is reset to v0 < 0, which mimics the effect of the AHP (afterhyperpolarization) current. We assume that if the I-cell is close to recovery, a single E-spike elicits a spike from the I-cell. Thus, in the local network, the E-I circuit forms an oscillator with one spike per cycle in each cell; the I-cell has recovered by the time it receives excitation in the next cycle. When the I-cell is partially refractory, the same input (or several inputs) to that cell may cause the potential to cross only a resting zero value; equation 2.1 then implies that the cell will also re, but with a variable delay time that depends on the difference between the times of the excitation and the time of the last spike of that cell. This delay is a central feature of the analysis in Ermentrout and Kopell (1998) for a two-circuit network; for our more complicated case, we chose an explicit model that has this delay. We chose the following form for the Il current. For an I-cell, the current is completely of synaptic origin and is given by Il D a

X k

d(t ¡ tl( k) ).

(2.2)

Here the delta functions represent pulses (inputs) coming with delays from nonlocal E-cells with amplitude a at times tl( k) from cell l C k to cell l. For an E-cell, the current has the form Il D c ¡ g

X j

¡(t¡tlj ) / tsi

e

h(t ¡ tlj ),

(2.3)

where c is a driving current (e.g., driven by a stimulus) and the second term represents inhibitory synaptic inputs with a positive amplitude g coming

Multispikes and Synchronization for Networks with Delays

1579

from the local I-cell at times tlj with synaptic I7!E time constant tsi . The function h(x) is a standard step function dened as h(x) D 1 for x ¸ 0 and h(x) D 0 for x < 0. We assume that for I-cells, the synaptic E7!I amplitude a is smaller than both |v0 | and vth . For E-cells, we assume the condition g À c > vth C |v0 |. The rst inequality means that the local inhibition g is strong enough to prevent ring (Traub, Jefferys, & Whittington, 1999); the second implies that the Ecells re frequently in its absence. Values tme , tmi , v0 , and vth can be different for I- and E-cells, but we do not use this explicitly. We assume that the synaptic E7!I time course tse is not important. This assumption is consistent with experimental studies on the CA1 area (Buhl, Halasy, & Somogi, 1994) that reveal that excitatory postsynaptic potentials to I-cells decay rapidly. As a consequence, incoming excitatory inputs to I-cells are taken as structureless (cf. equation 2.2). As we will show, the effective membrane time constant tmi for the Icells plays a role in determining parameter regimes in which the doublets are stable. The corresponding time constant tme for E-cells does not play as important a role. For some calculations, notably of W E dened below, we cannot get specic answers in full generality. However, we do the computation in the complementary regimes tme / tsi ¿ 1 and tme / tsi À 1. Because of the large amount of active conductances that lower tme , we believe tme / tsi ¿ 1 to be the more relevant regime. The qualitative features are shown to be the same in each, with different functional dependence of period of oscillations on tsi in the two regimes. The qualitative shape of the stability diagrams does not change at all. We do not put any constraints on the relative value of tsi and tmi . Note that the membrane dynamics represented by equation 2.1 is piecewise-linear and slightly different from standard integrate-and-re type models (McCullough & Pitts, 1943; Knight, 1972; Abbott & van Vreeswijk, 1993; Usher, Stemmler, Koch, & Olami, 1994; Gerstner, 1995) and from the Hopeld model (Hopeld, 1984; Amit, Gutfreund, & Sompolinsky, 1985; Hertz, Krogh, & Palmer, 1991). Our simple model mimics some nonlinear behavior caused by ionic channels. For negative voltages, it acts as a standard leaky integrate-and-re neuron. For positive voltages, it models the explosive depolarization generated by the opening of active voltage-gated sodium channels. Though we take the same time constants for v < 0 and v > 0, we note that the analysis below has the same form in the more general case in which these constants are allowed to be different. As in “type 1 neurons” (Ermentrout, 1996) for neurons in equation 2.1, there can be a signicant delay between receipt of a suprathreshold signal and the response of the neuron. In summary, in our model, the E-cells re continuously in the absence of local inhibitory connections. With inhibition, an E-cell can re only when the level of inhibition drops below a certain critical value. I-cells re when they are sufciently recovered from their last spike that the excitation (from

1580

Jan Karbowski and Nancy Kopell

whatever source) pushes them above its rst (v D 0) threshold. Thus, if the excitation arrives long after the last spike of an I-cell, the cell res almost immediately because it is recovered. However, the same excitation arriving shortly after a spike will not elicit another spike. We are assuming that the parameter ranges are such that an I-cell recovers well before the end of a cycle; recovery time is shorter than decay of inhibition, since the recovery time depends on the membrane time constant of the cell, which is small. 3 Network Properties Are Revealed by Maps

The above assumptions enable us to obtain analytical results concerning synchronous oscillations and their stability. We associate the synchronous state with the situation in which the E-cells re simultaneously. Below we construct maps that show how the network uses the timing of spikes to accomplish synchronization. The full cycle for E-cells is as follows. An E-cell res and sends a pulse to its local I-cell and many nonlocal I-cells. The local I-cell res almost immediately after receiving the pulse, since the I-cell is recovered from the last cycle. Simultaneously (or very quickly) it sends inhibition to the original E-cell, prohibiting it from further ring. In the meantime, the I-cell gets pulses from many nonlocal E-cells, but it must wait to re, because it is not yet recovered from the rst spike. We assume that the range of connectivity of the nonlocal E7!I connections (measured in time units) is signicantly smaller than the inhibitory time constant tsi , which is about 10 msec. This implies that all the nonlocal EPSPs arrive when the local E-cell is not yet recovered from its local inhibition. The parameter range relevant to the doublet case is one in which the combined excitation from the nonlocal E-cells is adequate to cause the Icell to re again, after a delay we call W I , before the E-cell recovers from inhibition and res again. In that case, the doublet spike sends a second inhibitory pulse to the local E-cell, which will re in the next cycle when the level of inhibition falls sufciently. We denote the time from the second inhibitory spike to the E-cell ring by W E . Thus for the doublet case, we have the following equations tNEl D tEl C W lI ( tEl , tl(1) , . . . , tl( nl ) ) C WlE ( WlI ),

(3.1)

for l D 1, . . . , N, where tEl , tNEl are the times of spiking of the lth E-cell in the previous and next cycles. WlI is a function of tl(1) , . . . , tl( nl ) , which are times of arriving pulses from nonlocal E-cells enumerated as l(1), l(2), . . . , l(nl ). We have tl( k) D tEl( k) C Ll,l( k) , where Ll,l( k) is a distance between the lth and l(k )th circuits. The I-cell res at some moment between the arrival of the nl and nl C 1 pulses. We look for a synchronous solution—one in which tNEi D tNjE provided tEi D tjE for every i and j. Equation 3.1 is the map that gives the

Multispikes and Synchronization for Networks with Delays

1581

timing of the E-spikes in one cycle as a function of spikes in the previous cycle. Note that the hypotheses on the extent of the excitatory connectivity is critical for well-dened cycles, which are needed to dene tNEl . In principle, one can generate multispikes in I-cells when the decay of inhibition is sufciently slow and the range of the E7!I connectivity sufciently large. In the case of triplets, equation 3.1 is modied to tNEl D tEl C WlI,1 ( tEl , tl(1) , . . . , tl( nl ) ) C WlI,2 ( tEl C WlI,1 , tl( nl C 1) , . . . , tl( nl C ml ) ) C WlE ( WlI,1 , WlI,2 ),

(3.2)

where W I,1 and W I,2 denote the waiting time of an I-cell for the second and third spikes from the previous spike. The integer ml denotes the number of inputs between the occurrence of the second and third spikes. Generalization to higher-order multispikes is straightforward. The same formalism also applies to the singlet case. In the latter, the I-cells do not receive enough excitation to overcome their partial refractoriness to reach the threshold v D 0. Recall that the decay time of the inhibition is long enough for the I-cell to recover. Thus, an E-cell spike elicits a spike from the now-recovered I-cell, which in turn inhibits the E-cell. The formalism is as in equation 3.1, with the waiting time W I set to zero. The rst goal is to compute the W I and W E functions. We compute WlI (the ring time of an I-cell started from time tEl ) using equations 2.1 and 2.2. At time t D tEl the potential of the I-cell is instantaneously brought to the value vth by the local E-cell and equally instantaneously reset to the negative value v0 after ring a spike. Now the I-cell begins to get excitatory pulses coming from nonlocal E-cells. The time when the I-cell crosses vth is equal to W lI , which takes the form " WlI

D tl ( n l ) ¡

tEl

¡

tmi ln

b exp[(tl( nl ) ¡ tEl ¡ dtl ) / tmi ] vth # nl a X C exp[(tl( nl ) ¡ tl( i) ) / tmi ] , vth iDk

(3.3)

l

where kl is the number of inputs needed for the I-cell to cross vl D 0, and dtl is the time at which this happens. The kl th pulse takes the voltage from a negative to some positive value b satisfying 0 < b < a. The I-cell res after receiving nl excitatory inputs. The details of the derivation of equation 3.3 are presented in appendix A. Computation of W lE using equations 2.1 and 2.3 is presented in appendix B. Here we briey sketch (Ermentrout & Kopell, 1998) how to derive an approximate formula for WlE . In the limit of fast membrane dynamics of the E-cell relative to the synaptic I7!E time course, that is, when tme / tsi ¿ 1,

1582

Jan Karbowski and Nancy Kopell

we can set the left-hand side of equation 2.1 with a D e to zero. The inhibitory current felt by an E-cell decreases with time as g exp(¡t / tsi ). In the case of doublets, one has two inhibitory spikes separated by time WlI ; the total inhibition is g[exp(¡t / tsi ) C exp(¡(t ¡ W I ) / tsi )]. Similarly, for triplets one has three exponential terms. Our WlE is the time interval from the occurrence of the second inhibitory spike to the local E-cell when the total inhibition drops below a certain critical value. That value is the one for which the total current Il in equation 2.3 becomes positive. After the current Il becomes positive, equation 2.1 implies that the membrane potential starts to grow, reaching the threshold very quickly at a time of the order of tme . In that limit WlE takes the form W lE D tsi ln

hg ± c

I

i

1 C e¡Wl / ts

²i

(3.4)

I

at that time, the E-cell res in the next cycle. In the opposite limit, tme / tsi À 1, WlE is of the order of the membrane time constant tme with a similar functional dependence (cf. appendix B). In the disordered network, the arriving times of pulses from E-cells to the lth I-cell form an irregular pattern. Therefore, we must introduce averaging over those times. We assume that the inputs are independent of one another and their temporal pattern depends only on initial conditions ( tEl ) and the architecture of the network fLl,l( k) g. We average equations 3.1 and 3.2 over the network architecture in which l( k ) D l C k ( mod N ); this corresponds to the assumption about periodic boundary conditions. Since arrival times do not depend on each other, our probability distribution r of fLl,lC k g is of the form r ( Ll,lC 1 , . . . , Ll,lC N¡1 ) D

N¡1 Y kD1

P( Ll,lC k ).

(3.5)

We choose the probability P(Ll,lC k ) of input arriving from ( l C k )th circuit to lth circuit at time Ll,lC k to be P (Ll,lC k ) »

¡

¢ ¡

¢

µ ¶ 1 kR kR Cf d Ll,lC k ¡ ¡ f C d Ll,lC k ¡ e¡Ll, lC k / s , 2 N N

(3.6)

where R is a linear size of the system, f is a measure of spatial disorder in positions of neurons and is the same for every pair of neurons, and s represents the range of synaptic connections between neurons. s is chosen so that if the E-cells are synchronous, all the excitation from the nonlocal E-cells to a local circuit comes during the recovery of a local E-cell from its local inhibition. Note that r does not depend on the position of the lth circuit, which is a consequence of the assumption about the translational invariance. The probability P( Ll,lC k ) is a product of two terms: the one with

Multispikes and Synchronization for Networks with Delays

1583

the delta functions represents the probability of arrival of an input at certain time, and the one with the exponential decay represents the probability of synaptic connection between circuits. The latter factor mimics the locality of connections. When s D 1, we have a network with global E7!I connections. The advantage of formula 3.6 lies in the fact that it captures different types of network architecture, yet is very simple and suitable for analytical investigations. When f D 0, one has an ordered network with regularly 6 arriving pulses separated by time R / N. The case with f D 0 corresponds to a disordered network in which the positions of the neurons are irregular or alternatively in which the positions of the neurons are regular but the lengths of axonal connections between them are irregular. In such a network, pulses come in at times that are random but symmetrically distributed around kR / N for every k. In principle, the range of variability of f can be from zero to R / 2N. However, in order to make our calculations easier, we adopted the weak disorder limit corresponding to f N / R ¿ 1. In this limit, the moments of distribution 3.6 are the same as those of a gaussian distribution for arriving times (cf. appendix C). Since the results of this article depend only on the moments, this implies that the results are equivalent for the two distributions. For the sake of convenience we dene D l¡1 ´ htEl i ¡ htE1 i, where the symbol h. . .i denotes averaging with respect to the probability distribution 3.5 and 3.6, that is, Z hHi D

1 0

Z dLl,lC 1 . . .

1 0

dLl,lC N¡1 r ( Ll,lC 1 , . . . , Ll,lC N¡1 ) ¢ H,

(3.7)

for any function H depending on fLl,lC k g. The next step is to rewrite equations 3.1 and 3.2 in terms of the fD g. This gives us ( N ¡ 1) equations for DN l ´ htNElC 1 i ¡ htNE1 i, which constitute the multidimensional map

DN l D fl (D 1 , . . . , D N¡1 ).

(3.8)

(Recall that the overbar denotes a quantity associated with the next cycle.) In the case of the doublet conguration, the nonlinear function f D ( f1 , f2 , . . . , fN¡1 ) is given by fl ( D 1 , . . . , D N¡1 ) D htNElC 1 (fD g)i ¡ htNE1 (fD g)i,

(3.9)

where htNEl (fD g)i D htEl i C hWlI i C hW lE i.

(3.10)

This means that the average of the next E-cell ring is a function only of the last E-cell average ring time, with no previous history dependence. In

1584

Jan Karbowski and Nancy Kopell

equation 3.10, WlI ( Ll,l(1) , . . . , Ll,l( nl ) I fD g) and WlE ( Ll,l(1) , . . . , Ll,l( nl ) I fD g) are now functions of fLl,l( k) g and fD g instead of the arrival times ftl( k) g. Formulas for hW lI i and hWlE i are complicated; their sum has the form tsi hWlI i C hWlE i D n 2 l coshnl (f / s) µ

¡

0

2 X a1 ,...,anl D1

µ g C £ ln 1 exp WlI c

exp @¡

¡R

nl X

1

(¡1)aj f / s A

jD1

nl R N ¶ ¶ C (¡1)anl f I fD g / tsi , (3.11) N

C (¡1)a1 f , . . . ,

¢

¢

R C (¡1) a1 with WlI ( N f , . . . , nNl R C (¡1)anl f I fD g) given by equation 3.3. The derivation of equation 3.11, including details of how to compute average values of kl and nl , which we denote as k and n, respectively, is given in appendix D. We are interested in the existence and stability of a synchronous state, which corresponds to the xed point DN l D D l with D l D 0, for every l. For the triplet case, the function f can be obtained explicitly as well but is more complex. In principle, the I-cells may produce many spikes in a cycle, provided the size R of the system is sufciently large. Some of these spiking congurations may be stable; some may not (see below). In the case of multistability, the actual conguration depends on the values of parameters used in the system. Especially critical is the interplay between the synaptic strength a and the linear size of the system R. The synaptic strength determines how many inputs from E-cells are needed to trigger ring of an I-cell. To have only doublets, one has to have a sufcient number of inputs to trigger the second spike in I-cell, that is, n < N, and insufcient to trigger a third spike, which means (n C k ) > N. Using the formulas for kl and nl in appendix D, we can use these inequalities to get bounds on the synaptic amplitude a such that the system has only doublets. When R / Ntmi À 1 and 0 < f / tmi ’ 1, which imply weak disorder, these bounds may be written as

cosh

f

i

s

cosh( fs ¡

f

tmi

)

e¡R / tm <

<

ab vth |v0 |

¡b¢ vth

1/2

(

cosh( fs C cosh( fs ¡

f

)

f

)

tmi tmi

!1 / 2 i

e¡R / 2tm .

(3.12)

For larger a it is possible to have both doublets and higher-order multispikes. In that case, the system would choose a stable conguration (see section 5). For stronger disorder (but still weak) when R / Ntmi À f / tmi À 1, the formula is slightly more complex, but the conclusion about a is still valid.

Multispikes and Synchronization for Networks with Delays

1585

We check the stability of synchronous solutions using the ( N ¡ 1) by ( N ¡ 1) stability matrix A dened as Al,m D @DN l / @D m | D D0 . The synchronized xed point is stable when the modulus of all eigenvalues l l of A is smaller than unity. Invoking the Gershgorin theorem (Horn & Johnson, P 1985), one can determine an upper limit of |l l | as N¡1 mD1 |Al,m |. If this upper limit is smaller than unity, we obtain a sufcient condition for stability that allows us to compute the stability diagrams presented in Figure 2. In these gures, we plot the region in parameter space in which the sufcient condition holds. We have chosen to vary R / Ntmi and tmi / tsi , since they emerge naturally as suitable coordinates; the other parameters are xed. Details of this computation are included in appendix E. For the singlet case we nd that the modulus of each eigenvalue of A is one; that is, the synchronized oscillations with singlets are marginally stable. In the case of doublets, we obtain stability regions for both the ordered and the disordered network. A crucial factor for this stability is the presence of delay time functions hW I i in equation 3.10. To make the calculated stability diagrams plausible, we now show that diagrams such as the ones depicted in Figure 2 are to be expected for solutions with doublets. Both gures show that for sufciently small values of R / Ntmi , the doublets are unstable. To get some insight into this behavior, notice that in order for an I-cell to re, it must receive a sufcient number of inputs to cross the zero value of its potential. Therefore, the size of the system R must be sufcient to produce enough inputs. The value of R / N plays the same role in this network as does the delay time d in the network for two circuits (Ermentrout & Kopell, 1998); in the latter network, stability requires d to be above some minimum magnitude, provided the coupling is not too strong. Similarly, when the effective membrane time constant tmi (from v0 to zero) is too large relative to the number of inputs, then the I-cell will not re; hence doublets cannot exist. Figure 2 shows that ordered and disordered networks differ with respect to stability for a large tmi / tsi ratio. This can be made plausible, again invoking existence arguments. If the synaptic time constant tsi of an I-cell is too small, the inhibition felt by an E-cell falls off faster than arrival of inputs to produce a second spike in the I-cell. This means that to get doublets, an inequality tsi > hW I i must be satised. For the ordered network, we have hW I i » tmi , since the time between doublets of the I-cell is governed by its effective membrane time constant tmi (cf. appendix D). Therefore we nd that ratio tmi / tsi must have an upper bound, because hW I i is bounded from below and positive (see Figure 2A). For the disordered network, our hW I i depends additionally on the disorder parameter f . This dependence creates a dramatic change, since now hW I i can assume arbitrarily small values for sufciently large f . That, in turn, implies that tmi / tsi does not have an upper bound because tmi / tsi » 1 / hW I i. Using the stability matrix A, we can calculate the critical value of ( R / Ntmi )cr ; above that value the sufcient condition for stability is satised

1586

Jan Karbowski and Nancy Kopell

A

i /t i m s

t

y0

x0

R/Nt

i m

B

t

i /t i m s

y0 x0

R/Nt

i m

(R/Nt

i ) m cr

Figure 2: Stability diagrams for the doublet conguration in the regime tme / tsi ¿ 1. The shadowed area indicates the stability region. (A) Ordered network. Values x0 and y0 present on the axes are given by equations E.9 and E.10 in sppendix E, respectively. There is stability only for tmi / tsi ¿ 1 and for R / Ntmi > x0 . For values b D a / 2 and |v0 | / a D 5, one nds x0 ¼ 2. (B) Disordered network. Note the appearance of the vertical asymptote and the increase in the stability region. Values x0 and y0 are of the same order of magnitude as for the ordered network. In the complementary regime (i.e., when tme / tsi À 1), the shapes of the diagrams remain the same with a substitution tsi 7! tme in coordinates.

Multispikes and Synchronization for Networks with Delays

1587

for arbitrary values of tmi / tsi (see Figure 2B). In the limit R / Ntmi À f / tmi À 1, this critical value takes the form (R / Ntmi )cr ¼

µ ¶ 8|v0 | f R R C ln cosh2 , s Ns Nf a

(3.13)

which is the point where a vertical asymptote crosses the horizontal axis in Figure 2B. Notice that when f D 0, namely, for the ordered network, this asymptote shifts to innity, reducing the stability region. Also note that the stability region grows with an increasing range of connections s and synaptic strength a. Details for how to compute ( R / Ntmi )cr are given in appendix E. For the triplet case in the ordered network, we nd that the upper limit of |l l | is always greater than unity (computation not given). This suggests that the triplet conguration in a such network is unstable. On the contrary, in the disordered network we nd that synchronized oscillations with triplets are stable for all tmi / tsi provided R / Ntmi ratio is sufciently large, that is, R / Ntmi À 1, although we are unable to determine the exact values analytically. We see from the above that spatial disorder in a network helps to stabilize the synchronized oscillations. So does increasing the range of synaptic connections (see equation 3.13). Based on comparison between doublets and triplets, we conjecture that higher-order multispikes are less stable than doublets, if they exist at all. The analysis also allows us to determine the period T of oscillations, which is equal to hW I i C hW E i (see equation 3.10). In the ordered network, with the doublet conguration, a formula is relatively simple. In the limit tme / tsi ¿ 1 we obtain 0 2 g T D tsi ln @ 4 1 C c

(

i

[(vth ¡ a )(1 ¡ e¡R / Ntm ) C a] i a[b C ( a ¡ b )e¡R / Ntm ] £ [a C |v0 |(1 ¡ e

¡R / Ntmi

!tmi / tsi 3 1

)]

5A .

(3.14)

Note that the period of oscillations of the whole network is larger than the period of oscillations of an isolated local circuit, tsi ln( g / c), but both are of the same order of magnitude. In the opposite limit, tme / tsi À 1, one should substitute tme for tsi and g(1 C vth / c)tsi / tme for g. The period grows with increasing size of the system, saturating for R 7! 1 (for xed N). This means that the frequency of oscillations is bounded from below. The dependence on the membrane time constant tmi is essentially the same as on R. T decreases with the increasing strength of E7!I

1588

Jan Karbowski and Nancy Kopell

synapses, which corresponds to increasing the amplitude a. The same conclusions are valid in the case of the disordered network with the doublet solution, but the expression for the period is more complex with similar logarithmic terms. In the regime tme / tsi ¿ 1, we obtain the following expressions for the period of oscillations. For the ordered network, in the limits tmi / tsi ¿ 1 and R / Ntmi À 1, that is, in the parameter range where there is a stable doublet conguration, our period is governed only by the inhibitory synaptic time course tsi (Whittington et al., 1995; Jefferys, Traub, & Whittington, 1996; Traub, Whittington, Colling, Buzsaki, & Jefferys, 1996) and reads T ¼ tsi ln(2g / c). (Proportionality between the period and tsi has also been obtained before in a network of inhibitory neurons; Chow, White, Ritt, & Kopell, 1998.) In the above limits, we also obtain a very simple formula for the ratio of the averaged timing between doublets hW I i to the period— hW I i / T ¼ (tmi / tsi ) ln(vth |v0 | / ab ) / ln(2g / c)—which depends logarithmically on the synaptic nonlocal (E7! I) strength. In the same limit for the disordered network with additional constraint f / tmi ¹ 1, we obtain the period T¼

tsi

ln

¡ 2g ¢ c

¡

f

2

tanh

f

s

.

(3.15)

This implies that spatial disorder reduces the period, and in addition, the longer the range of connections s, the greater is the period. In the complementary regime tme / tsi À 1, we obtain the following expression for the period: T ¼ tme ln

¡ 2gt

i s

ctme

¢

h 1C

f f vth i ¡ tanh . 2 s c

(3.16)

Note that now the period depends more weakly (i.e., logarithmically) on the synaptic time constant tsi . The triplet case occurs only in the disordered network. Also here, an analytical expression for T can be obtained, and again we nd that the leading dependence is logarithmic, although a formula is more complicated than for doublets. Nevertheless, the qualitative dependences on R, tsi , tmi , a and the disorder remain valid here. 4 Inclusion of Synaptic Noise

In this section we estimate the effect of a synaptic noise on the stability of the doublet conguration in the regime tme / tsi ¿ 1. To simplify considerations we will consider the network with N D 2 local circuits with the same type of connections as before. For N D 2 circuits, the assumption about the smallness of excitatory E7!I synaptic strength a is removed; in this case, a single nonlocal excitatory input

Multispikes and Synchronization for Networks with Delays

1589

must trigger the ring of local I-cell. We assume, however, that a < vth , so that an I-cell has a delay between receipt of an input and ring. We can make use of equations 3.3 and 3.4 derived before for W I and W E functions. The formula for W I will be modied because now there is only one strong source of excitation instead of many weak sources. Thus, the sum in the logarithmic term in equation 3.3 now disappears. We note that the time dtl when the membrane potential crosses zero is exactly the moment when nonlocal excitatory pulse arrives, dtl D tE2 ¡ tE1 C R, where R is now the conduction delay time between the circuits. The formula for the value of the membrane potential b right after the reception of an input can also be substituted in equation 3.3. This value now is b D v0 exp(¡dtl / tmi ) C a. Taking all these changes into account, we obtain for the rst circuit ! vth I EC EC i . (4.1) W1 D t2 R ¡ t1 tm ln E E i v0 e¡(t2 C R¡t1 ) / tm C a

(

A similar formula holds for the second circuit with the substitution tE1 $ tE2 . E remains the same as equation 3.4 The equation for W1,2 In the presence of synaptic noise, synaptic strengths a and g will uctuate rapidly around their average values, reducing the precision in values of W I and W E . To estimate the effect of synaptic noise, we will treat a and g as random variables. Repeating the steps from section 3 we have

DN D D C tsi ln

¡ h(¡D ) ¢ h( D )

,

(4.2) E

E

where now D ´ tE2 ¡ tE1 and DN ´ tN2 ¡ tN1 , and ±

² t i /t i m s

C ) h( D ) D v0 e¡(R D / tm C a i

± ²t i /t i i m s C e(RC D ) / tm vth .

(4.3)

Notice that synaptic I7!E strength g cancels out. Averaging the above equation over different possible values of a with a simple probability distribution (cf. appendix C) P (a )noise D 12 [d(a¡a0 ¡da) C d(a¡a0 C da)], where da is a measure of synaptic noise, and solving the stability inequality | @DN / @D | D D0 | < 1 in terms of R, in the limit da / a0 ¿ 1 we obtain, after some tedious algebra, " #! 2|v0 | da 2 i 1C r , (4.4) R / tm > ln a0 a0

¡ ¢

(

with rD

h i i i 2t i t 2 (4| v0 |vth )tm / ts C a0 m / s (1 C tmi / tsi ) i

i

2tmi / ts

(4| v0 |vth )tm / ts C a0

.

(4.5)

1590

Jan Karbowski and Nancy Kopell

When the synaptic noise is absent (da D 0), the right-hand side of equation 4.4 is smaller than ifda > 0. Hence, inclusion of synaptic noise decreases the parameter regime in which there is stability. However, the effect of noise is rather weak, since dependence in equation 4.4 is logarithmic. Finally, let us note that since g dropped out from equations 4.4 and 4.5, synaptic noise in a local I7!E inhibitory synapse is irrelevant. This is the consequence of the fact that we assumed slow I7!E inhibition, that is, tme / tsi ¿ 1. When inhibition is slow, fast synaptic uctuations do not have a great inuence on synaptic transmission, since the synapse has time to average over uctuations. This is not the case for E7!I nonlocal synaptic connections, since in that case excitation is fast (Buhl et al., 1994) and the system lacks the time to average. Therefore, synaptic uctuations in the latter case are important. 5 Discussion

The literature on synchronization is very large; some related papers using the properties of inhibition are Lytton and Sejnowski (1991), Wang and Rinzel (1992, 1993), Friesen (1994), van Vreeswijk, Abbott, and Ermentrout (1994), Bush and Sejnowski (1996), Gerstner, van Hemmen, & Cowan (1996), Wang and Buzsaki (1996), Terman, Kopell, and Bose (1998), Terman and Rubin (1998), White, Chow, Ritt, Soto, & Kopell (1998), and Chow et al. (1998). In most articles on synchronization, conduction delays do not play a part. One that explicitly addresses that question is Konig and Schillen (1991), which gives simulations of a ring-rate model showing that synchronization could be accomplished with delays up to roughly a third of the period. It differs in two respects from our article. First, we consider the spiking neurons type model, in which the timing between single spikes is very important; in ring-rate models, the ne temporal structure is lost. Second, our synchronous state is stable even for innite-range delays, that is, for s 7! 1 (see equation 3.13 and Figure 2B); if there is heterogeneity in the driving current, the range could be reduced (Ermentrout & Kopell 1998). Crook, Ermentrout, Vanier, and Bower (1997) considers temporal delays caused by axons in a continuous model of excitatory oscillators. They found that the synchronous state is stable provided delays are not too large. These authors do not consider the effect of inhibition. Other articles, like Campbell and Wang (1998) and Ernst, Pawelzik, and Geisel (1995) deal with pulsecoupled relaxation oscillators. The rst of these uses excitation with homogeneous global inhibition. The second uses either excitatory or inhibitory homogeneous connections, but not both, as a mechanism for generation of long-range synchronization. Neither of these articles takes into account the time structure of inhibition. Thus, they are quite different in spirit from the models of Whittington et al. (1995), Traub, Whittington, Stanford, and Jefferys (1996) and Ermentrout and Kopell (1998), which deal with synchronization of excitatory and inhibitory cells that spike once or twice per cycle,

Multispikes and Synchronization for Networks with Delays

1591

and make central use of the time course of the inhibition. Another recent related article is Traub, Whittington, Stanford, and Jefferys (1999), which includes a more detailed model than the one in Traub, Whittington, Stanford, and Jefferys (1996) and summarizes a body of work on gamma and beta rhythms. (See also Traub, Jefferys, & Whittington, 1999.) This article extends the work of Ermentrout and Kopell (1998) on gamma oscillations in a smaller network. We have analyzed the mechanism for producing long-range synchrony in a distributed network of spiking neurons, in which each interneuron receives a large number of inputs arriving at different times. We showed that even with many inputs, a ring conguration in which interneurons produce doublets yields stable synchronization over a larger-parameter range than other congurations. These doublets have been seen in large-scale simulations, as well as in vitro studies (Traub, Whittington, Stanford, & Jefferys, 1996; Traub, Whittington, Buhl, Jefferys, & Faulkner, 1999), and might provide additional support for a temporal coding hypothesis (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). We note that (as in Tsodyks, Mitkov, & Sompolinsky, 1993, van Vreeswijk et al., 1994, and Gerstner et al., 1996), our calculations deal only with linear stability, and hence do not give information about global stability. In Ermentrout and Kopell (1998), the synchronization mechanism depends on the properties of the waiting time function called W I in this article and TI in Ermentrout and Kopell (1998). The analysis in that article was valid for all functions TI that have the qualitative property that they decay in time. That is, the longer the time from the spike of an I-cell until when it receives further excitation, the shorter the wait until it then res. This property was shown to hold for a class of biophysical models reduced from those of Traub and Miles (Ermentrout & Kopell, 1998; Traub & Miles, 1991). The analysis in this article uses a particular form for the dynamics (see equation 2.1), a one-dimensional system having two thresholds. This form was chosen as the simplest equation for which the waiting time function TI has a form similar in shape to that computed from the reduced model of Traub and Miles (1991). This was used in the computation of equation 3.3. The essential role in our mechanism is played by local inhibitory neurons that lter inputs coming from nonlocal excitatory cells to a local Ecell, inhibiting it from ring by producing appropriately timed multispikes. Especially in the regime tme / tsi ¿ 1, this inhibition governs the period of oscillations of excitatory cells, as well as stabilizes the synchronous state. For the ordered network (see Figure 2A) this inhibition must be slower than the effective membrane time constant of inhibitory cells in order to preserve stability. In the case of the disordered network (see Figure 2B) this inhibition does not have to be slow for sufciently large R / Ntmi . In the complementary regime, tme / tsi À 1, the time course of the inhibition does not play any role in the stability. Our stability diagrams and the period of oscillations scale with factor R / N, the average time needed for a pulse to travel between neighboring

1592

Jan Karbowski and Nancy Kopell

circuits. This scaling is a direct consequence of equation 3.6 and indicates that the stability diagrams, as well as the period, do not depend on the network size R, provided the density of circuits R / N is xed. The stability diagrams indicate that below a minimal value x0 of R / Ntmi , the doublet conguration is not stable. From equation E.9 (in appendix E), using as parameter values b D a / 2, and v0 / a D 5, we get x0 ¼ 2, with x0 only weakly dependent on a, b and v0 . This gives a minimal size for the distance R / N between the local circuits: R / N > x0 tmi . If the effective time constant tmi is very small (from conductances in Traub, Whittington, Stanford, & Jefferys, 1999, we estimate tmi to be a fraction of a msec), the local circuits can be taken to be very close. This estimate is made as follows: the active conductances include an excitatory metabotropic conductance, AHP conductance and synaptic conductances from local interactions (Traub, Whittington, Stanford, & Jefferys, 1999). For interneurons, the input conductance is about 10 nS. After the rst spike, there is a large, fast AHP. At the peak of the fast AHP, the input conductance is at least 500 to 1000 nS (Traub & Miles, 1995). That means that after the rst action potential, the input conductance increases by at least a factor of 50, which implies that the time constant goes down by such a factor. An estimate of the baseline time constant in Traub and Miles (1995) is 37.5 msec; after the rst action potential and before the doublet, the time constant would then be less than a msec. This takes into account only the AHP current. If one includes also the metabotropic glutamate conductances activated by the tetanic stimulation and the GABAA conductances from the other interneurons, as documented in Whittington, Stanford, Colling, Jefferys, & Traub (1997), one has an effective time constant that is considerably lower. We note that all the computations done in this article concern the time between the rst and second spikes of a doublet (or the third, in the triplet conguration). Thus we are using an effective time constant that reects only the small part of the oscillation cycle. Another property of equation 3.6 is that the disorder parameter f is the same for every pair of circuits. This is the consequence of assumed translational invariance in the system and corresponds to the situation in which disorder is not too strong. We showed that for weak spatial disorder, the results of this article are identical if equation 3.6 is replaced by a gaussian distribution. In the limit of very strong disorder, however, one should not expect the assumption about translational invariance to be correct. In this case, the proper approach would require taking f as a space-dependent variable, that is, different for every pair of circuits. Our analysis concerns perfect (probabilistic) synchrony, in which every cell (excitatory and inhibitory) participates in the synchronization process. For such a complete synchrony, we nd that weak disorder (f N / R ¿ 1) enlarges the stability region of the synchronous state in parameter space in comparison to the ordered network. Our argument probably does not work for a very strong disorder; the existence argument would be especially dif-

Multispikes and Synchronization for Networks with Delays

1593

cult to make, since the pattern of arriving inputs to every cell would be very different. Nevertheless, in such a network, there may exist some partially coherent state in which only a fraction of cells participate at a given time (Tsodyks et al., 1993; van Vreeswijk, 1996). In vitro and in large-scale simulations (Traub, Whittington, Stanford, & Jefferys, 1996), doublets are produced by only some interneurons in some cycles. It remains to understand how the mechanism explained in this article can work in the more disordered situation in which only a fraction of those interneurons that re in a cycle also re a doublet spike. Synaptic noise in I- and E-cells could also have a drastic effect on the stability of various multispike congurations. Analysis for the two-circuit case suggests that noise in general reduces stability, although rather weakly. Thus, noise has an opposite effect from weak spatial randomness in the network, which in general makes long-range synchronization more robust. The results of this article seem to be consistent with the recent experimental in vitro studies concerning the effect of morphine on the long-range synchrony in the hippocampal CA1 area (Whittington, Traub, Faulkner, Jefferys, & Chettiar, 1998). These authors found that adding morphine to the system, which effectively reduces the level of inhibition in the system, causes bursting of inhibitory interneurons accompanied by lack of long-range synchrony. They suggest that this loss of synchrony may be the reason that morphine causes cognitive dysfunction clinically (Whittington et al., 1998). In our model, a decrease in I7!I inhibition corresponds to decreasing |v0 | and/or vth (so less input is required to trigger ring of I-cell). This means that ab / |v0 |vth ratio would increase in equation 3.12, opening the possibility for generating higher-order multispikes in inhibitory neurons. Because such higher-order multispikes are probably unstable (or stable only in a limited parameter space), one could observe bursts without the synchrony. Also notice that the stability regions in Figure 2 get smaller with decreasing strength of nonlocal synaptic E7!I coupling. This is in a qualitative agreement with the large-scale simulations involving multicompartmental realistic neurons of the hippocampal CA1 area (Traub, Whittington, Stanford, & Jefferys, 1996). In those simulations it was found that reduction of E7!I AMPA conductance resulted in disruption of coherent global oscillations. For large values of synaptic coupling, there is some chance that more than one multispike conguration may exist and be stable. This may be undesirable for the network if a xed conguration is needed for downstream processing. We note that architecture-dependent synaptic depression (Thomson and Deuchars, 1994; Markram & Tsodyks, 1996; Abbott, Sen, Varela, & Nelson, 1997, Koch, 1997) might then help to decrease the synaptic coupling to values at which only the doublets are stable. We also note that since the period of the network oscillation depends on the extra inhibition produced by the doublet spike, modulation of the percentage of interneurons ring doublets may provide a mechanism for modulation of the period.

1594

Jan Karbowski and Nancy Kopell

Appendix A

In this appendix we show how to derive equation 3.3. The local lth I-cell receives a spike train of nonlocal excitatory cells at times tl(1) , . . . , tl(kl ) , . . . , tl( nl ) , after which it res, where l( k ) D l C k. Solving equations 2.1 and 2.2 with an initial condition v(tEl ) D v0 , we obtain recursive formulas for the values of the potential v( t) ) v( tl(1) ) D v0 e¡(tl(1) ¡tl / tm C a, E

i

v( tl(2) ) D v( tl(1) )e¡(tl(2) ¡tl(1) ) / tm C a, i

v( tl( kl ) ) D v(tl(kl ¡1) )e¡(tl(kl ) ¡tl(kl ¡1)

¢¢¢¢¢¢

) / tmi

C a,

(A.1)

up to the time dtl ´ tl(kl ) when v(d tl ) D 0. This time is given by " dtl D

tmi ln

# l ¡1 |v0 | kX ( tl(i) ¡tE ) / tmi l ¡ . e a iD1

(A.2)

At that time v( t) in equation 2.1 changes its sign. This causes exponential growth described by the recursive equations (

) / tmi

C a,

(tl(k C 2) ¡tl(k C 1) ) / tmi l l

C a,

v( tl( kl C 1) ) D b ¢ e tl(kl C 1) ¡tl(kl ) v( tl( kl C 2) ) D v( tl( kl C 1) )e

¢¢¢¢¢¢

( ) v(tl( nl ) ) D v( tl( nl ¡1) )e tl(nl ) ¡tl(nl ¡1) / tm C a. i

(A.3)

The I-cell res between receiving spikes enumerated as nl and nl C 1. Thus the time of ring t f is determined from a condition (

v( tf ) D v(tl(nl ) )e t f ¡tl(nl )

) / tmi

´ vth .

(A.4)

This yields t f D tl( nl ) C tmi ln(vth / v( tl( nl ) )). Inserting values of v(tl( nl ) ) and tl(kl ) , and using the fact that WlI D tf ¡ tEl , we obtain equation 3.3. We should mention that v(tl(nl ) ) ¼ vth ; hence the logarithmic term in the formula for t f is very small. Thus, an approximate formula for WlI is WlI ¼ tl( nl ) ¡ tEl . We will make use of this fact in appendix D. Nevertheless, one should be cautious about dropping the logarithmic term too quickly; this term is important in stability considerations, that is, in the calculation of the stability matrix A.

Multispikes and Synchronization for Networks with Delays

1595

Appendix B

In this appendix we show how to derive the W lE function in equation 3.4. At time t D tEl the lth E-cell res and its potential is reset to v0 ( < 0). We solve equations 2.1 and 2.3 with two inhibitory inputs coming from the local I-cell separated by time WlI with an initial condition v( tEl ) D v0 . For simplicity, we assume that the time t0 when the potential of the E-cell crosses zero is larger than tEl C WlI . The former is determined from the condition v(t0 ) D 0, which produces 0 D v0 e¡(t0 ¡tl ) / tm C c(1 ¡ e¡(t0 ¡tl ) / tm ) ¡ E

e

E

e

tsi

gtsi ¡ tme

h i E E E I E I i e i e £ e¡(t0 ¡tl ) / ts ¡ e¡(t0 ¡tl ) / tm C e¡(t0 ¡tl ¡Wl ) / ts ¡ e¡(t0 ¡tl ¡Wl ) / tm . (B.1) In the limit tme / tsi ¿ 1 this equation yields t0 ¡ tEl ¼ tsi ln

hg ± c

I

i

1 C eWl / ts

²i

,

(B.2)

and in the limit tme / tsi À 1 (with c / g ¿ 1), µ

t0 ¡

tEl

¼

tme

²¶ gtsi ± WlI / tme C ln 1 e . ctme

(B.3)

At t D t0 , v( t) changes sign and begins to grow exponentially, reaching the threshold vth . Solving equations 2.1 and 2.3 as before with an initial condition v( t0 ) D 0, we obtain in the limit tme / tsi ¿ 1, the time tth at which v(tth ) D vth : tth ¡ t0 ¼

tme ln

¡

v ti 1 C th e s ctm

¢

.

(B.4)

In the limit tme / tsi À 1, the time tth is given by ±

tth ¡ t0 ¼ tme ln 1 C

vth ² . c

(B.5)

Since WlE D tth ¡ tEl ¡ WlI , we can neglect tth ¡ t0 part in the limit tme / tsi ¿ 1, which then yields WlE ¼ t0 ¡ tEl ¡ WlI D tsi ln

hg ± c

I

i

1 C e¡Wl / ts

²i

.

(B.6)

We cannot neglect tth ¡ t0 part in the limit tme / tsi À 1, because both parts are

1596

Jan Karbowski and Nancy Kopell

of the same order. As a result, in this limit, we obtain µ

i¶ gtsi ± vth ² h I e C C e¡Wl / tm . 1 1 ctme c

W lE ¼ tme ln

(B.7)

Formulas B.6 and B.7 differ mainly by a substitution tsi $ tme ; this is the reason that stability diagrams in both regimes look the same. Both formulas are similar in form to that adopted by Ermentrout and Kopell (1998). Appendix C

In this appendix we demonstrate that the probability distribution P1 ( x) with delta functions (see equation 3.6) and gaussian distribution P2 D exp[¡(x ¡ x0 )2 / 2f 2 ] are equivalent in the limit f / x0 ¿ 1, which is the limit of weak disorder. To prove the above statement, it is sufcient to show that n-moment of x computed using both distributions is the same in the above limit. For the gaussian distribution we have Z hxn iP2 D Z¡1

1 ¡1

Z dxP 2 ( x ) ¢ xn D Z¡1

1 ¡1

dx ¢ xn e¡(x¡x0 )

2

/ 2f 2 ,

(C.1)

where Z is the normalization factor. This integral can be computed using n ¡(x¡x0 )2 / 2f 2

x e

´e

¡x20 / 2f 2

The result is 2n ¡x20 / 2f 2

n

hx iP2 D f e

¡

f

¡@¢ @x0

@

2

n

¢

n

@x0

2

ex0 / 2f e¡(x¡x0 ) 2

2

2

/ 2f 2 .

2

ex0 / 2f ,

(C.2)

(C.3)

which yields hxiP2 D x0 , hx2 iP2 D x20 (1 C f 2 / x20 ), hx3 iP2 D x30 (1 C 3f 2 / x20 ), and so forth. On the other hand, for P1 distribution we have n

hx iP1

1 D 2 D

xn0

Z

1

¡1

"

dx ¢ xn [d(x ¡ x0 C f ) C d(x ¡ x0 ¡ f )]

n( n ¡ 1) f 2 1C 2 x20

# C O(f 4 / x40 ).

(C.4)

Thus, both probability distributions give the same results in the limit f / x0 ¿ 1, which in our case corresponds to the limit f N / R ¿ 1. This is the reason

Multispikes and Synchronization for Networks with Delays

1597

that we have used simple P1 distribution with delta functions instead of more complicated gaussian distribution. In that way we avoided complex multidimensional integrals in equation 3.7. Notice that our proof is valid for a random variable x, assuming both positive and negative values. One can also show that the above conclusion is valid for a random variable dened only on a positive interval 0 < x < 1, although calculations in this case are more tedious. Appendix D

In this appendix we show how to derive the average value of any function when the probability distribution is given by equations 3.5 and 3.6. We also show how to obtain the average number of inputs kl and nl needed for crossing the zero potential value ¡ and threshold, respectively. ¢ For an arbitrary function H Ll,l(1) , . . . , Ll,l( k) I fD l g we want to calculate its average value hHi, that is, carry out the multidimensional integrals in equation 3.7. We obtain hHi D

1

2 X

2k coshk (f / s) 0

a1 ,...,ak D1

£ exp @¡

k X

(¡1)aj

H

¡R N

C (¡1)a1 f , . . . ,

kR C (¡1)ak f I fD l g N

¢

1

f / sA ,

(D.1)

jD1

where the prefactor term comes from the normalization of the probability distribution r (fLg). Using this formula and the facts that tl( k) D tEl( k) C Ll,l( k) and tEk ¡ tE1 ´ D k¡1 , we can easily write formulas for hW I i and hW E i based on equations 3.3 and 3.4. Their forms are as follows ¶ 2 µ X 1 hnl iR C (¡1)af 2 cosh(f / s) aD1 N 0 1 2 nl i X X t a m £ e¡(¡1) f / s ¡ n £ exp @¡ (¡1)aj f / s A 2 l coshnl (f / s) a ,...,a D1 jD1

hWlI i D htEl( nl ) i¡htEl i C

£ ln

¡b

1

vth C

nl

exp[(nl R / N C (¡1)anl f ¡ dtl C D lC nl ¡1 ¡ D l¡1 ) / tmi ]

n a Xl exp [((nl ¡ i)R / N C [(¡1)anl ¡ (¡1)ai ]f vth iDk l

i C D lC nl ¡1 ¡ D lC i¡1 ) / tm

¤

¢

,

(D.2)

1598

Jan Karbowski and Nancy Kopell

and tsi hWlE i D n 2 l coshnl (f / s ) µ £ ln

¡

0

2 X a1 ,...,anl D1

µ g 1 C exp ¡WlI c

exp @¡

¡R

nl X

1 (¡1)aj f / s A

jD1

nl R N ¶ ¶ C (¡1)anl f I fD g / tsi , N

C (¡1)a1 f , . . . ,

¢

¢

(D.3)

where dtl is given by equation A.2. g We can also write the sum hW I iC hW E i D htsi ln[ c (1 C exp(WlI / tsi ))]i, which E is equation 3.11 in the text. Note that hW i and hW I i C hW E i differ only by the sign of W I present in the logarithmic term; compare equations D.3 and 3.11. We can also derive the average value of kl and nl , dened below, in the synchronous state using equation D.1. From equation A.1, we obtain i

hv(tl(kl ) )i D v0 he¡Ll,l(kl ) / tm i C a

kl X iD1

he¡(Ll, l(kl ) ¡Ll, l(i) ) / tm i i

(D.4)

for the synchronous state with D 1 D ¢ ¢ ¢ D D N¡1 D 0. In deriving equation D.4, we again have used the relation tl(k ) D tEl( k) C Ll,l( k) . Invoking equation D.1 yields i

he¡Ll,l(kl ) / tm i D

cosh( sf C cosh

f

s

f

tmi

)

i

he¡kl R / Ntm i

(D.5)

and he¡(Ll,l(kl ) ¡Ll,l(i) ) / tm i D i

cosh( sf C

f

tmi

) cosh( sf ¡

cosh2 sf

f

tmi

)

i

he¡(kl ¡i)R / Ntm i.

(D.6)

Condition hv(tl( kl ) )i D 0 determines the average value of kl denoted by k, which we dene as k D ( Ntmi / R) ln[hexp(kl R / Ntmi )i]. The formula for k takes the form 0 kD

i

[|v0 |(1 ¡ e¡R / Ntm ) cosh

f

s

B C a cosh( fs ¡ tfi )] cosh( fs C tfi ) B Ntmi m m ln B i B 2 f 2 f R @ a[cosh s C e¡R / Ntm sinh tmi ]

1 C C C. C A

(D.7)

Multispikes and Synchronization for Networks with Delays

1599

In a similar manner we obtain the average value of nl , denoted by n, from the condition hv(tl( nl ) )i D vth . Our n, dened as n D ( Ntmi / R) ln[hexp(nl R / Ntmi )i], takes the form 0 i [(vth ¡ a)(1 ¡ e¡R / Ntm ) cosh2 sf B B C a cosh( f C fi ) cosh( f ¡ fi )] Ntmi s tm s tm B ln B nD B [b C ( a ¡ b)e¡R / Ntmi ] cosh( f ¡ fi ) R s tm @ 1 i

£

[|v0 |(1 ¡ e¡R / Ntm ) cosh sf C a cosh( fs ¡ a[cosh2

f

s

i C e¡R / Ntm sinh2 tfi ] m

f

tmi

C )] C C C . (D.8) C A

From the average values k and n, we can compute hWlI (fD D 0g)i of a synchronous state depending only on the parameters of the network architecture. This function plays an important role in the discussion about stability. Making use of the fact that the fourth, logarithmic term in equation D.2 is small (cf. appendix A), we can neglect it. Thus, in the synchronous state, equation D.2 has the approximate form hWlI i ¼

hnl iR f ¡ f tanh . s N

(D.9)

We expect that for large nl , its higher-order cumulants are small, so the relation hnl i ¼ n holds. Using this fact, we can substitute n for hnl i into the above equation. We obtain 0

[(vth ¡ a )(1 ¡ e¡R / Ntm ) cosh2 sf B B C a cosh( f C fi ) cosh( f ¡ fi )] s tm s tm B hW I i ¼ tmi ln B B [b C (a ¡ b )e¡R / Ntmi ] cosh( f ¡ fi ) s tm @ i

i

[|v0 |(1 ¡ e¡R / Ntm ) cosh £

f

1

s

C C C ¡ f tanh f . (D.10) i C 2 f s C e¡R / Ntm sinh t i ] A m

C a cosh( sf ¡ tfi )] m

a[cosh2

f

s

Note that hW I i is of the order of the membrane time constant tmi and additionally that hW I i can be arbitrarily small provided the disorder parameter f is sufciently large.

1600

Jan Karbowski and Nancy Kopell

Appendix E

In this appendix we sketch how to compute the stability diagrams depicted in Figure 2, as well as how to derive equation 3.13. Formulas below are provided only in the regime tme / tsi ¿ 1. In the opposite regime, tme / tsi À 1, formulas are the same with a substitution tsi 7! tme . According to the Gershgorin theorem (Horn & Johnson, 1985), the lth eigenvalue l l of the stability matrix A lies in a disk such that |l l ¡ Al,l | ·

X

|Al,m |,

(E.1)

6 l mD

where Al,m ´ @DN l / @D m | D D0 . Using the inequality |l l ¡ Al,l | ¸ |l l | ¡ |Al,l |, P we obtain |l l | · N¡1 mD1 |Al,m |. The system is stable when |l l | < 1 for every l. Thus, a sufcient condition for stability is N¡1 X

|Al,m | < 1

(E.2)

mD1

for every l D 1, . . . , N ¡ 1. After some algebra the sum in equation E.2 can be found as N¡1 X mD1

|Al,m | D

¡

2 i

1 C e¡nR / Nts

1C

b|v0 | [Ll, l(n ) ¡2Ll, l(k ) ] / tmi b [Ll, l(n ) ¡Ll,l(k ) ] / tmi l l l l he i¡ he i avth vth

¢

1 b [Ll,l(n ) ¡Ll,l(k ) ] / tmi i l l C he i ¡ e¡nR / Nts . 2 vth

(E.3)

Next, invoking equations D.1, D.5, and D.6, we can nd the averages in the above equation. Combining all this, equation E.2 decomposes into two complementary cases, each described by a pair of inequalities, due to the modulus present in equation E.3. We discuss the rst case in detail. Later we comment on the complementary second case. The inequalities for the rst case are given by e( n¡k)R / Ntm > i

2 f

cosh s vth f b cosh( s C tfi ) cosh( fs ¡ m

i

f

tmi

)

e¡nR / Nts

(E.4)

and " e

nR / Ntsi

1C2

f f f b|v0 | cosh( s C 2 tmi ) cosh( s ¡ avth cosh2 sf

f f f b cosh( s C tmi ) cosh( s ¡ ¡ vth cosh2 sf

f

tmi

)

f

tmi

)

e( n¡2k)R / Ntm i

# ( n¡k )R / Ntmi

e

< 2,

(E.5)

Multispikes and Synchronization for Networks with Delays

1601

where n and k are given by equations D.7 and D.8. Inequalities E.4 and E.5 must be satised simultaneously. Notice that after insertion of equations for k and n, exp(nR / Ntsi ) becomes a function of the ratios tmi / tsi and R / Ntmi , and exp(nR / Ntmi ) becomes a function of R / Ntmi ratio for given other parameters. This says that we can compute a sufcient condition for the stability in terms of tmi / tsi and R / Ntmi . In the ordered network case for f D 0, equations E.4 and E.5 reduce to tmi / tsi

(

i

1 vth [b C ( a ¡ b)e¡R / Ntm ] > ln i K b[(vth ¡ a)(1 ¡ e¡R / Ntm ) C a]

! ,

(E.6)

and tmi / tsi

" i 1 1 b[( vth ¡ a )(1 ¡ e¡R / Ntm ) C a] < ¡ ln 1C i 2 K vth [b C ( a ¡ b )e¡R / Ntm ] #! i [|v0 |(1 C e¡R / Ntm ) ¡ a] £ , i [|v0 |(1 ¡ e¡R / Ntm ) C a]

(

(E.7)

with K D ln

(

i

i

[(vth ¡ a)(1 ¡ e¡R / Ntm ) C a][|v0 |(1 ¡ e¡R / Ntm ) C a] i a[b C ( a ¡ b)e¡R / Ntm ]

! .

(E.8)

The two curves on the right-hand side of equations E.6 and E.7 cross in the point with coordinates R / Ntmi ¼ ln[(a C 2b)|v0 | / 2ab]

(E.9)

and tmi / tsi ¼ ¡ ln[1 ¡ 2a2 / ( a C 2b)|v0 |] / ln(vth |v0 | / ab ),

(E.10)

denoted as x0 and y0 in Figure 2, respectively. Above this value of R / Ntmi the set of inequalities E.6 and E.7 is satised. The complementary case differs from the rst by inverting the inequality sign in equation E.4 and multiplying the third term in the brackets in equation E.5 by 3 and setting the right side of this inequality equal to zero. In this case we also obtain the stability region for R / Ntmi greater than the value given by equation E.9. Combining these two cases, we nd the stability region as that shadowed in Figure 2A. Notice that the ratio tmi / tsi must be small in order to preserve stability.

1602

Jan Karbowski and Nancy Kopell

The disordered network case with f > 0 is more complicated algebraically to analyze. In this case, there also exists some minimum value of R / Ntmi of the order of that in equation E.9 for weak disorder, below which the doublet conguration is unstable. The disordered case differs from the ordered case by the appearance of a vertical asymptote in Figure 2B, which crosses the horizontal axis at (R / Ntmi )cr given by equation 3.13. The reason for the appearance of the asymptote is the fact that the term in the bracket in equation E.5 can now change sign. For small R / Ntmi , the bracket term is positive, whereas for R / Ntmi 7! 1, it is negative. This does not happen for the ordered network; in this case, the bracket term is always positive. When the bracket term is negative, inequality E.5 is satised for every tmi / tsi (recall that this ratio comes from exp(nR / Ntsi ) term). Putting the bracket term equal to zero and substituting for n and k values represented by equations D.7 and D.8 determines the vertical asymptote in Figure 2B. In the limit R / Ntmi À f / tmi À 1 we obtain ( R / Ntmi )cr given by equation 3.13. Notice that for the ordered network with f D 0, our asymptote shifts to innity (that is, it disappears), which corresponds to decreasing the stability region. For the triplet conguration, analysis shows that in the ordered network case, the condition E.2 cannot be satised for any tmi / tsi and R / Ntmi . In the disordered network case, we can show that there is some stability region for R / Ntmi À 1. Acknowledgments

We thank B. Ermentrout, C. Chow, and particularly R. Traub for discussions. The work was supported by NSF DMS 9706694 and NIMH R01 MH47150. J. K. thanks the Fulbright Foundation for partial support. References Abbott, L. F., Sen, K., Varela, J. A., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224. Abbott, L. F., & van Vreeswijk, C. (1993). Asynchronous states in network of pulse-coupled oscillators. Physical Review E, 48, 1483–1490. Ahissar, E., Vaadia, E., Ahissar, M., Bergman, H., Arieli, A., & Abeles, M. (1992). Dependence of cortical plasticity on correlated activity of single neurons and on behavioral context. Science, 257, 1412–1415. Amit, D. J., Gutfreund, H., & Sompolinsky, H. (1985).Spin-glass models of neural networks. Physical Review A, 32, 1007–1018. Andersen, P., Silfvenius, H., Sundberg, S., Sveen, O., & Wigstrom, H. (1978). Functional characteristics of unmyelinated bres in the hippocampal cortex. Brain Res., 144, 11–18.

Multispikes and Synchronization for Networks with Delays

1603

Bliss, T. V. P., & Collingride, G. L. (1993). A synaptic model of memory: Longterm potentiation in the hippocampus. Nature, 361, 31–34. Bragin, A., Jando, G., Nadasdy, Z., Hetke, J., Wise, K., & Buzsaki, G. (1995). Gamma (40–100 Hz) oscillations in the hippocampus of the behaving rat. J. Neurosci., 15(1), 47–60. Bressler, S. L., Coppola, R., & Nakamura, R. (1993). Episodic multiregional cortical coherence at multiple frequencies during visual task performance. Nature, 366, 153–156. Buhl, E. H., Halasy, K., & Somogi, P. (1994). Diverse sources of hippocampal unitary inhibitory postsynaptic potentials and the number of synaptic release sites. Nature, 368, 823–828. Bush, P., & Sejnowski, T. (1996). Inhibition synchronizes sparsely connected cortical neurons within and between columns of realistic network models. J. Comp. Neurosci., 3, 91–110. Campbell, S. R., & Wang, D. (1998). Relaxation oscillators with time delay coupling. Preprint. Chow, C. C., White, J. A., Ritt, J., & Kopell, N. (1998). Frequency control in synchronized networks of inhibitory neurons. J. Comp. Neurosci., 5, 407–420. Crook, S. M., Ermentrout, G. B., Vanier, M. C., & Bower, J. M. (1997). The role of axonal delay in the synchronization of networks of coupled cortical oscillators. J. Comp. Neurosci., 4, 161–172. Engel, A. K., Konig, P., & Singer, W. (1991). Direct physiological evidence for scene segmentation by temporal coding. Proc. Natl. Acad. Sci. USA, 88, 9136– 9140. Engel, A. K., Kreiter, A. K., Konig, P., & Singer, W. (1991). Synchronization of oscillatory neuronal responses between striate and extrastriate visual cortical areas of the cat. Proc. Natl. Acad. Sci. USA, 88, 6048–6052. Ermentrout, G. B. (1996). Type I membranes, phase resetting curves and synchrony. Neural Comp., 8, 979–1002. Ermentrout, G. B., & Kopell, N. (1998). Fine structure of neural spiking and synchronization in the presence of conduction delays. Proc. Natl. Acad. Sci. USA, 95, 1259–1264. Ernst, U., Pawelzik, K., & Geisel, T. (1995). Synchronization induced by temporal delays in pulse-coupled oscillators. Physical Review Lett., 74, 1570– 1573. Freund, T. F., & Buzsaki, G. (1996). Interneurons of the hippocampus. Hippocampus, 6, 347–470. Fries, P., Roelfsema, P. R., Engel, A. K., Konig, P., & Singer, W. (1997). Synchronization of oscillatory responses in visual cortex correlates with perception in interocular rivalry. Proc. Natl. Acad. Sci. USA, 94, 12699– 12704. Friesen, W. (1994). Reciprocal inhibition, a mechanism underlying oscillatory animal movements. Neurosci. Behavior, 18, 547–553. Gerstner, W. (1995). Time structure of the activity in neural networks models. Physical Review E, 51, 738–758. Gerstner, W., van Hemmen, J. L., & Cowan, J. D. (1996).What matters in neuronal locking? Neural Comp., 8, 1653–1676.

1604

Jan Karbowski and Nancy Kopell

Gray, C. M., Konig, P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reects global stimulus properties. Nature, 338, 334–337. Hebb, D. O. (1949). The organization of behavior. New York: Wiley. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computations. Redwood City, CA: Addison-Wesley. Hopeld, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. USA, 81, 3088–3092. Horn, R. A., & Johnson, C. A. (1985). Matrix analysis. Cambridge: Cambridge University Press. Jefferys, J. G. R., Traub, R. D., & Whittington, M. A. (1996). Neuronal networks for induced “40 Hz” rhythms. Trends Neurosci., 19, 202–208. Joliot, M., Ribary, U., & Llinas, R. (1994). Human oscillatory brain activity near 40 Hz coexists with cognitive temporal binding. Proc. Natl. Acad. Sci. USA, 91, 11748–11751. Knight, B. W. (1972). Dynamics of encoding in a population of neurons. J. Gen. Physiol., 59, 734–766. Knowles, W., & Schwartzkroin, P. (1981). Local circuit synaptic interactions in hippocampal brain slices. J. Neurosci., 1, 318–322. Koch, C. (1997). Computation and the single neuron. Nature, 385, 207–210. Konig, P., Engel, A. K., & Singer, W. (1995). Relation between oscillatory activity and long-range synchronization in cat visual cortex. Proc. Natl. Acad. Sci. USA, 92, 290–294. Konig, P., & Schillen, T. B. (1991). Stimulus-dependent assembly formation of oscillatory responses: I. Synchronization. Neural Comp., 3, 155–166. Llinas, R., & Ribary, U. (1993). Coherent 40-Hz oscillation characterizes dream state in humans. Proc. Natl. Acad. Sci. USA, 90, 2078–2081. Lytton, W., & Sejnowski, T. (1991). Simulations of cortical pyramidal neurons synchronized by inhibitory neurons. J. Neurophysiol., 66, 1059–1097. Magee, J. C., & Johnston, D. (1997). A synaptically controlled, associative signal for Hebbian plasticity in hippocampal neurons. Science, 275, 209– 213. Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Markram, H., & Tsodyks, M. (1996). Redistribution of synaptic efcacy between neocortical pyramidal neurons. Nature, 382, 807–810. McCullough, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys., 5, 115–133. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Roelfsema, P. R., Engel, A. K., Konig, P., & Singer, W. (1997).Visuomotor integration is associated with zero time-lag synchronization among cortical areas. Nature, 385, 157–161. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annu. Rev. Neurosci., 18, 555–586.

Multispikes and Synchronization for Networks with Delays

1605

Steriade, M., Amzica, F., & Contreras, D. (1996). Synchronization of fast (30–40 Hz) spontaneous cortical rhythms during brain activation. J. Neurosci., 16(1), 392–417. Stopfer, M., Bhagavan, S., Smith, B. H., & Laurent, G. (1997). Impaired odour discrimination on desynchronization of odour-encoding neural assemblies. Nature, 390, 70–74. Tamamaki, N., & Nojyo, T. (1990). Disposition of the slab-like modules formed by axon branches originating from single CA1 pyramidal neurons in the rat hippocampus. J. Comp. Neurol., 291, 509–519. Terman, D., Kopell, N., & Bose, A. (1998). Dynamics of two mutually coupled slow inhibitory neurons. Physica D, 117, 241–275. Terman, D., & Rubin, J. (1998). Geometric analysis of population rhythms in synaptically coupled neuronal networks. Neural Comput. (in press). Thomson, A. M., & Deuchars, J. (1994). Temporal and spatial properties of local circuits in neocortex. Trends Neurosci., 17, 119–126. Traub, R. D., Jefferys, J. G. R., & Whittington, M. A. (1999). Fast oscillations in cortical circuits. Cambridge, MA: MIT Press. Traub, R. D., & Miles, R. (1991).Neuronal networks of the hippocampus. Cambridge: Cambridge University Press. Traub, R. D., & Miles, R. (1995). Pyramidal cell-to-inhibitory cell spike transduction explicable by active dendritic conductances in inhibitory cell. J. Comp. Neurosci., 2, 291–298. Traub, R. D., Whittington, M. A., Buhl, E. H., Jefferys, J. G. R, & Faulkner, H. J. (1999). On the mechanism of the gamma-beta frequency shift in neuronal oscillations induced in rat hippocampal slices by tetanic stimulation. J. Neurosci., 19, 1088–1105. Traub, R. D., Whittington, M. A., Colling, S., Buzsaki, G., & Jefferys, J. (1996). Analysis of gamma rhythms in the rat hippocampus in vitro and in vivo. J. Physiol., 493, 471–484. Traub, R. D., Whittington, M. A., Stanford, M., & Jefferys, J. G. R. (1996). A mechanism for generation of long-range synchronous fast oscillations in the cortex. Nature, 383, 621–624. Tsodyks, M., Mitkov, I., & Sompolinsky, H. (1993). Pattern of synchrony in inhomogeneous networks of oscillators with pulse interactions. Physical Review Lett., 71, 1280–1283. Usher, M., Stemmler, M., Koch, C., & Olami, Z. (1994). Network amplication of local uctuations causes high spike rate variability, fractal ring patterns and oscillatory local eld potentials. Neural Comp., 6, 795–836. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events. Nature, 373, 515–518. van Vreeswijk, C. (1996). Partial synchrony in populations of pulse-coupled oscillators. Physical Review E, 54, 5522–5537. van Vreeswijk, C., Abbott, L. F., & Ermentrout, G.B. (1994). When inhibition not excitation synchronizes neuronal ring. J. Comp. Neurosci., 1, 313–321. Wang, X. J., & Buzsaki, G. (1996). Gamma oscillations by synaptic inhibition in an interneuronal network. J. Neurosci., 16, 6402–6413.

1606

Jan Karbowski and Nancy Kopell

Wang, X. J., & Rinzel, J. (1992). Alternating and synchronous rhythms in reciprocally inhibitory model neurons. Neural Comp., 4, 84–97. Wang, X. J., & Rinzel, J. (1993). Spindle rhythmicity in the reticularis thalami nucleus: Synchronization among mutually inhibitory neurons. Neuroscience, 53, 899–904. White, J., Chow, C., Ritt, J., Soto, C., & Kopell, N. (1998). Synchronous oscillation in heterogeneous, mutually inhibited neurons. J. Comp. Neurosci., 5, 5–16. Whittington, M. A., Stanford, I. M., Colling, S. B., Jefferys, J. G. R., & Traub, R. D. (1997). Spatiotemporal patterns of c frequency oscillations tetanically induced in the rat hippocampal slice. J. Physiol., 502, 591–607. Whittington, M. A., Traub, R. D., Faulkner, H. J., Jefferys, J. G. R., & Chettiar, K. (1998). Morphine disrupts long-range synchrony of gamma oscillations in hippocampal slices. Proc. Natl. Acad. Sci. USA, 95, 5807–5811. Whittington, M. A., Traub, R. D., & Jefferys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615. Received January 11, 1999; accepted September 10, 1999.

LETTER

Communicated by Carl van Vreeswijk

Synchrony in Heterogeneous Networks of Spiking Neurons L. Neltner, D. Hansel Laboratoire de Neurophysique et Physiologie du Syst`eme Moteur, Universit´e Ren´e Descartes, 75270 Paris Cedex 06, France G. Mato Comisi´on Nacional de Energ´õa At´omica and CONICET, Centro At´omico Bariloche and Instituto Balseiro (CNEA and UNC), 8400 San Carlos de Bariloche, R.N., Argentina C. Meunier Laboratoire de Neurophysique et Physiologie du Syst`eme Moteur, Universit´e Ren´e Descartes, 75270 Paris Cedex 06, France The emergence of synchrony in the activity of large, heterogeneous networks of spiking neurons is investigated. We dene the robustness of synchrony by the critical disorder at which the asynchronous state becomes linearly unstable. We show that at low ring rates, synchrony is more robust in excitatory networks than in inhibitory networks, but excitatory networks cannot display any synchrony when the average ring rate becomes too high. We introduce a new regime where all inputs, external and internal, are strong and have opposite effects that cancel each other when averaged. In this regime, the robustness of synchrony is strongly enhanced, and robust synchrony can be achieved at a high ring rate in inhibitory networks. On the other hand, in excitatory networks, synchrony remains limited in frequency due to the intrinsic instability of strong recurrent excitation. 1 Introduction

Synchrony in large-scale neural systems is evidenced by signicant uctuations in time of the ensemble activity. Complementary experimental techniques—EEG, magnetoencephalography, recording local eld potentials, and multiunit and multielectrode recordings—have shown it to be a ubiquitous feature of neural activity in the central nervous system of vertebrates. It occurs, for instance, in the limbic, motor, olfactory, and visual systems. Different manifestations of synchrony have been shown to be associated with specic physiological states. Synchrony is also observed in pathological conditions (e.g., epileptic seizures, Parkinsonian disease) (Gray, 1994; Ritz & Sejnowski, 1997). Recent studies have largely focused on spike-to-spike synchrony and strong oscillatory activity, but the concept of synchrony actually covers a c 2000 Massachusetts Institute of Technology Neural Computation 12, 1607– 1641 (2000) °

1608

L. Neltner, D. Hansel, G. Mato, and C. Meunier

very large span of phenomena, as is evidenced by a few examples. During epileptic seizures, hippocampal pyramidal cells may exhibit global low-frequency synchronous bursting. In the same structure, high-frequency episodes, at about 200 Hz, are observed during sharp-waves events (Buzs´aki & Chrobak, 1995). Oscillatory synchrony, involving neural assemblies that evolve in time, was demonstrated in the olfactory system of insects (Laurent, 1996) and shown to play a role in odor recognition. Indeed, disrupting synchrony in the antennal lobe through GABAA antagonists impedes the discrimination between similar odors (Stopfer, Bhagavan, Smith, & Laurent, 1997). This remains the most convincing evidence on the function of synchrony in neural systems. In the past 10 years, following the pioneering work of Gray and Singer (1989) and Eckhorn et al. (1988), many experiments have been devoted to the investigation of stimulus-induced oscillatory activity in the visual system, from the retina to the primary visual cortex. It has been suggested that the c rhythm might play an important function in the processing of visual stimuli by providing a exible way to implement the concepts of feature binding and segmentation (for a review, see Gray, 1994), but the functional signicance of synchrony in this framework remains a subject of debate (Gray, 1994). A fundamental issue is to understand when synchrony can emerge and how cell properties, synaptic interactions, and network architecture interact to determine the nature of synchronous states in a large neural network. Much theoretical work in the past few years has focused on the emergence of synchrony through the collective dynamics of neural networks. Then, the correlations of pairs of neurons originate neither from the direct inuence of one neuron on another nor from a common external input to the system. Rather, they stem from the common synaptic input generated by the very activity of the network (Hansel & Sompolinsky, 1996). In this mean-eld framework, the respective roles of synaptic interactions and cell properties in building synchrony were largely unraveled by studying networks of integrate-and-re neurons (Abbott & van Vreeswijk, 1993; Gerstner, Van Hemmen, & Cowan, 1996) and analyzing the dynamics of conductance-based models in the weak coupling limit (Hansel, Mato, & Meunier, 1993a, 1993b). In this limit, the collective behavior of the system is determined only by the linear response of the neurons to small depolarizations and the nature of the interaction between neurons. It was thus shown, for a large class of homogeneous neural systems, that slow, excitatory interactions could not lead to full synchrony, while inhibitory interactions did (van Vreeswijk , Abbott, & Ermentrout, 1994; Hansel, Mato, & Meunier, 1995). These systems are characterized by the fact that the neurons respond to a brief depolarization by advancing the ring of their next spike, with a phase excitability curve that peaks near the end of the interspike. This type of response to perturbations is exhibited by integrateand-re models, as well as many conductance-based models, such as the

Synchrony in Heterogeneous Networks

1609

model of hippocampal inhibitory interneuron introduced in Wang and Buszaki ` (1996). It was experimentally demonstrated that regular spikers in slices of neocortex have such response properties (Reyes & Fetz, 1993a, 1993b). On the basis of these results, it was suggested that mutual inhibition between neurons might play an important role in the emergence of synchrony in large neuronal systems (van Vreeswijk et al., 1994; Hansel, Mato, & Meunier, 1993b; Wang & Busz`aki, 1996). This theoretical idea is supported by several experimental works. In particular, Whittington, Traub, and Jefferys (1995) have shown that a collective oscillation around 40 Hz emerges in hippocampal and neocortical slices due to fast inhibition between interneurons, when slow GABAB inhibition and ionotropic glutamate excitation are pharmacologically blocked. Therefore, the rhythm observed in pyramidal cells could result from their entrainment by the inhibitory network (Cobb, Buhl, Halasy, Paulsen, & Somogyi, 1995). Most theoretical work on synchrony has focused on homogeneous or mildly heterogeneous neural systems. For heterogeneous systems of weakly coupled phase oscillators (Kuramoto, 1984), a transition to the asynchronous state is found at a value of the disorder proportional to the strength of the synaptic coupling. This suggests that increasing the coupling strength might help to synchronize heterogeneous systems. However, increasing the coupling also has adverse effects, such as silencing low-frequency neurons in inhibitory networks (suppression) or strongly increasing the ring rates of neurons in excitatory networks. Therefore, understanding how synchrony can be achieved in neural populations, when the spread of the intrinsic properties of neurons is large (see Gustafsson & Pinter, 1984a, 1984b, for a thorough study of the spread in the intrinsic properties of lumbar a motoneurons in anesthetized cats), remains an open issue. The dependence of synchrony on the synaptic time course in inhibitory systems was recently investigated by several authors (Golomb & Rinzel, 1993; Hansel & Mato, 1993; Wang & Buszaki, ´ 1996; White, Chow, Ritt, Soto-Trevi˜no, & Kopell, 1998; Chow, 1998). When inhibition is fast with respect to the typical ring rate of neurons (i.e., the phasic regime of White et al., 1998), the network settles in cluster states that are easily destabilized when heterogeneities are introduced. In contrast, when inhibition is slow, homogeneous systems display stable in-phase synchrony (White et al., 1998). Yet it was shown that when mild heterogeneities in the cell properties throughout the network are taken into account, a transition to asynchrony occurs in this tonic regime for disorder levels of the order of 5% (White et al., 1998). This is easily understood since the modulation of the synaptic input is small in this regime. Thus, it seems a priori difcult to explain synchrony phenomena at high frequency on the basis of mutual inhibition. Accordingly, recent works on hippocampal high-frequency ripples have investigated the possible role of gap junctions in the emergence of synchrony (Draguhn, Traub, Schmitz, & Jefferys, 1998).

1610

L. Neltner, D. Hansel, G. Mato, and C. Meunier

The goal of this article is to investigate the robustness of synchrony—the maximal spread of cell properties that is compatible with synchrony—in the framework of fully coupled one-population networks. We dene robustness as the value of the disorder at which the asynchronous state becomes linearly stable. For lesser values of the disorder, the system will then evolve at large times to some synchronous state. This is a conservative denition of robustness; synchrony may still be observed at higher disorder if the stable asynchronous state coexists with some synchronous state (multistability). We especially address the question of which interactions, excitatory or inhibitory, are better able to synchronize neurons at a given ring rate. After reviewing the models of neurons and the methods we use to investigate network dynamics, we study the impact of heterogeneities on synchrony. The question is rst addressed in the framework of phase-reduction techniques, where analytical results on the onset of synchrony can be derived. These results are extended to nite coupling by performing numerical simulations. We then study a strong coupling regime that has never been investigated by previous authors, where balancing the synaptic current experienced by the neurons by an external drive of opposite sign prevents the ring rates from drifting. Finally, we discuss our results and their implications. 2 Methods 2.1 Models. We analyze the collective behavior of large networks consisting of only one population of neurons, either excitatory or inhibitory. The neurons are coupled all-to-all with the same coupling strength throughout the network. All the neurons are the same, but we assume that the external input varies from neuron to neuron in such a way that the distribution of the ring rates is either gaussian or Lorentzian when the neurons are uncoupled. Firing frequencies of neurons must be positive. Therefore, we use ringrate distributions restricted between 0 and 2 fN, obtained by truncating the original gaussian or Lorentzian distributions symmetrically with respect to the average frequency, fN, and modifying accordingly the normalization constant (see Figure 1). The standard deviation of the truncated distribution is different from the standard deviation of the original distribution. However, for a gaussian distribution, the relative difference between these two quantities is less than 1% as long as the standard deviation of the original distribution is less than 0.3 fN. Individual neurons are described by a single compartment model— either a simple Lapicque integrate-and-re (IF) model or a Hodgkin-Huxleylike model, namely the Wang-Busz´aki (WB) model, introduced in Wang and Busz`aki (1996). The WB model was designed as a minimal model of

Synchrony in Heterogeneous Networks

1611

Figure 1: Probability density of the ring frequency f . Symmetrically truncated gaussian distribution: s D 0.3 Nf (full line) and s D 0.5 Nf (dashed line). Symmetrically truncated Lorentzian distribution of half-width 0.5 Nf (dotted line).

hippocampal inhibitory interneuron and presents a wide ring frequency range, from 0 to several hundred Hz. The membrane potential of the IF model evolves in the subthreshold regime according to the differential equation C

dV D I` C Isyn( t) C I, dt

(2.1)

where I` D g`( V` ¡ V ) is a passive leak current of conductance g` and reversal potential V`, and Isyn is the synaptic current due to the action of the other neurons of the network. The current I is a constant external drive. The membrane capacitance C is set to 1 m F/cm2 . Requiring a passive membrane time constant t D C / g` of 10 ms leads to a leak conductance g` D 0.1 mS/cm2 . The resting membrane potential Vrest D V` is equal to ¡60 mV in all our simulations. Whenever the membrane potential, V, reaches the threshold h (¡40 mV in our simulations), a spike is red and V is instantaneously reset to the resting level. In the WB model, the membrane potential evolves according to C

dV D INa (t) C IK ( t) C I`( t) C Isyn( t) C I, dt

(2.2)

1612

L. Neltner, D. Hansel, G. Mato, and C. Meunier

where INa and IK are voltage-dependent sodium and potassium currents. Their kinetics as well as the passive parameters of the model are the same as in Wang and Buszaki ` (1996) and recalled in appendix A. In both cases (IF and WB models) the dynamics of the networks are integrated using an adaptive second-order Runge-Kutta algorithm, as explained in Hansel, Mato, Neltner, and Meunier (1998). The synaptic current induced by a given presynaptic neuron into a postsynaptic neuron is modeled as Isyn (t ) D g

X spikes

f ( t ¡ tspike ),

(2.3)

where summation is performed over all the spikes emitted prior to time t by the presynaptic neuron. Each synaptic activation then induces a single event, described by the function f ( t) D

± ² 1 ¡ t ¡ t e t1 ¡ e t2 H ( t), t1 ¡ t2

(2.4)

where H (t ) is the Heaviside function. The larger of the time constants, t1 and t2 , characterizes the rate of exponential decay of the synaptic potentials, while their time to peak is equal to tpeak D (t1 t2 / (t1 ¡ t2 )) ln(t1 / t2 ). The normalization adopted here ensures that the time integral of f (t) is always unity. The coupling strength g is then proportional to the maximal synaptic conductance but not equal to it; it is actually the total charge transferred during a single synaptic event. In a fully connected network of size N, g must scale as 1 / N, so that the total synaptic input remains nite in the limit N ! C 1. Inhibition corresponds to a negative coupling, g < 0, and excitation to a positive coupling, g > 0. Our description of the synaptic coupling does not take into account the shunting effect of the synapses. However, we have checked that the conclusions of our study still hold when synapses are described in term of conductances. 2.2 Synchrony. We measure the degree of synchrony as it was proposed and developed in Hansel and Sompolinsky (1992), Golomb and Rinzel (1994), and Ginzburg and Sompolinsky (1994): by analyzing the temporal uctuations of the network activity. One evaluates the average membrane potential in the network at a given time t,

AN ( t) D

N 1 X Vi (t). N iD1

Its time uctuations can be characterized by the variance D E D N D AN (t)2 ¡ hAN (t)i2t . t

(2.5)

(2.6)

Synchrony in Heterogeneous Networks

1613

When this variance is normalized to the population-averaged variance of single-cell activity,

D D

N ±D E ² 1 X Vi ( t)2 ¡ hVi ( t)i2t , t N iD1

(2.7)

the resulting parameter, SN D

DN , D

(2.8)

generally behaves for large N as SN D Â C

a CO N

¡1¢ N2

.

(2.9)

The constant Â, between 0 and 1, measures the degree of coherence in the system in the innite size limit, while a yields an estimate of nite-size effects. The coherence parameter Â is equal to 1 if the system is totally synchronous (i.e., Vi (t) D V ( t) for all i) and vanishes if the state of the system is asynchronous in the large N limit. This denition of coherence encompasses all possible forms of synchrony in the network. In practice, Â was estimated by calculating S N for large networks (N ranging from 128 to 4096) and checking that nite-size effects were small (in some cases, we have checked for nite-size effects for network size up to N D 65,536). 2.3 Phase Models. The dynamics of a conductance-based neuron involves several variables, including membrane potential and gating variables. However, if all the neurons in the network re periodically when they are uncoupled, with their ring rates, fi , lying in a narrow range, and if they are weakly coupled together, the dynamics of each neuron in the network can be described by a single angle variable, w i , varying on [0, 2p ]. This variable is the phase of the neuron along its unperturbed limit cycle (i.e., when decoupled from the network). Thanks to this reduction (Hansel et al., 1995), the network dynamics is then described by the simpler differential system, N dw i g X D 2p fi C C(w i ¡ wj ), dt N jD1

(2.10)

where g is the coupling strength and C is the effective interaction between any two neurons. This interaction function is obtained by convolving the synaptic current generated by the presynaptic neuron, j, with the “response

1614

L. Neltner, D. Hansel, G. Mato, and C. Meunier

function” Z of the target neuron, i, over one period of the single-neuron dynamics, C(w i ¡ w j ) D

1 2p

Z

2p 0

Z( u )Isyn( u C w j ¡ w i )du.

(2.11)

The function Z is nothing more than the linear phase response of the neuron to very brief current pulses. It indicates what proportionality exists between the amplitude of a small current pulse and the phase advance (or delay) it induces in the neuron. In the special case of IF neurons, the response function can be analytically derived. It reads, 1 w Z(w ) D e 2p f t . I

(2.12)

For conductance-based models Z and its Fourier expansion have to be determined numerically from the single-neuron dynamics (Hansel et al., 1993a; Ermentrout, 1996). Large networks may display a transition from the asynchronous state to some synchronous state when varying control parameters. In particular, such a transition may occur as the level of disorder is decreased or the coupling strength is changed. This can be analytically investigated in networks of phase oscillators, as shown in appendix B (see Strogatz and Mirollo, 1991). Let h(v) be the distribution of the natural angular frequencies of the phase oscillators, v D 2p f . Let the Fourier expansion of C(w ) be C(w ) D

C1 X

n D0

an sin(nw ) C bn cos(nw ).

(2.13)

For a xed value of the disorder, the asynchronous state becomes linearly unstable, when the coupling strength, g, is changed, as soon as one of the Fourier modes, n, of the interaction function, C, satises the two conditions 2 an 2 p g an C bn 2 2 h(v ¡ gb0 ¡ Vn ) ¡ h(¡v ¡ gb0 ¡ Vn ) bn . dv D v g an 2 C bn 2

h(¡ gb0 ¡ Vn ) D ¡ Z

2

lim

!0 2

1

(2.14) (2.15)

These equations determine the critical value of the disorder, sc , at which the mode n is marginally stable, and the corresponding purely imaginary eigenvalue inVn . The mode n is then linearly unstable for all levels of disorder larger than the critical value (s > sc ) (see appendix B). Alternatively equations 2.14 and 2.15 give the critical value, gc , of the disorder at which the mode n becomes unstable, when the level of disorder s is kept xed. Note

Synchrony in Heterogeneous Networks

1615

that condition 2.14 can be fullled only if an and g have opposite signs. This means that the mode n is always linearly stable in an inhibitory network if an < 0 and cannot be involved in the destabilization of the asynchronous state. Conversely, the mode n is linearly stable in an excitatory network if an > 0. Note that the functions Z and Isyn depend on the intrinsic ring rates of presynaptic and postsynaptic neurons. In equations 2.14 and 2.15, the Fourier coefcients are computed for the effective interaction of neurons with intrinsic angular frequency vN D 2p fN. Taking into account the spread of frequencies would result only in higher-order corrections. 2.4 Robustness of Synchrony. For phase models, sc is strictly proportional to g. Beyond the weak coupling limit, the critical disorder also depends on the coupling strength, albeit in a more complex way. In all cases we dene, for any given value of the coupling strength, g, the robustness of synchrony, R( g ), as the critical disorder at which the asynchronous state becomes linearly stable. This critical disorder is always expressed in what follows as a fraction of the average angular frequency, fN, in the uncoupled network. Meaningful comparisons of the robustness of the synchronous state for two neuron models require that numerical simulations be performed in equivalent conditions. In such a situation, we do not choose the same value of the coupling strength for both models, as the effect of a given synaptic current will largely depend on the membrane properties of the postsynaptic neuron. We set the coupling strength at different values for the two models, chosen so that a single synaptic event leads in both cases to a postsynaptic potential of 0.5 mV in a neuron at rest. This guarantees that the effects of synaptic interactions between neurons are similar in the two models. 3 Results 3.1 Weak Coupling. Figures 2A and 2B, respectively, display the robustness of synchrony in an excitatory network (RE ) of IF neurons and in an inhibitory network (RI ) of IF neurons as a function of the average ring frequency, fN. These results were obtained in the weak coupling limit and for a gaussian distribution of ring rates. Both RE and RI go to innity as the ring rate goes to zero, which means that synchrony can always be achieved at very low rates in a network of IF neurons, even if heterogeneities are strong. This result is also valid for a Lorentzian distribution of frequencies. In this simpler case, this can be analytically derived as shown in appendix C. In the excitatory case, the robustness of synchrony, RE , monotonically decreases with increasing frequency (see Figure 2A) and eventually vanishes at a nite frequency. This means that no synchrony is possible at a high ring rate, even in a homogeneous network. On the contrary, in the inhibitory case, RI is a nonmonotonic function of the ring frequency and never vanishes as

1616

L. Neltner, D. Hansel, G. Mato, and C. Meunier

Figure 2: Weakly coupled IF neurons: Robustness of synchrony, R, versus the average ring frequency, Nf . Firing rates in the uncoupled network are spread according to a symmetrically truncated gaussian distribution (see section 2). Results are displayed for three sets of synaptic time constants: t1 D 1 ms, t2 D 3 ms (full line); t1 D 1 ms, t2 D 7 ms (dashed line); t1 D 6 ms, t2 D 7 ms (dotted line). (A) Excitatory network. (B) Inhibitory network. Results obtained for a symmetrically truncated Lorentzian distribution are also plotted in the inhibitory case (t1 D 6 ms, t2 D 7 ms, empty squares).

Synchrony in Heterogeneous Networks

1617

can be seen in Figure 2B. This entails that a mildly heterogeneous inhibitory network always displays some degree of synchrony. The discontinuities in the derivative visible on all the curves correspond to the frequencies at which changes in the order of the critical mode, nc , occur. As can be seen in Figures 3A and 3B, the order of the critical mode monotonically decays with increasing ring rate, in both the excitatory and the inhibitory cases, and eventually it becomes 1. By contrast, it diverges at low frequency, as shown above. This means that at low frequency, destabilization of the asynchronous state will lead to cluster states where the system splits into several groups of in-phase locked neurons (Golomb, Hansel, Shraiman, & Sompolinsky, 1992). To compare the efciency of inhibitory and excitatory couplings at synchronizing neuronal activity in IF networks for given values of the synaptic time constants, we have computed the ratio RE / RI . In Figure 4, this ratio is plotted as a function of the average ring rate, fN. At high frequency, RE / RI vanishes because the asynchronous state is stable for excitatory interactions but not for inhibitory ones. When the frequency is decreased, RE / RI rapidly becomes larger than 1. This means that at low ring rate, excitation is much more efcient than inhibition at synchronizing neural activity in heterogeneous IF networks. Note that RE / RI remains nite in the limit of vanishing ring rates, in spite of the low-frequency divergence of both RE and RI . We now analyze how synchrony depends on the synaptic time course. For the sake of simplicity, we assume that t1 D t2 D tsyn. The synaptic interaction is then described by an “a function,” f ( t) D

t ¡t / tsyn H ( t). e 2 tsyn

(3.1)

Comparing Figures 4 and 5, we see that the ratio RE / RI diverges when fNtsyn goes to 0 at xed fN, while it tends to a constant value when fNtsyn goes to 0 at xed tsyn. Thus, fNtsyn is not a scaling variable for RE / RI . This is because RE / RI depends not only on the temporal modulation of the synaptic current (controlled by fNtsyn), but also on the phase response of neurons, which explicitly depends on fN. This can be analytically shown for a Lorentzian distribution (see appendix C). Are the above results modied when active properties of neurons are taken into account? To investigate this issue, we have studied networks of WB neurons. We have numerically computed the response function for various values of the external drive, I, leading to different ring rates of the neurons. Using the Fourier expansion of the response function, we have analytically computed the robustness of synchrony. The results are shown for excitatory and inhibitory couplings, respectively, in Figures 6A and 6B. Qualitatively these results are similar to those obtained for the IF model, except that both RE and RI remain nite in the limit of vanishing ring rates

1618

L. Neltner, D. Hansel, G. Mato, and C. Meunier

Figure 3: Weakly coupled IF neurons: order of the critical mode versus the average ring frequency Nf in an excitatory network. (A) Excitatory network. (B) Inhibitory network. Same parameters and conventions as in gure 2.

for the WB model. In particular, RI never vanishes; RE is zero above some critical frequency, fc ; and RE is larger than RI at low frequency (compare Figures 6A and 6B).

Synchrony in Heterogeneous Networks

1619

Figure 4: Weakly coupled IF neurons: Relative robustness RE / RI versus the average ring frequency Nf . Same parameters and conventions as in gure 2.

Figure 5: Weakly coupled IF neurons: Relative robustness RE / RI versus the synaptic time tsyn for a xed averaged ring frequency Nf D 20 Hz and an “alpha function” synaptic interaction.

1620

L. Neltner, D. Hansel, G. Mato, and C. Meunier

Figure 6: Weakly coupled WB neurons: Robustness R of synchrony versus the average ring frequency Nf . (A) Excitatory network. (B) Inhibitory network. Results are displayed for two sets of synaptic time constraints: z1 D 1 ms, z2 D 3 ms (full line and lled squares); z1 D 1 ms, z2 D 7 ms (dashed line and empty squares).

3.2 Beyond Weak Coupling. The results of the previous section were derived under the assumption of weak coupling. To assess their range of validity, we numerically compute RE and RI as a function of the coupling

Synchrony in Heterogeneous Networks

1621

Figure 7: IF neurons: Coherence Â as a function of the relative frequency spread s / Nf (gaussian disorder) for excitatory interactions (lled squares) and inhibitory interactions (empty squares and full line); t1 D 3 ms, t2 D 1 ms, g D 0.05 nC/cm2 , Nf D 80 Hz.

strength for both IF or WB neurons and compare those results to the predictions obtained using the phase reduction technique. Figure 7 displays the coherence, Â, versus the spread, s, of ring rates for excitatory and inhibitory networks of IF neurons. The average ring rate was kept xed at 80 Hz. For that value, partial synchrony (Â ¼ 0.36) is achieved in the homogeneous (s D 0) excitatory case, and the fully synchronous state (Â D 1) is stable in the homogeneous inhibitory case. Remarkably, Â decays to 0 much faster for inhibition than for excitation when disorder is increased, although it starts from much higher values. Networks of WB neurons display a qualitatively similar behavior, although the range over which Â decays to zero is larger by one order of magnitude. Figures 8A and 8B, respectively, display RE and RI as a function of the coupling strength for IF networks and a xed value of the average ring rate in the uncoupled network. Note on Figure 8B the multistability present in the inhibitory case, as the transition to the asynchronous state occurs at different values of the disorder, when starting from an initial condition that is close to the fully synchronous state, and from an initial condition, where the membrane voltages of neurons are widely spread. For small coupling strength, g, the robustness linearly varies with g with a slope that is correctly predicted by the phase model limit. For larger values of g, RE and RI monotonically decrease with increasing g. In excitatory networks, this stems from

1622

L. Neltner, D. Hansel, G. Mato, and C. Meunier

Figure 8: IF neurons: Robustness versus coupling strength g (in nC/cm2 ). The average free ring rate is Nf D 50 Hz. Filled squares: numerical simulations. Solid line: Results from phase model. (A) Excitatory interactions, t1 D 3 ms, t2 D 1 ms. (B) Inhibitory interactions t1 D 7 ms, t2 D 1 ms. Results are shown in that latter case for two initial conditions differing by the spread of the initial voltage in the neuronal population equal. Filled squares: Spread of 50% of the subtreshold voltage range. Empty squares: 1% (empty squares) of the subtreshold voltage range.

Synchrony in Heterogeneous Networks

1623

the increase of the ring rate with the coupling; in inhibitory networks, this is related to clustering (see appendix D). However, in the inhibitory case, the robustness of synchrony may eventually increase again when the coupling strength is further increased. This is because low-frequency neurons are progressively suppressed when the coupling strength is increased, which decreases the frequency spread of the spiking population and increases the maximum level of disorder required to stabilize the asynchronous state. WB networks behave in a similar way, although synchrony can be achieved in heterogeneous excitatory networks only in a much more restricted range of ring rates (see Figures 9A and9 B). The results of this section show that for both models and both types of interactions, excitatory and inhibitory, when the external input is kept xed and the strength of the interaction is varied, the conclusions derived using the phase-reduction method remain qualitatively valid on a nite range of coupling strength. In this range, excitation is more efcient than inhibition at synchronizing heterogeneous networks at low ring rates. The opposite is true at large ring rates where only inhibition may synchronize heterogeneous networks. 3.3 Synchrony at Strong Coupling. In the weak coupling limit, the interactions between neurons have a negligible effect on the average ring rate. This is not so at strong couplings. This is particularly obvious in excitatory networks where recurrent excitation strongly increases the average ring rate with respect to its value in the uncoupled network. Consequently, the synaptic interaction becomes closer and closer to a constant as the oscillation period becomes much smaller than the synaptic time constants, and it can no longer destabilize the asynchronous state. This is one reason why no synchrony can be achieved at large coupling strengths in an excitatory network. In inhibitory networks, increasing the coupling strength decreases the ring rate and would seem to favor synchrony. However, this also leads to suppression of low-frequency neurons. Moreover, the states of partial synchrony exhibited by inhibitory networks depend on both the ring rate and the coupling strength, and synchrony may also be reduced due to a higher degree of clustering when the coupling strength becomes larger. To analyze the sole effect of the coupling strength, g, on the robustness of synchrony in excitatory and inhibitory networks, independently of the change of ring rate it provokes, we shall simultaneously change g and the external drive, I, acting on neurons so that the average frequency in the network remains constant, whatever the value of g. This can be achieved by making the drive Ii on neuron i depend on g as

Ii ( f, g) D IO( f ) ¡ g fN,

(3.2)

where IO( f ) is the constant drive required for neuron i to re at frequency f , and fN is the average frequency of neurons across the uncoupled network. In

1624

L. Neltner, D. Hansel, G. Mato, and C. Meunier

Figure 9: WB neurons: Robustness of synchrony versus coupling strength, g (in nC/cm2 ). t1 D 3 ms, t2 D 1 ms, and Nf D 3 Hz. (A) Excitatory interactions. Only the increase of RE with g is displayed; at higher values of g (outside the range of the gure), RE actually decreases, as in gure 8A, for IF neurons, and eventually vanishes. (B) Inhibitory interactions. RI again increases at higher values of g, as in Figure 8B (not shown). As in Figure 8 the line indicates the predictions of the phase model.

Synchrony in Heterogeneous Networks

1625

the asynchronous state, the total synaptic input on a neuron due to the rest of the network is then exactly balanced by the coupling dependent term of the applied current, so that the frequency remains that of the single neuron. The results then obtained for excitatory and inhibitory WB networks under the above constraint (see equation 3.2) are shown in Figure 10 for a constant low ring rate, fN D 3 Hz. Increasing the synaptic coupling allows us to improve substantially the robustness for both excitation and inhibition within certain limits. Also note that a larger robustness can be achieved in the excitatory case than in the inhibitory case. Similar, though quantitatively different, results are obtained for IF networks (results not shown). For large enough values of fN, the situation changes. Indeed, for fN larger than a critical value that depends on the synaptic time constants, the asynchronous state is the only stable active state of the excitatory WB network, even in the homogeneous case (it coexists with the stable quiescent state of the network). In other words, RE D 0. By contrast, partial synchrony is achieved in an inhibitory network, even at a high rate: RI monotonically increases with g up to values of more than 20% when fN D 50 Hz (results not shown). Thus, the conclusions previously established for weak couplings generalize at all coupling strengths, when the average frequency is kept xed by appropriately tuning the external drive. Excitation synchronizes neurons more efciently at low ring rates, but above some threshold frequency, only inhibition can synchronize neural activity. 3.4 The Balanced Synchronous State. The previous section might give the sense that a precise tuning of the external drive, so that it exactly compensates for the shift of the average frequency in the asynchronous state that is induced by the network dynamics, is required to achieve synchrony at large coupling. Here we show that when both external drive input and the coupling strength are strong enough, and, under some restrictions that are elaborated on in appendix D, this balancing automatically takes place in the asynchronous state, without any need to ne-tune the parameters. For simplicity, we dene all currents relative to the current threshold, Ic , so that any current above (respectively below) this threshold is positive (respectively negative). We shall assume that both the coupling strength, g, and the average external drive, I, linearly depend on a scaling parameter, K, and we shall investigate the large K regime where both coupling and average drive are strong. More specically, we shall assume that,

g D ¡KJ,

(3.3)

where J is some positive constant and that the drive Ii on neuron i satises Ii D KI0 C di ,

(3.4)

1626

L. Neltner, D. Hansel, G. Mato, and C. Meunier

Figure 10: Robustness of synchrony versus coupling strength, g (in nC/cm2 ), in a WB network for a constant average ring rate Nf D 3 Hz in the coupled network. t1 D 3 ms, t2 D 1 ms. (A) Excitatory interactions. Here again, as in Figure 8A, only the increase of RE with g is shown. (B) Inhibitory interactions. Results are shown in that latter case for two initial conditions differing by the (uniform) spread of the initial voltage in the neuronal population. Filled squares: Spread of § 10 mV. Empty squares: Spread of § 0.1 mV. In both gures the line indicates the prediction of the phase model (as in Figure 1).

Synchrony in Heterogeneous Networks

1627

where I0 is some constant current, I0 > 0, and the di are equally distributed independent random variables with zero mean. For an excitatory network, K is negative, so that the external drive is below the current threshold and opposes the frequency increase due to recurrent excitation. In inhibitory networks, K is positive. The external drive is then positive and opposes the frequency decrease due to inhibitory synaptic interactions. In both cases, when the network is in the asynchronous state with an average ring rate fN, the total current input on neuron i is equal to Ii C g fN D K(I0 ¡ J fN) C di .

(3.5)

Both the ring rate, fi , of neuron i and the average ring rate, fN, are then determined by the self-consistent equations, h i fi D G K( I0 ¡ J fN) C di Z h i Nf D P(d)G K( I0 ¡ J fN) C d dd,

(3.6) (3.7)

where f D G [I] is the frequency-current relation of the neurons and P is the probability density of the random variable d. In the large K limit, the average ring rate of the neurons, fN, remains nite and is equal to I0 fN D C O (1 / K) . J

(3.8)

The correction of order 1 / K to fN, ¡A / K, is determined by the nonlinear equation I0 D J

Z P(d)G [JA C d] dd.

(3.9)

This correction to the average frequency is positive for excitatory networks (K < 0) and negative for inhibitory networks (K > 0). Substituting the order 1 / K approximation of fN thus obtained in equation 3.6, gives the ring rates, fi , of the neurons up to order 1 / K, fi D G [JA C di ] .

(3.10)

Note that the ring rates depend on the single-neuron dynamics only through the frequency-current relation. Also note that in the asynchronous state, the width of the frequency distribution only depends on the probability density, P, and the frequency-current relation, G. It is independent of the average drive on neurons.

1628

L. Neltner, D. Hansel, G. Mato, and C. Meunier

Equation 3.8 implies that although both the external drive on neurons and the synaptic couplings are large, the net current input on neurons and their ring rates remain nite in the limit of large K. Moreover, in this limit, the average ring rate no longer depends on K. Therefore, if the external drive and the coupling strength are sufciently strong and have opposite effects on the ring rate, the system settles into a “balanced state” where the average ring rate depends on only the ratio of these two quantities (up to small corrections). The equations satised by the average drive and the average ring rate in this state are the same as in the “chaotic balanced state” studied in van Vreeswijk and Sompolinsky (1996, 1998). However, at variance with the chaotic balanced state, the neurons re periodically in the asynchronous balanced state we consider here. This is because the network is fully connected in our study (see also Golomb & Hansel, in press). The robustness, RI , of an inhibitory WB network in the balanced state is shown in Figures 11A and 11B as a function of the scaling parameter K for two values of the average frequency in the large K limit, f D 50 Hz (A) and f D 200 Hz (B). In both cases, a much larger robustness than at weak coupling can be achieved by sufciently increasing K. This is because in spite of the averaging effect of the high ring rate, the synaptic current remains strongly modulated as the coupling strength is large. One can even show analytically that as the ring rate increases, the index of the mode that destabilizes the asynchronous state decreases and synchrony becomes more robust. The behavior of excitatory networks is different, as the ring rate can no longer be controlled when the coupling gets too large. This can be analytically proved for a homogeneous IF network (see appendix D). As a consequence, there is a limiting value of the ring rate beyond which a purely excitatory network cannot exhibit synchrony. Therefore, we are led in the strong coupling regime to the same conclusions as before: excitatory couplings give rise to more robust synchrony at low ring rates, while only inhibition can synchronize networks at high rates. 4 Discussion

This study has addressed the issue of synchrony in heterogeneous neural networks made up of only excitatory spiking neurons or only inhibitory spikers. Previous work has stressed the role of inhibitory interactions, showing that mutually inhibiting neurons can synchronize (Wang & Busz`aki, 1996), even in the presence of mild heterogeneities (White et al., 1998). We have shown here, in contrast with the results of White et al. (1998), that synchrony is also possible in heterogeneous inhibitory networks when the spread in the ring rates of neurons is a substantial fraction of the average ring rate, a situation that is probably frequent in biological preparations. Synchrony is even possible at high ring rates, provided that the average

Synchrony in Heterogeneous Networks

1629

Figure 11: Inhibitory network of WB neurons: Robustness, RI versus the scaling factor K (see text). t1 D 7 ms, t2 D 1 ms, Nf D 50 Hz (A) and 200 Hz (B). In both cases RI saturates with increasing K, as is slightly visible in A. Same conventions as in Figure 8.

synaptic current is balanced by a constant external input of the same order of magnitude. In this way, the coupling strength can be strongly increased, and modulation of the synaptic current maintained, while suppression is

1630

L. Neltner, D. Hansel, G. Mato, and C. Meunier

avoided. This situation, which can be achieved without nely tuning the parameters, is different from the asymptotic regimes dened in Chow, White, Ritt, and Kopell (1998), except at very low frequency, where it corresponds to the fast regime. It is neither the phasic regime, because the membrane time constant and the inhibition decay time are of the same order, nor the tonic regime, because the oscillation period is not much higher than the inhibition decay time. By contrast, balancing cannot control the ring rate in excitatory networks when the average ring rate gets too high. As a consequence, synchrony at high ring rates cannot be achieved in excitatory networks. Still, excitation is more efcient than inhibition at synchronizing neurons at sufciently low ring rates. Note that this robustness of synchrony in excitatory networks relatively to the inhibitory situation does not entail that excitatory networks can display synchrony at low ring rates in spite of a large spread of parameters. Indeed, near the bifurcation point, the ring rates of the neurons are strongly sensitive to intrinsic neuronal properties (e.g., the leak conductance) and to the value of the external drive. We have reached these conclusions by analyzing under which conditions the asynchronous state is unstable in large networks and the system therefore settles in some synchronous state. This is in contrast with most previous theoretical works (see, however, Abbott & van Vreeswijk, 1993, and van Vreeswijk, 1996), which mainly focused on synchronous states in homogeneous networks, in particular on the fully synchronous state, and on what these states became when heterogeneities were introduced. More specically, we have studied the destabilization of the asynchronous state to the benet of some coherent state when network parameters are varied. To address this question, we have used the weak coupling limit in which the stability of the asynchronous state and its dependence on synaptic parameters and on the average ring frequency can be analytically investigated. Thus, we could assess the robustness of synchrony, dened as the relative spread in the natural frequencies of neurons beyond which the incoherent state becomes stable, for excitatory and inhibitory networks. This procedure was performed for both the IF model, in which the inputs are passively integrated below the ring threshold, and the WB model, which incorporates voltage-dependent sodium and potassium currents. We reached the same qualitative conclusions for both models. At sufciently low frequency, synchrony can be achieved in both excitatory and inhibitory networks but is more robust in the excitatory case. At high frequency, only inhibition can lead to synchrony, providing that the degree of heterogeneity is sufciently small. The critical frequency at which excitatory couplings can no longer synchronize neural activity depends on the synaptic time constants and the neuron model. Quantitatively, the IF model behaves nonetheless differently from the WB model in several important aspects. First, for excitatory couplings and given synaptic time constants, the critical frequency is larger, by one order

Synchrony in Heterogeneous Networks

1631

of magnitude, in IF networks. Second, the robustness of synchrony is one order of magnitude smaller in inhibitory IF networks than in inhibitory WB networks. Third, the robustness of synchrony diverges as the average ring rate goes to zero in IF networks, while it decreases in WB networks. Many conductance-based models start to re at vanishingly low frequency at the current threshold, like the WB model. All such models that we have studied so far behave qualitatively like the WB model, but the value of the critical frequency may strongly change with the model. The atypical behavior of the IF model stresses the importance of active properties for the emergence of synchrony in networks. A similar conclusion was reached by Golomb and Hansel (in press) in a study of synchrony in sparsely connected neuronal networks. The small robustness of synchrony at low ring rates in inhibitory WB networks is due to the fact that the order of the Fourier mode responsible for the linear instability of the asynchronous state increases with decreasing ring rate. Robustness then decreases with the amplitude of the corresponding component in the Fourier spectrum of the interaction function. For the WB model (and more generally for neurons with similar frequency-current relations), all Fourier components of the response function vanish with the ring rate (Ermentrout, 1996). On the other hand, the modulation of the synaptic current becomes too small at high frequency to entrain neurons efciently and robustness becomes low. Between these two regimes, an optimal value of the average ring rate can be found where robustness is maximal. Similarly, for a given ring frequency, synaptic time constants can be found that maximize the robustness of synchrony in inhibitory networks. Note, however, that robustness is not determined by the sole ratios fNt1 and fNt2 (see Figures 4 and 5.) Using numerical simulations of large networks, we have shown that the picture that emerges from the weak coupling limit remains valid in a nite range of coupling strength. Moreover, for a given external drive, the robustness linearly increases with the coupling strength in this range. However, when the coupling strength becomes too large, robustness again decreases in both excitatory and inhibitory networks, albeit for different reasons. In excitatory networks, the ring rate dramatically increases with the coupling due to recurrent excitation between neurons, and the synaptic current is no longer modulated in time. By contrast, the ring rate decreases with increasing coupling strength in inhibitory networks, which favors cluster states. The higher the order of these cluster states, the more easily they are destabilized and replaced by the asynchronous state when heterogeneities are introduced. These desynchronizing effects can be prevented by introducing an additional current proportional to the coupling and of opposite sign. When both the synaptic interactions and the external drive are strong, the synaptic current on neurons automatically adjusts so that their ring rate remains nearly constant when the coupling strength is varied. This leads

1632

L. Neltner, D. Hansel, G. Mato, and C. Meunier

to robust synchrony in inhibitory networks, even at a high ring rate. By contrast, the ring frequency cannot be controlled in excitatory networks beyond some limiting value of the coupling strength, due to an instability arising from recurrent excitation (see appendix C). The behavior of the networks in the regime where both synaptic currents and external drive are strong still depends on the coupling strength. For instance, the frequency at which the maximum robustness is achieved in the inhibitory case shifts to higher frequencies when the scaling factor, K, increases. For synaptic times t1 D 3 ms, t1 D 1 ms, the maximum robustness for the WB model is observed around 40 Hz for K D 16, and around 60 Hz for K D 32 (for a reference value of the coupling constant, J, always set to unity). Moreover, the transients become shorter with increasing K. For a WB network, ring at 50 Hz, around 100 cycles are required to reach the asymptotic state when K D 1, whereas only 4 or 5 are required when K D 32. In conclusion, we have brought to the light a strong coupling regime where the robustness of synchrony is strongly enhanced. In particular we have shown that signicant synchrony could be achieved in heterogeneous inhibitory networks, even at high ring rate, thanks to the balancing mechanism that occurs when the external drive also is strong and scales as the coupling strength. However, the physiological relevance of this regime remains to be established. The values of the coupling strength we use in our numerical studies of fully connected networks correspond to relevant values of the coupling strength in a network with more realistic connectivity. Consider, for instance, a network of WB neurons, where each neuron is contacted by K D 100 ¡1000 other neurons ring incoherently with an average frequency of f Hz. Assume that each synapse evokes postsynaptic potentials of the order of 0.2 mV. This corresponds, in the WB model, to synaptic strength between two neurons on the order of gs D 0.4 nC/cm 2 . In a state in which the neurons re incoherently with an average frequency f Hz, the total synaptic input received by any neuron is the order of K £ gs £ f . To obtain a total synaptic input in the fully connected WB networks we have investigated, we must set the coupling parameter g such that g » Kgs » 40 ¡400 nC/cm2 . This range of coupling corresponds to the strong coupling regime we have studied in this article. Note, however, that if K is too small compared to the total number of neurons in the system, the stability of the asynchronous state can be signicantly affected by the sparseness of the connectivity pattern (see Golomb & Hansel, in press for a discussion of this issue). In that case, an asynchronous state can be stable even in the strong coupling regime we have studied (Golomb & Hansel, in press). Finally, we also stress that our work did not cover more realistic situations where two populations, one excitatory and the other inhibitory, interact. Our preliminary results (not shown here) on heterogeneous two populations systems show that the loop between excitatory and inhibitory neurons can lead to an astonishingly high robustness of synchrony.

Synchrony in Heterogeneous Networks

1633

Appendix A: The Wang–Buzs a´ ki Model

Parameters of the Wang-Buzs´aki model are as follows. The capacitance C is equal to 1 m F/cm2 . The leak current is given by I` D g`( V` ¡ V ) where g` D 0.1 mS/cm2 and V` D ¡65 mV. The transient sodium current is given by INa D gNa m1 3 h( VNa ¡ V ) where gNa D 35 mS/cm2 and VNa D 55 mV. The activation variable m instantaneously adjusts to its steady-state value m1 D am / (am C bm ), where am ( V ) D ¡0.1(V C 35) / (exp(¡0.1( V C 35)) ¡ 1) and bm ( V ) D 4 exp(¡(V C 60) / 18). The inactivation variable h evolves according to dh D ah (1 ¡ h ) ¡ bh h, where ah ( V ) D 0.35 exp(¡(V C 58) / 20), dt bh ( V ) D 5 / (exp(¡0.1( V C 28)) C 1). The delayed rectier current is given by IK D gK n4 ( VK ¡ V ), where gK D 9 mS/cm2 , VK D ¡90 mV. The activation variable n evolves according to dn D an (1¡ n )¡bn n with an ( V ) D ¡0.05(V C dt 34) / (exp(¡0.1( V C 34)) ¡ 1) and bn ( V ) D 0.625 exp(¡(V C 44) / 80). Appendix B: Instability of the Asynchronous State in Heterogeneous Networks of Phase Oscillators

We consider a population of neurons in which the intrinsic frequencies are distributed according to h(v). The phases of the neurons evolve according to N g X wPi D vi C C(w i ¡ w j ). N jD1

(B.1)

The evolution of the one-phase probability density, r (w , t, v), is then given by the continuity equation, @r @t

C

@ @w

(r v ) D 0,

(B.2)

where vDvC g

C1 X

nD0

rn sin(n(w ¡ yn )) C sn cos(n(w ¡ yn )),

(B.3)

and the coefcients rn , sn , and yn can be expressed in terms of the Fourier coefcients, an and bn , of the interaction function, Z rn einyn D an sn einyn D bn

2p 0

Z

0

Z

C1

¡1 2p Z C 1 ¡1

r (w , t, v)h(v)einw dvdw

(B.4)

r (w , t, v)h(v)einw dvdw .

(B.5)

1634

L. Neltner, D. Hansel, G. Mato, and C. Meunier

To analyze the linear stability of the incoherent state, r 0 (w , t, v) D write r (w , t, v) as r (w , t, v) D r 0 (w , t, v) C 2 g(w , t, v),

1 2p

, we

(B.6)

where g(w , t, v) D

C1 nD X

cn (t, v)einw ,

(B.7)

nD¡1

and linearize the continuity equation around r 0 . Thus, we get @cn @t

D ¡fi(v C gb0 )ngcn ¡ g Cg

nbn i 2

Z

C1

¡1

nan 2

Z

C1

¡1

h(v)cn (t, v)dv

h(v)cn ( t, v)dv.

(B.8)

We then seek eigenvectors of the form cn (t, v) D dn (v)el n t ,

(B.9)

where l n D m n C inVn . Equation B.8 then leads to n 1 D g (¡an C ibn ) 2

Z

C1

¡1

h(v) dv. l n C i(v C gb0 )n

(B.10)

This equation determines the eigenvalues l n for n ¸ 1. Separating the real and imaginary parts of equation B.10, we obtain two equations that determine the bifurcation point and the value of Vn : Z

C1

¡1 Z C1 ¡1

h(v) h(v)

m 2n m 2n

C

mn 2 n (v C gb

0

C Vn

)2

dv D ¡

2 an 2 gn an C bn 2

2 n(v C gb0 C Vn ) bn . dv D 2 2 2 C n (v C gb0 C Vn ) gn an C bn 2

(B.11) (B.12)

When the stability of the mode n changes, the associated eigenvalue is purely imaginary: l n D inVn . Taking the corresponding limit, m n ! 0 in equations B.11 and B.12, we get: 2 an p g an 2 C bn 2 2 h(v ¡ gb0 ¡ Vn ) ¡ h(¡v ¡ gb0 ¡ Vn ) bn . dv D 2 v g an C bn 2

h(¡ gb0 ¡ Vn ) D ¡ 2

Z

lim

!0 2

1

(B.13) (B.14)

Synchrony in Heterogeneous Networks

1635

In the case of a gaussian distribution of frequencies, h(v) D p

¡

1 2p s 2

exp ¡

(v ¡ v) N 2 2 2s

¢

,

(B.15)

bn an

(B.16)

equations B.13 and B.14 lead to r

2 p

Z

V n C gb 0 C vN s

0

¡ exp C

x2 2

¢ dx D ¡

¡

r (Vn C gb0 C v) p an 2 C bn 2 N 2 sc D ¡g exp ¡ 8 2sc 2 an

¢ ,

(B.17)

where sc is the critical value of the disorder at which the mode n becomes unstable. For a Lorentzian distribution, h(v) D

s 1 , p s 2 C (v ¡ v) N 2

(B.18)

equations B.13 and B.14 take the simpler form: s 2 an D¡ 2 s 2 C (Vn C gb0 )2 g an C b2n Vn C gb0 2 bn D . 2 s C (Vn C gb0 )2 g a2n C b2n

(B.19) (B.20)

Therefore: sc D ¡g

an 2

(B.21)

Vn C gb0 D §

bn . 2

(B.22)

In that case, the critical level of disorder, sc , depends on the sole Fourier coefcient, an . Appendix C: Lorentzian Distribution of Intrinsic Frequencies

In this appendix we calculate the robustness at low frequency or for fast synaptic interactions for excitatory or inhibitory IF networks and a Lorentzian distribution of heterogeneities. Let us investigate rst the robustness in the low-frequency limit. We write the Fourier expansion of the response function as: Z(w ) D

C1 X

nD0

An sin(nw ) C Bn cos(nw ).

(C.1)

1636

L. Neltner, D. Hansel, G. Mato, and C. Meunier

A straightforward calculation using equations 2.4 and 2.11 shows that the Fourier coefcients of the phase interaction are given by: an D bn D

An (1 ¡ n2 v2 t1 t2 ) ¡ Bn nv (t1 C t2 ) (1 C n2 v2 t12 )(1 C n2 v2 t22 )

An nv (t1 C t2 ) C Bn (1 ¡ n2 v2 t1 t2 ) . (1 C n2 v2 t12 )(1 C n2 v2 t22 )

(C.2) (C.3)

For the IF model, the coefcents An and Bn are ± ² v2 t 1 2p / vt ¡ 1 e 2p I 1 C (nvt )2 An D ¡nvt Bn .

Bn D

(C.4) (C.5)

The Fourier coefcient an of C can be written under the form an D K(nv, t, t1 , t2 )L(v, t ), where K( x, t, t1 , t2 ) D ¡

x3 (t t1 t2 ) ¡ x(t C t1 C t2 ) ¢¡ ¢¡ ¢ 1 C x2 t12 1 C x2 t22 1 C x2 t 2

(C.6)

and L(v, t ) D

² v2 t ± 2p / vt ¡1 . e 2p I

(C.7)

The function K(x ) reaches its (negative) minimum, Kmin , at x D xmin , and its (positive) maximum, Kmax , at x D xmax , while L keeps a constant sign. For a Lorentzian distribution of angular frequencies, the critical level of disorder depends only on an ; see equation B.23. For excitatory coupling (g > 0) the rst mode, nc , that destabilizes the asynchronous state is then given by the best integer approximation of ¡xmin / v. N Indeed, K is then closest to Kmin , and an takes its most negative value there. When the average frequency goes to zero, K( nc v, N t, t1 , t2 ) goes to Kmin , and the instability condition becomes sc D ¡gK min L(vt N ) / 2. The quantity L( vt N ) diverges in this limit, and so does the critical disorder, sc . By contrast, for inhibitory coupling (g < 0), the critical mode, nc , is given by the best integer approximation of xmax / v. N The critical value of the disorder, sc , then diverges at low ring rate as gKmax L( vt N ) / 2. In both cases, excitatory and inhibitory, the critical mode, nc , goes to innity at low frequency as 1 / v. N This stems from the fact that in that limit, the membrane potential of an IF neuron remains exponentially close to the threshold over a large part of the period so that small depolarizations can induce large changes in the phase of the neuron. This is a particular feature of IF neurons, which passively integrate their inputs over the whole subthreshold potential range.

Synchrony in Heterogeneous Networks

1637

To investigate the limit of fast synaptic interactions, we proceed in a similar way. For simplicity, we assume that t1 D t2 ´ tsyn The Fourier coefcient an can then be written as an D K(nv, t, tsyn )L(v, t ), where ¡ ¢ 2 ¡ x t C 2tsyn x3 t tsyn K(x, t, tsyn ) D ¡ ²2 . ¢± 2 1 C x2 t 2 1 C x2 tsyn

(C.8)

When tsyn is much smaller than t , K can be rewritten as K1 ( xt )K2 ( xtsyn ), where K1 ( y) D

y

(C.9)

1 C y2

K2 ( y) D ¡

y2 ¡ 1 ¢2 . 1 C y2

(C.10)

Consequently, the critical mode, nc , which destabilizes the incoherent state, satises nc vt » 1 in the excitatory case. It keeps a constant value, independent of tsyn, when tsyn goes to 0. On the contrary, the critical mode is larger than 1 / (vtsyn ) for fast inhibitory interactions, so that it diverges when tsyn goes to 0. As a result, RE / RI then scales like t t for xed ring rate fN. syn

Appendix D: Stability of the Asynchronous State in IF Networks

The stability of the asynchronous state of a fully connected homogeneous (di D 0) IF network was studied by Abbott and van Vreeswijk (1993). They showed that the eigenvalues, l, that determine the linear stability of the asynchronous state are the solutions of the equation ±

²

f0 el / f0 ¡ 1 (l C a1 )(l C a2 ) D la1 a2

Z

1 0

dyF ( y)ely / f0 ,

(D.1)

where f0 is the ring rate of neurons in units of 1 / t , a1 D t / t1 , a2 D t / t2 , and F( y ) D

gf0 expy / f0 . gf0 C I

(D.2)

In the balanced state (see section 3.4) any neuron in the network is submitted to the external drive KI0 and the synaptic drive ¡KJf0 , so that it receives the total current K(I0 ¡ Jf0 ). The interaction function then reads F( y ) D

f0 J ey / f0 . I0 ¡ f 0 J

(D.3)

1638

L. Neltner, D. Hansel, G. Mato, and C. Meunier

Integrating over one period the evolution equation of any neuron of the network, one also derives the equation 1 ¢ D 1 ¡ e¡1 / f0 , K I0 ¡ Jf0

(D.4)

¡

which enables us to express I0 as a function of f0 . The eigenvalue equation can then be rewritten as ±

²

el / f0 ¡ 1 (l C a1 )(l C a2 ) D ¡a1 a2 f0 KJ ±

lC1

(l C 1) / f0

£ e

±

l ²

¡1 .

1 ¡ e¡1 / f0

²

(D.5)

Let us investigate the high-frequency limit f0 À 1. Considering the complex part of the spectrum and looking for solutions of equation D.5 of the form l n D 2p inf0 C gn

(D.6)

where gn ¿ f0 , one obtains gn D

1 4p 2 n2 f 02 a1 a2 KJ

¡1

.

(D.7)

6 The ansatz, D.6, is valid when K D af02 , with a D 4p 2 n2 / a1 a2 J. Then the bounded quantity gn is just the real part of the nth eigenvalue and determines the linear stability of the asynchronous state with respect to the Fourier mode of order n. For inhibitory interactions (K > 0) gn is positive for modes of order n larger than

nc D

p

a1 a2 aJ , 2p

(D.8)

while low-order Fourier modes are stable. Accordingly, the homogeneous network settles in cluster states. Let us also note that the coupling strength at which the mode n becomes stable scales as n2 , so that it is difcult to stabilize high-order Fourier modes by increasing the coupling. Higher-order cluster states are easily destabilized in favor of the asynchronous state, when the disorder is increased. This is why the robustness of synchrony, RI , in heterogeneous inhibitory networks may be small, although the asynchronous state is never stable in homogeneous networks.

Synchrony in Heterogeneous Networks

1639

The behavior of excitatory networks (K < 0) is quite different. Indeed, it is not possible to increase the coupling strength to compensate the decrease in the modulation of the synaptic current that occurs at high frequency. If we consider the real part of the spectrum, the eigenvalues satisfy l 2 C d (a1 C a2 ) C a1 a2 (1 C KJ ) D 0

(D.9)

in the high-frequency limit. When the actual coupling strength, KJ, becomes smaller than ¡1, the larger real eigenvalue becomes positive, and an instability develops. Balancing is then impossible as the inhibitory external drive cannot oppose recurrent excitation. For this reason, excitatory interactions cannot synchronize neuronal systems at high ring rates. Acknowledgments

G. M. was partially supported by grant PICT97 03-00000-00131 from ANPCyT and by Fundacion ´ Antorchas. We thank D. Golomb and C. Van Vreeswijk for fruitful discussion. References Abbott, L. F., & Van Vreeswijk, C. (1993). Asynchronous states in networks of pulsed-coupled oscillators. Phys. Rev. E, 48, 1483–1490. Buzs´aki, G., & Chrobak, J. J. (1995). Temporal structure in spatially organized neuronal ensembles: A role for interneuronal networks. Curr. Opin. Neurobiol., 5, 504–510. Chow, C. C. (1998). Phase-locking in weakly heterogeneous neuronal networks. Physica D, 118, 343–370. Chow, C. C., White, J. A., Ritt, J., & Kopell, N. (1998). Frequency control in synchronized networks of inhibitory neurons. J. Comp. Neuro., 5, 407–420. Cobb, S. R., Buhl, E. H., Halasy, K., Paulsen, O., Somogyi, P. (1995). Synchronization of neuronal activity in hippocampus by individual GABAergic interneurons. Nature, 378, 75–78. Draguhn, A., Traub, R. D., Schmitz, D., Jefferys, J. G. (1998). Electrical coupling underlies high-frequency oscillations in the hippocampus in vitro. Nature, 394, 189–192. Eckhorn, R., Bauer, R., Jordan, W., Brosch, M., Kruse, W., Munk, M., & Reitboeck, H. J. (1988). Coherent oscillations: A mechanism of feature linking in the visual cortex? Multiple electrode and correlation analyses in the cat. Biol. Cybern., 60, 121–130. Ermentrout, B. (1996).Type I membranes, phase resetting curves, and synchrony. Neural Comp., 8, 979–1001. Gerstner, W., Van Hemmen, J. L., Cowan, J. D. (1996). What matters in neuronal locking? Neural Comput., 8, 1653–1676. Ginzburg I., & Sompolinsky, H. (1994).Theory of correlations in stochastic neural networks. Phys. Rev. E, 50, 3171–3191.

1640

L. Neltner, D. Hansel, G. Mato, and C. Meunier

Golomb, D., & Hansel, D. (in press). The number of synaptic inputs and the synchrony of large sparse neuronal networks. Neural Computation. Golomb, D., Hansel, D., Shraiman, B., & Sompolinsky, H. (1992). Clustering in globally coupled phase oscillators. Physical Review A, 45, 3516–3530. Golomb, D., & Rinzel, J. (1993).Dynamics of globally coupled inhibitory neurons with heterogeneity. Phys. Rev. E, 48, 4810–4814. Golomb, D., & Rinzel, J. (1994). Clustering in globally coupled inhibitory neurons. Physica D, 72, 259–282. Gray, C. M. (1994). Synchronous oscillations in neuronal systems, mechanisms and functions. J. Comp. Neuro., 1, 11–38. Gray, C. M., & Singer, W. (1989). Stimulus-specic neuronal oscillations in orientation columns of cat visual cortex. Proc. Natl. Acad. Sci. USA, 86, 1698– 1702. Gustafsson, B., & Pinter, M. (1984a). Relations among passive electrical properties of lumbar a-motoneurones of the cat. J. Physiol., 356, 401–434. Gustafsson, B., & Pinter, M. (1984b). An investigation of threshold properties among cat spinal a-motoneurones. J. Physiol., 357, 453–483. Hansel, D., & Mato, G. (1993).Patterns of synchrony of a Hodgkin-Huxley neural network at weak coupling. Physica A, 200, 662. Hansel, D., Mato, G., & Meunier, C. (1993a).Phase dynamics for weakly coupled Hodgkin–Huxley neurons. Europhysics Letters, 23, 367–372. Hansel, D., Mato, G., & Meunier, C. (1993b). Phase reduction and neural modeling. Concepts Neurosci., 4, 193–210. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comp., 7, 307–337. Hansel, D., Mato, G., Neltner, L., & Meunier, C. (1998).On numerical simulations of integrate-and-re neural networks. Neural Comp., 10, 467–484. Hansel, D., & Sompolinsky, H. (1992). Synchronization and computation in a chaotic neural network. Phys. Rev. Lett., 68, 718–721. Hansel, D., & Sompolinsky, H. (1996). Chaos and synchrony in a model of a hypercolumn in visual cortex. J. Comp. Neurosc., 3, 7–34. Kuramoto, Y. (1984). Chemical oscillations, waves and turbulence. New York: Springer-Verlag. Laurent, G. (1996). Dynamical representation of odors by oscillating and evolving neural assemblies. Trends Neurosci., 19, 489–496. Reyes, A. D., and Fetz, E. E. (1993a).Two modes of interspike interval shortening by brief transient depolarizations in cat neocortical neurons. J. Neurophysiol., 69, 1661–1672. Reyes, A. D., and Fetz, E. E. (1993b). Effects of transient depolarizing potentials on the ring rate of cat neocortical neurons. J. Neurophysiol., 69, 1673–1683. Ritz, R., & Sejnowski, T. J. (1997). Synchronous oscillatory activity in sensory systems: New vistas on mechanisms. Curr. Opin. Neurobiol., 7, 536–546. Stopfer, M., Bhagavan, S., Smith, B. H., & Laurent, G. (1997). Impaired odour discrimination on desynchronization of odour-encoding neural assemblies. Nature, 390, 70–74. Strogatz, S. H., & Mirollo, R. E. (1991). Stability of incoherence in a population of coupled oscillators. J. Stat. Phys., 63, 613–635.

Synchrony in Heterogeneous Networks

1641

van Vreeswijk, C. (1996). Partial synchronization in populations of pulsecoupled oscillators. Phys. Rev. E, 54, 5522–5537. van Vreeswijk, C., Abbott, L. F., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural ring. J. Comput. Neurosci., 1, 313–321. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comp., 10, 1321–1372. Wang, X.-J., & Buzs`aki, G. (1996). Gamma oscillations in a hippocampal interneuronal network model. J. Neurosci., 16, 6402–6413. White, J. A., Chow, C. C., Ritt, J., Soto-Trevino, ˜ C., & Kopell, N. (1998). Synchronization and oscillatory dynamics in heterogeneous, mutually inhibited neurons. J. Comp. Neuro., 5, 5–16. Whittington, M. A., Traub, R. D., & Jefferys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615. Received April 7, 1999; accepted August 4, 1999.

LETTER

Communicated by Roger Traub

Dynamics of Spiking Neurons with Electrical Coupling Carson C. Chow Department of Mathematics, University of Pittsburgh, Pittsburgh, PA 15206, U.S.A. Nancy Kopell Department of Mathematics and Center for BioDynamics, Boston University, Boston, MA 02215, U.S.A. We analyze the existence and stability of phase-locked states of neurons coupled electrically with gap junctions. We show that spike shape and size, along with driving current (which affects network frequency), play a large role in which phase-locked modes exist and are stable. Our theory makes predictions about biophysical models using spikes of different shapes, and we present simulations to conrm the predictions. We also analyze a large system of all-to-all coupled neurons and show that the splay-phase state can exist only for a certain range of frequencies.

1 Introduction

Electrical coupling between neurons has long been thought to have the effect of synchronizing oscillatory neurons, especially if the neurons involved are similar to one another. Here we analyze in more detail the effects of coupling periodically spiking cells by electrical synapses and show that gap junctions can actively foster asynchrony. We focus on the effects of spike shape and size, along with driving current (which inuences the network frequency). Even when the spikes are very thin, the current ow during the spike is shown to have a signicant effect on the self-organization of the network, and the current ow during the afterpotentials is an important part of the synchronizing process. Indeed, the frequency of the network plays a signicant role in whether the circuit will synchronize, changing the balance of these processes by altering the percentage of time occupied by the spike in a cycle. We analyze what modes of stable locking are possible for the network and show that synchronization is possible at much higher frequencies than for coupling via inhibitory synapses. However, at very high frequencies, gap junctions can be asynchronizing if the strength of the synapse is sufciently low; at intermediate frequencies, asynchronous modes can stably exist with the synchronous ones. The models we use are described by an integrate-and-re formalism, with the addition of action potentials that are inserted when a cell reaches c 2000 Massachusetts Institute of Technology Neural Computation 12, 1643– 1678 (2000) °

1644

Carson C. Chow and Nancy Kopell

threshold. The electrical synapses are modeled as giving currents proportional to the differences in the voltages of the two cells. The analysis is done by means of the spike response method (Gerstner & van Hemmen, 1992; Gerstner, 1995; Gerstner, van Hemmen, & Cowen, 1996; Chow, 1998), in which the effects of the coupling and the spikes are encoded in response kernels. Bressloff and Coombes (1998, 2000) have a similar formalism for analyzing these types of networks. Although the spike response method was invented for synaptic interactions, we show here that it can be used for electrical interactions as well. Indeed the spike response method allows us to consider the simultaneous effects of both kinds of coupling in a uniform formalism. We consider the consequences of the interaction of electrical and inhibitory synapses in a future publication. Here, we consider only electrical coupling and focus on the different effects of spike shape on synchronization in different regimes. We start in section 2 with the equations of the neurons and the synapses as coupled differential equations. Because the equations are piecewise linear between spikes, they can be explicitly integrated to compute the response kernels. We compute those kernels in terms of the spike parameters, the strength of the electrical synapses, and the intrinsic recovery rate of the uncoupled neuron. This gives explicit solutions for the coupled equations in terms of those parameters and the driving currents to each of the cells (which need not be the same). These explicit solutions are the basis for the analysis in the rest of the article. We note that the solutions have the same form as those analyzed in Chow (1998) describing interactions of cells via synaptic interactions. Thus, the general stability criteria developed there can also be used for the current problem. In section 3, we consider phase-locked solutions between two spiking neurons connected by gap junctions. We nd that, depending on the parameters of the neurons and the gap junction strength, the neurons can either synchronize, antisynchronize, be phase locked at an arbitrary phase, or lose periodic ring. One nonintuitive result is that changing the shape of the spikes may have different effects on the network in different parameter regimes. For example, increasing the amplitude and width of the spikes diminishes the range over which synchrony is stable when the frequency is relatively high. However, at low frequencies in which there is bistability, it can enhance synchrony by diminishing the range of parameters over which the competing antisynchronous solutions are stable. For nonsynchronous modes such as antisynchrony or splay phase, electrical coupling can change the period of the network. Not only does the period of each cell change in such nonsynchronous modes, but the network period also increases signicantly; for example, in the antisynchronous mode, the network frequency is twice that of the cell frequency. Thus, relatively weak electrical coupling may be functionally important in creating appropriate frequency ranges for oscillations in a coupled network. Whether a given mode of locking has a stable existence also depends on the network fre-

Dynamics of Spiking Neurons

1645

quency. The theory predicts a sequence of bifurcations as the driving current to the cells, and hence the frequency of the network, is changed. We test that theory against the two biophysical models and nd agreement for a model that has large spikes and another model that has small spikes (see the appendix for details of the models). In section 4 we consider an all-to-all electrically coupled network of neurons. We show that a large family of periodic phase-locked states can exist, including synchronous and clustered states. We also show that the splayphase state, in which all the neurons re in sequence, can exist and be stable over only a nite range of periods. Finally, we review some related modeling work and place our results in the context of physiological systems that are electrically coupled. 2 Equations and Methods 2.1 Spiking Neuron Model. We consider a simple integrate-and-re model of the neuron with the form

C

X dV D IQ ¡ gL V C AQ ( tQ ¡ tQ l ), dtQ

(2.1)

l

where V is the voltage, C is the capacitance, IQ is the applied current, gL is an effective passive leak conductance, AQ ( tQ) is a term representing spiking and restoring currents, and tQ l represents the times that V ( tQ) reaches a threshold V D VT from below. When V reaches threshold, a new term AQ ( tQ ¡ tQ l ) is added, which generates a spike (action potential) and resets the potential V to V0 . The current AQ ( tQ) associated with an individual spike and recovery is chosen to be (

QQ

gL VA ej t , AQ ( tQ) D ¡C(VT ¡ V0 C VM )d( tQ ¡ DQ ),

0 < tQ · DQ , DQ < t.Q

(2.2)

Here VA is a spiking amplitude scale, jQ is the rise rate of the spiking current, DQ is the width of the spike from threshold to the peak, VM is the maximum amplitude above threshold that the spike reaches before the potential is reset, and d(¢) is the Dirac delta function (which is used to reset the potential). AQ ( tQ) represents nonlinear currents that mediate a fast activation to generate a spike, followed by an even faster inactivation and hyperpolarization to bring the potential back to V0 . VM is completely determined by the membrane dynamics and AQ ( tQ). The potential can be shifted and rescaled with v D ( V ¡ V0 ) / (VT ¡ V0 ) so that the threshold has a value of v D 1 and the reset potential is v D 0. We

1646

Carson C. Chow and Nancy Kopell

also rescale time by t D tQ / tm , where tm D C / gL . We then arrive at a rescaled system of the form dv D I ¡ v C A( t), dt

(2.3)

where ID and

(IQ ¡ gL V0 ) , gL ( V T ¡ V 0 )

(2.4)

» jt v e , A(t) D A ¡(1 C vM )d( t ¡ D ),

0 < t · D, D < t,

(2.5)

with vA D VA / ( VT ¡ V0 ), vM D VM / ( VT ¡ V0 ), D D DQ / tm , and j D jQ tm . The membrane equation has four dimensionless parameters: the applied current I, the spike rise rate j , the spike width D , and the spike amplitude scale v A . The latter three parameters control the shape and amplitude of the spike. In this form, I must be larger than unity in order for the potential to reach the threshold for ring. To compute the scaled maximum amplitude vM , we integrate equation 2.3 over the width of one spike. Taking initial conditions to be the potential at threshold (v D 1 and t D 0), we obtain v( t) D 1 C I (1 ¡ e¡t ) C

vA j t [e ¡ e¡t ], 1Cj

t < D,

(2.6)

with vM dened by vM D v(D ) ¡ 1. For a fast-rising and narrow spike (compared to the membrane timescale) we can assume vM ’ vA ej D / (1 C j ). Figure 1 shows four examples of the voltage traces for different parameters of the model given in equation 2.3. 2.2 Spike Response Model for Coupled Neurons. We use the spike response formalism for a system of two neurons coupled with resistive gap junctions. (We can also include synaptic coupling within this formalism.) In doing so, the effects of the gap junction coupling will be separated from the intrinsic dynamics. After the equations are derived, we will apply the techniques previously used to understand the dynamics of synaptically coupled neurons. We will also show how this method can be generalized to a network of N all-to-all coupled neurons. The equations are X dv1 D I1 ¡ v1 ¡ g( v1 ¡ v2 ) C A( t ¡ tl1 ), dt l

(2.7)

X dv2 D I2 ¡ v2 ¡ g( v2 ¡ v1 ) C A( t ¡ tl2 ), dt l

(2.8)

Dynamics of Spiking Neurons

4

1.5

a)

3

v(t)

1647

1

2 0.5

1 0

0

40

1

2

3

4

0

5

4

10

2 0

1

2

t

3

4

0

5

1

2

3

4

5

2

3

4

5

d)

6

20

0

0

8

c)

30

v(t)

b)

0

1

t

Figure 1: Examples of voltage traces for the integrate-and-re model with parameters (a) j D 50, D D 0.1, vA D 1, I D 1.3, (b) j D 12, D D 0.1, vA D 1, I D 1.3, (c) j D 12, D D 0.5, vA D 1, I D 1.55, (d) j D 12, D D 0.5, vA D .1, I D 1.55.

where g is a gap junction strength (scaled by gL ) and tli represents the times when vi crosses the threshold from below. The spiking kernel A(t ) generates a spike and resets the potential to zero. Our strategy is to express the dynamics for vi in terms of a set of response kernels, which we will explicitly calculate. We rst transform into normal modes vC D v1 C v2 and v¡ D v1 ¡ v2 to obtain X X dvC D IC ¡ vC C A( t ¡ tl1 ) C A( t ¡ tm 2) dt m l X X dv¡ D I¡ ¡ rv ¡ C A( t ¡ tl1 ) ¡ A ( t ¡ tm 2 ), dt m l

(2.9) (2.10)

where r D 1 C 2g, IC D I1 C I2 and I¡ D I1 ¡ I2 . Integrating gives vC D IC (1 ¡ e¡t ) C

X l

gC (t ¡ tl1 ) C

X m

gC ( t ¡ t m 2 ),

(2.11)

1648

Carson C. Chow and Nancy Kopell

v¡ D

X X I¡ (1 ¡ e¡rt ) C g¡ (t ¡ tl1 ) ¡ g¡ ( t ¡ t m 2 ). r m l

(2.12)

Since we are interested in the steady state, we can start the interaction at any initial conditions; we have taken initial conditions of vC D v¡ D 0. The kernels are nonzero only for positive argument. They are given by Z ga ( t ) D

t 0

e¡ra ( t¡t ) A( t0 )dt0 , 0

(2.13)

and after integrating » v (r C j )¡1 [ej t ¡ e¡ra t ], ga ( t ) D A a ¡(1 C vM ¡ ga ( D ))e¡ra ( t¡D ) ,

0 < t·D D < t,

(2.14)

where a D § , rC D 1 and r¡ D r. We note that for fast-rising and narrow spikes, vM ’ gC (D ). Returning to the original coordinates and assuming that the neurons have been spiking for a long time so that initial conditions have decayed away, we obtain the spike response equations, v1 (t ) D IO1 C v2 (t ) D IO2 C where

¡ ¡

X l

X l

c s (t ¡ tl1 ) C c s (t ¡ tl2 ) C

¢ ¢

¡ ¡

X

c c ( t ¡ tm 2 ),

(2.15)

m

c c ( t ¡ tm 1 ),

(2.16)

m

X

¢ ¢

µ ¶ 1 1 1 1C IO1 D I1 C 1 ¡ I2 , 2 r r µ ¶ 1 1 1 1¡ IO2 D I1 C 1 C I2 , 2 r r

(2.17) (2.18)

and 1 [gC ( t) C g¡ ( t)], 2 1 c c D [gC ( t) ¡ g¡ ( t)]. 2 cs D

(2.19) (2.20)

c s is the spike generation and reset kernel and c c is the coupling kernel in the Pspike response method (Gerstner et al., 1996; Chow, 1998). We note that m c c ( t ¡ tm i ) can be interpreted as the contribution to the membrane potential due to the gap junction coupling. It is the gap junction analog of the postsynaptic potential.

Dynamics of Spiking Neurons

1649

2 g

a)

s

1 0

1

2

3

t

1 b)

0.2 g

c

0

1

2

3

4

5

t

0.2 0.4 Figure 2: (a) Two examples of the c s kernel with parameters D D 0.1, j D 50, vA D 1, and gap junction conductances g D 0.5 (solid line) and g D 5 (dashed line). (b) Two examples of the c c kernel with D D 0.1, j D 50, vA D 1, and gap junction conductances g D .5 (solid line) and g D 5 (dashed line).

Written out explicitly, the c s kernel is given by c s ( t) D

¤ » vA £ (1 C j )¡1 [ej t ¡ e¡t ] C ( r C j )¡1 [ejt ¡ e¡rt ] , 2 £ ¤ ¡ 12 e¡(t¡D ) C dc e¡r(t¡D ) ,

0< t·D D < t,

(2.21)

where dc D 1 C 2c c (D ).

(2.22)

The c s kernel has the shape of a single spike as seen in Figure 2a. Increasing the gap junction strength through r mainly affects the recovery phase.

1650

Carson C. Chow and Nancy Kopell

The c c kernel is given by ¤ » vA £ (1 C j )¡1 [ej t ¡ e¡t ] ¡ ( r C j )¡1 [ej t ¡ e¡rt ] , 2 £ ¤ c c( t) D ¡ 12 e¡(t¡D ) ¡ dc e¡r(t¡D ) ,

0< t·D D < t.

(2.23)

The c c kernel is the contribution to the electrical coupling due to a single spike. Examples of c c are shown in Figure 2b. For short times, c c is positive, rising until a cusp, which occurs at the peak of the spike. After the cusp, c c begins to decrease, becoming negative for long times. The sum over the c c kernels can be thought of as setting a background potential level on which the spiking takes place. The parameter dc , a measure of the maximum amount of excitatory input received, is important for controlling the background potential level. Later, it will be seen that dc is the critical parameter for determining the synchronizing behavior of the network. For fast-rising spikes, we can assume exp(j D ) À exp(¡D ), leading to

¡

d c ’ 1 C vM 1 ¡

1Cj rCj

¢

,

(2.24)

where we have used vM ´ gC ( D ) ’ vA ej D / (1 C j ). dc can be increased by increasing r through the gap junction strength g or increasing the maximum spike amplitude vM . (Note that v M is measured in reference to a xed postspike hyperpolarization.) While increasing the spike rise rate j increases vM , if v M is kept xed, increasing j actually decreases dc . For weak coupling strength (g ¿ 1), dc has the form dc ’ 1 C

2gvM . 1Cj

(2.25)

In the ensuing text, we will refer to spikes with dc À 1 as large spikes and spikes with dc » 1 as small spikes. 3 Phase-Locked States

In this section we apply the formalism of the previous section and analyze the existence and stability of phase-locked states for two neurons coupled through gap junctions. We focus on the states of synchrony (S) and antisynchrony (AS), although a third phase-locked state can also exist. We show that by changing the frequency of the network, the neurons can make transitions between these phase-locked states. In some instances, bistability is possible. 3.1 Conditions for Periodic Phase Locking. We are interested in the existence and stability of phase-locked periodic solutions. Here, we describe

Dynamics of Spiking Neurons

1651

a self-consistent method for determining the period T of a periodic solution and the phase difference w between the cells in that periodic solution. We show that there is a function G(w , T ) whose zeros for xed T give the phase difference; there is a companion equation that is used to determine T. We consider two neurons ring in a phase-locked pattern. We analyze this situation by supposing that the neurons re periodically at tl1 D ¡lT and tl2 D (w ¡ l)T. Without loss of generality, we assume that cell 1 is ahead of or at the same position as cell 2, that is, w ¸ 0. The next time the ith neuron res is when the potential vi reaches threshold from below (i.e., vP i > 0). Thus, using the spike response equations, we can derive a condition for phaselocked ring. Noting that neuron 1 will next re at t D 0 and neuron 2 will next re at t D wT, we obtain from equations 2.15 and 2.16 v1 (0) D 1 D IO1 C v2 (w T ) D 1 D IO2 C

X l ¸1

X l ¸1

c s (lT ) C c s (lT ) C

X l ¸1

X l ¸0

c c ( lT ¡ w T, )

(3.1)

c c ( lT C w T ).

(3.2)

This is a set of two equations with two unknowns. Note that the second sum in equation 3.2 starts with the index l D 0 because cell 2 feels the effect of cell 1 in the rst cycle, but not vice versa. We can rewrite this system in a more convenient form. Subtracting equation 3.2 from 3.1 yields G(w , T ) D b I1 ¡ b I2 D

I1 ¡ I2 , 1 C 2g

(3.3)

where G(w , T ) D c c (w T ) C

X l ¸1

[c c (lT C wT) ¡ c c ( lT ¡ w T )].

(3.4)

This equation can be interpreted as a condition for the phase w given a xed period T (provided a solution with that T is possible). Adding equation 3.2 to 3.1 and dividing by two yields F(w , T ) D 1 ¡ I,N

(3.5)

where ¶ Xµ 1 1 c s ( lT ) C (c c ( lT ¡ w T ) C c c (lT C wT)) F(w , T ) D c c (w T ) C 2 2 l ¸1

(3.6)

and IN D (b I1 C b I2 ) / 2 D ( I1 C I2 ) / 2. This equation gives a condition for the period T given a xed phase w .

1652

Carson C. Chow and Nancy Kopell

Although equation 3.5 could possibly allow multiple solutions for a given phase w, there is only one “physical” solution, which is given by the smallest solution for T. This is because the neuron is considered to re and reset immediately on crossing the threshold from below. A condition that also must be satised is that a neuron must remain subthreshold throughout its period. We examine this condition in detail for AS in section 3.3. For homogeneous systems (I1 D I2 ) S (w D 0) and AS (w D 0.5) both satisfy equation 3.3. For S, the period condition, equation 3.5, is the same as that for a single neuron. As we will show in section 3.2, a solution to equation 3.5 can almost always be found for S. However, the existence of AS is not guaranteed. As will be shown in section 3.3, a neuron may not remain subthreshold while the other neuron res. Stability of these periodic phase-locked solutions can also be analyzed. In Chow (1998), a sufcient condition and a separate necessary condition for stable phase locking were derived. The sufcient condition for stability of a locked solution with phase w is that the slopes of the kernels (c s and c c ) are both positive at times t D lT § wT; that is, they are rising at the time of ring (Gerstner et al., 1996; Chow, 1998). The necessary condition for stability is that the derivative of G(w , T ) with respect to w must be positive (van Vreeswijk, Abbott, & Ermentrout, 1994; Chow, 1998). 3.2 Period of Phase Locking. The network period depends on the parameters and the phase-locked state. For a given xed phase w , the period T is determined implicitly from equation 3.5. Here we will show that for S, a solution for the period can always be found if the input current is suprathreshold and not too large. However, for AS, condition 3.5 may not always have a solution for T. We show that if I is too small (depending on the other parameters), there is no solution, implying that AS does not exist in that regime. As will be seen in section 3.3, even if a solution exists to equation 3.5, AS still may not exist if the value obtained for the period is too large. When both S and AS exist, we analyze equation 3.5 to see the relationship between their periods, TS and T AS , respectively. The result is that for large spikes, TAS < TS for xed parameter values, but the opposite is true for small spikes. We also look at how the period changes with parameters. We show that increasing the gap junction strength g decreases T AS for large spikes, but increases it for small spikes. Since equation 3.5 is a continuous function of w , T changes continuously with changing w . For S, using equations 2.19 and 2.20, equation 3.6 becomes

F(0, TS ) D

X l ¸1

gC (lTS ).

(3.7)

Inserting the kernel gC from equation 2.14 into 3.7 and evaluating the sums, condition 3.5 becomes F(0, TS ) D

eD D 1 ¡ IN 1 ¡ eTS

(3.8)

Dynamics of Spiking Neurons

1653

for TS > D . It can be easily shown that this is the period condition for an uncoupled neuron (Chow, 1998; Chow, White, Ritt, & Kopell, 1998). The period is dened implicitly by equation 3.8. The function F(0, TS ) is negative and monotone increasing in T S . For 1 < IN · 1 C (1 ¡ eD )¡1 there is always a solution for TS . (We note that the period is well dened only for TS > D .) Solving equation 3.8 for T S yields ! IN ¡ 1 C eD . (3.9) TS D ln IN ¡ 1

(

The threshold for ring is IN D 1. The period decreases with increasing I.N For AS, equation 3.6 gives F(1 / 2, T AS ) D

X l ¸1

c s (lT AS ) C

X m ¸0

c c (mTAS C T AS / 2).

(3.10)

Inserting the kernels 2.21 and 2.23 for t > D into equation 3.10 and evaluating the sums leads to the period condition equation 3.5: F(1 / 2, T AS ) D

dc 1 erD eD N ¡ D 1 ¡ I. 2 rT T / AS AS C1 2e 2 e /2 ¡ 1

(3.11)

for T AS > 2D . The smallest T AS satisfying equation 3.11 is the only physical solution. There are situations where a solution to equation 3.11 does not exist. From that equation, we see that for a given I,N for dc large enough and r small enough, then TAS does not exist since F(1 / 2, T AS ) stays above 1 ¡ I.N By the same reasoning, solutions for TAS can then exist for subthreshold input (IN < 1); the excitation provided by each neuron through the gap junction would sustain ring. Except for cases where dc and r are very large, near the smallest solution of equation 3.11, we have @F(1 / 2, T AS ) / @T > 0. This implies that increasing IN decreases the period. From equation 3.11 we see that increasing dc increases F(1 / 2, T AS ) and thus decreases T AS . This can be understood heuristically since the spike contributes an effective excitation through the gap junction and dc increases with increasing width and amplitude of the spike. We note that dc increases when r D 1 C 2g increases. On the other hand, the factor erD / (erTAS / 2 C 1) increases only if T AS < 2D and decreases otherwise. However, for relatively small r and T AS (recall that small is in comparison to the effective leak time), the factor decreases slowly. Thus, for cells with large spikes and periods that are fast compared to the leak time, increasing r generally decreases the period for AS, while for cells with small spikes and long periods, increasing r increases the period. We can compare the periods of AS to S by rewriting F(1 / 2, T ) as F(1 / 2, T ) D F(0, T ) C y (T )

(3.12)

1654

Carson C. Chow and Nancy Kopell

where

y (T ) D

dc erD 1 eD ¡ . 2 rT 2 e / C 1 2 eT / 2 C 1

(3.13)

Comparing to equation 3.8, we nd TS D TAS when y D 0. For a given period, there is a set of possible parameters dc and r for which y D 0. The trivial solution isdc D 1 and r D 1, which corresponds to uncoupled neurons. For weak coupling and a small spike amplitude, TS » T AS . If y > 0, then TS > TAS , and vice versa. From equation 3.13, we nd that y can be positive if dc is large enough. Again, this can be understood heuristically. For strong spikes and weak coupling, AS has a shorter period, but for weak spikes and strong coupling, the opposite is true. 3.3 Existence of the Antisynchronous State. In order for AS to exist, the potential of the neurons must remain below threshold for the duration of a period. This can be violated if the spikes have large enough amplitude or the electrical coupling is strong enough so that when one of the neuron spikes, it induces the second neuron to cross threshold; that is, as soon as one of the neurons res, the other will be induced to re nearly immediately and the neurons will tend to synchronize. This effect is similar to fast threshold modulation (Somers & Kopell, 1993). The spike mediated through the gap junction acts like a brief excitatory pulse synchronizing the two neurons, prohibiting antisynchrony. We can make this more concrete by considering AS where tl1 D ¡lT and l t2 D T / 2 ¡ lT. The potential of neuron 1 obeys

v 1 ( t ) D I1 C

X l ¸0

[c s ( t C lT ) C c c ( t ¡ T / 2 C lT )],

t > 0.

(3.14)

We require v1 ( t) < 1 for 0 < t < T. When the spike of neuron 2 is felt by neuron 1 at t D T / 2 C D , there is a chance that neuron 1 may be induced to re. To prevent this, we must have v 1 ( T / 2 C D ) D I1 C

X l ¸0

[c s (T / 2 C D C lT ) C c c ( D C lT )] < 1.

(3.15)

Using equations 2.21 and 2.23 in 3.15 and evaluating the sums yields dc <

¡

¢

3 ¡ 2I1 C

1 (1 C e¡rT / 2 ). eT / 2 ¡ 1

(3.16)

The applied current and the period are related through condition 3.11. For homogeneous neurons, I1 D I.N Solving equation 3.11 for IN and substituting into 3.16 yields dc <

¡

¢

1C

1 ¡ eD dc erD C (1 C e¡rT / 2 ). 2 T rT / e ¡ 1 e /2 ¡ 1

(3.17)

Dynamics of Spiking Neurons

1655

For very narrow spikes (D ’ 0), this inequality is approximately dc <

sinh rT / 2 . sinh rT / 2 ¡ 1

(3.18)

As the period T increases, the right-hand side of equation 3.18 decreases toward unity. Hence, as the period gets longer, dc must be smaller in order for AS to exist. In the limit of T ! 1, condition 3.18 becomes dc < 1. Recall that dc ¸ 1 and decreases with decreasing spike amplitude. Thus, as the period approaches innity, the spike amplitude must approach zero for AS to exist. The behavior is the same for increasing the gap junction strength r. For a xed spike amplitude and gap junction strength, there is a maximum period allowable for AS to exist. The larger the spike or stronger the gap junction, the smaller this maximum is. 3.4 Global Behavior. A global view of the existence and stability of phase-locked states can be obtained from condition 3.3 if we treat the period T as a bifurcation parameter. For a xed period T, equation 3.3 provides the phase w of any locked solution; by Chow (1998), that solution is unstable if the slope of G(w , T ) is negative. In section 3.7, we relate the bifurcation unfolding for T to the network parameters I and g. We can compute G(w , T ) explicitly by evaluating the sums in equation 3.4 to obtain for T ¸ 2D

G(w , T ) D c c (w T ) ¡

µ 1 e¡(TC wT¡D ) ¡ e¡(T¡wT¡D ) 2 1 ¡ e¡T ¡dc

G(w , T ) D ¡

¶ e¡r(TC wT¡D ) ¡ e¡r(T¡wT¡D ) , 1 ¡ e¡rT

µ 1 e¡(w T¡D ) ¡ e¡(T¡wT¡D ) 2 1 ¡ e¡T ¡dc

¶ e¡r(wT¡D ) ¡ e¡r(T¡wT¡D ) , 1 ¡ e¡rT

wT · D ,

D < w T, < T / 2,

(3.19)

(3.20)

and G(w , T ) is an odd function about w D 1 / 2. The functional form for G for T < 2D is much more complicated. Four examples of G(w , T ) versus w for a progression of different xed periods T are shown in Figure 3. Phase-locked solutions are given by the zero crossings of G(w , T ). If the slope at the zero crossing is negative, then the solution is unstable. Although the condition G0 (w , T ) > 0 is a necessary condition only for stability, it allows us to use the graphs of G to give insight into the regimes for which various locked solutions are stable. Bifurcations take place at critical points TC that satisfy G(w , TC ) D 0. The four parts of Figure 3 capture the qualitative dynamics of a system coupled with weak gap junctions. The bifurcation sequence is summarized in Figure 4. For

1656

Carson C. Chow and Nancy Kopell

a)

0.025 G(f ) 0

0.5

0.5

1 f

0.025

c)

0.01 G(f ) 0

0

1 f

0.025

b)

0.025

0.5

d)

0.01

1 f

0.5

0.01

1 f

0.01

Figure 3: Four examples of G(w ) with parameters D D 0.1,j D 50, g D .5, vA D 1 and period (a) T D 2, (b) T D 1, (c) T D 0.25, and (d) T D 0.2.

AS 3

0.5

TC

AS2

AS1

TC

TC

f 0

S

TC

T

Figure 4: Bifurcation sequence for phase-locked states with phase w and period T for weak gap junctions. Solid lines indicate stable states and dashed lines unstable states. Moving from left to right corresponds to increasing period, and moving from right to left corresponds to increasing frequency. There are four bifurcation points: TCAS1 , TCAS2 , TCAS3 , and TCS .

strong gap junctions, synchrony is always stable, as one would expect. As the gap junction strength is reduced, the state will make a transition into the corresponding weak gap junction state for that particular period.

Dynamics of Spiking Neurons

1657

We consider the bifurcation unfolding for weak gap junctions as the period is decreased (see Figure 4). For a long period, as in Figure 3a, both w D 0 and w D 0.5 are zeros with G0 (w , T ) > 0. Thus, the necessary conditions for stable S and AS are satised. However, AS may not exist for long periods, as shown in section 3.3. There is also an unstable third mode that appears symmetrically around AS. (This third mode also exists for synaptic coupling; van Vreeswijk et al., 1994; Hansel, Mato, & Meunier, 1995; Chow, 1998.) G(w , T ) has a cusp between w D 0 and the third mode, which comes from the cusp in c c . As the period is decreased, the third mode approaches w D 0.5 until an inverse pitchfork bifurcation at G0 (0.5, T ) D 0, when AS loses stability and the third mode disappears (see Figure 3b). This corresponds to the value TCAS1 in Figure 4. With a further decrease in period, S loses stability through a pitchfork bifurcation, and a stable third mode reappears (see Figure 3c). This occurs at TCS in Figure 4. As the period decreases even further, the cusp approaches w D 0.5. At T D TCAS2 ´ 2D , the cusp crosses w D 0.5, and stable AS appears (see Figure 3d). This bifurcation is discontinuous because of the cusp. For a smooth spike, the cusp would be smoothed and the AS would gain stability through a pitchfork bifurcation. For even shorter periods, as will be seen in section 3.6, the AS state can lose stability to the third mode at TCAS3 . In the synchronous state, the period of the network is given by their intrinsic dynamics because the effect of the electrical coupling is zero. For AS, the contribution from the electrical coupling affects the period (although this effect may be small). In the two following sections, we show how the critical points change as the network parameters are varied. We also consider the sufcient conditions for stability of S and AS in these sections. With these results and those of section 3.2, we can construct the bifurcation sequence for changes in the applied current I and the gap junctional strength g. This bifurcation diagram gives information about necessary conditions for the stability of S and AS. In the next two sections we also consider the sufcient conditions. 3.5 Synchrony. As seen in section 3.4, S can become unstable if the period is too short. From equation 3.4 we nd

G0 (0, T ) / T D cPc (0) C 2

X l ¸1

cPc ( lT ),

(3.21)

where the prime indicates a derivative with respect to w and the dot indicates the derivative with respect to t. For c c given by equation 2.23, cPc (0) D 0. This reects the fact that the neurons do not immediately affect each other upon ring. Inserting equation 2.23 into 3.21 and evaluating the sum, we nd that for T > D , G0 (0, T ) / T D

eD rerD ¡ d . c eT ¡ 1 erT ¡ 1

(3.22)

1658

Carson C. Chow and Nancy Kopell

Note that dc > 1 sincec c ( D ) > 0 and r > 1. Thus, for T small enough, dc large enough, and r not too large, the second term will dominate the rst term in equation 3.22, and G0 (0, T ) will become negative, indicating the instability of S. The lower bound of the period for stable S is given by the critical period TCS , which satises G0 (0, T CS ) D 0. For T below T CS , the necessary condition for stability is violated, and stable S cannot exist. We examine how TCS behaves when we change the parameters. The critical condition (see equation 3.22) can be rearranged to take the form S

erTC ¡ 1 S

eTC ¡ 1

D rdc e( r¡1)D .

(3.23)

For large r, TCS decreases with increasing r. Thus for very strong gap junctions, the lower bound is the width of the spike, implying that synchrony can be stable at any allowable period (periods smaller than the width of the spike are not allowable in our model). This synchronizing tendency is the general presumption of gap junctions. However, if the gap junction is not strong, then there can be a range of frequencies for which S is unstable. The left-hand side of equation 3.23 is monotone increasing in TCS . Thus, TCS increases with increasing dc or D . The conditions for stable synchrony depend importantly on the spike width D and on dc . As either of these is increased, a longer period (lower frequency) is required for stable S. A larger-amplitude spike leads to a larger dc and hence makes it more difcult to synchronize two neurons in the sense that the range of frequencies where they can synchronize is reduced. We now consider sufcient conditions for stability. In Chow (1998), it was shown that cPs (lT ) > 0, cPc (0) ¸ 0, and cPc (mT ) > 0, for l ¸ 0, m ¸ 1, was sufcient for stability. From equation 2.21 and Figure 2a, the slope of c s is always positive. From equation 2.23 and Figure 2b, the slope of c c (lT ) is negative between the maximum at t D D and the minimum t D tmin but positive everywhere else. Minimizing equation 2.23 for t > D gives tmin D D C

1 ln dc (1 C 2g). 2g

(3.24)

Recall that cPc (0) D 0. For T > tmin , we have cPc (mT ) > 0, ensuring stability. For T < tmin , the slope of c c ( T ) can be negative, and stability is no longer ensured. (Note that T > D ; the period must be longer than the width of a spike.) For strong gap junctions, tmin approaches D . This implies that as the gap junction becomes stronger, S is guaranteed to be stable at higher and higher frequencies. For g ¿ 1, we can use equation 2.25 for dc in 4.24 and expand to obtain tmin ’ D C 1 C

vM . 1Cj

(3.25)

Dynamics of Spiking Neurons

1659

Thus, for weak gap junction strength, g has no effect on tmin at linear order. 3.6 Antisynchrony . The behavior of G(w , T ) indicates that AS could be stable for either long periods or very short periods. We now investigate these possibilities in more detail. A necessary condition for stability of AS is that G0 (0.5, T ) > 0. The sufcient conditions for stability are given by cPs ( lT C T / 2) > 0 and cPc ( lT C T / 2) > 0. From equation 3.4, we nd that

G0 (0.5, T ) / T D 2

X l ¸0

cPc ( lT C T / 2).

(3.26)

After evaluating the sum, we obtain ( 0

G (0.5, T ) / T D

¡r(3T / 2¡D ) e¡(3T / 2¡D ) ¡ dc r e 1¡e¡rT , 1¡e¡T ¡r(T / 2¡D ) ¡ dc r e 1¡e¡rT ,

2cPc ( T / 2) C e¡(T / 2¡ D ) 1¡e¡T

T · 2D

T > 2D ,

(3.27)

where cPc (T / 2) D

µ ¶ v A j ej T / 2 C e¡T / 2 j ejT / 2 C re¡rT / 2 ¡ , 2 1Cj rCj

T · 2D .

(3.28)

First consider the behavior for T · 2D ´ TCAS2 . This is the situation where the spikes are overlapped; one neuron reaches threshold, while the other neuron is still in a spiking phase. From equation 4.27 and as seen in Figure 3d, AS can be stable for T D T CAS2 , provided r is small enough. However, cPc (0) D 0 and is monotone, increasing in T until T D 2D . Thus, if T is reduced, there is a bifurcation at TCAS3 when G0 (0.5, TCAS3 ) D 0, (not shown in Figure 3). Stability of AS is possible for TCAS3 < T · T CAS2 . For small r, increasing r will increase TCAS3 . Thus, the regime for stable AS at these short periods is reduced and could possibly be eliminated by increasing the gap junction strength. The sufcient conditions for stability are satised if T / 2 > tmin , where tmin is given in equation 3.24. This sets a minimum on the period that depends on dc . In the short period regime, we have shown that AS could satisfy the necessary condition for stability when the period is reduced to less than twice the width of the spike. However, at this point, the sufcient conditions are not satised since T / 2 < D < tmin . Although stable solutions are possible if the sufcient condition is violated, it may be that for extremely high frequencies, neither S nor AS is stable. For T slightly larger than 2D , G0 (0.5T ) is negative. Once the spikes are no longer overlapped, AS immediately loses stability. This bifurcation is seen in Figures 3c and 3d. The bifurcation is discontinuous because the

1660

Carson C. Chow and Nancy Kopell

spike shapes are not smooth. Had a smoother spike shape been chosen, the transition from stability to instability would be continuous. As the period is increased, the necessary condition for AS stability begins to be satised at the critical period TCAS1 satisfying G0 (0.5, T CAS1 ) D 0. Thus, TCAS1 gives a lower bound for stable AS in the long period regime (T > 2D ). Applying this to equation 4.27 for this regime gives the following condition for this lower bound: sinh(rTCAS1 / 2) sinh(TCAS1 / 2)

(

)

D rdc e r¡1 D .

(3.29)

The left-hand side of equation 4.29 is monotone increasing in TCAS1 , so increasing dc or D increases TCAS1 . Thus, large spikes increase the lower bound on period for stable AS in the long period range. For strong gap junctions, TCAS1 decreases and in the limit r ! 1, TCAS1 ! 2D . The maximum period allowed for existence of AS as given by equation 3.18 also decreases as r is increased and at a slightly faster rate. This implies that as the gap junction strength increases, AS can be stable at shorter periods in the long period regime, although it likely loses existence before the period can get too short. To summarize, the necessary condition for stability of AS is satised for periods longer than TCAS1 provided the gap junction is not too strong. It is also satised for short periods between TCAS3 and TCAS2 , although the sufcient conditions are not satised. This region is reduced with increasing gap junction strength. For long periods, existence is lost if the gap junction is too strong or the period is too long. Results from section 3.5 lead to the conclusion that in the short period regime, large spikes diminish the range of periods over which synchrony is stable. From this section we see that in the long period regime, large spikes diminish the range over which AS is stable. These results are borne out in numerical solutions, as will be shown in section 3.9. 3.7 Bifurcations for I and g. In section 3.4, we showed that for weak gap junctions, a bifurcation sequence shown in Figure 4 takes place for changes in the period T. We now relate this sequence to bifurcations, using as parameters the applied current I and gap junction strength g. This is accomplished by identifying the locations of the four bifurcation points: TCAS1 , TCAS2 , TCAS3 , and TCS . From section 3.2 we showed that increasing the applied current I decreases the period T. Thus, moving from right to left on the bifurcation plot in Figure 4 corresponds to increasing I. However, for a xed I, the states S, AS, or third mode will have different periods, so the diagram will not carry over directly. The bifurcation sequence will vary according to whether the period decreases or increases as the phase of the periodic locked state moves

Dynamics of Spiking Neurons

1661

AS1

0.5

AS2

IC

IC

AS3

IC

f 0

S

I

IC

Figure 5: Bifurcation sequence for phase-locked states with phase w and applied current I for large spikes. Solid lines indicate stable states, dashed lines unstable, and dotted lines unknown behavior. Moving from left to right corresponds to decreasing period. There are four bifurcation points: ICAS1 , ICAS2 , ICAS3 , and ISC .

away from w D 0. For each T bifurcation point in Figure 4, there is a corresponding I bifurcation point, which we label ICAS1 , ICAS2 , ICAS3 , and ICS . The bifurcation diagram for large spikes is shown in Figure 5. In this case, the period decreases with increasing phase (i.e., for a xed I, AS has a shorter period than the third mode, which has a shorter period than S). Thus, increasing I will almost lead to the same bifurcation sequence as decreasing T. We simply reverse the plot in Figure 4. The one difference is that for very large I beyond the bifurcation point TCAS3 , AS cannot bifurcate to the third mode because the latter has a longer period and may not exist at the bifurcation point. AS may lose stability to a nonperiodic or nonphase-locked state that our analysis does not consider. The situation changes for the bifurcation diagram for small spikes (dc small), which is shown in Figure 6. In this case, for xed parameters, S has a shorter period than the third mode and AS. The bifurcation picture given in Figure 4 breaks down since the S state cannot bifurcate into the third mode if the third mode does not exist at that parameter. The third mode could exist for higher values of I, but it may not be connected to S through a simple bifurcation. There may also be nonperiodic or nonphase-locked behavior. At very large I, AS could possibly exist and bifurcate into the third mode. However, the sufcient conditions for stable AS are not satised in this regime. Recall from the previous two sections that the critical points TC move toward larger values as dc and D are increased (except for T CAS2 D 2D , which changes only with changing D ). This implies that the corresponding critical points in the I diagram (IC ) move toward lower values. Thus, for larger dc and D , the bifurcation points have lower values of I if the spikes are large and higher values if the spikes are small. The bifurcation sequences for g

1662

Carson C. Chow and Nancy Kopell

AS1

0.5

AS2

IC

IC

AS3

IC

f 0

S

I

IC

Figure 6: Bifurcation sequence for phase-locked states with phase w and applied current I for small spikes. Solid lines indicate stable states, dashed lines unstable, and dotted lines unknown behavior.

can be deduced by examining Figures 5 and 6. As the gap junction strength g is increased fairly strongly for g, the critical points TC will decrease (IC will increase). Thus, the bifurcations occur at a higher value of I. Hence, the region for stable S increases. The region in which the necessary condition for AS stability is satised increases in the low I regime. However, as discussed in section 3.3, AS also loses existence at lower I. Thus, for very strong g, only S will exist stably. As g is decreased, the state can bifurcate into AS or third mode depending on the value of I. 3.8 Adding Heterogeneity. With the addition of heterogeneity, the locking condition (see equation 3.3) shows that neither S nor AS is a solution. However, it was shown in Chow (1998) that near synchrony and near antisynchrony are possible for weak enough heterogeneity. As long as G0 (w , T ) is positive at the phase-locked solution, then stability is possible. The stronger the gap junction is, the more likely the two neurons will synchronize. This is expected heuristically but can also be seen from the behavior of G(w , T ). Recall from equation 3.4 that G(w , T ) is constructed from a sum over c c ( t) at periodic intervals. As the gap junction strength increases, c c ( t) begins to look more and more like gC (t ), which has positive slope everywhere. Since cPc (0) D 0, it does not contribute to stability of the synchronous state of w D 0. However, cPc ( t) increases with t for the duration of the spike. This can cause G(w , T ) to rise fairly steeply for w > 0. So for w near to but away from zero, the spike contributes positively to G0 (w , T ). This also implies that a fastrising spike may also make synchrony easier to maintain in the presence of heterogeneity. 3.9 Application to Biophysical Neuron Models. We compared the predictions of our analysis on the integrate-and-re model to biophysical con-

Dynamics of Spiking Neurons

1663

50

a)

0

v(t)

50 0

10

20

30

40

50

t 50

b)

0 v(t) 50

100

0

10

20

30

40

50

t Figure 7: (a) Voltage trace for the interneuron model with QI D 0. (b) Voltage trace for the reduced Traub-Miles model with QI D 2.0.

ductance-based neuron models. We considered two different models that represented ranges of possible spike shapes. Figure 7 show examples of the spike forms for an interneuron model of White, Chow, Ritt, Soto-Trevino, and Kopell (1998) and a reduced Traub-Miles (RTM) model (Traub, Jefferys, & Whittington, 1997; Ermentrout & Kopell, 1998). The dynamical equations are given in the appendix. The interneuron model has a wider spike and a larger spike (the size of the spike is in reference to a postspike hyperpolarization) compared to the RTM model. This implies that the interneuron model has a larger dc than the RTM model. In the RTM model, the spiking currents play a very small role in the recovery phase, and so a passive decay to threshold is a good approximation. However, for the interneuron model, the spiking currents seem to play an important role in the recovery. From the gures, it appears that the effective leak time of the interneuron model is longer than the RTM model. Recall that time has been scaled by the leak

1664

Carson C. Chow and Nancy Kopell Table 1: Table of Period T and Stable States for the Interneuron Model with g D 0.025 as a Function of Varying Applied Current T. I

T

State

< ¡0.55 ¡0.55–1.65

> 68.1 68.1–9.6 24.0–7.7 9.6 – 3.4 3.4 – 3.1 < 3.1 2.7

S (only) S AS S Third mode AS AS Nonperiodic

1.65–19.0 19.0–23.0 > 23.0 25.0 > 25.4

time in the analysis. Thus, for a xed period, lengthening the leak time (i.e., reducing the leak rate) effectively shortens the scaled period. We looked for steady-state behavior over a range of applied currents I and gap junction conductances g, and ran for many hundreds of periods to ensure that the observed states were not transient. We rst summarize the analysis of our simplied model. For very strong gap junctions, synchrony is the only state that can exist. For weak gap junctions, a bifurcation sequence can be observed as the applied current is changed. The bifurcation sequence will be different for large and small spikes, as seen in Figures 5 and 6. We predict that the interneuron model should behave like a large spike cell and the RTM model should behave like a small spike cell. From section 3.2, we predict that increasing the gap junction strength should decrease the period of AS for the interneuron model since the effective leak time of the interneuron model is long and dc is large. On the other hand, it should increase the RTM period since the leak time is short and dc is small. The same arguments also predict that for a xed set of neuronal parameters, the period of S will be longer than the period of AS for the interneuron model, but the opposite will hold for the RTM model. For the interneuron model with weak coupling (g D 0.025), we numerically found a bifurcation sequence (see Table 1) that matched our analytical results. For low levels of applied current, only the synchronous (S) state could be found. At a current of I D ¡0.55, AS appeared. S and AS coexisted for ¡0.55 · I < 1.65. An example of AS in the bistable regime is shown in Figure 8. At I D 1.65, AS lost stability. S persisted with increasing current until I D 19, where it bifurcated into a stable third mode, as seen in Figure 9. The phase difference of the third mode approached w D 0.5 until I D 23, when it bifurcated into AS. Figure 10 shows the voltage traces for AS and IQ D 25. At I ’ 25.4, AS lost stability to a nonperiodic state (see Figure 11).

Dynamics of Spiking Neurons

1665

50

v(t)

0

50 0

20

40

60

80

100

t Figure 8: Interneuron model: Antisynchrony in the bistable regime, g D 0.025, QI D ¡0.55, T D 24 ms.

25 v(t) 0 25 50

0

5

10 t

15

20

Figure 9: Interneuron model: Third mode, g D 0.025, QI D 23, T D 3.1.

Although they are difcult to perceive in the gure, the amplitudes of the spikes are slowly increasing and decreasing. As the gap junction strength was increased in any of the above states, the cells fell into S, as expected. The period of AS was always shorter than S in the bistable regime, also as expected. In the bistable regime, the periods differed

1666

Carson C. Chow and Nancy Kopell

20

v(t)

0

20 40

0

5

10 t

15

20

Figure 10: Interneuron model: Antisynchrony in the high-frequency regime, g D 0.025, IQ D 25, T D 2.7.

20

v(t)

0 20 40

0

5

10 t

15

20

Figure 11: Interneuron model: Nonperiodic state in the high-frequency regime, g D 0.025, IQ D 25.4.

roughly by about 20%. Increasing the gap junction strength shortened the AS period, also as predicted (see Figure 12). The bifurcation sequence for the RTM model with weak coupling (g D 0.01) is given in Table 2. For very low currents, only the synchronous state

Dynamics of Spiking Neurons

1667

50 40 T

30 I=- 0.05

20 10 0

I=0 0

0.01 0.02 0.03 0.04 0.05 g

Figure 12: AS period as a function of the gap junction strength for the interneuron model. The S period has the value of 19.0 for I D 0 and 50.8 for I D ¡0.05 and does not change with g. AS has a shorter period than S for a xed I and g. Table 2: Table of Period T and Stable States for the RTM Model with g D 0.01 as a Function of Varying Applied Current T. I

T

State

< 0.08 0.08–0.55

> 114 158–42 114–39 < 39

S (only) AS S S Non-S

> 0.55 Very large

Note: The non-S behavior found at very high I was for smaller values of g.

existed. At IQ D 0.08 and a period of T D 158 ms, the AS state came into existence and both S and AS were stable. Figure 13 shows an example of the AS state. At IQ D 0.55 the AS state became unstable. Our numerics for the RTM model found that synchrony extended to very high frequencies. This was to be expected since the spike was extremely narrow. (We note that the sufcient conditions for stable AS are not satised for T < D , so AS need not be observed at high frequencies.) In our analytical model, the width D corresponds to the time for the spike to reach the peak from

1668

Carson C. Chow and Nancy Kopell

50

0 v(t) 50

0

50

100 t

150

200

Figure 13: Reduced Traub–Miles model: Antisynchrony in the bistable regime, g D 0.01, QI D .55, T D 42 ms.

threshold. For the RTM model, this was on the order of 0.1 ms, which would require a biologically improbable period near T ’ 0.2 or 5000 Hz. On the other hand, the time spent above threshold for the spike was approximately 0.5 ms, which would require T ’ 1 ms or 1000 Hz. We were unable to drive the RTM model fast enough for S to lose stability. At very high applied current (I ’ 1370, T ’ .8 ms), a Hopf bifurcation took place, and continuous spiking was replaced by steady state. (This is often observed in conductancebased models but does not occur in our analytical model.) However, when we lowered the gap junction strength to g D 0.0001, we were able to nd nonsynchronous behavior for very large I. Figure 14 shows a third mode state for I D 1000. In the bistable regime, the AS period was longer than the S period, opposite of the interneuron model, and the predicted result for small spikes (i.e., dc small). The periods between S and AS differed roughly by about 10%. Increasing the gap junction strength increased the period of the AS state, also as expected (see Figure 15). 4 Larger Networks

We consider an all-to-all coupled network of N neurons with gap junctions. Although this is probably unrealistic for very large N, it does serve to gain some qualitative understanding of large network effects. We consider the

Dynamics of Spiking Neurons

1669

20 0 v(t)

20 40 60 0

1

2

3

4

t Figure 14: Reduced Traub–Miles model: Third mode for g D 0.0001, QI D 1000, T D 5.761 ms.

70 I=0.5 60 T 50 I=0.3 40

0

0.01

0.02

g Figure 15: AS period as a function of the gap junction strength for the RTM model. The S period has the value of 41.8 for I D 0.3 and 55.8 for I D 0.5. AS has a longer period than S.

model X X dvi ( vi ¡ vj ) C D I ¡ vi ¡ g A( t ¡ tli ), dt 6 i jD l

(4.1)

1670

Carson C. Chow and Nancy Kopell

where i is the neuron index. We consider a network of homogeneous neurons, although the analysis can be done as well with heterogeneous applied currents. We rewrite equation 4.1 in the form of a matrix equation, dvE D M ¢ vE C fE, dt

(4.2)

where P 1 I C l A(t ¡ tl1 ) P l C BI C l A ( t ¡ t2 ) C B fE D B C .. @ A . P l I C l A( t ¡ tN )

0

1 v1 B v2 C B C vE D B . C , @ .. A vN

0

(4.3)

and 0 ¡1 ¡ ( N ¡ 1)g B g B MDB .. @ . g

g ¡1 ¡ (N ¡ 1)g .. . g

¢¢¢ ¢¢¢ ¢¢¢

1 g C g C C . (4.4) .. A . ¡1 ¡ ( N ¡ 1)g

Proceeding as we had for two neurons, we diagonalize system 4.2. Due to the high degree of symmetry, the matrix M has only two distinct eigenvalues: ¡1 and ¡1 ¡ Ng (which is N ¡ 1 times degenerate). Transforming to the diagonalized system, integrating, and transforming back gives us the spike response form vi ( t ) D I C

X l

C s ( t ¡ tli ) C

XX 6 i jD

m

C c ( t ¡ tjl ),

(4.5)

where C s (t) D gC ( t) C ( N ¡ 1)g¡ ( t) C c (t) D gC ( t) ¡ g¡ ( t).

(4.6) (4.7)

The kernel gC (t ) is identical to the two-neuron network kernel (see equation 2.14), and g¡ (t) has the same form but with r¡ D 1 C Ng. The equations are similar to those of the N neuron synaptically coupled neuronal network studied in Chow (1998). We now consider periodic phase-locked states in a network of homogeneous neurons. Suppose the neurons re at tli D w i ¡ lT. At threshold they satisfy the condition 1DIC

X l

C s ( lT ) C

X j,m

C c ( mT C (w i ¡ w j )T ).

(4.8)

Dynamics of Spiking Neurons

1671

Due to the permutation symmetry of the neuron index, any symmetric combination of the phases is a possible phase-locked periodic solution (Chow, 1998). Examples include the synchronous state where all the neurons re together, antisynchrony where half the neurons re and then the other half res, the splayphase state where the neurons re periodically in sequence, and clustered states. The synchronous solution always exists if the driving current is above threshold. The period is given by the single neuron period. The analysis for the existence of the antisynchronous solution follows as in the two-neuron case. We now examine conditions under which the splay-phase solution can exist. The splay-phase state is dened by w i D iT / N, where i varies from 0 to N ¡ 1. The splay-phase state can exist if a solution for the period T can be found for 1DIC

X l ¸1

C s ( lT ) C

X j,m ¸1

¡

C c mT ¡

¢

j T . N

(4.9)

For simplicity, suppose the spikes are separated by at least D (i.e., spikes do not overlap). Then C s ( t) D ¡e¡(t¡D ) ¡ (N ¡ 1)dc e¡r(t¡D ) , ¡(t¡D )

C c ( t) D ¡e

C dc e

¡r(t¡D )

.

(4.10) (4.11)

We may now substitute this into equation 4.9 and sum the resulting geometric series to obtain I¡1D

eD eD erD erD C C ( N ¡ 1)dc ¡ dc rT N . rT T N ¡1 e / ¡1 e ¡1 e / ¡1

eT

(4.12)

The period of the splay-phase state is given by the smallest solution of T to equation 4.12. For a xed N, a solution can always be found for some I and T. However, there will be a maximum period that can sustain the splay phase due to the fast threshold modulation effect, as described for antisynchrony in section 3.1, that is, the neurons must remain subthreshold when the other neurons re. j The sufcient conditions for stability are that CP c ( mT ¡ N T ) > 0, for all m and j (Chow, 1998). Thus the separation between the neurons must exceed tmin from equation 3.24 to ensure stability. This sets a minimum period of T D Ntmin . It also suggests that there is a region of allowed periods for stable splay phase. If the period is too short, the state loses (sufcient condition for) stability, and if the period is too long, it loses existence. The allowed period for the splay-phase must also scale with the number of neurons in the network N since the separation between the neurons must remain relatively constant independent of network size. This means that the applied

1672

Carson C. Chow and Nancy Kopell

current must be reduced as N increases in order to sustain a stable splayphase state. Numerical simulations of the interneuron model conrmed these observations. We simulated networks up to N D 8 and found the existence of the splay-phase state only over a midrange of periods. 5 Discussion 5.1 Related Modeling Work. The rst articles to point out that electrical coupling can be antisynchronizing were by Sherman and Rinzel (1992) and Sherman (1994) using simulations done in the context of pancreatic beta cells, which have bursting electrical behavior. Later work on this system (de Vries, Zhu, & Sherman, 1998) used the fact that the envelopes of the bursts were roughly sinusoidal in shape and used an analysis near a Hopf bifurcation to see how the coupling could destabilize synchrony. Two other papers (Han, Kurrer, & Kuramoto, 1995, 1997) showed that electrical coupling can give rise to antisynchronous solutions if the components of the cells have trajectories close to a homoclinic bifurcation. The analysis that we do in this article concerns networks of spiking neurons, focusing on the shapes of the spikes. The most similar work deals with excitatory and inhibitory chemical synapses, using the spike response method (Gerstner & van Hemmen, 1992; Gerstner, 1995; Gerstner et al., 1996; Chow, 1998). These and other related models (van Vreesijk et al., 1994; Hansel et al., 1995; Bressloff & Coombes, 1998, 1999) showed that for integrate-and-re models, inhibitory synapses can stabilize synchrony (provided the timescales of rise and fall of the synapse are slow enough), while excitation generally destabilizes the synchronous solution. (See also Terman, Kopell, & Bose, 1998, and Bose, Terman, & Kopell, in press) for related results about inhibition and excitation in bursting neurons.) One way to understand intuitively the result that gap junctions can be antisynchronizing is to think of the effects of the electrical coupling as combining those of excitation and inhibition. To see this, consider the coupling currents for a nonrectied electrical synapse when the spikes are not (yet) synchronous. In the spike phase of the cycle of one cell, the coupling currents to the other cell are depolarizing, acting like excitation; in the postpolarization phase, these currents can be hyperpolarizing (depending on the phase of the other cell). This intuition suggests that wide and tall spikes should help to destabilize the synchronous state, whereas a long and deep postpolarization phase should encourage synchrony. This is indeed what the mathematics conrms, producing more details about effects of sizes and shapes as well as rigor. This intuition also suggests why frequency of the network plays a role in synchronization: as the frequency increases, the size of the interspike interval decreases much more than any change in the shape of the spike, favoring the effects of the spike over those of the postpolarization phase and thus favoring destabilization of the synchronous solution. This argument shows that one should not expect frequency to play an important role in the

Dynamics of Spiking Neurons

1673

stability of sinusoidal like oscillators coupled electrically. There are several other articles on electrically coupled neurons (or compartments) that are related to the current work. Kepler, Marder, and Abbott (1990) considered the electrical coupling of a bursting cell and a passive cell to show that the coupling may increase or decrease the frequency of the oscillation, depending on the shape of the waveform of the oscillator. Kopell, Abbott, and Soto-Trevino (1998) showed that if one of the elements of the network is bistable and the other is an oscillator, the network can exhibit much greater complexity; they introduced new geometric techniques very different from the ones in this article or in Kepler et al. (1990). Manor, Rinzel, Segev, and Yarom (1997) showed that cells with heterogeneous properties, none of them oscillators, could produce an oscillation in an electrically coupled system that is the appropriate average of the dynamics. Electrical coupling is also relevant to understanding the dynamics of compartmental models in which the conductances vary between compartments (Booth & Rinzel, 1995; Li, Bertram, & Rinzel, 1996; Mainen & Sejnowski, 1996; Pinsky & Rinzel, 1994). Finally, we point out that the literature on neural oscillators is large and rapidly growing, so the above list constitutes only the work that is most directly related to the themes of this article. Ritz and Sejnowski (1997) provide a review of some recent work on neural oscillators. 5.2 Gap Junctions in Physiological Systems. Gap junctions are found in many tissues of the body, including the nervous system. (For reviews, see Bennett, 1997, and Dermietzel & Spray, 1993.) Even in the nervous system, electrical coupling is found in a wide variety of cells, including astrocytes and oligendrocytes, as well as neurons. Many functions have been attributed to these electrical synapses, including exchanges of metabolites and second messengers, and buffering the K+ activity surrounding active neurons (Dermietzel & Spray, 1993). For neurons, the most common function ascribed to electrical synapses is mediating synchrony among active cells or relaying signals quickly. Although gap junctions are thought to be most prevalent (at least in vertebrates) during early development, they are also known to exist in the adult mammalian central nervous system, for example, in the neocortex (Gibson, Beierlein, & Connors, 1999), the hippocampus (Draguhn, Traub, Schmitz, & Jefferys, 1998), the inferior olive (Llinas, Baker, & Sotelo, 1974), and the retina (Dowling, 1991). In this article, we have shown that electrical coupling can organize a rhythm to be asynchronous, especially at low coupling strengths. The ability of a collection of spiking neurons to synchronize is dependent on the size and shape of the spike form, as well as the frequency at which the cells are ring. At low coupling strengths and very high ring rates (dependent on the shape of the spike wave form), the synchronized state is unstable, and a pair of cells res in antisynchrony. For a lower range of frequencies, the synchronized and antisynchronized states are bistable. For a population,

1674

Carson C. Chow and Nancy Kopell

the network behavior can have the phases splaying out over the circle. One preparation in which there are well-documented gap junctional connections and oscillator cells ring at high rates is the pacemaker nucleus of the weakly electric sh, Apteronotus (Dye, 1988, 1991; Moortgat, Bullock, & Sejnowski, 2000a, 2000b). In this tissue, cells re very precisely, and synchronously (Dye, 1988). Evidence for distribution of phases of the pacemaker cells of that nucleus is given in Figure 8 of Dye (1988). Pharmacological manipulations that increase the internal concentration of Ca2C and are presumed to decrease the strength of the gap junctional coupling (Spray & Bennett, 1985) do partially desynchronize the population; however, the desynchronization may be due to differences in the natural frequencies rather than the ability of the weak coupling to desynchronize actively even identical cells, as discussed above. In contrast to Dye (1991), Moortgat et al. (2000a) observe that adding gap junction blockers results in a general reduction of the frequency of oscillation of the pacemaker nucleus. This is predicted above for the model with large spikes. Gap junctions have recently been documented within two distinct subsets of interneurons in the rat neocortex (Gibson et al., 1999), one of which (the fast-spiking interneurons) gets strong inputs from the thalamus. In recent work, Gibson and Beierlein (pers. comm.) have injected depolarizing current into pairs of electrically coupled fast-spiking interneurons to modulate their frequency. In one pair of cells at a frequency of 40 Hz, the pair oscillated synchronously; with further depolarization that caused the frequency of each cell to go up to approximately 100 Hz, the cells red in antisynchrony. This is in agreement with predictions from our theory. However, other pairs did not show this effect. Another potential application of this work concerns bursting behavior of interneurons in the hippocampus. In recent simulations of data from the work of Zhang et al. (1998), Skinner, Zhang, Velasquez, and Carlen (1999) investigated a model network of cells coupled by gap junctions and inhibitory synapses. Blocking the inhibition increased the frequency of each of the cells; at the higher frequencies, the spikes within a burst of the electrically coupled cells had an antisynchronous relationship. In a future publication, we will analyze this system and show that one important effect of the inhibition on the network can be to change its frequency, which affects what congurations are stable using the electrical coupling. The techniques used for the analysis are similar to those used in Chow (1998) to analyze the effects of chemical synapses on model neurons. For these neurons, the effects of the chemical and electrical neurons in the spike-response equations are additive. Hence, the same formalism can be used for situations involving both kinds of synapses acting in parallel. High-frequency spiking is also found in hippocampal interneurons during ripples, and gap junctional coupling is implicated in the coordination of these rhythms (Draguhn et al., 1998). More recent work (Traub, Schmitz, Jefferys, & Draguhn, 1999) suggests that the mechanism of production of the

Dynamics of Spiking Neurons

1675

rhythms involves spontaneous production of spikes in the axons with transmittal through a sparsely connected network of axo-axonal connections. Although the mechanism that produces the frequency in that case is different (it depends on the network connectivity and the spontaneous rate of action potentials), it remains to be understood if the mechanism that produces the coherence of the network is similar to the one described in this article. In any physiological situation, the network behavior is affected by heterogeneity, spatial structure of the neurons, connectivity of the network, and interaction with chemical synapses. The analysis presented considered only connections between soma and/or axons, in which spatial and delay effects are not included. For dendro-dendritic connections, such delays would be important. Although it is not within the scope of this article to present the complete analysis, we report here that a similar (but more complicated) analysis of neurons connected with gap junctions at the end of a passive dendrite leads to a change in parameter ranges in which different congurations are stable. In particular, the regime in which asynchrony is stable is greatly increased and that in which synchrony is stable is reduced. We believe that this is due to the ltering properties of the dendrite, which changes the shape of the spike at the synapse. Ability to re in an asynchronous way increases the exibility of network dynamics. This article provides insight into how such asynchrony can be fostered by electrical coupling, even in the absence of the above complexities, all of which create a much richer dynamical environment. It remains to understand how these extra features might interact with the mechanisms described in this article to allow exible modulation of network behavior. Appendix: Neuron Dynamics

In our simulations, we considered a network of N conductance-based singlecompartment neuron models coupled electrically with gap junctions. The membrane potential obeyed the current balance equation,

C

X dVi D IQ ¡ INa ¡ IK ¡ IL ¡ g(Vi ¡ Vj ), dt 6 i jD

(A.1)

where i is the neuron index that runs from 1 to N, g is the gap junction conductance, IQ is the applied current, INa D gNa m3 h(Vi ¡ VNa ) and IK D gK n4 (Vi ¡ VK ) are the spike-generating currents, IL D gL (Vi ¡ VL ) is the leak current, and C D 1m F/cm 2 . The interneuron model in White et al. (1998) used the following parameters: gNa D 30 mS/cm2 , gK D 20 mS/cm2 , gL D 0.1 mS/cm2 , VNa D 45 mV, VK D ¡80 mV, and VL D ¡60 mV. The activation variable m was assumed fast and substituted with its asymptotic value m D m1 (v ) D (1 C

1676

Carson C. Chow and Nancy Kopell

exp[¡0.08(v C 26)])¡1 . The gating variables h and n obey dh h1 ( v) ¡ h D , th ( v) dt

dn n1 ( v ) ¡ n D , tn ( v ) dt

(A.2)

with h1 ( v ) D (1 C exp[0.13(v C 38)])¡1 , th (v ) D 0.6 / (1 C exp[¡0.12(v C 67)]), n1 ( v ) D (1 C exp[¡0.045(v C 10)])¡1 , and tn (v ) D 0.5 C 2.0 / (1 C exp[0.045(v¡ 50)]). The reduced Traub–Miles model (Traub et al., 1997; Ermentrout & Kopell, 1998) used the following parameters: gNa D 100 mS/cm2 , gK D 80 mS/cm2 , gL D 0.05 mS/cm2 , VNa D 50 mV, VK D ¡100 mV, VL D ¡67 mV; m D Q m( v) / ( a Q m ( v) C bQm ( v )), where aQ m ( v ) D 0.32(54 C v) / (1 ¡ exp(¡(v C m1 ( v ) D a Q 54) / 4)) and bm ( v) D 0.28(v C 27) / (exp((v C 27) / 5) ¡ 1); dn Da Q n ( v )(1 ¡ n ) ¡ bQn (v )n dt

(A.3)

with aQ n ( v ) D 0.032(v C 52) / (1 ¡ exp(¡(v C 52) / 5)), bQn (v ) D 0.5 exp(¡(v C 57) / 40); h D h1 ( v ) D max[1 ¡ 1.25n, 0]. The ODEs were integrated using the CVODE method with the program XPPAUT written by G. B. Ermentrout and available online at http://www. pitt.edu/»phase/. Acknowledgments

We thank John White and Bard Ermentrout for many helpful discussions. This work was supported by NIH grant K01 MH01508 (CC), NIMH grant 47150 (NK), and NSF grant 9200131 (NK), and the A. P. Sloan Foundation (CC). References Bennett, M. (1997). Gap junctions as electrical synapses. J. Neurocytology, 26, 349–366. Booth, V., & Rinzel, J. (1995). A minimal compartment model for a dendritic origin of bistability of motoneuron ring patterns. J. Comp. Neurosci., 2, 1–14. Bose, A., Terman, D., & Kopell, N. (in press). Almost-synchronous solutions in networks of neurons for mutually coupled excitatory synapses. Physica D. Bressloff, P. C., & Coombes, S. (1998). Desynchronization, mode locking, and bursting in strongly coupled integrate-and-re oscillators. Phys. Rev. Lett., 81, 2168–2171. Bressloff, P. C., & Coombes, S. (2000). Dynamics of strongly coupled spiking neurons. Neural Comp., 12, 91–129. Chow, C. C. (1998). Phase-locking in weakly heterogeneous neuronal networks. Physica D, 118, 343–370.

Dynamics of Spiking Neurons

1677

Chow, C. C., White, J. A., Ritt, J., & Kopell, N. (1998). Frequency control in synchronous networks of inhibitory neurons. J. Comp. Neurosci., 5, 407– 420. Dermietzel, R., & Spray, D. (1993). Gap junction in the brain: Where, what type, how many and why? TINS, 16, 186–192. de Vries, G., Zhu, H.-R., & Sherman, A. (1998). Diffusively coupled bursters: Effects of heterogeneity. Bull. Math. Biol., 60, 1167–1200. Dowling, J. E. (1991). Retinal neuromodulation: The role of dopamine. Visual Neuroscience, 7, 87–97. Draguhn, A., Traub, R. D., Schmitz, D., & Jefferys, J. G. R. (1998). Electrical coupling underlies high-frequency oscillations in the hippocampus in vitro. Nature, 394, 189–192. Dye, J. (1988). An in vitro physiological preparation of a vertebrate communicatory behavior: Chirping in the weakly electric sh, Apteronotus. J. Comp. Physiol. A, 163, 445–458. Dye, J. (1991).Ionic and synaptic mechanisms underlying a brainstem oscillator: An in vitro study of the pacemaker nucleus of Apteronotus. J. Comp. Physiol A, 168, 521–532. Ermentrout, G. B., & Kopell, N. (1998). Fine structure of neural spiking and synchronization in the presence of conduction delays. Proc. Nat. Acad. Sci. USA, 95, 1259–1264. Gerstner, W. (1995). Time structure of the activity in neural network models. Phys. Rev. E, 51, 738–758. Gerstner, W., & van Hemmen, J. L. (1992). Associative memory in a network of “spiking” neurons. Network, 3, 139–164. Gerstner, W., van Hemmen, J. L., & Cowen, J. (1996). What matters in neuronal locking? Neural Comput., 8, 1653–1676. Gibson, J. R., Beierlein, M., & Connors, B. (1999). Two networks of electrically coupled neurons in the neocortex. Nature, 402, 75–79. Han, S. K., Kurrer, C., & Kuramoto, Y. (1995).Dephasing and bursting in coupled neural oscillators. Phys. Rev. Lett., 75, 3190–3193. Han, S. K., Kurrer, C., & Kuramoto, Y. (1997). Diffusive interaction leading to dephasing of coupled neural oscillators. Int. J. Bifurcations and Chaos, 7, 869– 876. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Comput., 7, 307–337. Kepler, T. B., Marder, E., & Abbott, L. F. (1990). The effect of electrical coupling on the frequency of model neuronal oscillators. Science, 6, 83–85. Kopell, N., Abbott, L. F., & Soto-Trevino, C. (1998). On the behavior of a neural oscillator electrically coupled to a bistable element. Physica D 121, 367– 395. Li, Y.-X., Bertram, R., & Rinzel, J. (1996). Modeling N-methyl-D-aspartate– induced bursting in dopamine neurons. Neuroscience, 71, 397–410. Llinas, R., Baker, R., & Sotelo, C. (1974). Electrotonic coupling between neurons in the cat inferior olive. J. Neurophysiol., 37, 560–571. Mainen, Z. F., & Sejnowski, T. J. (1996). Inuence of dendritic structure on ring pattern in model neocortical neurons. Nature, 382, 363–366.

1678

Carson C. Chow and Nancy Kopell

Manor, Y., Rinzel, J., Segev, I., & Yarom, Y. (1997). Low amplitude oscillation in the inferior olive: A model based on electrical coupling of neurons with heterogeneous channel densities. J. Neurophysiol., 77, 2736–2752. Moortgat, K. T., Bullock, T. H., & Sejnowski, T. J., (2000a). Precision of the pacemaker nucleus in a weakly electric sh: network vs. cellular inuences. J. Neurophysiol., 83, 971–983. Moortgat, K. T., Bullock, T. H., & Sejnowski, T. J. (2000b). Gap junction effects on precision and frequency of a model pacemaker network. J. Neurophysiol., 83, 984–997. Pinsky, P. F., & Rinzel, J. (1994). Intrinsic and network rhythmogenesis in a reduced Traub model for CA3 neurons. J. Comp. Neurosci., 1, 39–60. Ritz, R., & Sejnowski, T. J. (1997). Synchronous oscillatory activity in sensory systems: New vistas on mechanisms. Current Opinion in Neurobiology, 7, 536– 546. Sherman, A. (1994). Anti-phase, asymmetric and aperiodic oscillations in excitable cells—I. Coupled bursters. Bull. Math. Biology, 56, 811–835. Sherman, A., & Rinzel, J. (1992). Rhythmogenic effects of weak electrotonic coupling in neuronal models. Proc. Natl. Acad. Sci. U.S.A., 89, 2471–2474. Skinner, F. K., Zhang, L., Velasquez, J. L. P., & Carlen, P. L. (1999). Bursting in inhibitory interneurons: A role for gap-junctional coupling. J. Neurophysiol., 81, 1274–1283. Somers, D., & Kopell, N. (1993). Rapid synchronization through fast threshold modulation. Biol. Cybern., 68, 393–407. Spray, D. C., & Bennett, M. V. (1985). Physiology and pharmacology of gap junctions. Annu. Rev. Physiol., 47, 281–303. Terman, D., Kopell, N., & Bose, A. (1998). Dynamics of two mutually coupled slow inhibitory neurons. Physica D, 117, 241–275. Traub, R. D., Jefferys, J. G. R., & Whittington, M. A. (1997). Simulation of gamma rhythms in networks of interneurons and pyramidal cells. J. Comp. Neurosci., 4, 141–150. Traub, R. D., Schmitz, D., Jefferys, J. G. R., & Draguhn, A. (1999).High-frequency population oscillations are predicted to occur in hippocampal pyramidal neuronal network interconnected by axo-axonal gap-junctions. Neuroscience, 92, 407–426. van Vreeswijk, C., Abbott, L., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural ring. J. Comp. Neurosci., 1, 313–321. White, J. A., Chow, C. C., Ritt, J., Soto-Trevino, C., & Kopell, N. (1998). Synchronization and oscillatory dynamics in heterogeneous, mutually inhibited neurons. J. Comp. Neurosci., 5, 5–16. Zhang, Y., Velazquez, J. L. P., Tian, G. F., Wu, C. P., Skinner, F. K., Carlen, P. L., & Zhang, L. (1998). Slow oscillations (· 1 Hz) mediated by GABAergic interneuronal networks in rat hippocampus. J. Neurosci., 18, 9256–9268. Received January 29, 1999; accepted September 2, 1999.

LETTER

Communicated by Zachary Mainen

A Model for Fast Analog Computation Based on Unreliable Synapses Wolfgang Maass Thomas Natschl ager ¨ Institute for Theoretical Computer Science, Technische Universit¨at Graz, A-8010 Graz, Austria We investigate through theoretical analysis and computer simulations the consequences of unreliable synapses for fast analog computations in networks of spiking neurons, with analog variables encoded by the current ring activities of pools of spiking neurons. Our results suggest a possible functional role for the well-established unreliability of synaptic transmission on the network level. We also investigate computations on time series and Hebbian learning in this context of space-rate coding in networks of spiking neurons with unreliable synapses. 1 Introduction

This article explores links between two levels of modeling computation in biological neural systems: the level of individual synapses and spiking neurons and the network level. Such links are of interest under the hypothesis that important aspects of the computational function of neuronal activity can be understood only on the larger scale of networks consisting of hundreds and more neurons. One particular challenge is the task to provide models for fast analog computation in neuronal systems that are consistent with experimental data. Thorpe, Fize, and Marlot (1996) and others have demonstrated that biological neural systems involving 10 or more synaptic stages are able to carry out complex computations within 100 to 150 ms. This cannot be be explained through models based on an encoding of analog variables through ring rates of spiking neurons, since the ring rates in these neural systems are typically well below 100 Hz and interspike intervals are highly variable (Koch, 1999). In addition, it has recently been argued that synaptic depression makes ring rates above 20 Hz indistinguishable for the postsynaptic neurons (Abbott, Varela, Sen, & Nelson, 1997). One approach for explaining the possibility of fast analog computation relies on the assumption that the relevant analog variables are encoded in small temporal differences among the ring times of neurons (Thorpe et al., 1996; Hopeld, 1995; Maass, 1997). These models are able to explain the possibility of fast analog computation in networks where neuronal ring and synaptic transmission are highly reliable, or where the average ring c 2000 Massachusetts Institute of Technology Neural Computation 12, 1679– 1704 (2000) °

1680

W. Maass and T. Natschl¨ager

Table 1: Relationships Between Details of Neuronal Hardware and Possible Computational Functions on the Network Level. Neuron level

Network level

Synaptic unreliability

Graded response of ring activity in pools of neurons, yielding the power to approximate arbitrary continuous functions in space-rate coding

Decaying parts of excitatory and inhibitory postsynaptic potentials in combination with refractory effects and recurrent connections

Emulation of arbitrary given linear lters with nite and innite impulse response with regard to space-rate coding

Time-sensitive rule for long-term potentiation

Hebbian learning for space-rate coding

times of pools of neurons encode analog variables on a timescale of a few milliseconds. Although some evidence for such coding has been found, a more common type of coding encountered in vertebrate cortex is a population coding where information about the stimulus or subsequent responses of the organism are encoded in a space-rate code, that is, in the fractions of neurons in various pools that re within some short time interval (say, of length D between 5 and 10 ms). We refer to this coding scheme as spacerate coding. We analyze in this article possible links between well-known response characteristics of individual neurons and synapses on one hand, and computational functions of networks of neurons with regard to space-rate coding on the other hand, as indicated in Table 1. In particular we address the possible computational role of the unreliability of synaptic transmission, which has so far been ignored in this context. In section 2 we show that unreliability of synaptic transmission can be used as an essential ingredient for a model for fast analog computation in space-rate coding that spreads ring activity uniformly among all neurons in a pool. We exhibit a suitable tool from probability theory for analyzing large-scale effects of unreliable synaptic transmission (the Berry-Esseen theorem), and we analyze inherent noise sources of this computational model. Our theoretical analysis is complemented by results of computer simulations. In section 3 we turn to the question what other types of computations (besides approximating arbitrary continuous functions between vectors of analog input and output values) can be induced by synaptic unreliability in combination with other details of the neuronal hardware on the larger scale of networks of neurons with space-rate coding. We show that for computations on time series, the interaction of the time courses of excitatory postsynaptic potentials’ (EPSPs) and inhibitory postsynaptic potentials’ (IPSPs)

Model for Fast Analog Computation

1681

refractory effects of neurons, and recurrent excitation and inhibition provides in combination with unreliable synapses the means for realizing in space-rate coding a rich class of linear lters with nite and innite impulse response. Finally in section 4 we briey address the relationship between temporal learning rules for individual neurons and Hebbian learning in space-rate coding. 2 A Model for Fast Analog Computation in a Space-Rate Code

We say the analog variable x 2 [0, 1] is encoded by a pool U of N neurons in a space-rate code if during a short time interval of length D , a total of Nx neurons re. If the time interval is short enough, say D D 5 ms, one can assume that each neuron res at most once during this time interval. Although there exists substantial empirical evidence that many cortical systems encode relevant analog variables by such space-rate code, it has remained unclear how networks of spiking neurons compute in terms of such a code. Some of the difculties become apparent if one just wants to understand, for example, how the trivial linear function f ( x) D x / 2 can be computed by such a network if the input x 2 [0, 1] is encoded by a space-rate code in a pool U of neurons and the output f (x ) 2 [0, 1 / 2] is supposed to be encoded by a space-rate code in another pool V of neurons. If one assumes that all neurons in V have the same ring threshold and that reliable synaptic connections from all neurons in U to all neurons in V exist with approximately equal weights, a ring of a fraction x of neurons in U during a short time interval will typically trigger almost none or almost all neurons in V to re, since they all receive about the same input from U. Several mechanisms have already been suggested that could in principle achieve a smooth, graded response in terms of a space-rate code in V instead of a binary all-or-none ring, such as strongly varying ring thresholds or different numbers of synaptic connections from U for different neurons v 2 V (Wilson & Cowan, 1972). Neither option is completely satisfactory, since ring thresholds of biological neurons appear to be rather homogeneous and both options would fail to spread average activity over all neurons in V. Hence they would fail to spread the energy consumption in the neuronal ensembles and also would make the computation less robust against failures of individual neurons. In this section we introduce an alternative model for analog computing in space-rate coding that takes the well-known stochastic properties of biological synapses into account. We show that the unreliability of synaptic transmission provides a very useful mechanism for reliable analog computation in space-rate coding. Our model does not require that the ring threshold of the neurons involved is different, and it automatically spreads the ring activity uniformly over all neurons within a pool. In section 2.3, inherent noise sources of this model for analog computation are investigated. In section 2.5, it is shown that our model induces an analog version of the

1682

W. Maass and T. Natschl¨ager

familiar model of a synre chain that is computationally more powerful and simultaneously more consistent with experimental data than the traditional “digital” version of a synre chain. 2.1 Denition of the Computational Model. We start by addressing the most basic question concerning computations in space-rate coding: how can the ring activity in a pool V of neurons be related to the ring activity in n presynaptic pools U1 , . . . , Un of neurons? For simplicity, we assume that n pools U1 , . . . , Un consisting of N neurons each are given, and we also assume that all neurons in these pools have synaptic connections to all neurons in another pool V of N neurons. 1 We write xi for the analog variable ranging over [0, 1] that is encoded by space-rate coding in pool Ui and y for the analog variable ranging over [0, 1] that is encoded by space-rate coding in pool V. We assume throughout this article that for each pool Ui , all neurons in Ui are excitatory or all neurons in Ui are inhibitory. By choosing the number n of pools Ui sufciently large, one can approximate with this model another type of computational model where instead of discrete pools of neurons that encode one analog variable each, one has a continuous spatial pattern of ring activity. We address in section 2.4 the question what functions can be computed by multilayer feedforward networks in terms of space-rate coding, and we address in section 3 the question of which additional computational capabilities arise when one takes recurrent connections involving excitatory and inhibitory neurons into account. In accordance with standard results from neurophysiology, we assume in our model that an action potential (“spike”) from a neuron u 2 Ui triggers with a certain probability rvu (“nonfailure probability”) the release of one or several vesicles lled with neurotransmitter at one or several release sites of the synapses between neurons u 2 Ui and v 2 V. The data from Markram, Lubke, ¨ Frotscher, Roth, and Sakmann (1997), Larkman, Jack, and Stratford (1997), and Dobrunz and Stevens (1997) strongly suggest that in the case of a release, the amplitude of the resulting EPSP in neuron v is a stochastic quantity. Consequently we model the amplitude of the EPSP (or IPSP) in the case of a release by a random variable avu with probability density function w vu . Empirical data show that the variance of the distribution w vu is typically rather high (Markram, Lubke, ¨ Frotscher, Roth, & Sakmann, 1997; Larkman et al., 1997; Stevens & Zador, 1998). This high variability can be traced back to two sources. First, it is uncertain whether a vesicle is released at a single synaptic release site in response to an action potential from the presynaptic neuron. Thus, the amplitude of the postsynaptic potential (PSP) in the postsynaptic neuron v varies in dependence of the actual number of vesicle releases that take place at the different synaptic release sites between neu-

1

Our results remain valid for connection patterns given by sparser random graphs.

Model for Fast Analog Computation

1683

Figure 1: (A) Computer simulation of the model described in section 2.1 with a time interval D of length D D 5 ms for space-rate coding and neurons modeled by the spike-response model of Gerstner (1999b). We have chosen n D 6, a pool size N D 200, and hw1 , . . . , w6 i D h10, ¡20, ¡30, 40, 50, 60i for the effective weights. Each dot is the result of a simulation P nwith an input hx1 , . . . , x6 i selected randomly from [0, 1]6 in such a way that iD 1 wi xi covers the range [¡10, 70] almost uniformly. The y-axis shows the fraction y of neurons in pool V that re during a 5 ms time interval in response to the ring of a fraction xi of neurons in pool Ui for i D 1, . . . ,P 6 during an earlier time interval of length 5 ms. The n solid line is a plot of s( iD1 wi xi ) as described in the text (cf. equation 2.6). (B) Distribution of nonfailure probabilities rvu for synapses between the pools U4 and V underlying this simulation. (C) Example of a probability density function w vu of EPSP amplitudes as used for this simulations. This corresponds to a synapse with ve release sites and a release probability of 0.3. For details about the simulation, see the appendix.

rons u and v. Second, even at a single release site, the amplitude of the EPSP (“quantal size”) caused by a vesicle release at this site varies from trial to trial (Dobrunz & Stevens, 1997; Auger, Kondo, & Marty, 1998). Most of our results—in particular the results in the following section— hold for arbitrary probability density functions w vu and arbitrary values of the parameters rvu . In section 2.3 we address the question of what effect specic probability density functions w vu may have on analog computations in terms of space-rate coding. The equation for the probability density function w vu for the case of multiple release sites and a gaussian distribution of the quantal size at each release site, which is modeled in our computer simulations, is discussed in the appendix. Figure 1C shows an example of w vu for a synapse with ve release sites.

1684

W. Maass and T. Natschl¨ager

2.2 Graded Response Through Unreliable Synapses. In this section we introduce analytical tools for estimating the ring activity y in pool V in terms of the ring activities x1 , . . . , xn in n presynaptic pools U1 , . . . , Un . We show that y can basically be approximated by a function of a weighted sum of the xi , although some unexpected complications will turn up in this analysis. First, we demonstrate that the model dened above is indeed able to produce a graded output y encoded in a space-rate code in pool V such that the activity is uniformly distributed over all neurons v 2 V. We prove this fact theoretically by employing a special version of the central limit theorem of probability theory—the Berry-Esseen theorem. Our theoretical results are then compared with computer simulations of our model. Consider an idealized mathematical model where all neurons that re in the pool Ui re synchronously at time Tin , and the probability that a neuron v 2 V res (at time Tout ) can be described by the probability that the sum hv of the amplitudes of EPSPs and IPSPs resulting from ring of neurons in the pools U1 , . . . , Un exceeds the ring threshold h (which is assumed to be the same for all neurons v 2 V). We assume in this section that the ring rates of neurons in pool V are relatively low, so that the impact of their refractory period can be neglected. We investigate refractory effects in section 3. The random S variable (r.v.) hv is the sum of random variables hvu for all neurons u 2 niD1 Ui , where hvu models the contribution of neuron u to hv . We assume that hvu is nonzero only if neuron u 2 Ui res at time Tin (which occurs with probability xi )2 and if the synapse between u and v releases one or several vesicles (which occurs with probability rvu whenever u res). If both events occur, then the value of hvu is chosen according to some probability density function w vu . The functions w vu , as well as the parameters rvu , are allowed to vary arbitrarily for different P pairs P u, v of neurons. For each neuron v 2 V, we consider the sum hv D niD1 u2Ui hvu of the r.v.’s hvu , and we assume that v res at time Tout if and only if hv ¸ h . Although the r.v.’s hvu may have quite different distributions (for example, due to different w vu and rvu ), their© stochastic ª independence allows us to approximate the ring probability P hv ¸ h through a normal distribution W. The Berry-Esseen theorem (Petrov, 1995) implies that

© ª rv |P hv ¸ h ¡ (1 ¡ W(hI m v , sv ))| · 0.7915 3 , sv

(2.1)

where W(hI m v , sv ) denotes the normal distribution function with mean m v and variance sv2 . The three moments occurringPin equation 2.1 can be P related to the r.v.’s hvu through the equations m v D niD1 u2Ui E[hvu ], sv2 D Pn P Pn P 3 iD1 u2Ui Var[hvu ], and r v D iD1 u2Ui E[|hvu ¡ E[hvu ]| ]. According to 2 This holds if the pool size is large enough that we can treat xi (y) as the probability that a neuron u 2 Ui (v 2 V) will re once during a certain input (output) interval of length D .

Model for Fast Analog Computation

1685

the denitionR of the r.v. hvu we have E[hvu ] D xi rvu aN vu and E[h2vu ] D xi rvu aO vu , where aN vu D aw vu ( a )da denotes the mean EPSP (IPSP) amplitude and aO vu D R aw vu ( a )da denotes the second moment. Hence we can assign to m v and sv in equation 2.1 the values mv D sv2 D

n X X iD1 u2Ui

xi rvu aN vu ,

n X ± X iD1 u2Ui

(2.2) ²

2 2 xi rvu aO vu ¡ x2i rvu aN vu .

(2.3)

A closer look reveals that the right-hand side of equation 2.1 scales like that for large N, we can approximate N ¡1 / 2 .3 Hence, equation © 2.1 implies ª the ring probability P hv ¸ h by the term 1¡W(hI m v , sv ), which smoothly grows with m v . The gain of this sigmoidal function depends on the size of sv . In particular, if synaptic transmission were reliable, this function would degenerate P to a step function. WithPthe denition of the formal weights wvi :D u2Ui rvu aN vu , we have m v D niD1 wvi xi , and hence 1 ¡ W(hI m v , sv ) P smoothly grows with the weighted sum niD1 wvi xi of©the inputs xi . ª So far we have considered only the probability P hv ¸ h that a single neuron v 2 V will re, but we are really interested in the expected fraction y of neurons in pool V that will re in response to a ring of a fraction xi of neurons in the pools Ui for i D 1, . . . , n. According to equation 2.1, one can approximate y for sufciently large pool sizes N by yD

ª 1 X © 1 X P hv ¸ h D 1 ¡ W(h I m v , sv ) . N v2V N v2V

Hence y is approximated by an average of the N sigmoidal functions 1 ¡ W(h I m v , sv ). If the weights wvi have similar values for different P v 2 V, one can expect that y grows smoothly with the weighted sum m D niD1 wi xi , N P where we write wi D v2V wvi / N for the “effective weights” wi between the pools of neurons Ui and V. In order to test these theoretical predictions for an idealized mathematical model, we have carried out computer simulations of a more detailed model consisting of more realistic models for spiking neurons and time intervals 3

More precisely, the right-hand side of equation 2.1 scales like N ¡1 / 2 if for all N, the

3

average value of the terms E[ hvu ¡ E[hvu ]] is uniformly bounded from above and the average value of the terms Var[h vu ] is uniformly bounded from below by a constant > 0 Sn for i 2 f1, . . . , ng and u 2 iD1 Ui . The latter can be achieved only for inputs where xj > 0 for some j, since otherwise Var[hvu©] D 0 for ª all u 2 Ui and all i 2 f1, . . . , ng. But in the case xi D 0 for all i 2 f1, . . . , ng, both P hv ¸ h and 1 ¡ W(hI m v , sv ) have value 0 if h > 0 and hence the left-hand side of equation 2.1 has value 0.

1686

W. Maass and T. Natschl¨ager

Iin and Iout of length D D 5 ms for space-rate coding. The resulting fraction y of neurons in pool V that red during Iout was measured for a large variety of different inputs hx1 , . . . , x6 i 2 [0, 1]6 encoded through the fractions of ring neurons in U1 , . . . , U6 during an earlier time interval Iin .4 It is shown in Figure 1A that y can be approximated quite well by a suitable sigmoidal “activation function” s (more precisely, by the function s derived in equaP tion 2.6 applied to a weighted sum 6iD1 wi xi of the inputs x1 , . . . , x6 . Note that s has not been implemented explicitly in our computational model, but rather emerges implicitly through the large-scale statistics of the ring activity. The deviation of the data points in Figure 1A from the sigmoidal function P s( niD1 wi xi ) (solid line) can be traced back to two independent sources of noise. One source of noise is stochastic uctuations due to the nite pool size N D 200. This type of noise is guaranteed to disappear for N ! 1 with N¡1 / 2 according to equation 2.1. Another source of noise is of a more systematic nature and is predicted by our preceding theoretical analysis (cf. equation 2.1). For large pool sizes N, the ring probability of a neuron v in pool V converges to P 1 ¡ W(h I m v , sv ). This function depends not just on the weighted sum m v D niD1 wvi xi , but also on another function sv of the input vector hx1 , . . . , xn i. This fact, which causes a noise for computing in spacerate coding that dominates the sampling noise already for moderate pool sizes (cf. Figure 2), will be addressed in the next section. In the following, the term systematic noise refers to the fact that the output y in space-rate code P does not merely depend on a single variable mN D niD1 wi xi but is actually a function that depends in a more complex way on the input hx1 , . . . , xn i. 2.3 Analysis of the Systematic Noise. At the end of the preceding section, we pointed out that the output y of our model for P computing in spacerate coding depends not just on the weighted sum niD1 wi xi of the inputs x1 , . . . , xn . Here we provide tools for analyzing on what other quantities the output y may depend, and we exhibit a specic type of synaptic connectivity that minimizes the impact of this type of systematic noise. © Equation ª 2.1 shows that if one wants to approximate the ring probability P hv ¸ h of an arbitrary neuron v in pool V by a function of the weighted sum m v of the ring probabilities xi in pools Ui , it is indeed necessary that sv also can be approximated by a function of m v . This follows easily from the equation W(hI m v , sv ) D W((h ¡ m v ) / sv I 0, 1). In order to analyze the possibilities for approximating sv by a function of m v , we restate sv2 as

sv2 D

4

n X iD1

wO vi xi ¡

n X iD1

x2i

X u2Ui

r2vu aN 2vu with wO vi :D

X u2Ui

rvu aO vu .

The ring times of the individual neurons are distributed uniformly over Iin .

(2.4)

Model for Fast Analog Computation

1687

Figure 2: Quantitative study of the noise underlying the computation in our model. The model is constructed in exactly the same way as described in Figure 1, with the same values of the weights wi . (A) The empirical standard deviation (std) of the outputs y of 200 individual simulations in dependence of (0) (0) the pool size N for two cases. We used the same input x(0) D hx1 , . . . , xn i for all 200 simulations and measured the standard deviation l 0 of the observed outputs. Thus, l 0 describes the contribution of the stochastic uctuations to the total noise. In order to test to what extent the response depends just on mN , we Pn generated 200 different inputs x D hx1 , . . . , xn i such that iD 1 wi xi always was Pn (0) equal to mN (0) D iD1 wi xi D 24 and measured the standard deviation l t of the resulting output y, which describes the amount of the total noise. (B) The ratio l t /l 0 between the total noise and the stochastic uctuations.

P P We will focus on scenarios where the second term niD1 x2i u2Ui r2vu aN 2vu of equation 2.4 can be neglected for sufciently large N. This can be achieved if either the nonfailure probabilities rvu or the mean EPSP (IPSP) amplitudes aN vu scale such that the weights wvi remain bounded for large N.5 Under this assumption, one can approximate sv2 by some second-order polynomial Am 2v C Bm v C C in m v if the angle Q between the weight vector w v D hwv1 , . . . , wvn i and the “virtual” weight vector w O v D hwO v1 , . . . , wO vn i that is dened in equation 2.4 is rather small.6 One can easily show that sv2 can be expressed exactly as a function of m v if and only if Q D 0 or Q D p . Hence it is worthwhile to analyze what type of 5

Pn

P

The term iD1 x2i r2 aN 2 converges to 0 for N ! 1 if, for example, the empiru2U i vu vu ical standard deviation of the terms rvu aN vu over u 2 Ui does not grow much faster than the

P

¡ P

¢1 / 2

wvi 2 (r ) · mean wvi / N D u2U i rvu aN vu / N for N ! 1, for example, if N1 u2Ui vu aN vu ¡ N wvi . This appears to be a quite reasonable assumption, and it furthermore implies that N P Pn 2 P r2 aN 2 · N2 ¢ w2vi . Hence the second term x r2 aN 2 in equation 2.4 u2Ui vu vu iD 1 i u2Ui vu vu converges to 0 for N ! 1 if the “weights” wvi remain bounded for N ! 1. 6 Q is dened by cos Q D ( w ¢ w O v k). v O v ) / (kw v k ¢ kw

1688

W. Maass and T. Natschl¨ager

synaptic connectivity structure can achieve this. Obviously Q D 0 (Q D p ) holds if and only if there exists a constant c > 0 (c < 0) such that wO vi D c wvi for all i 2 f1, . . . , ng. Note that this implies that all pools Ui consist of excitatory (Q D 0) or inhibitory (Q D p ) neurons, since wO vi ¸ 0 by denition of wO vi . In the following, we consider the case where all neurons are excitatory. The condition wO vi D c wvi for all i 2 f1, . . . , ng with a common constant c > 0 is quite difcult to achieve in the case of heterogeneous weights wvi for different i. However, one scenario in which this can in principle be achieved even with heterogeneous S nonnegative weights wvi is that where there exists for each neuron u 2 niD1 Ui just a single release site between u and v with a common stereotyped probability density function w for the amplitudes avu of the EPSPs caused by a synaptic release. For such architecture, the nonfailure probabilities rvu (equal to the release probability in that case) can be chosen arbitrarily in order to achieve a desired heterogeneous assignment of nonnegative weights wvi for i 2 f1, . . . , ng, while the condition wO vi D c wvi for all i 2 f1, . . . , ng will automatically be satised with a common constant c > 0. This follows easily from the fact that for P a common stereotyped P probability density function w , one gets wvi D aN u2Ui rvu and wvi D aO u2Ui rvu R R with aN D aw ( a)da and aO D a2 w (a )da. A connection from a neuron u 2 Ui to a neuron v 2 V via a single release site also seems to be advantageous with regard to synaptic plasticity. In the case of a single release site, it is a reasonable assumption that the nonfailure probabilities rvu (equal to the release probability in that case) and the distribution of the EPSP amplitudes w vu can be cho7 sen independently. ¡PThis implies ¢ ¡P that one ¢ can approximate the effective weights wvi by N1 aN vu rvu and the “virtual weights” wO vi by u2U u2U i i ¡ ¢ ¡P ¢ 1 P O vi D c wvi for all i 2 f1, . . . , ng u2Ui aO vu u2Ui rvu . Thus, the condition w N reduces to 1 X c X aO vu D aN vu for all i 2 f1, . . . , ng. N u2U N u2U i

(2.5)

i

Assume that for a given set of effective and virtual weights, the relation wO vi D c wvi holds for all i 2 f1, . . . , ng, and subsequently the weights wvi are subject to change due to some plasticity mechanism. In order to maintain the property that wO vi D c wvi for all i, such a change is best implemented by changes in the nonfailure probabilities rvu , since equation 2.5 holds independent of their values. If the weight change would be implemented by changes in the mean EPSP amplitudes aN vu , then the average of the second moments aO vu would have to change proportionally to keep equation 2.5 valid. If one considers the well-known relation aO vu D aN 2vu C aQ 2vu , where aQ2vu is 7

For synapses with more than one release site, the assumption that rvu can be chosen independently from w vu does not hold (see the appendix for an example).

Model for Fast Analog Computation

1689

Figure 3: Quantitative study of the noise underlying the computation in our model if all connections consist of just a single release site. The model is constructed in a similar way as described in Figure 1, with hw1 , . . . , w6 i D h10, 20, 30, 40, 50, 60i for the effective weights. The simulation protocol is the same as described in the caption of Figure 2. (A) Stochastic uctuations (l 0 ) and the total noise (l t ). (B) Ratio between l 0 andl t has a value of about 1 for N ¸ 200. This P shows that the output y of the network depends just on the weighted sum mN D niD1 wi xi .

R the variance of the distribution w vu (Qa2vu D w vu ( z)( z ¡ aN vu )2 dz), it becomes even clearer that such a change would involve complex changes in the distributions w vu . Hence, in the case of connections with single release sites, it is advantageous to implement weight changes through changes in the nonfailure probabilities rather than through changes in the distribution of the EPSP amplitudes. Thus, our preceding analysis exhibits a possible computational advantage of synaptic connections consisting of single release sites. At rst sight, such connectivity structure appears to be quite undesirable because of the high unreliability of such synaptic connections. However, our preceding arguments show that such connectivity structure supports precise analog computations in space-rate coding better than multiple release sites. Together with equation 2.5, it induces an angle Q D 0 between the vectors hwv1 , . . . , wvn i and hwO v1 , . . . , wO vn i, thereby allowing that the resulting ring P activity y in pool V depends on only m v D niD1 wvi xi for arbitrary values x1 , . . . , xn 2 [0, 1] of presynaptic pool activities (see Figure 3). Now we consider the case where the angle Q between the weight vector O v D hwO v1 , . . . , wO vn i w v D hwv1 , . . . , wvn i and the “virtual” weight vector w 6 is rather small but D 0. We will show that in this case, we can still approximate sv2 P well by the expression B 0m v C C0 for B0 :D w O / kw Nv ¢w N v k2 and C0 : D 12 niD1 ( wO vi ¡ B0 wvi ). In fact, this specic linear expression in m v provides an approximation to sv2 that is in a heuristic sense optimal

1690

W. Maass and T. Natschl¨ager

among all second-order polynomials Am 2v C Bm v C C in m v . To see this, we choose the parameters A, B, and C such that the average error E D R ¡ 2 ¢2 ( 2 C Bm v C C) dx of such an approximation is minimized. [0,1]n sv ¡ Am v dE dE As shown in the appendix, the conditions dA D 0, dE dB D 0, and dC D 0 imply for N ! 1 that A D 0, B D B0 , and C D C0 . For these values B, and C, the average error E has the minimum value Emin D R of A, 2 ¡( m C (s ˆ v k2 (1 ¡ cos2 Q ) / 12. Emin decreases with B0 v C0 ))2 dx D kw n v [0,1] decreasing Q , especially Emin D 0 if Q D 0. This indicates that © observation ª one can approximate the ring probability P hv ¸ h of neuron v in pool V ¡ ¢ by the sigmoidal function 1 ¡ W hI m v , (B0 m v C C0 )1 / 2 of the single variable m v if the angle Q is not too large. Furthermore, if we assume that the values wvi and wO vi do not differ too much for different neurons v 2 V, we can approximate the output y of the network by

±

y ¼ 1 ¡ W hI mN , ( B0 mN C C0 )1 / 2

²

(2.6)

P P for mN D niD1 wi xi and wi D (1 / N ) v2V wvi . This theoreticalP prediction is supported by our computer simulations. The function of mN D niD1 wi xi that appears on the right-hand side of equation 2.6 is drawn as a solid line in Figure 1A. Complications arise if some of the weights wvi are positive and some are negative. In this case, it is impossible to satisfy the condition wO vi D c wvi 6 0 for all i 2 f1, . . . , ng. Hence, s 2 depends not with a common constant c D v only on m v , and y depends not only on m v . This indicates that the systematic noise is larger in the case where one considers a combination of inhibitory and excitatory input. However, the simulations reported in Figure 1 show that even in this P case, the output y of the network approximates a sigmoidal function of mN D niD1 wi xi . 2.4 Multilayer Computations. The preceding arguments show that approximate computations of functions of the form hx1 , . . . , xn i ! y D s P ( niD1 wi xi ), with inputs and output in space-rate code, can be carried out within 10 ms by a network of spiking neurons. Hence, the universal approximation theorem for multilayer perceptrons implies that arbitrary continuous functions f : [0, 1]n ! [0, 1]m can be approximated with a computation time of not more than 20 ms by a network of spiking neurons with three layers. Thus, our model provides a possible theoretical explanation for the empirically observed very fast multilayer computations in biological neural systems that were mentioned in section 1. Results of computer simulations of the computation of a specic function f in space-rate coding that requires a multilayer network because it interpolates the boolean function XOR are shown in Figure 4.

Model for Fast Analog Computation

1691

Figure 4: (A) Plot of a function f (x1 , x2 ): [0, 1]2 ! [0, 1], which interpolates XOR. Since f interpolates XOR, it cannot be computed by a single sigmoid unit. (B) Computation of f by a three-layer network in space-rate coding with spikeresponse model neurons (N D 200) according to our model (for details, see the appendix).

2.5 Consequences for Synre Chains. Our computational model also throws new light on the familiar model of a synre chain (Abeles, 1991). There one considers a longer chain of pools of neurons, with large diverging and converging connectivity between adjacent pools in the chain. Abeles, Bergman, Margalit, and Vaadia (1993) pointed out that this architecture has the property that relatively synchronous ring in the rst pool triggers even more synchronous ring in the subsequent pools of the chain. We have shown in this article that if one takes synaptic unreliability into account, a synchronous ring of a certain fraction x of neurons in the rst pool will cause a certain fraction y of neurons in the subsequent pools to re synchronously. Whereas Abeles (1991) considered only the situation where x and y are close to 1, our analysis shows that more subtle information processing can be carried out by a synre chain. One can arrange that y is a smooth function of x, for a wide range of values for x. Such “graded activation” of synre chains allows substantially more complex computations in networks of linked and/or reverberating synre chains, since it supports the implementation of “graded pointers” from one synre chain to other ones. The original version and our analog version of a synre chain (based on unreliable synapses) make slightly different predictions regarding the ring behavior of individual neurons in the synre chain. In the analog version of a synre chain, only a certain fraction y of neurons will re in a pool V of a chain, where y may assume arbitrary values between 0 and 1. Furthermore, for repeated activations of the synre chain with the same ring activity x in the rst pool, the actual set of neurons in V that re will change from trial to trial. Hence, precisely timed ring patterns among neurons in different pools of the chain would not occur every time when the rst pool is activated

1692

W. Maass and T. Natschl¨ager

with the same x, but just with a certain probability that is somewhat higher than for completely randomly ring neurons. This prediction appears to be consistent with published data (Abeles et al., 1993). 3 Analog Computation on Time Series in a Space-Rate Code

We showed in section 2 that biological neural systems with space-rate coding have at least the computational power of multilayer perceptrons. In this section, we demonstrate that they have strictly more computational power. This becomes clear if one considers computations on time series rather than on static batch inputs, as in section 2. We now analyze the behavior of our computational model if the ring probabilities in the pools Ui change with time. Writing xi ( t) (y(t)) for the probability that a neuron in pool Ui (V) res during the tth time window of length D (e.g., for D in the range between 1 and 5 ms), our computational model from section 2 maps a vector of n analog time series fxi ( t)gt2N onto an output time series fy(t)gt2N (where (N D f0, 1, 2, . . .g). As an example, consider a network that consists of one presynaptic pool U1 connected to the output pool V with the same type of synapses as discussed in section 2. In addition, there are feedback connections between individual neurons v 2 V. The results of simulations reported in Figures 5 and 6 show that this network computes an interesting map in the time-series domain: the space-rate code in pool V represents a sigmoidal function s (as in section 2) applied to the output of a bandpass lter. Figure 5B shows the response of such a network to a sine wave with some bursts of activity added on top (see Figure 5A). Figures 6A and 6B show the frequency response of the bandpass lter that is implemented by this network of spiking neurons (see the appendix for the denition of the frequency response). Figure 5C shows the output of another network of spiking neurons (which approximates a lowpass lter) to the same input (shown in Figure 5A). The theoretical analysis of these networks will be presented in section 3.3. To analyze theoretically the computational power of such networks of spiking neurons in the time-series domain, we use the spike-response model (Gerstner, 1999b). We will focus on the situation where one can neglect the nonlinear effects of the sigmoidal function s; we consider time series of the form xi ( t) D x0 C xQ i (t) and y( t) D y0 C yQ ( t), where the magnitudes of the signals xQ i ( t) and yQ ( t) are small enough that s can be approximated in this range by a linear function x 7! Kx. 3.1 Taking the Time Course of Postsynaptic Potentials into Account.

We model the effect of a ring of a neuron u 2 Ui at time k on the membrane potential of a neuron v 2 V at time t by a “response function” ei (t ¡ k ), as in the spike-response model (Gerstner, 1999b). Such response function has the typical shape of an EPSP or IPSP. It is usually described mathematically as a difference of two exponentially decaying functions (see the appendix).

Model for Fast Analog Computation

1693

Figure 5: (A) Response of two different networks to the same input. (B) Response of a network that approximates a bandpass lter. (C) Response of a network that approximates a low-pass lter. The gray-shaded bars in B and C show the actual measured fraction of neurons that re in pool V of the network during a time interval of length 5 ms in response to the input activity in pool U 1 shown in A. The solid lines in B and C are plots of the output predicted by equation 3.4. For details about the parameters, see the appendix.

Figure 6: (A) Amplitude response |H( f )| and (B) phase response y ( f ) of the lter (bandpass) implicitly implemented by the network of spiking neurons described at the beginning of section 3 (see the appendix for the denition of |H( f )|, y ( f ), and the parameters of the network). Solid lines are plots of the amplitude and phase response predicted by the theory presented in section 3.3. Dots are simulation results for the approximating network of spiking neurons (see the appendix for details).

1694

W. Maass and T. Natschl¨ager

If one takes this time course of EPSPs and IPSPs into account but ignores the impact of v’s ring on its own membrane potential, our arguments from section 2 imply that one can write the expected value m ( t) of the membrane potentials of neurons v 2 V at time t as m (t ) D

n X

wi

iD1

t X kD0

ei ( k ) xi ( t ¡ k ) .

(3.1)

The “weights” wi result from the release probabilities and distributions of PSP amplitudes as described in section 2. We will ignore the impact of synaptic dynamics and assume that these weights wi do not depend on t. 3.2 Taking Refractory Effects into Account. When the resulting ring probabilities in pool V become sufciently large, the refractory effects of neurons v 2 V start to affect the resulting computation on time series. We then have to replace equation 3.1 by

m (t ) D

n X

wi

iD1

t X kD0

ei ( k ) xi ( t ¡ k ) C

t X kD1

g(k)y(t ¡ k ),

(3.2)

where g(k) typically assumes a very large negative value for a few ms, and then decays exponentially with a relatively small time constant (Gerstner, 1999b). This yields a special case of an innite impulse response timeinvariant linear lter, as we show in the next section in a more general context. Since different specic ring properties of specic neurons, such as adaptation or rhythmic ring, correspond in this spike-response model to different refractory functions g(k) of specic neurons, a rich class of different innite impulse response lters may arise in this way in biological neural systems. 3.3 Adding Recurrent Connections in Pool V . We now assume that in addition to the model dened in section 2 there are m different kinds of recurrent connections among neurons in pool V, with different response functions r j ( k ) and weights wQ j , j D 1, . . . , m. We assume that the large-scale structure of these recurrent connections is of the same type as those between pools Ui and V; that is, we assume that these are connections with unreliable synapses. Therefore, the weights wQ j also result from release probabilities and PSP amplitude distributions as described in section 2. We then arrive at the following equation for the average membrane potential of neurons in pool V:

m (t) D

n X iD1

wi

t X kD0

ei ( k ) xi ( t ¡ k ) C

t X kD1

g(k)y(t ¡ k ) C

m X jD1

wQ j

t X kD1

r j ( k)y( t ¡ k ).

Model for Fast Analog Computation

1695

If t is sufciently large such that ei ( t0 ) D r j ( t0 ) D 0 for t0 ¸ t, then for xi ( t) D x0 C xQ i (t ) and y( t) D y0 C yQ (t), one can rewrite m ( t) as m ( t) D m 0 C mQ (t ) with m 0 D x0

n X iD1

wi

1 X kD0

ei ( k ) C y0

1 X

g(k) C y0

kD1

m X j D1

wQ j

1 X

r j ( k)

kD1

and mQ (t ) D

n X iD1

wi

t X kD0

ei ( k)xQ i ( t ¡ k ) C

t X kD1

g(k)yQ ( t ¡ k ) C

m X jD1

wQ j

t X kD1

r j ( k) yQ ( t ¡ k ).

We showed in section 2 that under certain conditions, one can approximate the ring activity y( t) in pool V by a function of the form s(m ( t)), for some suitable sigmoidal function s (see equation 2.6). Since we consider only small signals xQ i ( t) and yQ ( t), we can make the approximation yQ ( t) D K ¢ mQ (t ) for some constant K, which depends on the sigmoidal function s. This yields for yQ ( t) the recursive equation yQ (t ) D K ¢

n X

wi

iD1

CK

t X kD1

t X kD0

ei ( k )xQ i ( t ¡ k )

g(k)yQ ( t ¡ k ) C K

m X jD1

wQ j

t X kD1

r j (k )yQ ( t ¡ k ).

(3.3)

We now show that equation 3.3 describes a very powerful computational model for computations on time series. This becomes clear if one considers just the special case where xi ( t) D x( t) (hence, xQ i ( t) D xQ ( t) for all i 2 f1, . . . , ng). We then have yQ (t ) D

t X kD0

bk xQ ( t ¡ k) ¡

t X kD1

ak yQ ( t ¡ k )

(3.4)

P P with bk D K niD1 wi ei ( k ) and ak D ¡Kg(k) ¡ K jmD1 wQ jr j ( k ). For arbitrary real valued numbers bk and ak for k 2 N with bk D ak D 0 for k > k0 (where k0 is some sufciently large constant) 8 this is the general form of an innite impulse response (IIR) time-invariant linear lter (see Haykin, 1996; Back & Tsoi, 1991). Furthermore if each response function ei (¢) and r j (¢) is represented as a difference of two exponentially decaying functions and g(¢) is also decaying function, then P represented as an exponentially P the kernels K niD1 wi ei ( k) and ¡Kg(k) ¡ K jmD1 wQ jr j ( k ) in equation 3.4 are 8

Since in biological systems the responses to individual spikes vanish after some time, it is a reasonable approximation to set ei (k) D rj (k) D g(k) D 0 for k > k0 .

1696

W. Maass and T. Natschl¨ager

weighted sums of exponentially decaying functions of the form e¡k / t for various different values of t. In that case, one can easily show that the map from arbitrary input functions xQ ( t) to the output yQ ( t) can in principle approximate any given time-invariant linear lter with IIR for any nite number t of discrete time steps. 9 In the case where ak D 0 for all k 2 N, P that is, yQ ( t) D tkD0 bk x( t ¡ k ), the resulting mapping from xQ ( t) to yQ ( t) is a time-invariant linear lter with nite impulse response (FIR). Hence, if there are no recurrent connections and one can neglect the refractory effects, our model can still approximate any given FIR lter. One can build a large class of practically relevant lters with the help of such FIR and IIR lters, including good approximations to low-pass and bandpass lters (see Figure 6). Note that different architectures of recurrent circuits with different delays and different types of neurons involved in the recurrent loop give rise to a large variety of different coefcient vectors ha1 , . . . , ak0 i. The variety of different vectors hb0 , . . . , bk0 i appears to be more limited in comparison, since time courses of PSPs in isolated neurons are relatively uniform. However, it is shown in Bernander, Douglas, Martin, and Koch (1991) that the level of background activity in cortical neurons may change the time constants of PSPs considerably. Hence, such a mechanism supports the implementations of a large variety of vectors hb0 , . . . , bk0 i. The preceding analysis implies that assuming that one can approximate any given real-valued sequences bk and ak on any nite interval, the output of the network of spiking neurons can approximate in space-rate coding on any nite time interval the saturated version of any given linear timeinvariant lter with nite and innite impulse response. One may view this result as a universal approximation theorem for linear lters by networks of spiking neurons. Back and Tsoi (1991) have shown that an articial neural network consisting of IIR lters as “synapses” and two layers of sigmoidal gates as “neurons” can be adjusted (via gradient descent for the parameters involved) to approximate for low-pass-ltered white noise as input in its output a very complex given time series. Our combined results from section 2 and the results of this section show that in principle, an arbitrary network of the type considered by Back and Tsoi (1991) can be implemented by a network of spiking neurons in space-rate coding. Note, however, that in this imple-

9 For a rigorous proof of that result, one just needs to observe that any function g: f0, . . P . , k0 g ! R can be approximated arbitrarily closely by functions of the form n e (k), where each function ei (¢) is a difference of two exponentially def (k) D i D 1 wi i caying functions. This approximation result relies on two facts: that any continuous function can be approximated arbitrarily closely by a polynomial on any bounded interval (approximation theorem of Weierstrass) and that the variable transformation u D ¡ log x for u 2 [0, 1) transforms an exponentially decaying function e¡u / t over [0, 1) into the function x1 / t over (0, 1]. It becomes clear from this proof that it sufces to consider just time constants t · 1 for that purpose.

Model for Fast Analog Computation

1697

mentation, each IIR lter is implemented not by a single synapse but by a pool of spiking neurons. An interesting aspect of this analysis is that these computations in the time-series domain with space-rate coding make essential use of two aspects of neuronal dynamics that had received little attention in preceding computational models for neural systems: the specic forms of the refractory functions g and of the decaying parts of PSPs. 4 Hebbian Learning for Space-Rate Coding

The model for space-rate coding based on unreliable synapses that we analyzed in section 2 has the characteristic property that the actual sets of neurons that re will vary from trial to trial, even for the same input. In this section, we show that this property is quite desirable from the point of view of learning. Most theoretically attractive learning rules for a computational unit in an articial neural net, such as the Hebb rule, require a weight change D wi that depends on the product xi y of the ith analog input xi and the analog output y of the unit. If these values xi and y are encoded in a network of spiking neurons by a space-rate code, it is not clear how this product xi y can be “computed” locally by a biological synapse or by the two spiking neurons that the synapse connects. On the other hand Markram, Lubke, ¨ Frotscher, and Sakmann (1997) have shown that biological synapses regulate their efcacies in dependence of the temporal relationship between pre- and postsynaptic ring. For example, a biological synapse may increase its release probability by an amount b if the postsynaptic neuron v res within 100 ms after the presynaptic neuron u. When we apply this rule in the context of our computational model from section 2, we see that the fraction of synapses between neurons u 2 Ui and v 2 V so that v res within 100 ms after the ring of u has an expected value of xi y. Hence the expected fraction of synapses between Ui and V whose release probability is increased by b according to the rule from Markram, Lubke, ¨ Frotscher, and Sakmann (1997) is xi y. This causes an average change in release probability between the pools Ui and V proportional to xi y , although the value of the product xi y is nowhere explicitly computed in the network. A closer look shows that this average change in release probability has the desired effect only of increasing the response y after a few repetitions under a special condition: if the actual sets of neurons in the pools that re vary from trial to trial. Such trial-to-trial variability of the specic neurons that re is inherent in our model for analog computation in space-rate coding presented in section 2, in contrast to previous models, which ignore the large-scale effects of synaptic unreliability. We have compared the relationship between weight changes resulting from Hebb’s learning rule for an articial neural network and local changes of synaptic efcacy based on temporal relationship of pre- and postsynaptic ring according to Markram, Lubke, ¨ Frotscher, and Sakmann (1997) for

1698

W. Maass and T. Natschl¨ager

Figure 7: (A, B) Typical change in the distribution of release probabilities at synapses between U1 and V during learning as described in the text, (A) before and (B) after learning. (C) Distribution of the normalized errors |wnew ¡ Winew | / |wmax | for all i 2 f1, . . . , 6g where Winew is the P ideal value i P of the (1 N updated weight according to the Hebb rule and wnew D N) / i v2V u2Ui rvu avu is the “effective weight” resulting from applications of Markram’s rule in our simulated network of spiking neurons (|wmax | D 100 in this case; the mean is 0.0062). See the appendix for details.

our computational model from section 2 through computer simulations (see the appendix). A typical change in the distribution of release probabilities between pools U1 and V resulting from repeated applications of Markram’s rule is shown P in the P Figures 7A and 7B. After 10 learning steps, the weights wi :D (1 / N ) v2V u2Ui rvu aN vu for all i 2 f1, . . . , ng resulting from Markram’s rule differ on average by 0.0062 (normalized error; see Figure 7C) from the weights Wi resulting from applications of Hebb’s rule for an articial neural network with the corresponding parameter values and the same initial weights. For the application of Hebb’s learning rule, y was P computed by s( niD1 wi xi ) for the sigmoidal function s given by the righthand side of equation 2.6, with the same inputs xi that are used as ring probabilities for the pools U1 , . . . , Un . These results show that due to the large trial-to-trial variability of neuronal ring in our model from section 2, iterated local applications of Markram’s temporal learning rule implement with high delity Hebb’s learning rule for space-rate coding on the network level. 5 Discussion

We have addressed specic links between details of realistic models for biologic neurons and synapses on one hand and resulting large-scale effects for computations with populations of neurons on the other hand. In particular we have investigated possible macroscopic effects of the well-known stochastic nature of synaptic transmission. This aspect has been neglected in previous analytical studies of the dynamics of populations of spiking neurons (see, for example, the references in Gerstner, 1999a, 1999b).

Model for Fast Analog Computation

1699

In section 2 we showed that the unreliability of synaptic transmission sufces to explain the possibility of fast analog computations in space-rate coding on the network level. In fact, theoretically any given bounded continuous function can be approximated arbitrarily close by such networks, with a computation time of not more than 20 ms. In contrast to other possible explanations of analog computation in space-rate coding, this model spreads ring activity uniformly among the neurons and therefore is able to explain the large trial-to-trial variability in neuronal ring patterns. Whereas it is occasionally conjectured that synaptic unreliability “averages out” on the network level, and hence causes no signicant macroscopic effects, we showed in section 2 through rigorous theoretical analysis and extensive computer simulations that synaptic unreliability may be essentially involved in generating a “graded response” for computations on the network level. Furthermore, we showed that most of these results are quite robust, since they hold for arbitrary assignments of synaptic failure probabilities and arbitrary distributions of PSP amplitudes. We also investigated a specic source of systematic noise for analog computation in space-rate coding that arises in this model. In contrast to sampling errors, this systematic noise is not likely to disappear when the pool size N is chosen very large. However, our computer simulations suggest that the computational effect of this systematic noise is rather small. We showed in section 2.3 that this source of systematic noise is removed if synaptic connections consist of single release sites and the “weights” of synaptic connections are encoded in release probabilities of synapses rather than in the amplitudes of postsynaptic responses caused by the releases. We also showed in subsection 2.5 that our computational model gives rise to an analog version of the familiar model of a synre chain. This analog version appears to be computationally more powerful than the classical binary version, and its ring behavior appears to be more consistent with experimental data. In section 3 we addressed the question of what additional computational operations on the network level are supported by other prominent features of biological neurons and microcircuits, such as the refractory behavior of neurons and local recurrent connections. For that purpose, we have moved from computations on static inputs toward an analysis of computations on time series, which is arguably a biologically particular relevant computational domain. Various ltering properties of biological neural systems have already been addressed in Koch (1999). We have exhibited specic macroscopic effects for analog computations on time series in space-rate coding that are caused by unreliable synapses in combination with the decaying parts of postsynaptic potentials and refractory effects on the neuron level, and in combination with recurrent connections in microcircuits. We have shown that through these features, a rich repertoire of linear lters, especially arbitrary time-invariant linear lters with nite and innite impulse response, can be approximated on the level of space-rate coding in pop-

1700

W. Maass and T. Natschl¨ager

ulations of neurons. In combination with the results from section 2, this shows that for computations on time series, in principle the full computational power of the articial neural networks proposed in Back and Tsoi (1991) can be emulated by networks of biologically realistic neurons with space-rate coding. In section 4 we investigated macroscopic effects of a local learning rule for spiking neurons that is empirically supported by the experiments of Markram, Lubke, ¨ Frotscher, and Sakmann (1997). Whereas this learning rule is usually discussed only in the context of temporal coding in small neuronal circuits, we showed that the same learning rule gives rise to a Hebbian learning rule on the network level with space-rate coding (hence without a coding of relevant information in ring times). This link between the Markram rule and Hebbian learning for space-rate coding appears to be obvious on rst sight. A closer look, however, shows that it requires the large trial-to-trial variability of neuronal ring that is provided—in contrast to previous models—by the model for analog computation with space-rate coding based on unreliable synapses that was presented in section 2. Appendix A.1 Synapse Model. In all the simulations reported in this article, a connection between a neuron u 2 Ui and v 2 V is modeled by dvu independent release sites, where dvu may vary for different pairs u, v of neurons. The release probability pvu is assumed to be equal at all dvu release sites of one connection. In the case of a release at a single release site, the amplitude of the resulting EPSP (IPSP), also known as quantal size, is drawn from a gaussian distribution with mean qN vu and variance qQvu . For such a connection, the probability density function w vu of EPSP (IPSP) amplitudes in the case of a release is given by (we skip the indices vu for sake of brevity)

w ( a) D

¡¢

¡

d 1X d k p (1 ¡ p)d¡k Q aI kNq, r kD1 k

¢

q kqQ2 ,

¡ ¢ where Q ¢I q, N qQ is the normal probability density function with mean qN and variance qQ and r D 1 ¡ (1 ¡ p )n is the probability that at least at one of the release sites some probability.” R vesicle is released, that is, r is the “nonfailure R The mean aN :D aw ( a )da and the second moment aO :D a2 w ( a)da of such ¡ ¢ a distribution are given by aN D qdp N and aO D 1r qN2 ( dp2 (d ¡ 1) C dp ) C qQ2 dp . Note that d D 1 yields aN D qN and aO D qN 2 C qQ 2 . A.2 Neuron Model. For all simulations we have used the spike-response model as described in Gerstner (1999b). The time course of an EPSP (IPSP) a ( ¡t / t1 is modeled by t1 ¡t ¡ e¡t / t2 ), where t1 and t2 describe the rise and fall e 2

Model for Fast Analog Computation

1701

time of the PSP, and a denes the amplitude. 10 If not stated otherwise, we have set t1 D 5 ms, t2 D 12 ms for the EPSPs and t1 D 10 ms, t2 D 12 ms for the slower IPSPs. The refractory behavior is modeled by the function g(t) D ¡Re¡t / tr where tr D 4 ms was used for all simulations. R was chosen to be equal to the threshold of the neuron. A.3 Synaptic Parameters. The parameters dvu , pvu , qNvu , and qQ vu for a connection between a neuron u 2 Ui and v 2 V are chosen independently from distributions reported in the literature. The number of release sites nvu was chosen randomly from the set f1, 2, 3, 4, 5g Markram, Lubke, ¨ Frotscher, Roth, and Sakmann (1997). The release probabilities pvu are drawn from an exponential distribution (Huang & Stevens, 1997) with mean pi (see Figure 1B for an example). The mean quantal sizes qNvu are drawn from a normal distribution with mean qN i and a variance of 0.1| qN i |. In accordance with the results reported in Auger et al. (1998), we have chosen the standard deviation of the quantal sizes qQ vu D 0.05qN vu . pi and qNi were chosen to reect different “effective weights” wi between pool Ui and pool V, as specied in the caption of Figure 1. A.4 Second-Order Approximation of ¾v2 . We dene the average error E Rbetween sv2 and the second-order polynomial Am 2v C Bm v C C as E :D 1 (s 2 ( 2 C Bm v C C))2 dx . We want to nd the values A0 , B0 , 2 [0, 1]n v ¡ Am v

and C0 for the parameters A, B, and C such that E assumes its minimum dE D 0, dE Emin . Therefore, we simultaneously solve the equations dA dB D 0 and dE dC D 0. After some calculations one gets P 4 1 iw Pi P , A0 D ¡ P 4 N i wi C (5 / 2) i jD 6 i w i wj 0 1 P P 3 X XX O 1 w w w A 0 i i B0 D Pi 2 ¡ P i i2 ¡ P 2 @ w3i C w2i wj A , N i wi i wi i wi i i jD 6 i C0 D

1X 1 X 2 A0 X X A0 X 2 ( wO i ¡ B0 wi ) ¡ wi ¡ wi wj ¡ wi . 2 i 3N i 4 i jD 3 i 6 i

P wO i wi In the limit N ! 1 these equations reduce to A0 D 0, B0 D Pi 2 , and w i i P C0 D 12 i ( wO i ¡ Bwi ). These solutions lead to a minimal value of the average error E of ¡P 2 ¢ ¡P 2 ¢ ¡P ¢2 Oi O i wi iw i wi ¡ iw P . Emin D 12 i w2i 10 If the two time constants of these two exponentially decaying functions converge to a common value, then their difference converges to an “a-function,” which is another common description of the time course of a PSP.

1702

W. Maass and T. Natschl¨ager

With the denitions w O v D hwO v1 , . . . , wO vn i and cos Q D N v D hwv1 , . . . , wvn i, w O k2 (w v ¢ w (1 ¡ cos2 Q ) . O v ) / (kw v k ¢ kw O v k) this reduces to Emin D kw12 A.5 Frequency Response of a Time-Invariant Linear Filter. The transfer function (in terms of the z-transformation. Haykin, 1996) of a time( ) invariant P k linear lter that P k transforms a time series x t into P k a time series y( t) D k0D0 bk x( t ¡ k )¡ k0D1 ak y( t ¡ k ) is given by H ( z) D ( k0D0 bk z¡k ) / (1C p P k0 ¡k ), where is a complex variable. Setting D j2p f (j D ¡1), we z z e kD1 ak z get the lter frequency response denoted by H( e j2p f ), where f denotes the frequency in hertz. Expressing H( e j2p f ) in its polar form |H( f )| e jy ( f ) , one denes the frequency response of the lter in terms of two components: the amplitude response |H( f )| and the phase response y ( f ). These two quantities have the following meaning: Obviously, if one applies a sine wave with frequency f at the input, the output of a linear lter is also a sine wave, but with different amplitude and phase. The amplitude response |H( f )| is the ratio between the amplitudes of the output and the input sine wave. The phase response y ( f ) is the difference between the phases of the output and the input sine waves. In that way, one can measure for each frequency f the amplitude and phase response, as was done in the simulations reported in Figure 6. A.6 Details of Simulations Reported in Figures 5 and 6. The network that approximates a low-pass lter (cf. Figure 5C) consists of a single excitatory presynaptic pool U1 and the postsynaptic pool V. The effective weight between U1 and V is w1 D 20. For the time constants of the EPSPs we have chosen t1 D 25 ms and t2 D 26 ms. For the network that approximates a bandpass lter (cf. Figure 5B), there were in addition inhibitory recurrent connections from pool V to pool V with an effective weight wQ 1 D ¡40 (t1 D 30 ms and t2 D 26 ms). The effective weight from pool U1 to pool V was w1 D 130 (t1 D 10 ms and t2 D 11 ms). A.7 Details to Simulations Reported in Figure 7. We performed 300 simulations for the model from section 2 for n D 6, N D 200 with different values of b and different sets of initial release pvu (and hence P probabilities P different sets of initial weights wi D (1 / N ) v2V u2Ui rvu aN vu ). Throughout these simulations, we used synapses with just a single release site. Each of these 300 simulations consisted of 10 individual trials with different inputs () () () hx1l , . . . , x6l i, l D 1, . . . , 10, where the inputs xi l are drawn from normal 2 distributions with mean xN i and variance xO i (xN i 2 [0, 1] and xO i 2 [0.1, 0.3] were chosen uniformly from the given interval for each simulation). In each trial, we simulated a network of spiking neurons and applied Markram’s rule locally. The release probability pvu of a synapse between neuron u 2 Ui and v 2 V was increased by b if the postsynaptic neuron red within 20 ms after the presynaptic neuron. After these 10 trials, we compared the resulting

Model for Fast Analog Computation

1703

P P effective weights wnew D (1 / N ) v2V u2Ui rvu aN vu with the weights W inew D i P ( l) ( l) predicted by the Hebb rule, where aN i is the mean Wiold C Nb aN i 10 lD1 xi y PSP amplitude resulting from spikes of neurons u 2 Ui and y( l) is given by P6 ( l) y( l) D s( iD1 wi xi ) for the sigmoidal function given by the right-hand side of equation 2.6. Acknowledgments

We thank Peter Auer for helpful discussions and an anonymous referee for helpful comments. Research for this article was partially supported by the ESPRIT Working Group NeuroCOLT, No. 8556, and the Fonds zur Forderung ¨ der wissenschaftlichen Forschung (FWF), Austria, project P12153. References Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and gain control. Science, 275, 220–224. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal ring patterns in the frontal cortex of behaving monkeys. J. Neurophysiology, 70(1), 1629–1638. Auger, C., Kondo, S., & Marty, A. (1998). Multivesicular release at single functional synaptic sites in cerebellar stellate and basket cells. Journal of Neuroscience, 18(12). Back, A. D., & Tsoi, A. C. (1991). FIR and IIR synapses, a new neural network architecture for time series modeling. Neural Computation, 3, 375–385. Bernander, O., Douglas, R. J., Martin, K. A. C., & Koch, C. (1991). Synaptic background activity determines spatial-temporal integration in pyramidal cells. Proceedings of the National Academy of Sciences, 88, 11569–11573. Dobrunz, L., & Stevens, C. (1997). Heterogeneity of release probability, facilitation and depletion at central synapses. Neuron, 18, 995–1008. Gerstner, W. (1999a). Populations of spiking neurons. In W. Maass & C. Bishop (Eds.), Pulsed neural networks (pp. 261–293). Cambridge, MA: MIT Press. Gerstner, W. (1999b). Spiking neurons. In W. Maass & C. Bishop (Eds.), Pulsed neural networks (pp. 3–53). Cambridge, MA: MIT Press. Haykin, S. (1996). Adaptive lter theory. Englewood Cliffs, NJ: Prentice Hall. Hopeld, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 367, 33–36. Huang, E. P., & Stevens, C. F. (1997). Estimating the distribution of synaptic reliabilities. J. Neurophysiology, 78(6), 2870–2880. Koch, C. (1999). Biophysics of computation: Information processing in single neurons. New York: Oxford University Press.

1704

W. Maass and T. Natschl¨ager

Larkman, A. U., Jack, J. J. B., & Stratford, K. J. (1997). Quantal analysis of excitatory synapses in rat hippocampal CA1 in vitro during low-frequency depression. J. Physiology, 505, 457–471. Maass, W. (1997). Fast sigmoidal networks via spiking neurons. Neural Computation, 9, 279–304. Markram, H., Lubke, ¨ J., Frotscher, M., Roth, A., & Sakmann, B. (1997). Physiology and anatomy of synaptic connections between thick tufted pyramidal neurons in the developing rat neocortex. J. Physiology, 500(2), 409–440. Markram, H., Lubke, ¨ J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213–215. Petrov, V. V. (1995). Limit theorems of probability theory. New York: Oxford University Press. Stevens, C. F., & Zador, A. (1998). Input synchrony and the irregular ring of cortical neurons. Nature Neuroscience, 3, 210–217. Thorpe, S., Fize, D., & Marlot, C. (1996).Speed of processing in the human visual system. Nature, 381, 520–522. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophysics Journal, 12, 1–24. Received July 2, 1998; accepted March 18, 1999.

LETTER

Communicated by Bartlett Mel

Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces Aapo Hyv a¨ rinen Patrik Hoyer Helsinki University of Technology, Laboratory of Computer and Information Science, FIN-02015 HUT, Finland Olshausen and Field (1996) applied the principle of independence maximization by sparse coding to extract features from natural images. This leads to the emergence of oriented linear lters that have simultaneous localization in space and in frequency, thus resembling Gabor functions and simple cell receptive elds. In this article, we show that the same principle of independence maximization can explain the emergence of phase- and shift-invariant features, similar to those found in complex cells. This new kind of emergence is obtained by maximizing the independence between norms of projections on linear subspaces (instead of the independence of simple linear lter outputs). The norms of the projections on such “independent feature subspaces” then indicate the values of invariant features. 1 Introduction

A fundamental approach in signal processing is to design a statistical generative model of the observed signals. Such an approach is also useful for modeling the properties of neurons in primary sensory areas. Modeling visual data by a simple linear generative model, Olshausen and Field (1996) showed that the principle of maximizing the sparseness (or nongaussianity) of the underlying image components is enough to explain the emergence of Gabor-like lters that resemble the receptive elds of simple cells in mammalian primary visual cortex (V1). Maximizing sparseness is in this context equivalent to maximizing the independence of the image components (Comon, 1994; Bell & Sejnowski, 1997; Olshausen & Field, 1996). We show in this article that this same principle can also explain the emergence of phaseand shift-invariant features: the principal properties of complex cells in V1. Using the method of feature subspaces (Kohonen, 1995, 1996), we model the response of a complex cell as the norm of the projection of the input vector (image patch) onto a linear subspace, which is equivalent to the classical energy models. Then we maximize the independence between the norms of such projections, or energies. Thus we obtain features that are localized c 2000 Massachusetts Institute of Technology Neural Computation 12, 1705– 1720 (2000) °

1706

Aapo Hyv¨arinen and Patrik Hoyer

in space, oriented, and bandpass (selective to scale/frequency), like those given by simple cells, or Gabor analysis. In contrast to simple linear lters, however, the obtained features also show the emergence of phase invariance and (limited) shift or translation invariance. Phase invariance means that the response does not depend on the Fourier phase of the stimulus; the response is the same for a white bar and a black bar, as well as for a bar and an edge. Limited shift invariance means that a near-maximum response can be elicited by identical bars or edges at slightly different locations. These two latter properties closely parallel the properties that distinguish complex cells from simple cells in V1. Maximizing the independence, or equivalently, the sparseness of the norms of the projections to feature subspaces thus allows for the emergence of exactly those invariances that are encountered in complex cells, indicating their fundamental importance in image data. 2 Independent Component Analysis of Image Data

The basic models that we consider here express a static monochrome image I ( x, y ) as a linear superposition of some features or basis functions bi (x, y ), I ( x, y ) D

m X

bi (x, y )si ,

(2.1)

iD1

where the si are stochastic coefcients, different for each image I ( x, y ). The crucial assumption here is that the si are nongaussian and mutually independent. This type of decomposition is called independent component analysis (ICA) (Comon, 1994; Bell & Sejnowski, 1997; Hyv¨arinen & Oja, 1997), or, from an alternative viewpoint, sparse coding (Olshausen & Field, 1996, 1997). Estimation of the model in equation 2.1 consists of determining the values of si and bi (x, y ) for all i and ( x, y ), given a sufcient number of observations of images, or in practice, image patches I (x, y ). We restrict ourselves here to the basic case where the bi ( x, y ) form an invertible linear system. Then we can invert the system as si D hwi , Ii,

(2.2) P

where the wi denote the inverse lters, and hwi , Ii D x,y wi (x, y )I ( x, y) denotes the dot product. The wi ( x, y) can then be identied as the receptive elds of the model simple cells, and the si are their activities when presented with a given image patch I (x, y ). Olshausen and Field (1996) showed that when this model is estimated with input data consisting of patches of natural scenes, the obtained lters wi ( x, y) have the three principal properties of simple cells in V1: they are localized, oriented, and bandpass. van Hateren and van der Schaaf (1998) compared quantitatively the obtained lters wi ( x, y ) with those measured by single-cell recordings of the macaque cortex and found a good match for most of the parameters.

Emergence of Phase- and Shift-Invariant Features

1707

3 Decomposition into Independent Feature Subspaces

In addition to the essentially linear simple cells, another important class of cells in V1 is complex cells. Complex cells share the above-mentioned properties of simple cells but have the two principal distinguishing properties of phase invariance and (limited) shift invariance (Hubel & Wiesel, 1962; Pollen & Ronner, 1983), at least for the preferred orientation and frequency. Note that although invariance with respect to shift and global Fourier phase are equivalent, they are different properties when phase is computed from a local Fourier transform. Another distinguishing property of complex cells is that the receptive elds are larger than in simple cells, but this difference is only quantitative and of less consequence here. (For more details, see Heeger, 1992; Pollen & Ronner, 1983; Mel, Ruderman, & Archie, 1998.) To this date, very few attempts have been made to formulate a statistical model that would explain the emergence of the properties of visual complex cells. It is simple to see why ICA as in equation 2.1 cannot be directly used for modeling complex cells. This is due to the fact that in that model, the activations of the neurons si can be used linearly to reconstruct the image I ( x, y ), which is not true for complex cells due to their two principal properties of phase invariance and shift invariance. The responses of complex cells do not give the phase or the exact position of the stimulus, at least not as a linear function, as in equation 2.1. (See von der Malsburg, Shams, & Eysel, 1998, for a nonlinear reconstruction of the image from complex cell responses.) The purpose of this article is to explain the emergence of phase- and shiftinvariant features using a modication of the ICA model. The modication is based on combining the technique of multidimensional independent component analysis (Cardoso, 1998) and the principle of invariant-feature subspaces (Kohonen, 1995, 1996). We rst describe these two recently developed techniques. 3.1 Invariant Feature Subspaces. The classical approach for feature extraction is to use linear transformations, or lters. The presence of a given feature is detected by computing the dot product of input data with a given feature vector. For example, wavelet, Gabor, and Fourier transforms, as well as most models of V1 simple cells, use such linear features. The problem with linear features, however, is that they necessarily lack any invariance with respect to such transformations as spatial shift or change in (local) Fourier phase (Pollen & Ronner, 1983; Kohonen, 1996). Kohonen (1996) developed the principle of invariant-feature subspaces as an abstract approach to representing features with some invariances. This principle states that one may consider an invariant feature as a linear subspace in a feature space. The value of the invariant, higher-order feature is given by (the square of) the norm of the projection of the given data point on that subspace, which is typically spanned by lower-order features.

1708

Aapo Hyv¨arinen and Patrik Hoyer

A feature subspace, like any linear subspace, can always be represented by a set of orthogonal basis vectors, say, wi ( x, y ), i D 1, . . . , n, where n is the dimension of the subspace. Then the value F( I ) of the feature F with input vector I ( x, y) is given by F( I ) D

n X iD1

hwi , Ii2 .

(3.1)

(For simplicity of notation and terminology, we do not distinguish clearly the norm and the square of the norm in this article.) In fact, this is equivalent to computing the distance between the input vector I ( x, y) and a general linear combination of the basis vectors (lters) wi ( x, y ) of the feature subspace (Kohonen, 1996). A graphical depiction of feature subspaces is given in Figure 1. Kohonen (1996) showed that this principle, when combined with competitive learning techniques, can lead to the emergence of invariant image features. 3.2 Multidimensional Independent Component Analysis. In multidimensional independent component analysis (Cardoso, 1998), a linear generative model as in equation 2.1 is assumed. In contrast to ordinary ICA, however, the components (responses) si are not assumed to be all mutually independent. Instead, it is assumed that the si can be divided into couples, triplets, or in general n-tuples, such that the si inside a given n-tuple may be dependent on each other, but dependencies among different n-tuples are not allowed. Every n-tuple of si corresponds to n basis vectors bi ( x, y ). We call a subspace spanned by a set of n such basis vectors an independent (feature) subspace. In general, the dimensionality of each independent subspace need not be equal, but we assume so for simplicity. The model can be simplied by two additional assumptions. First, although the components si are not all independent, we can always dene them so that they are uncorrelated and of unit variance. In fact, linear dependencies inside a given n-tuple of dependent components could always be removed by a linear transformation. Second, we can assume that the data are whitened (sphered); this can be always accomplished by, for example, PCA (Comon, 1994). Whitening is a conventional preprocessing step in ordinary ICA, where it makes the basis vectors bi orthogonal (Comon, 1994; Hyv¨arinen & Oja, 1997), if we ignore any nite-sample effects. These two assumptions imply that the bi are orthonormal and that we can take bi D wi as in ordinary ICA with whitened data. In particular, the independent subspaces become orthogonal after whitening. These facts follow directly from the proof in Comon (1994), which applies here as well, due to our above assumptions.

Emergence of Phase- and Shift-Invariant Features

1709

Figure 1: Graphical depiction of the feature subspaces. First, dot products of the input data with a set of basis vectors are taken. Here, we have two subspaces with four basis vectors in each. The dot products are then squared, and the sums are taken inside the feature subspaces. Thus we obtain the (squares of the) norms of the projections on the feature subspaces, which are considered as the responses of the subspaces. Square roots may be taken for normalization. This scheme represents features that go beyond simple linear lters, possibly obtaining invariance with respect to some transformations of the input, for example, shift and phase invariance. The subspaces, or the underlying vectors wi , may be learned by the principle of maximum sparseness, which coincides here with maximum likelihood of a generative model.

1710

Aapo Hyv¨arinen and Patrik Hoyer

Let us denote by J the number of independent feature subspaces and by Sj , j D 1, . . . , J the set of the indices of the si belonging to the subspace of index j. Assume that the data consist of K observed image patches Ik ( x, y ), k D 1, . . . , K. Then we can express the likelihood L of the data given the model as follows: ¡ ¢ L Ik ( x, y ), k D 1 . . . KI wi (x, y ), i D 1 . . . m 2 3 D

K Y

4 |det W |

kD1

J Y

jD1

pj (hwi , Ik i, i 2 Sj )5 ,

(3.2)

where pj (.), which is a function of the n arguments hwi , Ik i, i 2 Sj , gives the probability density inside the jth n-tuple of si , and W is a matrix containing the lters wi ( x, y ) as its columns. The term |det W | appears here as in any expression of the probability density of a transformation, giving the change in volume produced by the linear transformation (see, e.g., Pham, Garrat, & Jutten, 1992). The n-dimensional probability density pj (.) is not specied in advance in the general denition of multidimensional ICA (Cardoso, 1998). 3.3 Combining Invariant Feature Subspaces and Independent Subspaces. Invariant-feature subspaces can be embedded in multidimensional

independent component analysis by considering probability distributions for the n-tuples of si that are spherically symmetric, that is, depend only on the norm. In other words, the probability density pj (.) of the n-tuple with index j 2 f1, . . . , Jg, can be expressed as a function of the sum of the squares of the si , i 2 Sj only. For simplicity, we assume further that the pj (.) are equal for all j, that is, for all subspaces. This means that the logarithm of the likelihood L of the data, that is, the K observed image patches Ik ( x, y), k D 1, . . . , K, given the model, can be expressed as ¡ ¢ log L Ik (x, y ), k D 1 . . . KI wi ( x, y ), i D 1 . . . m 0 1 D

J K X X kD1 jD1

X

log p @

i2Sj

hwi , Ik i2 A C K log |det W |,

(3.3)

P where p( i2Sj s2i ) D pj ( si , i 2 Sj ) gives the probability density inside the jth n-tuple of si . Recall that prewhitening allows us to consider the wi ( x, y ) to be orthonormal, which implies that log |det W | is zero. This shows that the likelihood in equation 3.3 is a function of the norms of the projections of Ik ( x, y) on the subspaces indexed by j, which are spanned by the orthonormal basis sets given by wi ( x, y ), i 2 Sj . Since the norm of the projection of visual data

Emergence of Phase- and Shift-Invariant Features

1711

on practically any subspace has a supergaussian distribution, we need to choose the probability density p in the model to be sparse (Olshausen & Field, 1996), that is, supergaussian (Hyv¨arinen & Oja, 1997). For example, we could use the following probability distribution, 0 log p @

1 X i2Sj

2

s2i A D ¡a 4

X i2Sj

3 1/2 s2i 5

(3.4)

C b,

which could be considered a multidimensional version of the exponential distibution (Field, 1994). The scaling constant a and the normalization constant b are determined so as to give a probability density that is compatible with the constraint of unit variance of the si , but they are irrelevant in the following. Thus, we see that the estimation of the model consists of nding subspaces such that the norms of the projections of the (whitened) data on those subspaces have maximally sparse distributions. The introduced independent feature subspace analysis is a natural generalization of ordinary ICA. In fact, if the projections on the subspaces are reduced to dot products, that is, projections on one-dimensional subspaces, the model reduces to ordinary ICA, provided that, in addition, the independent components are assumed to have symmetric distributions. It is to be expected that the norms of the projections on the subspaces represent some higher-order, invariant features. The exact nature of the invariances has not been specied in the model but will emerge from the input data, using only the prior information on their independence. When independent feature subspace analysis is applied P to natural image data, we can identify the norms of the projections ( i2Sj s2i )1 / 2 as the responses of the complex cells. If the individual lter vectors wi (x, y ) are identied with the receptive elds of simple cells, this can be interpreted as a hierarchical model where the complex cell response is computed from simple cell responses si , in a manner similar to the classical energy models for complex cells (Hubel & Wiesel, 1962; Pollen & Ronner, 1983; Heeger, 1992). It must be noted, however, that our model does not specify the particular basis of a given invariant-feature subspace. 3.4 Learning Independent Feature Subspaces. Learning the independent feature subspace representation can be simply achieved by gradient ascent of the log-likelihood in equation 3.3. Due to whitening, we can constrain the vectors wi to be orthogonal and of unit norm, as in ordinary ICA; these constraints usually speed up convergence. A stochastic gradient ascent of the log-likelihood can be obtained as

0

D wi ( x, y ) / I ( x, y )hwi , Iig @

1 X

r2Sj(i)

hwr , Ii2 A ,

(3.5)

1712

Aapo Hyv¨arinen and Patrik Hoyer

where j( i) is the index of the subspace to which wi belongs, and g D p0 / p is a nonlinear function that incorporates our information on the sparseness of the norms of the projections. For example, if we choose the distribution in equation 3.4, we have g( u ) D ¡ 12 au¡1 / 2 , where the positive constant 12 a can be ignored. After every step of equation 3.5, the vectors wi need to be orthonormalized; for a variety of methods to perform this, see (Hyv¨arinen & Oja, 1997; Karhunen, Oja, Wang, Vigario, & Joutsensalo, 1997). The learning rule in equation 3.5 can be considered as “modulated” nonlinear Hebbian learning. If the subspace that contains wi were just onedimensional, this learning rule would reduce to the learning rules for ordinary ICA given in Hyv¨arinen and Oja (1998) and closely related to those in Bell and Sejnowski (1997), Cardoso and Laheld (1996), and Karhunen et al. (1997). The difference is that in the general case, the Hebbian P term is divided by a function of the output of the complex cell, given by r2Sj(i) hwr , Ii2 , if we assume the terminology of the energy models. In other words, the Hebbian term is modulated by a top-down feedback signal. In addition to this modulation, the neurons interact in the form of the orthogonalizing feedback. 4 Experiments

To test our model, we used patches of natural images as input data Ik ( x, y) and estimated the model of independent feature subspace analysis. 4.1 Data and Methods. The data were obtained by taking 16 £ 16 pixel image patches at random locations from monochrome photographs depicting wildlife scenes (animals, meadows, forests, etc.). The images were taken directly from PhotoCDs and are available on the World Wide Web.1 The mean gray-scale value of each image patch (i.e., the DC component) was subtracted. The data were then low-pass ltered by reducing the dimension of the data vector by principal component analysis, retaining the 160 principal components with the largest variances. Next, the data were whitened by the zero-phase whitening lter, which means multiplying the data by C ¡1 / 2 , where C is the covariance of the data (after PCA) (see, e.g., Bell and Sejnowski, 1997). These preprocessing steps are essentially similar to those used in Olshausen and Field (1996) and van Hateren and van der Schaaf (1998). The likelihood in equation 3.3 for 50,000 such observations was maximized under the constraint of orthonormality of the lters in the whitened space, using the averaged version of the learning rule in equation 3.5; that is, we used the ordinary gradient of the likelihood instead of the stochastic gradient. The fact that the data were contained in a 160-dimensional subspace meant that the 160 basis vectors wi now formed an orthonormal system for that subspace and not for the original space, but this did not necessitate any 1

WWW address: http://www.cis.hut./projects/ica/data/images

Emergence of Phase- and Shift-Invariant Features

1713

Figure 2: Linear lter sets associated with the feature subspaces (model complex cells), as estimated from natural image data. Every group of four lters spans a single feature subspace (for whitened data).

changes in the learning rule. The density p was chosen as in equation 3.4. The algorithm was initialized as in Bell and Sejnowski (1997) by taking as wi the 160 middle columns of the identity matrix. We also tried random initial values for W . These yielded qualitatively identical results, but using a localized lter set as the initial value considerably improves the convergence of the method, especially avoiding some of the lters getting stuck in local minima. This initialization led, incidentally, to a weak topographical organization of the lters. The computations took about 10 hours on a single RISC processor. Experiments were made with different dimensions Sj for the subspaces: 2, 4, and 8 (in a single run, all the subspaces had the same dimension). The results we show are for four-dimensional subspaces, but the results are similar for other dimensions. 4.2 Results. Figure 2 shows the lter sets of the 40 feature subspaces (complex cells), when subspace dimension was chosen to be 4. The results are shown in the zero-phase whitened space. Note that due to orthogonality, the lters are equal to the basis vectors. The lters look qualitatively similar in the original, not whitened space. The only difference is that in the original space, the lters are sharper—that is, more concentrated on higher frequencies.

1714

Aapo Hyv¨arinen and Patrik Hoyer

Figure 3: Typical stimuli used in the experiments in Figures 4 to 6. The middle column shows the optimal Gabor stimulus. One of the parameters was varied at a time. (Top row) Varying phase. (Middle row) Varying location (shift). (Bottom row) varying orientation.

The linear lters associated with a single complex cell all have approximately the same orientation and frequency. Their locations are not identical, but are close to each other. The phases differ considerably. Every feature subspace can thus be considered a generalization of a quadrature-phase lter pair as found in the classical energy models (Pollen & Ronner, 1983), enabling the cell to be selective to given orientation and frequency, but invariant to phase and somewhat invariant to shifts. Using four lters instead of a pair greatly enhances the shift invariance of the feature subspace. In fact, when the subspace dimension was 2, we obtained approximately quadrature-phase lter pairs. To demonstrate quantitatively the properties of the model, we compared the responses of a representative feature subspace and the associated linear lters for different stimulus congurations. First, an optimal stimulus for the feature subspace was computed in the set of Gabor lters. One of the stimulus parameters was changed at a time to see how the response changes, while the other parameters were held constant at the optimal values. Some typical simuli are depicted in Figure 3. The investigated parameters were phase, orientation, and location (shift). Figure 4 shows the results for one typical feature subspace. The four linear lters spanning the feature subspace are shown in Figure 4a. The optimal stimulus values (for the feature subspace) are represented by 0 in these results; that is, the values given here are departures from the optimal values. The responses are in arbitrary units. For different phases, ranging from

Emergence of Phase- and Shift-Invariant Features

1715

Figure 4: Responses elicited from a feature subspace and the underlying linear lters by different stimuli, for stimuli as in Figure 3. (a) The four (arbitrarily chosen) lters spanning the feature subspace. (b) Effect of varying phase. Upper four rows: absolute responses of linear lters (simple cells) to stimuli of different phases. Bottom row: response of feature subspace (complex cell). (c) Effect of varying location (shift), as in b. (d) Effect of varying orientation, as in b.

¡p / 2 to p / 2, we thus obtained Figure 4b. On the bottom row, we have the response curve of the feature subspace; for comparison, the absolute values of the responses of the associated linear lters are also shown in the four plots

1716

Aapo Hyv¨arinen and Patrik Hoyer

above it. Similarly, we obtained Figure 4c for different locations (location was moved along the direction perpendicular to the preferred orientation; the shift units are arbitrary) and Figure 4d for different orientations (ranging from ¡p / 2 to p / 2). The response curve of the feature subspace shows clear invariance with respect to phase and location; note that the response curve for location consists of an approximately gaussian envelope, as observed in complex cells (von der Heydt, 1995). In contrast, no invariance for orientation was observed. The responses of the linear lters, on the other hand, show no invariances with respect to any of these parameters, which is in fact a necessary consequence of the linearity of the lters (Pollen & Ronner, 1983; von der Heydt, 1995). Thus, this feature subspace shows the desired properties of phase invariance and limited shift invariance, contrasting them with the properties of the underlying linear lters. To see how the above results hold for the whole population of feature subspaces we computed the same response curves for all the feature subspaces and, for comparison, all the underlying linear lters. Optimal stimuli were separately computed for all the feature subspaces. In contrast to the above results, we computed the optimal stimuli separately for each linear lter as well; this facilitates the quantitative comparison. The response values were normalized so that the maximum response for each subspace or linear lter was equal to 1. Figure 5 shows the responses for 10 typical feature subspaces and 10 typical linear lters, whereas Figure 6 shows the median responses of the whole population of 40 feature subspaces and 160 linear lters, together with the 10% and 90% percentiles. In Figures 5a and 6a, the responses are given for varying phases. The top row shows the absolute responses of the linear lters; in the bottom row, the corresponding results for the feature subspaces are depicted. The gures show that phase invariance is a strong property of the feature subspaces; the minimum response was usually at least two-thirds of the maximum response. Figures 5b and 6b show the results for location shift. Clearly, the receptive eld of a typical feature subspace is larger and more invariant than that of a typical linear lter. As for orientation, Figures 5c and 6c depict the corresponding results, showing that the orientation selectivity was approximately equivalent in linear lters and feature subspaces. Thus, we see that invariances with respect to translation and especially phase, as well as orientation selectivity, hold for the population of feature subspaces in general. 5 Discussion

An approach closely related to ours is given by the adaptive subspace selforganizing map (Kohonen, 1996), which is based on competitive learning of a invariant-feature subspace representation similar to ours. Using twodimensional subspaces, an emergence of lter pairs with phase invariance and limited shift invariance was shown in Kohonen (1996).However, the

Emergence of Phase- and Shift-Invariant Features

1717

Figure 5: Some typical response curves of the feature subspaces and the underlying linear lters when one parameter of the optimal stimulus is varied, as in Figure 3. Top row: responses (in absolute values) of 10 linear lters (simple cells). Bottom row: responses of 10 feature subspaces (complex cells). (a) Effect of varying phase. (b) Effect of varying location (shift). (c) Effect of varying orientation.

emergence of shift invariance in Kohonen (1996) was conditional to restricting consecutive patches to come from nearby locations in the image, giving the input data a temporal structure, as in a smoothly changing image sequence. Similar developments were given by Foldi ¨ ak ´ (1991). In contrast to

1718

Aapo Hyv¨arinen and Patrik Hoyer

Figure 6: Statistical analysis of the properties of the whole population of feature subspaces, with the corresponding results for linear lters given for comparison. In all plots, the solid line gives the median response in the population of all cells (lters or subspaces), and the dashed lines give the 90 percent and 10 percent percentiles of the responses. Stimuli were as in Figure 3. Top row: responses (in absolute values) of linear lters (simple cells). Bottom row: responses of feature subspaces (complex cells). (a) Effect of varying phase. (b) Effect of varying location (shift). (c) Effect of varying orientation.

these two theories, we formulated an explicit image model, showing that emergence is possible using patches at random, independently selected locations, which proves that there is enough information in static images to explain the properties of complex cells. In conclusion, we have provided a model of the emergence of phase- and shift-invariant features—the principal properties of visual complex cells— using the same principle of maximization of independence (Comon, 1994; Barlow, 1994; Field, 1994) that has been used to explain simple cell properties (Olshausen & Field, 1996). Maximizing the independence or, equivalently, the sparseness of linear lter outputs, the model gives simple cell properties. Maximizing the independence of the norms of the projections on linear subspaces, complex cell properties emerge. This provides further evidence for the fundamental importance of dependence reduction as a strategy for sensory information processing. It is, in fact, closely related to such fundamental objectives as minimizing code length and reducing redundancy

Emergence of Phase- and Shift-Invariant Features

1719

(Barlow, 1989, 1994; Field, 1994; Olshausen & Field, 1996). It remains to be seen if our framework could also account for the movement-related, binocular, and chromatic properties of complex cells; the simple cell model has had some success in this respect (van Hateren & Ruderman, 1998). Extending the model to include overcomplete bases (Olshausen & Field, 1997) may be useful for such purposes. References Barlow, H. (1989). Unsupervised learning. Neural Computation, 1, 295–311. Barlow, H. (1994). What is the computational goal of the neocortex? In C. Koch & J. Davis (Eds.), Large-scale neuronal theories of the brain. Cambridge, MA: MIT Press. Bell, A., & Sejnowski, T. (1997).The “independent components” of natural scenes are edge lters. Vision Research, 37, 3327–3338. Cardoso, J. F. (1998). Multidimensional independent component analysis. In Proc. ICASSP’98 (Seattle, WA). Cardoso, J.-F., & Laheld, B. H. (1996). Equivariant adaptive source separation. IEEE Trans. on Signal Processing, 44(12), 3017–3030. Comon, P. (1994).Independent component analysis—a new concept? Signal Processing, 36, 287–314. Field, D. (1994). What is the goal of sensory coding? Neural Computation, 6, 559– 601. Foldi´ ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3, 194–200. Heeger, D. (1992). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 181–198. Hubel, D., & Wiesel, T. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, 73, 218–226. Hyv¨arinen, A., & Oja, E. (1997). A fast xed-point algorithm for independent component analysis. Neural Computation, 9(7), 1483–1492. Hyv¨arinen, A., & Oja, E. (1998). Independent component analysis by general nonlinear Hebbian-like learning rules. Signal Processing, 64(3), 301–313. Karhunen, J., Oja, E., Wang, L., Vigario, R., & Joutsensalo, J. (1997). A class of neural networks for independent component analysis. IEEE Trans. on Neural Networks, 8(3), 486–504. Kohonen, T. (1995). Self-organizing maps. Berlin: Springer-Verlag. Kohonen, T. (1996). Emergence of invariant-feature detectors in the adaptivesubspace self-organizing map. Biological Cybernetics, 75, 281–291. Mel, B. W., Ruderman, D. L., & Archie, K. A. (1998). Translation-invariant orientation tuning in visual “complex” cells could derive from intradendritic computations. Journal of Neuroscience, 18, 4325–4334. Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive eld properties by learning a sparse code for natural images. Nature, 381, 607–609. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37, 3311–3325.

1720

Aapo Hyv¨arinen and Patrik Hoyer

Pham, D.-T., Garrat, P., & Jutten, C. (1992). Separation of a mixture of independent sources through a maximum likelihood approach. In Proc. EUSIPCO (pp. 771–774). Pollen, D., & Ronner, S. (1983). Visual cortical neurons as localized spatial frequency lters. IEEE Trans. on Systems, Man, and Cybernetics, 13, 907–916. van Hateren, J. H., & Ruderman, D. L. (1998). Independent component analysis of natural image sequences yields spatiotemporal lters similar to simple cells in primary visual cortex. Proc. Royal Society ser. B, 265, 2315–2320. van Hateren, J. H., & van der Schaaf, A. (1998). Independent component lters of natural images compared with simple cells in primary visual cortex. Proc. Royal Society ser. B, 265, 359–366. von der Heydt, R. (1995).Form analysis in visual cortex. In M. S Gazzaniga (Ed.), The cognitive neurosciences (pp. 365–382). Cambridge, MA: MIT Press. von der Malsburg, C., Shams, L., & Eysel, U. (1998). Recognition of images from complex cell responses. Abstract presented at at Proc. of Society for Neuroscience Meeting. Received September 29, 1998; accepted May 7, 1999.

LETTER

Communicated by Jack Cowan

Tilt Aftereffects in a Self-Organizing Model of the Primary Visual Cortex James A. Bednar Risto Miikkulainen Department of Computer Sciences, University of Texas at Austin, Austin, TX 78712, U.S.A. RF-LISSOM, a self-organizing model of laterally connected orientation maps in the primary visual cortex, was used to study the psychological phenomenon known as the tilt aftereffect. The same self-organizing processes that are responsible for the long-term development of the map are shown to result in tilt aftereffects over short timescales in the adult. The model permits simultaneous observation of large numbers of neurons and connections, making it possible to relate high-level phenomena to low-level events, which is difcult to do experimentally. The results give detailed computational support for the long-standing conjecture that the direct tilt aftereffect arises from adaptive lateral interactions between feature detectors. They also make a new prediction that the indirect effect results from the normalization of synaptic efcacies during this process. The model thus provides a unied computational explanation of selforganization and both the direct and indirect tilt aftereffect in the primary visual cortex.

1 Introduction

The tilt aftereffect (TAE; Gibson & Radner, 1937) is a simple but intriguing visual phenomenon. After staring at a pattern of tilted lines or gratings, subsequent lines appear to have a slight tilt in the opposite direction (see Figure 1). The effect resembles an afterimage from staring at a bright light, but it represents changes in orientation perception rather than in color or brightness. Most modern explanations of the TAE are loosely based on the featuredetector model of the primary visual cortex (V1), which characterizes this area as a set of orientation-detecting neurons. Experiments showed that these neurons became more difcult to excite during repeated presentation of oriented visual stimuli, and the desensitization persisted for some time (Hubel & Wiesel, 1968). This observation led to the fatigue theory of the TAE: perhaps active neurons become fatigued due to repeated ring, causing the response to a test gure to change during adaptation. Assumc 2000 Massachusetts Institute of Technology Neural Computation 12, 1721– 1740 (2000) °

1722

James A. Bednar and Risto Miikkulainen

Figure 1: Tilt aftereffect patterns. Fixate your gaze on the circle inside the central diagram for at least 30 seconds, moving your eye slightly inside the circle to avoid developing strong afterimages. Now xate on the diagram at the left. The vertical lines should appear slightly tilted clockwise; this phenomenon is called the direct tilt aftereffect. If you instead xate on the horizontal lines at the right, they should appear barely tilted counterclockwise, due to the indirect tilt aftereffect. (Adapted from Campbell & Maffei, 1971.)

ing the perceived orientation is some sort of average over the orientation preferences of the activated neurons, the nal perceived orientation would thus show the direct TAE (Coltheart, 1971). The fatigue theory has been discredited for a number of reasons, chief among which is that individual V1 neurons actually do not appear to fatigue (Finlayson & Cynader, 1995; McLean & Palmer, 1996). In fact, their response to direct stimulation is essentially unchanged by adaptation to a visual stimulus (Vidyasagar, 1990). The now-popular inhibition theory postulates that the TAE instead results from changing inhibition between orientation-detecting neurons (Tolhurst & Thompson, 1975). The inhibition hypothesis has recently been incorporated into theories of the larger purpose and function of the cortex. Barlow (1990) and Foldi ¨ ak ´ (1990) have proposed that the early cortical regions are acting to reduce the amount of redundant information present in the visual input. They suggest that aftereffects are not aws in an otherwise well-designed system, but an unavoidable result of a self-organizing process that aims at producing an efcient, sparse encoding of the input through decorrelation (see also Field, 1994; Miikkulainen, Bednar, Choe, & Sirosh, 1997; Sirosh, Miikkulainen, & Bednar, 1996). Based on these theories, Dong (1995) has shown analytically that perfect decorrelation can result in direct tilt aftereffects that are similar to those found in humans. Only very recently, however, has it become computationally feasible to test the inhibition-decorrelation theory of the TAE in a detailed model of cortical function, with limitations similar to those known to be present in the cortex. A Hebbian self-organizing process (the receptive-eld laterally interconnected synergetically self-organizing map, or RF-LISSOM; Miikkulainen et al., 1997; Sirosh, 1995; Sirosh & Miikkulainen, 1994a, 1997) has been shown

Tilt Aftereffects

1723

to develop feature detectors and specic lateral connections that could produce such aftereffects. The RF-LISSOM model gives rise to anatomical and functional characteristics of the cortex such as topographic maps, ocular dominance, orientation, and size preference columns and the patterned lateral connections between them. Although other models exist that explain how the feature detectors and afferent connections could develop by inputdriven self-organization, RF-LISSOM is the rst model that also shows how the long-range lateral connections self-organize as an integral part of the process. The laterally connected model has also been shown to account for many of the dynamic aspects of the adult visual cortex, such as reorganization following retinal and cortical lesions (Miikkulainen et al., 1997; Sirosh & Miikkulainen, 1994b; Sirosh et al., 1996). These ndings suggest that the same self-organization processes that drive development may be acting in the adult (see Fregnac, 1996; Gilbert, 1998 for review). We explore this hypothesis here with respect to the TAE. The current work is a rst study of the functional behavior of the RFLISSOM model, specically the response to stimuli similar to those known to cause the TAE in humans. The model permits simultaneous observation of activation and connection patterns between large numbers of neurons. This makes it possible to relate higher-level phenomena to low-level events, which is difcult to do experimentally. The results provide detailed computational support for the idea that the TAE results from a general selforganizing process that aims at producing an efcient, sparse encoding of the input through decorrelation.

2 Architecture

The cortical architecture for the model has been simplied and reduced to the minimum necessary conguration to account for the observed phenomena. Because the focus is on the two-dimensional organization of the cortex, each “neuron” in the model cortex corresponds to a vertical column of cells through the six layers of the primate cortex. The cortical network is modeled with a sheet of interconnected neurons and the retina with a sheet of retinal ganglion cells (see Figure 2). Neurons receive afferent connections from broad overlapping circular patches on the retina. The N £ N network is projected on to a central region of the R £ R retinal ganglion cells, and each neuron is connected to ganglion cells in an area of radius r around the projections. Thus, neurons at a particular cortical location receive afferents from a corresponding location on the retina. Since the lateral geniculate nucleus (LGN) accurately reproduces the receptive elds of the retina, it has been bypassed for simplicity. Each neuron also has reciprocal excitatory and inhibitory lateral connections with itself and other neurons. Lateral excitatory connections are short range, connecting each neuron with itself and its close neighbors. Lateral in-

1724

James A. Bednar and Risto Miikkulainen

Figure 2: Architecture of the RF-LISSOMnetwork. A small RF-LISSOM network and retina are shown, along with connections to a single neuron (shown as a large circle). The input is an oriented gaussian activity pattern on the retinal ganglion cells (shown by grayscale coding); the LGN is bypassed for simplicity. The afferent connections form a local anatomical receptive eld (RF) on the simulated retina. Neighboring neurons have different but highly overlapping RFs. Each neuron computes an initial response as a scalar product of its RF and its afferent weight vector. The responses then repeatedly propagate within the cortex through the lateral connections and evolve into activity bubbles. After the activity stabilizes, weights of the active neurons are adapted.

hibitory connections run for comparatively long distances, but also include connections to the neuron itself and to its neighbors.1 The input to the model consists of two-dimensional ellipsoidal gaussian patterns representing retinal ganglion cell activations (as in Figures 2 and 3a). For training, the orientations of the gaussians are chosen randomly from the uniform distribution in the full range [0, p ), and the positions are chosen randomly within the retina. The elongated spots approximate natural visual stimuli after the edge detection and enhancement mechanisms 1 For high-contrast inputs, long-range interactions must be inhibitory for proper selforganization to occur (Sirosh, 1995). Recent optical imaging and electrophysiological studies have indeed shown that long-range interactions in the cortex are inhibitory at high contrasts, even though individual lateral connections are primarily excitatory (Grinvald, Lieke, Frostig, & Hildesheim, 1994; Hata, Tsumoto, Sato, Hagihara,& Tamura, 1993;Hirsch & Gilbert, 1991; Weliky, Kandler, Fitzpatrick, & Katz, 1995). The model uses explicit inhibitory connections for simplicity since all inputs used are high contrast, and since it is the high-contrast inputs that primarily drive adaptation in the Hebbian model (Bednar, 1997).

Tilt Aftereffects

1725

in the retina. They can also be seen as a model of the intrinsic retinal activity waves that occur in late prenatal development in mammals (Bednar & Miikkulainen, 1998; Shatz, 1990). The RF-LISSOM network models the selforganization of the visual cortex based on these natural sources of elongated features. The afferent weights are initially set to random values, and the lateral weights are preset to a smooth gaussian prole. The connections are organized through an unsupervised learning process. At each training step, neurons start out with zero activity. The initial response gij of neuron (i, j) is calculated as a weighted sum of the retinal activations:

gij D s

(

X

! jab m ij,ab ,

(2.1)

a,b

where jab is the activation of retinal ganglion ( a, b) within the anatomical receptive eld (RF) of the neuron, m ij,ab is the corresponding afferent weight, and s is a piecewise linear approximation of the sigmoid activation function. The response evolves over a very short timescale through lateral interaction. P At each time step, the neuron combines the above afferent activation jm with lateral excitation and inhibition: gij (t) D s

(

X

jm C c e

X k,l

Eij,klgkl ( t ¡ 1) ¡ c i

X k,l

! Iij,klgkl ( t ¡ 1) , (2.2)

where Eij,kl is the excitatory lateral connection weight on the connection from neuron ( k, l) to neuron ( i, j), Iij,kl is the inhibitory connection weight, and gkl (t ¡ 1) is the activity of neuron (k, l) during the previous time step. The scaling factors c e and c i determine the relative strengths of excitatory and inhibitory lateral interactions. While the cortical response is settling, the retinal activity remains constant. The cortical activity pattern starts out diffuse and spread over a substantial part of the map (as in Figure 3c), but within a few iterations of equation 2.2, converges into a small number of stable, focused patches of activity, or activity bubbles (see Figure 3d). After the activity has settled, the connection weights of each neuron are modied. Both afferent and lateral weights adapt according to the same mechanism: the Hebb rule, normalized so that the sum of the weights is constant: wij,mn ( t C dt) D P

wij,mn ( t) C agij Xmn £ ¤ , mn wij,mn ( t ) C agij Xmn

(2.3)

where gij stands for the activity of neuron ( i, j) in the nal activity bubble,

1726

James A. Bednar and Risto Miikkulainen

(a) Input pattern

(b) Orientation map

I

-0.6 avg

-1.3 avg

S

I -90

-45

0

+45

(c) Initial activity

+90

S -90

-45

0

+45

+90

(d) Settled activity

wij,mn is the afferent or lateral connection weight (m , E or I), a is the learning rate for each type of connection (aA for afferent weights, aE for excitatory, and aI for inhibitory), and Xmn is the presynaptic activity (j for afferent, g for lateral). The larger the product of the presynaptic and postsynaptic activity gij Xmn , the larger the weight change is. Therefore, when the presynaptic and postsynaptic neurons re together frequently, the connection becomes stronger. Both excitatory and inhibitory connections strengthen by correlated activity; normalization then redistributes the changes so that the sum of each weight type for each neuron remains constant. At long distances, very few neurons have correlated activity, and therefore most long-range connections eventually become weak. The weak con-

Tilt Aftereffects

1727

nections are eliminated periodically, resulting in patchy lateral connectivity similar to that observed in the visual cortex. The radius of the lateral excitatory interactions starts out large, but as self-organization progresses, it is decreased until it covers only the nearest neighbors. Such a decrease is one way of varying the balance between excitation and inhibition to ensure that global topographic order develops while the receptive elds become well tuned at the same time (Miikkulainen et al., 1997; Sirosh, 1995). Similar effects can be achieved by changing the scaling factors c e and c i over time, or by using connections whose sign depends on the activation level of the neuron (as in Stemmler, Usher, & Niebur, 1995).

Figure 3: Facing page. Orientation map activation. After being trained on inputs like the one in (a) with random positions and orientations, the RF-LISSOM network developed an orientation map whose central 96 £ 96 region is shown in (b). As in (c,d), short white lines in (b) indicate the orientation preference of selected neurons; longer lines denote a stronger preference. The shading also gives information about the preference; dark shading denotes a preference close to vertical (0 degrees), and light denotes one close to horizontal ( § 90 degrees). The white outline shows the extent of the patchy self-organized lateral inhibitory connections of one neuron, marked with a white square, which has a vertical orientation preference. The strongest connections of each neuron are extended along its preferred orientation and link columns with similar orientation preferences, avoiding those with orthogonal preferences. The shading in (c,d) shows the strength of activation for each neuron to pattern (a). The initial response of the organized map is spatially broad and diffuse (c, top), like the input, and its cortical location at, above, and below the center of the cortex indicates that the input is vertically extended around the center of the retina. The response is patchy because the network is also encoding orientation, and the neurons that encode orientations far from the vertical do not respond (compare c to b). The histogram (c, bottom) sums up the orientation coding of the response. Each bin represents a range of 5 degrees, which is the precision to which the orientation map was measured. A wide range of neurons preferring orientations around 0 degrees are activated, but the average orientation is approximately 0 degrees) (¡0.6 degrees for this particular run). After the network settles through lateral interactions, the activation is much more focused, both spatially (d, top) and in representing orientation (d, bottom), but the spatial and orientation averages continue to match the position and orientation of the input, respectively. The average orientation of the settled response (¡1.3 degrees here) is taken to be the perceived orientation for the TAE experiments. Color and animated versions of these gures can be seen online at http://www.cs.utexas.edu/users/nn/pages/research/ selforg.html.

1728

James A. Bednar and Risto Miikkulainen

3 Experiments

The model consisted of an array of 192 £ 192 neurons and a retina of 36 £ 36 ganglion cells. The center of the anatomical RF of each neuron was placed at the location in the central 24 £ 24 portion of the retina corresponding to the location of the neuron in the cortex, so that every neuron would have a complete set of afferent connections. The RF had a circular shape, consisting of random-strength connections to all ganglion cells within 6 units from the RF center. The cortex was self-organized for 20, 000 iterations on oriented gaussian inputs with major and minor axes of half-width s D 7.5 and 1.5, respectively.2 The training took 2.5 hours on 16 processors of a Cray T3E supercomputer at the Texas Advanced Computing Center. The model requires 1.5 gigabytes of physical memory to represent the 400 million connections in this small section of the cortex. 3.1 Orientation Map Organization. In the self-organization process, the neurons developed oriented RFs organized into orientation columns very similar to those observed in the primary visual cortex (see Figure 3b). The strongest lateral connections of highly tuned cells link areas of similar orientation preference and avoid neurons with the orthogonal orientation preference. Furthermore, the connection patterns of highly oriented neurons are typically elongated along the direction in the map that corresponds to the neuron’s preferred stimulus orientation (as subsequently found experimentally by Bosking, Zhang, Schoeld, & Fitzpatrick, 1997). This organization reects the activity correlations caused by the elongated gaussian input pattern: such a stimulus activates primarily those neurons that are tuned to the same orientation as the stimulus and located along its length (Sirosh et al., 1996). Since the long-range lateral connections are inhibitory, the net result is decorrelation: redundant activation is removed, resulting in a sparse representation of the novel features of each input (Barlow, 1990; Field, 1994;

2 The initial lateral excitation radius was 19 and was gradually decreased to 1. The lateral inhibitory radius of each neuron was 47, and inhibitory connections whose strength was below 0.00005 were pruned away at 20, 000 iterations. The lateral inhibitory connections were initialized to a gaussian prole with s D 100, and the lateral excitatory connections to a gaussian with s D 15, with no connections outside the nominal circular radius. The lateral excitation c e and inhibition strength c i were both 0.9. The learning rate aA was gradually decreased from 0.007 to 0.0015, aE from 0.002 to 0.001, and aI was a constant 0.00025. The lower and upper thresholds of the sigmoid were increased from 0.1 to 0.24 and from 0.65 to 0.88, respectively. The number of iterations for which the lateral connections were allowed to settle at each training iteration was initially 9 and was increased to 13 over the course of training. These parameter settings were used by Sirosh (pers. comm.) to model development of the orientation map, and were not tuned or tweaked for the TAE simulations. Small variations produce roughly equivalent results (Sirosh, 1995).

Tilt Aftereffects

1729

Sirosh et al., 1996). As a side effect, illusions and aftereffects may sometimes occur, as will be shown below. 3.2 Aftereffect Simulations. In psychophysical measurements of the TAE, a xed stimulus is presented at a particular location on the retina. To simulate these conditions in the RF-LISSOM model, the position and angle of the inputs were xed to a single value for a number of iterations. To permit more detailed analysis of behavior at short timescales, the learning rates were reduced from those used during self-organization to aA D aE D aI D 0.000005. All other parameters remained as in selforganization. To compare with the psychophysical experiments, it is necessary to determine what orientation the model “perceives” for any given input. Precisely how neural responses are interpreted for perception remains quite controversial (see Parker & Newsome, 1998, for review), but results in primates suggest that behavioral performance approaches the statistical optimum given the measured properties of cortical neurons (Geisler & Albrecht, 1997). Accordingly, we extracted the perceived orientation using a vector sum procedure, which has been shown to be optimal under conditions present in the model (Snippe, 1996). For this procedure, each active neuron was represented by a vector whose magnitude corresponded to the activation level and whose direction corresponded to the orientation preference of the neuron. Perceived orientation was then measured as a vector sum over all neurons that responded to the input. In the model, the perceived orientation was found to match the absolute orientation of the input pattern to within a few degrees (Bednar, 1997). To determine the TAE in the model, the perceived orientation was rst computed for inputs of all orientations. The model then adapted to a xed input for 90 iterations, and the perceived orientation was again computed for all inputs. The difference between the initial perceived angle and the one perceived after adaptation was taken as the magnitude of the TAE. Figure 4 plots these differences for the different angles. For comparison, it also shows the most detailed data available for the TAE in human foveal vision (Mitchell & Muir, 1976). The results from the RF-LISSOM simulation are strikingly similar to the psychophysical results. For the range 5 to 40 degrees, all subjects in the human study exhibited angle repulsion effects nearly identical to those found in the RF-LISSOM model; the data were most complete for the subject shown. The magnitude of this direct TAE increases very rapidly to a maximum angle repulsion at 8 to 10 degrees, falling off somewhat more gradually to zero as the angular separation increases. The simulations with larger angular separations (from 40 to 85 degrees) show a smaller angle attraction—the indirect effect. Although there is a greater intersubject variability in the psychophysical literature for the indirect effect than the direct

1730

James A. Bednar and Risto Miikkulainen

Figure 4: Tilt aftereffect at different angles. The open circles represent the TAE for a single human subject (DEM) from Mitchell and Muir (1976) averaged over 10 trials. For each angle in each trial, the subject adapted for 3 minutes on a sinusoidal grating of a given angle and then was tested for the effect on a horizontal grating. Error bars indicate § 1 standard error of measurement. The subject shown had the most complete data of the four in the study. All four showed very similar effects in the x-axis range § 40 degrees; the indirect TAE for the larger angles varied widely between § 2.5 degrees. The graph is roughly antisymmetric around 0 degrees, so the TAE is essentially the same in both directions relative to the adaptation line. For comparison, the heavy line shows the average magnitude of the TAE in the RF-LISSOM model over 10 trials with different adaptation angles. Error bars indicate § 1 standard error of measurement; in most cases they are too small to be visible since the results were very consistent among runs. The network adapted to an oriented line at the center of the retina for 90 iterations; then the TAE was measured for test lines oriented at each angle. The duration of adaptation was chosen so that the magnitudes of the human data and the model match; this was the only parameter t to the data. The result from the model closely resembles the curve for humans, showing both direct and indirect TAEs.

effect, those found for the RF-LISSOM model are well within the range seen for human subjects. In parallel with the angular changes in the TAE, its peak magnitude in humans increases logarithmically with adaptation time (Gibson & Radner, 1937), eventually saturating at a level that depends on the experimental protocol used (Greenlee & Magnussen, 1987; Magnussen & Johnsen, 1986).

Tilt Aftereffects

1731

Figure 5: Direct TAE over time. The circles show the magnitude of the TAE as a function of adaptation time for human subjects MWG (unlled circles) and SM (lled circles) from Greenlee and Magnussen (1987). They were the only subjects tested in the study. Each subject adapted to a single +12 degree line for the time period indicated on the horizontal axis (bottom). To estimate the magnitude of the aftereffect at each point, a vertical test line was presented at the same location, and the subject was requested to set a comparison line at another location to match it. The plots represent averages of ve runs; the data for 0 to 10 minutes were collected separately from the rest. For comparison, the heavy line shows average TAE in the LISSOM model for a C 12 degree test line over 9 trials (with parameters as in Figure 4). The horizontal axis (top) represents the number of iterations of adaptation, and the vertical axis represents the magnitude of the TAE at this time step. The RF-LISSOMresults show a similar logarithmic increase in TAE magnitude with time, but do not show the saturation that is seen for the human subjects.

As the number of adaptation iterations is increased in the model, the magnitude of the TAE also increases logarithmically (see Figure 5). (The single curve that best matched the magnitude of the human data was shown in Figure 4, but the ones for different amounts of adaptation all had the same basic shape.) The time course of the TAE in the RF-LISSOM model is qualitatively similar to the human data, but it does not completely saturate over the adaptation amounts tested so far.

1732

James A. Bednar and Risto Miikkulainen

3.3 How Does the TAE Arise in the Model?. The TAE seen in Figure 4 must result from changes in the connection strengths between neurons, since no other component of the model changes as adaptation progresses. Simulations performed with only one type of weight adapting (either afferent, lateral excitatory, or lateral inhibitory) show that the inhibitory weight changes determine the shape of the curve for all angles (Bednar, 1997). In what way do the changing inhibitory connections cause these effects? We will demonstrate this process in a simulation where only inhibitory weights adapted, the adaptation period was longer, and a higher learning rate was used, all in order to exaggerate the effect and make its causes more clearly visible. First, the connections change in a way that leaves the perceived orientation of the adaptation line unchanged. Figure 6a shows the initial response (I), the settled response (S), and the settled response after adaptation (A) for a vertical input (0 degrees, marked with a vertical line.) The response after adaptation is more diffuse because the most active neurons have become inhibited (see equation 2.3). Throughout adaptation, the distribution of active orientation detectors is centered around approximately the same angle, and a constant angle is perceived. The initial and settled responses to a test line with a slightly different orientation (e.g., 10 degrees, marked with a dotted line in Figure 6b) are again centered around that orientation (6bI and S). However, comparing the histograms of settled activity before and after adaptation (6bS and A), it is clear that fewer neurons close to 0 degrees respond after adaptation, but an increased number of those representing distant angles (over 10 degrees) do. This is because during adaptation, inhibition was strengthened primarily between neurons close to the 0 degree adaptation angle, and not between those that prefer larger orientations. Such adaptation increases the ability of the map to detect small differences from the adaptation line. Before adaptation, the settled histograms for 0 degrees and 10 degrees are fairly similar, with averages differing by only 9.2 degrees in this simulation (6aS and 6bS). After adaptation, the histograms are very different, resulting in a 23 degree difference in perceived orientation (compare 6aA with 6bA). This adaptation is manifested at the psychological level as a shift of the perceived orientation away from the adaptation line, that is, the direct TAE. Meanwhile, the response to a very different test line (e.g., 50 degrees, the dotted line in Figure 6c) becomes broader and stronger after adaptation (compare 6cS and A). Adaptation occurred only in activated neurons, so neurons with orientation preferences greater than 50 degrees are unchanged. However, those with preferences less than 50 degrees actually now respond more strongly (6cS and A). The reason is that during adaptation, the inhibitory connections of these neurons with each other became stronger. Because of normalization (see equation 2.3), their connections to other neurons, those representing distant angles such as 50 degrees, became weaker. As a result, the 50 degree line now inhibits them less than before

Tilt Aftereffects

1733

0 test line

10 test line

I

10.1 avg

-1.3 avg

S

-2.7 avg

A

-0.6 avg

-90

-45

0

+45

(a) Adaptation

50 test line

I

49.2 avg

I

7.9 avg

S

47.5 avg

S

20.3 avg

A

45.1 avg

A

+90 -90

-45

0

+45

(b) Direction

+90 -90

-45

0

+45

+90

(c) Indirect effect

Figure 6: Explanation of the TAE. This gure shows activation histograms similar to those in Figures 3c and 3d for several test lines. The histograms focus on the orientation-specic aspects of the cortical response by abstracting out the spatial content, and demonstrate how changes in the response cause the TAE. The top row of histograms (marked I, for Initial) shows the initial response to a vertical line (0 degrees), a 10 degree line, and a 50 degree line, each marked by dotted lines. In each case, the initial response is roughly centered around the orientation of the input line. The next row (marked S, Settled) shows the settled response, which has been focused by the lateral connections but is still centered around the input orientation. The bottom row (marked A, Adapted) shows the settled response to the same input after adapting to a vertical (0 degree) line, marked by a vertical line on the plots. To magnify and clarify the effect for explanatory purposes, only the inhibitory weights were modiable in this simulation, their learning rate was increased to 0.00005, and the adaptation lasted for 256 iterations. (a) The settled response to the 0 degree test line broadens with adaptation, as the inhibition between the active units increases (aS!A). Since the response remains centered around 0 degrees, there is little change in the perceived orientation and the TAE is close to 0 degrees (as can also be seen in Figure 4). (b) In contrast, a dramatic orientation shift is evident for the 10 degree test line: while the settled histogram before adaptation was centered around 7.9 degrees, after adaptation it is centered around 20.3 degrees (bS!A), inducing a direct effect of 12.4 degrees. The direct TAE is caused by the same changes that caused the broadening in (a): the activity around 0 degrees has decreased, while the activity at larger angles has increased. (c) The changes are more subtle for the indirect effect. For the 50 degree stimulus, only the neurons around 0 degrees in (cI) fall in the range of orientations initially activated by the adaptation line (aI), and thus those are the only ones that change their behavior between (cS) and (cA). During adaptation, their inhibition to and from neurons around 0 degrees was increased, and the weight normalization caused a corresponding decrease to other neurons, including those around 50 degrees. As a result, they are now less inhibited than before adaptation, and the average response shifts toward the adaptation angle (the indirect TAE). Color and animated demos of these examples can be seen online at http://www.cs.utexas.edu/users/nn/pages/research/selforg.html.

1734

James A. Bednar and Risto Miikkulainen

adaptation. Thus, they are more active, and the perceived orientation shifts toward 0 degrees, causing the indirect tilt aftereffect. The indirect effect is therefore true to its name, caused indirectly by the strengthening of inhibitory connections during adaptation. This explanation of the indirect effect is novel and emerges automatically from the RFLISSOM model. The model thus shows computationally how both the direct and indirect effects can be caused by the same activity-dependent adaptation process and that it is the same process that drives the development of the map. 4 Discussion and Future Work

Our results suggest that the same self-organizing principles that result in sparse coding and reduce redundant activation in the visual cortex may also be operating over short time intervals in the adult, with quantiable psychological consequences such as the TAE. This nding demonstrates a potentially important computational link between development, structure, and function. Although the RF-LISSOM model was not originally developed as an explanation for the TAE, it exhibits TAEs that have nearly all of the features of those measured in humans. The effect of varying angular separation between the test and adaptation lines is similar to human data, the time course is approximately logarithmic, and the TAE is localized to the retinal location that experienced the stimulus. With minor extensions, the model should account for other features of the TAE as well, such as higher variance at oblique orientations, frequency localization, movement direction specicity, and ocular transfer (Bednar, 1997). One difference, however, shows up in prolonged adaptation: the TAE in humans eventually saturates near 4 to 5 degrees (Greenlee & Magnussen, 1987; Magnussen & Johnsen, 1986; Wolfe & O’Connell, 1986). The TAE in the RF-LISSOM model does not saturate quite as quickly (see Figure 5), and it also appears to saturate for different reasons. In the model, saturation occurs only after the inhibitory weights have been strengthened so much that the cortical response to the adaptation line is entirely suppressed. However, in humans, the adaptation line is still easily detectable even after saturation (Magnussen & Johnsen, 1986). Apparently the human visual system has additional constraints on how much adaptation can occur in the short term, and those constraints are currently not part of the model. Another feature of the human TAE that does not directly follow from the model is the gradual recovery of accurate perception even in complete darkness (Greenlee & Magnussen, 1987; Magnussen & Johnsen, 1986). With the purely Hebbian learning used in the model, only active units are adapted, and thus no weight changes will occur for blank inputs. One possible explanation for both saturation and dark recovery is that the inhibitory weights modied during tilt adaptation are a set of small, temporary weights adding

Tilt Aftereffects

1735

to or multiplying more permanent connections. Changes to these weights would be limited in magnitude, and the changes would gradually decay in the the absence of visual stimulation. Such a mechanism was proposed by von der Malsburg (1987) as an explanation of visual object segmentation, and it was implemented in the RF-LISSOM model of segmentation by Choe and Miikkulainen (1998) and Miikkulainen et al. (1997). A variety of physical substrates for temporary modications in efcacy have been found (reviewed in Zucker, 1989). Such effects could also be included in the TAE model, which would allow it to replicate the saturation and dark recovery aspects of the human TAE. The main contribution of the RF-LISSOM model of the TAE is its novel explanation of the indirect aftereffect. Proponents of the lateral inhibitory theory of the TAE and tilt illusion have generally ignored indirect effects or postulated that they occur only at higher cortical levels (Wenderoth & Johnstone, 1988; van der Zwan & Wenderoth, 1995), partly because it has not been clear how they could arise through inhibition in V1. However, recent theoretical models have suggested that indirect effects could occur as early as V1 for simultaneous stimuli (Mundel, Dimitrov, & Cowan, 1997), and RF-LISSOM demonstrates that a quite simple, local mechanism in V1 is also sufcient to produce indirect aftereffects: If the total synaptic strength at each neuron is limited, bolstering the lateral inhibitory connections between active neurons eventually weakens their inactive inhibitory connections, causing the indirect aftereffect. Such weight normalization is computationally necessary: weights governed by a pure Hebbian rule would otherwise increase indenitely or would have to reach a xed maximum strength (Rochester, Holland, Haibt, & Duda, 1956; von der Malsburg, 1973; Miller & MacKay, 1994), and neither of these outcomes is biologically plausible. Furthermore, very recent experimental results demonstrate that several processes of normalization are actually occurring in visual cortex neurons near the timescales at which the indirect effect is seen (Turrigiano, Abbott, & Marder, 1994; Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998; Turrigiano, 1999). Such regulation changes the efcacy of each synapse while keeping their relative weights constant (Turrigiano et al., 1998), as in the RF-LISSOM model. The current RF-LISSOM results predict that similar whole-cell regulation of total synaptic strength underlies the indirect tilt aftereffect in V1. The RF-LISSOM results may also help explain why there is relatively large intersubject variability in the shape and magnitude of the angular function of the indirect effect (Mitchell & Muir, 1976). Although the overall t in Figure 4 is quite good, there is a small discrepancy at very large angles. This may be an interesting artifact of the visual environment of modern humans. As humans orient themselves with respect to man-made objects, such as books and pictures, they will often see angles near 90 degrees, whereas the model was trained with only single lines. If the model were trained on more realistic visual inputs that included corners, crosses, and approximately or-

1736

James A. Bednar and Risto Miikkulainen

thogonal angles, it would develop more connections between neurons with orthogonal preferences. These connections would have an inhibitory effect, and they would decrease through normalization during adaptation. As a result, the model would show an increased indirect effect near § 90 degrees, matching the human data. The prediction is therefore that the indirect effect differences between subjects arise from differences in their long-term visual experience, due to factors such as attention and different environments. Through mechanisms similar to those causing the TAE, the RF-LISSOM model should also be able to explain tilt illusions between two stimuli presented simultaneously. Such an explanation was originally proposed by Carpenter and Blakemore (1973), and the principles have been demonstrated recently in an abstract model of orientation (Mundel et al., 1997). Testing this hypothesis in RF-LISSOM for spatially separated stimuli would require a much larger inhibitory connection radius than used for these experiments; such simulations are not yet practical because of computational constraints. Bayesian analysis may offer a way to extract the response to two simultaneously presented overlapping lines (Zemel, Dayan, & Pouget, 1998), which could allow the tilt illusion to be measured in the current model. The tilt illusion can also be tested with existing computing hardware by rst combining the current orientation map simulation with an ocular dominance simulation (such as that of Sirosh & Miikkulainen, 1997). Then perceived orientations can be computed for inputs in different eyes; if the tilt illusion is occurring, then the perceived orientation for an input in one eye will change when a different orientation is presented to the other eye (as found in humans; Carpenter & Blakemore, 1973). In addition, many similar phenomena such as aftereffects of curvature, motion, spatial frequency, size, position, and color have been documented in humans (Barlow, 1990). Since specic detectors for most of these features have been found in the cortex, RF-LISSOM should be able to account for them by the same process of decorrelation mediated by self-organizing lateral connections. 5 Conclusion

The experiments reported here lend strong computational support to the theory that tilt aftereffects result from Hebbian adaptation of the lateral connections between neurons. Furthermore, the aftereffects occur as a result of the same decorrelating process that is responsible for the initial development of the orientation map. The same model should also apply to other aftereffects and to simultaneous tilt illusions. Computational models such as RF-LISSOM can demonstrate many visual phenomena in high detail that are difcult to measure experimentally, thus presenting a view of the cortex that is otherwise not available. This

Tilt Aftereffects

1737

type of analysis can provide an essential complement to both high-level theories and to experimental work with humans and animals, signicantly contributing to our understanding of the cortex.

Acknowledgments

Thanks to Joseph Sirosh for supplying the original RF-LISSOM code. This research was supported in part by the National Science Foundation under grants IRI-9309273 and IIS-9811478. Computer time for the simulations was provided by the Texas Advanced Computing Center at the University of Texas at Austin and by the Pittsburgh Supercomputing Center under grant IRI-940004P. Software for the RF-LISSOM model and demos of the data presented here are available online at http://www.cs.utexas.edu/ users/nn/.

References Barlow, H. B. (1990). A theory about the functional role and synaptic mechanism of visual after-effects. In C. Blakemore (Ed.), Vision: Coding and efciency (pp. 363–375). New York: Cambridge University Press. Bednar, J. A. (1997). Tilt aftereffects in a self-organizing model of the primary visual cortex. Master’s thesis, The University of Texas at Austin. Technical Report AI97–259. Bednar, J. A., & Miikkulainen, R. (1998). Pattern-generator-driven development in self-organizing models. In J. M. Bower (Ed.), Computational neuroscience: Trends in research, 1998 (pp. 317–323). New York: Plenum. Bosking, W. H., Zhang, Y., Schoeld, B., & Fitzpatrick, D. (1997). Orientation selectivity and the arrangement of horizontal connections in tree shrew striate cortex. Journal of Neuroscience, 17(6), 2112–2127. Campbell, F. W., & Maffei, L. (1971). The tilt aftereffect: A fresh look. Vision Research, 11, 833–840. Carpenter, R. H. S., & Blakemore, C. (1973). Interactions between orientations in human vision. Experimental Brain Research, 18, 287–303. Choe, Y., & Miikkulainen, R. (1998). Self-organization and segmentation in a laterally connected orientation map of spiking neurons. Neurocomputing, 21, 139–157. Coltheart, M. (1971). Visual feature-analyzers and aftereffects of tilt and curvature. Psychological Review, 78, 114–121. Dong, D. W. (1995). Associative decorrelation dynamics: A theory of selforganization and optimization in feedback networks. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 925–932). Cambridge, MA: MIT Press. Field, D. J. (1994). What is the goal of sensory coding? Neural Computation, 6, 559–601.

1738

James A. Bednar and Risto Miikkulainen

Finlayson, P. G., & Cynader, M. S. (1995). Synaptic depression in visual cortex tissue slices: An in vitro model for cortical neuron adaptation. Experimental Brain Research, 106(1), 145–155. Foldi ¨ a´ k, P. (1990). Forming sparse representations by local anti-Hebbian learning. Biological Cybernetics, 64, 165–170. Fregnac, Y. (1996). Dynamics of functional connectivity in visual cortical networks: An overview. Journal of Physiology (Paris), 90, 113–139. Geisler, W. S., & Albrecht, D. G. (1997). Visual cortex neurons in monkeys and cats: Detection, discrimination, and identication. Visual Neuroscience, 14(5), 897–919. Gibson, J. J., & Radner, M. (1937). Adaptation, after-effect and contrast in the perception of tilted lines. Journal of Experimental Psychology, 20, 453–467. Gilbert, C. D. (1998). Adult cortical dynamics. Physiological Reviews, 78(2), 467– 485. Greenlee, M. W., & Magnussen, S. (1987). Saturation of the tilt aftereffect. Vision Research, 27(6), 1041–1043. Grinvald, A., Lieke, E. E., Frostig, R. D., & Hildesheim, R. (1994). Cortical pointspread function and long-range lateral interactions revealed by real-time optical imaging of macaque monkey primary visual cortex. Journal of Neuroscience, 14, 2545–2568. Hata, Y., Tsumoto, T., Sato, H., Hagihara, K., & Tamura, H. (1993). Development of local horizontal interactions in cat visual cortex studied by crosscorrelation analysis. Journal of Neurophysiology, 69, 40–56. Hirsch, J. A., & Gilbert, C. D. (1991). Synaptic physiology of horizontal connections in the cat’s visual cortex. Journal of Neuroscience, 11, 1800–1809. Hubel, D. H., & Wiesel, T. N. (1968). Receptive elds and functional architecture of monkey striate cortex. Journal of Physiology, 195, 215–243. Magnussen, S., & Johnsen, T. (1986). Temporal aspects of spatial adaptation: A study of the tilt aftereffect. Vision Research, 26(4), 661–672. McLean, J., & Palmer, L. A. (1996). Contrast adaptation and excitatory amino acid receptors in cat striate cortex. Visual Neuroscience, 13(6), 1069–1087. Miikkulainen, R., Bednar, J. A., Choe, Y., & Sirosh, J. (1997). Self-organization, plasticity, and low-level visual phenomena in a laterally connected map model of the primary visual cortex. In R. L. Goldstone, P. G. Schyns, & D. L. Medin (Eds.), Psychology of Learning and Motivation, 36 (pp. 257–308). San Diego, CA: Academic Press. Miller, K. D., & MacKay, D. J. C. (1994). The role of constraints in Hebbian learning. Neural Computation, 6, 100–126. Mitchell, D. E., Muir, D. W. (1976). Does the tilt aftereffect occur in the oblique meridian? Vision Research, 16, 609–613. Mundel, T., Dimitrov, A., & and Cowan, J. D. (1997). Visual cortex circuitry and orientation tuning. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 887–893). Cambridge, MA: MIT Press. Parker, A. J., & Newsome, W. T. (1998). Sense and the single neuron: Probing the physiology of perception. Annual Review of Neuroscience, 21, 227–277.

Tilt Aftereffects

1739

Rochester, N., Holland, J. H., Haibt, L. H., & Duda, W. L. (1956). Tests on a cell assembly theory of the action of the brain, using a large digital computer. Reprinted in J. A. Anderson & E. Rosenfeld (Eds.), 1998, Neurocomputing: Foundations of research. Cambridge, MA: MIT Press. Shatz, C. J. (1990). Impulse activity and the patterning of connections during CNS development. Neuron, 5, 745–756. Sirosh, J. (1995). A self-organizing neural network model of the primary visual cortex. Doctoral dissertation. The University of Texas at Austin. Technical Report AI95–237. Sirosh, J., & Miikkulainen, R. (1994a). Cooperative self-organization of afferent and lateral connections in cortical maps. Biological Cybernetics, 71, 66–78. Sirosh, J., & Miikkulainen, R. (1994b). Modeling cortical plasticity based on adapting lateral interaction. In J. M. Bower (Ed.), The neurobiology of computation: The Proceedings of the Third Annual Computation and Neural Systems Conference (pp. 305–310). Dordrecht: Kluwer. Sirosh, J., & Miikkulainen, R. (1997). Topographic receptive elds and patterned lateral interaction in a self-organizing model of the primary visual cortex. Neural Computation, 9, 577–594. Sirosh, J., Miikkulainen, R., & Bednar, J. A. (1996). Self-organization of orientation maps, lateral connections, and dynamic receptive elds in the primary visual cortex. In J. Sirosh, R. Miikkulainen, & Y. Choe (Eds.), Lateral interactions in the cortex: Structure and function. Austin, TX: UTCS Neural Networks Research Group. Available online at: http://www.cs.utexas.edu/ users/nn/web-pubs/htmlbook96. Snippe, H. P. (1996). Parameter extraction from population codes: A critical assessment. Neural Computation, 8(3), 511–529. Stemmler, M., Usher, M., & Niebur, E. (1995). Lateral interactions in primary visual cortex: A model bridging physiology and psychophysics. Science, 269, 1877–1880. Tolhurst, D. J., & Thompson, P. G. (1975). Orientation illusions and aftereffects: Inhibition between channels. Vision Research, 15, 967–972. Turrigiano, G. G. (1999). Homeostatic plasticity in neuronal networks: The more things change, the more they stay the same. Trends in Neuroscience, 22, 221– 227. Turrigiano, G., Abbott, L. F., & Marder, E. (1994). Activity-dependent changes in the intrinsic properties of cultured neurons. Science, 264, 974–977. Turrigiano, G. G., Leslie, K. R., Desai, N. S., Rutherford, L. C., & Nelson, S. B. (1998). Activity-dependent scaling of quantal amplitude in neocortical neurons. Nature, 391, 845–846. van der Zwan, R., & Wenderoth, P. (1995). Mechanisms of purely subjective contour tilt aftereffects. Vision Research, 35(18), 2547–2557. Vidyasagar, T. R. (1990). Pattern adaptation in cat visual cortex is a co-operative phenomenon. Neuroscience, 36, 175–179. von der Malsburg, C. (1973). Self-organization of orientation-sensitive cells in the striate cortex. Kybernetik, 15, 85–100. Reprinted in J. A. Anderson & E. Rosenfeld (Eds.), 1988, Neurocomputing: Foundations of research. Cambridge, MA: MIT Press.

1740

James A. Bednar and Risto Miikkulainen

von der Malsburg, C. (1987).Synaptic plasticity as basis of brain organization. In J.-P. Changeux & M. Konishi (Eds.), The neural and molecular bases of learning (pp. 411–432). New York: Wiley. Weliky, M., Kandler, K., Fitzpatrick, D., & Katz, L. C. (1995).Patterns of excitation and inhibition evoked by horizontal connections in visual cortex share a common relationship to orientation columns. Neuron, 15, 541–552. Wenderoth, P., & Johnstone, S. (1988). The different mechanisms of the direct and indirect tilt illusions. Vision Research, 28, 301–312. Wolfe, J. M., & O’Connell, K. M. (1986). Fatigue and structural change: Two consequences of visual pattern adaptation. Investigative Ophthalmology and Visual Science, 27(4), 538–543. Zemel, R. S., Dayan, P., & Pouget, A. (1998). Probabilistic interpretation of population codes. Neural Computation, 10(2), 403–430. Zucker, R. S. (1989).Short-term synaptic plasticity. Annual Review of Neuroscience, 12, 13–31. Received November 28, 1998; accepted July 13, 1999.

1741 Errata

In the paper entitled “Synthesis of Generalized Algorithms for the Fast Computation of Synaptic Conductances with Markov Kinetic Models in Large Network Simulations” by Michele Giugliano, that appeared in Neural Computation 12, 903–931 (2000), subscripts placed under the left-facing arrows used in the description of the computer implementation of a sample algorithm, on pages 924, 926, 927, 928, and 929 (i.e. “x7”, “x12”, “x28”, “x11”), are typos and they carry no meaning related to the assignment statements indicated by the arrows themselves.

Neural Computation, Volume 12, Number 8

CONTENTS Article

Neural Systems as Nonlinear Filters Wolfgang Maass and Eduardo D. Sontag

1743

Note

Analytical Model for the Effects of Learning on Spike Count Distributions Gianni Settanni and Alessandro Treves 1773 Letters

Calculation of Interspike Intervals for Integrate-and-Fire Neurons with Poisson Distribution of Synaptic Inputs A. N. Burkitt and G. M. Clark

1789

Statistical Procedures for Spatiotemporal Neuronal Data with Applications to Optical Recording of the Auditory Cortex O. Fran¸cois, L. Mohamed Abdallahi, J. Horikawa, I. Taniguchi, and T. Herv´e 1821 Probabilistic Motion Estimation Based on Temporal Coherence Pierre-Yves Burgi, Alan L. Yuille, and Norberto M. Grzywacz

1839

Boosting Neural Networks Holger Schwenk and Yoshua Bengio

1869

Gradient-Based Optimization of Hyperparameters Yoshua Bengio

1889

A Signal-Flow-Graph Approach to On-line Gradient Calculation Paolo Campolucci, Aurelio Uncini, and Francesco Piazza

1901

Bootstrapping Neural Networks ¨ Jurgen Franke and Michael H. Neumann

1929

Information Geometry of Mean-Field Approximation Toshiyuki Tanaka

1951

Measuring the VC-Dimension Using Optimized Experimental Design Xuhui Shao, Vladimir Cherkassky, and William Li

1969

ARTICLE

Communicated by James Williamson

Neural Systems as Nonlinear Filters Wolfgang Maass Institute for Theoretical Computer Science, Technische Universit¨at Graz, A-8010 Graz, Austria Eduardo D. Sontag Department of Mathematics, Rutgers University, New Brunswick, NJ 08903, U.S.A. Experimental data show that biological synapses behave quite differently from the symbolic synapses in all common articial neural network models. Biological synapses are dynamic; their “weight” changes on a short timescale by several hundred percent in dependence of the past input to the synapse. In this article we address the question how this inherent synaptic dynamics (which should not be confused with long term learning) affects the computational power of a neural network. In particular, we analyze computations on temporal and spatiotemporal patterns, and we give a complete mathematical characterization of all lters that can be approximated by feedforward neural networks with dynamic synapses. It turns out that even with just a single hidden layer, such networks can approximate a very rich class of nonlinear lters: all lters that can be characterized by Volterra series. This result is robust with regard to various changes in the model for synaptic dynamics. Our characterization result provides for all nonlinear lters that are approximable by Volterra series a new complexity hierarchy related to the cost of implementing such lters in neural systems. 1 Introduction

Synapses in common articial neural network models are static: the value wi of a synaptic weight is assumed to change only during “learning.” In contrast to that, the “weight” wi (t) of a biological synapse at time t is known to be strongly dependent on the inputs xi (t¡t ) that this synapse has received from the presynaptic neuron i at previous time steps t ¡t . Varela et al. (1997) have shown that a model of the form wi (t) D wi ¢ D(t) ¢ (1 C F (t))

(1.1)

with a constant wi , a depression term D (t ) with values in (0, 1], and a facilitation term F (t) ¸ 0 can be tted remarkably well to experimental data for synaptic dynamics. The facilitation term F (t) is usually modeled as a linear c 2000 Massachusetts Institute of Technology Neural Computation 12, 1743– 1772 (2000) °

1744

Wolfgang Maass and Eduardo D. Sontag

lter with exponential decay: If xi (t ¡ t ) is the output of the presynaptic neuron (typically modeled by a sum of d-functions), then the current value of this facilitation term is of the form Z 1 (1.2) F(t) D r xi (t ¡ t ) ¢ e¡t /c dt 0

for certain parameters r , c > 0 that vary from synapse to synapse. A few other models have been proposed for synaptic dynamics (see e.g. Dobrunz & Stevens, 1997; Murthy, Sejnowski, & Stevens, 1997; Tsodyks, Pawelzik, & Markram, 1998; Koch, 1999; Maass & Zador, 1998, 1999) that are all quite similar. Closely related models had already been proposed and investigated in Grossberg (1969, 1972, 1984); Francis, Grossberg, & Mingolla, 1994). Our analysis in this article is primarily based on the model of Varela et al. (1997). However we will prove that our results also hold for the somewhat more complex model for synaptic dynamics in a mean-eld context of Tsodyks et al. (1998). We show that such inherent synaptic dynamics empower neural networks with a remarkable capability for carrying out computations on temporal patterns (i.e., time series) and spatiotemporal patterns. This computational mode, where inputs and outputs consist of temporal patterns or spatiotemporal patterns—rather than static vectors of numbers—appears to provide a more adequate framework for analyzing computations in biological neural systems. Furthermore their capability for processing temporal and spatiotemporal patterns in a very efcient manner may be linked to their superior capabilities for real-time processing of sensory input; hence, our analysis may provide new ideas for designing articial neural systems with similar capabilities. We consider not just computations of neural systems with a single temporal pattern as input, but also characterize their computational power for the case where several different temporal patterns u1 (t), . . . , un (t) are presented in parallel as input to the neural system. Hence we also provide a complete characterization of the computational power of feedforward neural systems for the case where salient information is encoded in temporal correlations of ring activity in different pools of neurons (represented by correlations among the corresponding continuous functions u1 (t), . . . , un (t )). Therefore, various informal suggestions for computational uses of such code can be placed on a rigorous mathematical foundation. It is easy to see that a large variety of computational operations that respond in a particular manner to correlations in temporal input patterns dene time-invariant lters with fading memory; hence they can in principle be implemented on each of the various kinds of dynamic networks considered in this article. Previous standard models for computations on temporal patterns in articial neural networks are time-delay neural networks (where temporal structure is transformed into spatial structure) and recurrent neural networks, both being based on standard “static” synapses (Hertz, Krogh, &

Neural Systems as Nonlinear Filters

1745

Palmer, 1991). Such transformation makes it impossible to let “time represent itself ” (Mead, 1989) in subsequent computations, which tends to result in a loss of computational efciency. The results of this article suggest that feedforward neural networks with simple dynamic synapses provide an attractive alternative. Various questions regarding articial neural networks with more general recurrent structure, in which the time-series character of the data plays a central role, were answered, within the framework of computational learning theory, in the papers (Dasgupta & Sontag, 1996—studied hard-threshold lters with a discrete timescale; Koiran & Sontag, 1998) (discrete-time recurrent networks), and (Sontag, 1998—continuous-time recurrent networks). Sontag (1997) summarizes some of the approximation capabilities and other properties of these classes of recurrent networks. In section 2, we introduce the formal notion of a dynamic network, which combines biologically realistic synaptic dynamics according to Varela et al. (1997) with standard sigmoidal neurons (modeling ring activity in a population of neurons), and we review some basic concepts regarding lters. In section 3 we characterize the computational power of feedforward dynamic networks for computations on temporal patterns (i.e., functions of time), and we show that our result can be extended to the model of Tsodyks et al. (1998) for synaptic dynamics. The formal proofs of the characterization results in this article rely on standard techniques from mathematical analysis. In section 4 we extend our investigation to computations on spatiotemporal patterns. Section 5 discusses some conclusions. 2 Basic Concepts

In contrast to the static output of gates in feedforward articial neural networks the output of biological neurons consists of action potentials (“spikes”)—stereotyped events that mark certain points in time. These spikes are transmitted by synapses to other neurons, where they cause changes in the membrane potential that affect the times when these other neurons re and thereby emit a spike. We will focus in this article on the implications of one type of temporal dynamics provided by the components of such neural computations: the inherent temporal dynamics of synapses. The empirical data of Varela et al. (1997) describe the amplitudes of excitatory postsynaptic currents (EPSCs) in a neuron in response to a spike train from a presynaptic neuron. These two neurons are likely to be connected by multiple synapses, and the resulting EPSC amplitude can be understood as a population response of these multiple synapses. Therefore it is justied to employ a deterministic model for synaptic dynamics in spite of the stochastic nature of synaptic transmission at a single release sit (Dobrunz & Stevens, 1997). The EPSC amplitude in response to a spike is modeled in Varela et al. (1997) by terms of the form w¢ (1 C F ) and w¢ D ¢ (1 C F ), where F

1746

Wolfgang Maass and Eduardo D. Sontag

is a linear lter with impulse response r ¢ e¡t /c modeling facilitation and D is some nonlinear lter modeling depression at synapses. In some versions of the model considered in Varela et al. (1997) this lter D consists of several depression terms. However, it only assumes values > 0 and is always time invariant and has fading memory. We analyze the impact of this synaptic dynamics in the context of common models for computations in populations of neurons where one can ignore the stochastic aspects of computation in individual neurons in favor of the deterministic response of pools of neurons that receive similar input (“population coding” or “space rate coding”; see Georgopoulos, Schwartz, & Ketner, 1986; Abbott, 1994; Gerstner, 1999). More precisely, our subsequent neural network model is based on a mean-eld analysis of networks of biological neurons, where pools P of neurons serve as computational units, whose time-varying ring activity (measured as the number of neurons in P that re during a short time interval [t, t C D ]) is represented by a continuous bounded function y (t). In case pool P receivesP inputs from m other pools of neurons P1 , . . . , Pm , we assume that y (t ) D s( m i D 1 wi (t ) xi ( t ) C w0 ), where xi (t) represents the time-varying ring activity in pool Pi and wi (t) represents the time-varying average “weight” of the synapses from neurons in pool Pi to neurons in pool P.1 In the context of neural computation with population coding considered in this article, we have to expand the model of Varela et al. (1997) to populations of synapses that connect two pools of neurons, where presynaptic activity is described not by spike trains but by continuous functions xi (t) ranging over some bounded interval [B0 , B1 ] with 0 < B0 < B1 . Therefore, we generalize their model for the dynamics of synapses from a nonlinear lter applied to a sequence of d-functions (i.e., to a spike train) to a corresponding nonlinear lter applied to a continuous input function xi (t).2 Thus if xi (t) is a continuous function describing the ring 1

The function s: R ! R is some activation function, for example, s(x) D 1 / (1 C e ¡x ). For the following, it sufces to assume that s is continuous and not a polynomial. In sections 3.2 and 3.3 we have to assume in addition that s assumes nonnegative values only. We refer to Maass and Natschl¨ager (in press) for theoretical arguments and computer simulations that support the use of a sigmoidal activation function in this context. 2 So far no empirical data are available for the temporal dynamics of a population of synapses (that connects two pools of neurons in a feedforward direction) in dependence of the pool activity of the presynaptic pool of neurons. It is not completely unproblematic to assume that synaptic dynamics can be modeled on the level of pool activity in the same way as for spiking neurons, although this is commonly done. The exact formula for the ring activity y(t) in the postsynaptic pool P of neurons requires multiplying for each presynaptic pool Pi of neurons the product of the vector of spike activity of individual neurons ºi, k in pool Pi with the matrix of current synaptic coupling strengths wi,k, j (t) for neurons ºj in pool P. The resulting ring activity y(t) of pool P is the average of the current ring activities of neurons ºj in pool P. In our mean-eld model, we assume that this average over j can be expressed in terms of products of the average wi (t) of the synaptic weights wi,k, j (t) over j and k with the average ring activity xi (t) in the presynaptic pool Pi . In particular this mean-eld model ignores that the value of wi, k, j (t) will in general

Neural Systems as Nonlinear Filters

1747

activity in the ith presynaptic pool Pi of neurons, we model the size of the resulting synaptic input to a subsequent pool P of neurons by terms of the form wi (t )¢xi (t) with wi (t) : D wi ¢(1 C F xi (t)) or wi (t) : D wi ¢D xi (t)¢(1 C F xi (t)), where the lters F and D are dened as in Varela et al. (1997). The rst equation that models facilitation gives rise to the denition of the class DN of dynamic networks in denition 1, and the second equation, which models the more common co-occurrence of facilitation and depression, gives rise to the denition of the class DN¤ . We dene the class DN of dynamic networks (see Figure 1) as the class of arbitrary feedforward networks consisting of sigmoidal gates that map input functions x1 (t), . . . , xm (t) to a function Denition 1.

y (t ) D s with

(

m X iD 1

¡

! wi (t)xi (t) C w0 ,

wi (t) D wi ¢ 1 C r

Z

1 0

xi (t ¡ t )e ¡t /c dt

¢

for parameters wi 2 R and r , c > 0 . s is some “activation function” from R into R , for example, the logistic sigmoid function dened by s(x) D 1 / (1 C e ¡x ). We will assume in the following only that s is continuous and not a polynomial.3 The slightly different class DN¤ is dened in the same way, except that ( wi t ) is of the form Z wi (t) D wi ¢ D xi (t ) ¢ (1 C r

1 0

xi (t ¡ t )e¡t /c dt ),

where D is some arbitrary given time invariant fading memory lter 4 with values D xi (t) 2 (0, 1].5 depend on the specic ring history of the specic presynaptic neuron ºi,k . We refer to Tsodyks et al. (1998) for a detailed mathematical analysis of this problem. It is shown in that article through computer simulations and theoretical arguments that for the slightly different model for synaptic dynamics considered there, the error resulting from generalizing the model from presynaptic individual neurons to presynaptic pools is benign. We will discuss the model from Tsodyks et al. (1998) in sections 3.2 and 3.3, and we will show in theorems 2 and 3 that our results can be extended to their model. 3 According to Leshno, Lin, Pinkus, & Schocken (1993)the subsequent theorems would hold under even weaker conditions on s. 4 See the remainder of this section for a review of these notions. 5 This lter D models synaptic depression, and can, for example, be dened as in

1748

Wolfgang Maass and Eduardo D. Sontag

Figure 1: A dynamic network with one hidden layer consisting of two hidden neurons G1 and G2 . The synapse from the ith input to G1 computes the lter ui (¢) 7! wi (¢)¢ui (¢), the synapse from the ith input to G2 computes the lter P n xi (¢) 7! Q i (¢) ¢ ui (¢). The output (t) ¢ a ¢ s( w of the network is of the form z(t) D 1 iD 1 wi Pn Q i (t) ¢ ui (t) C w Q 0 ) C a0 with a0 , a1 , a2 2 R . Thus the ui (t) C w0 ) C a2 ¢ s( iD 1 w network computes a lter that maps the input functions u1 (¢), . . . , un (¢) onto the output function z(¢).

Thus dynamic networks in DN or DN¤ are simply feedforward neural networks consisting of sigmoidal neurons, where static weights wi are replaced by biologically realistic history-dependent functions wi (t). The input to a dynamic network consists of an arbitrary vector of functions u1 (¢), . . . , un (¢). The output of a dynamic network is dened as a weighted sum z (t) D

k X i D1

ai yi (t) C a0

of the time-varying outputs y1 (t), . . . , yk (t) of certain sigmoidal neurons in the network, where the “weights” a0 , . . . , ak can be assumed to be static. Thus a dynamic network with n inputs maps n input functions u1 (¢), . . . , un (¢) onto some output function z (¢). 6 Varela et al. (1997). Our subsequent results are independent of the specic denition of D . 6 In principle one is also interested in a more general type of operators that map vectors u of real-valued functions on vectors y of m real-valued functions, where m is larger than 1. However, in order to answer the questions that are addressed in this article for the case m > 1, it sufces to focus on the case m D 1. The reason is that operators that output vectors of m real-valued functions can be viewed as vectors of m operators that output one real-valued function each. In this way, our results for the case m D 1 will imply a complete characterization of all operators that can be approximated by a more generalized type of

Neural Systems as Nonlinear Filters

1749

A somewhat related network model has been investigated in Back and Tsoi (1991). They exhibited a learning algorithm for this model, but no characterization of the computational power of such networks was given there. Temporal patterns are modeled in mathematics as functions of time. Hence networks that operate on temporal patterns map functions of time onto functions of time. We will refer to such maps from functions to functions (or from vectors of functions to functions) as lters (in mathematics, they are usually referred to as operators). We will reserve the letters F , H, S for lters, and we write F u for the function resulting from an application of the lter F to a vector u of functions. Notice that when we write F u (t), we mean (F u )(t) (that is, the function F u evaluated at time t). We write C( A, B ) for the class of all continuous functions f : A ! B. We will consider suitable subclasses U µ C( A, B ) for A µ Rk and B µ R, and study lters that map Un into RR (where RR is the class of all functions from R into R), that is, lters that map n functions u(¢), . . . , un (¢) onto another function z(¢). In this section and in section 3, we will focus on the case k D 1, where the input functions u1 (¢), . . . , un (¢) are functions of a single variable, which we will interpret as time. The case k > 1 will be considered in section 4. A trivial special case of a lter is the shifting lter St0 with St0 u (t) D u (t ¡ t0 ). An arbitrary lter F : U n ! RR is called time invariant if a shift of the input functions by a constant t0 causes a shift of the output function by the same constant t0 , that is, if for any t0 2 R and any u D hu1 , . . . , un i 2 Un , one has that F ut0 (t) D F u (t ¡ t0 ) where ut0 D hSt0 u1 , . . . , St0 un i. All lters considered in this article will be time invariant. Note that if U is closed under St0 for all t0 2 R, then a time-invariant lter F : Un ! RR is fully characterized by the values F u (0) for u 2 Un . Another essential property of lters considered in this article is fading memory. If a lter F has fading memory, then the value of F v (0) can be approximated arbitrarily closely by the value of F u (0) for functions u that approximate the functions v for sufciently long bounded intervals [¡T, 0]. The formal denition is as follows: We say that a lter F : U n ! RR has fading memory if for every v D hv1 , . . . , vn i 2 Un and every e > 0 there exist d > 0 and T > 0 so that | F v(0) ¡ F u (0)| < e for all u D hu1 , . . . , un i 2 Un with the property that kv(t) ¡ u(t)k < d for all t 2 [¡T, 0].7 Denition 2.

dynamic networks that output m real-valued functions instead of just one. 7 We will reserve k ¢ k for the max-norm on Rn ; that is, for x D hx , . . . , x i 2 Rn we 1 n write kxk for maxf|xi |: i D 1, . . . , ng.

1750

Wolfgang Maass and Eduardo D. Sontag

Interesting examples of linear and nonlinear lters F : U ! RR can be generated with the help of representations of the form Remark 1.

Z

F u(t) D

0

1

Z ...

1 0

u (t ¡ t1 ) ¢ . . . ¢ u (t ¡ tk )h (t1 , . . . , tk ) dt1 . . . dtk

for measurable and essentially bounded functions u: R ! R. We will always assume in this article that h 2 L1 . One refers to such an integral as a Volterra term of order k. Note that for k D 1, it yields the usual representation for a linear time-invariant lter. The class of lters that can be represented by Volterra series—by nite or innite sums of Volterra terms of arbitrary order—has been investigated for quite some time in neurobiology and engineering (see Palm & Poggio, 1977; Palm, 1978; Marmarelis & Marmarelis, 1978; Schetzen, 1980; Poggio & Reichardt, 1980; Rugh, 1981; Rieke, Warland, Bialek, & deRuyter van Steveninck, 1997). It is obvious that any lter F that can be represented by a sum of nitely many Volterra terms of any order (i.e., by a Volterra polynomial or nite Volterra series) is time invariant and has fading memory. This holds for any class U of uniformly bounded input functions u. According to the subsequent lemma 1, both of these properties are inherited by lters F that can be approximated by some arbitrary innite sequence of such lters. This implies that any lter that can be approximated by nite or innite Volterra series (which converge in the sense used here) is time invariant and has fading memory (over any class U of uniformly bounded functions u). Boyd and Chua (1985) have shown that under some additional assumptions about U (for example, the assumptions in theorem 1 below), the converse also holds: any time-invariant lter F : U ! RR with fading memory can be approximated arbitrarily closely by Volterra polynomials. Remark 2.

1. It is easy to see that for classes U of functions that are uniformly bounded (i.e., U µ C( A, B ) for some bounded set B µ R) our denition of fading memory agrees with that considered in Boyd and Chua (1985). All classes U considered in this article are uniformly bounded. 2. It is obvious that any time-invariant lter F that has fading memory is causal; u (t ) D v (t ) for all t · t0 implies that F u (t0 ) D F v (t0 ) for all t0 2 R . 3. All dynamic synapses considered in this article are modeled as lters that map an input function xi (¢) onto an output function wi (¢) ¢ xi (¢). Furthermore, all of these lters turn out to be time invariant with fading memory. This has the consequence that all models for dynamic networks considered in this article compute time-invariant lters with fading memory.

Neural Systems as Nonlinear Filters

1751

4. If one considers recurrent versions of such networks, then in the absence of noise, such networks can theoretically also compute lters without fading memory. Consider, for example, some lter F with F u (0) D 0 if u (t) D 0 for all t · 0 and F u (0) D 1 if there exists some t0 · 0 so that u (t0 ) ¸ 1. It is obvious that such a lter does not have fading memory. But a network where some “self-exciting” recurrent subcircuit is turned on (and stays on permanently) whenever the input u reaches a value ¸ 1 for some t0 2 R can compute such a lter. Alternatively, a feedforward network can also compute a non-fadingmemory lter if any of its components (synapses or neurons) have some permanent memory feature. 5. A special case of time-invariant lters F with fading memory are those dened by F u (0) D f (u (0)) for arbitrary continuous functions f : Rn ! R. Therefore the universal approximation theorem for lters that follows from our subsequent theorem 1 contains as a special case the familiar universal approximation theorem for functions from Hornik, Stinchcombe, and White (1989). 6. It is obvious that a lter F on U n has fading memory if and only if the functional FQ : U n ! R dened by FQ u : D F u (0) is continuous on U n with regard to the topology T generated by the neighborhoods fu 2 Un : kv(t) ¡ u (t)k < d for all t 2 [¡T, 0]g for arbitrary v 2 Un and d, T > 0.

Assume that U is closed under St0 for all t0 2 R and a sequence (Fn )n2N of lters converges to a lter F in the sense that for every e > 0 there exists an n0 2 N so that | Fn u (t) ¡ F u (t)| < e for all n ¸ n0 , u 2 U n , and t 2 R. Then the following holds: Lemma 1.

1. If all the lters Fn are time invariant, then F is time invariant. 2. If all the lters Fn have fading memory, then F has fading memory.

The rst claim follows immediately from the fact that F u (t) D limn!1 Fn u (t) for all u 2 U n and t 2 R. In order to prove the second claim, we can assume that some e > 0 and some v 2 Un have been given. We x some n0 2 N so that | Fn0 u (t) ¡ F u(t)| < e / 3 for all u 2 U n , t 2 R. Since Fn0 has fading memory there exists some T > 0 and some d > 0 so that | Fn0 u (0) ¡ Fn0 v (0)| < e / 3 for all u 2 U n with the property that ku(t) ¡ v (t)k < d for all t 2 [¡T, 0]. By our choice of n0 this implies that | F u (0) ¡ F v (0)| < e for all u 2 Un with ku(t) ¡ v (t)k < d for all t 2 [¡T, 0]. Hence F has fading memory. Proof.

1752

Wolfgang Maass and Eduardo D. Sontag

3 Computations on Temporal Patterns 3.1 Characterizing the Computational Power of Neural Networks with Dynamic Synapses. Our subsequent theorem 1 shows that simple lters

that model only synaptic facilitation (as considered in the denition of DN) provide the networks with sufcient dynamics to approximate arbitrary given time-invariant lters with fading memory. We show that the simultaneous occurrence of depression (as in DN¤ ) is not needed for that, but it also does not hurt. This appears to be of some interest for the analysis of computations in biological neural systems, since a fairly large variety of different functional roles have already been proposed for synaptic depression: explaining psychological data on conditioning and reinforcement (Grossberg, 1972), boundary formation in vision and visual persistence (Francis et al., 1994), switching between different neural codes (Tsodyks & Markram, 1997), and automatic gain control (Abbott, Varela, Sen, & Nelson, 1997). As a complement of these conjectured roles for synaptic depression our subsequent theorem 1 points to a possible functional role for synaptic facilitation: it empowers even very shallow feedforward neural systems with the capability to approximate basically any linear or nonlinear lter that appears to be of interest in a biological context. Furthermore we show that this possible functional role for facilitation can coexist with independent other functional roles for synaptic depression. Our result shows that one can rst choose the parameters that control synaptic depression to serve some other purpose and can then still choose the parameters that control synaptic facilitation, so that the resulting neural system can approximate any given time-invariant lter with fading memory.8 Assume that U is the class of functions from R into [B0 , B1 ]) that satisfy |u(t) ¡ u (s)| · B2 ¢ |t ¡ s| for all t, s 2 R, where B0 , B1 , B2 are arbitrary real-valued constants with 0 < B0 < B1 and 0 < B2 . Let F be an arbitrary lter that maps vectors u D hu1 , . . . , un i 2 Un into functions from R into R . Then the following are equivalent:9 Theorem 1.

(a) F can be approximated by dynamic networks S 2 DN (i.e., for any e > 0 there exists some S 2 DN such that | F u (t) ¡ S u (t )| < e for all u 2 U n and all t 2 R). (b) F can be approximated by dynamic networks S 2 DN with just a single layer of sigmoidal neurons.

8 We show in section 3.3 that alternatively one can employ just depressing synapses for approximating any such lter by a neural system. 9 The implication “(c) ) (d)” was already shown in Boyd and Chua (1985).

Neural Systems as Nonlinear Filters

1753

(c) F is time invariant and has fading memory. (d) F can be approximated by a sequence of (nite or innite) Volterra series. These equivalences remain valid if DN is replaced by DN¤ . The following result stems from the proof of theorem 1. It shows that the class of lters that can be approximated by dynamic networks is very stable with regard to changes in the denition of a dynamic network. Dynamic networks with just one layer of dynamic synapses and one subsequent layer of sigmoidal gates can approximate the same class of lters as dynamic networks with an arbitrary nite number of layers of dynamic synapses and sigmoidal gates. Even with a sequence of dynamic networks that have an unboundedly growing number of layers, one cannot approximate more lters. Furthermore, if one restricts the synaptic dynamics R 1 in the denition of dynamic networks to the simplest form wi (t) D wi ¢ (1 C r 0 xi (t ¡ t )e ¡t /c dt ) with some arbitrarily xed r > 0 and time constants c from some arbitrarily xed interval [a, b] with 0 < a < b, the resulting class of dynamic networks can still approximate (with just one layer of sigmoidal neurons) any lter that can be approximated by a sequence of arbitrary dynamic networks considered in denition 1. In the case of DN¤ one can choose to x r > 0 or arbitrarily x the interval [a, b] for the value of c . Corollary 1.

In addition we will show in section 3.2 that the claim of theorem 1 remains valid if we replace the model from Varela et al. (1997) for synaptic dynamics (employed in the denition of the classes DN and DN¤ of dynamic networks) by the model from Tsodyks et al. (1998). Furthermore we show in section 3.3 that the claim of theorem 1 also holds for networks where synapses exhibit just depression, not facilitation. The proof of theorem 1 shows that its claim as well as the claims of corollary 1 hold under much weaker conditions on the class U. Apart from the requirement that U is closed under translation, it sufces to assume that U is some arbitrary class of uniformly bounded and equicontinuous 10 functions that is closed with regard to the topology dened in part 6 of remarks 2, since this assumption is sufcient for the application of the Arzela-Ascoli theorem (see Dieudonne, 1969, or Boyd & Chua, 1985) in the proof. Remark 3.

According to lemma 1, any lter that can be approximated by nite or innite Volterra series is time invariant and has fading memory. This implies (d ) ) (c). Furthermore it is shown in Boyd and Chua Proof of Theorem 1.

10 U is equicontinuous if for any e > 0 there exists a d > 0 so that |t ¡ s| · d implies |u(t) ¡ u(s)| · e for all t, s 2 R and all u 2 U.

1754

Wolfgang Maass and Eduardo D. Sontag

(1985) that for the classes U considered in this article, any time-invariant lter F : U ! RR with fading memory can be approximated by a sequence of nite Volterra series (i.e., by Volterra polynomials). This argument can be trivially extended to lters F : Un ! RR with n ¸ 1. This implies (c) ) (d). Hence we have shown that (c) , (d ). The implication (b) ) (a ) is obvious. In order to prove (a) ) (c) we observe that all lters occurring at synapses of a dynamic network (see denition 1) are time invariant and have fading memory. This implies that all lters S dened by dynamic networks (i.e., all S 2 DN [ DN ¤ ) are time invariant and have fading memory. According to lemma 1, this implies that any lter F that can be approximated by such networks is time invariant and has fading memory. For the proof of (c) ) (b) we rst consider the case n D 1. We assume that F is some arbitrary given lter that is time invariant and has fading memory. We will rst show that F can be approximated by lters S 2 DN. The proof is based on an application of the Stone-Weierstrass theorem (see, e.g., Dieudonne, 1969, or Folland, 1984) similarly as in Boyd and Chua (1985). That article extends earlier arguments by Sussmann (1975), Fliess (1975), and Gallman and Narendra (1976) from a bounded to an unbounded time interval. Furthermore, our proof exploits the fact that any continuous function can be uniformly approximated on any compact set by weighted sums of sigmoidal gates (Hornik et al., 1989; Sandberg, 1991; Leshno et al., 1993). We will apply the Stone-Weierstrass theorem to functionals from U¡ : D fu| (¡1,0] : u 2 Ug into R. For that purpose we have to show that the lters H of the form

¡

H u(t) D u(t) ¢ 1 C r

Z

1 0

u (t ¡ t )e¡t /c dt

¢

6 separate points in U¡ ; that is, for any u, v 2 U¡ with u D v there exists 6 (0) (0). a lter H of this form such that Hu Thus, we consider some D Hv 6 v ( t ) for some t · 0. Then the function arbitrary given u, v 2 U with u (t) D 6 0 for some t ¸ 0. This implies u (0) ¢ u (¡t ) ¡ v(0) ¢ v(¡t ) assumes a value D

that Z q (l ) D

0

1

(u (0) ¢ u(¡t ) ¡ v(0) ¢ v (¡t ))e ¡t / l dt

does not assume a constant value for all arguments l in [a, b]. This follows because if q(l) D c for all such l, then q, being an analytic function of l 2 C with real part > 0, would equal c for all real l > 0. Since the limit of q (l) as l ! 1 is zero, this means c D 0. However, the Laplace transform is one-to-one. (This is a standard fact; one way to prove it is using that the Laplace transform of a bounded measurable function w, evaluated at any

Neural Systems as Nonlinear Filters

1755

point of the form s D 1 C iv, coincides, as a function of v, with the Fourier transform of w (t)e ¡t , and the Fourier transform is one to one on integrable functions; cf. Hewitt & Stromberg, 1965, corollary 21.47.) Hence, q (l) does not assume a constant value for all arguments l in [a, b]. Since q is analytic, it therefore assumes, in any interval [a, b] with 0 < a < b, innitely many different values. This implies that for any xed r > 0, Z u (0) C r ¢

1 0

Z 6 v(0) C r ¢ u (0) ¢ u (¡t )e¡t /c dt D

1 0

v (0) ¢ v(¡t )e ¡t /c dt

6 for some c 2 [a, b]. Therefore we have Hc u (0) D Hc v (0) for the lter Hc dened by

¡

Hc u (t) D u (t) ¢ 1 C r ¢

Z

1 0

¢

u (t ¡ t )e¡t /c dt .

In order to apply the Stone-Weierstrass theorem, we also need to show that U¡ is a compact metric space with regard to the topology T dened in part 6 of remarks 2. Obviously this topology T coincides with the topology generated on U ¡ by the metric d(u, v) : D sup t· 0

|u(t) ¡ v (t)| 1 C |t|

(since all functions in U are assumed to be uniformly bounded). The compactness of U ¡ with regard to this metric follows by a routine argument, applying the Arzela-Ascoli theorem successively to the sequence of restrictions U| [¡T,0] : D fu| [¡T,0] : u 2 Ug for T 2 N and by diagonalizing over converging subsequences for these restrictions (see, for instance, lemma 1 in Boyd & Chua, 1985). The Stone-Weierstrass theorem implies that there exists for every given e > 0 some m 2 N, lters Hc 1 , . . . , Hc m as specied above, and a polynomial p such that | F u (0) ¡ p(Hc 1 u (0), . . . , Hc m u (0))| <

e 2

Q c u :D for all u 2 U ¡ . Since the functionals HQ c i : U¡ ! R dened by H i Hc i u (0) are continuous over the compact space U ¡, the values Hc i u(0) for i 2 f1, . . . , ng and u 2 U ¡ are contained in some bounded interval [¡b, b]. Furthermore according to Hornik et al. (1989) and Leshno et al. (1993), there

1756

Wolfgang Maass and Eduardo D. Sontag

exist sigmoidal gates G1 , . . . , Gk and parameters a0 , . . . , ak 2 R such that

0 1 k X e @ A ( ) ( ) C ¡ a a px 0 < j Gj x 2 jD 1

forPall x 2 [¡b, b]m .11 Note that Gj (Hc 1 u (0), . . . , Hc m u (0)) is of the form s( m i D 1 wi (0) u (0) C w0 ) with w0 2 R and wi ( t ) as in denition 1 (with xi (¢) replaced by u (¢)). Hence the previously constructed Hc 1 , . . . , Hc m together with this layer of k sigmoidal gates G1 , . . . , Gk dene a dynamic network S 2 DN. We then have | F u (0) ¡ S u (0)| < e for all u 2 U¡ . Because of the time invariance of F and Hc 1 , . . . , Hc m , this implies that | F u(t) ¡ S u(t)| < e for all u 2 U and all t 2 R. This completes the proof of (c) ) (b) for the case of dynamic networks that dene lters S 2 DN and n D 1. 6 6 In order to show that for u, v 2 U ¡ with u D v we have Hu (0) D (0) Hv also for some lter H that reects synaptic dynamics with some arbitrary given depression lter D as in the denition of DN¤ we consider 6 two cases for the lter Hc with Hc u (0) D Hc v(0) that we have already constructed. Case 1: u (0) ¢ D u (0) D v(0) ¢ D v (0). Then the function u (0) ¢ D u (0) ¢ u(¡t ) ¡ 6 0 for some t ¸ 0. Hence we can apply v (0) ¢ D v(0) ¢ v (¡t ) assumes a value D the same argument as before to the function Z qQ (l) D

1 0

(u (0) ¢ D u (0) ¢ u (¡t ) ¡ u(0) ¢ D v (0) ¢ v(¡t ))e¡t / l dt

to show that this function assumes innitely many different values for l 2 [a, b], for any given interval [a, b] with 0 < a < b. This implies that there exists given r > 0 some c 2 [a,Rb] so that u (0) ¢ D u (0) ¢ (1 C r ¢ R 1 for every 1 ¡t /c 6 v (0) ¢ D v (0) ¢ (1 C r ¢ (¡t ) )D (¡t )e¡t /c dt ). u e dt 0 0 v 6 u (0) ¢ D u (0)R D v (0) ¢ D v (0). Then the claim follows since r ¢ R 1Case 2: ¡t /c dt ¡ r ¢ 1 v(¡t )e ¡t /c dt converges to 0 if r ! 0 or c ! 0. (¡t ) u e 0 0 The rest of the argument is exactly as in the preceding argument for lters S 2 DN. This completes the proof of (c) ) (b ) also for the case of dynamic networks that dene lters S 2 DN¤ and n D 1. In the claim of the theorem, we had considered a slightly more general class of lters F that are dened over U n for some given n ¸ 1. In order to extend the preceding proof of (c) ) (b ) to the more general input space for n ¸ 1, one just has to note that U n is a compact metric space with regard to the product topology generated by the topology T over U as in part 6 of remarks 2, and that our preceding arguments imply that lters over Un of 11

These approximation results were previously applied in this context by Sandberg (1991).

Neural Systems as Nonlinear Filters

1757

the form hu1 , . . . , un i ! H ui with i 2 f1, . . . , ng (and H modeling synaptic dynamics according to denition 1) separate points in Un¡ . 3.2 Extension of the Result to the Model for Synaptic Dynamics by Tsodyks, Pawelzik, and Markram. Tsodyks et al. (1998) propose a slightly

different temporal dynamics for depression and facilitation in populations of synapses. In contrast to the model from Varela et al. (1997) that underlies our denition 1, this model has been explicitly formulated for a mean-eld analysis, where the input to a population of synapses consists of a continuous function xi (t) that models ring activity in a presynaptic pool Pi of neurons rather than a spike train from a single presynaptic neuron. We show in this section that our characterization result from the preceding section also holds for this model for synaptic dynamics. The rst difference to the synapse model from Varela et al. (1997) is a R0 ¡r xi (t 0 ) dt 0 ¡t use-dependent discount factor e ¢ e¡t /c instead of just e¡t /c in the model for facilitation, which reduces the facilitating impact of preceding large input xi (¡t ) on the value of the synaptic weight at time 0. In other words, facilitation is no longer modeled by a linear lter; instead, one assumes that facilitation has less impact on a synapse that has already been facilitated by preceding inputs. For a precise denition of the resulting variation DN C of our denition of R1 dynamic networks from denition 1, we replace wi (t) D wi ¢(1 C r 0 xi (t¡t ) e¡t /c dt ) by wi (t) D wi ¢ wO i (t), where Z wO i (t) D r ¢

1 0

xi ( t ¡ t ) ¢ e

¡r

Rt t¡t

xi (t 0 ) dt 0

¢ e¡t /c dt.

(3.1)

This is the model for facilitation proposed in equation 3.5 of Tsodyks et al. (1998) for a mean-eld setting, where xi (t ¡ t ) models ring activity at 1 , r is time t ¡ t in a presynaptic pool Pi of neurons (wO i (t) is denoted by USE denoted by USE, c is denoted by tf acil , and wi is denoted by ASE . We show in the subsequent result that for any given value of the parameter r (which models the normal use of synaptic resources caused by input to a “rested” synapse) and for any given interval [a, b] one can choose the values wi 2 R and time constants c from [a, b] so that a network consisting of facilitating synapses in combination with one layer of sigmoidal neurons can approximate any time-invariant lter with fading memory. Tsodyks et al. (1998) also propose a model for populations of synapses that exhibit both depression and facilitation (one substitutes equation 3.5 for USE in equation 3.3 of Tsodyks et al., 1998). A new feature of this model is that one can no longer express the current synaptic weight wi (t) as a product of the outputs of two separate lters: one for depression and one for facilitation. Rather, the output of the lter for facilitation (see our equation 3.1) enters the computation of the current output of the lter for depression.

1758

Wolfgang Maass and Eduardo D. Sontag

This is biologically plausible, since a facilitated synapse spends its resources more quickly—and hence is subject to stronger depression. In our notation, the model from Tsodyks et al. (1998) for depression and facilitation in a mean-eld setting (equations 3.3 and 3.5 in Tsodyks et al., 1998) yields the following formula for the value wi (t) of the current weight of a population of synapses (with wO i (t) dened according to equation 3.1): Z

wi (t) :D wi ¢ wO i (t) ¢

1 0

e¡t / trec ¢ e

¡

Rt

t¡t

wO i (t 0 )xi (t 0 ) dt 0

dt.

(3.2)

This formula involves another parameter trec : the time constant for the recovery from using synaptic resources. We will write DN C C for the class of feedforward networks consisting of sigmoidal neurons with dynamic weights wi (t) according to equation 3.2. In order to make sure that the integrals in equations 3.1 and 3.2 assume a nite value for bounded synaptic inputs xi (¢), one has to make sure that not only the network inputs but also the outputs of sigmoidal units in networks from the classes DN C and DN C C are always nonnegative. For that purpose, we assume in this section and the next that the sigmoidal activation function s assumes nonnegative values only. This is no real restriction since the output of a sigmoidal unit models the current ring activity in a pool of neurons. Any lter that maps xi (¢) onto wi (¢) ¢ xi (¢) with wi (¢) dened according to equation 3.1 and 3.2 is time invariant and has fading memory. Hence every network in DN C and DN C C computes a time-invariant lter with fading memory. Assume that U is the class of functions from R into [B0 , B1 ]) that satisfy |u(t) ¡ u (s)| · B2 ¢ |t ¡ s| for all t, s 2 R, where B0 , B1 , B2 are arbitrary real-valued constants with 0 < B0 < B1 and 0 < B2 . Let F be an arbitrary lter that maps vectors u D hu1 , . . . , un i 2 Un into functions from R into R. Then the following are equivalent: Theorem 2.

(a) F can be approximated by dynamic networks S 2 DN C (i.e., for any e > 0 there exists some S 2 DN C such that | F u (t) ¡ S u (t )| < e for all u 2 U n and all t 2 R). (b) F can be approximated by dynamic networks S 2 DN C with just a single layer of sigmoidal neurons. (c) F is time invariant and has fading memory. (d) F can be approximated by a sequence of (nite or innite) Volterra series. These equivalences remain valid if DN C is replaced by DN C C . It will be obvious from the proof of theorem 2 that in principle quite small ranges sufce for the “free” parameters c and trec that control the synaptic dynamics according to equations 3.1 and 3.2.

Neural Systems as Nonlinear Filters

1759

In order to approximate an arbitrary given time-invariant fading memory lter F by dynamic networks S from DN C , one can choose for any given r > 0 the parameters c of the synapses in S (dened according to equation 3.1) from some arbitrarily given interval [a, b]. In order to approximate F by networks S from DN C C one can choose for any given r > 0 the parameters c from some arbitrarily given interval [a, b] and the parameters trec according to equation 3.2 from some arbitrarily given interval [a0 , b0 ]. Corollary 2.

It sufces to describe how the proof of (c) ) (b) from theorem 1 has to be changed. For the case of networks from the class DN C , we have to show that the lters HcC of the form Proof of Theorem 2.

Z

HcC u (t) D u (t) ¢ r

1 0

u (t ¡ t ) ¢ e

¡r

Rt t¡t

u (t 0 ) dt 0

¢ e¡t /c dt

separate points in U ¡. We show that for any given r > 0, a, b 2 R with 6 0 < a < b, and any u, v 2 U ¡ with u D v there exists some c 2 [a, b] such 6 that HcC u (0) D HcC v (0). Thus we consider some arbitrary given u, v 2 U 6 v( t ) for some t · 0. According to our argument in the proof of with u (t) D theorem 1, it sufces for that to show that u (0) ¢ u (¡t ) ¢ e

¡r

R0 ¡t

u (t 0 ) dt 0

6 v (0) ¢ v(¡t ) ¢ e D

¡r

R0 ¡t

v(t 0 ) dt 0

for some t ¸ 0,

(3.3)

because this implies that the function q C dened by Z q C (`) : D

1 0

(u (0) ¢ u (¡t ) ¢ e ¡r ¡ v(0) ¢ v (¡t ) ¢ e

R0 ¡t

¡r

u (t 0 ) dt 0

R0 ¡t

v (t 0 ) dt 0

)e ¡t / ` dt

assumes innitely many different values for ` 2 [a, b]. Assume for a contradiction that equation 3.3 does not hold. This implies 6 that u (0) D v (0). Consider some t0 < 0 with u (t0 ) D v (t0 ). We assume without loss of generality that u (t0 ) > v(t0 ). Set t0C :D infft > t0 : u (t) · v (t)g t0¡ :D supft < t0 : u (t) · v (t)g. We have t0C · 0 since u (0) D v(0). Case 1: t0¡ > ¡1 ( i.e., 9 t < t0 : u (t) · v (t)). Then t0¡ < t0 < t0C , u (t0¡ ) D ¡ ( v t0 ), u (t0C ) D v (t0C ) and u (t ) > v(t) for all t 2 (t0¡ , t0C ). According to our as-

1760

Wolfgang Maass and Eduardo D. Sontag

R0 R0 R0 R0 sumption this implies that t C u (t) dt D t C v (t) dt and t¡ u (t) dt D t¡ v (t) dt, 0 0 0 0 R tC R tC although t¡0 u (t) dt > t¡0 v (t ) dt. This yields a contradiction. 0

R0

0

Case 2: u (t) > v(t) for all t < t0 . Our assumptions imply then that

R0

v (t) dt and u (t) > v (t) for all t < t0C . Therefore there exR0 0 (u (t )¡v(t 0 )) dt 0 r ists some e > 0 such that e t ¸ 1 C e for all t · t0 . Hence we can conclude from our assumption that uv((tt)) ¸ 1 C e for all t · t0 . This imR0 0 ( (t )¡v(t 0 )) dt 0 r plies that e t u ! 1 for t ! ¡1, hence uv ((tt)) ! 1 for t ! ¡1. This provides a contradiction to our denition of the class U of functions to which u and v belong, since all functions in U have values in [B0 , B1 ] for 0 < B0 < B1 . This completes our proof of the direction (c) ) (b). The remainder of the proof for the case of dynamic networks from the class DN C is the same as for theorem 1. In order to prove (c) ) (b ) for networks from the class DN C C , we have to show that the lters that map an input function xi (¢) onto the output function wi (¢) ¢ xi (¢) with wi (t) dened according to equation 3.2 separate 6 points in U ¡. Thus we x some u, v 2 U with u (t ) D v (t) for some t · 0. C According to the preceding proof for DN , there exists some c 2 [a, b] such 6 that HcC u (0) D HcC v (0). We want to show for this c that there exists for 0 0 any given a , b with 0 < a0 < b0 some trec 2 [a0 , b0 ] so that the resulting lter dened by the synapse according to equation 3.2 can separate u and v. More precisely, we show for the lter G c that is dened in analogy to equation 3.1 by t0C

u (t) dt D

t0C

Z c u(t) :D r ¢

G

1 0

(thus HcC u (t) D u (t) ¢ G Z u (0) ¢ G 6 D

c u (0) ¢

v (0) ¢ G

0 c

u(t ¡ t ) ¢ e

c 1

¡r

Rt t¡t

u (t 0 ) dt 0

¢ e ¡t /c dt

u (t)) that e¡t / trec ¢ e

v (0) ¢

Z

1 0

¡

R0 ¡t

Gc u (t 0 )¢u (t 0 ) dt 0

e¡t / trec ¢ e

¡

R0 ¡t

dt

Gc v(t 0 )¢v(t 0 ) dt 0

dt.

(3.4)

It is obvious that the function h: R ! R dened by R0 Gc u (t 0 )¢u (t 0 ) dt 0 ¡t (0) ¢ u e c R0 ¡ G v (t 0 )¢v (t 0 ) dt 0 (0) (0) ¡ v ¢ G c v ¢ e ¡t c

h (t ) : D u (0) ¢ G

6 0 for some t ¸ 0, since h(0) D 6 0 by our choice of c . Hence assumes a value D

Neural Systems as Nonlinear Filters

1761

by the argument via the Laplace transform from the proof of theorem 1, there exists some trec 2 [a0 , b0 ] (for any given a0 , b0 2 R with 0 < a0 < b0 ) so that Z

1 0

6 0, e¡t / trec ¢ h (t ) dt D

which is equivalent to the desired inequality (see equation 3.4). Thus we have shown that the lters dened by the temporal dynamics of synapses in dynamic networks from the class DN C C separate points in U ¡. The rest of the proof is the same as for theorem 1. 3.3 Universal Approximation of Filters with Depressing Synapses Only.

We show in this section that the computational power of feedforward neural networks with dynamic synapses remains the same if the synapses exhibit just depression, not facilitation. This holds provided that the time constants trec for their recovery from depression can be chosen individually from some interval [a, b] (this holds for any values of a, b with 0 < a < b). This result is of interest since according to Tsodyks et al. (1998), all synapses between pyramidal neurons exhibit depression, not facilitation. We will employ the model from Tsodyks et al. (1998) for synaptic depression in a mean-eld setting, which is specied in equation 3.3 of Tsodyks et al. (1998). We write DN ¡ for the class of feedforward neural networks consisting of sigmoidal neurons (whose activation function s assumes nonnegative values only) with weights wi (t) evolving according to Z wi (t) D wi ¢ USE ¢

1 0

e

¡t / trec

¢e

¡

Rt t¡t

USE ¢xi (t 0 ) dt 0

dt

(3.5)

in dependence of the presynaptic pool activity xi (t ), where USE > 0 is some given constant. Note that wi (t) ¢ xi (t) agrees with the term ASE ¢ hy(t)i with hy(t)i dened by equation 3.3 in Tsodyks et al. (1998), which models the average value of the postsynaptic current caused in pool P by the ring activity xi (t) in the pool Pi in the case of depressing synapses between pools Pi and P (the parameter ASE is denoted by wi in our notation). Assume that U is the class of functions from R into [B0 , B1 ]) that satisfy |u(t) ¡ u (s)| · B2 ¢ |t ¡ s| for all t, s 2 R, where B0 , B1 , B2 are arbitrary real-valued constants with 0 < B0 < B1 and 0 < B2 . Let F be an arbitrary lter that maps vectors u D hu1 , . . . , un i 2 U n into functions from R into R. Then the following are equivalent: Theorem 3.

(a) F can be approximated by dynamic networks S 2 DN ¡ (i.e., for any e > 0 there exists some S 2 DN ¡ such that | F u (t) ¡ S u (t)| < e for all u 2 Un and all t 2 R).

1762

Wolfgang Maass and Eduardo D. Sontag

(b) F can be approximated by dynamic networks S 2 DN ¡ with just a single layer of sigmoidal neurons. (c) F is time invariant and has fading memory. (d) F can be approximated by a sequence of (nite or innite) Volterra series. It is obvious that all lters dened by dynamic networks from the class DN¡ are time invariant and have fading memory. Hence it sufces to show how the proof of (c) ) (b) has to be changed in comparison with the proof of theorem 1. Assume that parameters a, b, USE 2 R with 0 < a v (t0 ) for some t0 · 0, there exists some trec 2 [a, b] so that the lter that models synaptic dynamics according to equation 3.5 differs at time 0 for the two input functions u, v (instead of xi ); we have to show that Proof.

Z u (0) ¢

1 0

e

¡t / trec

¢e

¡

R0 ¡t

USE¢u(t 0 ) dt 0

Z dt

6 v (0)¢ D

1 0

e

¡t / trec

¢e

¡

R0 ¡t

USE¢v(t 0 ) dt 0

dt.

According to our argument with the Laplace transform in the proof of the6 0 for some t ¸ 0 , where h: R ! R is orem 1, it sufces to show that h (t ) D the function dened by h (t ) : D u (0) ¢ e

¡

R0 ¡t

USE ¢u(t 0 ) dt 0

¡ v (0) ¢ e

¡

R0 ¡t

USE ¢v(t 0 ) dt 0

.

6 6 If u (0) D 0 . Hence we assume that v (0), this is obvious, since then h (0) D u (0) D v(0) . Furthermore, we assume for a contradiction that h (t ) D 0 for all t ¸ 0 . Set

t0C : D inf ft > t0 : u(t) · v(t)g. R0 R0 R0 Then we have t0 < t0C · 0 , t C u (t 0 ) dt 0 D tC v (t 0 ) dt 0 , and t0 u (t 0 ) dt 0 D 0 0 R0 (t 0 ) dt 0 . This yields a contradiction to the fact that u (t 0 ) > v (t 0 ) for all t0 v R tC R tC t 0 2 [t0 , t0C ], and hence t00 u (t 0 ) dt 0 > t00 v (t 0 ) dt 0 . 3.4 Focusing on Excitatory Synapses. In the preceding dynamic network models, we had assumed that the constant factors wi could be chosen to be positive or negative, thus yielding excitatory or inhibitory synapses in a biological interpretation. This formal symmetry between excitatory and inhibitory synapses is not adequate for most biological neural systems— for example, the cortex of primates, where just 15% of the synapses are inhibitory. We would like to point out that according to Maass (in press) one can replace the dynamic networks considered in the preceding sections

Neural Systems as Nonlinear Filters

1763

by an alternative type of network where just the dynamics of excitatory synapses matters—which can be just depressing, just facilitating, or depressing and facilitating, as in the preceding sections. The key observation is that instead of approximating the given polynomial p by a weighted sum of sigmoidal neurons in the proof of theorem 1 (and analogously in the subsequent theorems), one can approximate p by a single soft winner-take-all module applied to several weighted sums of the lters Hc 1 , . . . , Hc m with nonnegative weights wi only.12 The resulting network structure corresponds to a biological neural system where the lters Hc 1 , . . . , Hc m are realized exclusively by excitatory dynamic synapses, and the role of inhibitory synapses is restricted to the realization of the subsequent soft winner-take-all module. We refer to Maass (in press) for details of this alternative style of network construction. 3.5 Allowing the Input Functions to Vanish. We had assumed in theorem 1 that all input functions ui satisfy ui (t) ¸ B0 for all t 2 R, where B0 > 0 is some arbitrary constant. This assumption is usually met in the sketched application to biological neural systems, because the minimum ring rate of neurons is larger than 0 (typically in the range of 5 Hz). Alternatively one can assume that all input functions ui are superimposed with some positive constant input (which could be interpreted as background ring activity). The following result shows that from a mathematical point of view, the assumption B0 > 0 is necessary at least in the case of single-layer networks, since in the case B 0 D 0 a strictly smaller class of lters is approximated by dynamic networks with a single layer of sigmoidal neurons. Theorem 4 gives a precise characterization of this smaller class of lters.

Assume that U is the class of functions u in C (R, [0, B 1 ]) that satisfy |u(t) ¡ u (s)| · B2 ¢ |t ¡ s| for all t, s 2 R, where B1 and B2 are arbitrary real-valued constants with 0 < B1 and 0 < B2 . Let F be an arbitrary lter that maps functions from U into RR . We consider here only the version of dynamic networks giving rise to lters in DN. Then F can be approximated by dynamic networks with a single layer of sigmoidal gates if and only if F is time invariant, has fading memory, and there exists a constant cF such that F u (t) D cF for all u 2 U and t 2 R with u (t) D 0. Theorem 4.

The form of the lters dening the class DN implies that when u (t ) D 0, all lters output the value 0 at the given time t, and hence the sigmoidal gate outputs the value G(t) D s(w0 ), irrespective of the values of u at other times. It is easy to see that all lters approximated by Proof of Theorem 4.

12 If one prefers, one can replace the nonnegative weighted sums of the lters H c 1 , . . . , H c m in this alternative approximation result by sigmoidal neurons applied to nonnegative weighted sums of the lters H c 1 , . . . , H c m .

1764

Wolfgang Maass and Eduardo D. Sontag

such networks must also have the same property. The converse implication is established in almost exactly the same manner as in the proof of theorem 1. The only difference is as follows. It could be that u (0) D v (0) D 0, in which case our argument fails to provide a separating lter. However, this separation is not needed if we only need to approximate lters that are constant on the set of inputs with zero value at t D 0. This follows from the following lemma, which is a small variation of the Stone-Weierstrass theorem. Given a compact Hausdorff topological space U and a closed subset S of U, we say that a function f : U ! R is S-constant if the restriction of f to S is a constant function. We say that a class of real-valued functions on U is S-separating if, for each 6 u, v 2 U, u D v, such that not both of u and v are in S, there is some f 2 F 6 such that f (u ) D f (v ). Suppose that F is a class consisting of continuous and S-constant functions that S-separates. Then, polynomials in elements of F approximate every S-constant continuous function U ! R. Lemma 2.

This lemma is proved as follows. We consider the quotient space US : D U / S, where we collapse all points of S to one point, endowed with the quotient topology (its open sets are those open sets V in U for which S \ 6 ; implies S µ V). The topological space US is compact, because the V D canonical projection onto the quotient is continuous, and U was assumed to be compact. In addition, US is a Hausdorff space, as follows from the fact 6 S, there are disjoint neighborhoods of x and S (since S is that for each x 2 compact). The lemma is established by noticing that continuous S-constant functions induce continuous functions on US , so that we may apply the standard Stone-Weierstrass theorem to this quotient space. Now the theorem follows, using as S the set consisting of all inputs so that u (0) D 0. The S-separation property is established just as in the proof of theorem 1; we omit the routine details. 3.6 Combining Synaptic Dynamics with Membrane Dynamics. One other source of temporal dynamics in biological neural systems is the dynamics of the membrane potential of neurons. Hence it is of interest to consider a variation of our notion of a dynamic network where a function x (t) is processed at a connection of the network rst by a lter H that maps xi (¢) onto wi (¢) ¢xi (¢) (modeling synapse dynamics) and then by another lter G modeling membrane dynamics of the receiving neurons. Since each single excitatory postsynaptic potential and inhibitory postsynaptic potential can be tted very well by a function of the form b1 e¡t /c 1 ¡ b2 e¡t /c 2 , it appears to be justied to model membrane dynamics in the context of our model for population coding with pools of neurons by rst-order linear lters G with an impulse response g (t ) consisting of a weighted sum of functions

Neural Systems as Nonlinear Filters

1765

of the form e¡t /c . In the resulting variation of our notion of a dynamic network, one replaces the lters H that model synaptic dynamics according to denition 1 at the connections of the network by compositions G ± H with such linear lters G . All preceding results remain valid for this variation of the network model. In order to approximate arbitrary given time-invariant lters with fading memory by such networks, one just has to show that for 6 v (t ) for some t · 0 there exist lters any two functions u, v 2 U with u (t) D 6 ( (G ± H )v(0). This holds H , G of the desired type such that G ± H )u (0) D even for u, v 2 C(R, [B0 , B1 ]) with B0 D 0, since we have to nd a lter H 6 H v (t ) modeling synaptic dynamics according to denition 1 so that Hu (t) D for some t · 0 (hence we can allow that u (0) D v (0) D 0). We then can apply to the functions Hu (t) and Hv (t) for t · 0 the argument from the proof of theorem 1 to nd a linear lter G with impulse response e ¡t /c so that 6 ( G ± H )v (0). The same argument shows that theoretically the (G ± H )u (0) D same class of lters can be approximated by dynamic networks if one relies on only linear lters G modeling membrane dynamics. 4 Computations on Spatiotemporal Patterns

A closer look shows that many temporal patterns that are relevant for biological neural systems are not just temporal but spatiotemporal patterns. For example, in auditory systems, the additional spatial dimension parameterizes different frequency bands. These are represented by spatial locations of the inner hair cells on the cochlea and corresponding spatial maps in higher parts of the auditory system. In visual systems, it is obvious that the analysis of moving objects (and/or the stabilization of visual input in spite of body movements of the receiving organism) requires the processing of complex spatiotemporal patterns. In this context two spatial dimensions correspond directly to retina locations. But one should note that other “spatial” dimensions in the subsequent denition 3 need not correspond to spatial locations in the outer world (or on the retina), but can also correspond to scales in a more abstract feature space, for example, to spatial frequency or phase. Therefore, we will consider spatiotemporal patterns with an arbitrary nite number d 2 N of spatial dimensions. The transformation and classication of complex spatiotemporal patterns appear to be relevant also for higher cortical areas, since recordings from larger populations of neurons via voltage-sensitive dyes or MEG, EEG, and so forth suggest that sensory input is encoded in spatiotemporal activation patterns of associated cortical neural systems. These spatiotemporal activation patterns provide the input to higher cortical areas. Also the output of various systems in the motor cortex can be viewed as spatiotemporal patterns (Georgopoulos, 1995). Hence, one may argue that also higher cortical areas carry out computations on spatiotemporal patterns. We will show that one can extend our analysis of computations on temporal patterns to an analysis of computations on spatiotemporal patterns.

1766

Wolfgang Maass and Eduardo D. Sontag

For that purpose, we introduce a suitable extension of our denition of a dynamic network that allows for d spatial dimensions in the input functions u. We dene a spatial dynamic network and the corresponding classes SDN and SDN¤ of lters as a variation of denition 2 of a dynamic network. Fix some arbitrary d 2 N. A spatial dynamic network with d spatial input dimensions (in addition to the time dimension) assigns to vectors u of n input functions u: Rd £ R ! R an output function z: R ! R. The only difference to the preceding denition of a dynamic network is that now there exists for each network a nite set X of vectors x 2 Rd so that the actual input to the network consists of functions of the form u (x, ¢) for x 2 X. Denition 3.

According to this denition, any spatial dynamic network samples the input functions u: Rd £ R ! R just at a xed nite set X of points x. Nevertheless we will show in theorem 5 that these networks can approximate a very large class of lters on functions u: Rd £ R ! R. The notion of a Volterra series (see remark 1) can be readily extended to input functions u: Rd £ R ! R (again we assume that u is measurable and essentially bounded). In this case a kth order Volterra term is of the form Z1 ¡1

...

Z1 Z1 ¡1 0

...

Z1 0

u (x1 , . . . , xd , t ¡ t1 ) ¢ . . . ¢ u(x1 , . . . , xd , t ¡ tk )

(4.1)

¢ h(x1 , . . . , xd , t1 , . . . , tk ) dx1 . . . dxd dt1 . . . dtk

for some function h 2 L1 . Analogously as before we refer to a series consisting of nitely many such terms as a Volterra polynomial or nite Volterra series. Let U be the class of functions in C(Rd £ R, [B0 , B1 ]) that satisfy Q for all hx, ti, hQx, ti Q 2 Rd £ R, where d 2 N |u(x, t) ¡ u (xQ , tQ)| · B2 khx, ti ¡ hQx, tik and B0 , B 1 , B2 are arbitrary real-valued constants with 0 < B0 < B1 and 0 < B2 . Then the following holds for any lter F : U n ! RR : F can be approximated by spatial dynamic networks (i.e., for any e > 0 there exists some S 2 SDN such that | F u (t) ¡ S u(t)| < e for all u 2 U n and t 2 R) if and only if F can be approximated by a sequence of (nite or innite) Volterra series. The claim remains valid if SDN is replaced by SDN¤ . Theorem 5.

We show that the two alternative conditions on F in the claim of the theorem are equivalent by proving that both conditions are equivalent to a third condition: the condition that F is time invariant and has fading Proof.

Neural Systems as Nonlinear Filters

1767

memory—for the straightforward extension of this notion to lters F : U n ! RR , where U is now a class of functions from Rd £ R into R. We say that such lter F has fading memory if for every hv1 , . . . , vn i 2 Un and every e > 0 there exist d > 0, T > 0, and K > 0 so that |F v (0) ¡ F u (0)| < e for all u D hu1 , . . . , un i 2 U n with the property that kv(x, t) ¡ u (x, t)k < d for all t 2 [¡T, 0] and all x 2 [¡K, K]d . This condition implies “fading inuence” of u (x, t) for arguments hx, ti where |t| or kxk are very large. It is obvious that any kth-order Volterra term of the form 4.1 is time invariant and has fading memory. Furthermore this property is preserved by taking sums and limits (analogously as in lemma 2). Hence also for lters with inputs u: Rd £ R ! R we have that any lter that can be approximated by a sequence of nite or innite Volterra series is time invariant and has fading memory. On the other hand one can extend the proof of theorem 1 in Boyd and Chua (1985) in a straightforward manner to show that any time-invariant lter that has fading memory and receives inputs from Un (for a class U with the properties as in the claim of the theorem) can be approximated arbitrarily closely by Volterra polynomials. For this extension of the argument of Boyd and Chua (1985), one just has to verify that this class U is compact with regard to the topology generated by the neighborhoods fu 2 U n : kv(x, t) ¡ u (x, t)k < e for all t 2 [¡T, 0] and all x 2 [¡K, K]n g for arbitrary v 2 Un and e, T, K > 0. It is clear that any spatial dynamic network according to denition 3 is time invariant and has fading memory. Thus it remains only to show that any time-invariant lter F : U n ! RR with fading memory can be approximated arbitrarily closely by spatial dynamic networks. In order to extend the proof of theorem 1 in Boyd and Chua (1985) to this case, one just has to observe that the proof of theorem 1 implies that for any two functions u, v, 2 U with 6 u (x, t) D v (x, t) for some t · 0 and x 2 Rd there exists some x 2 Rd and some lter H modeling synaptic dynamics as in denition 2 which satises 6 H u(x, ¢)(0) D H v (x, ¢)(0). Thus we have shown that a lter F : Un ! RR can be approximated by spatial dynamic networks if and only if F is time invariant and has fading memory. This completes the proof of theorem 5. Remark 4.

1. Analogous versions of corollary 1, remark 3, and theorems 2 through 4 also hold for the framework of computations on spatiotemporal patterns considered in theorem 5. 2. If one considers a system consisting of many spatial dynamic networks that provide separate outputs for different spatial output locations, one also can produce spatiotemporal patterns in the output of such systems. Theorem 5 implies that exactly those maps A from spatiotemporal patterns to spatiotemporal patterns can be realized by such systems where the restriction of the output of A to any xed output location is a time-invariant lter F with fading memory.

1768

Wolfgang Maass and Eduardo D. Sontag

5 Conclusion

We have analyzed the power of feedforward models for biological neural systems for computations on temporal patterns and spatiotemporal patterns. We have identied the class of lters that can be approximated by such models and shown that this result is quite stable with regard to changes in the model. In particular, we have shown that all lters that can be approximated by Volterra series can be approximated by models consisting of a single layer of dynamic synapses and neurons. Furthermore, the class of lters that can be approximated does not change if one considers feedforward networks with arbitrarily many layers. In addition, the lters in this class are characterized by a very simple property (time invariance and fading memory) that is in general easy to check for a concrete lter. This class of lters is very rich. In fact, one might argue that any lter that is possibly useful for a function of a biological organism belongs to this class. Since we have included in our analysis the case where several temporal patterns are presented simultaneously as input to a neural system, our approach also provides a new foundation for analyzing computations that respond in a particular manner to temporal correlations in the ring activity of different pools of neurons. We show that any such computation that can be described by time-invariant lters with fading memory (which is, for example, the case for most conjectured computations involving binding of features belonging to the same object via temporal correlations) can in principle be carried out by a feedforward neural system. So far the analysis of the possible functional role of short-term synaptic dynamics has found the most convincing computational role for synaptic depression. Our results here point to a possible computational role for the other important dynamical mechanism in biological synapses: facilitation. We show that via facilitation, models for neural systems gain the power to approximate any lter in the very large class of linear and nonlinear lters described above. Furthermore, we have shown that this possible function of facilitation does not interfere with any other computational role of synaptic depression, since we have shown that for any xed depression mechanisms, one can nd parameters for the synaptic facilitation mechanisms that allow the approximation of arbitrary lters from this class. Apart from this, we have also shown that the same very rich class of linear and nonlinear lters can be approximated by models for neural systems whose synapses just exhibit depression, not facilitation. The view of neural systems as computational models for computations on temporal patterns or spatiotemporal patterns—rather than on static vectors of numbers—is likely to have signicant consequences for the analysis of learning in neural systems. It suggests that learning should be analyzed in the context of adaptive lters. Whereas we do not contribute directly to any learning result in this article, our results identify exactly the class of lters within which such lter adaptation would take place, and thereby prepare

Neural Systems as Nonlinear Filters

1769

the ground for a closer analysis of learning in neural systems from the point of view of adaptive ltering. Finally we point out that our “universal approximation results” for computations on temporal and spatiotemporal patterns suggest a new complexity measure and a new parameterization for nonlinear lters in this domain, which may be more appropriate in the context of biological neural systems. We show that instead of measuring the complexity of a nonlinear lter F by the degree of the Volterra polynomial or Wiener polynomial that is needed to approximate F within a given 2 , one can also measure the complexity of F by the number of sigmoidal gates and dynamic synapses that are needed to approximate F within 2 . Our results show that both complexity hierarchies characterize the same class of linear and nonlinear lters. However, the latter measure is more adequate in the context of neural computation, because the approximation of a single sigmoidal gate requires a high-order Volterra polynomial for a good approximation. Hence, the order of the Volterra polynomial required to approximate a given nonlinear lter F is in general a poor measure for the cost of implementing F in neural hardware. On the other hand, the alternative complexity measure for lters F that is suggested by our results is closely related to the cost of implementing F in neural hardware. In addition, our approach via formal models for dynamic networks provides a new parameterization for all lters that are approximable by Volterra series—in terms of parameters that control the architecture as well as the temporal dynamics and scale of synaptic dynamics. Such parameterization is particularly of interest for the analysis of learning (if the goal is to learn a map from spatiotemporal to spatiotemporal patterns), especially since the parameters that occur in our new parameterization appear to be related to those parameters that are “plastic” in biological neural systems. This article also prepares the ground for an investigation of the required complexity of models for neural systems for approximating specic lters that are of particular interest in this context. Preliminary computer simulation results (Natschl¨ager, Maass, & Zador, 1999) suggest that in fact quite small instantiations of the dynamic network models considered in this article sufce to approximate specic quadratic lters. Other topics of current research are the role of noise in this context and the possible role of lateral and recurrent connections in the network (see Maass, 1999). Acknowledgments

We thank Anthony M. Zador for bringing the problems addressed in this article to our attention. In addition we thank Larry Abbott and two anonymous referees for helpful comments. W. M. thanks the Sloan Foundation (USA), the Fonds zur Forderung ¨ der wissenschaftlichen Forschung (FWF), Austria, project P12153, and the NeuroCOLT project of the EC for partial

1770

Wolfgang Maass and Eduardo D. Sontag

support, and the Computational Neurobiology Lab at the Salk Institute for its hospitality. References Abbott, L. F. (1994). Decoding neuronal ring and modelling neural networks. Quarterly Review of Biophysics, 27, 291–331. Abbott, L. F., Varela, J. A., Sen, K., & Nelson, S. B. (1997). Synaptic depression and cortical gain control. Science, 275, 220–224. Back, A. D., & Tsoi, A. C. (1991). FIR and IIR synapses, a new neural network architecture for time series modeling. Neural Computation, 3, 375–385. Boyd, S., & Chua, L. O. (1985). Fading memory and the problem of approximating nonlinear operators with Volterra series. IEEE Trans. on Circuits and Systems, 32, 1150. Dasgupta, B., & Sontag, E. D. (1996). Sample complexity for learning recurrent perceptron mappings. IEEE Trans. Inform. Theory, 42, 1479–1487. Dieudonne, J. (1969). Foundations of modern analysis. New York: Academic Press. Dobrunz, L., & Stevens, C. (1997). Heterogeneous release probabilities in hippocampal neurons. Neuron, 18, 995–1008. Fliess, M. (1975). Un Outil algebrique: les s´eries formelles non commutatives. In G. Marche (Ed.), Mathematical systems theory (pp. 122–148). New York: Springer-Verlag. Folland, G. B. (1984). Real analysis. New York: Wiley. Francis, G., Grossberg, S., & Mingolla, E. (1994). Cortical dynamics of feature binding and reset: Control of visual persistence. Vision Research,34, 1089–1104. Gallman, P. G., & Narendra, K. S. (1976). Representations of nonlinear systems via the Stone-Weierstrass theorem. Automatica, 12, 618–622. Georgopoulos, A. P. (1995). Reaching: coding in motor cortex. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks. Cambridge, MA: MIT Press. Georgopoulos, A. P., Schwartz, A. P., & Ketner, R. E. (1986).Neuronal population coding of movement direction. Science, 233, 1416–1419. Gerstner, W. (1999). Spiking neurons. In W. Maass & C. Bishop (Eds.), Pulsed neural networks (pp. 32–53). Cambridge, MA: MIT Press. Grossberg, S. (1969).On the production and release of chemical transmitters and related topics in cellular control. J. Theor. Biol., 22, 325–364. Grossberg, S. (1972). A neural theory of punishment and avoidance, II: Quantitative theory. Mathematical Biosciences, 15, 253–285. Grossberg, S. (1984).Some psychophysiological and pharmacological correlates of a developmental, cognitive, and motivational theory. In R. Karrer, J. Cohen, & P. Tueting (Eds.), Brain and information: Event related potentials (pp. 58–142). New York: New York Academy of Science. Hertz, J., Krogh, A., & Palmer, R. G. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley. Hewitt, E., & Stromberg, K. (1965). Real and abstract analysis. Berlin: SpringerVerlag.

Neural Systems as Nonlinear Filters

1771

Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366. Koch, C. (1999). Biophysics of computation: Information processing in single neurons. New York: Oxford University Press. Koiran, P., & Sontag, E. D. (1998). Vapnik-Chervonenkis dimension of recurrent neural networks. Discrete Applied Math., 86, 63–79. Leshno, M., Lin, V., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6, 861–867. Maass, W. (in press). On the computational power of winner-take-all. Neural Computation. Maass, W. (1999). On the role of noise and recurrent connections in neural networks with dynamic synapses. Unpublished manuscript. Maass, W., & Natschl¨ager, T. (in press). A model for fast analog computation based on unreliable synapses. Neural Computation. Maass, W., & Zador, A. (1998). Dynamic stochastic synapses as computational units. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 194–200). Cambridge, MA: MIT Press. Maass, W., & Zador, A. (1999). Computing and learning with dynamic synapses. In W. Maass & C. Bishop (Eds.), Pulsed neural networks. Cambridge, MA: MIT Press. Marmarelis, P. Z., & Marmarelis, V. Z. (1978). Analysis of physiological systems: The white-noise approach. New York: Plenum Press. Mead, C. (1989). Analog VLSI and neural systems. Reading, MA: Addison-Wesley. Murthy, V., Sejnowski, T., & Stevens, C. (1997).Heterogeneous release properties of visualized individual hippocampal synapses. Neuron, 18, 599–612. Natschl¨ager, T., Maass, W., & Zador, A. (1999). Temporal processing with dynamic synapses (Tech. Rep.). Institute for Theoretical Computer Science, Technische Universitaet Graz. Palm, G. (1978). On representation and approximation of nonlinear systems. Biol. Cybernetics, 31, 119–124. Palm, G. & Poggio, T. (1977). The Volterra respresentation and the Wiener expansion: Validity and pitfalls. SIAM J. Appl. Math., 33(2), 195–216. Poggio, T., & Reichardt, W. (1980). On the representation of multi-input systems: Computational properties of polynomial algorithms. Biol. Cybernetics, 37, 167–186. Rieke, F., Warland, D., Bialek, W., & de Ruyter van Steveninck, R. (1997).SPIKES: Exploring the neural code. Cambridge, MA: MIT Press. Rugh, W. J. (1981). Nonlinear system theory. Baltimore: John Hopkins University Press. Sandberg, I. W. (1991). Structure theorems for nonlinear systems. Multidimensional Systems and Signal Processing, 2, 267–286. Schetzen, M. (1980). The Volterra and Wiener theories of nonlinear systems. New York: Wiley. Sontag, E. D. (1997).Recurrent neural networks: Some systems-theoretic aspects. In M. Karny, K. Warwick, & V. Kurkova (Eds.), Dealing with complexity: A neural network approach (pp. 1–12). Berlin: Springer-Verlag.

1772

Wolfgang Maass and Eduardo D. Sontag

Sontag, E. D. (1998). A learning result for continuous-time recurrent neural networks. Systems and Control Letters, 34, 151–158. Sussmann, H. J. (1975). Semigroup representations, bilinear approximations of input-output maps, and generalized inputs. In G. Marchesini (Ed.), Mathematical systems theory (pp. 172–192). New York: Springer-Verlag. Tsodyks, M., & Markram, H. (1997). The neural code between neocortical pyramidal neurons depends on neurotransmitter release probability. Proc. Natl. Acad. Sci., 94, 719–723. Tsodyks, M., Pawelzik, K., & Markram, H. (1998). Neural networks with dynamic synapses. Neural Computation, 10, 821–835. Varela, J. A., Sen, K., Gibson, J., Fost, J., Abbott, L. F., & Nelson, S. B. (1997). A quantitative description of short-term plasticity at excitatory synapses in layer 2/3 of rat primary visual cortex. Journal of Neuroscience, 17, 7926–7940. Received October 19, 1998; accepted September 16, 1999.

NOTE

Communicated by Carl van Vreeswijk

Analytical Model for the Effects of Learning on Spike Count Distributions Gianni Settanni Alessandro Treves SISSA, Programme in Neuroscience, Trieste, Italy The spike count distribution observed when recording from a variety of neurons in many different conditions has a fairly stereotypical shape, with a single mode at zero or close to a low average count, and a long, quasi-exponential tail to high counts. Such a distribution has been suggested to be the direct result of three simple facts: the ring frequency of a typical cortical neuron is close to linear in the summed input current entering the soma, above a threshold; the input current varies on several timescales, both faster and slower than the window used to count spikes; and the input distribution at any timescale can be taken to be approximately normal. The third assumption is violated by associative learning, which generates correlations between the synaptic weight vector on the dendritic tree of a neuron, and the input activity vectors it is repeatedly subject to. We show analytically that for a simple feedforward model, the normal distribution of the slow components of the input current becomes the sum of two quasi-normal terms. The term important below threshold shifts with learning, while the term important above threshold does not shift but grows in width. These deviations from the standard distribution may be observable in appropriate recording experiments. 1 Spike Counts and the S C F model

The variability in the emission of action potentials by nerve cells may be characterized by several measures, one of which is the distribution of spike counts in a time window of xed length. While other measures, such as the distribution of consecutive interspike intervals, are more direct descriptions of the irregularity in the ring, spike count distributions provide estimates of the entropy of neural codes, inasmuch as the code used by a particular neuron is thought to be expressed simply by its ring frequency. Because of this interpretation, the observation that spike count distributions often have quasi-exponential tails (Abeles, Vaadia, & Bergman, 1990; Barnes, McNaughton, Mizumori, Leonald, & Lin, 1990) has been linked to an optimal coding principle (Levy & Baxter, 1996). If the spike count is taken to be the symbol coding for the message represented c 2000 Massachusetts Institute of Technology Neural Computation 12, 1773– 1787 (2000) °

1774

Settanni and Treves

by the input arriving at the cell at any given time, and the mean count is constrained by a xed metabolic budget, then optimal usage of symbols occurs, in the noiseless case, when their probabilities are exponentially distributed—that is, have maximal entropy under the constraint (Shannon, 1948). It is not clear, however, how this coding principle could apply to the more meaningful situation of noisy coding and, more important, how it could be compatible with deviations from the pure exponential shape often prominent at the low-count end of the distribution, that is, the nonzero mode. An alternative suggestion (Panzeri, Booth, Wakeman, Rolls, & Treves, 1996; Treves, Panzeri, Rolls, Booth, & Wakeman, 1999) is that the stereotypical shape does not reect any design or optimization, but rather reects precisely the lack of any principle capable of imparting signicant structure to the distribution; that is, it represents a sort of null hypothesis against which any organizing principle could be tested. The suggestion is embodied in a crude model of the variability of the input current into the soma of a typical neuron, which assumes that (1) the input current translates linearly into a spike count—of course, above a ring threshold; (2) its variability has frequency components at several different time scales, both slower and faster (hence the name S C F model) than the time window used to count spikes; and (3) each component is normally distributed. The S C F model generates a formula for the spike count distribution with three free parameters: the position of the mean current relative to ring threshold and the standard deviations of slow and fast components of its variability. The formula was found to t adequately spike counts recorded from monkey inferior temporal cortex neurons responding to quasi-ecological stimulation with a video of natural images, counts that could not instead be tted by the exponential or other simple models (Treves et al., 1999). The S C F model was then found to provide satisfactory ts in other experiments, such as with primate hippocampal neurons responding to continuously changing views of the monkey’s environment (Panzeri, Rolls, Treves, Robertson, & Georges-Fran¸cois, 1997) or with a class of rat somatosensory neurons spontaneously active under anesthesia (Irina Erchova, pers. comm.). This raises the issue of whether it is at all possible to identify situations in which a well-dened principle or process does affect ring statistics, and its effects can be quantitatively demonstrated in the spike count distributions, as deviations from the null hypothesis expectation. We consider here a model of associative learning of discrete input vectors by a (feedforward) network comprising a single output neuron. While the S C F model neglects correlations between afferent input patterns and synaptic weights, the learning process considered here produces precisely such correlations, and we have calculated analytically the resulting changes in spike count distributions. The possibility of observing these changes in experiments is discussed briey at the end.

Spike Count Distributions After Learning

1775

2 An S C F Model Rened by Learning

The correlations induced by associative learning could affect any frequency component in the variability of the input current. To stick to a simple model, we consider only the case in which they affect solely the low-frequency components. This corresponds to the simplied scenario in which the slow variability in the input (and output) of our neuron codes for meaningful, slowvarying, stimuli, while fast uctuations, reect only noise. We thus take fast uctuations to be normally distributed, as in the standard S C F model, and in fact do not include them, leaving them to be added up at the end, after deriving the distribution of the slow components. This should not be taken to imply that fast variability is negligible (it is not; see Treves et al., 1999), but the analytical convenience of concentrating on slow variability warrants the minor imprecision in the transparent expression obtained at the end. The single output unit in the model receives N inputs, through synaptic weights J that modify with a learning rule that models associative plasticity. Learning is one shot, in that the weights are taken to have been modied by a single presentation of each of p input patterns. The learning rule includes balanced potentiation and depression components, so that the net average change of each weight is zero. The variance in the value of each weight is also taken to remain constant. The input patterns are uncorrelated among themselves and with the preexisting synaptic weight vector. Under these conditions, the distribution of the summed input current over all novel input pattern vectors remains the same (normal in the N ! 1 limit) after learning. What changes, and what we are going to compute, is the distribution of the input current over the p familiar input vectors—those that have been learned. The output distribution is calculated with the mean-eld analysis detailed in the appendix. For the sake of clarity, we try to keep the notation consistent with Treves (1995), where a similar calculation was reported. Storage and retrieval (S and R) here refer to the rst presentation of one of the p input patterns, and to the subsequent presentation that generates the output distribution we aim for. g is the input vector during storage. It represents ring rates computed over a time window of, say, a few hundred milliseconds, and may be measured in Hz. We take each of its N components to be distributed independently, P (g) D

Y

Pg (gi ),

(2.1)

i

according to some distribution Pg , which for consistency should itself be of the stereotypical form mentioned above. One result we nd, though, is that the precise form of Pg is not critical. V is the input vector during retrieval. The stimulus that has been previously learned is taken to have been reproduced with some added gaussian

1776

Settanni and Treves

noise d (with zero mean and variance sd2 ), followed by rectication: 1 Vi D [gi C di ] C .

(2.2)

Therefore P (V |g) D

Y i

"

# (V i ¡ g i ) 2 H (Vi ) d(Vi )W (¡gi / sd ) C p exp ¡ , 2sd2 2p sd

(2.3)

where d (x ) is the Dirac delta function, H (x) is the Heaviside function, and W(x) is the normalized probability integral, that p is, the integral of the normal Rx distribution of unit variance: W(x) D ¡1 (dx / 2p ) exp ¡x2 / 2. The rst term in each factor of the product represents the probability that the input unit i be below threshold, and the second term that it be above threshold. The noise level sd parameterizes the variability in the ring frequency of input units, measured over a trial of, say, a few hundred milliseconds, among trials with the same stimulus. It is thus a measure of slow components in the noise, which could again be taken to be similar, like the frequency distribution, between input and output units, and it could in principle be evaluated experimentally. Note that the distributions P (g) and P (V ) need not be assumed to be identical, even if both take the general stereotypical form; the addition of the noise term d, which induces a difference between the two, could be thought to reect the altered attentional state, for example, of successive with respect to the rst presentation of a novel stimulus. Z is the output during storage resulting from the product between input and synaptic vectors followed by thresholding and rectication. Gaussian noise 2 S with zero mean and variance s2 2 is also added to the output, h iC Z D JS ¢ g C Z± C 2 S .

(2.4)

The threshold (¡Z± ) may lump together a bona-de current threshold, inhibitory terms due to nonselective effects of interneurons and competition among output cells, and the baseline mean value of the product J L ¢g, taken to be constant. The gain of the threshold-linear transfer function has been set to 1 by rescaling synaptic weights to pure numbers of order 1 / N, so that Z, 2 S , and Z0 , like g, may be measured in Hz. The output distribution during storage is

¡

P (Z|g ) D d (Z )W ¡ H (Z)

C p 1

2p s2

JS ¢ g C Z ±

¢

s2

exp ¡

(Z ¡ JS ¢ g ¡ Z± )2 , 2s2 2

[x] C D x for x > 0 and 0 otherwise.

(2.5)

Spike Count Distributions After Learning

1777

where the rectication has been applied directly, without rst adding fast uctuations. This omission, which is carried over in the synaptic modication terms below, is deliberate; it simplies analytically what was already a rather crude model. u is the steady component of the summed input current to the neuron during retrieval. It is convenient to calculate its distribution before adding fast noise and rectication, after which operations one would have the actual output during retrieval, U. If we take u to include slow noise with the same variance s2 2 , it will follow the conditional probability density, P (u|Z, V , g) D p

1 2p s2

exp ¡

(u ¡ J R ¢ V ¡ U± )2 , 2s2 2

(2.6)

which we have to average appropriately in order to nd the target distribution. JS and JR are the weight vectors during storage and retrieval. With respect to the pattern being considered, they are assumed to have random components, except for a term, present in J R but absent in J S , that reects the storage of the pattern and produces the effect on the spike count distribution, during retrieval, which is the aim of the calculation. Each component JiS is then taken to have zero mean (any baseline value can be incorporated into the constant threshold ¡Z± ) and xed variance sJ2 D 1 / N (so that the scalar product JS ¢ g is of order 1). Apart from the relevant pattern, the components of the new vector JiR reect also the storage of many other patterns, intervened between storage and retrieval. A stable regime, in which storing new patterns and gradually forgetting old ones does not alter the mean strength and variance of the vector, can be described (after Treves, 1995) by a pseudo-rotation from the random direction JS into a new random direction JN , JiR D cos(h ) JiS C

p h c p (gi ¡ g)(Z ¡ Z± ) C sin(h ) JiN , N N

(2.7)

where: JiS is multiplied by a factor cos(h ), that reduces its relative importance with time, parameterized by h. JiN , again of zero mean and variance sJ2 D 1 / N, is multiplied by sin(h ), which grows with the successive learning of new patterns. The relation of h to real time need not be detailed, except that obviously h D 0 implies that no other pattern has modied the weights between the storage and retrieval of the one being considered, while h D p / 2 implies complete oblivion of the original weights. The passociative modication term is multiplied by a normalizing factor h / N, designed to ensure that the variance of the modication term

1778

Settanni and Treves

is set to c sJ2 . Since sJ2 D 1 / N, one needs to adjust 1 / h2 to the value of the variance of the g and Z factors. In this way the plasticity c can be taken to be the average proportion of the synaptic weight variance accounted for by a single learned pattern 2 if all that JR encodes are p patterns, with equal strength, c D 1 / p. The quantity calculated, in the large-N limit, is the distribution hP(u)i of the static or low-frequency component of the current into the output neuron. The brackets indicate averaging over all possible values of the weight vectors. To convert u into an output spike count, one would need to consider additional high-frequency components of the noise (as explained by Treves et al., 1999), to multiply by a gain (which, if different among patterns, cannot be taken to equal 1) and discretize the resulting output frequency U into a number of spikes emitted. These steps are needed when analyzing experimental results, but since they could be handled in a number of alternative ways, they are beyond the scope of this article, which focuses on how hP(u)i differs from a normal distribution. 3 Result and Parameters

The calculation reported in the appendix yields the expression » µ ¶ 1 b1 (u ¡ U± C Z± g ) C b0 Z± hP(u)i D p W ¡ p b0 2p det T µ ¶ (u ¡ U± C Z± g )2 1 £ p exp ¡ 2b 0 det T b0 " # (b1 C b2 g )(u ¡ U± ) C (b 0 C 2b1 g C b 2 g2 )Z± CW p b0 C 2b1 g C b 2 g2 1 £p b 0 C 2b1 g C b 2 g2 µ ¶¼ (u ¡ U± )2 £ exp ¡ , 2(b0 C 2b1 g C b2 g2 ) det T

(3.1)

where the matrix T and the matrix elements b0,1,2 are TD

¡s

T ¡1 D

2 C

z± cos(h )w± 2

¡b

0

¡b1

cos(h )w± s2 2 C y±

¢

(3.2)

¢

¡b1 , b2

(3.3)

2 For consistency, a decay factor cos(h ) should express the gradual forgetting of this learned pattern, along with the others. We assume such factor to be incorporated into the p c factor, purely to simplify the resulting formulas.

Spike Count Distributions After Learning

1779

and the averages x± , y± , w± , andpz± are dened in the appendix. Learning is now parameterized by g D hx± Nc , which, apart from the number hx± of order unity, can be seen to measure roughly the inverse square root of the storage load, that is, of the effective number of learned patterns 1 /c divided by the number of inputs N. The expression is a sum of two terms: the rst originating from cases in which the original unthresholded response f was below threshold and the second from those with f > 0.3 Each term is a gaussian modulated by a W (x) factor, which suppresses the gaussian for values of u not matching the corresponding f . Thus, the rst term is more important for u values below threshold, and once the current is converted into a spike count, its detailed form will be largely uninuential. The second term is more important above threshold and is more directly observable. The partition in two term therefore stems from the thresholding (rectication) of the response f , which is the crucial assumption, while the detailed form of each term reects all other ingredients of the model. In the limit of zero plasticity, c ! 0 so that also g ! 0, both terms reduce to normal distributions of mean U± and variance b 0 det T ´ s2 2 C y± , with the two prefactors adding up to unity. The model then reduces to a normal distribution of slow uctuations, as in the standard S C F model, with their variance given by the sum of that of “slow noise”, s2 2 , and “signal” (y± can be seen as the product of the variance of synaptic weights and that of the activity of input units). The two terms depend on learning (i.e., on plasticity) in two simple but different ways. One should rst realize what range of values is accessible for the parameter g. The parameter c in principle can range from 0 (no plasticity) to 1 (the entire synaptic variance is due to the storage of the single memory pattern being examined); a meaningful set of values, though, is around 1 / N, since p ’ O (N ) is the memory p capacity of associative nets (Rolls & Treves, 1998). Given that, besides c N, the other factors that determine g reduce to a number of order unity, g itself is nonnegative, and can be considered to range from 0 to values of order 1. As g increases from 0, the rst gaussian has its mean shifted by an amount ¡Z± g, that is, toward negative values if the mean output during the learning phase, Z± , is positive, and to positive values in the (also common) case in which the distribution of responses to novel stimuli—ideally derived from the hP(Z)i distribution—corresponds to a negative mean of the slow component of the S C F current. The width of this rst gaussian does not change. The modulating factor W contributes to make the former effect difcult to detect, as it suppresses further, for Z± > 0, the gaussian peak.

3

Integrating hP(f , u)i rst over u and then over f , one can check that the normalization of equation 3.1 is correct.

1780

Settanni and Treves

The second gaussian has the mean unaffected by learning, while its width grows, roughly doubling in value when g reaches values of order 1 (if the b factors, which are all positive, are also of similar magnitude). The modulating prefactor has a complex dependence on g, but is in any case larger for Z± > 0. The distribution depends on two additional parameters: the noise on input variables, sd , and the input distribution Pg (g), which theoretically has innite degrees of freedom. However, it depends on both in a rather mild way, only through the averages x± , y± , w± , and z± . In fact, it can be seen that for sd ! 0, that is, negligible input noise, the specic form of Pg becomes irrelevant, and only its rst two momenta matter. For the sake of simplicity, we use a binary Pg in the graphs, but any other choice gives similar results. The binary Pg reduces to a probability 1 ¡ a for the input activity to be 0, and a probability a to be at the arbitrary level of 50 Hz; a parameterizes the sparseness of the activity (Treves & Rolls, 1991) in that it gives, for such a binary distribution, the fraction of active units. The graphs show the effect of increasing values of the learning parameter g for three cases that correspond to negative, zero, and positive mean levels of the output current, and thus reproduce three typical regimes of ring statistics. In each case, we set Z± D U± and equal, and very low, noise levels in the input and output s2 D sd D 0.5 Hz. Three values p of the learnp ing parameter are included, which correspond to Nc D 0.0, Nc D 1.45, p and Nc D 5.8. The resulting g values differ in each gure as they depend on the value of the hx0 factor and are reported in the legend. In any case the three values p correspond to no plasticity, intense plasticity, and very strong plasticity ( Nc D 5.8 implies that even taking N ’ 104 , a single pattern accounts for about 1 / 300 of the variance of synaptic weights, on Figure 1: Facing page. The effects of learning on spike count distributions that fall largely (a) below threshold, (b) around threshold, or (c) above threshold. Each graph shows as solid curves the normal distribution P(u) for g D 0, and the distributions obtained with two different values of the learning parameter, while the dashed lines indicate the gaussian curves that best t each distribution. (a) The mean value of the original distribution is U± D Z± D ¡20 Hz and the two nonzero values of the plasticity correspond to g D 1.21 and g D 4.86. (b) The mean value is U± D Z± D 0 Hz, while the same plasticity values result in g D 1.44 and g D 5.77. (c) U± D Z± D 20 Hz, g D 1.30 and g D 5.21. The g values used in each of the three panels are somewhat different because of the different h factor, resulting from a different h(Z ¡ Z± )2 i average. Other parameters: a D 0.5, h D 0, sd D s2 D 0.5 Hz. Since differences in the portion of the distribution below threshold are difcult to detect in practice, the most favorable situation to experimentally observing learning effects is that exemplied in panel a, with the original distribution largely below threshold, and the modied distribution substantially shifted to higher spike counts.

Spike Count Distributions After Learning

1781

average). The sparseness is set in each case at a D 0.5; different sparseness values change the distributions quantitatively but not the qualitative effects of learning. In the rst case, Figure 1a, the original output distribution is

1782

Settanni and Treves

above threshold only with its upper tail, Z± D ¡20 Hz. The main effect is a shift of the bulk of the distribution to higher values, with its width remaing constant, while the upper tail is broadened. This situation (mode of the original distribution below threshold, that is quasi-exponential spike count) corresponds to a fraction of cortical recordings and may be more typical of hippocampal cells (Panzeri et al., 1997). In the second case, Figure 1b, the mean output in the absence of learning is exactly at threshold, Z± D 0 Hz. The half distribution below threshold then stays unchanged, while the half distribution above is broadened by learning. The latter effect is the dominant one in the third case, Figure 1c, in which the original output spike count has been taken to have a peak above threshold, Z± D 20 Hz. In this case, only the lower tail of the distribution shifts, and in the opposite direction, to lower values below threshold (an effect difcult to detect in practice). 6 Z± , The effects on the distribution are somewhat more complex if U± D and it would take several gures to describe the detailed dependence on h, sd , s2 , and so on. However, Figure 1 provides a useful indication of the main effects—those that could be hoped to be observed in experiments— and the low noise levels used make such effects particularly salient. 4 Can Deviations from Normality Be Observed?

Figure 1 also shows, with dashed lines, the normal distributions in u that most closely match (in the least mean square sense) the actual distribution for the two nonzero values of the learning parameter. Visual inspection of the gure claries the relative likelihood of observing the effects of learning in experiments. In experiments in which the only data are the ring statistics to presumably well-learned visual stimuli, the effects of learning have to be demonstrated as mismatches between the observed spike count and the closest underlying normal distribution of (slow) uctuations. Figure 1 indicates that such a mismatch will be substantial only below threshold, and then more so for distributions concentrated below threshold (as the one in the example of Figure 1a). The shape of the distribution of slow uctuations below threshold is difcult to extract, particularly with the limited data available in practice, from the observed spike count. It is therefore likely that deviations from normality, if they indeed occur as an effect of learning, will be observable only in experiments designed for that purpose, in particular, involving extensive sampling (recording sessions). A thorough analysis will be needed to disentangle among deviations from normality, if observed, those due to learning from those due to any of the many simplifying assumptions of our model. The situation is different if the effects of learning can also be gauged by the deviations of the observed spike count from a “control” spike count obtained in equivalent conditions, but with the cell stimulated with novel (nonlearned) stimuli. In that case the parameters of the normal distribution

Spike Count Distributions After Learning

1783

assumed by the S C F model for g D c D 0 can in principle be extracted from the data and used to try to t the spike count obtained with the familiar (presumably learned) stimuli. Therefore, not only the shape of the distribution of slow uctuations but also its estimated mean and variance will offer indications about the effects of learning, whether they are consistent with those predicted by the analysis and, if so, yield estimates of the learning parameters g and c . Now deviations from normality due to other effects can hopefully be subtracted out. Experiments that in principle allow for such a comparison have been carried out in the laboratory of James Ringo (Ringo & Nowicka, 1996). The idea is to train a monkey to discriminate among a set of 12 to 40 visual images, which are generated as simple combinations of elementary shapes and colors, with a given algorithm. These images are used throughout training and become familiar to the monkey. During testing, the statistics of single cell responses to such images are contrasted with the statistics of responses to a much larger set of images, each of which is novel but generated by the same algorithm. Under such conditions, any difference in the statistics can be used, in particular the mean and variance of the spike count (Ringo & Nowicka, 1996) and, after analysis with the S C F model, the estimated mean and variance of the distribution of slow uctuations. An analysis along these lines is in progress (facing subtleties posed, as usual, by limited sampling) and will be reported elsewhere. In addition to limited sampling, the analysis of experimental recording has to confront effects inherent in the behavior of real neurons, which have not been considered in this simple model. For example, for experiments in which steady stimuli are presented in successive trials, each for a xed time, in principle one has to take into account the variability in the latency of the response, adaptation in the ring, across-trials trends in the response to the same stimulus, and so on. The quantication of these effects requires a study of its own, beyond the scope of this article, but their presence should clearly be borne in mind. In conclusion, a simple (threshold-linear) model predicts simple effects of learning on the statistics of trial-to-trial uctuations in the input current to a neuron. These effects can be summarized by the next-order-cumulant rule: below threshold, where the output, fast noise aside, is zero (i.e., constant with respect to the input current), the effect of learning is a linear increase in the rst-order cumulant of the input distribution (i.e., its mean value); above threshold, where the output is roughly linear in the input, the effect is on the second-order cumulant (i.e., the variance). Experimental constraints make the validation of these model predictions not quite straightforward; but if the effects turn out to be observable, they will allow an estimate, at least as an order of magnitude, of the plasticity parameter, that is, a measure of the amount of learning stored in the synapses to a neuron.

1784

Settanni and Treves

Appendix

The quantity we have evaluated, hP(u)i, is the average across all possible values of the synaptic weight vector of the distribution P (u ), which is itself already integrated over all values of g, V , and Z. It can thus be written 0 1 0 1 ( JSi )2 (JiN )2 Z Y S N ¡ ¡ Y B dJ B dJ i 2s 2 C 2s 2 C i J J hP(u)i D e e @p A @p A 2p sJ 2p sJ i i Z £

dV dZ dgP (u|Z, V , g)P(V |g)P(Z|g)P(g).

(A.1)

To proceed, one has just to insert the conditional probabilities from equations 2.3, 2.5, and 2.6 and carry out a succession of integrals. It is convenient to write P (V |g) and P (Z|g ) as pure gaussian integrals over the dummy variables vi and f , of which Vi and Z are the real parts (e.g., Z D f H (f )). Moreover, the normal distributions of the variables f and u around their mean values can be written as gaussian integrals in the noise terms 2 S and 2 R . One can then integrate over the synaptic weights, obtaining Z hP(u)i D

C1

¡1

µ d2 R i2 exp 2 2p s2

R(

(2 R )2 u ¡ U± ) ¡ s2 2 2s2 2

¶

µ S ¶Z Y (2 S )2 d2 S df i2 (f ¡ Z± ) ddi dvi £ exp ¡ dgi Pg (gi ) 2 2 2 2 2p s s 2s ¡1 2 2 2 i 2p sd ( ) r X 2 Rh c d i (v i ¡ g i ) (gi ¡ g)[ £ exp i N f H (f ) ¡ Z± ]vi H (vi ) C s2 2 N sd2 i ( X d2 sJ2 [gi2 S C cos(h )vi H (vi )2 R ]2 i C £ exp ¡ 2 s2 4 2sd2 i ) sJ2 [sin(h )vi H (vi )2 R ]2 C . (A.2) 2 s2 4 Z

C1

One may then introduce, in order to separate integration variables, the average input parameters, xD

1 X (gi ¡ g)V N i N i

(A.3)

yD

1 X 2 Vi N i

(A.4)

Spike Count Distributions After Learning

1785

wD

1 X gi V i N i

(A.5)

zD

1 X 2 g , N i i

(A.6)

which are also integrated over, and constrained to take the above values by Q y, Q w, Q and delta functions dened in terms of the conjugated parameters x, zQ . The intermediate formula becomes Z Z Z Z NdxdxQ NdydyQ NdwdwQ NdzdQz hP(u)i D 2p 2p 2p 2p Z C1 Z C1 d2 R d2 S df £ exp NF 2 2 ¡1 2p s2 ¡1 2p s2 » R 2 (2 ) C (2 S )2 £ exp ¡ 2s2 2 ) sJ2 N S )2 S R R 2 C [z(2 C 2w cos(h )2 2 C y (2 ) ] 2s2 4 » R 2 (u ¡ U ± ) 2 Rh p C £ exp i c N[f H (f ) ¡ Z± ]x s2 2 s2 2 ¼ 2 S (f ¡ Z± ) C , (A.7) s2 2 where " # d2 dddv (g) ( ) exp F D exp ¡ 2 C i xxQ C yyQ C wwQ C zQz dgPg 2p sd2 2sd " d(v ¡ g) (v ) £ exp i ¡ xQ (g ¡ g)vH N sd2 # Z

Q 2 H (v) ¡ wgvH (v ) ¡ zQg2 . Q ¡yv

(A.8)

Having set in the model sJ2 N D 1, it is possible to calculate the integrals in the N ! 1 limit with the saddle point method, which amounts to using the formula, R n lim d xg (x ) exp(NF(x )) ¼ N!1

s

¼ g (x max ) exp[NF(xmax )]

¡ 2p ¢ s n

N

1 , ¡ det H[F (x max )]

(A.9)

1786

Settanni and Treves

Q w, Q y, Q zQ ), where x is a short hand for the n D 8-dimensional vector (x, y, w, z, x, which maximizes the exponent F in x max , and H[F (x max )] is the Hessian of F at the maximum. One nds the maximum of F at xQ ± D yQ ± D wQ ± D zQ ± D 0 " Z (g)(g ¡ g) x± D dgPg N gW "

Z

dgPg (g) (g

y± D

2

C sd2 )W

"

Z

dgPg (g)g gW

w± D

¡g¢ sd

Z z± D

¡g¢ sd

sd

C p

¡g¢

2p

( (

exp ¡

g2 2sd2

(A.10)

!#

(A.11)

gsd g2 C p exp ¡ 2 sd 2sd 2p !# sd g2 C p exp ¡ 2 2sd 2p

!# (A.12)

(

(A.13)

dgPg (g)g 2 ,

(A.14)

in which W is, as above, the normal distribution function. At the maximum, F (x± , y± , w± , z± , xQ ± , yQ ± , wQ ± , zQ ± ) D 0

(A.15)

det H [F(x± , y± , w± , w± , xQ ± , yQ ± , wQ ± , zQ ± ) D ¡1,

(A.16)

so we are left with hP(u)i ¼ g (x± , y± , w± , z± , xQ ± D 0, yQ ± D 0, wQ ± D 0, zQ ± D 0) Z C1 Z C1 S d2 R d2 df D 2 2 ¡1 2p s2 ¡1 2p s2 » R 2 (u ¡ U ± ) 2 Rh p 2 C £ exp i c N[f H (f ) ¡ Z± ]x± C 2 2 s2 s2 ( (2 R )2 C (2 S )2 £ exp ¡ 2s2 2 C

sJ2 N 2s2 4

[z± (2

S )2

C 2w± cos(h )2

S R

2

C y± (2

S (f

R )2

¡ Z± ) s2 2

¼

) ] . (A.17)

The above expression is but a gaussian integral in R 2 in the variables 2 S and 2 R ; one has to carry out the nal integral in df to obtain the expression in the text.

Acknowledgments

We are grateful to James Ringo, Stefano Panzeri, and Andrea Benucci for discussions of our results. This work was in partial fulllment of require-

Spike Count Distributions After Learning

1787

ments for an M.Sc. thesis (G.S.) and was supported in part by CNR, INFM, and HFSP. References Abeles, M., Vaadia, E., & Bergman, H. (1990). Firing patterns of single units in the prefrontal cortex and neural network models. Network, 1, 13–25. Barnes, C. A., McNaughton, B. L., Mizumori, S. J. Y., Leonard, B. W., & Lin, L.-H. (1990). Comparison of spatial and temporal characteristics of neuronal activity in sequential stages of hippocampal processing. In J. Storm-Mathisen, J. Zimmer, & O. P. Ottersen (Eds.), Understanding the brain through the hippocampus (pp. 287–300). Amsterdam: Elsevier. Levy, W. B., & Baxter, R. A. (1996). Energy efcient neural codes. Neural Comp., 8, 531–543. Panzeri, S., Booth, M., Wakeman, E. A., Rolls, E. T., & Treves, A. (1996). Do ring rate distributions reect anything beyond just chance? Society for Neuroscience Abstracts, 22, 1124. Panzeri, S., Rolls, E. T., Treves, A., Robertson, R. G., & Georges-Fran¸cois, P. (1997). Efcient encoding by the ring of hippocampal spatial view cells. Society for Neuroscience Abstracts, 23, 195.4. Ringo, J. L., & Nowicka, A. (1996). Long term memory for visual images in the hippocampus and inferotemporal cortex. Society for Neuroscience Abstracts, 22, 281. Rolls, E. T., & Treves, A. (1998).Neural networks and brain function. Oxford: Oxford University Press. Shannon, C. E. (1948). A mathematical theory of communication. AT&T Bell Labs. Tech. J., 27, 379–423. Treves, A. (1995). Quantitative estimate of the information relayed by the Schaffer collaterals. J. Comp. Neurosci., 2, 259–272. Treves, A., Panzeri, S., Rolls, E. T., Booth, M. C. A., & Wakeman, E. A. (1999). Firing rate distributions and efciency of information transmission of inferior temporal cortex neurons to natural visual stimuli. Neural Comp., 11, 611–641. Treves, A., & Rolls, E. T. What determines the capacity of autoassociative memories in the brain. Network, 2:371–397, 1991. Received November 2, 1998; accepted June 8, 1999.

LETTER

Communicated by Fabrizio Gabbiani

Calculation of Interspike Intervals for Integrate-and-Fire Neurons with Poisson Distribution of Synaptic Inputs A. N. Burkitt G. M. Clark Bionic Ear Institute, East Melbourne, Victoria 3002, Australia We present a new technique for calculating the interspike intervals of integrate-and-re neurons. There are two new components to this technique. First, the probability density of the summed potential is calculated by integrating over the distribution of arrival times of the afferent postsynaptic potentials (PSPs), rather than using conventional stochastic differential equation techniques. A general formulation of this technique is given in terms of the probability distribution of the inputs and the time course of the postsynaptic response. The expressions are evaluated in the gaussian approximation, which gives results that become more accurate for large numbers of small-amplitude PSPs. Second, the probability density of output spikes, which are generated when the potential reaches threshold, is given in terms of an integral involving a conditional probability density. This expression is a generalization of the renewal equation, but it holds for both leaky neurons and situations in which there is no time-translational invariance. The conditional probability density of the potential is calculated using the same technique of integrating over the distribution of arrival times of the afferent PSPs. For inputs with a Poisson distribution, the known analytic solutions for both the perfect integrator model and the Stein model (which incorporates membrane potential leakage) in the diffusion limit are obtained. The interspike interval distribution may also be calculated numerically for models that incorporate both membrane potential leakage and a nite rise time of the postsynaptic response. Plots of the relationship between input and output ring rates, as well as the coefcient of variation, are given, and inputs with varying rates and amplitudes, including inhibitory inputs, are analyzed. The results indicate that neurons functioning near their critical threshold, where the inputs are just sufcient to cause ring, display a large variability in their spike timings. 1 Introduction

The stream of action potentials (spikes) generated by a neuron, which is the means by which neurons communicate, plays an important role in neural information processing. The generation of an action potential can be described c 2000 Massachusetts Institute of Technology Neural Computation 12, 1789– 1820 (2000) °

1790

A. N. Burkitt and G. M. Clark

in terms of the membrane potential of the neuron reaching a threshold value, as in the integrate-and-re model. This model, which has both a long history (Lapicque, 1907) and wide application (Tuckwell, 1988a), has recently been compared with the Hodgkin-Huxley model for the case where both models receive a stochastic input current, and it was found that the threshold model provides a good approximation (Kistler, Gerstner, & van Hemmen, 1997). One of the most universal characteristics of neuronal behavior is the apparent stochastic nature of the timing of action potentials (Tuckwell, 1988b, 1989). The traditional view has long been that the mean rate of ring of a neuron provides an adequate description of the information it conveys (Adrian, 1928). There are, however, a number of instances within the nervous system where the temporal information contained in the timing of individual spikes plays an important role, such as the encoding of sound frequency information in the auditory pathway, where spikes become locked to the phase of the incoming sound wave (see Theunissen & Miller, 1995, and Carr & Friedman, 1999, for reviews of temporal coding). There has also been considerable interest in studies of cross-correlational activity of neurons in the visual cortex, which provide data suggesting a functional role for the synchronization of neural activity, as reviewed by (Engel, Konig, ¨ Kreiter, Schillen, & Singer, 1992; Singer, 1993). This has led to a recent reappraisal of the information that can be contained in the timing of neuronal spikes (Bialek & Rieke, 1992), and a variety of mathematical models of networks of spiking neurons have been proposed (Abeles, 1982a; Judd & Aihara, 1993; Gerstner, 1995) in which the timing of individual spikes encodes information (Hopeld, 1995; Gabbiani & Koch, 1996; Maass, 1996a, 1996b). Studies suggest that the integrate-and-re mechanism is unable to account for the irregularity observed in the interspike intervals (ISIs) in spike trains from neurons in the visual cortex of behaving monkeys (Softky & Koch, 1992, 1993). One possible explanation is that the neurons act as coincidence detectors (Abeles, 1982b), which may be made possible with active dendritic conductances (Softky, 1994). However, subsequent studies (Shadlen & Newsome, 1994, 1998) suggest that the observed irregularity may be accounted for by the effect of inhibitory postsynaptic potentials (IPSPs) and that the timing of neuronal spikes conveys little, if any, information. The resolution of such issues requires the development and analysis of neuronal models that are sufciently complex to capture the essential features of the neuronal processing, including stochastic input, while still being mathematically tractable. The earliest solution of the integrate-and-re model that incorporated stochastic activity was to model the incoming postsynaptic potentials (PSPs) as a random walk (Gerstein & Mandelbrot, 1964). Subsequent developments have largely built on this diffusion approach using stochastic differential equations and the Ornstein-Uhlenbeck Process (OUP). Stein formulated the integrate-and-re model with stochastic input to include the decay of the membrane potential (Stein, 1965), and a number of other authors subsequently have investigated the model using both

Calculation of Interspike Intervals

1791

stochastic differential equations and numerical techniques (Tuckwell, 1977; Wilbur & Rinzel, 1982; L a´ nsky, ´ 1984). These techniques have been used to examine the role of inhibition in the Stein model (Tuckwell, 1978, 1979; Cope & Tuckwell, 1979; Tuckwell & Cope, 1980) and a number of other features, such as the effect of reversal potentials (Tuckwell, 1979; Hanson & Tuckwell, 1983; Musila & L´ansky, ´ 1994). We present here an alternative formulation of the stochastic integrateand-re model that allows the analysis of arbitrary synaptic response functions (which describe the time course of the incoming PSP, including a nite rise time and subsequent decay). The rst step is the calculation of the probability density of the summed potential, which is given by integrating over the arrival times of the incoming PSPs. The probability density function at any time is calculated in the gaussian approximation (i.e., expanding in powers of the PSP amplitude, a, and neglecting terms of order higher than quadratic in the expansion), and it is found to be a gaussian distribution, with the (time-dependent) mean and width of the distribution, depending on the details of the distribution of PSP arrival times. In the case of a Poisson distribution of PSP arrival times, our results reproduce the known expression for the probability density of the membrane potential in the diffusion limit. In order to calculate the probability density of output spikes, it is necessary to know when the potential reaches threshold for the rst time. This rst-passage density is usually related to the probability density of the potential by a renewal equation (Tuckwell, 1988b), a procedure that works only when there is no decay of the membrane potential (the perfect integrator model). However, it is possible to formulate a more general integral relationship involving the probability density of output spikes and a conditional probability density of the potential (Plesser & Tanaka, 1997; Burkitt & Clark, 1999a). We show how this conditional probability density can also be calculated using the technique of integrating over the distribution of arrival times of the afferent PSPs. In addition to reproducing the results of the perfect integrator and Stein models, this method allows us to investigate models that incorporate not only the membrane potential leakage but also a nite rise time of the postsynaptic response. Numerical solutions of the spike output distribution (i.e., the interspike interval distribution) of these models are given. We have also applied this technique to the study of the synchronization problem, in which the relationship among a number of approximately synchronized inputs (with a spread of arrival times giving the temporal jitter of the inputs) with the spread in time of the output spike distribution is examined (Burkitt & Clark, 1999a). The analysis supports previous studies (Bernander, Koch, & Usher, 1994; Diesmann, Gewaltig, & Aertsen, 1996; MarLsalek, ´ Koch, & Maunsell, 1997) showing which the ratio of the output jitter to the input jitter is consistently less than one and that it decreases for increasing numbers of inputs. Moreover, we identied the variation in the spike-generating thresholds of the neurons and the variation in the num-

1792

A. N. Burkitt and G. M. Clark

ber of active inputs as being important factors that determine the timing jitter in layered networks, in addition to the previously identied factors of axonal propagation times and synaptic jitter. This study of the synchronization problem using the techniques presented here is complementary to this study, since the problems addressed represent two extreme cases of the temporal information that the inputs contain: in the synchronization problem, all of the input arrival times are related, whereas for a Poisson distribution of PSP inputs, the arrival times are completely independent. Details of the technique are presented in section 2, and it is then applied to various neural models in section 3. The result for the perfect integrator (Gerstein & Mandelbrot, 1964), in which the decay of the membrane potential over time is neglected, is reproduced in section 3.1. The solution for the moments of the ring distribution in the Stein model (Gluss, 1967) is reproduced in section 3.2, and plots of the ring probability, mean interspike interval, width of the interspike interval distribution, and coefcient of variation are given. We also plot the output rate of ring as a function of the input rate for a number of thresholds. More general synaptic response functions, which include not only the decay of the membrane potential but also its rise time, are considered in section 3.3. The effect of a distribution of PSP amplitudes and rates is examined in section 3.4, and the effect of including inhibitory PSPs is examined in section 3.5. The nal section contains a discussion of the results and the conclusions. 2 Technique for Calculating Output Spike Density

We calculate the interspike interval for an integrate-and-re neuron with N afferent bers. The time of arrival of the inputs on the kth ber is modeled as a renewal process, and the results in sections 2.1 and 2.2 hold for any renewal process, that is, the successive time intervals between inputs are independent and identically distributed. The times of arrival of the inputs are described by the probability density function, which we denote by pk (t); this is the probability density of the interspike interval distribution on the kth ber. In section 2.3 we examine the case in which the arrival times form a

Poisson process with constant rate l k . However, the technique outlined here is applicable to probability densities other than the Poisson distribution, which is useful when refractory effects are considered and for nonstationary processes. The membrane potential V (t) is assumed to be reset to its initial value at time t D 0, V (0) D v0 after an action potential (AP) has been generated. The potential is the sum of the excitatory and inhibitory postsynaptic potentials (EPSPs and IPSPs, respectively), V ( t ) D v0 C

N X k D1

ak

1 X m D1

uk (t ¡ tkm ),

t ¸ 0,

(2.1)

Calculation of Interspike Intervals

1793

where the index k D 1, . . . , N denotes the afferent ber and the index m denotes the mth input from the particular ber, whose time of arrival is tkm (0 < tk1 < tk2 < ¢ ¢ ¢ < tkm < ¢ ¢ ¢). The amplitude of the inputs from ber k is ak (positive for EPSPs and negative for IPSPs), and the time course of an input at the site of spike generation is described by the synaptic response function uk (t), which is zero for t < 0 and positive for t ¸ 0 with a peak value normalized to one (the denitions for the three cases examined here are given in equations 3.1, 3.6, and 3.17). In section 3 we examine a number of synaptic response functions corresponding to various neural models, including the perfect integrator model and the Stein model (also called the shot noise model). More general models of the synaptic response function enable us to model the time course of the incoming PSPs, taking into account both the leakage of the potential across the membrane and the effect of propagation along the dendritic tree. The various versions of the integrate-and-re neural model have as their key components a passive integration of the synaptic inputs, a voltage threshold for spike generation, and a lack of the specic currents that underlie spiking. The complex mechanisms by which the sodium and potassium conductances cause action potentials to be generated are not part of the model. An important assumption is also that the synaptic inputs interact only in a linear manner, so that phenomena involving nonlinear interaction of synaptic inputs, such as synaptic currents that depend on postsynaptic currents and pulse-paired facilitation and depression (which require higherorder terms in uk (t)), are neglected. A recent review of the integrate-and-re neural model, giving details of the various variations of the model and comparisons with both experimental data and other models, is given in Koch, (1999). In the results that follow, we express the relationship between the inputs and the threshold in terms of the threshold ratio R, which is dened as RD

h Vth ¡ v0 , D Na Na

(2.2)

where N is the number of afferent bers that contribute EPSPs and a is the amplitude of the individual PSPs (taken here to all be equal in magnitude) and h is the difference between the threshold potential, Vth , and the resting potential, v0 . We choose the units of voltage to be set by the difference between the threshold and reset values, h D Vth ¡ v0 D 1. The utility of this parameterization of the threshold will become apparent when we consider how the results scale with increasing numbers of inputs, N, and decreasing amplitudes, a, in section 3. 2.1 The Probability Density of the Potential. The probability density of the potential having the value v at time t is evaluated by considering the fraction of inputs for which this holds. This is given by the integral over the

1794

A. N. Burkitt and G. M. Clark

probability distribution of the inputs p(v, t | v0 ) D

N Y

(Z

1

0

k D1

Z £

Z dt k1 pk (tk1 ) 1

t k2

1 tk 1

dtk2 pk (tk2 ¡ tk1 ) )

dtk3 pk (tk3 ¡ tk2 ) . . . d(V(t) ¡ v),

(2.3)

where p (v, t | v0 ) is the probability that the potential, V (t), takes the value v at time t, given the initial condition V (0) D v0 . This initial condition corresponds to the reset of the membrane potential to v0 immediately after the previous spike, which is assumed to occur at time t D 0. The dots indicate the sequence of integrals, each corresponding to the successive arrival of inputs on the particular ber (note that the integrals are normalized), and d(x) is the Dirac delta function. Since the contributions from the afferent bers are independent, the above probability density may be written as Z p(v, t | v0 ) D

1 ¡1

Z D

1 ¡1

N Y dx expfix(v ¡ v0 )g Fk (x, t) 2p kD 1 ( ) N X dx exp ix (v ¡ v0 ) C ln Fk (x, t) , 2p kD 1

(2.4)

where the Fourier representation of the Dirac delta function is used: Z d (z ) D

1 ¡1

dx expf¡i x zg. 2p

(2.5)

The function Fk (x, t) is given by (Z Fk (x, t) D

1 0

Z dtk1 pk (tk1 ) (

£ exp

¡ i x ak

1 tk 1

1 X m D1

) dtk2 pk (tk2 ¡ tk1 ) . . . )

u ( t ¡ tk m ) .

(2.6)

The exponential in the above expression for Fk (x, t) can be expanded in terms of the individual amplitudes, ak , and only the linear and quadratic terms are retained. A formal solution to the resulting equations is given (we do not address such questions as the convergence of the expansion). However, in order to obtain a handle on the accuracy of this approximation, the results are compared in the following sections with those obtained by

Calculation of Interspike Intervals

1795

numerical simulations. The expression for Fk (x, t ) gives (

x2 a2k ln Fk (x, t) D ln 1 ¡ i x ak Dk (t) ¡ Ek (t) C O(a3k ) 2

)

x2 a2k (Ek (t) ¡ D2k (t)) C O (a3k ), 2

D ¡i x ak Dk (t) ¡

(2.7)

where O (y ) denotes a remainder term that is smaller than C £ |y3 | when y lies in a neighborhood of 0 for some positive constant C. The functions Dk (t ) and Ek (t) are given by (Z Dk (t) D

1 0

(Z Ek ( t ) D

1 0

£

Z dtk1 pk (tk1 )

t k1

Z dtk1 pk (tk1 )

1 X m,m0 D1

1

1 t k1

) dtk2 pk (tk2 ¡ tk1 ) . . . )

1 X mD 1

u k ( t ¡ tk m )

dtk2 pk (tk2 ¡ tk1 ) . . .

uk (t ¡ tkm )uk (t ¡ tkm0 ).

(2.8)

The expressions for Dk (t) and Ek (t) may be solved using Laplace transforms, as shown in appendix A. Neglecting the O (a3k ) terms in equation 2.7 and retaining only the linear and quadratic terms (the gaussian approximation), the probability density function can then be evaluated as p (v, t | v0 ) Z 1

( ! ) N N X dx x2 X 2 2 exp i x v ¡ v0 ¡ D ak Dk ( t ) ¡ a (Ek (t) ¡Dk (t)) 2 k k ¡1 2p k » ¼ (v ¡ v0 ¡ U (t))2 1 exp ¡ , (2.9) D p 2C (t) 2p C(t)

(

with U (t) D C(t) D

N X

ak Dk (t)

kD 1 N X kD 1

a2k (Ek (t) ¡ D2k (t)).

(2.10)

In the case of the Poisson process, the expressions for U (t ) and C (t) take particularly simple forms, as will be shown in section 2.3 (We assume here

1796

A. N. Burkitt and G. M. Clark

that C (t) > 0, which is indeed the case for Poisson inputs, as shown later in equation 2.23). The gaussian approximation for Fk (x, t), equation 2.7 (i.e., keeping only terms to second order in the amplitude ak of the individual postsynaptic contributions) is an approximation that is expected to work best for large values of N and values of the amplitude that are small (voltage is measured in units of the difference h between the threshold, Vth , and reset, v0 , potentials). We assume that the amplitude, ak , scales with the number of inputs, N (i.e., that the amplitude is of order 1 / N) and that U (t ) is therefore of the same order as h, thus giving a reasonable output rate of spikes. Consequently the threshold ratio, equation 2.2, is a useful quantity with which to compare neurons that sum different number of inputs of various amplitude. The results in the following sections do indeed support this assumption. The question of how many inputs and how small the individual synaptic amplitudes must be in order for the analysis to provide an accurate evaluation of the neural response will be addressed in subsequent sections, where we compare the results with those obtained from numerical simulations for a range of number of inputs and relative amplitudes. However, the situation in which a large number of small-amplitude postsynaptic potentials are required for the ring threshold to be reached is one that occurs in many neurons (Abeles, 1991; Douglas & Martin, 1991). We consider here the case in which the amplitudes of the contributions from the inputs are equal in magnitude, |ak | D a, the rates on each of the incoming bers are equal, l k D l, and all of the inputs are excitatory. It is straightforward to generalize the results to any distribution of amplitudes, as discussed in section 3.4, since the amplitude dependence of the functions U (t), C (t) is given explicitly in equation 2.10 (note that Dk (t), Ek (t) are independent of the amplitude). In section 3.5 the generalization to the case in which there are both excitatory and inhibitory inputs is considered. Since there is no interaction from the contributions of different incoming bers in this model, it is possible to consider having different synaptic response functions on different afferent bers. This may enable us to model inputs coming from bers that are situated on the dendritic tree at different distances from the site (typically the axon hillock) at which the summed potential generates the output spike. However, we consider here only the case in which the synaptic response function from each input is the same, uk (t) D u (t). Consequently all input bers and their characteristics are equivalent, and therefore the index k is dropped in the analysis of the following section. 2.2 First-Passage Time to Threshold. A spike is generated when the summed membrane potential reaches the threshold, Vth (Vth D v0 C h), for the rst time. The probability density of output spikes, which is the interspike interval (ISI) density, is therefore equivalent to the probability density of the rst time that the potential, V (t), reaches the threshold, which

Calculation of Interspike Intervals

1797

is called the rst-passage time to threshold density, denoted by fh (t). This may be obtained from the integral equation (Plesser & Tanaka, 1997; Burkitt & Clark, 1999a), Z p (v, t | v0 ) D

t 0

dt0 fh (t0 )pO(v, t | Vth , t0 , v0 ),

v ¸ Vth ,

(2.11)

where pO(v2 , t2 | v1 , t1 , v0 ) is an approximation to the conditional probability density of the potential. The conditional probability density, p (v2 , t2 | v1 , t1 , v0 ), is dened as the probability that the potential has the value v2 at time t2 , given that V (t1 ) D v1 and the reset value of the potential after a spike is v0 . Our approximation to the conditional probability density, pO(v2 , t2 | v1 , t1 , v0 ), has the same denition, with the added restriction that the potential is subthreshold in the time interval 0 · t < t0 . If the uctuations in the input currents are not temporally correlated, then the approximate expression is exact, as is the case for both the perfect integrator model and the Stein model with Poisson distributed inputs. However, the two expressions will differ if there are temporal correlations, which occurs if the presynaptic spikes have correlations or the synaptic currents have a nite duration. The integral expression for fh (t) given above is analogous to the renewal equation (Tuckwell, 1988b), Z p (v, t | v0 ) D

t 0

dt0 fh (t0 )p(v, t ¡ t0 | Vth ).

(2.12)

This renewal equation is a special case of equation 2.11 that is valid only for the perfect integrator (i.e., nonleaky) neural model, and it arises because the perfect integrator relates the conditional probability density to the original probability density, that is, p (v, t | Vth , t0 , v0 ) D p (v, t ¡ t0 | Vth ) for v ¸ Vth (in more general neural models, there is no such spatial and temporal homogeneity). There is also no assumption about the stationarity of the conditional probability density, pO(v, t | Vth , t0 , v0 ), in equation 2.11, it does not have any time-translational invariance. The conditional probability density takes account of possible multiple crossings of the threshold between times t0 and t. The conditional probability density is given by (Bayes’ theorem) p (v2 , t2 | v1 , t1 , v0 ) D

p ( v 2 , t2 , v 1 , t1 | v 0 ) , p (v1 , t1 | v0 )

(2.13)

and the joint probability density p (v2 , t2 , v1 , t1 | v0 ), which is the probability that V (t) takes the value v1 at time t1 and takes the value v2 at time t2 (given that the reset value of the potential after a spike is generated is again at v0 ),

1798

A. N. Burkitt and G. M. Clark

may be evaluated in a similar way to the probability density, equations 2.3 and 2.4: p ( v 2 , t2 , v 1 , t 1 | v 0 ) ( Z N Z 1 Y D dtk1 pk (tk1 ) k D1

Z D

0

1

dx2 ¡1 2p

Z

1 t k1

) dtk2 pk (tk2 ¡tk1 ) . . . d(V(t1 ) ¡v1 )d(V (t2 ) ¡v2 )

1

dx1 expfi x2 (v2 ¡ v0 ) C i x1 (v1 ¡ v0 ) ¡1 2p C N ln F (x2 , x1 , t2 , t1 )g.

(2.14)

The function ln F (x2 , x1 , t2 , t1 ) is expanded in the amplitude a of the synaptic response function as before (see equation 2.7) and contains cross-terms in x2 x1 : ln F (x2 , x1 , t2 , t1 ) ( D ln 1 ¡ i x2 a D(t2 ) ¡ i x1 a D (t1 )

x2 x2 ¡ 2 a2 E (t2 ) ¡ 1 a2 E (t1 ) ¡ x2 x1 a2 G (t2 , t1 ) C O(a3 ) 2 2

)

D ¡ i x 2 a D ( t2 ) ¡ i x 1 a D ( t1 )

¡

x22 2 x2 a (E (t2 ) ¡ D2 (t2 )) ¡ 1 a2 (E (t1 ) ¡ D2 (t1 )) 2 2

¡ x2 x1 a2 (G (t2 , t1 ) ¡ D(t2 )D (t1 )) C O (a3 ),

(2.15)

where D (t ), E (t) are given by equation 2.8 (the index k is dropped, since all incoming bers are treated identically) and G(t2 , t1 ) is given by (Z G ( t2 , t1 ) D

1 0

£

Z 0

0

dt1 p (t1 )

1 X mD 1

1

t01

u (t1 ¡ t0m )

) 0

0

0

dt2 p(t2 ¡ t1 ) . . .

1 X m0 D 1

u (t2 ¡ t0m0 ).

(2.16)

This expression for G(t2 , t1 ) may also be solved in terms of Laplace transforms (see appendix B). Once again we retain only the linear and quadratic terms in the amplitude, a, in equation 2.14. In order to evaluate the integrals in equation 2.14 we make a change of variable y1 D x1 C k (t2 , t1 )x2 where k (t2 , t1 ) is chosen

Calculation of Interspike Intervals

1799

in order to eliminate the cross-term x2 y1 :

k ( t2 , t1 ) D

N a2 [G(t2 , t1 ) ¡ D(t2 )D(t1 )] . C ( t1 )

(2.17)

The y1 and x2 integrals are now independent and may be evaluated. The y1 integral yields exactly p (v1 , t1 | v0 ), and the x2 integral gives the conditional probability density, p (v2 , t2 | v1 , t1 , v0 ) D p

1

2pc (t2 , t1 ) » ¼ [(v2 ¡v0 ¡U (t2 ) ¡k (t2 , t1 )(v1 ¡v0 ¡U (t1 ))]2 £ exp ¡ , 2c (t2 , t1 )

(2.18)

where c (t2 , t1 ) D C (t2 ) ¡ k 2 (t2 , t1 ) C(t1 ).

(2.19)

The value of c (t2 , t1 ) is guaranteed to be positive (for t2 > t1 ) for Poissondistributed inputs (see the end of section 2.3). The rst passage-time density is evaluated using equation 2.11. For inputs described by a Poisson process, the equation may be solved explicitly for two neural models: the perfect integrator and the Stein model. In the case of the perfect integrator, the method reproduces the known result of the inverse gaussian density (Tuckwell, 1988b), as discussed in section 3.1. In the case of the Stein model, this technique reproduces the known solution (Gluss, 1967), in which the moments of the rst passage-time density are expressed in terms of the Laplace transforms of the probability density and the conditional probability density (section 3.2). However, apart from these special cases, equation 2.11 has not been solved analytically for fh (t). We solve equation 2.11 for fh (t) numerically using standard methods for Volterra integral equations (Press, Flannery, Teukolsky, & Vetterling, 1992; Plesser & Tanaka, 1997). The moments of this distribution are then straightforward to evaluate. The zeroth moment of the distribution is the integral of the probability density itself and gives the probability r of a spike being produced. The rst moment, t f , is the average time of the rst threshold crossing and hence the ISI. The second moment sout is the spread of the distribution in time of the output spikes (i.e., the width of the ISI). Although the ISI distributions are frequently characterized by their moments, the numerical results show clearly that they are somewhat skewed, resembling more the generalized inverse gaussian distribution (Iyengar & Liao, 1997).

1800

A. N. Burkitt and G. M. Clark

2.3 Poisson Process. In a renewal process, the interarrival times of the presynaptic inputs are independent and identically distributed with an arbitrary distribution. The Poisson process is a particular case of a renewal process in which the distribution is exponential; the probability density of the rst input on ber k is

pk (tk ) D l k e¡l k tk ,

(2.20)

where tk is the time since the previous input and l k is the constant rate of input on the kth ber. Likewise, the probability density of the mth input on ber k is ¡l pk (tkm ¡ tk(m¡1) ) D l k e k

(tkm ¡tk(m¡1) )

,

(2.21)

where tk(m¡1) is the time of the preceding input (i.e., the (m ¡ 1)th input). The Laplace transform of pk (t) is pk,L (s) D

lk s C lk

,

(2.22)

and Dk,L (s ), Ek,L (s ), equations A.1 and A.2, may be evaluated to yield the functions U (t) and C (t), equation 2.10,

U (t) D C (t) D

X

Z ak l k

0

k

X k

t

Z a2k l k

t 0

dt0 u (t0 ) dt0 u2 (t0 )

(l k Dl )

D

Z N al

(l k Dl )

D

t 0

dt0 u (t0 )

Z N a2 l

t 0

dt0 u2 (t0 ).

(2.23)

For a Poisson process, these expressions may also be arrived at by noting that a superposition of Poisson processes gives a Poisson process (with a rate that is the sum of the contributing rates) and by integrating the synaptic response function to obtain the average and variance of the depolarization (Stein, 1967; Abeles, 1982b). The conditional probability density, p (v2 , t2 | v1 , t1 , v0 ), may be obtained in closed analytic form for any synaptic response function, u (t), since the expression for k (t2 , t1 ) follows from equation 2.17 and the solution of G(t2 , t1 ) in appendix B, equation B.1:

k ( t2 , t1 ) D

N a2 l

R t1 0

dt 0 u (t0 ) u (D C t0 ) , C ( t1 )

D

D t2 ¡ t1 ¸ 0.

(2.24)

Note that the rate, l, given here is the average rate on each afferent ber, and therefore the total rate of inputs is Nl. For a Poisson process, k (t2 , t1 ) and

Calculation of Interspike Intervals

1801

C (t1 ) are related to the autocovariance of the sequence of Poisson inputs, C(t2 , t1 ), by the relationship (Papoulis, 1991),

k ( t2 , t1 ) D

C (t 2 , t1 ) , C (t 1 , t1 )

C (t1 ) D C (t1 , t1 ),

(2.25)

which ensures that the expression for c (t2 , t1 ), equation 2.19, is positive denite. 3 Neural Models 3.1 Perfect Integrator Model. The perfect integrator model (also called the leakless integrate-and-re model) is the simplest of the neural models to analyze, since there is no decay of the potential with time. Although this is clearly an unrealistic assumption, it provides a reasonable approximation for situations in which the integration occurs over a much shorter timescale than the decay constant. The synaptic response function for the perfect integrator simply takes the form » 1 for t ¸ 0 (3.1) u (t ) D 0 for t < 0.

The probability density of the potential for this model is well known (Tuckwell, 1988b) and may be evaluated as a Wiener process with drift (Gerstein & Mandelbrot, 1964). The resulting expressions are identical to those obtained using the above technique. The expressions for U (t) and C (t ), for the case in which all the inputs are excitatory and the input rate on each ber is identical, follow from equation 2.23: U (t) D N al t

C(t) D N a2 l t.

(3.2)

Moreover, the expressions for k (t2 , t1 ) and c (t2 , t1 ) are

k ( t2 , t1 ) D 1 (3.3) c (t2 , t1 ) D C (t2 ) ¡ C (t1 ) D N a2 l (t2 ¡ t1 ), and therefore the conditional probability density is related to the probability density in this particular case by p (v2 , t2 | v1 , t1 , v0 ) D p(v2 , t2 ¡ t1 | v1 ),

v2 > v1 .

(3.4)

Equation 2.11 is therefore a genuine renewal equation, as we would expect for the perfect integrator. The density of the output spike distribution is

1802

A. N. Burkitt and G. M. Clark

given by standard Laplace transform techniques as the inverse gaussian density (Gerstein & Mandelbrot, 1964), fh (t) D p

» ¼ (h ¡ N al t )2 exp ¡ , 2 N a2 l t 2p N a2 l t3 h

(3.5)

where the threshold is given by Vth D v0 C h. 3.2 Solution of the Stein Model. The passive decay of the membrane potential may be incorporated in the model by a synaptic response function of the form (Stein, 1965),

» ¡t t e / u(t) D 0

for t ¸ 0 for t < 0,

(3.6)

where t is the time constant of the membrane potential decay. This form of the integrate-and-re model with stochastic inputs was rst studied by Stein (1965) and is known as the Stein model or shot-noise threshold model. For a Poisson process in the case in which all the inputs are excitatory (inhibitory inputs are considered in section 3.5) and the input rate on each ber is identical, it is straightforward, using equations 2.19, 2.23, and 2.24, to calculate the functions U (t) D N al t (1 ¡ e¡t / t ) C (t) D

1 N a2 l t (1 ¡ e¡2t / t ) 2

k (t2 , t1 ) D e¡(t2 ¡t1 ) / t c (t2 , t1 ) D

1 N a2 l t (1 ¡ e ¡2(t2 ¡t1 ) / t ) D C (t2 ¡ t1 ). 2

(3.7)

Thus for the Stein model the conditional probability density, equation 3.19, possesses temporal homogeneity (i.e., it is stationary), p(v2 , t2 | v1 , t1 , v0 ) ´ q(t2 ¡ t1 I v2 , v1 , v0 ),

(3.8)

where the function q(tI v2 , v1 , v0 ), which is introduced here to make clearer the equivalence with the known result (Gluss, 1967), is 1 2p C (t) » ¼ [v2 ¡v1 e ¡t / t ¡v0 (1 ¡ e¡t / t ) ¡U (t)]2 £ exp ¡ . (3.9) 2C (t )

q (tI v2 , v1 , v0 ) D p

Calculation of Interspike Intervals

1803

The equation for the probability density of the output spike distribution, equation 2.11, is therefore a convolution equation that allows a solution in terms of Laplace transforms fhL (s) D

pL (v, s | v0 ) qL (sI v, v0 , v0 ) , D qL (sI v, Vth , v0 ) qL (sI v, Vth , v0 )

(3.10)

where we have used the following straightforward general identity between the probability density and conditional probability density, p (v, t | v0 ) D p (v, t | v0 , 0, v0 ) D q(tI v, v0 , v0 ).

(3.11)

This result for fhL (s ), equation 3.10, valid for v ¸ Vth agrees exactly with the known result obtained for fhL (s ) in the case v D Vth in the diffusion limit (Gluss, 1967). The rst three moments of the probability density fh (t) may be obtained from the Laplace transform by the relationships Z 1 dt fh (t) D fhL (0) Z

0

1 0

Z

1 0

dt t fh (t) D ¡ dt t2 fh (t ) D

¡ df

hL ( s)

¡d f 2

ds

hL (s ) ds 2

¢

(3.12)

¢

s D0 sD 0 .

Denoting derivatives with respect to s by a prime and taking the value of each of the functions at s D 0, and with values of v ¸ Vth , we obtain solutions for r , the probability of a spike’s being generated in a nite time interval, t f , the mean ISI, and sout , the standard deviation of the output distribution of spikes (the width of the ISI distribution), Z 1 pL r D dt fh (t) D qL 0 Z 1 q0 p0 1 (3.13) tf D dt t fh (t) D L ¡ L r 0 qL pL Z 1 1 p00 q00 q0 2 p0 2 sout D dt (t ¡ t f )2 fh (t ) D L ¡ L C 2L ¡ 2L , r 0 pL qL qL pL where the primes denote derivatives with respect to s and pL ´ pL (v, 0 | v0 ),

qL ´ qL (0I v, Vth , v0 ).

(3.14)

The evaluation of the expressions for p0L , q0L , p00L , and q00L as numerical integrals follows from the standard properties of Laplace transforms, as in equation 3.12.

1804

A. N. Burkitt and G. M. Clark

While it is perfectly possible to evaluate the parameters r , t f and sout numerically using the above Laplace transform methods, we have used direct numerical methods for Volterra integral equations (Press et al., 1992; Plesser & Tanaka, 1997) to evaluate the rst-passage time density fh (t ) from its denition in equation 2.11 using equation 2.19. This equation has been solved for some typical parameter values in order to illustrate the nature of the solutions and compare them with results in subsequent sections, where we include synaptic response functions with nite rise times (section 3.3) and inhibition (section 3.5). A typical plot of the ring probability, r (see equation 3.13), as a function of the threshold ratio, R (see equation 2.2), is given in Figure 1 for various numbers of afferent bers, N. The input rate of EPSPs on each of the incoming bers is l in D 1.0, and time is measured in units of the membrane time constant, t D 1.0 (rates are given as spikes per unit of t ). The curves represent interpolations between solutions that were evaluated at intervals of 0.01 on the R-axis. The steepest curve corresponds to the largest number of inputs (N D 1024) and indicates an abrupt change between a ring regime at low threshold ratios and a quiescent (nonring) regime at high threshold ratios. In contrast, where there are fewer inputs (smaller N values), this transition between ring and nonring regimes extends over a broader range of threshold ratios, R. The value of the threshold ratio at which this transition from ring to nonring behavior takes place we dene as the critical threshold ratio, Rcrit . The criteria used to dene the critical threshold ratio in this plot were that the probability of ring, r , be equal to 0.05, which was chosen to signify that the probability had risen signicantly above zero (in principle, the criteria could be set at any intermediate value between zero and one). The dependence of the critical threshold ratio on the rate of inputs l in is shown in Figure 2, where it is plotted for inputs N D 16, 32, 64, 128, 256, 512, 1024, and the results for equal input rates, l in , are connected by lines. The results indicate that higher rates of input will generate an output spike over a larger range of threshold ratios, as expected. Moreover, the plots in Figure 2 show that the critical threshold ratio, Rcrit, decreases only slightly as the number of inputs, N, increases. Consequently a neuron with a small number of large-amplitude inputs has a similar threshold ratio to one with a large number of small-amplitude inputs. In the large-N limit, the behavior becomes deterministic as the variance of the inputs decreases (lim(N!1 ) C(t) D 0), and consequently a spike output is produced when the mean input reaches the threshold. The critical threshold ratio is therefore given by the condition lim (N!1) U (t ! 1) D h , which gives a value of lim (N!1 ) Rcrit D tl in (from the rst expression in equation 3.7), consistent with the nite-N results in Figure 2 (for which t D 1). The mean ISI, tf (equation 3.13), is plotted in Figure 3 for a number of input rates, l in (the lines represent interpolations between closely spaced individual numerical solutions). The solid lines are the results for N D

Calculation of Interspike Intervals

1805

1.0 l

in

= 1.0

Probability of firing r

0.8

0.6

0.4 N=16 N=1024

0.2

0

0

0.2

0.4

0.6 0.8 1 Threshold ratio R

1.2

1.4

1.6

Figure 1: Plots of the probability of ring, r (equation 3.13), versus threshold ratio, R (equation 2.2),for the Stein model. The rate of incoming EPSPs isl in D 1.0 on each afferent ber, with time in units of the membrane time constant, t (i.e., rates given as spikes per unit of t ), and number of inputs N D 1024 (solid line), 256 (dotted line), 64 (dash-dotted line), and 16 (dashed line).

1024 inputs, which indicate that for all threshold ratios, R, the mean ISI, tf , decreases as the input rate, l in , increases. Likewise, lower threshold ratios, R, cause shorter mean ISIs, tf , for any xed input rate, l in . The results have only a very slight dependence on the number of inputs, as illustrated by the dashdotted line for N D 16 inputs at l in D 1.5, which is almost indistinguishable from the N D 1024 results at the same input rate. The dotted lines give the values of the threshold ratio, R, where the rate of the incoming PSPs, l in , equals the rate of the output spikes, l out D 1 / t f , which they generate. In general we expect that in the normal operating regime of a network of such neurons, there would indeed be some approximate equality between the input and output rates, in order for the neural information to be propagated to further stages of processing. The relationship between the input rate, l in , and the output rate, l out , is illustrated in Figure 4 (joined by solid lines) for the Stein model with N D 1024 inputs and three values of the threshold ratio, R (the dashed lines connect results for the synaptic response function discussed in section 3.3). This relationship for the Stein model is substantially linear, with lower threshold ratios, R, resulting in higher output rates, l out . Note that refractory effects, which result in a saturation of the output rate, have not been included in this plot.

1806

A. N. Burkitt and G. M. Clark

6 l

4

l

3

l

Critical threshold ratio R

crit

5

in

in

in

l

in

l

2

in

l

in

l

1

in

l

0

16

32

64 128 256 Number of Inputs N

512

in

= 5.0

= 4.0

= 3.0 = 2.5 = 2.0 = 1.5 = 1.0 = 0.5

1024

Figure 2: Dependence of the critical threshold ratio, Rcrit , on the number of inputs, N, and the rate of inputs, l, for the Stein model (with units t D 1.0). Rcrit was dened as the threshold ratio, R, above which the ring probability, r , fell below 0.05.

The width of the ISI distribution, sout (equation 3.13), as a function of the input rate, l in , is illustrated in Figure 5 for a threshold ratio value of R D 0.5, input rates in the range 0.5 · l in · 2.0 and various numbers of inputs, N. Also plotted is the coefcient of variation of the ISI distribution, CVISI D

sout D soutl out . tf

(3.15)

The lower value of l in on each of the lines in this plot is determined by the critical threshold ratio Rcrit . The error bars joined by dotted lines represent the results, with their statistical errors, of numerical simulations involving 10,000 trials for values of N D 16, 32, and 64. A stream of Poisson-distributed inputs was generated using a pseudorandom number generator, and the error bars, which represent the statistical error, are only barely discernible in this plot at large values of l in , since the agreement between simulation and analytical results is so close. Near the critical threshold, the error bars become larger, and there is a small difference for the smaller values of N in the results for CV between the simulations and the analytical results given here. These results indicate that the width of the output spike distribution, sout , decreases with both increasing number of inputs, N, and increasing rate of inputs, l in . The coefcient of variation behaves in much the same

Calculation of Interspike Intervals

l

1807

Mean interspike interval tf

2.0

in

= 0.5

1.5 l

= 0.8

in

l

1.0

=1

in

l

= 1.5

in

l

0.5

in

=2 l

0

0

0.2

0.4 0.6 Threshold ratio R

0.8

in

=5

1

Figure 3: Dependence of the mean ISI, tf , on the threshold ratio, R, for the Stein model for N D 1024 (solid lines) with various rates of ring l in . Also plotted are results for N D 16 with l in D 1.5 (dash-dotted line). The dotted lines indicate the values of the threshold ratio, R, at which the input rate, l in , is equal to the output rate, l out . Time is measured in units of the membrane time constant, t (i.e., rates are given as spikes per unit of t ).

way for values of the input rate above the critical threshold; that is, CV decreases as N increases in this domain. However, in the neighborhood of the critical threshold (i.e., l in < 0.5), the coefcient of variation increases substantially. These results indicate that there are two quite different regimes: (1) above the critical threshold, tl in > Rcrit , and (2) near the critical threshold, tl in ¼ Rcrit. In the region substantially above the critical threshold, the coefcient of variation decreases with increasing N in the way observed in Softky and Koch (1993), which led them to conclude that there was a conict between the values of CV observed experimentally and the results of the integrate-and-re model. However, in the region near the critical threshold, the coefcient of variation is substantially larger and well within the range of experimental observations (Softky & Koch, 1992; Shadlen & Newsome, 1998). Softky and Koch (1992) analyzed recordings from the primary visual cortex and extrastriate cortex of awake, behaving macaque monkeys and found values of CV in the range 0.5–1.0. Shadlen and Newsome (1998) found somewhat higher values of CV in the range 1.0–1.5 in their recordings from the visual cortex of alert, behaving rhesus monkeys. However, Gur, Beylin, and Snodderly (1997) found that a large component of the response vari-

1808

A. N. Burkitt and G. M. Clark

7

R = 0.5

6

R = 0.8

Rate of output l

out

5

R = 1.0

4 R = 0.2

3

R = 0.5

2 1 0

0

1

2 3 Rate of input l

4

5

in

Figure 4: Mean output rate, l out , of spikes generated as a function of the input rate, l in , of afferent PSPs for N D 1024 inputs. The solid lines are the results for the Stein model with the three threshold ratios given. Also plotted in dashed lines are the results for the alpha model (equation 3.17) with the two threshold ratios given. Time is measured in units of the membrane time constant, t (i.e., rates are given as spikes per unit of t ).

ability in alert monkeys was caused by xational eye movements, and that when this was controlled for, the resulting values of CV were found to be in the range 0.1–0.5, with a systematic effect for stimulus strength (the variance was independent of the stimulus strength, whereas the output spike rate increased as the stimulus strength increased, resulting in lower values of CV). Shadlen and Newsome (1998) simulated three models for the integration of synaptic input, which in the units used here correspond to (1) membrane time constant t D 20 ms, no inhibition, excitatory inputs with a rate l in D 50 spikes/sec D 1.0 spikes / t , N D 300 inputs, and amplitude a D 1 / 150, giving a threshold ratio of R D 0.5, and an output rate of l out D 50 spikes/sec D 1.0 spikes / t ; (2) membrane time constant t D 1 ms, no inhibition, excitatory inputs with a rate; l in D 50 spikes/sec D 0.05 spikes / t , N D 300 inputs, and amplitude a D 1 / 16, giving a threshold ratio of R D 16 / 300 ¼ 0.05, and an output rate of l out D 50 spikes/sec D 0.05 spikes / t ; and (3) membrane time constant t D 20 ms, balanced excitatory and inhibitory inputs with input ratel in D 50 spikes/sec D 1.0 spikes / t , a lower barrier to the potential at the reset potential, N D 300 inputs, amplitude a D 1 / 15, giving a threshold ratio

Calculation of Interspike Intervals

1809

of R D 0.05, and an output rate of l out D 50 spikes/sec D 1.0 spikes / t . Their results for case (1) correspond to the value of the threshold ratio, R D 0.5, plotted in Figure 5, and the results plotted in this gure are consistent with their results and extend them to a whole range of input rates and numbers of inputs (their results are for l in D 1.0 spikes / t , N D 300). These results and their implications are discussed further when we look at the effect of including inhibitory inputs in section 3.5. 3.3 Postsynaptic Response Functions with Finite Rise Times. Both the perfect integrator model and the Stein model analyzed above have discontinuous voltage trajectories in which an arriving EPSP causes an instantaneous jump of amplitude a in the voltage. There are a number of models of the time course of EPSPs that have a smoothly varying voltage similar to that observed in intracellular recordings (Rhode & Smith, 1986; Paolini, Clark, & Burkitt, 1997). One such model is that in which the synaptic input current has a time course given by the alpha function (Jack, Noble, & Tsien, 1985),

I (t ) D kte¡at ,

a > 0,

(3.16)

which corresponds to delivering a total charge of k / a2 to the cell. The synaptic response function u (t) of this model (which we call the alpha model) may be evaluated by integrating the voltage in the equivalent circuit representation of the integrate-and-re neural model. The resulting expression, given u (0) D 0, is (Tuckwell, 1988a),

u (t ) D

8 ¡t / t µ ¶ (eBt ¡ 1) ke > Bt > > , te ¡ > > B < BC 2

kt ¡t / t > , > e > > > : 2C 0,

6 0, BD

t¸0

B D 0,

t¸0

(3.17)

t < 0,

where t D RC and B D 1 / t ¡ a. The Stein model may be recovered in the limit a ! 1. Models such as this, which have a nite rise time, provide an approximation of the postsynaptic potential at the soma (or specically at the site at which the action potential is generated, typically the axon hillock) that incorporates the time course of the diffusion of the current along the dendritic tree (Tuckwell, 1988a). The equation for the density of output spikes, fh (t) (see equation 2.11), may be solved directly to obtain r (the probability of a spike’s being produced), t f (the mean ISI), and sout (the width of the ISI distribution) as described in section 2.2. The effect of using this more general synaptic response function, which incorporates a nite rise time of the EPSP, is illustrated in Figure 4, where the output rate of spikes generated is plotted as a function

1810

A. N. Burkitt and G. M. Clark

Width of output distribution s

out

1.2 1.0 0.8 0.6 0.4 0.2 0.5

0.8

1.0 1.2 1.5 Rate of input

N=16 N=1024

2.0

1.0 0.8

CV

ISI

0.6 0.4

N=16

0.2 N=1024

0.3

0.5

0.8 1.0 1.2 1.5 Rate of input

2.0

Figure 5: Analytical results for the Stein model using equations 2.11 and 3.9 with threshold ratio R D 0.5 and number of inputs N D 16, 32, 64, 128, 256, 512, 1024 (lines plotted in order from top to bottom), plotted as a function of the input rate, l in . (a) Output jitter, sout . (b) Coefcient of variation, equation 3.15. The error bars joined by dotted lines represent the results of numerical simulations involving 10,000 trials.

Calculation of Interspike Intervals

1811

of the incoming EPSPs (only excitatory inputs are included here). The solid lines show the results of the Stein model at three threshold ratios, and the dashed lines are the results for the above model with a D 5 at the threshold ratios R D 0.2 and 0.5. The results indicate that the alpha model produces spikes at a lower rate than the Stein model for the same threshold ratio and that the input-output relationship differs substantially from the linear behavior evident for the Stein model. The width of the output spike distribution, sout , was also found to increase relative to that for the Stein model at the same threshold ratio. The value of a D 5 chosen here is somewhat on the low side of physiologically realistic values, but serves to illustrate the effect of including a nite rise time of the EPSP. 3.4 Distribution of PSP Amplitudes. The expressions for the mean and width of the probability density function, equation 2.10, are written in a form that makes their dependence on the distribution of PSP amplitudes transparent, since the functions Dk (t) and Ek (t) are independent of the amplitude, ak . In the discussion so far, we have considered only that case where the amplitudes are positive (i.e., excitatory PSPs) and equal, ak D a. The amplitude of the PSPs, however, will in general depend on their location on the dendritic tree, quantal uctuations, and other variables (Softky & Koch, 1993). However, it is straightforward using equation 2.10 to consider the effect of various distributions of amplitudes. In particular, if we consider a distribution of amplitudes, P (a), with mean aN and standard deviation sa

Z aN D

1 0

Z a P (a) da,

sa2 D

1 0

(a ¡ aN )2 P (a ) da,

(3.18)

then the expressions for U (t) and C (t), equation 2.23, which determine the probability density of the potential, become (Stein, 1967) Z U (t) D N aN l

t 0

dt0 u (t0 )

C(t) D N (Na2 C sa2 ) l

Z

t 0

dt0 u2 (t0 ).

(3.19)

3.5 Including Inhibition. IPSPs may be included as a special case of the previous section in which the inhibitory terms are negative, ak < 0. In the simplest situation, there are NE afferent bers contributing EPSPs of amplitude aE and NI afferent bers contributing IPSPs of amplitude aI :

ak D aE > 0 for k D 1, . . . , NE

ak D ¡aI < 0 for k D NE C 1, . . . , (NE C NI ).

(3.20)

1812

A. N. Burkitt and G. M. Clark

The functions U (t) and C (t), equation 2.10, then take the form U (t) D NE aE DE (t) ¡ NI aI DI (t) ±

²

±

²

C (t) D NE a2E EE (t ) ¡ D2E (t) C NI a2I EI (t ) ¡ D2I (t) ,

(3.21)

where DE,I and EE,I are given by equation 2.8 for the excitatory and inhibitory inputs respectively, which may have differing synaptic response functions uE (t) and uI (t). The effect of including inhibitory inputs is illustrated in Figure 6 for the Stein model, where the lines join the results for particular numbers of inhibitory inputs. The rst plot shows the relationship between the input and output rates at a threshold ratio of R D 0.5 for the case of NE D 64 excitatory inputs and NI D 0, 16, 32, and 48 inhibitory inputs, both of the same magnitude, aE D aI . The threshold ratio is dened, as before, just in terms of the excitatory inputs, R D h / NE aE (equation 2.2). The plot indicates that the inhibitory inputs lower the rate of outputs. The second plot shows the coefcient of variation, equation 3.15, as a function of the input rate. The error bars joined by the dotted lines for the NI D 0 and 16 cases represent the results and associated statistical errors of numerical simulations of Poissondistributed inputs involving 10,000 trials. It is observed that introducing inhibitory inputs increases the coefcient of variation of the output spikes, and that this effect is most pronounced near the critical threshold ratio (the leftmost points of the plots in Figure 6). It has been postulated (Shadlen & Newsome, 1994) that the inclusion of inhibition is crucial in explaining the apparent discrepancy between the observed irregularity of ISIs in the cortex and the regularity that is evident in models of integrate-and-re neurons (Softky and Koch, 1992, 1993). The experimentally observed values of the coefcient of variation take values in the range 1.0–1.5 (Shadlen & Newsome, 1998), as discussed in section 3.2 (although substantially lower values in the range 0.1–0.5 have been reported when eye movement has been controlled for; (Gur et al., 1997). The integrateand-re model with only excitatory inputs gives a train of output spikes that are very regular, as illustrated in Figure 5 by the small values of the coefcient of variation in the region substantially above the critical threshold. However, in the parameter region near the critical threshold, the coefcient of variation for the integrate-and-re model takes values much closer to one, as discussed in section 3.2 and plotted in Figure 5. The results for including inhibitory inputs are plotted in Figure 6, which causes an increase in the coefcient of variation, and this increase is largest near the critical threshold. The results plotted here show values of the coefcient of variation and trends that are in agreement with those in the study of a conductance-based integrate-and-re model (Troyer & Miller, 1997): a decrease in the coefcient of variation as the input rate increases and an increase in the coefcient of variation with more inhibitory inputs (the parameters used in their study

Calculation of Interspike Intervals

1813

were l I /l E D 0.75, NEl E ¼ 140 spikes / t , NIl I ¼ 65 spikes / t and their model includes a reversal potential at the reset voltage, which is not included in this analysis). Of particular interest is the model in which the excitation and inhibition are balanced, which gives rise to values of CV close to one (Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996; Shadlen & Newsome, 1998). In this balanced situation, the amplitude of the EPSPs and IPSPs may p be much larger, of order 1 / N, so that the output spike rate is of the same order of magnitude as the input rates and the uctuations in the voltage remain nite in the large N limit (Troyer & Miller, 1997; van Vreeswijk & Sompolinsky, 1998). To model inhibitory inputs accurately, it is necessary to take into account the effect of the reversal potential, which gives the intracellular potential a lower bound and causes the amplitude of the inputs to depend on the current value of the potential (i.e., the PSP amplitude decreases the nearer the membrane potential is to the reversal level). In the simulation of Shadlen and Newsome (1998; their case 3, as discussed here in section 3.2), the effect of the inhibitory reversal potential is approximated by introducing a lower barrier below which the potential is not permitted to fall. A more complete investigation of the effect of inhibition, which addresses these questions, is beyond the scope of this article, which deals only with synaptic inputs that are summed linearly. Such questions are addressed elsewhere (Burkitt & Clark, 1999b), where it is shown how the technique presented here for analyzing integrate-and-re neurons can be extended to include the effect of reversal potentials. 4 Conclusions

We have presented a new technique for analyzing the behavior of integrateand-re neurons, which allows both the probability density of the summed potential and the density of output spikes (i.e., the ISI distribution) to be calculated. The technique for calculating the probability density, which involves integrating over the arrival times of the input distribution, provides an alternative to the conventional diffusion approximation methods involving random walks or stochastic differential equations, and this new method allows the analysis of both arbitrary input distributions and synaptic response functions (i.e., including rise time and decay of the membrane potential). The output distribution of spikes is given by the rst-passage time to threshold, which is dened by an integral relationship involving the rst passage time and a conditional probability density. This relationship is a generalization of the renewal equation that holds for both leaky neurons and situations in which there is no time-translational invariance. The technique has been formulated in terms of an arbitrary probability distribution of inputs, and results are derived for inputs that have a Poisson distribution.

1814

A. N. Burkitt and G. M. Clark

6 N =0 I

Rate of output l

out

5

N =16

4

I

3

N =32 I

2

N =48

1 0

I

0.5

1.0

1.5 2.0 Rate of input

2.5

3.0

1.0 0.8

CV

ISI

0.6

N =48 I

0.4

N =32 I N =16 I N =0

0.2

I

0.5

1.0

1.5 2.0 Rate of input

2.5

3.0

Figure 6: Results for the Stein model with N E D 64 excitatory inputs, threshold ratio R D 0.5, and number of inhibitory inputs NI D 0, 16, 32, and 48, plotted as a function of the input rate, l in . The amplitudes of the EPSPs and IPSPs are equal. (a) Mean output rate, l out . (b) Coefcient of variation of the output spikes, equation 3.15. The error bars joined by dotted lines represent the results of numerical simulations involving 10,000 trials.

Calculation of Interspike Intervals

1815

The technique is a formal calculation that gives results that coincide with those using the diffusion method for the two cases studied here: the perfect integrator and the Stein model. For the perfect integrator model, the method reproduces the well-known result (Gerstein & Mandelbrot, 1964). In the case of the Stein model (Stein, 1965), which incorporates decay of the potential, the method reproduces the known solution for the moments of the probability density of output spikes, which are expressed in terms of Laplace transforms of the probability density of the potential and the conditional probability density (Gluss, 1967). Synaptic response functions with nite rise time have also been analyzed, and the moments of the spike distribution, which give, respectively, the probability of an output spike, the average time of spike generation, and the output jitter, have been obtained numerically. The results are expected to be most accurate when the amplitude of the individual synaptic inputs, a, is small in comparison to the threshold, h (which is the the difference between the ring and reset potentials), and the number of inputs, N, is large (we assume that the amplitude, a, scales with 1 / N), that is, where a large number of small-amplitude inputs are required to summate in order for the membrane potential to reach the ring threshold, as is the case in many neural systems. The linear summation of the individual contributions from each synapse in equation 2.1 makes it possible to analyze situations in which a single neuron has a variety of synaptic response functions. Such freedom could be used to model the effect of synapses at different parts of the synaptic tree, whereby synapses nearer the soma would have a synaptic response function with larger amplitude and shorter rise time than synapses farther away. The technique also enables a distribution of PSP amplitudes to be analyzed, including amplitudes that uctuate randomly about a mean value, which could be used to model the effect of quantal uctuations. The relationship between the rate of incoming EPSPs, l in , and the resultant rate of ring, l out , and the width of the ISI, sout , and the coefcient of variation, CV equation 3.15, have been analyzed here for the Stein model and the alpha model (which includes both rise time and decay of the membrane potential) for a range of both thresholds and numbers of inputs. The results indicate that the probability of ring, r , falls from one to zero as the threshold ratio, R, increases, and that this decline in the ring probability is sharper as the number of inputs, N, increases. This rapid transition from a ring to a nonring domain can be characterized by a value of the threshold ratio, Rcrit , which depends principally on the input rate. In the cases investigated here, the small value of the coefcient of variation, CV, above the critical threshold indicates that the output distribution is not Poissonian in this range of parameters, but rather is much more regular than typically observed in cerebral cortex (Softky & Koch, 1992, 1993). However, the results here indicate that near the critical threshold, the coefcient of variation, CV, has values much closer to those experimentally observed. An alternative

1816

A. N. Burkitt and G. M. Clark

explanation involves the effect of inhibitory inputs (Shadlen & Newsome, 1994), and studies using this technique do indeed show that the coefcient of variation increases when IPSPs are included. A more comprehensive examination of this issue incorporating reversal potentials (Tuckwell, 1979; Hanson and Tuckwell, 1983; Musila & L´ansky, ´ 1994) for the inhibitory PSPs is under way to examine this possibility (Burkitt and Clark, 1999b). Appendix A: Evaluation of Dk (t) and Ek (t)

The functions Dk and Ek , equation 2.8, may be evaluated using Laplace transforms, which are indicated by the subscript L, pk,L (s ) D L fpk (t)g, as follows, Dk,L (s) D uk,L (s)

1 X mD 1

( ) pm k,L s D

pk,L (s ) uk,L (s), 1 ¡ pk,L (s )

(A.1)

where the convolution property of Laplace transforms is repeatedly applied. The expression for Ek is separated into a diagonal part (m D m0 ) and an 6 m0 ) that are evaluated independently. The diagonal off-diagonal part (m D term is evaluated as above for Dk,L (s ) (except now with u2k (t)). The lower and upper off-diagonal terms are equal and consist of two innite sums, each of which may be successively treated in the same way as above: E k,L (s) D L fu2k (t)g C2

D

1 X mD1

1 X mD1

( ) pm k,L s

( ) pm k,L s L

(

( uk ( t ) L

¡1

uk,L (s )

1 X m0 D 1

)) 0 ( ) pm k,L s

pk,L (s ) pk,L (s) L fu2k (t )g C 2 L fuk (t)Dk (t)g. 1 ¡ pk,L (s ) 1 ¡ pk,L (s)

(A.2)

For a Poisson process, these expressions take particularly simple forms, as shown in section 2.3. Appendix B: Evaluation of G (t2 , t1 )

As in the case of the above expression for Ek,L , equation A.2, we separate 6 G(t2 , t1 ) into diagonal (m D m0 ) and off-diagonal (m D m0 ) parts that are evaluated independently. The diagonal term is evaluated in a similar fashion to the above, except now the Laplace transforms are carried out with respect to t1 , and the diagonal term is u(t1 )u (t1 C D ), where D D t2 ¡ t1 ¸ 0 is treated as a parameter. The lower and upper off-diagonal terms are equal and consist of two innite sums, each of which may be successively treated

Calculation of Interspike Intervals

1817

in the same way as in appendix A: GL (sI D ) D L fu(t1 )u (t1 C D )g C2

D

1 X mD 1

( pm L (s)L

1 X mD 1

( ) pm L s

u ( t1 ) L

( ¡1

pL ( s ) L fu(t1 )u (t1 C D )g 1 ¡ pL ( s ) C2

0

uL ( s )

pL ( s ) L fu(t1 ) D(t1 C D )g, 1 ¡ pL ( s )

1 X m0 D 1

)) 0 0 pm L (s )

(B.1)

where s and s0 refer to the Laplace transform with respect to t1 and t2 , respectively. For a Poisson process, these expressions take particularly simple forms (see section 2.3). Acknowledgments

This work was funded by the Cooperative Research Centre for Cochlear Implant, Speech and Hearing Research. We thank the anonymous referees for their useful comments on the manuscript. References Abeles, M. (1982a). Local cortical circuits: An electrophysiological study. Berlin: Springer-Verlag. Abeles, M. (1982b). Role of the cortical neuron: Integrator or coincidence detector? Isr. J. Med. Sci., 18, 83–92. Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. New York: Cambridge University Press. Adrian, E. (1928).The basis of sensation: The action of sense organs. London: Christophers. Bernander, O., Koch, C., & Usher, M. (1994). The effect of synchronized inputs at the single neuron level. Neural Comput., 6, 622–641. Bialek, W., & Rieke, F. (1992).Reliability and information transmission in spiking neurons. Trends Neurosci., 15, 428–434. Burkitt, A. N., & Clark, G. M. (1999a). Analysis of integrate-and-re neurons: Synchronization of synaptic input and spike output in neural systems. Neural Comput., 11, 871–901. Burkitt, A. N., & Clark, G. M. (1999b).Balanced networks: Analysis of leaky integrate and re neurons with reversal potentials. Submitted. Carr, C. E., & Friedman, M. (1999). Evolution of time coding systems. Neural Comput., 11, 1–20. Cope, D. K., & Tuckwell, H. C. (1979). Firing rates of neurons with random excitation and inhibition. J. Theor. Biol., 80, 1–14.

1818

A. N. Burkitt and G. M. Clark

Diesmann, M., Gewaltig, M. O., & Aertsen, A. (1996). Characterization of synre activity by propagating “pulse packets”. In J. Bower (Ed.), Computational neuroscience: Trends in research (pp. 59–64), San Diego: Academic Press. Douglas, R., & Martin, K. (1991). Opening the grey box. Trends Neurosci., 14, 286–293. Engel, A. K., Konig, ¨ P., Kreiter, A. K., Schillen, T. B., & Singer, W. (1992).Temporal coding in the visual cortex: New vistas on integration in the nervous system. Trends Neurosci., 15, 218–226. Gabbiani, F., & Koch, C. (1996). Coding of time-varying signals in spike trains of integrate-and-re neurons with random threshold. Neural Comput., 8, 44–66. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophys. J., 4, 41–68. Gerstner, W. (1995). Time structure of the activity in neural network models. Phys. Rev. E, 51, 738–758. Gluss, B. (1967). A model for neuron ring with exponential decay of potential resulting in diffusion equations for probability density. Bull. Math. Biophysics, 29, 233–243. Gur, M., Beylin, A., & Snodderly, D. M. (1997). Response variability of neurons in primary visual cortex (V1) of alert monkeys. J. Neurosci., 17, 2914–2920. Hanson, F. B., & Tuckwell, H. C. (1983). Diffusion approximations for neuronal activity including synaptic reversal potentials. J. Theoret. Neurobiol., 2, 127– 153. Hopeld, J. J. (1995). Pattern recognition computation using action potential timing for stimulus representation. Nature, 376, 33–36. Iyengar, S., & Liao, Q. (1997). Modeling neural activity using the generalized inverse gaussian distribution. Biol. Cybern., 77, 289–295. Jack, J. J. B., Noble, D., & Tsien, R. W. (1985). Electric current ow in excitable cells. Oxford: Clarendon Press. Judd, K. T., & Aihara, K. (1993). Pulse propagation networks: A neural network model that uses temporal coding by action potentials. Neural Networks, 6, 203–215. Kistler, W. M., Gerstner, W., & van Hemmen, J. L. (1997). Reduction of the Hodgkin-Huxley equations to a single-variable threshold model. Neural Comput., 9, 1015–1045. Koch, C. (1999). Biophysics of computation: Information processing in single neurons. Oxford: Oxford University Press. L´ansky, ´ P. (1984). On approximations of Stein’s neuronal model. J. Theor. Biol., 107, 631–647. Lapicque, L. (1907).Recherches quantitatives sur l’excitation electrique des nerfs trait´ee comme une polarization. J. Physiol. (Paris), 9, 620–635. Maass, W. (1996a). Lower bounds for the computational power of networks of spiking neurons. Neural Comput., 8, 1–40. Maass, W. (1996b).Networks of spiking neurons. In P. Bartlett, A. Burkitt, & R. C. Williamson (Eds.), Proceedings of the Seventh Australian Conference on Neural Networks. Canberra: Australian National University. MarLsalek, ´ P., Koch, C., & Maunsell, J. (1997).On the relationship between synaptic input and spike output jitter in individual neurons. Proc. Natl. Acad. Sci.

Calculation of Interspike Intervals

1819

USA, 94, 735–740. Musila, M., & L´ansky, ´ P. (1994). On the interspike intervals calculated from diffusion approximations of Stein’s neuronal model with reversal potentials. J. Theor. Biol., 171, 225–232. Paolini, A. G., Clark, G. M., & Burkitt, A. N. (1997). Intracellular responses of rat cochlear nucleus to sound and its role in temporal coding. NeuroReport, 8(15), 3415–3422. Papoulis, A. (1991). Probability, random variables, and stochastic processes (3rd ed.), Singapore: McGraw-Hill International Editions. Plesser, H. E., & Tanaka, S. (1997). Stochastic resonance in a model neuron with reset. Phys. Lett. A, 225, 228–234. Press, W. H., Flannery, B. P., Teukolsky, S. A., & Vetterling, W. T. (1992). Numerical recipes in Fortran: The art of scientic computing. Cambridge: Cambridge University Press. Rhode, W. S., & Smith, P. H. (1986). Encoding timing and intensity in the ventral cochlear nucleus of the cat. J. Neurophysiol., 56, 261–286. Shadlen, M. N., & Newsome, W. T. (1994). Noise, neural codes and cortical organization. Curr. Opin. Neurobiol., 4, 569–579. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing and learning. Annu. Rev. Physiol., 55, 349–374. Softky, W. R. (1994). Sub-millisecond coincidence detection in active dendritic trees. Neurosci., 58, 13–41. Softky, W. R., & Koch, C. (1992). Cortical cells should re regularly, but do not. Neural Comput., 4, 643–646. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13, 334–350. Stein, R. B. (1965). A theoretical analysis of neuronal variability. Biophys. J., 5, 173–194. Stein, R. B. (1967). Some models of neuronal variability. Biophys. J., 7, 37–68. Theunissen, F., & Miller, J. P. (1995). Temporal encoding in nervous systems: A rigorous denition. J. Comput. Neurosci., 2, 149–162. Troyer, T. W., & Miller, K. D. (1997).Physiological gain leads to high ISI variability in a simple model of a cortical regular spiking cell. Neural Comput., 9, 971–983. Tsodyks, M. V., & Sejnowski, T. J. (1995). Rapid state switching in balanced cortical network models. Network, 6, 111–124. Tuckwell, H. C. (1977). On stochastic models of the activity of single neurons. J. Theor. Biol., 65, 783–785. Tuckwell, H. C. (1978). Neuronal interspike time histograms for a random input model. Biophys. J., 21, 289–290. Tuckwell, H. C. (1979). Synaptic transmission in a model for stochastic neural activity. J. Theor. Biol., 77, 65–81. Tuckwell, H. C. (1988a). Introduction to theoretical neurobiology: Vol. 1, Linear cable theory and dendritic structure. Cambridge: Cambridge University Press.

1820

A. N. Burkitt and G. M. Clark

Tuckwell, H. C. (1988b). Introduction to theoretical neurobiology: Vol. 2, Nonlinear and stochastic theories. Cambridge: Cambridge University Press. Tuckwell, H. C. (1989). Stochastic processes in the neurosciences. Philadelphia: Society for Industrial and Applied Mathematics. Tuckwell, H. C., & Cope, D. K. (1980). Accuracy of neuronal interspike times calculated from a diffusion approximation. J. Theor. Biol., 83, 377–387. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comput., 10, 1321–1371. Wilbur, W. J., & Rinzel, J. (1982). An analysis of Stein’s model for stochastic neuronal excitation. Biol. Cybern., 45, 107–114. Received April 28, 1998; accepted June 7, 1999.

LETTER

Communicated by Misha Tsodyks

Statistical Procedures for Spatiotemporal Neuronal Data with Applications to Optical Recording of the Auditory Cortex O. Fran c¸ ois L. Mohamed Abdallahi Department of Statistics, LMC-IMAG, BP 53, 38041 Grenoble cedex 9, France J. Horikawa I. Taniguchi Department of Neurophysiology, Tokyo Medical and Dental University, Kanda-surugadai 2-3-10, Tokyo 101, Japan T. Herv e´ TIMC-IMAG, Facult´e de M´edecine, IAB, 38700 La Tranche, France This article presents new procedures for multisite spatiotemporal neuronal data analysis. A new statistical model—the diffusion model—is considered, whose parameters can be estimated from experimental data thanks to mean-eld approximations. This work has been applied to optical recording of the guinea pig’s auditory cortex (layers II–III). The rates of innovation and internal diffusion inside the stimulated area have been estimated. The results suggest that the activity of the layer balances between the alternate predominance of its innovation process and its internal process. 1 Introduction

During the past few years, optical recording has become a prominent technique in analyzing neural network activity (Grinvald, Frostig, Lieke, & Hildesheim, 1988; Cohen et al., 1989; Orback, Cohen, & Grinvald, 1996). However, such processing supplies a huge amount of spatiotemporal data from which the information is sometimes difcult to extract. According to Hebb (1949), the implementation of representation of sensory events in the brain would rely on high-order statistics computed from the temporal neuronal discharge patterns within groups of cells. Perception would be dynamically encoded in the temporal relationship between coinciding active cells. Estimating interaction parameters and their temporal evolution is thus a major issue in multisite data interpretation. Methodologies inspired from the statistical mechanics formalism have been introduced several times. For instance, the concept of random eld (RF) has been used to analyze spatial data (Herv´e, Dolmazon, & Demongeot, 1990). When the RF parameters vary, a large number of scenarios may be statistically rec 2000 Massachusetts Institute of Technology Neural Computation 12, 1821– 1838 (2000) °

1822

O. Fran¸cois et al.

covered (Fran¸cois, Demongeot, & Herv´e, 1992). Martignon, Von Hasseln, Gruen, Aertsen, and Palm (1995) and Makarenko, Welsh, Lang, and Llinas (1997) have also proposed statistical procedures based on Markov random elds (MRF). Nevertheless, the statistics obtained by tting RF or MRF on neural data are not intuitive. Although these methods have given useful insights on the structure of the data, there is an obvious need for different procedures. This article presents a new statistical method for multidimensional neural signal processing. The method relies on a parametric spatiotemporal model, the diffusion model (DM), with three parameters: the inverse duration of activity, the frequency of external arrivals, and the rate of diffusion of the activity from a site toward its neighboring sites. Section 2 presents the MRF and DM. The estimation of the diffusion parameter is performed on the basis of standard statistics such as the mean activity and the spatial covariance between neighboring sites. An identication with an MRF model is used to estimate the frequency of external arrivals. The statistical procedures needed to estimate the parameters of the two models are described in section 3. The method has been used on data gathered from the auditory cortex of the guinea pig for which spatiotemporal patterns have been observed (Horikawa, Hosokawa, Kubota, Nasu, & Taniguchi, 1996). The neural activity of layers II–III has been recorded via an optical method in response to pure tone bursts. The continuous signal of each photodiode has been converted into binary code. The neural activity is then analyzed in the light of MRF and DM approaches (section 4). These approaches enable an estimate of an input process arriving on layers II–III and an internal process. From these results, the excitatory and inhibitory processes involved in this experiment are discussed. It is suggested that the activity of layers II–III balances between the alternate predominance of its innovation process and its internal process (section 4.3). 2 Model Description

In what follows, it is assumed that binary data have been obtained from a grid S of optical diodes (two dimensions). The geometrical structure induced by the measures involves lattice models, the neighbors of a site being the nearest sites on the grid. The neighborhood of site i 2 S is denoted by Ni and has size m D 4. The number of sites is n (typically n D 144). A site with the value 1 is considered as active, whereas the value 0 means that the site is passive. At each instant t, an n-dimensional binary vector x D (x1 , . . . , xn ) is observed and represents the activity of the sites. This section presents two statistical models that account for the spatial nature of the data. They do not model the brain activity directly but rather the data collected by the grid. The rst model is the Ising MRF model, which has been used several times in this context. The second model is the DM. In each model, the global activity of the sites is separated in two components—

Statistical Procedures for Spatiotemporal Neuronal Data

1823

an internal one and an external one—for which distinct parameters are introduced. 2.1 Ising Model. MRF on the lattice are known to provide pertinent models for interacting binary systems (Besag, 1974; Ripley, 1988; Guyon, 1995). Among MRF models, the most famous is the Ising model, which has been used by Herv´e et al. (1990), Martignon et al. (1995), and Makarenko et al. (1997) to analyze neural signals. The Ising model has two parameters, a and b, and the likelihood function is dened as

L (a, bI x ) D exp(H(x)) / Z(a, b ) ,

x 2 f0, 1gS ,

(2.1)

where H (x ) is equal to H (x ) D

X i2S

axi C

1XX bxi xj , 2 i2S j2N

(2.2)

i

and Z(a, b ) is the normalization constant. The parameter a is called the singleton potential and can be viewed as the intensity of an external eld applied to the network. The parameter b is called the pair potential and can be interpreted as the intensity of the interaction between pairs of neighboring sites. A positive value of b translates the coactivity of neighboring sites, whereas a negative one inhibits links between sites. Historically, such a model has been introduced to explain the phenomenon of magnetism. Though the model is popular, using it for biological interpretation may be difcult because the physical meaning of its parameters in the context of neural data is not always clear. So alternative models are relevant, and new statistical methods are needed to estimate the parameters of the new models. 2.2 Diffusion Model. The DM is based on the principle of diffusion instead of interaction. In constrast to MRF, the description of the DM is dynamical. Data are assumed to be sampled from a multidimensional continuoustime stochastic process. The DM is a Markovian model dened by the rates of transition l(x, xi ) between the vector x and the vector xi obtained by modifying x at site i: xi D (x1 , . . . , xi¡1 , 1 ¡ xi , xi C 1 , . . . , xn ). The probability of modifying a site value during a small time interval dt is given by

Prob(x(t C dt ) D xi | x (t) D x ) D l (x, xi )dt C o(dt ),

(2.3)

with l(x, xi ) D

» d lC

m m

P

j2Ni xj

if xi D 1 if xi D 0

(2.4)

1824

O. Fran¸cois et al.

where d is a positive constant, and the following condition is assumed to hold, lC

m X xj > 0, m j2N

(2.5)

i

in order to warrant that the rates (see equation 2.4) are always positive. In this approach, signals are modeled as point processes as many references did before (Perkel, Gerstein, & Moore, 1967). The parameters may be interpreted as follows. Processes of arrivals and departures are associated with each site. Innovation on a passive site, which may be viewed as the arrival of spikes, activates this site at rate l. A departure on an active site stops its activity. The constant 1 /d can be interpreted as the mean duration of activity of a site. An active site returns to its resting state independent of the other sites. An active site may also propagate (or diffuse) its activity to a neighboring site chosen with uniform probability 1 / m at a rate equal to m . The parameter m is called the diffusion rate. If m D 0, the sites behave inde6 0, an interaction pendently, and no coherent activity can be observed. If m D between the sites exists, and a coherent activity may be observed. During the activity of a site, the inuence on the neighbors may be either excitatory or inhibitory, depending on the sign of m . With large, positive values of m , a large number of simultaneously active neighbors are observed. At the opposite, with negative values, the active sites are isolated. Provided that they are nonnegative, all the parameters have the dimension of a frequency and can be measured from the real data in kHz. When they are negative, the absolute value of the parameters may also be interpreted as a frequency. In this case, the occurrence of an event (arrival or diffusion) on a given site may be viewed as having an inhibitory effect. Indeed, the probability of ring is then decreased instead of being increased. 2.3 Mean-Field Analysis of the Diffusion Model. This section presents the analysis of the DM. The Markov process dened by the rates (see equation 2.4) can be considered an interacting particle system. (A reference textbook for such systems is Liggett, 1985.) Very few estimation procedures exist to identify the parameters of these systems. When l is positive, the model is additive (Griffeath, 1979). The convergence to equilibrium is exponentially fast (equilibrium is unique). Moreover, according to Griffeath (1979), the exponential rate of convergence is bounded above by l uniformly in n. This property allows considering that the system is under equilibrium. However, the equilibrium distribution of the process has no convenient analytical representation. The analysis will then rely on low-order statistics and mean-eld approximations. For each instant t ¸ 0, consider the probability that a site is active,

u (t) D P(xi (t) D 1),

(2.6)

Statistical Procedures for Spatiotemporal Neuronal Data

1825

which is independent of i by spatial invariance, and u D lim u (t).

(2.7)

t!1

The Kolmogorov equations of the DM lead to, for all i in S and j in Ni , du (t) D l ¡ (d C l ¡ m )u(t) ¡ m E[xi (t)xj (t )], dt

(2.8)

and, under equilibrium, l ¡ (d C l ¡ m )u ¡ m E[xi xj ] D 0.

(2.9)

The probability u that a site is active can be estimated by the empirical mean, uO D

n 1X xi , n iD 1

(2.10)

while the expectation E[xi xj ] can be estimated as rO D

1 nm

X

xi xj .

(2.11)

i2S I j2Ni

The mean-eld approximation assumes that the spatial covariance between two neighboring sites is null, that is, Cov(xi , xj ) D E[xi xj ] ¡ E[xi ]E[xj ] ¼ 0.

(2.12)

E[xi ] D u,

(2.13)

Since

equation 2.9 implies that l ¡ (d C l ¡ m )u ¡ m u2 ¼ 0.

(2.14)

According to the sign of the covariance, an excitatory or an inhibitory effect can be directly (or indirectly through MRF) detected. However, such a conclusion fails when the covariance is small. In contrast, mean-eld approximations become relevant. Thus, mean-eld statistics are complementary to standard ones. The mean-eld statistics for the DM are obtained from equation 2.14. An expression of the diffusion rate m is given by m D

(d C l) u (1 ¡ u )

¡

u¡

l l Cd

¢

.

(2.15)

1826

O. Fran¸cois et al.

3 Statistical Procedures

The statistical procedures needed to estimate the parameters of the MRF and DM are presented in this section. For the DM, the procedures are new. The data x D (x1 , . . . , xn ) consist of n binary variables observed on a lattice. This section also discusses the advantage of each model over the other. 3.1 Markov Random Field. Usual estimation procedures are the maximum likelihood (ML) method, which is computationally heavy, and the pseudo-likelihood (PL) method, owing to Besag (1974). In the pseudolikelihood method, the quantity to maximize is

PL(a, b ) D

n Y iD 1

P(xi | xj , j 2 Ni ).

(3.1)

Computing the conditional probability leads to P(xi D 1 | xj I j 2 Ni ) D

1 , 1 C e¡yi

(3.2)

with yi D a C b

X

xj .

(3.3)

j2Ni

Hence, the maximization of the PL can be achieved by the gradient algorithm aO k C 1 D a Ok ¡ h

@ log PL @a

@ bOk C 1 D bOk ¡ h

(aO k , bOk )

log PL (aO k , bOk ), @b

(3.4)

with h a small constant, a O 0 , bO0 arbitrary initial values, and k ¸ 0. This yields the following algorithm: aO k C 1 D a Ok C h bOk C 1 D bOk C h

n X iD 1 n X iD 1

¡

xi ¡

¡

¢

1 k

1 C e ¡yi

yQ i xi ¡

¢

1 k

1 C e¡yi

,

(3.5)

with yQ i D

X j2Ni

xj ,

(3.6)

Statistical Procedures for Spatiotemporal Neuronal Data

1827

and O k C bO k yki D a

X

xj .

(3.7)

j2Ni

The PL approximates the likelihood by assuming that the sites behave independently given the values of the neighboring sites. This approximation may also be viewed as a mean-eld one. The PL method is a consistent method and usually yields good results (Guyon, 1995). 3.2 Diffusion Model. The DM has three parameters: the mean inverse duration of activity d, the innovation rate l, and the diffusion rate m . Unfortunately, the analytical expression of the likelihood is unavailable, and ML or PL methods must be precluded. The emphasis here is on the estimation of l and m . It is assumed that the parameter d has been estimated from the temporal observations. There is no difculty in doing so. The estimation of l and m will rely on the mean-eld approximation (see equation 2.14) on one hand and on the identication between MRF and DM on the other hand. A property of MRF is that the ratio of the “birth” rate and the “death” rate at a site must be equal to the ratio of conditional probabilities at this site. For the parameters, this yields

d l Cd C

m m

P j2Ni

xj

¼

1 1Ce

aC b

P j2N i

xj

.

(3.8)

According to equations 2.14 and 3.8, an expression for l can be deduced:

l ¼ du

e

aC b

P

P j2Ni

u¡

1 m

xj

¡ P

j2Ni

xj

m (1¡u )

j2Ni

xj

.

(3.9)

O Therefore, the statistic lO Recall that u is estimated by the empirical mean u. O is obtained by plugging the values of a, O b, and uO in equation 3.9 and then averaging over all sites. Finally, using the mean-eld equation again, m can be estimated as ! O (d C l) lO mO D . (3.10) uO ¡ uO (1 ¡ uO ) lO C d

(

O mO is actually a mean-eld statistic. Mean-eld Since u is estimated by u, statistics do not take into account the topology imposed on the model by the structure of the measures. The accuracy of the mean-eld approximation has been checked numerically for a range of parameters corresponding to plausible values (to be obtained in section 4). The system was numerically

1828

O. Fran¸cois et al.

simulated with parameters d, l, and m around 0.2, 0.02, and 0.2. The biases of lO and mO are of the order of magnitude 10¡3 and 10¡2 . The standard deviations are of the order of magnitude 10 ¡2 . These results are evidence that the model is actually weakly dependent on the topology. Models with such a property are useful for integrating the ignorance of the real topology of the underlying network. Furthermore, mean-eld approximation is justied when the spatial empirical covariance cO D

1 nm

X i2S , j2Ni

xi xj ¡ uO 2

(3.11)

is small (the empirical covariance theoretically ranges between ¡0.25 and 0.25.) Within the range of tested parameters, the covariance remained small (below 0.05), and the estimation bias was acceptable. Bias increases as the covariance increases and becomes signicant above 0.05. This happens, for instance, when the parameter l takes high negative values (see Fran¸cois, Larota, Horikawa, and Herv´e, 1999). 3.3 Comparison Between MRF and DM. Although they are of different nature, the MRF and the DM share common features. In the previous section, we gave a static description of the MRF model. The DM, in contrast, has been dened as a dynamical model. In order to compare the models, we also consider a dynamical version of the MRF. In this version, the probability of modifying a site during a small time interval is given by

Prob(x(t C dt ) D xi | x (t) D x ) D l MRF (x, xi )dt C o (dt ),

(3.12)

with » l MRF (x, xi ) D

1 P exp(a C b j2Ni xj )

if xi D 1 if xi D 0.

(3.13)

Therefore, the parameters a and b are likely to be viewed as log rates. In this perspective, a is the log rate that a passive site be activated by an external signal, and b is the log rate that a site be activated by an active neighbor (given that the other neighbors are passive). In the DM, the rate of transition from 0 to 1 may be seen as a log-linear version of the corresponding rate in the MRF model (this is less clear concerning the transition 0!1). Henceforth, the same kind of relationship exists between MRF and DM than between the log-linear formulation of a Poisson process and its “ring-rate” parameterization. However, these models correspond to different statistical structures. Although a log-linear relation between rates exists, it is difcult to identify a relationship between the parameters (a, b ) and (l , m ) themselves. In the DM, a sum of elementary rates is computed. This summation corresponds to the superposition of independent innitesimal effects. The advantage of

Statistical Procedures for Spatiotemporal Neuronal Data

1829

·

0.4

· ·

·

· · ·

· ·

·

· · ·

·

· · ·

· 0.3

·

· · · · ·

· ·

·

0.2

MEAN ACTIVITY

· · ·

·

0.1

·

· ·

·

·

· ·

·

· 20

25

30

35

40

Milliseconds

Figure 1: Mean activity value.

the DM over the MRF model is that its formulation is more intuitive, and hence easier to interpret. The advantage of MRF over the DM is that its parameters are likely to be better estimated. 4 Data Analysis 4.1 Experiment Description. The animal (guinea pig) preparation and optical recording procedures have been described in detail Horikawa et al., 1996). Pure tone bursts (duration 50 ms, rise-fall time 10 ms, frequency 7 kHz, intensity 75 dB SPL) were delivered to the ear contralateral to the recording side. The uorescent signals emitted by the auditory cortex were recorded using a 12 £ 12-channel photodiode array at a rate of 0.576 ms per frame. The optical responses were expressed as DF/F, where DF is the change in intensity due to neural responses and F is the light intensity at rest. The noise due to cortical movements caused by pulsation was cancelled out by synchronizing stimulation and recording with the heartbeat and by subtracting a recording without a stimulus from that with a stimulus. The cortical recording area was 3 mm £ 3 mm, covering part of both the anterior (A, corresponding to primary auditory cortex) and dorsocaudal (DC) elds.

1830

O. Fran¸cois et al.

0.08

·

·

·

0.06

· · ·

· ·

·

·

· ·

·

·

· ·

0.04

COVARIANCE

·

·

·

· ·

· ·

·

· · ·

0.02

· ·

·

·

·

· · · ·

·

0.0

· · · 20

25

30

35

40

Milliseconds

Figure 2: Empirical covariance.

The measuring microscope was focused 200 m m below the surface of the auditory cortex, and the optical responses in the were those from layers II–III. The responses have been smoothed by running averages and the time of occurrence, and the duration of the response at each of 144 cortical locations was measured as a time of peak of the response and a period of half-peak, respectively. The time of peak is measured when the signal reaches its maximum value above rest + 3 S.D. (mean and standard deviation are measured during [0–18 ms] period after the onset of stimulus). The responses whose peak did not exceed the level of 3 S.D. above the resting potential were discarded. The duration of peak is measured as half-time of duration during which the response is above rest +3 S.D. In order to perform data analysis, the signals were encoded as binary signals in such a way that the signal of a diode is 0 when there is no response and it takes value 1 during the peak duration. The sampling rate of the binary data has been changed to 0.5 ms. 4.2 Results. The results presented in this section correspond to a typical response to a single stimulation. The measures resulting from the array of

Statistical Procedures for Spatiotemporal Neuronal Data

1831

·

-1.0

· · ·

·

· · · · ·

·

· ·

·

· · ·

·

·

·

·

·

·

-2.0

ALPHA

-1.5

·

·

· · · · ·

-2.5

·

· · ·

-3.0

·

· ·

· · · · 20

25

30

35

40

Milliseconds

Figure 3: Estimation of the external eld a.

diodes are displayed in Table 1. The processed data consist of 40 binary vectors (of size n D 144) corresponding to the activity of the network summarized in Table 1. This activity starts 20 ms after the stimulation and ends 40 ms after the stimulation. Each vector has been treated separately, following the procedures described in section 3. First, the mean duration of activity has been estimated. For the data in Table 1, the value of d is equal to d D 0.148 kHz (0.074 inverse halfmilliseconds). The mean activity and the empirical covariance have then been computed (see Figures 1 and 2). The mean activity increases from 0 to 40%, reaches a peak after 30 ms, and then decreases to 0. Neighboring sites interact in an excitatory way during activity (the covariance is positive). An oscillation of the interaction is also observed. The period of this oscillation is approximatively 10 ms. The MRF parameters have been computed according to the iterative procedure (see equation 3.5). Results are displayed in Figures 3 and 4. Nonnegative pair potentials conrm the excitatory aspect of interaction. In Figure 5, the plot (a, b ) is concentrated near the phase transition line of the Ising model (see Preston, 1976), a phenomenon that has been noticed by Makarenko et al. (1997) for data measured from the olive cerebel-

O. Fran¸cois et al.

1.2

1832

·

· · ·

·

· 1.0

·

0.8

·

· · ·

0.6

BETA

·

·

·

·

·

·

·

·

·

·

·

·

·

·

· ·

·

·

·

· · ·

·

·

0.4

· ·

·

·

0.2

· · 20

25

30

35

40

Milliseconds

Figure 4: Estimation of the pair potential b.

lar cortex. Clearly, a linear relationship between a and b exists, meaning that the fraction of activity due to the external signal is lowered when the excitatory interaction level increases. The information obtained by tting MRF is, however, close to that obtained from standard statistics (see Figure 2). Mean-eld statistics have been computed for l and m according to equations 3.9 and 3.10. The results are displayed in Figures 6, 7, and 8. Overall, the results obtained by the MRF analysis are conrmed. The diffusion rate mO is nonnegative. This means that the interaction is excitatory. Negative values of l have also been obtained (an arrival associated to a negative rate can be viewed as an inhibitory event). Again a linear dependence between the arrival and diffusion rates is observed. The parameters lO and mO oscillate in opposite phases (but the mean activity does not oscillate). The period of the oscillation is between 8 and 10 ms. The diffusion rate mO ranges between 0.0 and 0.4 kHz. A value equal to 0.2 kHz means that the frequency of excitatory events due to the activity of neighbors is 1 every 5 ms. When mO increases, the external input is inhibited, and negative values of lO are computed. The analysis supports the presence of oscillations in both the internal and external signals, while the MRF parameters

1.2

Statistical Procedures for Spatiotemporal Neuronal Data

·

·

1833

·

·

· ·

0.8

1.0

· ·

BETA

·

·

· · ·· ··

· ·

·

·

0.6

·

·

·

·

·

·

· ·

·

0.4

· ·

· · · ·

0.2

· · -3.0

-2.5

-2.0

-1.5

-1.0

ALPHA

Figure 5: Plot of a, b, and the phase transition line a C 2b D 0.

stay closer to the standard statistics (mean and covariance). In MRF, the information contained in the data is summarized into two parameters, a and b, which translate rst- and second-order interactions. Interactions of higher order are actually neglected. Such interactions may nevertheless exist and may alter the signicance of a and b. On the other hand, the meaneld approximation is based on a rst-order statistic that attenuates the importance of model geometry. Because all interactions are recorded into the mean activity, the method seems more reliable when the true order of interaction is unknown. The crucial point is that the diffusion parameter (which should be of second order) is well estimated by a rst-order statistic. 4.3 Discussion. In layers II–III of the guinea pig auditory cortex, excitatory responses are mediated by both N-methyl-D-asparate (NMDA) and non-NMDA receptors. Non-NMDA-receptor-mediated responses predominate, while NMDA-receptor-mediated responses are very small. Under normal conditions, the excitatory responses last 40 to 50 ms starting from 17 to 20 ms after the stimulus onset. NMDA-receptor-mediated responses start from 27 to 30 ms—10 ms after the beginning of non-NMDA-receptor-mediated

1834

O. Fran¸cois et al.

Table 1: Experimental Data. 54 0 55 81 46 77 56 38 35 33 56 59

43 46 52 39 44 44 43 46 36 36 41 51

43 43 51 52 38 62 40 50 37 41 55 59

65 54 45 45 50 37 53 40 36 53 41 40

50 71 61 54 50 41 40 39 38 39 39 43

67 56 59 51 55 47 40 41 36 47 41 56

56 68 61 62 45 56 41 45 39 81 39 47

63 58 54 54 66 47 39 41 40 43 43 41

65 63 62 58 61 46 43 40 48 38 48 45

85 55 58 66 67 55 47 45 41 39 46 48

62 38 60 56 47 54 61 44 40 43 40 46

28 0 63 69 89 56 43 43 46 46 53 65

2 0 5 2 7 7 8 12 14 25 6 3

6 5 17 10 14 14 15 8 9 6 8 6

14 10 24 17 20 5 21 5 15 12 3 3

6 17 39 26 24 33 9 21 22 10 17 6

9 9 10 9 5 28 36 24 18 13 5 12

9 6 15 12 16 15 33 17 14 10 10 10

5 9 18 21 28 17 31 20 14 8 20 8

14 15 25 25 20 29 15 25 15 13 32 18

7 6 22 12 9 16 17 24 16 29 3 7

3 8 8 15 3 21 13 18 7 22 7 3

6 6 8 5 23 14 6 14 8 17 18 21

3 0 13 14 5 7 24 13 10 18 5 3

Notes: The rst array gives the instant at which the activity is starting, and the second array gives the duration of the activity for each of the 144 sites. The topology is supposed to be a lattice consisting of 144 sites disposed on a 12 £ 12 grid. In these two arrays, the unit of time is 0.5 ms (resampled from original 0.576 sampling).

responses (Horikawa et al., 1996). Some cortical GABAergic inhibition originating from layers II–III occur, inhibiting the long-lasting NMDA-receptormediated responses and shortening the excitatory responses. In layers II–III, 20 to 25% of neurons are inhibitory. The GABAergic inhibition begins about 10 ms after the non-NMDA-receptor-mediated responses. The innovation process is composed of several inputs. Some specic inputs to layer III come from the stimulus propagation (main auditory inputs). They arise from layer IV, where the dorsal and ventral medial geniculate body (MGB) terminates, or directly from the MGB. Other inputs to layer II come from layer I, which receives them from the medial division of the MGB. These inputs are more conditioning related and probably polymodal. Another input is due to cortico-cortical connections. Such input comes from auditory areas of the both hemispheres. Innovation is characterized by the parameters a (MRF) and l (DM). The innovation process starts at t D 20 ms (see Figures 3 and 6), as the area becomes active (see Figure 1). This

Statistical Procedures for Spatiotemporal Neuronal Data

1835

0.06

·

0.04

· ·

· ·

0.02

· ·

· 0.0

· ·

·

·

· · · ·

·

·

· · ·

·

· · · · ·

·

· · ·

·

-0.02

LAMBDA

· · ·

·

·

-0.04

·

· · 20

25

· 30

35

40

Milliseconds

Figure 6: Estimation of l (kHz).

is consistent with usual recordings. A site is considered active when the response is above a threshold. Since l is alternatively positive and negative, it can be concluded that the innovation process consists of excitatory and inhibitory inputs. Excitatory inputs are directly induced by the stimulus propagation and are weakened by adaptation in the auditory nerve. This phenomenon explains why excitation starts decreasing at t D 23 ms. Subcortical inhibition originated from the MGB is also known to occur. The oscillatory behavior of l may be related to a superposition of excitatory and inhibitory processes. The second negative peak of l can be attributed to subcortical inhibition. The 26 ms latency of the rst negative peak seems correlated to the occurrence of GABAergic inhibition, though this inhibition belongs to the internal process of layers II–III. Although the curves of a and l are different, these parameters exhibit some analogous qualitative behavior: They increase and decrease at the same latencies. The estimation of the innovation process in the MRF approach provides a global trend (oscillation) of this process that is consistent with the estimation obtained in the DM approach. The internal process corresponds to interactions within layers II–III: MRF and DM approaches show that the internal process is excitatory, although

1836

O. Fran¸cois et al.

0.5

·

0.4

·

·

·

·

·

0.3

· ·

·

· · ·

· · 0.2

Diffusion rate MU

·

·

·

· ·

·

· · · ·

·

·

0.1

· · · · ·

0.0

·

·

·

· ·

· · ·

·

· 20

25

30

35

40

Milliseconds

Figure 7: Estimation of m (kHz).

it may be modulated by inhibitory events. This result is consistent with a recent study (Kubota, Sugimoto, Horikawa, Nasu, & Taniguchi, 1997) that outlines a propagation of non-NMDA-receptor mediated excitation within layers II–III of a rat slice preparation. Again, b and m have a similar oscillating behavior (see Figures 4 and 7), with a rst peak at t D 26 ms. Such latency is in accordance with the dynamics of the non-NMDA-receptor-mediated excitations, though this layers II–III excitation process is not yet well known under in vivo conditions. The decrease of b and m up to t D 30 ms is a consequence of the internal GABAergic inhibition that starts at about t D 27 ms. The diffusion increases several milliseconds after the arrival of inputs. This rst peak of diffusion can be interpreted as a consequence of the stimulus, which arrives with a large variance and propagates into the studied area. Intracortical interactions may be responsible for the second peak of diffusion at t D 34 ms. These interactions are delayed due to the intrinsic circuitry within each layer. The shape of the diffusion rate m is closer than b to the covariance (a surprising fact) and is likely to reect a better picture of the internal process. In Figure 8, the relation between l and m appears to be linear except for the points that correspond to the beginning and the end of the signal. Such

Statistical Procedures for Spatiotemporal Neuronal Data

1837

0.5

·

0.4

·

··

·

·

0.3

· ·

MU

·

· · · · ··· ·

·

·

·

0.2

·

·

·

0.1

·

·

·· ·

· ·

·

· ·

0.0

·

·· ·

·

· -0.04

-0.02

0.0

0.02

0.04

0.06

LAMBDA

Figure 8: Plot of (l , m ).

a relation suggests that a competition exists between innovation and the internal process. This may explain how the occurrence of internal GABAergic inhibition (at t D 27 ms) inuences the innovation process and reduces its effectiveness onto layers II–III. It suggests that when innovation becomes inhibitory, the sites activate the internal diffusion task and that when innovation becomes excitatory, the sites are less involved in the diffusion task. Conversely, the internal excitatory activity may also mask the innovation process to layers II–III. Our hypothesis is that the innovation process would be composed of a single long inhibition and a single excitation that become effective when the internal process is weak. Such ideas need empirical evidence. An extension of this study will be to obtain a conrmation of timings and identication for excitatory and inhibitory events in both innovation and the internal process of layers II–III. The question of isofrequency band coding into auditory cortex has not been addressed because spatial differences have not been taken into account. In a further study, it would be useful to distinguish between the activity of the primary auditory cortex and its dorsocaudal eld, which exhibit different response latencies, and have reciprocal connections.

1838

O. Fran¸cois et al.

References Besag, J. E. (1974). Spatial interaction and the statistical analysis of lattice systems. J. Roy. Statist. Soc. B, 60, 172–236. Cohen, L. B., Hopp, H. P., Wu, J. Y., Xiao, C., London, J., & Zecevic, D. (1989). Optical measurement of neuron activity in invertebrate ganglia. Ann. Rev. Physiol., 51, 527–541. Fran¸cois, O., Demongeot, J., & Herv´e, T. (1992).Convergence of a self-organizing stochastic neural network. Neural Networks, 5, 277–282. Fran¸cois, O., Larota, C., Horikawa J., & Herv´e, T. (1999). Diffusion and innovation rates for multidimensional neuronal data with large spatial covariance. Preprint. Griffeath, D. (1979).Additiveand cancellativeinteracting particlesystems. New York: Springer-Verlag. Grinvald, A., Frostig, R. D., Lieke, E., & Hildesheim, R. (1988). Optical imaging of neuronal activity. Physiological Reviews, 68, 1285–1366. Guyon, X. (1995). Random elds on a network. Modelling, statistics, and applications. New York: Springer-Verlag. Hebb, D. (1949). The organization of behaviour. New York: Wiley. Herv´e, T., Dolmazon J. M., & Demongeot, J. (1990). Random eld and neural information. Proc. Natl. Acad. Sci. USA, 87, 806–810. Horikawa, J., Hosokawa, Y., Kubota, M., Nasu, M., & Taniguchi, I. (1996).Optical imaging of spatiotemporal patterns of glutamatergic excitation and GABAergic inhibition in the guinea-pig auditory cortex. J. Physiol., 497, 629–638. Kubota, M., Sugimoto, S., Horikawa, J., Nasu, M., & Taniguchi, I. (1997). Optical imaging of dynamic horizontal spread of excitation in rat auditory cortex slices. Neurosci. Lett., 237, 77–80. Liggett, T. M. (1985). Interacting particle systems. New York: Springer-Verlag. Makarenko, V. I., Welsh, J. P., Lang, E. J., & Llinas, R. (1997). A new approach to the analysis of multidimensional neuronal activity: Markov random elds. Neural Networks, 10, 785–789. Martignon, L., Von Hasseln, H., Gruen, S., Aertsen A., & Palm, G. (1995). Detecting higher-order interactions among the spiking events in a group of neurons. Biol. Cybern., 73, 69–81. Orback, H. S., Cohen, L. B., & Grinvald, A. (1996). Optical monitoring of optical activity in the mammalian sensory cortex. Journal of Neuroscience, 5, 1886– 1895. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967). Neuronal spike trains and stochastic point processes. II. Simultaneous spike trains. Biophys. J., 7, 419– 440. Preston, C. J. (1976). Random elds. New York: Springer-Verlag. Ripley, B. D. (1988). Statistical inference for spatial processes. Cambridge: Cambridge University Press. Received October 22, 1997; accepted October 5, 1999.

LETTER

Communicated by Steven Nowlan

Probabilistic Motion Estimation Based on Temporal Coherence Pierre-Yves Burgi Centre Suisse d’Electronique et Microtechnique, 2007 Neuchˆatel, Switzerland Alan L. Yuille Norberto M. Grzywacz Smith-Kettlewell Eye Research Institute, San Francisco, CA 94115, U.S.A. We develop a theory for the temporal integration of visual motion motivated by psychophysical experiments. The theory proposes that input data are temporally grouped and used to predict and estimate the motion ows in the image sequence. This temporal grouping can be considered a generalization of the data association techniques that engineers use to study motion sequences. Our temporal grouping theory is expressed in terms of the Bayesian generalization of standard Kalman ltering. To implement the theory, we derive a parallel network that shares some properties of cortical networks. Computer simulations of this network demonstrate that our theory qualitatively accounts for psychophysical experiments on motion occlusion and motion outliers. In deriving our theory, we assumed spatial factorizability of the probability distributions and made the approximation of updating the marginal distributions of velocity at each point. This allowed us to perform local computations and simplied our implementation. We argue that these approximations are suitable for the stimuli we are considering (for which spatial coherence effects are negligible). 1 Introduction

Local motion signals are often ambiguous, and many important motion phenomena can be explained by hypothesizing that the human visual system uses temporal coherence to resolve ambiguous inputs. (Temporal coherence is the assumption that motion in natural images is mostly temporally smooth, that is, it rarely changes direction or speed abruptly.) For example, the perceptual tendency to disambiguate the trajectory of an ambiguous motion path by using the time average of its past motion was rst reported by Anstis and Ramachandran (Ramachandran & Anstis, 1983; Anstis & Ramachandran, 1987). They called this phenomenon motion inertia. Other motion phenomena involving temporal coherence include the improvement of velocity estimation over time (McKee, Silverman, & Nakayama, 1986; Ascher, 1998), blur removal (Burr, Ross, & Morrone, 1986), motion-outlier c 2000 Massachusetts Institute of Technology Neural Computation 12, 1839– 1867 (2000) °

1840

P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz

detection (Watamaniuk, McKee, & Grzywacz, 1994), and motion occlusion (Watamaniuk & McKee, 1995). Another example of the integration of motion signals over time comes from a recent experiment by Nishida and Johnston (1999) on motion aftereffects, where an orientation shift in the direction of the illusory rotation appears to involve dynamic updating by motion neurons. These motion phenomena pose a serious challenge to motion perception models. For example, in motion-outlier detection, the target dot is indistinguishable from the background noise if one observes only a few frames of the motion. Therefore, the only way to detect the target dot is by means of its extended motion over time. Consequently, explanations based solely on single, large local-motion detectors as in the motion energy and elaborated Reichardt models (Hassenstein & Reichardt, 1956; van Santen & Sperling, 1984; Adelson & Bergen, 1985; Watson & Ahumada, 1985; for a review of motion models, see Nakayama, 1985) seem inadequate to explain these phenomena (Verghese, Watamaniuk, McKee, & Grzywacz, 1999). Also, it has been argued (Yuille & Grzywacz, 1998) that all these experiments could be interpreted in terms of temporal grouping of motion signals involving prediction, observation, and estimation. The detection of targets and their temporal grouping could be achieved by verifying that the observations were consistent with the motion predictions. Conversely, failure in these predictions would indicate that the observations were due to noise, or distractors, which could be ignored. The ability of a system to predict the target’s motions thus represents a powerful mechanism to extract targets from background distractors and might be a powerful cue for the interpretation of visual scenes. There has been little theoretical work on temporal coherence. Grzywacz, Smith, and Yuille (1989) developed a theoretical model that could account for motion inertia phenomena by requiring that the direction of motion varies slowly over time while speed could vary considerably faster. More recently Grzywacz, Watamaniuk, and McKee (1995) proposed a biologically plausible model that extends the work on motion inertia and allows for the detection of motion outliers. It was observed (Grzywacz et al., 1989) that for general three-dimensional motion, the direction, but not the speed, of the image motion is more likely to be constant and hence might be more reliable for motion prediction. This is consistent with the psychophysical experiments on motion inertia (Anstis & Ramachandran, 1987) and temporal smoothing of motion signals (Gottsdanker, 1956; Snowden & Braddick, 1989; Werkhoven, Snippe, & Toet, 1992), which demonstrated that the direction of motion is the primary cue in temporal motion coherence. Other reasons that might make motion direction more reliable than speed are the unreliability of local speed measurement due to the poor temporal resolution of local motion units (Kulikowski & Tolhurst, 1973; Holub & MortonGibson, 1981) and involuntary eye movements. This article presents a new theory for visual motion estimation and prediction that exploits temporal information. This theory builds on our pre-

Probabilistic Motion Estimation

1841

vious work (Grzywacz et al., 1989, 1995; Yuille & Grzywacz, 1998) and on recent work on tracking of object boundaries (Isard & Blake, 1996). In addition, this article gives a concrete example of the abstract arguments on temporal grouping proposed by Yuille and Grzywacz (1998). To this end, we use a Bayesian formulation of estimation over time (Ho & Lee, 1964), which allows us simultaneously to make predictions over time, update those predictions using new data, and reject data that are inconsistent with the predictions. (We will be updating the marginal distributions of velocity at each image point rather than the full probability distribution of the entire velocity eld, which would require spatial coupling.) This rejection of data allows the theory to implement basic temporal grouping (for speculations about more complex temporal grouping, see Yuille & Grzywacz, 1998). This basic form of temporal grouping is related to data association as studied by engineers (Bar-Shalom & Fortmann, 1988) but differs by being probabilistic (following Ho & Lee, 1964). Finally, we show that our temporal grouping theory can be implemented using locally connected networks, which are suggestive of cortical structures. In deriving our theory, we made an assumption of spatial factorizability of the probability distributions and made the approximation of updating the marginal distributions of velocity at each point. This allowed us to perform local computations and simplied our implementation. We argued that these assumptions and approximations (as embodied in our choice of prior probability) are suitable for the stimuli we are considering but would need to be revised to include spatial coherence effects (Yuille & Grzywacz, 1988, 1989). The prior distribution used in this article would need to be modied to incorporate these effects. (For speculations about how this might be achieved, see Yuille & Grzywacz, 1998.) There is no difculty in generalizing the Bayesian theory to deal with complex hierarchical motion, but it is unclear whether the specic implementation in this article can be generalized. The article is organized as follows. The theoretical framework for motion prediction and estimation is introduced in section 2. Section 3 gives a description of the specic model chosen to implement the theory in the continuous time domain. Implementation of the model using a locally connected network is addressed in sections 4 and 5. The model’s predictions are tested in section 6 by performing computer simulations on the motion phenomena described previously. Finally, we discuss relevant issues, such as the possible relationship of the model to biology, in section 7. 2 The Theoretical Framework 2.1 Background. The theory of stochastic processes gives us a mathematical framework for modeling motion signals that vary over time. These processes can be used to predict the probabilities of future states of the system (e.g., the future velocities) and are called Markov if the probability of future states depends on only their present state (i.e., not on the time history

1842

P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz

of how they arrived at this state). In computer vision, Markov processes have been used to model temporal coherence for applications such as the optimal fusing of data from multiple frames of measurements (Matthies, Kanade, & Szeliski, 1989; Clark & Yuille, 1990; Chin, Karl, & Willsky, 1994). Such optimal fusing is typically dened in terms of least-squares estimates, which reduces to Kalman ltering theory (Kalman, 1960). Kalman lters have been applied to a range of problems in computer vision (see Blake & Yuille, 1992, for several examples), and to neuronal mode by Rao and Ballard (1997). Because Kalman lters are (recursive) linear estimators that apply only to gaussian densities, their applicability in complex scenes involving several moving objects is questionable. One solution, which we briey discuss, is the use of data association techniques (Bar-Shalom & Fortmann, 1988). In this article, however, we follow Isard and Blake (1996) and propose a Bayesian formulation of temporal coherence, which is a generalization of standard Kalman lters and can deal with targets moving in complex backgrounds. We begin by deriving a generalization of Kalman lters that is suitable for describing the temporal coherence of simple motion. (Our development follows the work of Ho & Lee, 1964.) Consider the state vector xE k describing the status of a system at time tk (in our theory, xE k denotes the velocity eld at every point in the image). Our knowledge about how the system might evolve to time k C 1 is described in a probability distribution function p (xE k C 1 | xE k ), known as the “prior” (in our theory, this prior will encode the temporal coherence assumption). The measurement process that relates the observation zE k of the state vector to its “true” state is described by the likelihood distribution function p (zE k | xE k ) (in our theory, Ezk will be the responses of basic motion units at every place in the image). From these two distribution functions, it is straightforward to derive the a posteriori distribution function p (xE k C 1 |Zk C 1 ), which is the distribution of xE k C 1 given the whole past set . of measurements Zk C 1 D (Ez0 , . . . , zE k C 1 ) (note that although the prediction of the future depends only on the current state, the estimate of the current state does depend on the entire past history of the measurements). Using Bayes’ rule and a few algebraic manipulations (Ho & Lee, 1964), we get p(xE k C 1 |Zk C 1 ) D

p(Ezk C 1 | xE k C 1 ) p (xE k C 1 |Zk ), p (zE k C 1 |Zk )

(2.1)

where p (xE k C 1 |Zk ) is prediction for the future state xE k C 1 given the current set of past measurements Zk , and p (zE k C 1 |Zk ) is a normalizing term denoting the condence in the measure Ezk C 1 given the set of past measurements Zk . The predictive distribution function can be expressed as Z (2.2) p(xE k C 1 |Zk ) D p (xE k C 1 | xE k )p(xE k |Zk )dxE k . Equations 2.1 and 2.2 are generalizations of the Wiener-Kalman solution for linear estimation in presence of noise. Its evaluation involves two

Probabilistic Motion Estimation

1843

stages. In the prediction stage, given by equation 2.2, the probability distribution p (xE k C 1 |Zk ) for the state at time k C 1 is determined. This function, which involves the present state and the Markov transition describing the dynamics of the system, has a variance larger than that of p(xE k |Zk ). This increase in the variance results from the nondeterministic nature of the motion. In the measurement stage, performed by equation 2.1, the new measurements Ezk C 1 are combined using Bayes’ theorem and, if consistent, reinforce the prediction and decrease the uncertainty about the new state. (Inconsistent measurements may increase the uncertainty still further.) It was shown (Ho & Lee, 1964) that if the measurement and prior probabilities are both gaussian, then equations 2.1 and 2.2 reduce to the standard Kalman equations, which update the mean and the covariance of p (xE k C 1 |Zk C 1 ) and p(xE k C 1 |Zk ) over time. Gaussian distributions, however, are nonrobust (Huber, 1981), and an incorrect (outlier) measurement can seriously distort the estimate of the true state. Standard linear Kalman lter models are therefore not able to account for the psychophysical data that demonstrate, for example, that human observers are able to track targets despite the presence of inconsistent (outlier) measurements (Watamaniuk et al., 1994; Watamaniuk & McKee, 1995). Various techniques, known collectively as data association (Bar-Shalom & Fortmann, 1988), can be applied to reduce these distortions by using an additional stage that decides whether to reject or accept the new measurements. From the Bayesian perspective, this extra stage is unnecessary; robustness can be ensured by correct choices of the measurement and prior probability models. More specically, we specify a measurement model that is robust against outliers. 2.2 Prediction and Estimation Equations for a Flow Field. We intend to estimate the motion ow eld, dened over the two-dimensional image array, at each time step. We describe the ow eld as fvE (xE , t)g, where the parentheses f¢g denote variations over each possible spatial position xE in the image array. The prior model for the motion eld and the likelihood for the velocity measurements are described by the probability distribution functions P (fvE (xE , t)g|fvE (xE , t ¡ d)g) and P (fwE (xE , t)g|fvE (xE , t)g) (see section 3 for details), where d is time increment and fwE (xE , t)g D fw 1 (xE , t), w 2 (xE , t), . . . , w M (xE , t)g represents the response of the local motion measurement units over the image array. Our theory requires that the local velocity measurements are P normalized so that M E , t) D 1 for all x E and at all times t (see section 3 i D 1 w i (x for details). We let W(t) D (fwE (xE , t)g, fwE (xE , t ¡ d)g, . . .) be the set of all measurements up to and including time t, and P (fvE (xE , t ¡ d)g|W(t ¡ d)) be the system’s estimated probability distribution of the velocity eld at time t ¡d. Using equations 2.1 and 2.2 described above, we get

P (fvE (xE , t)g|W(t)) D

P (fwE (xE , t)g|fvE (xE , t)g)P (fvE (xE , t )g|W (t ¡ d)) , P (fwE (xE , t)g|W(t ¡ d))

(2.3)

1844

P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz

for the estimation, and P (fvE (xE , t)g|W(t ¡ d)) Z D P (fvE (xE , t)g|fvE (xE , t ¡ d)g)P(fvE (xE , t ¡ d)g|W(t ¡ d))d[fvE (xE , t ¡ d)g], (2.4) for the prediction. 2.3 The Marginalization Approximation. For simplicity and computational convenience, we assume that the prior and the measurement distributions are chosen to be factorizable in spatial position, so that the probabilities at one spatial point are independent of those at another. This restriction prevents us from including spatial coherence effects which are known to be important aspects of motion perception (see, for example, Yuille & Grzywacz, 1988). For the types of stimuli we are considering, these spatial effects are of little importance and can be ignored. In mathematical terms, this spatial-factorization assumption means that we can factorize the prior Pp and the likelihood Pl so that

Pp (fvE (xE , t)g|fvE (xE 0 , t ¡ d)g) D

Y xE

Pl (fwE (xE , t )g|fvE (xE , t)g) D

pp (vE (xE , t)|fvE (xE 0 , t ¡ d)g) Y xE

pl (fwE (xE , t)| vE (xE , t)g).

(2.5)

A further restriction is to modify equation 2.4 so that it updates the marginal distributions at each point independently. This again reduces spatial coherence because it decouples the estimates of velocity at each point in space. Once again, we argue that this approximation is valid for the class of stimuli we are considering. This approximation will decrease computation time by allowing us to update the predictions of the velocity distributions at each point independently. The assumption that the measurement model is spatially factorizable means that the estimation stage, equation 2.3, can also be performed independently. This gives update rules for prediction Ppred : Z Ppred (vE (xE , t)|W (t ¡ d)) D [dvE (xE 0 , t ¡ d)]Pp (vE T (xE , t)|fvE (xE 0 , t ¡ d)g) Pe (fvE (xE 0 , t ¡ d)g|W(t ¡ d)),

(2.6)

and for estimation Pe : P e (vE (xE , t )|W (t)) D

Pl (wE (xE , t)| vE (xE , t ))Ppred (vE (xE , t)|W(t ¡ d)) P (wE (xE , t)|W(t ¡ d))

.

(2.7)

In what follows we will consider a specic likelihood and prior function that allow us to simplify these equations and dene probability distributions

Probabilistic Motion Estimation

1845

on the velocities at each point. This will lead to a formulation for motion prediction and estimation where computation at each spatial position can be performed in parallel. 3 The Model

We now describe a specic model for the likelihood and prior functions that, as we will show, can account for many psychophysical experiments involving temporal motion coherence. Based on this specic prior function, we then derive a formulation for motion prediction in the continuous time domain. 3.1 Likelihood Function. The likelihood function gives a probabilistic interpretation of the measurements given a specic motion. It is probabilistic because the measurement is always corrupted by noise at the input stage. (In many vision theories, the likelihood function depends on the image formation process and involves physical constraints, such as the geometry and surface reectance. Because we are considering psychophysical stimuli, consisting of moving dots, we can ignore such complications.) To determine our likelihood function, we must rst specify the input stage. Ideally, this would involve modeling the behaviors of the cortical cells’ sensing natural image motion, but this would be too complex to be practical. Instead, we use a simplied model of a bank of receptive elds tuned to various velocities vE i (xE , t), i D 1 ¢ ¢ ¢ M, and positioned at each image pixel. These cells have observation activities fw i (xE , t)g that are intended to represent the output of a neuronally plausible motion model such as Grzywacz and Yuille (1990). (In our implementation, these simple model cells receive contributions from the motion of dots falling within a local neighborhood, typically the four nearest neighbors. Intuitively, the closer the dot is to the center of the receptive eld and the closer its velocity is to the preferred velocity of the cell, then the larger the response. The spatial prole and velocity tuning curve of these receptive elds are described by gaussian functions whose covariance matrices S m:x and S m:v depend on the direction of vE (xE , t), and are specied in terms of their longitudinal and 2 2 2 2 transverse components sm:x,L , sm:x,T and sm:v,L , sm:v,T . For more details, see the appendix.) The likelihood function species the probability of the receptive eld responses conditioned on a “true” external motion eld. Ideally this should correspond exactly to the way the motion measurements are generated. We make a simplication, however, by assuming that the measurements depend on only the velocity eld at that specic position. This simplies the mathematics at the risk of making the system slightly sensitive to motion blur (i.e., the measurement stage allows for several dots to inuence the local measurement—provided the dots lie within the receptive eld—but the likelihood function assumes that only a single dot causes the measurement.)

1846

P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz

Q We can therefore write Pl (fwE (xE , t)g|fvE (xE , t)g) D x Pl (wE (xE , t)| vE (xE , t)), where fvE (xE , t)g is the external velocity eld. We now specify P l (wE (xE , t)| vE (xE , t)) Yl (w 1 (xE , t), w 2 (xE , t ), . . . , w M (xE , t), vE (xE , t)) R , D R ¢ ¢ ¢ P J (j1 , j2 , . . . , j M , vEj (xE , t))dj1 dj2 , . . . , djM

(3.1)

where P J is the joint probability distribution and we set Yl to be Yl (w 1 (xE , t), w 2 (xE , t), . . . , w M (xE , t), vE (xE , t)) D wE (xE , t) ¢ fE(vE (xE , t)),

(3.2)

which is attractive (see Yuille, Burgi, & Grzywacz, 1998) because it leads to a simple linear update rule, and with tuning curves f given by: ¡1

e¡(1 / 2)(vEi ¡vE ) S (vE i ¡vE ) fi (vE (xE , t )) D P M ¡(1 2)( E ¡ E )T S ¡1 ( E ¡E ) , vj v / vj v j D1 e T

(3.3)

where S is the covariance matrix that depends on the direction of vE (xE , t) and 2 is specied in terms of its longitudinal and transverse components sl:v,L and 2 sl:v,T ). The experimental data (Anstis & Ramachandran, 1987; Gottsdanker, 1956; Werkhoven et al., 1992) suggest that temporal integration occurs for velocity direction rather than for speed. This is built into our model by choosing the covariances so that the variance is bigger in the direction of motion than in the perpendicular direction (i.e., the velocity component perpendicular to the motion has mean zero and very small variance so the direction of motion is fairly accurate, but the variance of the velocity component along the direction of motion is bigger, which means that the estimation of the speed is not accurate). We have assumed that the response of the measurement device is instantaneous. It would be possible to adapt the likelihood function to allow for a time lag, but we have not pursued this. Such a model might be needed to account for motion blurring. 3.2 Prior Function. We now turn to the prior function for position and velocity. For a dot moving at approximately constant velocity, the state transitions for position and velocity are given respectively by xE (t C d) D E (t) C v xE (t) C d vE (t) C v E x (t), and vE (t C d) D v E v (t), where v E x (t) and v E v (t) are random variables representing uncertainty in the position and velocity. Extending this model to apply to the ow of dots in an image requires a conditional probability distribution Pp (vE (xE , t)|fvEj (xE i , t ¡ d)g), where the set of spatial positions and the set of allowed velocities are discretized with the spatial positions set to be fxE i : i D 1, . . . , Ng, and the Q velocities at xE i to fvEj : j D 1, . . . , Mg. We are assuming that Pp (fvE (xE i , t)g|.) D i Pp (vE (xE i , t)|.) so that the velocities at each point are predicted independent of each other. We also assume that the prior is built out of two components: (1) the probability

Probabilistic Motion Estimation

1847

p (xE | xE i , vE j (xE i , t ¡ d)) that a dot at position xE i with velocity vEj at time t ¡ d will be present at xE at time t and (2) the probability P (vE (xE , t)| vE j (xE i , t ¡ d)) that it will have velocity vE (xE , t). These components are combined to give: Pp (vE (xE , t )|fvE j (xE i , t ¡ d)g) 1 XX D P (vE (xE , t)| vE j (xE i , t ¡ d)) p (xE | xE i , vE j (xE i , t ¡ d)), ( E K x, t ) i j

(3.4)

where K (xE , t) is a normalization factor chosen to ensure that Pp (vE (xE , t)|fvEj (xE i , t ¡ d)g) is normalized. The two conditional probability distribution functions in equation 3.4 are assumed to be normally distributed, and thus, ( p (xE | xE i , vEj (xE i , t ¡ d)) » e¡(1 / 2)(xE ¡xE i ¡d vEj xE i ,t¡d))

P (vE (xE , t)| vEj (xE i , t ¡ d)) » e

T S ¡1 ( E ¡E ¡d E ( E ,t¡d)) x xi vj xi x

¡(1 / 2)(vE (xE ,t)¡vEj (xE i ,t¡d))T S v¡1 (vE (xE,t)¡vEj (xE i ,t¡d))

(3.5) .

(3.6)

These functions express the belief that the dots are, on average, moving along a straight trajectory (see equation 3.5) with constant velocity (see equation 3.6). The covariances S x and S v , which quantify the statistical deviations from the model, are dened in a coordinate system based on the velocity vector vE j . These matrices are diagonal in this coordinate system and are expressed in terms of their longitudinal and transverse components 2 , s 2 , s 2 , s 2 . (This model predicts future positions and velocities insx,L x,T v,L v,T dpendently; it does not take into account the possibility that if the speed increases during the motion, then the dot will travel further. This is an acceptable, and standard, approximation provided that d is small enough. It becomes exact in the limit as d 7! 0.) 3.3 Continuous Prediction. Computer simulations of equation 2.6 (even for the simple case of moving dots) require large-kernel convolutions, which require excessive computational time. Instead, we reexpress equation 2.6 in the continuous time domain. (Such a reexpression is not always possible and depends on the specic choices of probability distributions. Note that a similar approach has been applied for the computation of stochastic completion elds by Williams and Jacobs (1997a, 1997b).) As shown in the appendix for Gaussian priors, the evolution of P (vE (xE , t)|W (t ¡ d)) for d ! 0 satises a variant of a partial differential equation, known as Kolmogorov’s forward equation or Fokker-Planck equation. Our equation is a nonstandard variant because, unlike standard Kolmogorov theory, our theory involves probability distributions at every point in space that interact over time. It is given by: @ @t

P (vE (xE , t)) D

1 2

»

@T @E v

S vE

@ @E v

¼ P (vE (xE , t)) C

1 2

»

@T @E x

S xE

@ @E x

¼ P (vE (xE , t))

1848

P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz

¡ vE ¢

@ @E x

P (vE (xE , t )) C

1 @ |V| @Ex

Z vEP (vE (xE , t))dvE .

(3.7)

R T T @ @ where |V| is the volume of velocity space dvE , and f @@Ex S Ex @E g and f @@Ev S vE @E g x v are scalar differential operators (for isotropic diffusions, these scalar operators are Laplacian operators). The terms on the right-hand side of this equation stand for velocity diffusion (rst term), spatial diffusion (second term) with deterministic drift (third term), and normalization (fourth term). This equation is local and describes the spatiotemporal evolution of the probability distributions P (vE (xE , t)). 4 The Implementation

We now address the implementation of the update rule for motion estimation (see equations 2.7). At each spatial position, two separate cells’ populations are assumed: one for coding the likelihood function and another for coding motion estimation. If we were to choose the large-kernel convolution for motion prediction (see equation 2.6), then the network implementation would reduce to two positive linear networks multiplying each other followed by a normalization, illustrated in Figure 1 (see Yuille et al., 1998). But a problem with implementing this network on serial computers is that the range of the connections between motion estimation cells depends on the magnitude of vE j (that is, we would need long-range connections for high speed tunings). Such long-range connections occur in biological systems, like the visual cortex, but are not well suited to very large-scale integrated or serial computer implementations. Alternatively, using our differential equation for predicting motion (see equation 3.7) involves only local connections. We now focus on this new implementation (see Figure 2). Kolmogorov’s equation can be solved on an arbitrary lattice using a Taylor series expansion. The spatial term, which contains the diffusion and drift terms, can be written so that the partial derivatives at lattice points are functions of the values of the closest neighbor points. For the sake of numerical accuracy, we have chosen a spatial mesh system where the points are arranged hexagonally. Furthermore, we found it convenient to express the velocity vectors in polar coordinates where their norm s and angle h can be represented on a rectangular mesh. The change of variable is given T @ by pQ(s, h ) D sp (vE ). In polar coordinates, the differential operator @@Ev S vE @E v becomes Q D Lv [p]

¡

1 2 @2 pQ 1 2 2 s ¡ sv,L ¡ sv,T 2 v,L @s2 2

¢ @ ¡1 ¢ @s

s

pQ C

1 2 @2 pQ 1 s , 2 v,T @h 2 s2

(4.1)

2 2 where sv,L and sv,T are the variances in the longitudinal and transverse directions, respectively.

Probabilistic Motion Estimation

1849

Figure 1: Network for motion estimation. The network consists of two interacting layers. The top square shows the observation layer consisting of cells organized in columns on the (x, y) lattice. At each of the 12 spatial positions, there is a velocity column, which we display by two cells, shown as circles, with the adjacent arrows indicating their velocity tunings (here either horizontal or vertical). The lower square represents the estimation layer, which also consists of cells organized as columns on the (x, y) lattice. In the measurement stage, the observation cells inuence the estimation cells by excitatory (indicated by the plus sign) connections and multiplicative interactions (indicated by the cross sign). The excitation is higher between cells with similar velocity tuning, which we indicate by strong (solid lines) or weak (dashed lines) connections. In the prediction stage, cells within the estimation layer excite each other (again with the strength of the excitation being largest for cells with similar velocity tuning). Inhibitory connections within the columns are used to ensure normalization.

1850

P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz

Figure 2: Network for motion estimation using continuous prediction. (A) Continuous motion prediction (lower part of the network) is accomplished through nearest-neighbor spatial interactions—in this case, on a hexagonal lattice. The velocity distribution is represented at each spatial position by a set of velocity cells (small circles) organized according to a polar representation (small panel). Observation cells in the measurement layer (upper part) inuence the estimation cells by multiplicative interactions (indicated by £) to yield the motion estimation a. (B) Interactions between two populations of velocity cells involve cells tuned to the same velocities; interactions within a population involve cells tuned to different velocities.

We let the activities of the M motion estimation cells at time t be represented by a vector a( E , t) D (a1 (x E , t), . . . , aM (x E , t)). The iterative method for E x determining such activities involves a four-step scheme. The rst three steps are concerned with solving Kolmogorov’s equation. For a cell at position xE D (x, y) and of velocity tuning vEm D (s, h ), the three steps are: t 1/4 t C Dt ax,y,s,h D ax,y,s,h C

tC 1 / 2 ax,y,s,h

D

tC1/4 C ax,y,s,h

t / t / ax,y,s,h ¡ D ax,y,s,h C3

4

C1

2

Ds

X

t Aõ, (s)ax,y,s C õD s,h C D h

õ,

X

Bi,Â (s, h )ax C i,y/ C Â,s,h t C1 4

i,Â

Dt X X M

s ,h i,Â 0

0

Bi,Â (s0 , h 0 )ax C i,y/ C Â,s0 ,h 0 tC 1 2

(4.2)

where the coefcients Aõ, and Bi,Â are functions of s, h, and the covariance matrices. The constants D s, Dh represent the quantization factors for s and h. The superscripts t C 1 / 4, t C 1 / 2, t C 3 / 4 indicate the order in which these

Probabilistic Motion Estimation

1851

three steps proceed (a single time step for the complete system is broken down into four substeps for implementational convenience). The rst step is to evaluate the velocity differential operator and involves four neighbor cells. The second step calculates the spatial differential operator and involves six neighbor points (on the hexagonal spatial lattice). Periodic boundary conditions are assumed in space and velocity. The third step performs the normalization. To guarantee stability of the whole iterative method, the time step D t has been determined using von Neumann analysis (see Courant & Hilbert, 1953), which considers the independent solutions, or eigenmodes, of nite-difference equations. Typically the stability conditions involve ratios between the time step and the quantization scales of space and velocity. If we are performing prediction without measurement, then the fourth step—determining the activity of the motion estimation cells—simply reduces to t / tC 1 ax,y,s,h . D ax,y,s,h C3

4

(4.3)

If there are measurements, then we apply the motion estimation equation, 2.7, which consists of multiplying the motion prediction with the likelihood function (as dened by equations 2.8 and 2.9). Therefore, the complete update rule is tC 1 ax,y,s,h D

X 1 tC3/4 ax,y,s,h ¢ w j (xE , t) ¢ fj (vE m (xE , t)), N (xE , t C 1) j

(4.4)

P where N (xE , t C 1) is a normalization constant to impose mMD 1 am (xE , t C 1) D 1, 8xE , t so that we can interpret these activities as the probabilities of the true velocities. 5 Methods

Our model has two extreme forms of behavior depending on how well the input data agree with the model predictions. If the data are generally in agreement with the model’s predictions, then it behaves like a standard Kalman lter (i.e., with gaussian distributions) and integrates out the noise. At the other extreme, if the data are extremely different from the prediction, then the model treats the data as outliers or distractors and ignores them. We are mainly concerned here with situations where the model rejects data. The precise situation where noise and distractors start getting confused is very interesting, but we do not address it here. We rst checked the ability of our model to integrate out weak noisy stimuli for tracking a single dot moving in an empty background. For this situation, temporal grouping (data association) is straightforward, and so

1852

P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz

Figure 3: Speed discrimination for two moving dots. This graph shows the improved ability of our theory to estimate the relative speed of two dots (one of which is moving at 2 jumps per second) as a function of the number of jumps. The level of 80% correct discrimination is reached for diminishing speed differences (measured by dV / V) when the number of jumps increases, consistent with psychophysics (McKee et al., 1986).

standard Kalman lters can be used. To evaluate the model, we ran a set of trials that plotted the relative estimation of two velocities as a function of the number of jumps. The graph we obtained (see Figure 3) is very similar to those reported in the psychophysical literature (McKee et al., 1986). Henceforth, we assume that the model is able to integrate out noisy input. For the remaining experiments, we assumed that the local velocity measurements were “noisy” in the sense that the measurement units specied a probability of velocity measurements rather than a single velocity. However, the “probability of velocities” itself was not noisy. We next tested our model on three psychophysical motion phenomena. First, we considered the improved accuracy of velocity estimation of a single dot over time (Snowden & Braddick, 1989; Watamaniuk, Sekeuler, & Williams, 1989; Welch, Macleod, & McKee, 1997; Ascher, 1998). Then we examined the predictions for the motion occlusion experiments (Watamaniuk & McKee, 1995). Finally we investigated the motion outlier experiments (Watamaniuk et al., 1994). For all simulations, we set the parameters as follows. All dots were moving at a speed of 6 jumps per second (jps), where one jump corresponds to the distance separating two neighboring pixels in the horizontal direction. The longitudinal and transverse components of the covariances matrices were

Probabilistic Motion Estimation

1853

sm:x,L D 0.8 jump, sm:x,T D 0.4 jump, sm:v,L D 3.2 jps, and sm:v,T D 2.6 jps for the measurement, sl:v,L D 2.2 jps, and sl:v,T D 1.1 jps for the likelihood function, and sx,L D 0.6 jump, sx,T D 0.3 jump, sv,L D 0.8 jps, and sv,T D 0.4 jps for the prior function. These parameters were chosen by experimenting with the computer implementation, but the results are relatively insensitive to their precise values; detailed ne tuning was not required. Except for one of the occluder experiments (the occluder dened by distractor motion), where we used 18 velocity channels (six equidistant directions and three speeds), we used 30 velocity channels positioned at each spatial position, that is, six equidistant directions (thus Dh D 60 degrees, starting at the origin) and ve speeds (D r D 2 jps, with the slowest speed channel tuned to 2 jps). To guarantee stability of the numerical method, we set the time increment to D t D 0.6 ms. Initial conditions were uniformly distributed density functions. There were 32 £ 32 spatial positions. 6 Results 6.1 Velocity Estimation in Time. For this experiment, the stimulus was a single dot moving in an empty background. To evaluate how the velocity estimation evolves in time, we measured two numbers. The rst number was the sharpness of the velocity probability at each position, which we dene to be the negative of the entropy of the distribution (the entropy of P ( E , t) log aj (xE , t)) plus the maximum of the distribution is given by ¡ jM D 1 aj x the entropy (to ensure that the sharpness is nonnegative). Observe that the sharpness is the Kullback-Leibler divergence D (akU ) between the velocity distribution aj (xE , t) and the uniform distribution Uj D 1 / M, 8 j (more P ( E , t) logfaj (xE , t) / Uj g.) As is well known (Cover precisely, D (akU ) D jM D 1 aj x & Thomas, 1991) the Kullback-Leibler divergence is positive semidenite, taking its minimum value of 0 when aj D Uj , 8j and increases the more aj differs from the uniform distribution Uj (i.e., the sharper the distribution aj becomes). Hence the higher this sharpness is, the more precise the velocity estimate is. Moreover, the sharpness will be lowest in positions where there are no dots moving (i.e., for which wj (xE , t) D 1 / M, j D 1, . . . , M). The target, a coherently moving dot, should have a relatively high sharpness response surrounded by low sharpness responses in neighboring positions. The second number is the condence factor P (wE (xE , t)|W (t ¡ d)) in equation 2.7. This measures how probable the model judges the new input data. The higher this number is, the more the new measurement is consistent with the prediction (and hence the more likely that the measurement is due to coherent motion). The results of motion estimation for this experimental stimulus are shown in Figure 4. This gure shows an initial fast decrease in the entropy of the velocity distribution followed by a slow decrease (i.e., an increase in sharpness of the density function). Also shown is the increase in the condence

1854

P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz

Figure 4: Single dot experiment. A single dot is moving with constant velocity. (A) Increasing accuracy of the velocity estimate with time is visible by an increase in the condence factor and the sharpness of the density function. Observe how both the condence and sharpness increase rapidly at rst but then quickly start to atten out. (Our model outputs a data point at each time frame, and our plotting package joins them together by straight-line segments.) (B) Sharpness of the density function shown for each spatial position at successive time frames. The system converges to a single peak in the velocity estimation (though we do not show this here).

Probabilistic Motion Estimation

1855

factor of the target velocity estimation at each new iteration. These two effects indicate the increased accuracy of the velocity estimation with each new iteration, consistent with the psychophysical literature (Snowden & Braddick, 1989; Watamaniuk el al., 1989; Welch et al., 1997; Ascher, 1998). Note that for coherent motion, the sharpness of the model increases rapidly at rst and then appears to slow down. Our results suggest that the model reaches an asymptotic limit, although they do not completely rule out a continual gradual increase. We argue that asymptotic behavior is more likely as both the predictions and the measurements have inherent noise that can never be eliminated. Moreover, it can be proven that the standard (gaussian) Kalman model will converge to an asymptotic limit depending on the variances of the prior and likelihood functions (we veried this by computer simulations). It seems also that human observers reach a similar asymptotic limit, and their accuracy does not become innitely good with an increasing number of input stimuli. In addition, we plot the sharpness of the velocity estimates as a function of spatial position in Figure 4B (each column of velocity-tuned cells has a sharpness associated with its velocity estimate). We also tested the estimation of velocity for a dot moving on a circular trajectory, and our model can successfully estimate the dot’s speed, although the sharpness and condence are not as high as they are for straight motion (recall that the prior prefers straight trajectories). This again seems consistent with the psychophysics (Watamaniuk et al., 1994; Verghese et al., 1999). 6.2 Single Dot with Occluders. Next, we explored what would happen if the target dot went behind an occluding region where no input measurements could be made (we call this a black occluder). The results were the same as the previous case until the target dot reached the occluder. In the occluded region, the motion model continued to propagate the target dot, but, lacking supporting observations, the probability distribution started to diffuse in space and in velocity. This dual diffusion can be seen in Figure 5, where the sharpness of velocity estimation is shown after the dot entered the occluding region. Observe the decrease in sharpness of the density function, indicating a degradation of velocity estimation, when the target is behind the occluder. However, the model still has enough “momentum” to propagate its estimate of the target dot’s position even though no measurements are available (this will break down if the occluder gets too large). This is consistent with the ndings by Watamaniuk and McKee (1995), who showed that observers had a higher-than-average chance of detecting the target when it emerges from the occluder into a noisy background. We then tested our model with motion occluders—occluders dened by motion ows as described in Watamaniuk and McKee (1995) (see Figure 6A). We use the same measures as for the single isolated dot. We also plot the cells’ activities as a function of the direction tuning for the cells tuned to the optimal speed (see Figure 6B). The plot indicates that the cells can signal nonzero probabilities for several motions at the same time. After entering the

1856

P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz

Figure 5: Single dot experiment with Black occluder. (A) This representation shows how the spatial blur and sharpness of the velocity distribution change with time (measured in time frames) when a single dot moving along the y-axis gets occluded and reappears. The occluding area is from y D 15 up to y D 22. (B) The y-location of the peak of velocity distribution as a function of time. Motion inertia (momentum) keeps the peak moving during the occlusion, albeit with a tendency to slow down (as visible by the wider plateau around time frame 20).

Probabilistic Motion Estimation

1857

Figure 6: Single dot experiment with occluders dened by distractor motion. (A) The occluder is dened by the vertical motion of distractor dots, visible as circles in the left-top frame with arrows indicating velocity (the dots’ speed is 6 jumps per second and is represented by the length of the lines). The target motion, shown at time frame 15 and visible as a diamond, is moving horizontally from left to right. The height and width of the rectangles at each point indicate the condence and sharpness, respectively (a small rectangle indicates low condence and sharpness). The lattice, marked with small dots, is hexagonal for the measurement, estimation, and prediction stages. (B) The probability distribution of the velocity cells tuned to 6 jumps per second plotted as a function of directional tuning at different time steps. The motion-inertia effect of the target motion on the distractors is visible at time frame 18 as the target dot enters the occluder and two peaks start developing in the probability distribution. The bigger the occluder, the more the peak induced by the motion of the distractor dots starts dominating. But as the dot reemerges from the occluder, it rapidly becomes sharp again, as visible at time frame 20.

1858

P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz

occluding region, the peak corresponding to the target motion gets smaller than the peak due to the occluders. However, the peak of the target remains, and so the target peak can increase rapidly as the target exits from the occluders. Our model predicts that it is easier to deal with black occluders than with motion occluders, which is not consistent with the psychophysics that shows a rough similarity of effects for both types of occluders. We offer two possible explanations for this inconsistency. First, the model’s directionally selective units and / or prediction-propagation mechanism have too broad a directional tuning. Narrowing it may lead to weaker interactions between the motion occluder and the target. Second, in contrast with the black occluders, the motion occluders may group together as a surface, which would help explain why the visual system may be more tolerant of motion occluders. (Such a heightened tolerance might compensate for the initial putative advantage of the black occluders.) Kanizsa (1979) demonstrates several perceptual phenomena, which appear to require this type of explanation. In motion, similar effects have been modeled by Wang and Adelson (1993) in their work on layered motion. A limitation of our current model is that it does not include such effects. 6.3 Outlier Detection. In the outlier detection experiments (Watamaniuk et al., 1994; Verghese et al., 1999), the target dot is undergoing regular motion, but it is surrounded by distractor dots, which are undergoing random Brownian motion. To show the velocity estimation at each position, we plot the response of our network using a rectangle to display the two properties of sharpness and condence. The width of the rectangle gives the sharpness of the set of cells at that point, and the height gives the condence factor. The arrow shows the mean of the estimated velocity. It can be seen in Figure 7 that the target dot’s signal rapidly reaches large sharpness and condence by comparison to the distractor dots, which are not moving coherently enough to gain condence or sharpness. The sharpness of the target dot’s signal does not grow monotonically because the distractor dots sometimes interfere with the target’s motion by causing distracting activity in the measurement cells. 7 Discussion

This work provides a theory for temporal coherence. The theory was formulated in terms of Bayesian estimation using motion measurements and motion prediction. By incorporating robustness in the measurement model, the system could perform temporal grouping (or data association), which enabled it to choose what data should be used to update the state estimation and what data could be ignored. We also derived a continuous form for the prediction equation, and showed that it corresponds to a variant of Kolmogorov’s equations at each node. In deriving our theory, we made an assumption of spatial factorizability of the probability distributions and

Probabilistic Motion Estimation

1859

Figure 7: Outlier detection. A target dot (shown as a diamond with line pointing rightward, indicating its velocity) is halfway down the stimulus frame moving from left to right with noise-free motion. It is surrounded by distractor dots undergoing Brownian motion. After a few time step, the model gives high condence and sharpness for the target dot. Observe that motion estimates of the distractor dots can sometimes become sharp if their motion direction is not changing too radically between two time frames (see the large rectangles at the bottom of the estimation and prediction panels.) Graphic conventions as in Figure 6.

made the approximation of updating the marginal distributions of velocity at each point. This allowed us to perform local computations and simplied our implementation. We argued that these assumptions and approximations (as embodied in our choice of prior probability) are suitable for the stimuli we are considering, but they would need to be revised to include spatial coherence effects (Yuille & Grzywacz, 1988, 1989). The prior distribution used in this article would need to be modied to incorporate these effects. In addition, recent results by McKee and Verghese (personal communication) suggest that a more sophisticated prior may be needed. Their results seem to show that observers are better at predicting where there is likely to be motion rather than what the velocity will be. Recall that our prior probability model has two predictive components based on the current estimation of velocity at each image position. First, the velocity estimation is used to predict the positions in the image where the velocity cells would be excited. Second, we predict that the new velocity is close to the original velocity. By contrast, McKee and Verghese’s results seem to show that human observers can cope with reversals of motion direction (such as a ball bouncing off a wall). This is a topic of current research, and we note that priors that al-

1860

P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz

low for this reversal have already been developed by the computer vision tracking community (Isard & Blake, 1998). Simulations of the model seem in qualitative agreement with a number of existing psychophysical experiments on motion estimation over time, motion occluders and motion outliers (McKee et al., 1986; Watamaniuk & McKee, 1995; Watamaniuk et al., 1994). Such a qualitative agreement is characterized by three main features: (1) the nal covariance of the motion estimate, (2) the number of jumps needed to reach such a nal covariance, and (3) the extent of motion inertia during an occlusion, or between two measurements. These three features are controlled as follows: the likelihood’s covariance matrix determines the initial distribution, and thus the number of jumps to reach the desired covariance, while the prior’s covariance matrix affects all three features by imposing an upper bound to the covariance and setting the time constants of the diffusion process between measurements. Implementation of the model’s equations led to a network that shares interesting similarities with known properties of the visual cortex. It involves a columnar structure where the columns, at each spatial point, are composed of a set of cells tuned to a range of velocities (columnar structures exist in the cortex but none has yet been found for velocities). The observation cell layer involves local connections within the columns to compute the likelihood function. The estimated velocity ow is computed in a second layer. If these layers exist, it would be natural for them to lie in cortical areas involved with motion such as MT and MST (Maunsell & Newsome, 1987; Merigan & Maunsell, 1993). The estimation layer carries out the calculation according to Kolmogorov’s equation, or the discrete long-range version described in Yuille et al. (1998), and consists of spatial columns of cells that excite locally each other to predict the expected velocities in the future. The calculation requires excitatory and inhibitory synapses (i.e., no “negative synapses”). The inputs from the observation layer multiply the predictions in the estimated layer. Finally, we postulate that inhibition occurs in each column to ensure normalization of the velocity estimates at each column. This normalization could be implemented by lateral inhibition as proposed by Heeger (1992). A difcult aspect of this network, as a biological model, is its need for multiplications (such multiplications arise naturally from our probabilistic formulation using the rules for combining probabilities). Neuronal multiplication mechanisms were argued for by Reichardt (1961) and Poggio and Reichardt (1973, 1976) on theoretical grounds. A specic biophysical model to approximate multiplication was proposed by Torre and Poggio (1978). Detailed investigations of this model, however, showed that it provided at best a rough approximation for multiplication (Grzywacz & Koch, 1987). Moreover, experiments showed that motion computations in the rabbit retina, for which the model was originally proposed, were not well approximated by multiplications (Grzywacz, Amthor, & Mistler, 1990). Recent related work (Mel, Ruderman, & Archie, 1988; Mel, 1993), though arguing

Probabilistic Motion Estimation

1861

for multiplication-like processing, has not attempted to attain good approximations to multiplication. On the other hand, the complexities of neurons make it hard to rule out the existence of domains in which they perform multiplications. Overall, it seems best to be agnostic about neuronal multiplication and trust that more detailed experiments will resolve this issue. More pragmatically, it may be best to favor neuronal models that correspond to known biophysical mechanisms. The biophysics of neurons may prevent them from performing “perfect” Bayesian computations, and perhaps one should instead seek optimal models that respect these biophysical limitations. Finally, we emphasize that simple networks of the type we propose can implement Bayesian generalizations of Kalman lters for estimation and prediction. The popularity of standard linear Kalman lters is often due to pragmatic reasons of computational efciency. Thus, linear Kalman lters are often used even when they are inappropriate as models of the underlying processes. In contrast, the main drawback of such Bayesian generalizations is the high computational cost, although statistical sampling methods can be used successfully in some cases (see Isard & Blake, 1996). Our study shows that these computational costs may be better dealt with by using networks with local connections, and we are investigating the possibility of implementing our model on parallel architectures in the expectation that we will then be able to do Bayesian estimation of motion in real time. Appendix A.1 Receptive Field Tunings. More precisely, we set:

yi (xE , t ) w i (xE , t) D P N , E, t) j D 1 yj ( x where

yi (xE , t) D A C l 1

X xE 0 2Nbh(xE )

o o ) , sx,l G (xE 0 ¡ xE : vEbT (xE 0 , t), sx,t

o o ) £ G(vE T (xE 0 , t) ¡ vE i : vE T (xE 0 , t), sv,t , sv,l

C (1 ¡ l 1 )

X

xE 0 2Nbh(xE )

o o ) , sx,l G (xE 0 ¡ xE : vEbT (xE 0 , t ¡ d), sx,t

o o ), £ G(vE T (xE 0 , t ¡ d) ¡ vE i : vE T (xE 0 , t ¡ d), sv,t , sv,l

and the three terms of the right-hand side of the equation are explained as follows. The rst term, A, is the rest activity of the cells. It ensures that we will have observations w i (xE , t) D 1 / N, 8i if no velocity is present. In other words, if no dot moves across the cell, then all velocities are considered to be equally likely.

1862

P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz

The second term contains summations over true velocities (i.e., dots) within the receptive eld of the cell. It contains a spatial decay term G(xE 0 ¡ xE : ., ., .) and a velocity tuning curve G(vE T (xE 0 ) ¡ vE i : ., ., .). G is a gaussian function with longitudinal and tranverse variances s.,l s.,t dened in terms bT (x E 0 , t). In addition, the velocity tuning can depend on of the direction T 0 vE the speed . The observation model therefore assumes that the cell vE (xE , t) can measure the components of position and velocity most accurately in the direction perpendicular to the true motion of the stimulus. Intuitively, the closer the dot is to the center of the receptive eld and the closer its velocity is to the preferred velocity of the cell, then the larger the response is. In realistic motions (i.e., not dots!) we can typically assume that the velocity will be constant within each cell except at motion boundaries and transparencies (though there are receptive elds at multiple scales, and so the velocity may not be constant within the large cells). The third term is similar to the second, except that it contains memory of the true velocity at time t ¡d. This “memory” is due to the delayed response of the cell. 0 · l 1 · 1, so the third term is always positive. The response of the observation cells at a point xE at time t can be thought of as a local estimate of the probability of velocity. If there is no true velocity to be observed, then the response of all these cells would be uniform (i.e., w i (xE , t) D 1 / N, 8 i). If the cells receive input from a single true velocity, then the observation cell best tuned to this velocity will have the biggest response. The closer the true velocity stimulus (i.e., the dot) is to the center of the cell, then the stronger the peak. By contrast, multiple peaks can occur if there are several motions in the receptive eld. This can occur for motion transparency, near-motion boundaries, and occluding motion. It can also occur in our model if there is one true motion in the receptive eld at time t ¡ d and a different motion at time t. A.2 Kolmogorov’ s Equation. In this appendix, we derive Kolmogorov’s equation for motion prediction. Let us consider equation 2.6 for d ! 0. Taking this limit is tricky and requires Itoˆ calculus (Jazwinski, 1970). Our case is also nonstandard because our theory involves probability distributions at every point in space, which interact over time. This means that we cannot simply use the standard Kolmogorov’s equations, which apply to a single distribution only. Here we give a simple derivation that which uses delta functions and is appropriate when the prior model is based on a gaussian G(.). The prior specied by equations 3.4 and 3.5 is written as:

P (vE (xE , t C d)|fvE 0 (xE 0 , t)g) Z Z 0 D dxE dvE 0 (xE 0 )G (vE (xE , t C d) ¡ vE 0 (xE 0 , t) : dS vE ) £R

dxE 00

R

G(xE ¡ xE 0 ¡ d vE 0 (xE 0 , t) : dS xE ) . dvE 00 (xE 00 )G(xE ¡ xE 00 ¡ d vE 00 (xE 00 , t)IdS xE )

(A.1)

Probabilistic Motion Estimation

1863

The normalization term in the denominator is used to ensure that the probabilities of the velocities integrate to one at each spatial point xE . In this equation, we have scaled the covariances S by d. This is necessary for taking the limit as d 7! 0 (Jazwinski, 1970). Between observations we can express the evolution of the prior density P (vE (xE , t)) as » ¼ E (x E , t )) @P (v P (vE (xE , t C d)|fvE 0 (xE 0 , t )g) ¡ P (vE (xE , t )) D lim d7 !0 d @t »Z Z 1 D lim dxE 0 dvE 0 (xE 0 )P (vE (xE , t C d)|fvE 0 (xE 0 , t)g)P (fvE 0 (xE 0 , t)g) d7 !0 d ¼ ¡ P (vE (xE , t)) . (A.2) We now perform a Taylor series expansion of P (vE (xE , t C d)|fvE 0 (xE 0 , t)g) in powers ofd keeping the zeroth and rst-order terms (higher-order terms will vanish when we take the limit as d 7! 0). To perform this expansion, we use the assumption that this distribution is expressed in terms of gaussians. As d 7! 0, these gaussians will tend toward delta functions, and the derivatives of gaussians will tend toward derivatives of delta functions (thereby simplifying the expansion). This derivation can be justied rigorously, playing detailed attention to convergence issues, by the use of distribution theory or the application of Itoˆ calculus. If we expand G (vE (xE , t C d) ¡ vE 0 (xE 0 , t) : dS vE ) about d D 0, we nd that the zeroth-order term is a Dirac delta function with argument vE (xE , t C d) ¡ vE 0 (xE 0 , t). This term can therefore be integrated out. The derivative with respect to d will effectively correspond to differentiating the gaussian with respect to the covariance. By standard properties of the gaussian, this will be equivalent to a second-order spatial derivative. We will get similar terms if we expand G(xE ¡ xE 0 ¡ d vE 0 (xE 0 , t) : dS xE ), but we will also get an additional drift term from differentiating the d vE 0 (xE 0 , t ) argument. In addition, we will get other terms from the denominator of equation A.1. (These are required to normalize the distributions and are nonstandard. They are needed because we have a differential equation for a set of interacting probability distributions while the standard Kolmogorov’s equation is for a single probability distribution.) We collect all these zerothand rst-order terms in the expansion and substitute into equation A.3. We can then evaluate the integrals using known properties of the delta functions. After some algebra, these integrals yield our variant of Kolmogorov’s equation: @ @t

P (vE (xE , t)) D

1 2

»

» ¼ 1 @T @ S xE P (vE (xE , t)) 2 @Ex @Ex @E v @Ev Z 1 @ @ ¡vE ¢ P (vE (xE , t)) C vEP (vE (xE , t))dvE , (A.3) |V| @Ex @E x

@T

S vE

@

¼

P (vE (xE , t)) C

1864

P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz T

T

@ @ where f @@Ex S xE @E g and f @@Ev S vE @E g are scalar differential operators, and | V | is x vR the volume of velocity space dvE .

Acknowledgments

This work was supported by AFOSR grant F49620-95-1-0265 to A. L. Y. and N. M. G. Further support to A. L. Y. came from ARPA and the Ofce of Naval Research, grant number N00014-95-1-1022. In addition, N.M.G. received support from grants by the National Eye Institute (EY08921 and EY11170) and the William A. Kettlewell chair. Finally, we thank the National Eye Institute for a core grant to Smith-Kettlewell (EY06883). We appreciate useful comments from anonymous referees. References Adelson, E. H., & Bergen, J. R. (1985). Spatio-temporal energy models for the perception of motion. J. Opt. Soc. A., A 2, 284–299. Anstis, S. M., & Ramachandran, V. S. (1987). Visual inertia in apparent motion. Vision Res., 27, 755–764. Ascher, D. (1998). Human visual speed perception: Psychophysics and modeling. Unpublished doctoral dissertation, Brown University. Bar-Shalom, Y., & Fortmann, T. E. (1988). Tracking and data association. Orlando, FL: Academic Press. Blake, A., & Yuille, A. L. (eds.). (1992). Active vision. Cambridge, MA: MIT Press. Burr, D. C., Ross, J., & Morrone, M. C. (1986). Seeing objects in motion. Proc. Royal Soc. Lond. B, 227, 249–265. Chin, T. M., Karl, W. C., & Willsky, A. S. (1994). Probabilistic and sequential computation of optical ow using temporal coherence. IEEE Trans. Image Proc., 3, 773–788. Clark, J. J., & Yuille, A. L. (1990). Data fusion for sensory information processing systems. Norwell, MA: Kluwer. Courant, R., & Hilbert, D. (1953). Methods of mathematical physics (Vol. 1). New York: Interscience. Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: Wiley. Gottsdanker, R. M. (1956). The ability of human operators to detect acceleration of target motion. Psych. Bull., 53, 477–487. Grzywacz, N. M., Amthor, F. R., & Mistler, L. A. (1990).Applicability of quadratic and threshold models to motion discrimination in the rabbit retina. Biol. Cybern., 64, 41–49. Grzywacz, N. M. & Koch, C. (1987).Functional properties of models for direction selectivity in the retina. Synapse, 1, 417–434. Grzywacz, N. M., Smith, J. A., & Yuille, A. L. (1989).A computational framework for visual motion’s spatial and temporal coherence. In Proc. IEEE Workshop on Visual Motion. Irvine, CA.

Probabilistic Motion Estimation

1865

Grzywacz, N. M., Watamaniuk, S. N. J., & McKee, S. P. (1995). Temporal coherence theory for the detection and measurement of visual motion. Vision Res., 35, 3183–3203. Grzywacz, N. M., & Yuille, A. L. (1990). A model for the estimate of local image velocity by cells in the visual cortex. Proceedings of the Royal Society of London B, 239, 129–161. Hassenstein, B., & Reichardt, W. E. (1956). Systemtheoretische analyse der zeit-, reihenfolgen- und vorzeichenauswertung bei der bewegungsperzeption des russelkafers. Chlorophanus Z. Naturforsch., 11b, 513–524. Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Visual Neurosci., 9, 181–197. Ho, Y-C., & Lee, R. C. K. (1964). A Bayesian approach to problems in stochastic estimation and control. IEEE Trans. on Automatic Control, 9, 333–339. Holub, R. A. & Morton-Gibson, M. (1981). Response of visual cortical neurons of the cat to moving sinusoidal gratings: Response-contrast functions and spatiotemporal integration J. Neurophysiol., 46, 1244–1259. Huber, P. J. (1981). Robust statistics. New York: Wiley. Isard, M., & Blake A. (1996).Contour tracking by stochastic propagation of conditional density. In Proc. Europ. Conf. Comput. Vision (pp. 343–356). Cambridge, UK. Isard, M., & Blake, A. (1998).A mixed-state condensation tracker with automatic model-switching. Proc. Int. Conf. Comp. Vis. (pp. 107–112). Mumbai, India. Jazwinski, A. H. (1970). Stochastic processes and ltering theory. Orlando, FL: Academic Press. Kalman, R. E. (1960).A new approach to linear ltering and prediction problems. Trans. ASME, D: J. Basic Engineering, 82, 35–45. Kanizsa, G. (1979). Organization in vision: Essays on gestalt perception. New York: Praeger. Kulikowski, J. J., & Tolhurst, D. J. (1973). Psychophysical evidence for sustained and transient detectors in human vision. Journal of Physiology, 232, 519–548. Matthies, L., Kanade, T., & Szeliski, R. (1989). Kalman lter-based algorithms for estimating depth from image sequences. Int. J. Comput. Vision, 3, 209–236. Maunsell, J. H. R., & Newsome, W. T. (1987).Visual processing in monkey extrastriate cortex. In W. M. Cowan, E. M. Shooter, C. F. Stevens, & R. F. Thompson, (Eds.), Annual reviews of neuroscience (Vol. 10, pp. 363–401). Palo Alto, CA: Annual Reviews. McKee, S. P., Silverman, G. H., & Nakayama, K. (1986). Precise velocity discrimination despite random variations in temporal frequency and contrast. Vision Res., 26, 609–619. Mel, B. W. (1993). Synaptic integration in an excitable dendritic tree. J. Neurophysiol., 70, 1086–1101. Mel, B. W., Ruderman, D. L., & Archie, K. A. (1988). Translation-invariant orientation tuning in visual complex cells could derive from intradendritic computations. J. Neurosci., 1, 4325–34. Merigan, W. H., & Maunsell, J.H. (1993). How parallel are the primate visual pathways? In W. M. Cowan, E. M. Shooter, C. F. Stevens, & R. F. Thompson

1866

P.-Y. Burgi, A. L. Yuille, and N. M. Grzywacz

(Eds.), Annual reviews of neuroscience (Vol. 16, pp. 369–4021). Palo Alto, CA: Annual Reviews. Nakayama, K. (1985). Biological image motion processing: A review. Vision Res., 25, 625–660. Nishida, S., & Johnston, A. (1999). Inuence of motion signals on the perceived position of spatial pattern. Nature, 397, 610–612. Poggio, T. & Reichardt, W. E. (1973). Considerations on models of movement detection. Kybernetik, 13, 223–227. Poggio, T., & Reichardt, W. E. (1976). Visual control of orientation behaviour in the y: Part II: Towards the underlying neural interactions. Q. Rev. Biophys., 9, 377–438. Ramachandran, V. S., & Anstis, S. M. (1983). Extrapolation of motion path in human visual perception. Vision Res., 23, 83–85. Rao, R. P. N., & Ballard, D. H. (1997). Dynamic model of visual recognition predicts neural response properties of the visual cortex. Neural Computation, 9, 721–734. Reichardt, W. (1961). Autocorrelation, a principle for the evaluation of sensory information by the nervous system. In Rosenblith (Ed.), Sensory communication. New York: Wiley. Snowden, R. J., & Braddick, O. J. (1989). Extension of displacement limits in multiple-exposure sequences of apparent motion. Vision Res., 29, 1777–1787. Torre, V., & Poggio, T. (1978). A synaptic mechanism possibly underlying directional selectivity to motion. Proc. R. Soc. Lond. B, 202, 409–416. van Santen, J. P. H., & Sperling, G. (1984). A temporal covariance model of motion perception J. Opt. Soc. A., A 1, 451–473. Verghese, P., Watamaniuk, S. N. J., McKee, S. P., & Grzywacz, N. M. (1999). Local motion detectors cannot account for the detectability of an extended motion trajectory in noise. Vision Res., 39, 19–30. Wang, J. Y. A., & Adelson, E. H. (1993). Layered representation for motion analysis. In Proc. IEEE Computer Sociey Conference on Computer Vision and Patern Recognition 1993 (pp. 361–366). New York. Watamaniuk, S. N. J., & McKee, S. P. (1995). Seeing motion behind occluders. Nature, 377, 729–730. Watamaniuk, S. N. J., McKee, S. P., & Grzywacz, N. M. (1994). Detecting a trajectory embedded in random-direction motion noise. Vision Res., 35, 65–77. Watamaniuk, S. N. J., Sekuler, R., & Williams, D. W. (1989). Direction perception in complex dynamic displays: The integration of direction information. Vision Research, 29, 47–59. Watson, A. B., & Ahumada, A. J. (1985). Model of human visual motion sensing. J. Opt. Soc. A, A 2, 322–342. Welch, L., Macleod, D.I., & McKee, S. P. (1997). Motion interference: Perturbing perceived direction. Vision Res, 37, 2725–2736. Werkhoven, P., Snippe, H. P., & Toet, A. (1992). Visual processing of optic acceleration. Vision Res., 32, 2313–2329. Williams, L. R., & Jacobs, D. W. (1997a).Local parallel computation of stochastic completion elds. Neural Comp., 9, 859–881. Williams, L. R., & Jacobs, D. W. (1997b). Stochastic completion elds: A neural

Probabilistic Motion Estimation

1867

model of illusory contour shape and salience. Neural Comp., 9, 837–858. Yuille, A. L., Burgi, P.-Y., & Grzywacz, N. M. (1998). Visual motion estimation and prediction: A probabilistic network model for temporal coherence. In Proceedings of the Sixth International Conference on Computer Vision (pp. 973– 978). New York: IEEE Computer Society. Yuille, A. L., & Grzywacz, N. M. (1988). A computational theory for the perception of coherent visual motion. Nature, 333, 71–74. Yuille, A. L. & Grzywacz, N. M. (1989). A mathematical analysis of the motion coherence theory. International Journal of Computer Vision, 3, 155–175. Yuille, A. L., & Grzywacz, N. M. (1998). A theoretical framework for visual motion. In T. Watanabe (Ed.), High-level motion processing: Computational, neurobiological, and psychophysical perspectives. Cambridge, MA: MIT Press. Received November 2, 1998; accepted August 27, 1999.

LETTER

Communicated by Leo Breiman

Boosting Neural Networks Holger Schwenk LIMSI-CNRS, 91403 Orsay cedex, FRANCE Yoshua Bengio DIRO, University of Montr´eal, Montr´eal, Quebec, H3C 3J7, Canada Boosting is a general method for improving the performance of learning algorithms. A recently proposed boosting algorithm, AdaBoost, has been applied with great success to several benchmark machine learning problems using mainly decision trees as base classiers. In this article we investigate whether AdaBoost also works as well with neural networks, and we discuss the advantages and drawbacks of different versions of the AdaBoost algorithm. In particular, we compare training methods based on sampling the training set and weighting the cost function. The results suggest that random resampling of the training data is not the main explanation of the success of the improvements brought by AdaBoost. This is in contrast to bagging, which directly aims at reducing variance and for which random resampling is essential to obtain the reduction in generalization error. Our system achieves about 1.4% error on a data set of on-line handwritten digits from more than 200 writers. A boosted multilayer network achieved 1.5% error on the UCI letters and 8.1% error on the UCI satellite data set, which is signicantly better than boosted decision trees.

1 Introduction

Boosting, a general method for improving the performance of a learning algorithm, is a method for nding a highly accurate classier on the training set by combining “weak hypotheses” (Schapire, 1990), each of which needs only to be moderately accurate on the training set. (Perrone, 1994, is an earlier overview of different ways to combine neural networks.) A recently proposed boosting algorithm is AdaBoost (Freund, 1995), which stands for “adaptive boosting.” During the past two years, many empirical studies have been published that use decision trees as base classiers for AdaBoost (Breiman, 1996; Drucker & Cortes, 1996; Freund & Schapire, 1996a; Quinlan, 1996; Maclin & Opitz, 1997; Grove & Schuurmans, 1998; Bauer & Kohavi, 1999; Dietterich, forthcoming). All of these experiments have shown impressive improvements in the generalization behavior and c 2000 Massachusetts Institute of Technology Neural Computation 12, 1869– 1887 (2000) °

1870

Holger Schwenk and Yoshua Bengio

suggest that AdaBoost tends to be robust to overtting. In fact, in many experiments it has been observed that the generalization error continues to decrease toward an apparent asymptote after the training error has reached zero. (Schapire, Freund, Bartlett, & Lee, 1997), suggest a possible explanation for this unusual behavior based on the denition of margin of classication. Other attemps to understand boosting theoretically can be found in Schapire et al., 1997; Breiman 1997a, 1998; Friedman, Hastie, and Tibshirani, 1998; and Schapire, 1999. AdaBoost has also been linked with game theory (Breiman, 1997b; Grove & Schuurmans, 1998; Freund & Schapire, 1996b, 1999) in order to understand the behavior of AdaBoost and to propose alternative algorithms. Mason and Baxter (1999) propose a new variant of boosting based on the direct optimization of margins. There is recent evidence that AdaBoost may very well overt if we combine several hundred thousand classiers (Grove & Schuurmans, 1998). It also seems that the performance of AdaBoost degrades a lot in the presence of signicant amounts of noise (Dietterich, forthcoming; R¨atsch, Onoda, & Muller, ¨ 1998). Although much useful theoretical and experimental work has been done, there is still a lot that is not well understood about AdaBoost’s impressive generalization behavior. To the best of our knowledge, all applications of AdaBoost have been to decision trees and no applications to multilayer articial neural networks have been reported in the literature. This article extends and provides a deeper experimental analysis of our rst experiments with the application of AdaBoost to neural networks (Schwenk & Bengio, 1997, 1998). In this article we consider the following questions: Does AdaBoost work as well for neural networks as for decision trees? The short answer is yes, and sometimes even better. Does it behave in a similar way (as was observed previously in the literature)? The short answer is yes. Furthermore, are there particulars in the way neural networks are trained with gradient backpropagation that should be taken into account when choosing a particular version of AdaBoost? The short answer is yes, because it is possible to weight the cost function of neural networks directly. Is overtting of the individual neural networks a concern? The short answer is, not as much as when not using boosting. Is the random resampling used in previous implementations of AdaBoost critical, or can we get similar performances by weighing the training criterion (which can easily be done with neural networks)? The short answer is that it is not critical for generalization but helps to obtain faster convergence of individual networks when coupled with stochastic gradient descent. In the next section, we describe the AdaBoost algorithm and discuss several implementation issues when using neural networks as base classiers. In section 3, we present results that we have obtained on three medium-sized tasks: a data set of handwritten on-line digits and the letter and satellite data sets of the UCI repository. We nish with a conclusion and perspectives for future research.

Boosting Neural Networks

1871

2 AdaBoost

It is well known that it is often possible to increase the accuracy of a classier by averaging the decisions of an ensemble of classiers (Perrone, 1993; Krogh & Vedelsby, 1995). In general, more improvement can be expected when the individual classiers are diverse and yet accurate. One can try to obtain this result by taking a base learning algorithm and invoking it several times on different training sets. Two popular techniques exist that differ in the way they construct these training sets: bagging (Breiman, 1994) and boosting (Schapire, 1990; Freund, 1995). In bagging, each classier is trained on a bootstrap replicate of the original training set. Given a training set S of N examples, the new training set is created by resampling N examples uniformly with replacement. Some examples may occur several times, and others may not occur in the sample at all. One can show that, on average, only about two-thirds of the examples occur in each bootstrap replicate. The individual training sets are independent, and the classiers could be trained in parallel. Bagging is known to be particularly effective when the classiers are unstable, that is, when perturbing the learning set can cause signicant changes in the classication behavior. Formulated in the context of the bias-variance decomposition (Geman, Bienenstok, & Doursat, 1992), bagging improves generalization performance due to a reduction in variance while maintaining or only slightly increasing bias. Note, however, there is no unique bias-variance decomposition for classication tasks (Kong & Dietterich, 1995; Breiman, 1996; Kohavi & Wolpert, 1996; Tibshirani, 1996). AdaBoost, on the other hand, constructs a composite classier by sequentially training classiers while putting more and more emphasis on certain patterns. For this, AdaBoost maintains a probability distribution Dt (i) over the original training set. In each round t, the classier is trained with respect to this distribution. Some learning algorithms do not allow training with respect to a weighted cost function. In this case, sampling with replacement (using the probability distribution Dt ) can be used to approximate a weighted cost function. Examples with high probability would then occur more often than those with low probability, and some examples may not occur in the sample at all, although their probability is not zero. After each AdaBoost round, the probability of incorrectly labeled examples is increased, and the probability of correctly labeled examples is decreased. The result of training the tth classier is a hypothesis ht : X ! Y where Y D f1, . . . , kg is the space of labels, and X is the P space of input () features. After the tth round, the weighted error 2 t D 6 yi Dt i of i: ht (xi ) D the resulting classier is calculated, and the distribution Dt C 1 is computed from Dt by increasing the probability of incorrectly labeled examples. The probabilities are changed so that the error of the tth classier using these new “weights” Dt C 1 would be 0.5. In this way, the classiers are optimally decoupled. The global decision f is obtained by weighted voting. This ba-

1872

Holger Schwenk and Yoshua Bengio

sic AdaBoost algorithm converges (learns the training set) if each classier yields a weighted error that is less than 50%, that is, better than chance in the two-class case. In general, neural network classiers provide more information than just a class label. It can be shown that the network outputs approximate the a posteriori probabilities of classes, and it might be useful to use this information rather than to perform a hard decision for one recognized class. This issue is addressed by another version of AdaBoost, called AdaBoost.M2 (Freund & Schapire, 1997). It can be used when the classier computes condence scores for each class.1 The result of training the tth classier is now a hypothesis ht : X £ Y ! [0, 1]. Furthermore, we use a distribution Dt (i, y ) over the 6 yi g, where N is the numset of all miss-labels: B D f(i, y): i 2 f1, . . . , Ng, y D ber of training examples. Therefore, |B| D N (k ¡ 1). AdaBoost modies this distribution so that the next learner focuses not only on the examples that are hard to classify, but more specically on improving the discrimination between the correct class and the incorrect class that competes with it. Note that the miss-label distribution DP t induces a distribution over the examples: P Pt (i) D Wit / i Wit , where Wit D y D6 yi Dt (i, y). Pt (i) may be used for resampling the training set. Freund and Schapire (1997) dene the pseudo-loss of a learning machine as: 2 t D

1 X Dt (i, y )(1 ¡ ht (xi , yi ) C ht (xi , y )). 2 (i,y )2B

(2.1)

It is minimized if the condence scores of the correct labels are 1.0 and the condence scores of all the wrong labels are 0.0. The nal decision f is obtained by adding together the weighted condence scores of all the machines (all the hypotheses h1 , h2 ,. . . ). Table 1 summarizes the AdaBoost.M2 algorithm. This multiclass boosting algorithm converges if each classier yields a pseudo-loss that is less than 50%—in other words, better than any constant hypothesis. AdaBoost has very interesting theoretical properties; in particular it can be shown that the error of the composite classier on the training data decreases exponentially fast to zero as the number of combined classiers is increased (Freund & Schapire, 1997). Many empirical evaluations of AdaBoost also provide an analysis of the so-called margin distribution. The margin is dened as the difference between the ensemble score of the correct class and the strongest ensemble score of a wrong class. In the case in which there are just two possible labels, f¡1, C 1g, this is yf (x ), where f is the output of the composite classier and y the correct label. The classication is correct if the margin is positive. Discussions about the relevance of 1

The scores do not need to sum to one.

Boosting Neural Networks

1873

Table 1: Pseudo-Loss AdaBoost (AdaBoost.M2). Input: Init:

Sequence of N examples (x1 , y1 ), . . . , (xN , yN ) with labels yi 2 Y D f1, . . . , kg. 6 yi g Let B D f(i, y) : i 2 f1, . . . , Ng, y D D1 (i, y) D 1 / |B| for all (i, y) 2 B.

Repeat:

1. 2.

Train neural network with respect to distribution Dt and obtain hypothesis ht : X £ Y ! [0, 1]. Calculate the pseudo-loss of ht : 1 X 2 t D Dt (i, y)(1 ¡ ht (xi , yi ) C ht (xi , y)). 2 (i,y)2B

3. 4.

Set bt D 2 t / (1 ¡ 2 t ).

Update distribution Dt : Dt C 1 (i, y) D

1

Dt (i,y) 2 ((1 C ht (x i ,y i )¡h t (xi , y)) bt , Zt

where Zt is a normalization constant. Output: Final hypothesis:

f (x) D arg max y2Y

X± t

log

1 bt

² ht (x, y).

the margin distribution for the generalization behavior of ensemble techniques can be found in Freund & Schapire (1996b), Schapire et al. (1997), Breiman, 1997a,1997b), Grove and Schuurmans (1998), and R¨atsch et al. (1998). In this article, an important focus is on whether the good generalization performance of AdaBoost is partially explained by the random resampling of the training sets generally used in its implementation. This issue will be addressed by comparing three versions of AdaBoost in which randomization is used (or not used) in three different ways. 2.1 Applying AdaBoost to Neural Networks. In this article we investigate different techniques of using neural networks as base classiers for AdaBoost. In all cases, we have trained the neural networks by minimizing a quadratic criterion that is a weighted sum of the squared differences (zij ¡ zO ij )2 , where z i D (zi1 , zi2 , . . . , zik ) is the desired output vector (with a low target value everywhere except at the position corresponding to the target class) and zO i is the output vector of the network. A score for class j for pattern i can be directly obtained from the jth element zO ij of the output vector zO i . When a class must be chosen, the one with the highest score is 6 selected. Let Vt (i, j) D Dt (i, j) / maxkD6 yi Dt (i, k ) for j D yi and Vt (i, yi ) D 1. These weights are used to give more emphasis to certain incorrect labels according to the pseudo-loss AdaBoost.

1874

Holger Schwenk and Yoshua Bengio

What we call epoch is a pass of the training algorithm through all the examples in a training set. We compare three different versions of AdaBoost: R: Training the tth classier with a xed training set obtained by resampling with replacement once from the original training set. Before starting training the tth network, we sample N patterns from the original training set, each time with a probability Pt (i) of picking pattern i. Training is performed for a xed number of iterations always using this same resampled training set. This is basically the scheme that has been used in the past when applying AdaBoost to decision trees, except that we used the pseudo-loss AdaBoost. To approximate the pseudoloss, the training cost that is minimized for a pattern that is the ith one P from the original training set is 12 j Vt (i, j)(zij ¡ zO ij )2 .

E: Training the tth classier using a different training set at each epoch, by resampling with replacement after each training epoch. After each epoch, a new training set is obtained by sampling from the original training set with probabilities Pt (i). Since we used an on-line (stochastic) gradient in this case, this is equivalent to sampling a new pattern from the original training set with probability Pt (i) before each forward or backward pass through the neural network. Training continues until a xed number of pattern presentations has been performed. As for version R, the training cost that is minimized P for a pattern that is the ith one from the original training set is 12 j Vt (i, j)(zij ¡ zO ij )2 .

W: Training the tth classier by directly weighting the cost function (here the squared error) of the tth neural network, that is, all the original training patterns are in the training P set, but the cost is weighted by the probability of each example: 12 j Dt (i, j)(zij ¡ zO ij )2 . If we used directly this formula, the gradients would be very small, even when all probabilities Dt (i, j) are identical. To avoid having to scale learning rates differently depending on the number of examples, the following “normalized” error function was used: 1 P t (i ) X Vt (i, j)(zij ¡ zO ij )2 . 2 maxk Pt (k ) j

(2.2)

In versions E and W, what makes the combined networks essentially different from each other is the fact that they are trained with respect to different weightings Dt of the original training set. In R, an additional element of diversity is built in because the criterion used for the tth network is not exactly the errors weighted by Pt (i). Instead, more emphasis is put on certain patterns while completely ignoring others (because of the initial random sampling of the training set). The E version can be seen as a stochastic version of the W version, as the number of iterations through the data increases and the learning rate decreases, E becomes a

Boosting Neural Networks

1875

very good approximation of W. W itself is closest to the recipe mandated by the AdaBoost algorithm (but, as we will see below, it suffers from numerical problems). Note that E is a better approximation of the weighted cost function than R, in particular when many epochs are performed. If random resampling of the training data explained a good part of the generalization performance of AdaBoost, then the weighted training version (W) should perform worse than the resampling versions, and the xed sample version (R) should perform better than the continuously resampled version (E). Note that for bagging, which directly aims at reducing variance, random resampling is essential to obtain the reduction in generalization error. 3 Results

Experiments have been performed on three data sets: a data set of on-line handwritten digits, the UCI letters data set of off-line machine-printed alphabetical characters, and the UCI satellite data set generated from Landsat multispectral scanner image data. All data sets have a predened training and test set. All the p-values that are given in this section concern a pair (pO1 , pO2 ) of test performance results (on n test points) for two classication systems with unknown true error rates p1 and p2 . The null hypothesis is that the true expected performance for the two systems is not different, that is, p1 D p2 . Let pO D 0.5(pO1 C pO2 ) be the estimator of the common error rate under the null hypothesis. The alternative hypothesis is that p1 < p2 , so the p-value is obtained as the probability of observing such a large difference under the p (O ¡O ) null hypothesis, that is, P (Z > z ) for a normal Z, with z D pn p1 p2 . This is 2pO(1¡pO )

based on the normal approximation of the binomial, which is appropriate for large n (however, see Dietterich, 1998, for a discussion of this and other tests to compare algorithms).

3.1 Results on the On-line Data Set. The on-line data set was collected at Paris 6 University (Schwenk & Milgram, 1996). A WACOM A5 tablet with a cordless pen was used in order to allow natural writing. Since we wanted to build a writer-independent recognition system, we tried to use many writers and impose as few constraints as possible on the writing style. In total, 203 students wrote down isolated numbers that have been divided into a learning set (1200 examples) and a test set (830 examples). The writers of both sets are completely distinct. A particular property of this data set is the notable variety of writing styles that are not equally frequent at all. There are, for instance, 100 zeros written counterclockwise, but only 3 written clockwise. Figure 1 gives an idea of the great variety of writing styles of this data set. We applied only a simple preprocessing: the characters were resampled to 11 points, centered and size normalized to an (x, y)-coordinate sequence in [¡1, 1]22.

1876

Holger Schwenk and Yoshua Bengio

Figure 1: Some examples of the on-line handwritten digits data set (test set).

Table 2 summarizes the results on the test set before using AdaBoost. Note that the differences among the test results on the last three networks are not statistically signicant (p-value > 23%), whereas the difference with the rst network is signicant (p-value < 10 ¡5 ). Tenfold cross-validation within the training set was used to nd the optimal number of training epochs (typically about 200). Note that if training is continued until 1000 epochs, the test error increases by up to 1%. Table 3 shows the results of bagged and boosted multilayer perceptrons with 10, 30, or 50 hidden units, trained for either 100, 200, 500, or 1000 epochs, and using the ordinary resampling scheme (R), resampling with different random selections at each epoch (E), or training with weights Dt on the squared error criterion for each pattern (W). In all cases, 100 neural networks were combined. Table 2: On-line Digits Data Set Error Rates for Fully Connected MLPs (Not Boosted). Architecture a Train: Test: a The notation

22-10-10

22-30-10

22-50-10

22-80-10

5.7% 8.8

0.8% 3.3

0.4% 2.8

0.08% 2.7

22-h-10 designates a fully connected neural network with 22 input nodes, one hidden layer with h neurons, and a 10-dimensional output layer.

Boosting Neural Networks

1877

Table 3: On-line Digits Test Error Rates for Boosted MLPs. Architecture Version

R

Bagging

22-10-10 E W

R

5.4%

500 epochs

22-30-10 E W

R

2.8%

22-50-10 E W 1.8%

AdaBoost

100 epochs 200 epochs 500 epochs 1000 epochs 5000 epochs

2.9% 3.0 2.5 2.8 —

3.2% 2.8 2.7 2.7 —

6.0% 5.6 3.3 3.2 2.9

1.7% 1.8 1.7 1.8 —

1.8% 1.8 1.5 1.6 —

5.1% 4.2 3.0 2.6 1.6

2.1% 1.8 1.7 1.6 —

1.8% 1.7 1.7 1.5 —

4.9% 3.5 2.8 2.2 1.6

AdaBoost improved in all cases the generalization error of the MLPs— for instance, from 8.8% to about 2.7% for the 22-10-10 architecture. The improvement with 50 hidden units from 2.8% (without AdaBoost) to 1.6% (with AdaBoost) is signicant (p-value of 0.38%), despite the small number of examples. Boosting was also always superior to bagging, although the differences are not always very signicant because of the small number of examples. Furthermore, it seems that the number of training epochs of each individual classier has no signicant impact on the results of the combined classier, at least on this data set. AdaBoost with weighted training of MLPs (W version), however, does not work as well if the learning of each individual MLP is stopped too early (1000 epochs). The networks did not learn well enough the weighted examples, and 2 t rapidly approached 0.5. When training each MLP for 5000 epochs, however, the weighted training (W) version achieved the same low test error rate. AdaBoost is less useful for very big networks (50 or more hidden units for this data) since an individual classier can achieve zero error on the original training set (using the E or W method). Such large networks probably have a very low bias but high variance. This may explain why bagging, a pure variance-reduction method, can do as well as AdaBoost, which is believed to reduce bias and variance. However, AdaBoost can achieve the same low error rates with the smaller 22-30-10 networks. Figure 2 shows the error rates of some of the boosted classiers as the number of networks is increased. AdaBoost brings the training error to zero after only a few steps, even with an MLP with only 10 hidden units. The generalization error is also considerably improved, and it continues to decrease to an apparent asymptote after zero training error has been reached. The surprising effect of continuously decreasing generalization error even after training error reaches zero has already been observed by others (Breiman, 1996; Drucker & Cortes, 1996; Freund & Schapire, 1996a; Quinlan, 1996). This seems to contradict Occam’s razor, but a recent theorem (Schapire et al., 1997) suggests that the margin distribution may be relevant

1878

Holger Schwenk and Yoshua Bengio

MLP 22-10-10

error in %

10

Bagging AdaBoost (R) AdaBoost (E) AdaBoost (W)

test unboosted

8 6

test

4 2 train 0 1

10

100

MLP 22-30-10

error in %

10

Bagging AdaBoost (R) AdaBoost (E) AdaBoost (W)

8 6 4 2

test

train

0 1

10

100

MLP 22-50-10

error in %

10

Bagging AdaBoost (R) AdaBoost (E) AdaBoost (W)

8 6 4 2

test

train

0 1

10 number of networks

100

Figure 2: Error rates of the boosted classiers for increasing number of networks. For clarity, the training error of bagging is not shown (it overlaps with the test error rates of AdaBoost). The dotted constant horizontal line corresponds to the test error of the unboosted classier. Small oscillations are not signicant since they correspond to few examples.

Boosting Neural Networks

1879

to the generalization error. Although previous empirical results (Schapire et al., 1997) indicate that pushing the margin cumulative distribution to the right may improve generalization, other recent results (Breiman, 1997a, 1997b; Grove & Schuurmans, 1998) show that “improving” the whole margin distribution can also yield worse generalization. Figures 3 and 4 show several margin cumulative distributions, that is, the fraction of examples whose margin is at most x as a function of x 2 [¡1, 1]. The networks have been trained for 1000 epochs (5000 for the W version). It is clear in Figures 3 and 4 that the number of examples with high margin increases when more classiers are combined by boosting. When boosting neural networks with 10 hidden units, for instance, there are some examples with a margin smaller than ¡0.5 when only two networks are combined. However, all examples have a positive margin when 10 nets are combined, and all examples have a margin higher than 0.5 for 100 networks. Bagging, on the other hand, has no signicant inuence on the margin distributions. There is almost no difference in the margin distributions of the R, E, or W version of AdaBoost either.2 However, there is a difference between the margin distributions and the test set errors when the complexity of the neural networks is varied (hidden layer size). Finally, it seems that sometimes AdaBoost must allow some examples with very high margins in order to improve the minimal margin. This can best be seen for the 22-10-10 architecture. One should keep in mind that this data set contains only small amounts of noise. In application domains with high amounts of noise, it may be less advantageous to improve the minimal margin at any price (Grove & Schuurmans, 1998; R¨atsch et al., 1998), since this would mean putting too much weight to noisy or wrongly labeled examples. 3.2 Results on the UCI Letters and Satellite Data Sets. Similar experiments were performed with MLPs on the letters data set from the UCI machine learning data set. It has 16,000 training and 4000 test patterns, 16 input features, and 26 classes (A–Z) of distorted machine-printed characters from 20 fonts. A few preliminary experiments on the training set only were used to choose a 16-70-50-26 architecture. Each input feature was normalized according to its mean and variance on the training set. Two types of experiments were performed: (1) doing resampling after each epoch (E) and using stochastic gradient descent and (2) without resampling but using reweighting of the squared error (W) and conjugate gradient descent. In both cases, a xed number of training epochs (500) was used. The plain, bagged, and boosted networks are compared to decision trees in Table 4. In both cases (E and W) the same nal generalization error results were obtained (1.50% for E and 1.47% for W), but the training time using the

2

The W and E versions achieve slightly higher margins than R.

1880

Holger Schwenk and Yoshua Bengio

AdaBoost (R) of MLP 22-10-10 1

2 5 10 50 100

0.8 0.6

AdaBoost (E) of MLP 22-10-10 1 0.8 0.6

0.4

0.4

0.2

0.2

0

0 -1

-0.5

0

0.5

1.0

-1

AdaBoost (R) of MLP 22-30-10 1

2 5 10 50 100

0.8 0.6

1

0.6

0.2

0.2

0

0 0

0.5

1.0

-1

AdaBoost (R) of MLP 22-50-10 1

2 5 10 50 100

0.8 0.6

1 0.8 0.6 0.4

0.2

0.2

0

0 -0.5

0

0.5

1.0

0.5

1.0

-0.5

0

0.5

1.0

AdaBoost (E) of MLP 22-50-10

0.4

-1

0

2 5 10 50 100

0.8

0.4

-0.5

-0.5

AdaBoost (E) of MLP 22-30-10

0.4

-1

2 5 10 50 100

-1

2 5 10 50 100

-0.5

0

0.5

1.0

Figure 3: Margin distributions using 2, 5, 10, 50, and 100 networks, respectively.

Boosting Neural Networks

1881

AdaBoost (W) of MLP 22-10-10 1

2 5 10 50 100

0.8 0.6

Bagging of MLP 22-10-10 1 0.8 0.6

0.4

0.4

0.2

0.2

0

0 -1

-0.5

2 5 10 50 100

0

0.5

1.0

-1

AdaBoost (W) of MLP 22-30-10 1

2 5 10 50 100

0.8 0.6

1

0.6

0.2

0.2

0

0 0

0.5

1.0

-1

Bagging of MLP 22-50-10 1

2 5 10 50 100

0.8 0.6

1 0.8 0.6 0.4

0.2

0.2

0

0 -0.5

0

0.5

1.0

1.0

-0.5

0

0.5

1.0

AdaBoost (W) of MLP 22-50-10

0.4

-1

0.5

2 5 10 50 100

0.8

0.4

-0.5

0

Bagging of MLP 22-30-10

0.4

-1

-0.5

-1

2 5 10 50 100

-0.5

0

0.5

1.0

Figure 4: Margin distributions using 2, 5, 10, 50, and 100 networks, respectively.

1882

Holger Schwenk and Yoshua Bengio

Table 4: Test Error Rates on the UCI Data Sets. CARTa

C4.5b

MLP

Data Set

Alone Bagged Boosted

Alone Bagged Boosted

Alone Bagged Boosted

Letter Satellite

12.4% 6.4% 14.8 10.3

13.8% 6.8% 14.8 10.6

6.1% 4.3% 12.8 8.7

a b

3.4% 8.8

3.3% 8.9

1.5% 8.1

Results from Breiman (1996). Results from Freund & Schapire (1996a).

weighted squared error (W) was about 10 times greater. This shows that using random resampling (as in E or R) is not necessary to obtain good generalization (whereas it is clearly necessary for bagging). However, the experiments show that it is still preferable to use a random sampling method such as R or E for numerical reasons: convergence of each network is faster. For this reason, for the E experiments with stochastic gradient descent, 100 networks were boosted, whereas we stopped training on the W network after 25 networks (when the generalization error seemed to have attened out), which took more than a week on a fast processor (SGI Origin-2000). We believe that the main reason for this difference in training time is that the conjugate gradient method is a batch method and is therefore slower than stochastic gradient descent on redundant data sets with many thousands of examples, such as this one. (See comparisons between batch and on-line methods in Bourrely, 1989, and conjugate gradients for classication tasks in particular Moller, 1992, 1993.) For the W version with stochastic gradient descent, the weighted training error of individual networks does not decrease as much as when using conjugate gradient descent, so AdaBoost itself did not work as well. We believe that this is due to the fact that it is difcult for stochastic gradient descent to approach a minimum when the output error is weighted with very different weights for different patterns (the patterns with small weights make almost no progress). On the other hand, the conjugate gradient descent method can approach a minimum of the weighted cost function more precisely, but inefciently, when there are thousands of training examples. The results obtained with the boosted network are extremely good— 1.5% error, whether using the W version with conjugate gradients or the E version with stochastic gradient—and are the best ever published to date, as far as we know, for this data set. In a comparison with the boosted trees (3.3% error), the p-value of the null hypothesis is less than 10 ¡7 . The best performance reported in STATLOG (Feng, Sutherland, King, Muggleton, & Henery, 1993) is 6.4%. Note also that we need to combine only a few neural networks to get immediate important improvements: with the E version, 20 neural networks sufce for the error to fall under 2%, whereas boosted decision trees typically “converge” later. The W version of AdaBoost actually converged faster in terms of number of networks—after about 7 networks,

Boosting Neural Networks

1883

error in %

10

Bagging AdaBoost (SG+E) AdaBoost (CG+W)

8

test unboosted

6 4

test 2 train

0 1

10 number of networks

100

Figure 5: Error rates of the bagged and boosted neural networks for the UCI letter data set (log-scale). SG+E denotes stochastic gradient descent and resampling after each epoch. CG+W means conjugate gradient descent and weighting of the squared error. For clarity, the training error of bagging is not shown (it attens out to about 1.8%). The dotted constant horizontal line corresponds to the test error of the unboosted classier.

the 2% mark was reached, and after 14 networks the 1.5% apparent asymptote was reached (see Figure 5)—but converged much slower in terms of training time. Figure 6 shows the margin distributions for bagging and AdaBoost applied to this data set. Again, bagging has no effect on the margin distribution, whereas AdaBoost clearly increases the number of examples with large margins. Similar conclusions hold for the UCI satellite data set (see Table 4), although the improvements are not as dramatic as in the case of the letter data set. The improvement due to AdaBoost is statistically signicant (pBagging 1

AdaBoost (SG+E)

2 5 10 50 100

0.8 0.6

1 0.8 0.6

0.4

0.4

0.2

0.2

0

0 -1

-0.5

2 5 10 50 100

0

0.5

1.0

-1

-0.5

Figure 6: Margin distributions for the UCI letter data set.

0

0.5

1.0

1884

Holger Schwenk and Yoshua Bengio

value < 10 ¡6 ), but the difference in performance between boosted MLPs and boosted decision trees is not (p-value > 21%). This data set has 6435 examples, with the rst 4435 used for training and the last 2000 used for testing generalization. There are 36 inputs and 6 classes, and a 36-30-15-6 network was used. The two best training methods are the epoch resampling (E) with stochastic gradient and the weighted squared error (W) with conjugate gradient descent. 4 Conclusion

As demonstrated here in three real-world applications, AdaBoost can signicantly improve neural classiers. In particular, the results obtained on the UCI letters data set (1.5% test error) are signicantly better than the best published results to date, as far as we know. The behavior of AdaBoost for neural networks conrms previous observations on other learning algorithms (Breiman, 1996; Drucker & Cortes, 1996; Freund & Schapire, 1996a; Quinlan, 1996; Schapire et al., 1997), such as the continued generalization improvement after zero training error has been reached and the associated improvement in the margin distribution. It seems also that AdaBoost is not very sensitive to overtraining of the individual classiers, so that the neural networks can be trained for a xed (preferably high) number of training epochs. A similar observation was recently made with decision trees (Breiman, 1997b). This apparent insensitivity to overtraining of individual classiers simplies the choice of neural network design parameters. Another interesting nding is that the weighted training version (W) of AdaBoost gives good generalization results for MLPs but requires many more training epochs or the use of a second-order (and, unfortunately, batch) method, such as conjugate gradients. We conjecture that this happens because of the weights on the cost function terms (especially when the weights are small), which could worsen the conditioning of the Hessian matrix. So in terms of generalization error, all three methods (R, E, W) gave similar results, but training time was lowest with the E method (with stochastic gradient descent), which samples each new training pattern from the original data with the AdaBoost weights. Although our experiments are insufcient to conclude, it is possible that the weighted training method (W) with conjugate gradients might be faster than the others for small training sets (a few hundred examples). There are various ways to dene variance for classiers (e.g. Kong & Dietterich, 1995; Breiman, 1996; Kohavi & Wolpert, 1996; Tibshirani, 1996). It basically represents how the resulting classier will vary when a different training set is sampled from the true generating distribution of the data. Our comparative results on the R, E, and W versions add credence to the view that randomness induced by resampling the training data is not the main reason for AdaBoost’s reduction of the generalization error. This is in contrast to bagging, which is a pure variance-reduction method. For

Boosting Neural Networks

1885

bagging, random resampling is essential to obtain the observed variance reduction. Another interesting issue is whether the boosted neural networks could be trained with a criterion other than the mean squared error criterion, one that would better approximate the goal of the AdaBoost criterion: minimizing a weighted classication error. Schapire & Singer (1998) addresses this issue. Acknowledgments

Most of the work reported here was done while H. S. was doing a postdoctorate at the University of Montreal. We thank the National Science and Engineering Research Council of Canada and the government of Quebec for nancial support. References Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classication algorithms: Bagging, boosting, and variants. Machine Learning, 36, 105–139. Bourrely, J. (1989).Parallelization of a neural learning algorithm on a hypercube. In Hypercube and distributed computers (pp. 219–229). Amsterdam: Elsievier Science, North-Holland. Breiman, L. (1994). Bagging predictors. Machine Learning, 24(2), 123–140. Breiman, L. (1996).Bias, variance, and arcing classiers (Tech. Rep. No. 460). Berkeley: Statistics Department, University of California at Berkeley. Breiman, L. (1997a). Arcing the edge (Tech. Rep. No. 486). Berkeley: Statistics Department, University of California at Berkeley. Breiman, L. (1997b). Prediction games and arcing classiers (Tech. Rep. No. 504). Berkeley: Statistics Department, University of California at Berkeley. Breiman, L. (1998). Arcing classiers. Annuals of Statistics, 26(3), 801–849. Dietterich, T. (1998).Approximate statistical tests for comparing supervised classication learning algorithms. Neural Computation, 10(7), 1895–1924. Dietterich, T. G. (forthcoming). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning. Available online at: ftp://ftp.cs.orst.edu/pub/ tgd/papers/tr-randomized-c4.ps.gz. Drucker, H., & Cortes, C. (1996). Boosting decision trees. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.),Advances in neural informationprocessing systems (pp. 479–485). Cambridge, MA: MIT Press. Feng, C., Sutherland, A., King, R., Muggleton, S., & Henery, R. (1993). Comparison of machine learning classiers to statistics and neural networks. In Proceedings of the Fourth International Workshop on Articial Intelligence and Statistics (pp. 41–52). Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121(2), 256–285. Freund, Y., & Schapire, R. E. (1996a).Experiments with a new boosting algorithm.

1886

Holger Schwenk and Yoshua Bengio

In Machine Learning: Proceedings of Thirteenth InternationalConference (pp. 148– 156). Freund, Y., & Schapire, R. E. (1996b).Game theory, on-line prediction and boosting. In Proceedings of the Ninth Annual Conference on Computational Learning Theory (pp. 325–332). Freund, Y., & Schapire, R. E. (1997). A decision theoretic generalization of online learning and an application to boosting. Journal of Computer and System Science, 55(1), 119–139. Freund, Y., & Schapire, R. E. (1999). Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29, 79–103. Friedman, J., Hastie, T., & Tibshirani, R. (1998).Additivelogisticregression: A statistical view of boosting (Tech. Rep.). Stanford: Department of Statistics, Stanford University. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1–58. Grove, A. J., & Schuurmans, D. (1998). Boosting in the limit: Maximizing the margin of learned ensembles. In Proceedings of the Fifteenth National Conference on Articial Intelligence. Kohavi, R., & Wolpert, D. H. (1996). Bias plus variance decomposition for zeroone loss functions. In Machine Learning: Proceedings of Thirteenth International Conference (pp. 275–283). Kong, E. B., & Dietterich, T. G. (1995). Error-correcting output coding corrects bias and variance. In Machine Learning: Proceedings of Twelfth International Conference (pp. 313–321). Krogh, A., & Vedelsby, J. (1995).Neural network ensembles, cross validation and active learning. In G. Tesauro, D. S. Touretzky, & T. K. Leen, (Eds.), Advances in neural information processing systems, 7, (pp. 231–238). Cambridge, MA: MIT Press. Maclin, R., & Opitz, D. (1997). An empirical evaluation of bagging and boosting. In Proceedings of the Fourteenth National Conference on Articial Intelligence (pp. 546–551). Mason, L., & Baxter, P. B. J. (1999). Direct optimization of margins improves generalization in combined classiers. In M. S. Kearns, S. Solla, & D. Cohn (Eds.), Advances in neural information processing systems, 11, Cambridge, MA: MIT Press. Moller, M. (1992).Supervised learning on large redundant training sets. In Neural networks for signal processing 2. New York: IEEE Press. Moller, M. (1993). Efcient training of feed-forward neural networks. Unpublished doctoral dissertation, Aarhus University, Aarhus, Denmark. Perrone, M. P. (1993). Improving regression estimation: Averaging methods for variance reduction with extensions to general convex measure optimization. Unpublished doctoral dissertation, Brown University, Providence, RI. Perrone, M. P. (1994). Putting it all together: Methods for combining neural networks. In J. D. Cowan, G. Tesauro, & J. Alspector, (Eds.), Advances in neural information processing systems, 6, (pp. 1188–1189). San Mateo, CA: Morgan Kaufmann.

Boosting Neural Networks

1887

Quinlan, J. R. (1996).Bagging, boosting and C4.5. In Machine Learning: Proceedings of the Fourteenth International Conference (pp. 725–730). R¨atsch, G., Onoda, T., & Muller, ¨ K.-R. (1998).Soft margins for AdaBoost (Tech. Rep. No. NC-TR-1998-021).Egham, Surrey: Royal Holloway College, University of London. Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227. Schapire, R. E. (1999). Theoretical views of boosting. In Computational Learning Theory: Fourth European Conference, EuroCOLT (pp. 1–10). Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1997).Boosting the margin: A new explanation for the effectiveness of voting methods. In Machine Learning: Proceedings of Fourteenth International Conference (pp. 322–330). Schapire, R. E., & Singer, Y. (1998). Improved boosting algorithms using condence rated predictions. In Proceedings of the 11th Annual Conference on Computational Learning Theory. Schwenk, H., & Bengio, Y. (1997). Adaboosting neural networks: Application to on-line character recognition. In International Conference on Articial Neural Networks (pp. 967–972). Berlin: Springer-Verlag. Schwenk, H., & Bengio, Y. (1998). Training methods for adaptive boosting of neural networks. In M. I. Jordan, M. J. Kearns, & S. A. Solla, (Eds.), Advances in neural information processing systems, 10 (pp. 647–653). Cambridge, MA: MIT Press. Schwenk, H., & Milgram, M. (1996). Constraint tangent distance for online character recognition. In InternationalConference on Pattern Recognition (pp. D:520– 524). New York: IEEE Computer Society Press. Tibshirani, R. (1996). Bias, variance and prediction error for classication rules (Tech. Rep.). Toronto: Department of Statistics, University of Toronto. Received May 13, 1998; accepted March 24, 1999.

LETTER

Communicated by L e´ on Bottou

Gradient-Based Optimization of Hyperparameters Yoshua Bengio D´epartement d’informatique et recherche op´erationnelle, Universit´e de Montr´eal, Montr´eal, Qu´ebec, Canada, H3C 3J7 Many machine learning algorithms can be formulated as the minimization of a training criterion that involves a hyperparameter. This hyperparameter is usually chosen by trial and error with a model selection criterion. In this article we present a methodology to optimize several hyperparameters, based on the computation of the gradient of a model selection criterion with respect to the hyperparameters. In the case of a quadratic training criterion, the gradient of the selection criterion with respect to the hyperparameters is efciently computed by backpropagating through a Cholesky decomposition. In the more general case, we show that the implicit function theorem can be used to derive a formula for the hyperparameter gradient involving second derivatives of the training criterion. 1 Introduction

Machine learning algorithms pick a function from a set of functions F in order to minimize something that cannot be measured, only estimated, that is, the expected generalization performance of the chosen function. Many machine learning algorithms can be formulated as the minimization of a training criterion, which involves a hyperparameter, kept xed during this minimization. For example, in the regularization framework (Tikhonov & Arsenin, 1977; Poggio, Torre, & Koch, 1985), one hyperparameter controls the strength of the penalty term: a larger penalty term reduces the “complexity” of the resulting function (it forces the solution to lie in a “smaller ” subset of F ). A common example is weight decay (Hinton, 1987), used with neural networks and linear regression (also known in that case as ridge regression; Hoerl & Kennard, 1970), to penalize the L2-norm of the parameters. Increasing the penalty term (increasing the weight decay hyperparameter) corresponds to reducing the effective capacity (Guyon, Vapnik, Boser, Bottou, & Solla, 1992) by forcing the solution to be in a zero-centered hypersphere of smaller radius, which may improve generalization. A regularization term can also be interpreted as an a priori probability distribution on F : in that case the weight decay is a scale parameter (e.g., inverse variance) of that distribution. A model selection criterion can be used to select hyperparameters, more generally to compare and choose among models that may have a different capacity. Many model selection criteria have been proposed in the past c 2000 Massachusetts Institute of Technology Neural Computation 12, 1889– 1900 (2000) °

1890

Yoshua Bengio

(Vapnik, 1982; Akaike, 1974; Craven & Wahba, 1979). When there is only a single hyperparameter, one can easily explore how its value affects the model selection criterion: typically one tries a nite number of values of the hyperparameter and picks the one that gives the lowest value of the model selection criterion. In this article, we present a methodology to select simultaneously many hyperparameters using the gradient of the model selection criterion with respect to the hyperparameters. This methodology can be applied when some differentiability and continuity conditions of the training criterion are satised. The use of multiple hyperparameters has already been proposed in the Bayesian literature. One hyperparameter per input feature was used to control the prior on the parameters associated with that input feature (MacKay, 1995; Neal, 1998). In this case, the hyperparameters can be interpreted as scale parameters for the prior distribution on the parameters, for different directions in parameter space. Note that independently of this work, Larsen et al. (1998) have proposed a procedure similar to the one presented here, for optimizing regularization parameters of a neural network (different weight decays for different layers of the network). In sections 2, 3, and 4, we explain how the gradient with respect to the hyperparameters can be computed. In the conclusion, we briey describe the results of preliminary experiments performed with the proposed methodology (described in more detail in Latendresse, 1999, and Bengio & Dugas, 1999), and we raise some important open questions concerning the kind of “overtting” that can occur with the proposed methodology. 2 Objective Functions for Hyperparameters

We are given a set of independent data points, D D fz1 , . . . , zT g, all generated by the same unknown distribution P (Z ). We are given a set of functions F indexed by a parameter h 2 V (i.e., each value of the parameter h corresponds to a function in F ). In our applications we will have V µ Rs . We would like to choose a value of h that minimizes the expectation E Z (Q(h, Z)) of a given loss functional Q (h, Z). In supervised learning problems, we have input-output pairs Z D (X, Y ), with X 2 X , Y 2 Y , and h is associated with a function fh from X to Y . For example, we will consider the case of the quadratic loss, with real-valued vectors Y µ Rm and Q (h, (X, Y)) D 12 ( fh (X ) ¡ Y )0 ( fh (X ) ¡ Y ). Note that we use the letter h to represent parameters and the letter l to represent hyperparameters. In the next section, we will provide a formulation for the cases in which Q is quadratic in h (e.g., quadratic loss with constant or afne function sets). In section 4, we will consider more general classes of functions and loss, which may be applied to the case of multilayer neural networks, for example. 2.1 Training Criteria. In its most general form, a training criterion C is any real-valued function of the set of empirical losses Q (h, zi ) and of some

Gradient-Based Optimization of Hyperparameters

1891

hyperparameters l: C(h, l , D) D c(l, Q (h, z1 ), Q (h, z2 ), . . . , Q(h, zT )). Here l is assumed to be a real-valued vector l D (l 1 , . . . , l q ). The proposed method relies on the assumption that C is continuous and differentiable almost everywhere with respect to h and l. When the hyperparameters are xed, the learning algorithm attempts to perform the following minimization: h (l , D ) D argminh C (h, l , D ).

(2.1)

An example of a training criterion with hyperparameters is CD

X (xi ,yi )2D

wi (l)( fh (xi ) ¡ yi )2 C h 0 A (l )h,

(2.2)

where the hyperparameters provide different quadratic penalties to different parameters (with the matrix A(l )), and different weights to different training patterns (with wi (l )), (as in Bengio & Dugas, 1999; Latendresse, 1999). 2.2 Model Selection Criteria. The model selection criterion E is used to select hyperparameters or more generally to choose one model among several. Ideally, it should be the expected generalization error (for a xed l), but P (Z) is unknown, so many alternatives—approximations, bounds, or empirical estimates—have been proposed. Most model selection criteria have been proposed for selecting a single hyperparameter that controls the “complexity” of the class of functions in which the learning algorithm nds a solution, such as the minimum description length principle (Rissanen, 1990), structural risk minimization (Vapnik, 1982), the Akaike Information Criterion (Akaike, 1974), or the generalized cross-validation criterion (Craven & Wahba, 1979). Other criteria are those based on held-out data, such as the cross-validation estimates of generalization error. These are almost unbiased estimates of generalization error (Vapnik, 1982) obtained by testing fh on data not used to choose h. For example, the K-fold cross-validation estimate uses K partitions of D, S11 [ S12 , S21 [ S22 , . . . , SK1 [ SK2 :

Ecv (l , D) D

1X 1 X Q (h (l , Si1 ), zt ). K i |Si2 | z 2Si t

2

P When h is xed, the empirical risk T1 t Q(h, zt ) is an unbiased estimate of the generalization error of fh (but it becomes an optimistic estimate when h is chosen to minimize the empirical risk). Similarly, when l is xed, the crossvalidation criterion is an almost unbiased estimate (when K approaches |D|) of the generalization error of h (l , D). When l is chosen to minimize the

1892

Yoshua Bengio

cross-validation criterion, this minimum value also becomes an optimistic estimate. And when there is a greater diversity of values Q( fh (l ,D) , z) that can be obtained for different values of l, there is more risk of overtting the hyperparameters. In this sense, the use of hyperparameters proposed in this article can be very different from the common use in which a hyperparameter helps to control overtting. Instead, a blind use of the extra freedom brought by many hyperparameters could deteriorate generalization. 3 Optimizing Hyperparameters for a Quadratic Training Criterion

In this section we analyze the simpler case in which the training criterion C is a quadratic polynomial of the parameters h . The dependence on the hyperparameters l can be of higher order, as long as it is continuous and differentiable almost everywhere (see, for example, Bottou, 1998, for more detailed technical conditions sufcient for stochastic gradient descent): 1 0 h H (l )h, 2

C D a (l) C b(l )0h C

(3.1)

where h, b 2 R s , a 2 R, and H 2 Rs£s . For a minimum of equation 3.1 to exist requires that H be positive denite. It can be obtained by solving the linear system @C @h

D b C Hh D 0,

(3.2)

which yields the solution h (l ) D ¡H ¡1 (l)b(l ).

(3.3)

Assuming that E depends only on l through h , the gradient of the model selection criterion E with respect to l is @E @l

D

@E @h @h @l

.

If there were a direct dependency (not through h), an extra partial derivative would have to be added. For example, in the case of the cross-validation criteria, @Ecv @h

D

1 X 1 X @Q (h, zt ) . K i |Si2 | z 2Si @h t

2

In the quadratic case, the inuence of l on h is spelled out by equation 3.3, yielding @hi @l

D ¡

X @Hi,¡1 j j

@l

bj ¡

X j

Hi,¡1 j

@bj @l

.

(3.4)

Gradient-Based Optimization of Hyperparameters

1893 @H¡1

Although the second sum can be readily computed, @li, j in the rst sum is more challenging; we consider several methods below. One solution is based on the computation of gradients through the inverse of a matrix. This general but inefcient solution is the following: @Hi,¡1 j @l

D

X @Hi,¡1 j @Hk,l k,l

@Hk,l

@l

,

where ¡1

@Hi, j

@Hk,l

¡1 ¡1 ¡1 ¡1 D ¡Hi, j Hl,k C IiD 6 l, j D 6 k H i, j minor ( H, j, i ) 0 ,k0 , l

(3.5)

where minor(H, j, i) denotes the “minor matrix,” obtained by removing the jth row and the ith column from H, and the indices (l0 , k0 ) in equation 3.5 refer to the position within a minor matrix that corresponds to the position 6 6 (l, k ) in H (note l D i and k D j). Unfortunately, the computation of this gradient requires O (s5 ) multiply-add operations for an s£s matrix H, which is much more than is required by the inversion of H (O(s3 ) operations). A better solution is based on the following equality: HH ¡1 D I, where I is the s £ s identity matrix. This implies, by differentiating with respect to l: ¡1 ¡1 @H ¡1 , we get H C H @H D 0. Isolating @H @l @l @l @H ¡1 @l

D ¡H ¡1

@H @l

H ¡1 ,

(3.6)

which requires only about 2s3 multiply-add operations. An even better solution (which was suggested by L´eon Bottou) is to return to equation 3.2, which can be solved using about s3 / 3 multiply-add operations (when h 2 R s ). The idea is to backpropagate gradients through each of the operations performed to solve the linear system. The objective is to compute the gradient of E with respect to H and b through the effect of H and b on h, in order to compute @@lE , as illustrated in Figure 1. The backpropagation costs the same as the linear system solution—about s3 / 3 multiply / add operations—so this is the approach that we have kept for our implementation. Since H is the Hessian matrix, it is positive denite and symmetric, and equation 3.2 can be solved through the Cholesky decomposition of H (assuming H is full rank, which is likely if the hyperparameters provide some sort of weight decay). The Cholesky decomposition of a symmetric positive denite matrix H gives H D LL0 where L is a lower diagonal matrix (with zeros above the diagonal). It is computed in time O (s3 ) as follows: for i D 1, . . . , s q P 2 Li,i D Hi,i ¡ i¡1 k D 1 Li,k

for j D i C 1, . . . , s

±

Lj,i D Hi, j ¡

P i¡1

²

k D1 Li,k Lj,k / Li,i .

1894

Yoshua Bengio

Figure 1: Illustration of forward paths (full lines) and gradient paths (dashed) for computation of the model selection criterion E and its derivative with respect to the hyperparameters (l), when using the method based on the Cholesky decomposition and backsubstitution to solve for the parameters (h ).

Once the Cholesky decomposition is achieved, the linear system LL0h D ¡b can be easily solved in two backsubstitution steps: (1) rst solve Lu D ¡b for u; (2) solve L0h D u for h: 1. Iterate once forward through the rows of L: for i D 1, . . . , s, P ) ui D (¡bi ¡ i¡1 k D 1 Li,k uk / Li,i .

2. Iterate once backward through the rows of L:

for i D s, . . . , 1, P hi D (ui ¡ skD i C 1 Lk,ihk ) / Li,i .

The computation of the gradient of h with respect to the elements of H and b proceeds in exactly the reverse order. We start by backpropagating through the backsubstitution steps, and then through the Cholesky decomposition. @E @E Together, the three algorithms that follow allow computing @b and @H , starting from the “partial” parameter gradient

@E @hi h ,...,h 1 s

(not taking into account

the dependencies of hi on hj for j > i). As intermediate results, the algorithm computes the partial derivatives with respect to u and L, as well as the “full gradients” with respect to h,

taking into account all the dependencies

@E , @hi h i

between the hi ’s brought by the recursive computation of the hi ’s. First backpropagate through the solution of L0h D u: initialize dEdtheta Ã initialize dEdL Ã 0

@E @h h1 ,...,hs

Gradient-Based Optimization of Hyperparameters

1895

for i D 1, . . . , s dEdui Ã dEdtheta i / Li,i dEdLi,i Ã dEdLi,i ¡ dEdtheta i hi / Li,i for k D i C 1 . . . s dEdtheta k Ã dEdtheta k ¡dEdtheta i Lk,i / Li,i dEdL k,i Ã dEdLk,i ¡ dEdtheta i hk / Li,i .

Then backpropagate through the solution of Lu D ¡b: for i D s, . . . , 1 @E Ã ¡dEdui / Li,i @bi dEdLi,i Ã dEdLi,i ¡ dEdui ui / Li,i for k D 1, . . . , i ¡ 1 dEdu k Ã dEduk ¡ dEdui Li,k / Li,i dEdL i,k Ã dEdLi,k ¡ dEdui uk / Li,i . The above algorithm gives the gradient of the model selection criterion E with respect to coefcient b (l ) of the training criterion, as well as with respect to the lower diagonal matrix L, dEdLi, j D @@LE . i, j Finally, we backpropagate through the Cholesky decomposition, to convert the gradients with respect to L into gradients with respect to the Hessian H (l ): for i D s, . . . , 1 for j D s, . . . , i C 1 dEdL i,i Ã dEdLi,i ¡ dEdLj,i Lj,i / Li,i @E Ã dEdLj,i / Li,i @Hi, j for k D 1, . . . , i ¡ 1 dEdLi,k Ã dEdLi,k ¡dEdLj,i Lj,k / Li,i dEdLj,k Ã dEdLj,k ¡dEdLj,i Li,k / Li,i 1 @E Ã 2 dEdLi,i / Li,i @Hi,i for k D 1, . . . , i ¡ 1 dEdL i,k Ã dEdLi,k ¡ dEdLi,i Li,k / Li,i .

Note that we have computed gradients only with respect to the diagonal and upper diagonal of H because H is symmetric. Once we have the gradients of E with respect to b and H, we use the functional form of b(l) and H (l ) to compute the gradient of E with respect to l: @E @l

D

X @E @bi i

@bi @l

C

X @E @Hi, j i, j

@Hi, j @l

(again we assumed that there is no direct dependency froml to E; otherwise, an extra term must be added). Using this approach rather than the one described in the previous subsection, the overall computation of gradients is therefore in about s3 / 3 multiply-

1896

Yoshua Bengio

add operations rather than O (s5 ). The most expensive step is the backpropagation through the Cholesky decomposition itself (three nested s-iterations loops). This step may be shared if there are several linear systems to solve with the same Hessian matrix. For example, this will occur in linear regression with multiple outputs because H is block-diagonal, with one block for each set of parameters associated to oneP output and all blocks being the same (equal to the input “design matrix” t xt x0t , denoting xt the input training vectors). Only a single Cholesky computation needs to be done, shared across all blocks. 3.1 Weight Decays for Linear Regression. In this section, we illustrate the method in the particular case of multiple weight decays for linear regression, with K-fold cross-validation as the model selection criterion. The hyperparameter l j will be a weight decay associated with the jth input variable. The training criterion for the kth partition is

Ck D

1

X

|Sk1 | ( )2 k xt ,yt S1

1 1X X 2 (H xt ¡ yt )0 (H xt ¡ yt ) C lj H i, j . 2 2 j i

(3.7)

The objective is to penalize separately each of the input variables (as in MacKay, 1995; Neal, 1998), a kind of “soft variable selection” (see Latendresse, 1999, for more discussion and experiments with this setup). The training criterion is quadratic, as in equation 3.1, with coefcients aD

1X 0 y yt , 2 t t

H (ij), (i0 j0 ) D di,i0

X t

b(ij) D ¡

X

yt,i xt, j ,

t

xt, j xt, j0 C di,i0 dj, j0 l j ,

where di, j D 1 when i D j and 0 otherwise, and (ij ) is an index corresponding to indices (i, j) in the weight matrix H, for example, (ij ) D (i ¡ 1) £ s C j. From the above denition of the coefcients of C, we obtain their partial derivatives with respect to l: @b @l

D 0,

@H(ij), (i0 j0 ) @l k

D di,i0 dj, j0 dj,k .

By plugging the above denitions of the coefcients and their derivatives in the equations and algorithms of the previous section, we have therefore obtained an algorithm for computing the gradient of the model selection criterion with respect to the input weight decays of a linear regression. Note that here H is block-diagonal, with m identical blocks of size (n C 1), so the Cholesky decomposition (and similarly backpropagating through it) can be performed in about (s / m )3 / 3 multiply-add operations rather than s3 / 3 operations, where m is the number of outputs (the dimension of the output variable).

Gradient-Based Optimization of Hyperparameters

1897

4 Nonquadratic Criterion: Hyperparameters Gradient

If the training criterion C is not quadratic in terms of the parameters h , it will in general be necessary to apply an iterative numerical optimization algorithm to minimize the training criterion. In this section we consider what happens after this minimization is performed—at a value of h where @C / @h is approximately zero and @2 C / @h 2 is positive denite (otherwise we would not be at a minimum of C). The minimization of C(h, l , D ) denes a function h (l , D) (see equation 2.1). With our assumption of smoothness of C, the implicit function theorem tells us that this function exists locally and is differentiable. To obtain this function, we write F (h, l) D

@C @h

D 0,

evaluated at h D h (l , D) (at a minimum of C). Differentiating the above equation with respect to l, we obtain @F @h

C

@h @l

@F @l

D 0,

so we obtain a general formula for the gradient of the tted parameters with respect to the hyperparameters: @h (l , D ) @l

D ¡

¡@ C¢ 2

¡1

@h 2

@2 C @l @h

¡1 D ¡H

@2 C @l @h

.

(4.1)

Let us see how this result relates to the special case of a quadratic training criterion, C D a C b0h C 12 h 0 Hh: @h

D ¡H ¡1

@l

¡ @b @l

C

@H @l

¢

h

D ¡H ¡1

@b @l

C H ¡1

@H @l

H ¡1 b,

where we have substituted h D ¡H ¡1 b. Using the equality 3.6, we obtain the same formula as in equation 3.4. Let us consider more closely the case of a neural network with one layer of hidden units with hyperbolic tangent activations, a linear output layer, squared loss, and hidden-layer weights Wi, j . For example, if we want to use hyperparameters for penalizing the use of inputs, we have a criterion similar to equation 3.7: Ck D with C D

1

X

|Sk1 | ( ,y )2 k xt t S1

P k

1 1X X 2 ( fh (xt ) ¡ yt )0 ( fh (xt ) ¡ yt ) C lj W i, j , 2 2 j i

Ck , and the cross-derivatives are easy to compute:

2

@ C @Wi, j @l k

D dk, j Wi, j .

1898

Yoshua Bengio

The Hessian and its inverse require more work, but can be done respectively in at most O (s2 ) and O(s3 ) operations. See, for example, Bishop (1992) for the exact computation of the Hessian for multilayer neural networks. See Becker and LeCun (1989) and LeCun, Denker, and Solla (1990) for a diagonal approximation, which can be computed and inverted in O (s) operations. 5 Summary of Experiments and Conclusions

We have presented a new methodology for simultaneously optimizing several hyperparameters, based on the computation of the gradient of a model selection criterion with respect to the hyperparameters, taking into account the inuence of the hyperparameters on the parameters. We have considered the simpler case of a training criterion that is quadratic with respect to the parameters (h 2 R s ) and the more general nonquadratic case. We have shown a particularly efcient procedure in the quadratic case that is based on backpropagating gradients through the Cholesky decomposition and backsubstitutions. This was an improvement: we have arrived at this s3 / 3-operations procedure after studying rst an O (s5 ) procedure and then a 2s3 -operations procedure. In the case of input weight decays for linear regression, the computation can even be reduced to about (s / m )3 / 3 operations when there are m outputs. We have performed preliminary experiments with the proposed methodology in several simple cases, using conjugate gradients to optimize the hyperparameters. The application to linear regression with weight decays for each input is described in Latendresse (1999). The hyperparameter optimization algorithm is used to perform a soft selection of the input variables. A large weight decay on one of the inputs effectively forces the corresponding weights to very small values. Comparisons on simulated data sets are made in Latendresse (1999) with ordinary regression as well as with stepwise regression methods and the adaptive ridge (Grandvalet, 1998) or LASSO (Tibshirani, 1995). Another type of application of the proposed method has been explored, in the context of a real-world problem of nonstationary time-series prediction (Bengio & Dugas, 1999). In this case, an extension of the cross-validation criterion to sequential data that may be nonstationary is used. Because of this nonstationarity, recent data may sometimes be more relevant to current predictions than older data. The training criterion is a sum of weighted errors for the past examples, and these weights are given by a parameterized function of time (as the wi (l) in equation 2.2). The parameters of that function are two hyperparameters that control when a transition in the unknown generating process would have occurred and how strong that change was or should be trusted. In these experiments, the weight given to past data points is a sigmoid function of the time: the threshold and the slope of the sigmoid are the hyperparameters, representing, respectively, the time of a strong transition and the strength of that transition. Optimizing

Gradient-Based Optimization of Hyperparameters

1899

these hyperparameters, we obtained statistically signicant improvements in predicting one-month-ahead future volatility of Canadian stocks. The comparisons were made against several linear, constant, and ARMA models of the volatility. The experiments were performed on monthly return data from 473 Canadian stocks from 1976 to 1996. The measure of performance is the average out-of-sample squared error in predicting the squared returns. Single-sided signicance tests were performed, taking into account the auto-covariance in the temporal series of errors and the covariance of the errors between the compared models. When comparing the prediction of the rst moment (expected return), no model signicantly improved on the historical average of stock returns (constant model). When comparing the prediction of the second moment (expected squared returns), the method based on hyperparameters optimization beats all the other methods, with a p-value of 1% or less. What remains to be done? First, we need more experiments, in particular with the nonquadratic case (e.g., MLPs), and with model selection criteria other than cross-validation (which has large variance; Breiman, 1996). Second, there are important theoretical questions that remain unanswered concerning the amount of overtting that can be brought when too many hyperparameters are optimized. As we outlined in the introduction, the situation with hyperparameters may be compared with the situation of parameters. However, whereas the form of the training criterion as a sum of independent errors allows dening the capacity for a class of functions and relating it to the difference between generalization error and training error, more work needs to be done to obtain a similar analysis for hyperparameters. Acknowledgments

The author would like to thank L´eon Bottou, Pascal Vincent, Fran¸cois Blanchette, and Fran¸cois Gingras, as well as the NSERC Canadian funding agency. References Akaike, H. (1974). A new look at the statistical model identication. IEEE Transactions on Automatic Control, AC-19, 716–728. Becker, S., & LeCun, Y. (1989). Improving the convergence of back-propagation learning with second order methods. In D. Touretzky, G. Hinton, & T. Sejnowski (Eds.), Proceedings of the 1988 Connectionist Models Summer School (pp. 29–37). San Mateo, CA: Morgan Kaufmann. Bengio, Y., & Dugas, C. (1999). Learning simple non-stationarities with hyperparameters. Submitted. Bishop, C. (1992). Exact calculation of the Hessian matrix for the multi-layer perceptron. Neural Computation, 4, 494–501.

1900

Yoshua Bengio

Bottou, L. (1998). Online algorithms and stochastic approximations. In D. Saad (Ed.), Online learning in neural networks. Cambridge: Cambridge University Press. Breiman, L. (1996). Heuristics of instability and stabilization in model selection. Annals of Statistics, 24, 2350–2383. Craven, P., & Wahba, G. (1979). Smoothing noisy data with spline functions. Numerical Mathematics, 31, 377–403. Grandvalet, Y. (1998). Least absolute shrinkage is equivalent to quadratic penalization. In L. Niklasson, M. Boden, & T. Ziemske (Eds.), ICANN’98 (pp. 201– 206). Berlin: Springer-Verlag. Guyon, I., Vapnik, V., Boser, B., Bottou, L., & Solla, S. A. (1992). Structural risk minimization for character recognition. In J. Moody, S. Hanson, & R. Lipmann (Eds.), Advances in neural information processing systems, 4 (pp. 471–479). San Mateo, CA: Morgan Kaufmann. Hinton, G. (1987).Learning translation invariant in massively parallel networks. In J. de Bakker, A. Nijman, & P. Treleaven (Eds.), Proceedings of PARLE Conference on Parallel Architectures and Languages Europe (pp. 1–13). Berlin: SpringerVerlag. Hoerl, A., & Kennard, R. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67. Larsen, J., Svarer, C., Andersen, L. N., & Hansen, L. K. (1998). Adaptive regularization in neural networks modeling. In G. B. Orr & K.-R. Muller (Eds.), Neural Networks: Tricks of the Trade (pp. 113–132). Berlin: Springer-Verlag. Latendresse, S. (1999). Utilisation d’hyper-parametres pour la selection de variable. M.Sc. thesis, University of Montreal. LeCun, Y., Denker, J., & Solla, S. (1990). Optimal brain damage. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 598–605). San Mateo, CA: Morgan Kaufmann. MacKay D. (1995). Probable networks and plausible predictions—a review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems, 6, 469–505. Available online at: http://www.iro.umontreal.ca/»lisa. Neal, R. (1998). Assessing relevance determination methods using delve. In C. Bishop (Ed.), Neural networks and machine learning (pp. 97–129). Berlin: Springer-Verlag. Poggio, T., Torre, V., & Koch, C. (1985).Computational vision and regularization theory. Nature, 317, 314–319. Rissanen, J. (1990). Stochastic complexity in statistical inquiry. Singapore: World Scientic. Tibshirani, R. (1995). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B, 58, 267–288. Tikhonov, A., & Arsenin, V. (1977). Solutions of ill-posed problems. Washington, DC: W. H. Winston. Vapnik, V. (1982). Estimation of dependences based on empirical data. Berlin: Springer-Verlag. Received June 1, 1999; accepted August 30, 1999.

LETTER

Communicated by Andrew Back

A Signal-Flow-Graph Approach to On-line Gradient Calculation Paolo Campolucci ¤ Aurelio Uncini Francesco Piazza Dipartimento di Elettronica ed Automatica, Universit`a di Ancona, 60121 Ancona, Italy A large class of nonlinear dynamic adaptive systems such as dynamic recurrent neural networks can be effectively represented by signal ow graphs (SFGs). By this method, complex systems are described as a general connection of many simple components, each of them implementing a simple one-input, one-output transformation, as in an electrical circuit. Even if graph representations are popular in the neural network community, they are often used for qualitative description rather than for rigorous representation and computational purposes. In this article, a method for both on-line and batch-backward gradient computation of a system output or cost function with respect to system parameters is derived by the SFG representation theory and its known properties. The system can be any causal, in general nonlinear and time-variant, dynamic system represented by an SFG, in particular any feedforward, time-delay, or recurrent neural network. In this work, we use discrete-time notation, but the same theory holds for the continuous-time case. The gradient is obtained in a straightforward way by the analysis of two SFGs, the original one and its adjoint (obtained from the rst by simple transformations), without the complex chain rule expansions of derivatives usually employed. This method can be used for sensitivity analysis and for learning both off-line and on-line. On-line learning is particularly important since it is required by many real applications, such as digital signal processing, system identication and control, channel equalization, and predistortion. 1 Introduction

For many practical problems such as time-series forecasting and classication, identication, and control of nonlinear dynamic systems, feedforward neural networks (NNs) are not adequate; therefore several recurrent neural network (RNN) architectures have been proposed (Tsoi & Back, 1994;

¤

Note: Address reprint request or comments to Paolo Campolucci at campoluc@ tiscalinet.it. c 2000 Massachusetts Institute of Technology Neural Computation 12, 1901– 1927 (2000) °

1902

P. Campolucci, A. Uncini, and F. Piazza

Nerrand, Roussel-Ragot, Personnaz, Dreyfus, & Marcos, 1993; Narendra & Parthasarathy, 1991). RNNs can be essentially divided into three classes: fully recurrent NNs (FRNNs) (Williams & Peng, 1990; Williams & Zipser, 1989; Haykin, 1994), locally recurrent globally feedforward NNs (Back & Tsoi, 1991; Tsoi & Back, 1994; Campolucci, Piazza, & Uncini, 1995; Campolucci, Uncini, Piazza, & Rao, 1999), and NARX architectures (Haykin, 1994; Narendra & Parthasarathy, 1991; Horne & Giles, 1995). In order to train an RNN adaptively, an on-line learning algorithm must be derived for each specic architecture. In many learning algorithms, the gradient of a cost function must be estimated and then used in a parameter updating scheme (e.g., steepest descent or conjugate gradient). By gradient we mean the vector of derivatives of a given variable (here a system output or a function of it, e.g., a cost function) with respect to a set of parameters that inuence such a variable (here, the parameters of a subset of network components). When RNNs (or, in general, systems with feedback) are involved, the calculation of the gradient is more complex and difcult than in the case of feedforward neural networks (or systems without feedback). Due to feedback, in fact, the current output depends on the past outputs of the RNN, so that the present output depends not only on the present parameters but also on the past parameter values, and this dependence has to be considered in calculating the gradient. Quite often in the neural network literature, this dependence has been totally or partially neglected to simplify the derivation of the gradient. There are two different schemes to compute the gradient: the forward computation approach and the backward computation approach. These two names stem from the order in time in which the computations take place. Forward means that both time and propagation of gradient information across the network ow as in the forward pass (i.e., the computation of the network outputs from its inputs); backward indicates the opposite ow. Real-time recurrent learning (RTRL) (Williams & Zipser, 1989) and truncated backpropagation through time (truncated BPTT) (Williams & Peng, 1990) implement the forward and backward on-line computation, respectively. The major drawback of the forward computation over the backward one is its high computational complexity (Williams & Peng, 1990; Williams & Zipser, 1994; Srinivasan, Prasad, & Rao, 1994; Nerrand et al., 1993). On the other hand, the use of the backward technique in the case of RNNs with arbitrary architectures is not straightforward; the mathematics of each particular architecture has to be worked out. For systems with feedback or with internal time delays, the chain rule derivation can easily become complex and sometimes intractable (Werbos, 1990; Williams & Zipser, 1994; Back & Tsoi, 1991; Pearlmutter, 1995; Campolucci et al., 1999). An interesting approach to the computation of gradient information in RNNs with arbitrary architectures was proposed by Wan and Beaufays (1996), who used a diagrammatic derivation to obtain the BPTT batch al-

Signal-Flow-Graph Approach to On-line Gradient Calculation

1903

gorithm. The idea underlying their method is to describe the neural network by a graph model composed of some basic blocks, including summing junctions, branching points, nonlinear functions, weights, and delay operators. Then another graph is introduced to compute the gradient substituting summing junctions with branching points, and vice versa; nonlinear functions with gains; and delay operators with “advance” operators, while the weights are just replicated in the new graph (named reciprocal network). This method considers the derivative system obtained in linearizing the original network. Unfolding the network, multiplying the network output by the error at the proper time step, and applying the interreciprocity property, it is possible to get the desired gradient information by the transposed graph corresponding to the unraveled derivative network. The reciprocal network is this transposed network raveled back in time. The derived method can be computed only in batch mode since the reciprocal network is not causal due to the presence of the noncausal advance operators. The advance operators arise from the unfolding and transposing operations on which the method is based. Following the graph approach, now using the signal-ow-graph (SFG) representation theory and its known properties (Mason, 1953, 1956; Lee, 1974; Oppenheim & Schafer, 1975), in this article we propose a backward computation (BC) method that implements the BPTT algorithm for both on-line and batch-backward gradient computation of a function (e.g., a cost function) of the system output with respect to some parameters. This system can be any causal, nonlinear, and time-variant dynamic system represented by an SFG—in particular, any feedforward (static), time-delay, or RNN. In the following, we shall usually consider the NN framework, but the method is fully general. In this work, we use discrete-time notation; however, the theory holds for the continuous-time case. The new method has been developed mainly for on-line learning. On-line training of adaptive systems is very important in real applications such as digital signal processing, system identication and control, channel equalization, and predistortion, since it allows the model to follow time-varying features of the real system without interrupting the system operation itself (such as a communication channel). It can also be used for off-line learning; in this case, when a mean-squared-error (MSE) cost function is considered, this approach gives the batch BPTT algorithm and is equivalent to the method by Wan and Beaufays (1996). While we provide a proof based on a theorem (Lee, 1974), similar to Tellegen’s theorem for analog network, the proof by Wan and Beaufays (1996) for their diagrammatic derivation is based on the interreciprocity property of transposed graphs, that is, a consequence of the theorem used here. They also propose a proof of the equivalence of RTRL and BPTT in batch mode to show that the computed weight variations are the same, though the complexities of the two algorithms are different. Beaufays and Wan (1994) do not

1904

P. Campolucci, A. Uncini, and F. Piazza

address the problem of graphical derivation of RTRL, as in Wan and Beaufays (1996) for BPTT; nevertheless, the two approaches have been combined for that purpose in Wan and Beaufays (1998), where a diagrammatic derivation of an on-line learning method is reported. However, the resulting RTRL algorithm has a very high computational complexity—O(n2 ) compared to O(n ) of the on-line backward method presented here. In this comparison, it should be considered that any on-line backward method needs a truncation of the past history; forward methods such as RTRL do not. However, it is known (Williams & Peng, 1990) that RTRL also involves a different kind of approximation due to the fact that the weights change in time during learning. To mitigate this problem, an exponential decay on the contributions from past times can be used (Gherrity, 1989); nevertheless, no reduction of the computational complexity can be achieved as provided by the BPTT truncation strategy. The work presented here also generalizes previous work by Osowski (1994), which proposed an SFG approach to neural network learning. That work was basically developed for feedforward networks, and the proposed extension to recurrent networks is valid only for networks that relax to a xed point, as in the work of Almeida (1987); moreover, it is not possible to accommodate delay branches in this framework. General recurrent networks able to process temporal sequences cannot be trained by the proposed adjoint equations. Another related work is that by Martinelli and Perfetti (1991), which applies an adjoint transformation to an analog real circuit implementing a multilayer perceptron (MLP) and obtaining a circuit that implements backpropagation. Since this work is more related to hardware and the network is feedforward and static, it is less general and can be considered a particular case of the method presented here. Finally, our formulation allows dealing with both the learning problem and the sensitivity calculation problem, that is, the problem of computing the derivatives of the system output with respect to some internal parameters. These derivatives can be used for designing robust electrical or numerical circuits. The concepts detailed in this article were developed as part of Paolo Campolucci’s doctorate research and presented in detail in Campolucci (1998) and briey introduced in (Campolucci, Marchegiani, Uncini, and Piazza (1997) and Campolucci, Uncini, and Piazza (1998). 2 Signal Flow Graphs and Lee’s Theorem

A large class of complex systems, such as RNNs, can be represented by an SFG. Therefore in this section, SFG notation and properties will be outlined following Lee’s approach (1974). A different notation and properties of SFGs were also introduced by Oppenheim and Schafer (1975); the SFG theory was rst developed by Mason (1953, 1956).

Signal-Flow-Graph Approach to On-line Gradient Calculation

1905

Figure 1: Signal-ow-graph denitions. If IA is the set of indexes of branches that comes in node A and ZA the set of indexes of branches that comes out from P node A, then the following holds: xj D i2IA vi , 8j 2 ZA .

An SFG is dened as a set of nodes and oriented branches. A branch is oriented from node a to node b if node a is the initial node and node b is the nal node of the branch. An input node is one that has only one outgoing branch, the input branch, and no incoming branch. Associated with the input branch is an input variable uj . An output node is one that has only one incoming branch, the output branch. Associated with the output branch is an output variable yj . All other nodes besides the input and output nodes are called n-nodes. All other branches besides the input and output branches are called f -branches. There are two variables related by a function associated with each of the f branches: the initial variable xj , at the tail of the branch, and the nal variable vj , at the head of the branch. The relationship between the initial and nal variables for branch j is vj D fj [xj ], where fj [.] is a general function describing the operator (or circuit component) of the jth branch. By denition, for each n-node, the value of the xj variables associated with its outcoming branches is the sum of all the vj variables associated with its incoming branches (see Figure 1). Let the SFG consist of p C m C r nodes (m input nodes, r output nodes, and p n-nodes), and q C m C r branches (m input branches, r output branches, and q f -branches). Note that only the branches (and not the nodes) are indexed; therefore, all the reported indexes are to be considered as branch indexes. In the case of discrete-time systems, the functional relationship between the variables xj and vj of the jth f -branch (expressed as vj D fj [xj ]) is usually assumed to be: (

vj (t) D gj [xj (t), aj (t), t] D

vj (t) D q ¡1 xj (t) D xj (t ¡ 1)

for a static branch (without memory)

(2.1)

for a delay branch (one unit memory),

(2.2)

where q¡1 is the delay operator and gj is a general differentiable function, which can depend on an adaptable parameter (or vector of parameters) aj and also on time t to allow shape adaptation or variation in time of the nonlinearity. For a static branch, the relationship between xj and vj often

1906

P. Campolucci, A. Uncini, and F. Piazza

Figure 2: SFG representation of a generic system.

particularizes as »

vj (t) D wj (t)xj (t) vj (t) D fj (xj (t))

for a weight branch for a xed nonlinear branch,

(2.3)

where wj (t) is the jth (i.e., belonging to the jth branch) weight parameter of the system at time t, and fi is a differentiable function again of the jth branch. In the neural networks context, wj (t) is a synaptic weight and fj a sigmoidal activation function. However, equation 2.1 can represent more general nonlinear adaptable functions controlled by a parameter (or vector of parameters) aj , such as the recently introduced spline-based activation functions (Uncini, Vecci, Campolucci, & Piazza, 1999). We are now able to state the following denitions: The SFG GN that represents the system is given by the connection (with or without feedback) of the branches described by equations 2.1, 2.2, and possibly 2.3 (see Figure 2). Denition 1 (SFG).

The branches (components) are conceptually equivalent to the components of an electrical circuit, and the relationship between the SFG and the implemented system is the same as that between the electrical circuit and the system it implements. Not all the SFG can represent real systems. It is known, in fact, that in order to be computed, an SFG has to satisfy the constraint that each loop must include at least one delay branch. A reversed SFG GO N of a given SFG GN is a member of the set of SFGs obtained by reversing the orientation of all the branches in GN , that is, by replacing the summing junctions with branching points, and vice versa. Denition 2 (reversed SFG).

Let uO j , yOj , xOj and vOj , be the variables associated with the branches in GO N corresponding to uj , yj , xj , and vj in GN . In GO N the input variables are yOj

Signal-Flow-Graph Approach to On-line Gradient Calculation

1907

j D 1, . . . , r, the output variables are uO j j D 1, . . . , m, and vO j and xOj are the initial and the nal variable of the jth f -branch, respectively. Let the indexes assigned to the branches of GO N be the same as the indexes assigned to the corresponding branches of GN . Node indexes are not required by this development. The functional relationships between the variables xOj and vOj in GO N are generic; therefore, the graph GO N effectively belongs to a set of SFGs. For an example of SFG, see Figures 3 (a–b). The transposed SFG is a particular reversed SFG such that the relationships of the corresponding branches of GN and GO N remain the same. Denition 3 (transposed SFG).

For a generic SFG, the following theorem, similar to Tellegen’s theorem for electrical network (Tellegen, 1952; Peneld, Spence, & Duiker, 1970), was derived by Lee (1974). It relies on the topological properties of the original and reversed graphs, not on the functional relationships between the branch variables, which therefore can be described by any mathematical relation. Consider a causal dynamic system (Lee, 1974). Let GN be the corresponding SFG and GO N a generic reversed SFG. If uO j , yOj , xOj , and vOj , are the variables in GO N corresponding to uj , yj , xj , and vj in GN , respectively, then Theorem 2.1.

r X jD 1

yOj (t)¤ yj (t) C

q X jD 1

xOj (t)¤ xj (t) D

m X jD 1

uO j (t)¤ uj (t) C

q X j D1

vOj (t)¤ vj (t)

(2.4)

holds true, where “¤” denotes the convolution operator, that is, y (t ) ¤ x ( t ) D

t X s D t0

y (t ¡ s)x (s)

(2.5)

where t0 is the initial time. For a static system the following equation holds: r X jD 1

yOj yj C

q X jD 1

xOj xj D

m X jD 1

uO j uj C

q X jD 1

vOj vj .

(2.6)

3 Sensitivity Computation

In network theory, a sensitivity is usually dened as the derivative of a certain output with respect to a given parameter. A general method to compute sensitivities of a dynamic continuous-time system represented by a SFG was

1908

P. Campolucci, A. Uncini, and F. Piazza

Figure 3: (a) A simple IIR-MLP neural network with two layers, two inputs, one output, and one neuron in the rst layer, with two IIR lter synapses (one zero, one pole transfer function) and one neuron in the second layer, with one IIR lter synapsis (two zeros, two poles, transfer function). (b) A section of its O (a) . SFG GN . (c) Its adjoint SFG G N

Signal-Flow-Graph Approach to On-line Gradient Calculation

1909

derived by Lee (1974) for constant parameters. Similarly, an analogous result can be found for a discrete-time system; however, such a derivation can be used only for sensitivity computation in batch mode. Here we present a method to compute the sensitivity with respect to time-varying parameters, that is, the derivative of a system output with respect to past or present parameters (the weights wi and the nonlinearity control parameters ai ). For an adaptive system, they will depend on time; therefore, the derivatives with respect to such parameters will also depend on time. This is a generalization of work by Lee (1974) and Wan and Beaufays (1996). The derivative of an output yk (t) of the system with respect to the parameter wi (t ¡ t ), where t is the time of the original SFG and t a positive integer (t ¸ 0 and 0 · t · t), by the rst equation in equation 2.3 is @yk (t)

@wi (t ¡ t )

D

@yk (t )

@vi (t ¡ t ) @yk (t) D xi (t ¡ t ), ( ) ( ) @vi t ¡ t @wi t ¡ t @vi (t ¡ t )

(3.1)

where k D 1, . . . , r and i spans over the set of indexes of the weight branches. For the parameters of the nonlinear function, by equation 2.1, the following more general expression holds: @yk (t) @ai (t ¡ t )

D D

D

@yk (t )

@vi (t ¡ t )

@vi (t ¡ t ) @ai (t ¡ t ) @yk (t )

@vi (t ¡ t )

D

@yk (t)

@gi

@vi (t ¡ t ) @ai xi (t¡t ),ai (t¡t ), t¡t

g0i (t ¡ t ),

(3.2)

where k D 1, . . . , r and i spans over the set of indexes of the nonlinear parametric branches, and a simplied notation for the derivative of gi is implicitly dened. In both cases we need to compute the term @yk (t) / @vi (t ¡ t ), which is the derivative of the kth output variable at time t with respect to a signal in the system, that is, the nal variable of the ith f -branch at time t ¡ t . As for the computation of the sensitivity in an electrical circuit, in this case an adjoint network can be introduced to perform this task. The main idea is to apply Lee’s theorem (equation 2.4) to the SFG GN of the original system ( ) and a particular reversed SFG GO N , here called adjoint SFG or GO Na . It can be easily seen (see the proof in the appendix) that expression 2.4 holds true also when small variations of the variables are considered: r X jD 1

yOj (t) ¤ D yj (t) C

q X jD 1

xO j (t ) ¤ D xj (t ) D

C1 m X

j D1

C

uO j (t) ¤ D uj (t)

q X jD 1

vOj (t) ¤ D vj (t).

( ) Since the functional relationships of the branches of GO Na can be freely chosen, because equation 2.4 does not depend on them, these relationships are

1910

P. Campolucci, A. Uncini, and F. Piazza

Table 1: Adjoint SFG Construction Rules.

Note: For batch adaptation, the adjoint branches must be dened, remembering that t D T in this case, where T is the nal instant of the epoch.

chosen in order to obtain an estimate of can now state the following:

@yk (t ) @vi (t¡t )

from this equation. Thus we

( a) The adjoint SFG GO N is the particular reversed graph whose f -branch relationships are related to the f -branch equations of the original graph GN by the correspondence reported in Table 1.

Denition 4 (adjoint SFG).

Table 1 shows that for each delay branch in the original SFG, the corresponding branch of the adjoint SFG is a delay branch (with zero initial condition). The nonlinearity without parameters fj in the original SFG corresponds to a gain equal to the derivative of fj evaluated at the value of the initial variable of the branch at time t ¡ t in the adjoint SFG, where t is the time of the original SFG and t is that of the adjoint SFG. The weight at time t corresponds to the weight at time t ¡ t in the adjoint SFG. A similar

Signal-Flow-Graph Approach to On-line Gradient Calculation

1911

transformation should be performed for the general nonlinearity with control parameters. The outputs of the original SFG correspond to the inputs of the adjoint SFG. For an example of adjoint SFG construction for a simple MLP with innite impulse response (IIR) lter synapses, see Figure 3. Moreover, the inputs of the adjoint SFG must be set to an impulse in correspondence with the output of GN of which the sensitivity has to be computed, and to a constant null signal in correspondence with all the other outputs. In other words, to compute the value of @yk (t) / @vi (t ¡ t ), the input of the adjoint SFG should be:

8 > <0

» yOj (t ) D 1, > : 0,

8t

t D0 t > 0

6 k if j D

if j D k

j D 1, . . . , r.

(3.3)

( ) Let uO j , yOj , xO j , and vOj , be the variables associated with the branches in GO Na corresponding to uj , yj , xj , and vj in GN ; it follows (see the proof in the appendix) that

@yk (t) @vi

(t ¡ t ) D vO i (t ),

(3.4)

where k D 1, . . . , r and i D 1, . . . , q. Using this result in equation 3.1, we get @yk (t) @wi (t ¡ t )

D vO i (t )xi (t ¡ t ).

(3.5)

This result states that the derivative of an output variable of the SFG at time t with respect to its weight parameter wi (t ¡ t ) at time t ¡ t is the product of the initial variable of the ith branch in the original SFG at time t ¡ t and the initial variable of the corresponding branch in the adjoint graph at t time units. This result can be easily generalized for the nonlinear parametric function as follows: @yk (t) @ai (t ¡ t )

D vO i (t ) g0i (t ¡ t ).

(3.6)

4 SFG Approach to Learning in Nonlinear Dynamic Systems

In the previous section, we showed how the SFG of the original network (a ) GN and the SFG of the adjoint network GO N can be used to compute the sensitivity of an output with respect to a past or present system parameter (equations 3.5 and 3.6). Now we show how to use this technique to adapt a dynamic network, minimizing a supervised cost function with respect to the network parameters. The idea is to consider the cost function itself as a network connected

1912

P. Campolucci, A. Uncini, and F. Piazza

in cascade to the dynamical network to be adapted. Therefore, the entire system (named GS in the following) has as inputs the inputs of the original network and the desired outputs, while the value of the cost function is its unique output. The gradient of this output (i.e., of the cost function) with respect to the system parameters (e.g., weights) now corresponds to the sensitivity of this output, which can be computed by using equations 3.5 and 3.6 again, applied now to the cascade system GS . 4.1 Cost Functions and Parameter Updating Rules. Let us consider a discrete-time nonlinear dynamic system with inputs uk , k D 1, . . . , m, outputs yk , k D 1, . . . , r, and parameters wi and ai , which have to be adapted with respect to an output error. Using gradient-based learning and following the steepest-descent updating rule, it holds, for example, for the weight wi

D wi

D ¡m

@J @wi

m > 0,

,

(4.1)

where J is the cost function, D wi is the variation of the parameter wi , and m is the learning rate. The major problem is given by the calculation of the derivative @J / @wi . Since for on-line learning the parameters can change at each time instant, @J @wi

D

t X

@J

@wi (t ) t D0

D

t X

@J

@wi (t ¡ t ) t D0

.

(4.2)

As the length of the summation linearly increases with the current time step t, the summation in equation 4.2 must be truncated in order to implement the algorithm,

D wi (t ) D ¡m

t X

@J

@wi (t ) t D t¡h C 1

D ¡m

h¡1 X

@J

@wi (t ¡ t ) t D0

,

(4.3)

where h is a xed positive integer. In this way, not all the history of the system is considered but only the most recent part in the interval [t ¡ h C 1, t]. It is easy to show that a real truncation is necessary only for circuits with feedback, such as RNN, while for feedforward networks with delays (e.g., time delay NN, or TDNN) a nite h can be chosen, so that all the memory of the system is taken into account. An optimal selection of h requires an appropriate choice for each parameter to be adopted. For layered TDNN, h should depend on the layer and should be increased moving from the last to the rst layer, since more memory is involved. The updating rule (see equations 4.2 and 4.3) holds theoretically in the hypothesis of constant weights, but practically it is only a good approximation. Equations 3.5 and 3.6 do not require that hypothesis and can be used

Signal-Flow-Graph Approach to On-line Gradient Calculation

1913

in a more general context. Equations equivalent to 4.1, 4.2, and 4.3 can also be written for the parameters ai . The most common choice for the cost function J is the MSE. The instantaneous squared error at time t is dened as e2 (t) D

r X k D1

e2k (t)

with ek (t) D dk (t) ¡ yk (t),

(4.4)

where dk (t) k D 1, . . . , r are the desired outputs. So the MSE over a time interval [t0 , t1 ] is given by E (t 0 , t1 ) D

t1 X

e2 (t).

(4.5)

tD t0

In the case of batch training, the cost function can be chosen as the MSE over the entire learning epoch E (0, T ), where T is the nal time of the epoch, whereas for on-line training only the most recent errors e2 (t) must be considered, for example, using E (t ¡Nc C 1, t ) with the constant Nc properly chosen (Nerrand et al., 1993; Narendra & Parthasaraty, 1991). Therefore it holds:

8 > > > > < E(t ¡ Nc C 1, t) D T X > > > > e2 (s) : E(0, T ) D

t X

e2 (s)

on-line training

sD t¡Nc C 1

batch training.

sD 0

Assuming J D E (t0 , t1 ), it follows that

D wi (t) D ¡m

h¡1 X @E (t ¡ Nc C 1, t) t D0

@wi (t ¡ t )

for on-line training

(4.6a)

and

D wi

D ¡m

T X

@E (0, T )

@wi (T ¡ t ) t D0

for batch training.

(4.6b)

4.2 Cost Function Gradient Computation by SFGs. Let us consider the SFG GS obtained by a cascade connection of the SFG GN of the system to be adapted and the SFG that implements the cost function J (named G J in the following) (see Figure 4 for an example). The SFG GJ has as inputs the outputs of the adaptable system yk and the targets dk , while the value of J is its unique output. Obviously G J can contain the same kind of operators as GN : delays, nonlinearities, parameters to be adapted (e.g., for regulariza-

1914

P. Campolucci, A. Uncini, and F. Piazza

Figure 4: (a) SFG GS . The system SFG GN in cascade with the error calculation SFG G J with J D E(t ¡ 3, t). sqr(x) means x2 , and dj j D 1, . . . , r are the desired O (a) , that is, G O (a) in cascade with G O (a) . outputs. (b) Its adjoint G S J N

tion purposes), and feedback. The class of cost functions allowed by this approach is therefore enlarged with respect to other approaches since the cost expression can have memory and be recursive. Although some applications would require very complex cost functions, the SFG approach can easily handle them.

Signal-Flow-Graph Approach to On-line Gradient Calculation

1915

( )

Using the adjoint network GO Sa of the new SFG GS , it is easy to calculate the sensitivities of the output with respect to the system parameters (see equations 3.5 and 3.6). Since the output yk (k D 1) of GS is the cost function J, combining equation 4.2 with 3.5 or 3.6 and truncating the past history to h steps (on-line learning) it holds:

8 h¡1 X > @ J (t ) > > D vO i (t )xi (t ¡ t ) > < @wi

on-line

(4.7a)

T > X > @J > > D vO i (t )xi (T ¡ t ) :

batch

(4.7b)

8 h¡1 X > @ J (t ) > > D vO i (t ) g0i (t ¡ t ) > < @ai

on-line

(4.8a)

T > X > @J > > D vO i (t ) g0i (T ¡ t ) :

batch.

(4.8b)

t D0

@wi

t D0

t D0

@ai

t D0

Equations 4.7 and 4.8 hold for any cost function J that can be described by an SFG. In the particular case when J is chosen to be a standard MSE cost function, we show that the explicit implementation of GJ can be avoided since the involved derivatives can be analytically derived. For on-line learning, using the instantaneous squared error J D e2 (t) the input-output relationship of G J is given by equation 4.4; thus, » @e2 (t ) ¡2ek (t), t D 0 , (4.9) D k D 1, . . . , r. 0, t > 0 ( ) @yk t ¡ t Using equations 3.4 and 4.9 and considering the output yk of GN as an internal signal vi , we can explicitly compute the outputs of the adjoint network (a ) GO J

yO k (t ) D

@e2 (t ) @yk (t ¡ t )

» D

¡2ek (t), 0,

t D0 , t > 0

k D 1, . . . , r.

(4.10)

( ) Such outputs can be used as the inputs of GO Na . Therefore if the signal, equa( ) tion 4.10, feeds the network GO Na instead of equation 3.3, the explicit use of ( ) a GO J can be avoided. If J D E (t ¡ Nc C 1, t) is used, then equation 4.10 generalizes to the following (see Figure 4):

yO k (t ) D D

@E (t ¡ Nc C 1, t )

»

@yk (t ¡ t )

¡2ek (t ¡ t ), 0,

t · Nc ¡ 1 , k D 1, . . . , r. t ¸ Nc

(4.11)

1916

P. Campolucci, A. Uncini, and F. Piazza

Figure 5: (a) SFG GS , that is, the system SFG GN in cascade with the error calculation SFG G J with J D E(0, t), sqr(x) means x2 , dj j D 1, . . . , r are the desired O (a) , that is, G O (a) in cascade with G O (a) . The feedback in (a) outputs; (b) its adjoint G J S N sums the error terms from the initial instant. The feedback in (b) injects into the SFG’s left side a constant value equal to 1 for all t .

When batch learning is involved, that is, when J D E (0, T ), the following equation has to be used instead of 4.10 or 4.11: yO k (t ) D

@E (0, T ) @yk (T ¡ t )

D ¡ 2ek (T ¡ t ),

k D 1, . . . , r,

t D 0, 1, . . . , T.

(4.12)

Note that the parameter variations computed by the summation of all gradient terms in [0, T] are the same as those obtained by BPTT (Wan & Beaufays, 1996) (see Figure 5). Thus, for standard MSE cost functions, the derivative calculation can be ( ) performed considering GN and GO Na only, but using either equation 4.10, ( ) a 4.11, or 4.12 as inputs for GO N instead of equation 3.3.

Signal-Flow-Graph Approach to On-line Gradient Calculation

1917

Srinivasan et al. (1994) derive a theorem that substantially states what is expressed in equation 4.7 when J D e2 (t); however, such a theorem has been proved only for a neural network with a particular structure and by a completely different and very specic approach, whereas the method presented here is very general, including any NN as a special case. In the following, we call the resulting algorithm BC(h) (backward computation), where h is the truncation parameter, or BC for a short notation or the batch case. 4.3 Detailed Steps of the BC Procedure. To derive the BC algorithm for training an arbitrary SFG GN using an MSE cost function, the adjoint SFG ( ) GO Na must be drawn by reversing the graph and applying the transformation of Table 1.

On-Line Learning. Choosing J D E (t ¡Nc C 1, t ), then the following steps have to be performed for each time instant t: 1. The system SFG GN is computed one time step forward, storing in memory the internal states for the last h time steps. ( ) 2. The adjoint SFG GO Na is reset (setting null initial conditions for the delays). ( ) 3. GO Na is computed for t D 0, 1, . . . , h ¡ 1 with the input given by equation 4.11 computing the terms of summations 4.7a and 4.8a.

4. The parameters wj and aj are adapted by equation 4.1. Batch Learning. For each epoch:

Choosing J D

PT

tD 0 e

2(

t), then the procedure is simpler.

1. The system SFG GN is computed, storing its internal states from time t D 0 up to time T. ( ) 2. The adjoint SFG GO Na is reset.

( ) 3. GO Na is evaluated for t D 0 up to T, with t D T (see the appendix, accumulating the terms needed by equations 4.7b and 4.8b).

4. The parameters wj and aj are adapted by equation 4.1. However, not all the variables of the adjoint SFG have to be computed— only the initial variables of the branches containing adaptable parameters since they must be used in equations 4.7 and 4.8. It must be stressed that the delay operator of the adjoint SFG delays the t and not the t index. When the MSE is considered, the batch mode BC learning method corresponds to batch BPTT and Wan and Beaufays’ method (1996), while the

1918

P. Campolucci, A. Uncini, and F. Piazza

Figure 6: (a) SFG representing a recurrent neuron. (b) Its adjoint.

on-line mode BC procedure corresponds to truncated BPTT (Williams & Peng, 1990). If the desired cost function J is not a conventional MSE, then the cost function SFG GJ must be designed and connected in cascade with the network ( ) SFG, giving the overall SFG GS , whose adjoint SFG GO Sa must be drawn by reversing the graph and applying the transformation of Table 1. In the previous steps of the BC procedures, now GS must be considered instead of ( ) ( ) ( ) GN and GO Sa instead of GO Na while the input of the adjoint SFG GO Sa is given by equation 3.3 instead of 4.10, 4.11, or 4.12. 4.4 Example of Application of the BC Algorithm. As an example, a very simple SFG, a recurrent neuron with two adaptable parameters, is considered (see Figure 6a). The equations of the BC algorithm will be detailed to make the method clear and to show the relationship with truncated BPTT(h) (Williams & Peng, 1990) in on-line mode and with BPTT in batch mode. The equations of the forward phase of the system can be written looking at Figure 6a. Let us assume the following order of the branches: for the two weights and the nonlinearity operator, the branch index is the subindex shown in the gure, and for the delay operator it is equal to three. Since the system is single-input, single-output (SISO), the u and y variables do not

Signal-Flow-Graph Approach to On-line Gradient Calculation

1919

need any index: D

s(t) D x2 (t ) D w1 (t)u(t) C w4 (t)y (t ¡ 1)

(4.13)

y (t ) D f2 (s(t)).

(4.14)

4.4.1 On-line Case. For the simplest case J D e2 (t) D (d(t) ¡ y (t))2 , according to equation 4.10, the input of the corresponding adjoint SFG (see Figure 6b) must be: » ¡2e(t), t D 0 . (4.15) yO (t ) D 0, t >0 From the adjoint SFG, it results in vO 1 (t ) D vO 4 (t )

¡

D f20 (s(t ¡ t )) yO (t ) C

» wO 4 (t ¡ 1)vO 4 (t ¡ 1), 0,

¢

t > 0 , t D0

(4.16)

where (see Table 1) wO 4 (t ¡ 1) D w4 (t ¡ t C 1).

(4.17)

According to equations 3.5 and 4.3, the weights can be updated using the following two equations:

D w1 (t) D w4 (t)

D m D m

h¡1 X t D0 h¡1 X t D0

d(t ¡ t )u (t ¡ t )

(4.18)

d(t ¡ t )y(t ¡ t ¡ 1),

(4.19)

where D

d(t ¡ t ) D ¡

@e2 (t) @s (t ¡ t )

D ¡

@e2 (t) @v1 (t ¡ t )

D ¡vO 1 (t ).

(4.20)

It can be easily seen that these equations do indeed correspond to truncated BPTT(h), since the quantity d is simply the usual d of truncated BPTT. Note that here a weight buffer has to be used as stated by the theory, while sometimes in the literature only the current weights are used to obtain a simpler approximated implementation (Williams & Peng, 1990). It is worth noting that in spite of the simplicity of the system (a single recurrent neuron), the backward equations are not easy to derive by chain rule; in fact, forward (recursive) equations are usually proposed for adaptation.

1920

P. Campolucci, A. Uncini, and F. Piazza

4.4.2 Batch Case. The graph is computed for t D 0, . . . , T saving the internal states, without any evaluation of the adjoint SFG. The weight time index can be neglected since the weights are now constant. P In this case, the MSE over the entire epoch is taken, J D TtD0 e2 (t); therefore, the input of the adjoint SFG (see Figure 6b) must be, according to equation 4.12: yO (t ) D ¡2e(T ¡ t ),

t D 0, 1, . . . , T

(4.21)

From the adjoint SFG, remembering that t D T, it results in

¡

vO 1 (t ) D vO 4 (t ) D f20 (s(T ¡ t )) yO (t ) C

» wO 4 vO 4 (t ¡ 1), 0,

¢

t > 0 , t D0

(4.22)

where wO 4 D w4 . Therefore the weights can be updated according to the following two equations:

D w1 D w4

Dm Dm

T X t D0 T X t D0

d(T ¡ t )u(T ¡ t )

(4.23)

d(T ¡ t )y (T ¡ t ¡ 1)

(4.24)

where D

d (T ¡ t ) D ¡

@J @s(T ¡ t )

D¡

@J @v1 (T ¡ t )

D ¡Ov1 (t ).

(4.25)

It is easy to see that these equations correspond to BPTT, since the quantity d is simply the usual d of BPTT. 4.5 Complexity Analysis. The complexity of the proposed gradient computation is very low, since it linearly increases with the number of adaptable parameters. In particular, since the adjoint graph has the same topology as the original one, the complexity of its computation is about the same; in practice, it is lower, since not all the variables of the adjoint graph have to be computed, as stated in section 4.3. For on-line learning, the adjoint graph must be evaluated h times for each time step; therefore, the number of operations for the computation of the gradient terms with respect to all the parameters is roughly (in practice it is lower) h times the number of operations of the forward phase, plus the computation needed by equations 4.7 and 4.8 for each time step (whose complexity linearly increases with h and the number of parameters).

Signal-Flow-Graph Approach to On-line Gradient Calculation

1921

4.5.1 Fully Recurrent Neural Networks. For the simple case of a fully recurrent, single-layer, single-delay neural network composed of n neurons, the computational complexity is O (n2 ) operations per time step for batch BC or O (n2 h ) for on-line BC compared with O (n4 ) for RTRL (Williams and Zipser, 1989). The memory requirement is O(nT ) for batch BC or O (nh ) for on-line BC, and O (n3 ) for RTRL. Therefore, as far as computational complexity is concerned, in batch mode, BC is signicantly simpler than RTRL, whereas in on-line mode, the complexity and also the memory requirement ratios depend on n2 / h. However in many practical cases, n2 is large compared to h, and therefore RTRL will be more complex than on-line BC. 4.5.2 Locally Recurrent Neural Networks. For a complex architecture such as a locally recurrent layered network, a mathematical evaluation of complexity can be carried out by computing the number of multiplications and additions for one iteration of the on-line learning (i.e., for each input sample). Results for on-line BC are reported in the signicant special case of a two-layer MLP with IIR temporal lter synapses (IIR-MLP) (Tsoi & Back, 1994; Campolucci et al., 1999) with bias and moving average (MA) and auto regressive (AR) orders (L (l) and I (l ) , respectively), depending only on the layer index l. The number of additions and multiplications is, respectively: h i (1) (1) (2) (2) N2 C h N1 N0 (2I C L ) C 2N2 N1 (I C L ) ¡ N2 (N1 ¡ 1) C N1 , h i N2 C h N1 N0 (2I (1) C L (1) C 1) C 2N2 N1 (I (2) C L (2) ) C N1 (N2 C 1) where Ni is the number of neurons of layer ith (i D 0 corresponds to the input layer). These numbers must be added to the number of operations of the forward phase, which should always be performed before the backward phase, N1 N0 (L (1) C I(1) ) C N2 N1 (L (2) C I (2) ), that is, the same for both addition and multiplication. 5 Conclusion

We have presented a signal-ow-graph approach that allows an easy online computation of the gradient terms needed in sensitivity analysis and learning of nonlinear dynamic adaptive systems represented by SFG, even if their structure includes feedback (of any kind, even nested) or time delays. This method bypasses the complex chain rule derivative expansion traditionally needed to derive the gradient equations. The gradient information obtained in this way can be useful for circuit optimization by output sensitivity minimization or for gradient-based training algorithms, such as conjugate gradient or other techniques. This method should allow the development of computer-aided design software by which

1922

P. Campolucci, A. Uncini, and F. Piazza

Figure 7: (a) The perturbation introduced in the original SFG. (b) The corresponding node in the adjoint SFG.

an operator could dene an architecture of a system to be adapted or a circuit whose sensitivity must be computed, leaving the software the hard task of nding and implementing the needed gradient calculation algorithm. We have easily developed this software in a very general version. This work is related to previous results in different elds and provides an elegant analogy with the sensitivity analysis of analog networks by the adjoint network technique obtained applying Tellegen’s theorem which in that case relates voltages and currents of the network (Director & Rohrer, 1969). Appendix: Proof of the Adjoint SFG Method

Let us consider a nonlinear dynamical system described by an SFG with the notation dened in section 2. Starting at time 0 and letting the system run up to time t we obtain the signals uj (t), yj (t), xj (t), and vj (t). Now let us repeat the same process but introducing at time t ¡t a perturbation e to a specic node of the graph (see Figure 7a). This is equivalent to considering a system with m C 1 inputs, one more than the original system, where the input (m C 1)th is um C 1 (t) D u0 (t) where u0 (t) is dened such as » 0

u (t ¡ w ) D

0, e,

6 t w D . w Dt

Therefore, at time t, all the variables uj (t ), yj (t), xj (t), and vj (t) will be changed to uj (t) C D uj (t), yj (t) C D yj (t), xj (t) C D xj (t), and vj (t) C D vj (t ). Obviously it holds that D u0 (t) D u0 (t), 8t. Let uO j , yOj , xOj , vOj , and uO m C 1 (t) D uO 0 (t) be the corresponding variables associated with the branches of a generic reversed SFG GO N of the original graph

Signal-Flow-Graph Approach to On-line Gradient Calculation

1923

GN ; applying theorem 1 with e D 0 (no perturbation), it follows that r X jD 1

yOj (t) ¤ yj (t ) C

q X jD 1

xOj (t) ¤ xj (t) D

C1 m X

j D1

C

uO j (t) ¤ uj (t)

q X jD 1

vOj (t) ¤ vj (t ),

(A.1)

6 0 (with perturbation), the same equation rewrites as while when e D

r X jD 1

yOj (t) ¤ (yj (t) C D yj (t)) C D

C1 m X

jD 1

q X j D1

xOj (t) ¤ (xj (t) C D xj (t))

uO j (t) ¤ (uj (t ) C D uj (t)) C

q X jD 1

vOj (t ) ¤ (vj (t ) C D vj (t)).

(A.2)

Since the convolution is a linear operator, subtracting equation A.1 from A.2 results in r X jD 1

yOj (t) ¤ D yj (t) C

q X jD 1

xO j (t ) ¤ D xj (t ) D

C1 m X

j D1

C

uO j (t) ¤ D uj (t)

q X jD 1

vOj (t) ¤ D vj (t).

(A.3)

With regard to the input branches yOj of the SFG GO N , it is possible to choose

8 > <0

» yOj (t ) D > 1, : 0,

8t

6 k if j D

t D0 if j D k t > 0

j D 1, . . . , r

(A.4)

in order to obtain: r X jD 1

yOj (t) ¤ D yj (t) D D yk (t),

(A.5)

where yk is the output of which we need to compute the gradient. Since the inputs uj j D 1, . . . , m of GN are obviously not dependent on the perturbation e , it holds that

D uj (t) D 0 8t,

j D 1, . . . , mI

(A.6)

1924

P. Campolucci, A. Uncini, and F. Piazza

therefore, C1 m X

jD 1

uO j (t ) ¤ D uj (t) D uO m C 1 (t) ¤ D um C 1 (t) D uO 0 (t) ¤ D u0 (t) t X

D

w D0

uO 0 (w )D u0 (t ¡ w ) D uO 0 (t )D u0 (t ¡ t ).

(A.7)

Now we use the degrees of freedom given by the denition of reversed graph to choose the f -branch operators in GO N such that it holds: vOj (t) ¤ D vj (t) D xOj (t) ¤ D xj (t)

j D 1, . . . , q,

(A.8)

and equation A.3 can be greatly simplied. In this way, the adjoint graph ( ) GO Na will be obtained. For delay branches: vj (t) D q¡1 xj (t) ) D vj (t) D D xj (t ¡ 1) D q¡1 D xj (t).

(A.9)

Let us choose the relationship between the branch variables in GO N as xOj (t ) D q¡1 vOj (t ),

(A.10)

where t is the time index of the adjoint SFG and t the time index of the original SFG; the same notation will be used in the following. Since D xj (t ¡ w ) and D vj (t ¡ w ) are zero, if w > t , since the system is causal, and the initial conditions of GO N are set to zero, that is xOj (0) D 0,

(A.11)

it follows that vOj (t) ¤ D vj (t) D D D

t X w D0 t X w D0

t C1 X sD 0

vOj (w )D vj (t ¡ w ) D

t X w D0

vOj (w )D vj (t ¡ w )

xOj (w C 1)D xj (t ¡ w ¡ 1) D xOj (s)D xj (t ¡ s) D

D xOj (t) ¤ D xj (t).

t X sD 0

t C1 X sD 1

xOj (s)D xj (t ¡ s )

xOj (s)D xj (t ¡ s) (A.12)

For static branches: vj (t) D gj (xj (t), aj (t), t).

(A.13)

Signal-Flow-Graph Approach to On-line Gradient Calculation

1925

Let us choose the relationship between the branch variables xOj and vOj in GN as xOj (t ) D vOj (t )

D D vOj (t ) gj0 (t ¡ t ), @xj xj (t¡t ),aj (t¡t ), t¡t @gj

(A.14)

where gj0 (.) is implicitly dened as in equation 3.2. By differentiating (

D vj (s) D

(@gj / @xj ) D xj (s ) D gj0 (s)D xj (s), s ¸ t ¡ t xj (s),aj (s), s 0, s < t¡t

vOj (t) ¤ D vj (t) D D

t X w D0 t X

w D0

(A.15)

vOj (w )D vj (t ¡ w ) vOj (w ) gj0 (t ¡ w )D xj (t ¡ w ) D xOj (t ) ¤ D xj (t ).

(A.16)

Very interesting particular cases of equation A.14 for nonlinear adaptive lters or neural networks are: ( (A.17) for a weight branch xOj (t ) D wj (t ¡ t ), vOj (t ) 0 (A.18) xOj (t ) D fj (xj (t ¡ t )vOj (t ) for a nonlinear branch. Equations A.17 and A.18 correspond to the branches dened in equation 2.3, respectively. We have completely dened the reversed SFG GO N , which is now called (a ) (a ) adjoint signal-ow-graph GO N . Therefore the adjoint SFG GO N is dened as a reversed SFG GO N with the additional conditions given by equations A.10 and A.11 for delay branches, A.17 and A.18 (or A.14) for static branches, O and A.4 for the y-branches. These denitions are summarized in Table 1. Combining equations A.3, A.5, A.7, and A.8 results in

D yk (t) D uO 0 (t )D u0 (t ¡ t ).

(A.19)

Since uO 0 (t ) obviously does not depend on e , it holds that @yk (t) @u0 (t

¡ t)

D lim

e !0

D yk (t) D u 0 (t ¡ t )

D uO 0 (t ).

(A.20)

From Figure 7a, we observe that u0 and the ith branch comes into the same node; remembering that a node performs a summation, it holds that @yk (t) @u0 (t ¡ t )

D

@yk (t) @vi (t ¡ t )

,

(A.21)

1926

P. Campolucci, A. Uncini, and F. Piazza

and from Figure 7b that uO 0 (t ) D vO i (t ); therefore @yk (t) D vO i (t ). @vi (t ¡ t )

(A.22)

It is important to note that for batch learning, equation A.22 is evaluated only for t D T and, therefore the adjoint branches dened in Table 1 are considered with t D T, where T is the nal instant of the epoch. References Almeida, L. B. (1987). A learning rule for asynchronous perceptrons with feedback in combinatorial environment. In Proc. Int. Conf. on Neural Networks (Vol. 2, pp. 609–618). Back, A. D., & Tsoi, A. C. (1991). FIR and IIR synapses, a new neural network architecture for time series modelling. Neural Computation, 3, 375–385. Beaufays, F., & Wan, E. (1994). Relating real-time backpropagation and backpropagation-through-time: An application of ow graph interreciprocity. Neural Computation, 6, 296–306. Campolucci, P. (1998). A circuit theory approach to recurrent neural network architectures and learning methods. Doctoral dissertation in English, University of Bologna, Italy. PDF available online at http://nnsp.eealab.unian.it/ campolucci P or requested from [email protected]. Campolucci, P., Marchegiani, A., Uncini, A., & Piazza, F. (1997). Signal-owgraph derivation of on-line gradient learning algorithms. In Proc. ICNN-97, IEEE Int. Conference on Neural Networks (Houston, TX). Campolucci, P., Piazza, F., & Uncini, A. (1995). On-line learning algorithms for neural networks with IIR synapses. In Proc. IEEE International Conference of Neural Networks (Perth). Campolucci, P., Uncini, A., & Piazza, F. (1998). Dynamical systems learning by a circuit theoretic approach. Proc. ISCAS-98, IEEE Int. Symposium on Circuits and Systems. Campolucci, P., Uncini, A., Piazza, F., & Rao, B. D. (1999). On-line learning algorithms for locally recurrent neural networks. IEEE Trans. on Neural Networks, 10, 253–271. Director, S. W., & Rohrer, R. A. (1969). The generalized adjoint network and network sensitivities. IEEE Trans. on Circuit Theory, CT-16, 318–323. Gherrity, M. (1989). A learning algorithm for analog, fully recurrent neural networks. In Proc. Int. Joint Conference Neural Networks, (Vol. 1, pp. 643–644). Haykin, S. (1994). Neural networks: A comprehensive foundation. New York: IEEE Press–Macmillan. Horne, B. G., & Giles, C. L. (1995). An experimental comparison of recurrent neural networks. In G. Tasauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 Cambridge, MA: MIT Press. Lee, A. Y. (1974).Signal ow graphs—Computer-aided system analysis and sensitivity calculations. IEEE Transactions on Circuits and Systems, cas-21, 209–216. Martinelli, G., & Perfetti, R. (1991). Circuit theoretic approach to the backpropagation learning algorithm. Proc. IEEE Int. Symposium on Circuits and Systems.

Signal-Flow-Graph Approach to On-line Gradient Calculation

1927

Mason, S. J. (1953). Feedback theory—Some properties of signal-ow graphs. Proc. Institute of Radio Engineers, 41, 1144–1156. Mason, S. J. (1956). Feedback theory—Further properties of signal-ow graphs. Proc. Institute of Radio Engineers, 44, 920–926. Narendra, K. S., Parthasarathy, K. (1991). Gradient methods for the optimization of dynamical systems containing neural networks. IEEE Trans. on Neural Networks, 2, 252–262. Nerrand, O., Roussel-Ragot, P., Personnaz, L., Dreyfus, G., & Marcos, S. (1993). Neural networks and nonlinear adaptive ltering: Unifying concepts and new algorithms. Neural Computation, 5 165–199. Oppenheim, A. V., & Schafer, R. W. (1975). Digital signal processing. Englewood Cliffs, NJ: Prentice Hall. Osowski, S. (1994). Signal ow graphs and neural networks. Biological Cybernetics, 70, 387–395. Pearlmutter, B. A. (1995). Gradient calculations for dynamic recurrent neural networks: A survey. IEEE Trans. on Neural Networks, 6, 1212–1228. Peneld, P., Spence, R., & Duiker, S. (1970). Tellegen’s theorem and electrical networks. Cambridge, MA: MIT Press. Srinivasan, B., Prasad, U. R., & Rao, N. J. (1994). Backpropagation through adjoints for the identication of non linear dynamic systems using recurrent neural models. IEEE Trans. on Neural Networks, 5, 213–228. Tellegen, B. D. H. (1952). A general network theorem, with applications. Philips Res. Rep., 7, 259–269. Tsoi, A. C., & Back, A. D. (1994). Locally recurrent globally feedforward networks: A critical review of architectures. IEEE Transactions on Neural Networks, 5, 229–239. Uncini, A., Vecci, L., Campolucci, P., & Piazza, F. (1999). Complex-valued neural networks with adaptive spline activation function for digital radio links nonlinear equalization. IEEE Transactions on Signal Processing, 47, 505–514. Wan, E. A., & Beaufays, F. (1996). Diagrammatic derivation of gradient algorithms for neural networks. Neural Computation, 8, 182–201. Wan, E. A., & Beaufays, F. (1998). Diagrammatic methods for deriving and relating temporal neural networks algorithms. In M. Gori & C. L. Giles (Eds.), Adaptive processing of sequences and data structures. Berlin: Springer-Verlag. Werbos, P. J. (1990). Backpropagation through time: What it does and how to do it. Proc. of IEEE, 78, 1550–1560. Williams, R. J., & Peng, J. (1990). An efcient gradient-based algorithm for on line training of recurrent network trajectories. Neural Computation, 2, 490–501. Williams, R. J., & Zipser, D. (1989). A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1, 270–280. Williams, R. J., & Zipser, D. (1994). Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin & D. E. Rumelhart (Eds.), Backpropagation: Theory, architectures and applications (pp. 433–486). Hillsdale, NJ: Erlbaum. Received July 7, 1997; accepted July 22, 1999.

LETTER

Communicated by Malik Magdon-Ismail

Bootstrapping Neural Networks Jurgen ¨ Franke Department of Mathematics, University of Kaiserslautern, 67653 Kaiserslautern, Germany Michael H. Neumann Department of Economics, Humboldt University of Berlin, 13591 Berlin, Germany Knowledge about the distribution of a statistical estimator is important for various purposes, such as the construction of condence intervals for model parameters or the determination of critical values of tests. A widely used method to estimate this distribution is the so-called bootstrap, which is based on an imitation of the probabilistic structure of the data-generating process on the basis of the information provided by a given set of random observations. In this article we investigate this classical method in the context of articial neural networks used for estimating a mapping from input to output space. We establish consistency results for bootstrap estimates of the distribution of parameter estimates. 1 Introduction

Neural networks provide a tool for learning an unknown mapping, say m, from input space to output space. In the presence of noise, they provide function estimators that are nonparametric in spirit, as discussed in detail by White (1990). A feedforward neural network with independent inputs and noisy outputs is a particular nonlinear regression model. The reliability of estimates for the weights is important for the ability of the trained network to generalize. Their asymptotic normal distribution has been given by White (1989a). If one is interested in quantifying the performance of statistical estimates, Efron’s (1979) bootstrap is an alternative to asymptotic considerations, which often provides an adequate and in many cases better approximation to the actual distribution than standard asymptotics, as discussed, for example, for nonlinear nonparametric regression by H¨ardle and Bowman (1988) and for nonlinear nonparametric time series models by Franke, Kreiss, and Mammen (1997) and Neumann and Kreiss (1998). The bootstrap may be used for calculating condence intervals for predictions and critical values for statistical tests, and it is also helpful for model selection and automatic choice of the amount of smoothing in semi- and nonparametric situations (Efron & Tibshirani, 1993; Hall, 1992; Shao & Tu, 1995). For c 2000 Massachusetts Institute of Technology Neural Computation 12, 1929– 1949 (2000) °

1930

Jurgen ¨ Franke and Michael H. Neumann

the particular case of a parametric nonlinear regression model, the validity and second-order efciency of the bootstrap has been described and investigated in practice by Huet and Jolivet (1989), Huet, Jolivet, and Messean (1990) and Bunke, Droge, and Polzehl (1995). These results can be applied directly to feedforward neural networks in the correctly specied case, where the mapping m can be represented exactly by a network of the form considered. Typically, however, nite-dimensional neural networks used for learning a particular mapping are misspecied. In this article, we describe bootstrap procedures for feedforward neural networks that also cover this misspecied case. For simplicity, we restrict ourselves to networks with only one hidden layer and with one linear output unit, which, for real-valued mappings, already have the universal approximation property (Hornik, Stinchcombe, & White, 1989). However, our arguments can be generalized in a straightforward manner to multilayer-multioutput networks, where the output nodes do not have to be linear. We admit an arbitrary activation function for the neurons of the hidden layer, satisfying only certain smoothness conditions, such that our results cover multilayer perceptrons with sigmoid activation function as well as radial basis function networks with kernel-type activation function. Practitioners like Refenes, Zapranis, and Utans (1996) and Baxt and White (1995) have already used the standard bootstrap, for example, for estimating sampling variability of neural networks. In section 2, we provide the theoretical basis for these applications. We make the difference to correctly specied nonlinear regression models transparent and discuss some pitfalls related to identiability of network parameters. This residual-based bootstrap is compared with the asymptotic normal approximation in a short simulation study in section 3. In section 4, we present a different “wild” bootstrap procedure, which is able to cope with situations where the noise in the data and in particular its variance depends on the input. 2 The Bootstrap Procedure

We consider a training set of independent and identically distributed random row vectors (X0t , Yt ), t D 1, . . . , N, where Xt is of dimension d with marginal density p and Yt is real valued. Suppose we are interested in the relationship between Yt and Xt , and we want to estimate the conditional expectation of Yt given Xt D x (see White, 1989b): m (x ) D E fYt | Xt D xg.

(2.1)

We assume that the residuals et D Yt ¡ m(Xt ) are independent random variables with nite variance E fet2 | Xt D xg D se2 (x ). We want to approximate m by a single hidden-layer feedforward network with H hidden units, H ¸ 1.

Bootstrapping Neural Networks

1931

We write its output given input x as fH (x, #) D v0 C

H X

vh y (e x0 wh ),

(2.2)

hD1

where # D (w01 , . . . , w0H , v0 , . . . , vH )0 is the vector of network weights with w0h D (w0h , . . . , wph ), h D 1, . . . , H, y is the (xed) hidden unit activation function, and e x0 D (1, x0 ) is the input vector augmented by a bias component 1. The parameterization of the network function is not unique; certain simple symmetry operations applied to the weight vector obviously do not change the value of fH (x, # ). For a sigmoid activation function y centered around 0, that is, y (¡x ) D ¡y (x), these symmetry operations correspond to an exchange of hidden units and multiplying all weights of connections going into and out of a particular hidden unit by ¡1. To avoid this ambiguity, we consider only weight vectors # lying in a fundamental domain in the sense of Ruger ¨ and Ossen (1997). For the case of sigmoid activation functions with y (¡x ) D ¡y (x), this means that we restrict our attention to parameter vectors # with v1 ¸ v2 ¸ ¢ ¢ ¢ ¸ vH ¸ 0. To simplify the proofs, we consider only a compact subset H H of such a fundamental domain. Now, we train the network to get the nonlinear least-squares estimate b #N bN (# ), where of the weight vector, that is, b #N D arg min#2H H D bN ( # ) ´ D

N 1 X (Yt ¡ fH (Xt , # ))2 . N tD1

(2.3)

In the correctly specied case where m (x ) D p fH (x, #0 ) for some #0 2 H H , it is well known from nonlinear regression that N (b #N ¡ #0 ) is asymptotically for N ! 1 normally distributed with mean 0 and covariance matrix, se2 [E fr fH (Xt , #0 )r fH (Xt , #0 )0 g]¡1 . Here and in the following, r fH (x, # ) denotes the gradient of fH with respect to #. The analogous result for the misspecied situation is given by White (1989a). Here, there is no true #0 , but b #N converges to the parameter of the best network function approximator for m(x ), that is, to #0 D arg min#2H H D0 (# ), where D0 (# ) ´ E (Yt ¡ fH (Xt , # ))2 Z ´ [(m(x) ¡ fH (x, #))2 C se2 (x )]p (x )dx,

(2.4)

where we assume that the random vectorpXt has a density p (x ). If the size of the training set grows (N ! 1), then N (b #N ¡ #0 ) is again asymptotically normally distributed with mean 0 and covariance matrix A(#0 ) ¡1 B(#0 )

1932

Jurgen ¨ Franke and Michael H. Neumann

A(#0 )¡1 , where A (# ) D E fr 2 d(# )g , B(# ) D E fr d(# )r d (#)0 g, d (# ) ´ d (Xt , Yt , # ) D (Yt ¡ fH (Xt , # ))2 . Here, r 2 denotes the Hessian with respect to #. This result matches with the above result in the correctly specied case since then A (#0 ) reduces to 2E fr d (#0 )r d(#0 )0 g, which implies that A (#0 ) ¡1 B(#0 )A (#0 ) ¡1 equals the covariance matrix se2 [E fr fH (Xt , #0 )r fH (Xt , #0 )0 g]¡1 . For these results to hold, some assumptions have to be satised: A1. The activation function y is bounded and twice continuously dif-

ferentiable with bounded derivatives, and m is bounded. A2. D0 (#) has a unique global minimum at #0 lying in the interior of

H H , and r 2 D0 (#0 ) D A(#0 ) is positive denite.

We think that all results stated below remain valid if the boundedness of m is relaxed to a growth condition such as |m(x)| · CkxkM , for M < 1, in conjunction with the condition P (kXt k > t ) · Cr ¡t for r < 1. We decided to state the results under the stronger condition, A1, in order to keep the proofs as simple as possible. Besides the asymptotic normal distribution, the bootstrap provides an alternative approximation for the distribution of b #N ¡#0 . Following Huet and Jolivet (1989), we rst consider the following approach, which is adequate for identically distributed noise et . For initializing the bootstrap procedure, bN for m, i.e., an estimator for we need any uniformly consistent estimator m bN (x ) ¡ m(x )|g ¡! 0 in probability for N ! 1. m bN which supx2supp(p ) f| m could be chosen, for example, as the type of connectionist sieve estimator considered by White (1990), where the complexity of the network is allowed to grow with N, as a spline smoother with roughness penalty converging slowly to 0, or as a kernel-type smoother with bandwidth decreasing with N. Then the noise variables et can be approximated by b N (Xt ), t D 1, . . . , N. b et D Yt ¡ m

We know from our assumptions that E et D 0. To avoid a systematic error in the bootstrap, we follow Freedman (1981) and center the b et : e et D b et ¡

N 1 X b ek , t D 1, . . . , N. N k D1

eN denote the sample distribution given by e Let F e1 , . . . , e eN . We draw inde¤ eN , that is, for all t: pendent bootstrap errors e1¤ , . . . , eN from F

et¤ D e ek with probability

1 , k D 1, . . . , N. N

Bootstrapping Neural Networks

1933

In the same manner, we draw bootstrap input vectors X1¤ , . . . , X¤N randomly with replacement from X1 , . . . , XN , that is, for all t: Xt¤ D Xk with probability

1 , k D 1, . . . , N. N

Finally, we form bootstrap outputs as b N (Xt¤ ) C et¤ , t D 1, . . . , N, Y¤t D m ¤ , ¤ ). to get a bootstrap training set (X¤1 , Y1¤ ), . . . , (XN YN The basic idea of the bootstrap is the following. Because the et and Xt are independent and identically distributed, their sample distributions approximate for N large enough the true distributions. Therefore, the et¤ and bN is close to m, X¤t behave similar to the et and Xt . As, by construction, m the bootstrap outputs show similar random variations as the true outputs. Therefore, the behavior of estimates like the weights of a network after training should be similar to the original training set and the bootstrap training set. The random mechanism generating the bootstrap training set is, however, known to us, and we can repeat the above procedure as often as we like to get a whole family of independent bootstrap training sets ¤ ( )), (X1¤ (i), Y1¤ (i)), . . . , (X¤N (i), YN i i D 1, . . . , B. Using standard Monte Carlo techniques, we can mimic the behavior of any quantity of interest calculated from the training set. For illustration, let us assume that we are interested in the mean squared error ¡ ¡ ¢¢2 mse(x) D E m (x ) ¡ fH x, b #N ,

which we get if we train a network with H hidden units from the original random training set (Xt , Yt ), t D 1, . . . , N, and use it to estimate the value m (x ) of the function of interest for given input x. We get a bootstrap approximation for mse(x) as mse¤ (x ) D

B ¡ ¡ ¢¢2 1X ¤ b N (x ) ¡ fH x, b , m #N,i B iD 1

¤ where b is the weight vector after training the network using the ith #N,i bootstrap training set (X¤t (i), Y¤t (i)), t D 1, . . . , N. Before we illustrate the procedure with some examples in the next section, we rst state the results guaranteeing that our heuristic arguments can be made rigorous and that the bootstrap really works for neural network function estimators. We rst state a slight extension of White’s (1989a) results on the asymptotic distribution of b #N . In particular, we admit dependence of the noise variance on the input, which we discuss in section 4 in more detail.

1934

Jurgen ¨ Franke and Michael H. Neumann

A3. X1 , X2 , . . . are independent and identically distributed random vec-

tors with density p (x ). e1 , e2 , . . . are independent random variables and n o E fet | Xt D xg D 0 , E et2 | Xt D x D se2 (x ) < 1.

We also impose some further technical assumptions to simplify our proofs. They could be relaxed considerably without changing the validity of our results. (i) se2 (x ) is continuous and 0 < d · se2 (x ) · D < 1 for all x. (ii) There exist constants Cn such that E f| et | n | Xt D xg · Cn < 1 is satised for all x, n. (iii) m is continuous.

A4.

If we assume in addition that P (kXt k > t ) · Cr ¡t for r < 1, then we could replace the boundedness of se2 (x) by a growth condition as |se2 (x )| · CkxkM for M < 1. In the denition of D0 (# ), the expectation is taken over Yt and Xt . However, there is a second natural candidate for an optimal parameter: that value of # that provides the best approximation for a given realization of the input variables X1 , . . . , XN . We consider D N (# ) D

N h¡ i ¢2 1 X m (Xt ) ¡ fH (Xt , # ) C se2 (Xt ) , N t D1

(2.5)

and we dene #N D arg min DN (#). #2HN

bN (#) are connected by the relation DN (# ) D E fD bN (# ) | X1 , . . . , (DN (# ) and D XN g.) Instead of b #N ¡#0 , we consider its two components b #N ¡#N and #N ¡#0 separately, which asymptotically are independent.

Suppose that assumptions A1–A4 are satised. Let m (x) denote the conditional expectation of Yt given Xt D x. Then, for N ! 1, Theorem 1.

p

N

¡ b#

¢

¡ #N ¡! N #N ¡ #0 d N

¡ ¡S 0,

1

0

0 S2

¢¢

,

p p that is, N (b #N ¡ #N ) and N (#N ¡ #0 ) are asymptotically independent normal random vectors with covariance matrices S 1 and S 2 , respectively, where S 1 D A (#0 )¡1 B1 (#0 )A (#0 ) ¡1 , S 2 D A (#0 ) ¡1 B 2 (#0 )A (#0 ) ¡1

Bootstrapping Neural Networks

1935

with Z B1 ( # ) D 4 ¢ B2 ( # ) D 4 ¢

Z

se2 (x )r fH (x, # ) ¢ r fH (x, # )0 p (x )dx (m (x ) ¡ fH (x, # ))2 r fH (x, # ) ¢ r fH (x, # )0 p (x )dx

and A(# ) D r 2 D0 (# ) as above. (See the appendix for the proof.)

p O N ¡ #0 ) is asymptotically normal As an immediate consequence, N (# with mean 0 and covariance matrix S 1 C S 2 . In the correctly specied case, S 2 is equal to the zero matrix, as there is essentially no effect due to the randomness of the Xt ’s, that is, the difference between #N and #0 is asymptotically of smaller order than N ¡1 / 2 . In contrast, in the misspecied case, the randomness of the inputs causes a difference of order N ¡1 / 2 between the two optimal parameters #N and #0 . One way to prove that the bootstrap works asymptotically in a particular situation, where the limit distribution of the quantity of interest is known, is to show that the corresponding bootstrap quantity has the same asymptotic behavior. Let b #N¤ be the weight vector after training the network from the ¤ b¤ (# ), where bootstrap training set, that is, b #N D arg min#2H H D N b¤ ( # ) ´ D N

N 1 X (Y¤ ¡ fH (X¤t , # ))2 . N tD1 t

The bootstrap procedure of this section is based on the assumption that the distribution of et does not depend on Xt . In this case, the analogon of #N is given by #N¤ D arg min#2H H D¤N (# ), where D¤N (#) ´

N h¡ i ¢2 1 X mO N (X¤t ) ¡ fH (Xt¤ , #) C sO e2 , N t D1

PN 2 ¤ and sO e2 D N1 Q t estimates se2 . In contrast to #N , #N can be calculated t D1 e 2 as mO N and sO e are known. To get the bootstrap version of #0 , we have to replace expectation with respect to the joint distribution of Xt and et by expectation with respect to the joint distribution of Xt¤ and et¤ . Then we set #0¤ D arg min#2H H D¤0 (# ), where ¡ ¢2 D¤0 (# ) ´ E ¤ Y¤t ¡ fH (X¤t , # ) D

N ¡ ¢2 1 X O N (Xt ) ¡ fH (Xt , # ) C sO e2 . m N tD 1

Note that et¤ and X¤t are independent and E ¤ et¤ D O N: impose an assumption on m

1 N

P

Qt te

D 0. We have to

1936

Jurgen ¨ Franke and Michael H. Neumann

bN is uniformly consistent on the support of p, that is, A5. m

sup

x2supp(p)

bN (x ) ¡ m (x)|g D oP (1). f| m

This assumption is obviously realistic if p is compactly supported on a sufciently regular set (e.g., on a rectangle) and bounded away from 0 on this set. In the case of unbounded support, one could modify assumption A5 to a condition as bN (x ) ¡ m(x )|g D oP (1), sup f| m

x2X

(2.6)

N

6 X N) D where X N is an appropriate subset of Rd with the property P (Xt 2 o(N ¡1 ). This is possible under the assumption that P (kXt k > Nd ) D o (N ¡1 ), for some sufciently small d > 0 (for details see Franke, Kreiss, Mammen, & Neumann, 1998). Then the bootstrap, described above, works by the following theorem, ¤ which shows that the conditional distribution of b ¡#0¤ given #N¤ ¡ #N¤ and #N the original data (X1 , Y1 ), . . . , (XN , YN ), from which the bootstrap training sets are generated, coincides asymptotically with the distribution of b #N ¡#N and #N ¡ #0 as given in theorem 1.

Suppose that et is independent of Xt , t D 1, . . . , N, that is, in particular se2 (x) ´ se2 , and assume assumptions A1–A5. Then, Theorem 2.

L

¡p ¡ b# N

¤ N #N¤

¢

¢

¤ ¡ #N (X1 , Y1 ), . . . , (XN , YN ) ¡! N ¡ #0¤ d

¡ ¡S 0,

1

0

0 S2

¢¢

,

where this convergence holds in a uniform manner for ((X1 , Y1 ), . . . , (XN , YN )) 2 X N for a suitable sequence of sets X N with P (X N ) ! 1. (See the appendix for the proof.) The bootstrap also works in the case of deterministic inputs x1 , . . . , xN of the training set that are systematically selected by the experimenter. We have only to require that the inputs x1 , . . . , xN behave for increasing N similar to a random sample up to a certain degree. In particular, we need Remark.

N 1 X (m (xt ) ¡ fH (xt , #))2 N t D1

Z ¡!

N !1

(m (x) ¡ fH (x, # ))2 p (x)dx

for some probability density p(x ). If the xt are, for example, equispaced over a xed nite d-dimensional cube C, then the above condition is satised for a constant function p(x ) ´ pC .

Bootstrapping Neural Networks

1937

At the beginning of this section, we restricted the weight vector to a fundamental domain by requiring v1 ¸ v2 ¸ ¢ ¢ ¢ ¸ vH ¸ 0 for sigmoid activation functions with y (¡x ) D ¡y (x ). This avoids the identiability problems caused by the common symmetry properties of a feedforward neural network independent of the function that we want to approximate. However, it may happen that for the optimal weights of the parameter vector #, certain weights of outgoing connections coincide, say, vh D vh C 1 . This situation may happen in practice if the function m itself has certain symmetries related to the symmetries of y . In such a situation, the bootstrap breaks down if it is used for approximating the random uctuations of the weights of connections going into the hidden units numbers h and h C 1, provided one does not take additional precautions to guarantee identiability. To illustrate the problem, let us consider m(x ) D y (1 C x ) C y (x ¡ 1) for the centered logistic function y (u ) D (1 C e¡x ) ¡1 ¡ 12 . m is itself a network function for H D 2 hidden units with w±11 D w±12 D 1, w±01 D 1, w±02 D ¡1, v0 D 0, b N (x ) will have v1 D v2 D 1, and m is symmetric around 0. The estimate m similar properties as in the case of identiable parameters of the network. If we train the network repeatedly using independent bootstrap samples, we b¤01 ¼ 1, w b¤02 ¼ ¡1 and w b¤01 ¼ ¡1, w b¤02 ¼ 1 with approximately get randomly w equal probabilities. Therefore, the bootstrap estimate of the variance of those weights is too large due to the nonidentiability of the parameters of m and the corresponding approximate nonidentiability of the parameters of the bN . In practice, it is easy network, which provides the best approximation of m to detect such situations because they are characterized by almost identical bootstrap estimates of outgoing weights, say, b v¤h (i) ¼ b v¤h C 1 (i), i D 1, . . . , B. To ¤ be more precise, one can check if the differences b v h (i ) ¡ b v¤h C 1 (i), i D 1, . . . , B, are small compared to the sample standard deviation of b v¤h (i) and b v¤h C 1 (i), i D 1, . . . , B, and in such a case take additional precautions to make the parameterization of the network function by its weights unique. Remark.

3 A Bootstrap Procedure for Input-Dependent Noise

The bootstrap procedure can be applied only to situations where the noise et does not depend on the input, that is, if it is additive in the sense of Murata, Yoshizawa, and Amari (1994). In their general model, et D et (Xt ) and, in particular, its variance se2 (Xt ) D E fet2 | Xt g depend on the input Xt , a common situation in many practical problems. Under that circumstance, one way to bootstrap function estimates is the “wild bootstrap” or “external bootstrap.” Hardle ¨ (1990) and H¨ardle and Marron (1991) have discussed this in the context of nonparametric nonlinear regression, which we are considering here. For the connectionist regression estimate, it has the following form. We generate independent and identically distributed random variables b N (Xt ) as in secg1 , . . . , gN with mean 0 and variance 1. Then, with b et D Yt ¡ m tion 2, we draw pairs (Xt¤ , b et¤ ) randomly from the set f(X1 , b e1 ), . . . , (XN , b eN )g,

1938

Jurgen ¨ Franke and Michael H. Neumann

that is, for all t D 1, . . . , N: (X¤t , b et¤ ) D (Xk , b ek ) with probability

1 , k D 1, . . . , N. N

We transform b et¤ randomly by multiplying it with gt , which does not change the mean and the variance. Then we dene bootstrap outputs as b N (Xt¤ ) C gtb et¤ , t D 1, . . . , N. Y¤t D m

In contrast to the standard bootstrap, the bootstrap noise et¤ D gt ¢ b et¤ is generated in a manner depending on the bootstrap input X¤t to reect the dependence of et on the input in the original training set. O WB denote the wild bootstrap version of # O N . Because now the disLet # N tribution of the et depends on Xt , the wild bootstrap analogon of DN (# ) is given by D¤N (# ) ´

N h¡ i ¢2 1 X O N (Xt¤ ) ¡ fH (X¤t , # ) C eO t2 , m N tD 1

and, correspondingly, the analogon of D0 (# ) is ¡ ¢2 D¤0 (# ) ´ E ¤ Y¤t ¡ fH (X¤t , # ) N n¡ ¢2 ¡ ¢ 1 X O N (Xt ) ¡ fH (Xt , # ) C 2 m O N (Xt ) ¡ fH (Xt , # ) eO t E gt D m N tD 1 o C eO t2 E gt2 D

N h¡ i ¢2 1 X O N (Xt ) ¡ fH (Xt , # ) C eOt2 . m N tD 1

They differ from D¤N (# ), D¤0 (# ) of section 2 only by terms not depending on WB , WB of #. Therefore, the wild bootstrap versions #N #0 #N , #0 coincide with ¤ , ¤ of section 2. Analogously to theorem 2, we obtain consistency of the #N #0 bootstrap. Suppose that assumptions A1–A5 are fullled. Then, ! WB b p #NWB ¡ #N S1 ( 0, N WB X , Y1 ), . . . , (XN , YN ) ¡! N WB 1 0 d ¡ #N #0

Theorem 3.

L

(

¡

¢

¡ ¡

0 S2

¢¢ ,

where this convergence holds in a uniform manner for ((X1 , Y1 ), . . . , (XN , YN )) 2 X N for a suitable sequence of sets X N with P (X N ) ! 1. (See the appendix for the proof.)

Bootstrapping Neural Networks

1939

Another popular resampling method for regression models, already discussed by Freedman (1981) for linear regression, is the pairwise bootstrap. It is not based on the sample residuals but generates the bootstrap 0 0 ¤ ) are training set directly from the original training set: (X¤1 , Y¤1 ), . . . , (X¤N , YN 0 0 independently drawn from (X1 , Y1 ), . . . , (XN , YN ) and, for all t: Remark.

0

(X¤t , Y¤t ) D (Xk0 , Yk )

with probability

1 , k D 1, . . . , N. N

For the pairwise bootstrap, theorem 2 holds, too (see Sarishvili, 1998). The proof of this result follows more or less the same lines as for the residualbased bootstrap, and therefore we do not discuss it in detail. As the wild bootstrap, it also may be applied to input-dependent noise. However, in a small simulation study, the pairwise bootstrap performed worse than the wild bootstrap. That may be a feature of the particular examples studied there (see Sarishvili, 1998). By replacing b ek by its denition, the wild bootstrap may be considered a version of the pairwise bootstrap, where some additional noise is introduced to Y¤t to resolve the ties, which, with large probability, are present in the original pairwise boostrap resample. Here, Y¤t is modied in such a manner that the conditional expectation of the output given the input remains approximately unchanged. 4 Two Numerical Examples

In this section we present the results of a small simulation study to illustrate the performance of the common residual-based bootstrap and the wild bootstrap. In both cases, we consider a feedforward neural network with one hidden layer and complete connections. As the sigmoid activation function, we choose the centered logistic function,

y ( x) D

1 1 ¡ . 1 C e¡x 2

The generation of the data and the training of the network have been done using GAUSS 3.1, where we used the BFGS method as a batch mode algorithm for determining the network weights. As a rst example, we consider as the true function to be estimated m(x1 , x2 ) D (1 ¡ x21 ) C C x1 / 2, which is shown in Figure 1. We approximate m by a network with two hidden units, that is, f (x, # ) D v0 C

2 X h D1

vh y (w0h C w1h x1 C w2h x2 ) .

1940

Jurgen ¨ Franke and Michael H. Neumann

Figure 1: True regression function m(x).

Figure 2a shows the best approximation of m by f (x, #0 ), where Z #0 D arg min #

1 ¡1

Z

1 ¡1

|m(x) ¡ f (x, # )| 2 dx,

corresponding to inputs Xt D (Xt1 , Xt2 ), which are uniformly distributed over the square [¡1, 1] £ [¡1, 1]. The difference between m and f (., #0 ) is displayed in Figure 2b. We have to estimate the nine-dimensional parameter vector #0 D (w01 , w11 , w21 , w02, w12, w22, v0 , v1 , v2 )0 . We use the training set (X1 , Y1 ), . . . , (XN , YN ) with sample size N D 100, which obeys the regression model Yt D m (Xt ) C et , where Xt » Unif ([¡1, 1]2 ) and et » N (0, (0.1)2 ) are all independent. To approximate the true distribution of b #N ¡ #0 , we carried out 500 Monte Carlo runs. ¤ To nd a typical bootstrap distribution L (b ¡ #0¤ | (X1 , Y1 ), . . . , (XN , #N YN )), we selected that sample out of 500 randomly generated samples for which the difference between the loss kb #N ¡ #0 k2 and the mean of these losses, taken over all 500 samples, was minimized. For this sample, the estimation error is neither particularly large nor particularly small, and, therefore, it appears to exhibit a typical amount of randomness. Based on this one typical sample, we constructed 500 bootstrap samples ((X1¤ (i), Y1¤ (i)), . . . ,

Bootstrapping Neural Networks

1941

Figure 2: Best approximation of m(x) by a network with (a) two hidden units and (b) approximation error. ¤ ( ), ¤ ( ))), (X100 i Y100 i i D 1, . . . , 500, according to the description in section 2. b N for m, we used a Nadaraya-Watson kernel As a preliminary estimate m estimator with bivariate gaussian kernel and bandwidths h1 D 0.2 and h2 D 1.0. Other preliminary estimates could be considered as well. For each ¤ , of the bootstrap training sets, we calculated the bootstrap estimate b #N,i

1942

Jurgen ¨ Franke and Michael H. Neumann

Figure 3: Estimation error of input weight w21 for sample size 100: true probability density (solid line), its bootstrap approximation (dashed line) and its asymptotic normal approximation (dashed-dotted line).

¡ ¤ ¢ which leads to an estimate of L b #N ¡ #0¤ | (X1 , Y1 ), . . . , (XN , YN ) . Finally, p we also consider the asymptotic limit distribution of N (b #N ¡ #0 ) as given by White (1989a). b21 ¡ w21 ) (solid line), Figure 3 shows estimates of the densities of L (w b¤21 ¡ w¤21 | (X1 , Y1 ), . . . , (XN , YN )) based on the most typical sample of L (w (dashed line) as well as the cumulative distribution function of N (0, (S 1 C S 2 )33 / N ) (dashed-dotted line). Generally we observed a surprisingly good approximation of the true distributions by the bootstrap and the normal approximations. For the other network weights, we got similar results, which seems to indicate that the bootstrap of section 2 works reasonably well even for moderate sample sizes. Of course, larger sample sizes may be necessary for more irregular target functions or higher noise levels. As the second example, we consider heteroscedastic residuals et . In this case, neither the standard asymptotics of White (1989a) nor the residualbased bootstrap of section 2 are applicable. To facilitate the graphical representation of the results, we consider a one-dimensional input x. As the true function to be estimated, we choose the bump function m (x ) D

1 2 x C Q (8x ), ¡1 · x · 1, 2 3

where Q denotes the density of the standard normal distribution. As a training set, we use independent and identically distributed (X1 , Y1 ), . . . , (XN , YN ) with N D 100, where X1 , . . . , XN are uniformly distributed over [¡1, 1]

Bootstrapping Neural Networks

1943

Figure 4: Data, true regression function m(x), and the initial kernel estimate of m(x).

and Yt D m (Xt ) C et with independent zero-mean gaussian residuals e1 , . . . , eN with standard deviation s 2 1 1 1 (E fet2 / Xt g) 2 D s(Xt ) D C m (X t ) , 0.01 C 5 2

¡

¢

that is, the variance of a residual is large if the function value to be estimated is large, which is a typical heteroscedastic situation. As the basis for the wild bootstrap, we consider a typical sample selected in a similar manner as in the rst example. Figure 4 shows these data, the true function m, and the O N with bandwidth 0.07. Nadaraya-Watson kernel estimate m We approximate m (x ) by the network function f (x, # ) D v0 C

3 X h D1

vh y (w0h C w1h x ),

which provides quite a good t for appropriately chosen weights. We train the corresponding network with three neurons in its hidden layer and con-

1944

Jurgen ¨ Franke and Michael H. Neumann

Figure 5: 90% condence intervals for regression function m(x). (a) True condence bands (solid lines) together with true function m (dashed lines). (b) Wild bootstrap condence bands (solid lines) together with initial estimate of m (dashed lines). (c) Wild bootstrap condence bands (solid lines) together with true function m (dashed lines). (d) True (solid lines) and bootstrap (dashed lines) condence bands for m.

O N ) at the points xi D ¡1 C i / 50, i D 0, . . . , 100. sider the estimates f (xi , # We are interested in condence intervals for m (xi ), i D 0, . . . , 100. Figure 5a shows the true 90% condence intervals joined to form a band together with the true function m based on a Monte Carlo simulation with 200 independent runs. Based on the one typical sample, we approximated the distribution of O N ) by means of the wild bootstrap with standard, normally distributed f ( xi , # gt . Using 500 bootstrap replications, we determined 90% condence intervals as above. Figures 5b and 5c show these bootstrap condence intervals O N and the true function m, resp. For bettogether with the kernel estimate m ter comparability, Figure 5d shows the true condence band (solid lines) and the bootstrap condence bands (dashed lines) in one plot.

Bootstrapping Neural Networks

1945

The wild bootstrap captures the heteroscedasticity of the data quite nicely, and it provides a good approximation to the true condence intervals. There are only two problematic areas, as can be seen from Figure 5d: at the right end and around the peak. The former defect is easily explained; we have O N for the well-known boundary not corrected the initial kernel estimate m effects for ease of calculations. Using boundary kernels would immediately O N and the bootstrap condence intervals around improve the estimate m the boundary. The fact that the upper limits of the bootstrap condence intervals around the peak are too large is due to pure chance. Looking at Figure 4, it can be seen that all the residuals (with one exception) around the peak happen to be positive and rather large, and they draw the peak of O N and the corresponding upper bootstrap condence limit upward. Apart m from these explainable effects, the wild bootstrap provides a good method to quantify the reliability of neural network function estimates in the presence of heteroscedasticity. Also, the wild bootstrap does not assume any knowledge about the particular form of the dependence of the variability of et on Xt at all as, for example, a classical parametric asymptotic approach similar to White (1989a) but using nonlinear weighted least squares would have to. We stress that the bootstrap error bars of Figure 5 are based on an asymptotic approximation for the distribution of the estimation error b #N ¡ #0 . Therefore, they are approximate condence bands for the optimal network function fH (x, #0 ), not for the true conditional expectation function m(x ). If, however, H is large enough such that fH (x, #0 ) approximates m(x ), the bootstrap error bars may also be looked at as condence bands for m(x ). In practice, the number H of hidden neurons should be selected rather larger for the purpose of calculating bootstrap error bars than for the purpose of just getting an estimate of the function m(x ). A precise theoretical formulation, which would have to include asymptotics for H ! 1, backing this guideline is still lacking, but one could appeal to the analogous case of estimating m (x ) nonparametrically by kernel smoothing. There, for calculating condence bands, it is advantageous to undersmooth (i.e., to overt) compared to the optimal degree of smoothing for estimating the function (Hall, 1992). Appendix: Proofs A.1 Proof of Theorem 1. (i) Using Bernstein’s inequality in conjunction with a simple truncation argument (which is possible by assumption A4.ii), it is easy to see that for all d > 0 and l < 1 ±

1

bN (# ) ¡ DN (#)| C |DN (# ) ¡ D0 (#)| > Nd¡ 2 P |D

²

D O( N

¡l

)

(A.1)

bN , D N , uniformly in # 2 H H , using assumptions A1, A3, and A4. Since D and D0 are continuous in #, we obtain, by showing that the above result

1946

Jurgen ¨ Franke and Michael H. Neumann

holds simultaneously on a sequence of increasingly ne grids H N µ H H , © ª bN (# ) ¡ DN (# )| C |DN (# ) ¡ D0 (# )| D op (1). sup | D (A.2) #2H H

Since #0 is the unique minimum point of D0 (# ), we obtain |b #N ¡ #N | C |#N ¡ #0 | D op (1).

(A.3)

Hence, by assumptions A1 and A2 with increasing probability, b #N and #N are interior points of H H , that is, we have in particular that b N (b rD #N ) D r DN (#N ) D 0,

with probability converging to 1 for N ! 1. (ii) Hence, with probability tending to 1, 0 D r DN (#N ) ¡ r DN (#0 ) C r DN (#0 ) D r 2 D0 (#0 )( #N ¡ #0 ) C

N 2 X r fH (Xt , #0 )( fH (Xt , #0 ) ¡ m (Xt )) C R1 , N tD 1

(A.4)

where R1 D r DN (#N ) ¡ r DN (#0 ) ¡ r 2 DN (#0 )(#N ¡ #0 ) h i C r 2 DN ( #0 ) ¡ r 2 D0 (#0 ) ( #N ¡ #0 ) D op (k#N ¡ #0 k)

by assumptions A1 and A3. As © ª 2E r fH (Xt , #0 )( fH (Xt .#0 ) ¡ m (Xt )) D r D0 (#0 ) D 0, the middle term of equation A.4 is OP (N ¡1 / 2 ). Thus, we get 1

r 2 D0 (#0 )(#N ¡ #0 ) C op (k#N ¡ #0 k) D Op (N ¡ 2 ), which implies, since r 2 D0 (#0 ) is positive denite, that 1

k#N ¡ #0 k D Op (N ¡ 2 ). Inserting this into equation A.4, we obtain #N ¡ #0 D ¡(r 2 D0 (#0 )) ¡1 1

C op ( N ¡ 2 ).

N ¡ ¢ 2 X r fH (Xt , #0 ) fH (Xt , #0 ) ¡ m(Xt ) N tD 1

(A.5)

Bootstrapping Neural Networks

1947

(iii) Analogously to the above calculations, we have with probability tending to 1 that b N (b 0 D rD #N ) D r DN (b #N ) ¡ D r 2 D0 (#0 )(b #N ¡ #N ) ¡

N 2 X e t r f H (X t , b #N ) N t D1

N 2 X r fH (Xt , #0 ) et C R2 , N tD 1

(A.6)

where, as r DN (#N ) D 0 with probability tending to 1, R2 D r DN (b #N ) ¡ r DN (#N ) ¡ r 2 D0 (#0 )(b #N ¡ #N ) N n o X 2 C et r fH (Xt , #0 ) ¡ r fH (Xt , #O N ) N tD1 1

O N ¡ #N k). D op (kb #N ¡ #N k) C Op (kb #N ¡ #0 kN ¡ 2 ) D op (k#

That means 1

r 2 D0 (#0 )(b #N ¡ #N ) C op (kb #N ¡ #N k) D Op (N ¡ 2 ), which implies 1

bN ¡ #N k D Op (N ¡ 2 ), k#

and therefore, b #N ¡ #N D (r 2 D0 (#0 )) ¡1

N 2 X 1 r fH (Xt , #0 ) et C op (N ¡ 2 ). N t D1

(A.7)

The assertion follows now from equations A.5 and A.7 by a multivariate central limit theorem for functions of independent and identically distributed random vectors. In particular, the asymptotic mean of b #N ¡ #N and #N ¡ #0 vanishes as E fet | Xt D xg D 0. A.2 Proof of Theorem 2. This proof is very similar to that of theorem 1. b¤ (# ), D¤ (# ), and D¤ (# ) converge uniformly to D0 (# ), one can easily Since D 0 N N show that ¤ |b ¡ #0 | C |#0¤ ¡ #0 | D op (1). #N¤ ¡ #0 | C |#N

(A.8)

Then we obtain in complete analogy to the previous proof that

L

¡p ¡ b# N

¤ N #N¤

¡ #N¤ ¡ #0¤

¢

(X1 , Y1 ), . . . , (XN , YN )

¢ » N d

¡ ¡S 0,

¤ 1

0

0 S 2¤

¢¢ ,

1948

Jurgen ¨ Franke and Michael H. Neumann

where S 1¤ and S 2¤ are analogous to S 1 and S 2 , respectively, that is S 1¤ D A¤ (#0¤ ) ¡1 B¤1 (#0¤ )A¤ (#0¤ )¡1 , S 2¤ D A¤ (#0¤ ) ¡1 B¤2 (#0¤ ) A¤ (#0¤ ) ¡1 with A¤ (# ) D r 2 D¤0 (# ) and B¤1 (# ) D B¤2 (# ) D

N 4 X ¢ r fH (Xt , # )r fH (Xt , #)0 sO e2 , N tD 1

N ¡ ¢ 4 X bN (Xt ) ¡ fH (Xt , # ) r fH (Xt , # )) r fH (Xt , # )0 . ¢ m N tD 1

Because of equation A.8 and assumption A5, we obtain S i¤ D S i C oP (1), which nishes the proof. A.3 Proof of Theorem 3. This proof is completely analogous to that of theorem 2 and requires only simple modications. For example, one has to apply a multivariate central limit theorem for independent but not necessarily identically distributed observations. The fact that L ¤ (et¤ ) is not a consistent approximation to L (e ) does not matter since the covariance of a term as in equation A.7 depends on a weighted sum of the individual variances of the et . References Baxt, W. G., & White, H. (1995). Bootstrapping condence intervals for clinical input variable effects in a network trained to identify the presence of acute myocardial infarction. Neural Computation, 7, 624–638. Bunke, O., Droge, B., & Polzehl, J. (1995). Model selection, transformations and variance estimation in nonlinear regression (Discussion Paper No. 52). Berlin: Humboldt University. Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Ann. Statist., 7, 1–26. Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New York: Chapman and Hall. Franke, J., Kreiss, J.-P., & Mammen, E. (1997). Bootstrap of kernel smoothing in nonlinear time series. Unpublished report, University of Kaiserslautern. Franke, J., Kreiss, J.-P., Mammen, E., & Neumann, M. H. (1998). Properties of the nonparametric autoregressive bootstrap (Discussion Paper No. 54/98). Berlin: Humboldt University. Freedman, D. A. (1981). Bootstrapping regression models. Ann. Statist., 9, 1218– 1228. Hall, P. (1992). The bootstrap and edgeworth expansion. Berlin: Springer-Verlag. H¨ardle, W. (1990). Applied nonparametric regression. Cambridge: Cambridge University Press.

Bootstrapping Neural Networks

1949

H¨ardle, W., & Bowman, A. (1988). Bootstrapping in nonparametric regression: Local adaptive smoothing and condence bands. J. Amer. Statist. Assoc., 83, 102–110. H¨ardle, W., & Marron, J. S. (1991). Bootstrap simultaneous error bars for nonparametric regression. Ann. Statist., 13, 1465–1481. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366. Huet, S., & Jolivet, E. (1989). Exactitude au second order des intervalles de conance bootstrap pour les param´etres d’un modele de r´egression non line´ aire. C.R. Acad. Sci. Paris S´er. I. Math., 308, 429–432. Huet, S., Jolivet, E., & Messean, A. (1990). Some simulation results about condence intervals and bootstrap methods in nonlinear regression. Statistics, 21, 369–432. Murata, N., Yoshizawa, S., & Amari, S. (1994). Network information criterion— Determining the number of hidden units for an articial neural network model. IEEE Trans. Neural Networks, 5, 865–872. Neumann, M. H., & Kreiss, J.-P. (1998). Regression-type inference in nonparametric autoregression. Ann. Statist., 26, 1570–1613. Refenes, A.-P.N., Zapranis, A. D., & Utans, J. (1996).Neural model identication, variable selection and model adequacy. In A. Weigend et al. (Eds.), Neural networks in nancial engineering. Singapore: World Scientic. Ruger, ¨ S., & Ossen, A. (1997). The metric structure of weightspace. Unpublished manuscript. Berlin: Technical University of Berlin. Sarishvili, A. (1998). Das paarweise Bootstrap fur ¨ neuronale Netze. Diploma thesis, University of Kaiserslautern. Shao, J., & Tu, D. (1995). The jackknife and bootstrap. Berlin: Springer-Verlag. White, H. (1989a). Some asymptotic results for learning in single hidden-layer feedforward network models. J. Amer. Statist. Assoc., 84, 1008–1013. White, H. (1989b). Learning in articial neural networks: A statistical perspective. Neural Computation, 1, 425–464. White, H. (1990). Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings. Neural Networks, 3, 535–550. Received February 16, 1999; accepted September 18, 1999.

LETTER

Communicated by Shun-ichi Amari

Information Geometry of Mean-Field Approximation Toshiyuki Tanaka Department of Electronics and Information Engineering, Tokyo Metropolitan University, Tokyo 192-0397, Japan I present a general theory of mean-eld approximation based on information geometry and applicable not only to Boltzmann machines but also to wider classes of statistical models. Using perturbation expansion of the Kullback divergence (or Plefka expansion in statistical physics), a formulation of mean-eld approximation of general orders is derived. It includes in a natural way the “naive” mean-eld approximation and is consistent with the Thouless-Anderson-Palmer (TAP) approach and the linear response theorem in statistical physics. 1 Introduction

Estimating model parameters from data is a central issue in various inference and learning problems. One usually considers a parametric model, which is assumed to have generated a given data set, and then casts the problem into a statistical estimation problem. This problem can be regarded as a “backward” problem, because “causes,” or the parameters of the model, should be estimated from their “effects,” or data. The “backward” problem is often hard to solve directly, and in some cases the corresponding “forward” problem is used to help solve the “backward” problem via an error feedback scheme. Boltzmann machine learning by Ackley, Hinton, and Sejnowski (1985) is one such example. The forward problem, evaluating statistical properties of data from model parameters, can be solved in principle by simulating the data-generating process. In the case of Boltzmann machines, it can be done by the Gibbs sampler (Geman & Geman, 1984). However, in practical situations, solving the forward problem may still be computationally expensive, as in the Gibbs sampler, so that some approximation schemes are needed in such cases. Mean-eld approximation for Boltzmann machines is one such approximation. Mean-eld theory, on which mean-eld approximation is based, has been developed in the eld of statistical mechanics, and some renements, such as the so-called TAP approach (Thouless, Anderson, & Palmer, 1977) and the linear response theorem (Parisi, 1988), have been provided there. In an information processing context, mean-eld approximation has been applied to Boltzmann machines (Peterson & Anderson, 1987) as well as other modc 2000 Massachusetts Institute of Technology Neural Computation 12, 1951– 1968 (2000) °

1952

Toshiyuki Tanaka

els. Also, it is now a common understanding that its information-theoretic interpretation is given as a minimization of Kullback-Leibler information divergence (Tanaka, 1996; Saul & Jordan, 1996; Hofmann & Buhmann, 1997). However, most of these information-theoretic considerations are on the “naive” mean-eld approximation. There are attempts to apply advanced mean-eld theory in statistical physics to statistical models, such as Boltzmann machines (Galland, 1993; Kappen & Rodr´õ guez, 1998a, 1998b), Potts spin systems (Hofmann & Buhmann, 1997), the expectation-maximization algorithm (Dunmur & Tittterington, 1997), and so on. I believe this article, employing information geometry (Amari, 1985; Amari & Nagaoka, 1998), gives the rst unied and fully consistent treatment applicable not only to Boltzmann machines but also to wider classes of statistical models. 2 Problem Formulation

To avoid mathematical subtleties I present a formulation in the case of binary, or Ising, variables. But to keep the discussion as general as possible, I will not use the fact that the variables are indeed binary unless I explicitly state so. Thus, the following arguments can be extended to more general cases in a straightforward way. Let us consider N random variables si 2 f¡1, 1g, i D 1, . . . , N and let s D (s1 , . . . , sN ) 2 S D f¡1, 1gN . I consider the following set M of probabilities, which I will call a model: 8 0 19

<

MD

:

p (s) D exp @

X i

h i si C

X hiji

=

h ij si sj ¡ y A .

;

(2.1)

The notation hiji means that the summation is taken over all distinct pairs. The model M can be seen as the set of Boltzmann-Gibbs distributions realized by Boltzmann machines parameterized by connections fh ij g and biases fh i g. An essential point is that M constitutes an exponential family on the Q ( product set S, with the product measure dl (s) D N i D 1 [d si C 1) C d(si ¡ 1)]dsi being the dominating measure. Several other models, such as gaussian models, gaussian mixture models, and Markov models, can also be cast into the framework by considering an appropriate product set and a product measure in place of S and the dominating measure dl. Now I introduce several information-geometrical concepts (Amari, 1985; Amari & Nagaoka, 1998). A set (h 1 , . . . , h N , h 12 , . . . , h N¡1,N ) of the parameters can be seen as a coordinate system of M, because specifying the set of the parameters determines a single element p of M. M can be regarded as a manifold by this coordinate system. The set of parameters constituting this coordinate system is called the canonical parameters of that exponential family. In information-geometrical literature, another coordi-

Information Geometry of Mean-Field Approximation

1953

nate system also plays an important role. It is the expectation parameters (g1 , . . . , gN , . . . , g12, . . . , gN¡1,N ), which are dened by the expectations of the random variables si and si sj with respect to p; that is, gi D hsi ip and gij D hsi sj ip , where h¢ip denotes expectation with respect to p, and they can also be used to specify p uniquely, hence serving as another coordinate system. The problem treated here is stated as follows: Estimate the values of the expectation parameters of a target distribution q by knowing its canonical parameters h D (hi ) D (h i (q)) and w D (wij ) D (h ij (q)). How difcult this problem is depends on the models one is concerned with. For gaussian models, the problem is easy, because it can be solved by simple, straightforward algebraic calculations. For Boltzmann machines or Ising spin models, on the other hand, it is computationally hard to solve it exactly, and some approximating schemes, such as mean-eld approximation, are needed. 3 Solution by Mean-Field Approximation 3.1 Factorizable Submodel. If for p 2 M, h ij (p ) D 0 holds for all hiji,

p is said to be factorizable, because it can be expressed as the following factorized form: p (s) D

N Y i D1

exp(h i si ¡ yi ).

(3.1)

6 For such a p, hsi sj ip D hsi ip hsj ip holds for any i and j, and i D j; that is, all elements of s are statistically independent. Let F0 denote the set containing all the factorizable distributions p. I will call the set F0 the factorizable submodel of M. I assume that on F0 , the problem can be easily solved. It is in fact true for Boltzmann machines because when all h ij s are zero, one can compute all the expectation parameters of a target q 2 F0 by

gi (q) D tanh h i (q)

(3.2)

gij (q ) D gi (q )gj (q ).

(3.3)

and

3.2 Dual Foliations of Model. For analyzing mean-eld approximation, it is convenient to introduce a structure of orthogonal dual foliations of M. Let F D fF (w)g be a foliation parameterized by w D (w12 , . . . , wN¡1,N ) such that [ MD F (w) (3.4) w

with a leaf F (w) dened as

F (w) D fp(s) | h ij (p ) D wij for all i, jgI

(3.5)

1954

Toshiyuki Tanaka

that is, a set of distributions p that have h ij (p ) D wij as their common parameters. The factorizable submodel F0 described above is identical to the leaf F (0 ). Each leaf F (w) is an N-dimensional submanifold of M and also an exponential family by itself, with (h 1 , . . . , h N ) and (g1 , . . . , gN ) the canonical and expectation parameters, respectively. On each leaf F (w), a pair of dual Q is dened by potentials yQ and w, 0

Z

yQ (p) ´ y D log

S

exp @

1 N X iD 1

i(

h p) s i C

X

ij

w si sj A dl (s)

(3.6)

hiji

and wQ (p ) D

N X i D1

h i (p )gi (p ) ¡ yQ (p )

(3.7)

for p 2 F (w). The parameters of p 2 F (w) can be derived from these dual potentials from the following formulas: gi (p ) D @i yQ (p) and h i (p ) D @i wQ (p ),

(3.8)

where @i ´ (@ / @h i ) and @i ´ (@ / @gi ). For any two distributions p and q on F (w), the Kullback divergence of q from p, Z D (p k q ) D

S

p (s) log

p (s) dl (s), q (s)

(3.9)

is expressed in terms of these dual potentials yQ and wQ as D (p k q ) D yQ (q) C wQ (p ) ¡

N X

h i (q)gi (p ).

(3.10)

iD 1

In the “forward” problem treated here, we know all the canonical parameters h D (hi ) and w D (wij ) of the target q. Thus, we can specify the leaf F (w) to which the target q belongs. Then let us consider the following problem: Problem 1.

On the leaf F (w), nd a distribution p that best approximates

the target q. This problem at rst seems to be nonsense because one can immediately nd the trivial solution: p D q. However, solving the problem in terms of the variables (gi (p )) is nontrivial, and it is the key to understanding the approach based on mean-eld approximation, as will be shown. Another

Information Geometry of Mean-Field Approximation

1955

point to emphasize is that p should be taken on the leaf F (w), not on the leaf F (0 ). The latter choice would lead to the naive mean-eld approximation, which is different from our framework. (The issue of how these two are related is discussed in section 4.2.) Let us measure the degree of the approximation in problem 1 by the Kullback divergence. The problem is then cast into the problem of minimizing D(p k q) with respect to p 2 F (w). This is equivalent to minimizing G(p ) ´ wQ (p ) ¡

N X

higi (p),

(3.11)

iD 1

since the term yQ (q ) in equation 3.10 does not depend on p. Note that in statistical physics, wQ (p ) is the Gibbs free energy of p, which is obtained by the Legendre transform (see equation 3.7) of the Helmholtz free energy P i ¡yQ (p ), and the term N i D1 h gi ( p ) is called the Zeeman energy. Consider solving the minimization problem by taking the expectation parameters m D (m1 , . . . , mN ) D (g1 (p ), . . . , gN (p)) as the independent variables on F (w). By doing so, a solution will directly give an estimate for the expectation parameters of the target q, since the exact solution is given by p D q. In order to solve the minimization problem analytically, we must explicitly express G (p) as a function of m. Because wQ (p ) is in general a very complicated function of m, some approximation schemes are necessary for taking such an approach. In order to take the approach by mean-eld approximation, I introduce another foliation, A D fA (m)g of M, parameterized by m:

MD

[ m

A (m).

(3.12)

Let us use shorthand notations I, J, . . . , to represent distinct pairs of indices, such as hiji. The leaves A (m) are dened as

A (m) D fp(s) | gi (p ) D mi for all ig.

(3.13)

Each leaf A (m) has a pair of dual coordinate systems: one is (h I ), and the other is (gI ), and it is therefore an N (N ¡ 1) / 2-dimensional submanifold of M. It is not an exponential family in general since h i s are dependent on h I s so as to have m constant. However, as before, there exists a pair of dual potentials yN and wN on each leaf A (m) dened by

yN (p ) D yQ (p ) ¡

N X iD 1

h i ( p )m i

(D ¡wQ (p ))

(3.14)

1956

Toshiyuki Tanaka

A(m) F (w)

F0

M

Figure 1: Orthogonal dual foliations of M.

and wN (p ) D

X I

h I (p )gI (p) ¡ yN (p),

(3.15)

and the parameters are derived from these potentials by gI (p ) D @I yN (p ) and h I (p) D @I wN (p ),

(3.16)

where @I ´ (@ / @h I ) and @I ´ (@ / @gI ). These two foliations F and A of M (see Figure 1) are said to form orthogonal dual foliations of M, because the leaves F (w) and A (m) are orthogonal at their intersecting point. The orthogonal dual foliation structure will be fully used in deriving the mean-eld formalism in the next section. 3.3 Perturbation Expansion of Divergence. If all wij s are zero, the relevant leaf is F0 , and the problem is easily solved by assumption. In the case of Boltzmann machines, the potential wQ (p) on F0 is indeed expressed explicitly by m as

¶ N µ 1X 1 C mi 1 ¡ mi (1 C mi ) log C (1 ¡ mi ) log wQ (p ) D , 2 i D1 2 2

(3.17)

and the minimization of G(p ) with respect to m gives the solution mi D tanh hi as expected. When wij s are not equal to zero, it is in general hard to

Information Geometry of Mean-Field Approximation

1957

express wQ (p ) explicitly as a function of m. Mean-eld approximation tries to circumvent this difculty by reducing the problem to that on the factorizable submodel F0 . When the “distance” between F0 and the relevant leaf F (w) is not negligible, the reduction may introduce an error. To compensate the error mean-eld approximation utilizes Taylor expansion of wQ (p ) D ¡yN (p ) ´ ¡yN (w) on A (m) with respect to w at w D 0 , ! X 1X I (@I @J yN (0 ) wI wJ yN (w) D yN (0 ) C @I yN (0 )) w C 2 I IJ

(

C

1X (@I @J @K yN (0 ))wI wJ wK C ¢ ¢ ¢ , 6 IJK

(3.18)

assuming that the expansion converges. Equation 3.18 gives a perturbation expansion of the divergence D (p k q ) with respect to w, the perturbation parameters. One expects that each coefcient of the expansion should be explicitly obtained as a function of m. Note that when considering the expansion— that is, when evaluating the derivatives appearing in the coefcients—one should temporarily assume that m is xed, so that the expansion should be done on the leaf A (m) specied by m. Once the expansion has been done, one then regards m as a variable in the next stage and considers the minimization problem with respect to it, as will be described later. The following theorem gives a general result about these coefcients. The coefcients of the expansion in equation 3.18 are given by the cumulant tensors dened on A (m). Theorem 1.

Let us consider a distribution p 2 A (m) specied by w, andP assume that the canonical parameters of p are (h, w). Since yN (w) ´ y (p) ¡ i hi mi , we have ! Proof.

(

exp(yN (w)) D exp y (p ) ¡ X D

D

s X

s

X

0

exp @ 0 exp @

X i

X i

hi mi

i i

1

h si C

X hiji

hexp(zij si sj )ip D

s

exp @

1

hi (si ¡ mi ) C

X i

hi s i C

(

w si sj A exp ¡

The moment generating function of si sj is 0 X

ij

X hiji

X

X

! i

h mi

i

wij si sj A .

hiji

1 (zij C wij )si sj ¡ y (p)A

1958

Toshiyuki Tanaka

0 D

X

s

exp @

1 X

¡

i

hi (si ¡ mi ) C

¢ D exp yN (z C w) ¡ yN (w) .

X hiji

(zij C wij )si sj ¡ yN (w)A

This shows that yN (z C w) is the cumulant generating function, that is, differentiating it and letting z equal to zero yields cumulants of the distribution p 2 A (m) specied by w. The lowest-order coefcients of the expansion are given by the following theorem. Let p0 2 A (m) \ F0 for the leaf A (m) containing p. The rst-, second-, and third-order coefcients of the expansion, equation 3.18, are given by the following formulas, Theorem 2.

@I yN (0 ) D gI (p0 )

@I @J yN (0 ) D h(@I `)( @J `)ip0

@I @J @K yN (0 ) D h(@I `)( @J `)(@K `)ip0 ,

(3.19)

where ` ´ log p0 . The proof is straightforward with the help of the information-geometrical formulation and is given briey in the appendix. The right-hand sides of these formulas are the same as what one would obtain if one regards A (m) as being an exponential family, but it is merely a coincidence since A (m) is actually not an exponential family. In fact, it no longer holds in general for coefcients higher than fourth order— for example, Remark.

6 h( @I `)(@J `)( @K `)( @L `)ip @I @J @K @L yN (0 ) D 0

¡ h(@I `)(@J `)ip0 h(@K `)(@L `)ip0 ¡ h(@I `)(@K `)ip0 h(@J `)(@L `)ip0

¡ h(@I `)(@L `)ip0 h(@J `)(@K `)ip0 . This fact seems to cause difculty in obtaining expressions for the coefcients of general orders (Parisi & Potters, 1995). ( )

Let mi n denote the expectation h(si )n i with respect to the factorizable ( ) distribution p0 2 A (m). For the models of binary, or Ising, variables, mi n ( ) equals 1 for n even, and mi for n odd, because si D § 1 and thus (si )2n D 1. mi n must be expressed as a function of mi even for general models, because under

Information Geometry of Mean-Field Approximation

1959

the factorizable distribution, si obeys the distribution p (si ) D exp(h i si ¡ yi ) (with some dominating measure), which forms a one-parameter exponential family. Any element belonging to such a family can be uniquely specied ( ) by the expectation parameter mi , and therefore mi n should be determined solely by mi . Using these notations, the explicit formulas for the coefcients of the Taylor expansion in equation 3.18 are given by the following (all expressed as functions of m): For the rst-order: @I yN (0 ) D mi mi0 ,

(I D hii0 i).

(3.20)

For the second-order: (2) (@I )2 yN (0 ) D (mi(2) ¡ (mi )2 )(mi0 ¡ (mi0 )2 ),

(I D hii0 i)

(3.21)

and @I @J yN (0 ) D 0

(3.22)

6 J. for I D

For the third-order: ±

(@I )3 yN (0 ) D mi(3) ¡ 3mi(2) mi C 2(mi )3

²±

(3)

²

(2)

mi0 ¡ 3mi0 mi0 C 2(mi0 )3 ,

(I D hii0 i) (3.23)

and for I D hiji, J D hjki, and K D hiki for three distinct indices i, j, and k, ± ²± ²± ² (2) (2) (2) @I @J @K yN (0 ) D mi ¡ (mi )2 mj ¡ (mj )2 mk ¡ (mk )2 .

(3.24)

For other combinations of I, J, and K, @I @J @K yN (0 ) D 0.

(3.25)

Truncating the Taylor series up to the nth order term gives the nth order approximation yN n (w) to yN (w). For the case of Boltzmann machines, or Ising models, the potential y D yQ (w) is the negative of the Helmholtz free energy, wQ D ¡yN is the Gibbs free energy (as already mentioned), and wN is the same as the Boltzmann H function, or the negative of entropy. The rst- and second-order approximations ¡yN 1 (w) and ¡yN 2 (w) give the Weiss and TAP free energies, respectively. In the spin-glass literature, the Taylor expansion presented here has been exploited by Plefka (1982), and therefore is called the Plefka expansion. This expansion is used to derive the TAP free energy of

1960

Toshiyuki Tanaka

the Sherrington-Kirkpatrick (SK) model (Sherrington & Kirkpatrick, 1975) as the second-order approximation and to study its properties (see also Yedida & Georges, 1990; Georges & Yedida, 1991). It is worth emphasizing that the Plefka expansion does not assume the limit N ! 1 (the thermodynamic limit), nor does it assume that wij s are random quantities to be averaged over. The Plefka expansion is therefore directly applicable, beyond statistical-physical models, to the case where N is nite and wij s are xed, which is in fact the case in which mean-eld approximation is applied to practical information processing tasks: Since N is nite, the terms of all orders in the expansion do not vanish in principle, as opposed to the case of the SK model in which higher-order terms vanish in the limit N ! 1 (Plefka, 1982). Also, since wij s are not random quantities but are given and xed for a particular task as the parameters of the target, in general there will be no systematic way to handle the nonvanishing higherorder terms. (For Hopeld models, for example, there exist nonvanishing higher-order terms even in the limit N ! 1, but they can be systematically handled by making use of the statistical properties of wij s. See Nakanishi & Takayama, 1997.) 3.4 Mean-Field Equations and Linear Response. What we have to solve is the problem of minimizing G(p ). Now let us consider instead an approximate minimization problem—to minimize

Gn (p ) D wQ n (p ) ¡

N X

hi m i

(3.26)

iD 1

using wQ n (p ) D ¡yN n (p ) as an approximation to wQ (p ). Since the approximation wQ n (p) is now explicitly expressed as a function of m, at this stage one regards m as a variable and solves the minimization problem with respect to it. The approximate minimization problem is solved by considering the stationary condition @ @mi

Gn (p ) D 0.

(3.27)

This condition gives the so-called mean-eld equations, or self-consistent equations. For Boltzmann machines, the mean-eld equation for n D 1 is given as the following familiar form, tanh ¡1 mi ¡ hi ¡

X 6 i jD

wij mj D 0,

(3.28)

and for n D 2 it includes the so-called Onsager reaction term, tanh ¡1 mi ¡ hi ¡

X 6 i jD

wij mj C

X 6 i jD

(wij )2 (1 ¡ mj2 )mi D 0.

(3.29)

Information Geometry of Mean-Field Approximation

1961

Solving the mean-eld equations with respect to m gives an estimate for the expectation parameters (g1 (p ), . . . , gN (p)) of p, which is expected to best approximate the target q on F (w). For estimating (gI (p )), one can use the so-called linear response theorem in statistical physics (Parisi, 1988), which states that the susceptibility matrix (Âij ) and the stability matrix (Aij ) are the inverse of each other. In information-geometrical terms, the susceptibility matrix is identical to the Fisher information matrix ( gij ) or the Riemannian metric tensor, dened on the leaf F (w), Âij D gij (p) ¡ gi (p )gj (p ) D gij D @i @j yQ ,

(3.30)

and the stability matrix is just the inverse of the metric tensor, Aij D gij D @i @j wQ ,

(3.31)

where ( gij ) D ( gij ) ¡1 . In mean-eld approximation, one uses an approximation wQ n to wQ , and thus what one can compute is not the exact stability matrix ij

but an approximate stability matrix ( gn ), which is given by ij

gn D @i @j wQn .

(3.32)

Because wQ n is a function of m, one can indeed compute the derivatives ij

analytically. One can then evaluate the value of gn by substituting a solution m of the mean-eld equation into the analytical result. Inverting it and equating the result to the susceptibility matrix ( gij ) gives an estimate of (gI (p)) within the mean-eld approximation. Kappen and Rodr´õ guez (1998a, 1998b) have applied the linear response theorem to the mean-eld Boltzmann machine learning. Extension of their method to approximations of higher orders is straightforward by following the framework presented in this article (Tanaka, 1998a; Leisink & Kappen, 1998). 4 Discussion 4.1 Relation to Mean-Field “Theory” in Statistical Physics. Although the theory of mean-eld approximation, presented here as an approximation scheme for information processing tasks, uses many notions and ideas borrowed from the mean-eld theory in statistical physics, the objectives of these two contexts are sometimes different. In this section I examine the distinctions and relations between mean-eld approximation in information processing context and mean-eld theory in statistical physics. The mean-eld theory has been established based on studies on the kind of models treated in this article, placing emphasis on the properties in the thermodynamic limit N ! 1. From these studies, we know that in the

1962

Toshiyuki Tanaka

thermodynamic limit, such models can have phases with broken ergodicity such as ferromagnetic and spin-glass phases (Fischer & Hertz, 1991). In the ergodic phases, what we observe should be in a certain sense the so-called Gibbs averages—the averages with respect to the Boltzmann-Gibbs distribution. In the nonergodic phases, on the other hand, the Boltzmann-Gibbs distribution splits into more than one ergodic components, and therefore observed quantities become different from the Gibbs averages. Although this is strictly true in the thermodynamic limit, practically one can regard it as being true even in nite-N cases, in the sense that short-term observations on a single physical system will yield values different from the Gibbs averages. In such situations, the objectives become different between information processing and statistical physics contexts. In the information processing context, one is primarily interested in evaluating the Gibbs averages as precisely as possible and avoiding harmful inuences from broken ergodicity; the Gibbs averages are the essentially important quantities in statistical inference and learning, whereas in a statistical physics context, the important problem is to give a precise description to the phenomena, including phase transition and ergodicity breaking, and to predict the quantities to be observed from those models in nonergodic phases, rather than to evaluate the Gibbs averages. Let us now discuss the validity of the Plefka expansion for each of these two contexts. In statistical physics, in the thermodynamic limit, the Plefka expansion will converge when the connections w are sufciently small, and thus one can determine m as the unique solution of the mean-eld equation, derived from the Plefka expansion. The solution m will actually give the Gibbs averages of the model, which reects the fact that the problem under consideration (problem 1) does have a unique solution corresponding to the Gibbs averages. However, when w become larger, phase transition occurs, and the mean-eld equation comes to have multiple solutions. Although it is commonly expected in the statistical physics literature that the Plefka expansion remains valid in this case and that each solution of the mean-eld equation corresponds to one of metastable ergodic components, this issue has not yet been fully understood (Parisi & Potters, 1995). Let us turn to discussion within the information processing context. One can say in a strict mathematical sense that the range of validity of the Plefka expansion should correspond to the convergence domain of it as a power series. In the thermodynamic limit, we know, for example, that the boundary of the convergence domain for the SK model coincides with the spin-glass phase transition point (the de Almeida-Thouless [AT] line [de Almeida & Thouless, 1978]). In this case, the range of validity of the Plefka expansion in the mathematical sense is the same as the range of the ergodic phase, where the solution of the mean-eld equation gives the Gibbs average and the truncation of the Plefka expansion will give a reasonable approximation. Thus, we can no longer expect in the nonergodic phases that any of solutions gives the Gibbs average or that the truncation provides a good approximation.

Information Geometry of Mean-Field Approximation

1963

q D(p0 ||q)

p0

F0

Figure 2: “Naive” mean-eld approximation.

In nite-N case, which we are primarily interested in from the viewpoint of information processing, the model is ergodic irrespective of the magnitude of w, and no phase transition occurs in the strict sense. However, for xed m, yN (w) as a function of (wij ) has singularity at complex values of (wij ) (the so-called Lee-Yang singularity [Lee & Yang, 1952]; it will appear at real values of (wij ) in the N ! 1 limit, which signals a phase transition point), which limits the convergence domain of the Plefka expansion. In the strict mathematical sense, one has to understand that the formulation in this article is valid only within this convergence domain. Also, it is only within the convergence domain where one can expect that using higher-order approximations Gn (p ) of G(p ) will improve the accuracy of the estimation m of the Gibbs averages. Further studies are needed to understand fully the information-theoretic meaning of the Plefka expansion, the mean-eld equation, and its solutions in the nonergodic phases, where the Plefka expansion does not converge. This issue is beyond the scope here. 4.2 Relation to “Naive” Theory. The common information-theoretic understanding about the “naive” mean-eld approximation is that it can be regarded as a minimization of the Kullback divergence D(p0 k q) of p0 2 F0 from the target q (Tanaka, 1996; Saul & Jordan, 1996; Hofmann & Buhmann, 1997) (see Figure 2). It can be shown that it is consistent with the theory presented in this letter, and in fact it corresponds to the rst-order approximation within the framework here. Let p 2 M be a distribution that shares the canonical parameters (h I ) with q and the expectation parameters (gi ) with p0 . This assertion means that q and p are on the same leaf F (w), and that p and p0 are on the same leaf A (m). Because of the orthogonality of the two foliations F and A, the following “Pythagorean law” (Amari, 1985; Amari & Nagaoka, 1998) holds (see Figure 3):

D(p0 k q) D D (p0 k p) C D (p k q ).

(4.1)

1964

Toshiyuki Tanaka

D(p||q) q D(p0 ||q)

p

F(w) D(p0 ||p)

p0 F 0

Figure 3: Relation between “naive” mean-eld approximation and present framework.

Intuitively speaking, D (p0 k p) represents the squared distance between the two leaves, from F0 to F (w), and when |w| is small, it is a second-order quantity with respect to w. It should vanish in the rst-order approximation, and thus D (p0 k q) ¼ D (p k q) holds. Under this approximation, minimization of D (p0 k q) with respect to p0 2 F0 is equivalent to minimization of D (p k q) with respect to p 2 F (w). Indeed, it can be easily checked by direct calculation that the rst-order approximation of D (p k q) is exactly identical to D(p0 k q). This fact establishes the correspondence between the common information-theoretic understanding about the naive mean-eld approximation and the present framework. Moreover, it can be said that the term D(p0 k p ), which is ignored in the naive, rst-order mean-eld approximation, gives the higher-order correction to it, which yields the reaction terms in the mean-eld equation. This fact means that the distance of the two leaves F0 and F (w) determines the relevance of the correction by the reaction terms. In relation to this, it is necessary to argue against a common “belief” about the practical validity of the naive mean-eld approximation in the literature of engineering, which claims that the naive mean-eld approximation becomes a good approximation as the number N of units becomes large (see, for example, Haykin, 1994). This belief is usually explained as follows: Consider a Boltzmann machine with N units. When P it works in the Gibbs sampler mode, unit i receives the net input ui D jD6 i wij s j C hi as its “eld.” ui is a uctuating quantity because si s are random variables. However, when N is large, one can expect that the uctuation of ui will vanish, and thus ui can be safely replaced with the “mean eld”: hui i D

X 6 i jD

wij hs j i C hi .

(4.2)

Information Geometry of Mean-Field Approximation

1965

According to the framework presented in this article, the validity of the naive mean-eld approximation is explained differently. It is valid whenever truncation of the Plefka expansion does not introduce a signicant error, regardless of whether N is large. For the SK model, for example, the naive mean-eld approximation, or the Weiss mean-eld theory, is known to be incorrect even in the limit N ! 1. On the contrary, the naive meaneld approximation gives good results even when N is small, provided that wij s are small (Tanaka, 1998a), which can be ascribed to the fact that in such cases, the higher-order terms dropped from the Plefka expansion by the truncation are all relatively small. These facts show that the belief about the practical validity of the naive mean-eld approximation is theoretically not justied, and therefore it is nothing more than a belief. 4.3 Possible Extensions. I have presented a general theory of the meaneld approximation, which originated with statistical physics but now is formulated entirely in terms of information geometry so that it is applicable to models beyond Boltzmann machines. Any model constituting an exponential family and containing a factorizable submodel can benet from the theory. A straightforward application will be the case of higher-order Boltzmann machines, where the model M is dened by

8 <

19

0

M D p (s) D exp @ :

X i

h i si C

X hiji

h ij si sj C

X hijki

=

h ijk si sj sk C ¢ ¢ ¢ ¡ y A , (4.3)

;

and higher-order parameters such as h ijk enter the perturbation parameters. Extension of the linear response theorem to this case is quite easy (Tanaka, 1998b; Leisink & Kappen, 1998), but it requires considering not only the metric tensor but also higher-order geometrical quantities. Another category to which the theory is applicable will include models containing not factorizable submodels but ones that are “solvable” in some sense—for example, the model with partial factorizability (Saul & Jordan, 1996). Another important example is the Potts model, in which each set of Potts spin variables forms a factorial component. The TAP approach for the Potts model has been employed by Hofmann and Buhmann (1997). Further extension will be possible by following the framework presented in this article. Appendix: Proof of Theorem 2

For the rst-order coefcient, from equation 3.17 we have @I yN D gI ,

which is the rst-order cumulant, regardless of w.

(A.1)

1966

Toshiyuki Tanaka

For the second-order coefcient, we have @I @J yN D gN IJ ,

(A.2)

where ( gN IJ ) is the Fisher information matrix on A (m). This equality shows that the second-order coefcient is given by the second-order cumulant, regardless of w. The Fisher information matrix can be written as gN IJ D h(@I `)(@J `)ip ,

(A.3)

where ` D log p D

N X iD 1 N X

D

iD 1

h i si C

X hiji

h ij si sj ¡ y

h i (si ¡ mi ) C

X hiji

h ij si sj ¡ yN .

(A.4)

The third-order coefcient is given, using this expression, by @I @J @K yN D @I gN JK

D h(@I @J `)(@K `)ip C h(@J `)( @I @K `)ip C h(@I `)( @J `)(@K `)ip .

(A.5)

I will show that the rst two terms vanish when w D 0 and that eventually the coefcient is equal to the third term, the skewness tensor. First, using the relationship derived from equations 3.8 and 3.14, h i D @i wQ D ¡@i yN ,

(A.6)

we know that, for I D hii0 i, @I ` is written as @I ` D si si0 ¡ gI ¡

N X

k D1

(@k (@I yN ))(sk ¡ mk ),

(A.7)

from which h@I `ip0 D 0

(A.8)

results. For the second-order derivative of `, @I @J ` D ¡@IgJ ¡

N X

k D1

(@k (@I @J yN ))(sk ¡ mk )

(A.9)

Information Geometry of Mean-Field Approximation

1967

holds, from which we have h(@I @J `)(@K `)ip D ¡@Ig J h@K `ip ¡

N X

k D1

(@k (@I @J yN ))h(sk ¡ mk )@K `ip , (A.10)

and therefore h(@I @J `)(@K `)ip0 D ¡

N X

k D1

(@k (@I @J yN ))h(sk ¡ mk )@K `ip0 .

(A.11)

When w D 0, @I ` can be written as @I ` D (si ¡ mi )( si0 ¡ mi0 )

(A.12)

since @I yN D gI D mi mi0 , and thus h(sk ¡ mk )@K `ip0 vanishes because all si s are independent when w D 0 . The terms h(@I @J `)(@K `)ip and h(@J `)(@I @K `)ip also vanish when w D 0 . This completes the proof. References Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147–169. Amari, S. (1985). Differential-geometrical method in statistics. Berlin: SpringerVerlag. Amari, S., & Nagaoka, H. (1998). Introduction to information geometry. (D. Harada, trans.). New York: AMS and Oxford University Press. de Almeida, J. R. L., & Thouless, D. J. (1978). Stability of the SherringtonKirkpatrick solution of a spin glass model. J. Phys. A: Math. Gen., 11(5), 983– 990. Dunmur, A. P., & Titterington, D. M. (1997). On a modication to the mean eld EM algorithm in factorial learning. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 431– 437). Cambridge, MA: MIT Press. Fischer, K. H., & Hertz, J. A. (1991). Spin glasses. Cambridge: Cambridge University Press. Galland, C. C. (1993).The limitations of deterministic Boltzmann machine learning. Network, 4(3), 355–379. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Patt. Anal. Machine Intell., PAMI-6(6), 721–741. Georges, A., & Yedida, J. S. (1991). How to expand around mean-eld theory using high-temperature expansions. J. Phys. A: Math. Gen., 24(9), 2173–2192. Haykin, S. (1994). Neural network—A comprehensive foundation. New York: Macmillan. Hofmann, T., & Buhmann, J. M. (1997).Pairwise data clustering by deterministic annealing. IEEE Trans. Patt. Anal. Machine Intell., 19(1), 1–14. Errata, ibid., 19(2), 197, 1997.

1968

Toshiyuki Tanaka

Kappen, H. J., & Rodr´õ guez, F. B. (1998a). Efcient learning in Boltzmann machines using linear response theory. Neural Computaiton, 10(5), 1137–1156. Kappen, H. J., & Rodr´õ guez, F. B. (1998b). Boltzmann machine learning using mean eld theory and linear response correction. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10, (pp. 280–286). Cambridge, MA: MIT Press. Lee, T. D., & Yang, C. N. (1952). Statistical theory of equations of state and phase transitions II lattice gas and Ising model. Phys. Rev., 87(3), 410–419. Leisink, M. A. R., & Kappen, H. J. (1998). Learning higher order Boltzmann machines using linear response. In L. Niklasson, M. Boden, & T. Ziemke (Eds.), Proc. 8th Int. Conf. Articial Neural Networks (pp. 511–516). Berlin: SpringerVerlag. Nakanishi, K., & Takayama, H. (1997). Mean-eld theory for a spin-glass model of neural networks: TAP free energy and the paramagnetic to spin-glass transition. J. Phys. A: Math. Gen., 30(23), 8085–8094. Parisi, G. (1988). Statistical eld theory. Reading, MA: Addison-Wesley. Parisi, G., & Potters, M. (1995). Mean-eld equations for spin models with orthogonal interaction matrices. J. Phys. A: Math. Gen., 28(18), 5267–5285. Peterson, C., & Anderson, J. R. (1987). A mean eld theory learning algorithm for neural networks. Complex Systems, 1, 995–1019. Plefka, T. (1982). Convergence condition of the TAP equation for the inniteranged Ising spin glass model. J. Phys. A: Math. Gen., 15(6), 1971–1978. Saul, L. K., & Jordan, M. I. (1996). Exploiting tractable structures in intractable networks. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8, (pp. 486–492). Cambridge, MA: MIT Press. Sherrington, D., & Kirkpatrick, S. (1975). Solvable odel of a spin-glass. Phys. Rev. Lett., 35(26), 1792–1796. Tanaka, T. (1996). Information geometry of mean eld theory. IEICE Trans. Fundamentals, E79-A(5), 709–715. Tanaka, T. (1998a). Mean eld theory of Boltzmann machine learning. Phys. Rev. E, 58(2), 2302–2310. Tanaka, T. (1998b). Estimation of third-order correlations within mean eld approximation. In S. Usui & T. Omori (Eds.), Proc. Fifth Int. Conf. Neural Info. Process. ’98 Kitakyushu (Vol. 1, pp. 554–557). Thouless, D. J., Anderson, P. W., & Palmer, R. G. (1977). Solution of “Solvable model of a spin glass.” Phil. Mag., 35(3), 593–601. Yedida, J. S., & Georges, A. (1990). The fully frustrated Ising model in innite dimensions. J. Phys. A: Math. Gen., 23(11), 2165–2171. Received May 4, 1998; accepted June 7, 1999.

LETTER

Communicated by Eric Baum

Measuring the VC-Dimension Using Optimized Experimental Design Xuhui Shao Vladimir Cherkassky ECE Department, University of Minnesota, Minneapolis, MN 55455, U.S.A. William Li Operations and Management Science Deptartment, University of Minnesota, Minneapolis, MN 55455, U.S.A. VC-dimension is the measure of model complexity (capacity) used in VCtheory. The knowledge of the VC-dimension of an estimator is necessary for rigorous complexity control using analytic VC generalization bounds. Unfortunately, it is not possible to obtain the analytic estimates of the VC-dimension in most cases. Hence, a recent proposal is to measure the VC-dimension of an estimator experimentally by tting the theoretical formula to a set of experimental measurements of the frequency of errors on articially generated data sets of varying sizes (Vapnik, Levin, & Le Cun, 1994). However, it may be difcult to obtain an accurate estimate of the VC-dimension due to the variability of random samples in the experimental procedure proposed by Vapnik et al. (1994). We address this problem by proposing an improved design procedure for specifying the measurement points (i.e., the sample size and the number of repeated experiments at a given sample size). Our approach leads to a nonuniform design structure as opposed to the uniform design structure used in the original article (Vapnik et al., 1994). Our simulation results show that the proposed optimized design structure leads to a more accurate estimation of the VC-dimension using the experimental procedure. The results also show that a more accurate estimation of VC-dimension leads to improved complexity control using analytic VC-generalization bounds and, hence, better prediction accuracy. 1 Introduction and Motivation

Various applications and methodologies in engineering, statistics, and computer science are concerned with problems of predictive learning from samples. In such problems, the goal is to estimate certain properties of (unknown) statistical distributions from known nite samples (training data), in order to predict (unknown) future data accurately. The VC-theory (Vapnik, 1995) provides a comprehensive mathematical and conceptual framec 2000 Massachusetts Institute of Technology Neural Computation 12, 1969– 1986 (2000) °

1970

Xuhui Shao, Vladimir Cherkassky, and William Li

work for predictive learning. Specically, according to VC-theory, improved generalization (prediction accuracy) is achieved by minimizing the analytic VC generalization bounds developed for nite-sample estimation. The generalization ability of an estimator depends on an optimized trade-off between its complexity (dened as the VC-dimension) and its ability to t the training data. Hence, using VC-bounds for complexity control requires the knowledge of VC-dimension. Unfortunately, it is not possible to obtain the exact analytic estimate of the VC-dimension for most estimators of practical interest (Vapnik, 1995; Cherkassky & Mulier, 1998). The only practical solution is to estimate the VC-dimension experimentally following the approach of Vapnik, Levin, and Le Cun (1994). Their approach is simple: for binary classication problems, the deviation of the expectation of the training error from 0.5 for training data with randomly chosen class labels (with probability 0.5) depends on the complexity (i.e., the VC-dimension) of an estimator. Their method is based on a theoretically derived formula for the maximum deviation between the frequency of errors produced by an estimator on two randomly labeled data sets, j (n), as a function of the size of each data set n and the VCdimension h of an estimator. The main idea is that an accurate estimate of the VC-dimension gives the best t between the formula and a set of experimental measurements of the frequency of errors on randomly labeled data sets of varying sizes. It has been shown that this approach achieves accurate estimates of the VC-dimension for linear estimators with squared loss (Vapnik et al., 1994). Implementation of this general approach for measuring the VC-dimension requires a particular choice of the experimental procedure for choosing the measurement points corresponding to sample size, and the original study (Vapnik et al., 1994) adopted a uniform design. However, this uniform design may lead to rather inaccurate estimates of the VC-dimension, and in this article we suggest a better (nonuniform) design strategy, which yields more accurate estimates of the VC-dimension. The proposed strategy for optimized experimental design also results in improved complexity control. For example, in situations where the “correct” VC-dimension is unknown (i.e., penalized linear estimators), the proposed nonuniform design combined with the VC-bounds for complexity control yields more accurate predictive models than the original uniform design. 2 Measuring the VC-Dimension of an Estimator

This section briey describes the original approach for measuring the VCdimension (Vapnik et al., 1994). This procedure is dened for the binary classier. However, the estimated VC-dimension can be used as well for other types of learning problems, such as regression (Cherkassky & Mulier, 1998). In binary classication, each input sample X D (x1 , x2 , . . . , xd ) needs to be classied into one of the two classes (y D 0 or 1). The prediction accuracy

Measuring the VC-Dimension

1971

of a classier is dened as the misclassication rate on the testing set: Eerror D

nmissclassif ied ntotal

.

(2.1)

The VC-dimension h of a binary classier is dened as the maximum number of vectors X1 , . . . , Xh that can be separated into two classes in all 2h possible ways by this classier (Vapnik, 1995). The denition of VCdimension does not provide a constructive procedure for measuring it. Vapnik et al. (1994) proposed a method to estimate the effective VC-dimension by observing the maximum deviation j (n ) of error rates between two independently labeled data sets, j (n ) D max (| Eerror (Zn1 ) ¡ Eerror (Zn2 )|),

(2.2)

v

where Zn1 D fX1,1 , y1,1 g, . . . , fX1,n , y1,n g, Zn2 D fX2,1 , y2,1 g, . . . , fX2,n , y2,n g are two sets of classication samples of size n, and v is the set of parameters of a binary classier. According to VC-theory, j (n ) is bounded by j (n ) · W(n / h),

(2.3)

where ( W(t ) D

(2t ) C a ln t ¡k 1

1

±q

1

(t ¡k) C lnb(2t ) C 1 C

²

1

if (t < 0.5) otherwise.

(2.4)

where t D n / h, and the constants a D 0.16 and b D 1.2 have been determined empirically (Vapnik at al., 1994), and k D 0.14928 is determined so that W(0.5) D 1. According to Vapnik et al. (1994), this bound is tight; therefore, we can consider approximately, j (n ) ¼ W(n / h).

(2.5)

Since the VC-dimension h is the only unknown variable in equation 2.5, we can estimate h from observations of the maximum deviation j (n ) according to equation 2.2. j (n ) can be estimated by simultaneously minimizing the error rate of the rst labeled data set and maximizing the error rate of the second set. This leads to the following procedure, as shown in Figure 1 (Cortes, 1995; Vapnik et al., 1994): 1. Generate a random set of size 2n, Z2n D fx1 , y1 g, . . . , fx2n , y2n g. 2. Split the set into two sets of equal size: Z1 and Z2 .

1972

Xuhui Shao, Vladimir Cherkassky, and William Li

Training Set of Size 2n

Set 1(size n)

Set 2 (size n)

Flip the labels

Learning Machine (Trained as a binary classifier)

Restore the original labels Error_rate(set1)

Error_rate(set2)

Figure 1: Measuring the maximum discrepancy between two independent sets.

3. Flip the class labels for the second set: Z2 . 4. Merge the two sets to train the binary classier. 5. Separate the sets and ip the labels on the second set back again. 6. Measure the difference between the error rates on the two sets: j (n ) D |Eerror (Zn1 ) ¡ Eerror (Zn2 )|. This procedure, dened as an experiment in this article, gives a single estimate of j (n), from which we can obtain a single point estimate of h according to equation 2.5. In order to reduce the variability of estimates due

Measuring the VC-Dimension

1973

to random samples in the experiment, the procedure is repeated for different data sets of different sample sizes n1 , n2 , . . . , nd , in the range 0.5 · ni / h · 30. We call each ni a design point, and d is the total number of design points. At each design point ni , several(mi ) repeated experiments are performed. The original paper (Vapnik et al., 1994) species the experimental procedure for measuring the VC-dimension of a learning machine (as outlined above). However, practical application of this procedure requires specication of the experiment design, that is, the values ni and mi (i D 1, . . . , d). The original paper (Vapnik et al., 1994) used m1 D m2 D ¢ ¢ ¢ D md D constant, so we call this design a uniform design. Then the mean values of these repeated experiments are taken at each design point: jN (n1 ), . . . , jN (nd ). The effective VC-dimension h of the binary classier can then be determined by nding the parameter h¤ that provides the best t between W (n / h ) and the jN (ni )0 s: h¤ D arg min h

d £ X i D1

jN (ni ) ¡ W(ni / h )

¤2

(2.6)

We point out that the original paper (Vapnik et al., 1994) does not provide theoretical or empirical justication for choosing the uniform design. 3 Optimized Experimental Design

Due to random variations of the quantity j (h ) in the experimental procedure shown in Figure 1, it is difcult to obtain a perfect t to the theoretical function. Thus, the results may not be accurate or reliable enough for practical purposes. In this article, we address this problem by proposing an improved method for measuring the VC-dimension, using the ideas of optimal experimental design (Li & Wu, 1997) from applied statistics. An optimized design is an experiment that minimizes (or maximizes) a given design criterion. Its goal is usually to minimize the effect of the experimental errors. In the VC-dimension estimation experiment, our goal is to obtain an optimized design structure, that is, the optimized design points (n1 , n2 , . . . , nd ), and the optimized numbers of repeated experiments at each design points, (m1 , m2 , . . . , md ), so that the parameter h can be estimated as accurately as possible. We describe next a procedure for determining the optimized design structure, which contains the optimized numbers of repeated experiments at xed design points, for measuring the VC-dimension. 3.1 Criterion for Optimized Design. The effective VC-dimension h of the learning machine is the main variable under study. In most cases the true VC-dimension is unknown, and we can only measure the accuracy of estimation of the theoretical function w (n / h ), given by the mean squared

1974

Xuhui Shao, Vladimir Cherkassky, and William Li

error (MSE): MSE ( f itting ) D E ((j (n) ¡ W (n, h ))2 ).

(3.1)

We shall use this MSE(tting) as the criterion for optimized design (in the procedure described in section 3.2, the expectation in equation 3.1 is estimated by the sample mean). In the original paper (Vapnik et al., 1994), the mean values of repeated experiments at each design points are taken to t the theoretical function, as described in equation 2.6. For a uniform design, taking the mean values before least-squares tting will not change the estimation result, but the MSE of tting will be different (appears to be smaller by taking the mean). For the nonuniform design structure, however, using mean value tting will lead to inconsistent estimation results. Therefore, we will always use all the empirical data (without averaging at design points) to t W(n / h). 3.2 The Procedure for Optimized Design. An experimental design for estimating VC-dimension requires the specication of two factors: the design points and the number of repeated experiments at each design point. The method for nding an optimized design structure is based on a heuristic exchange procedure, which is motivated by the techniques in optimized experimental design (Li & Wu, 1997). In applied statistics, a key issue is how to select and construct designs for the most efcient use of resources. A vast number of algorithms have been developed for constructing the optimal design in the literature (for review, see Li & Wu, 1997). Most algorithms operate under the assumption that a neighborhood can be constructed to identify adjacent solutions that can be reached from any current solution. An algorithm usually starts with a carefully (or randomly) chosen design and then improves the design by a series of exchanges on the design. Such exchanges may include a swap of two design points or use of a new design point to replace the old one. Here, we will improve the design by moving an experiment from one design point to another. One major problem of these algorithms is that they usually cannot guarantee the convergence to a global optimum. A few techniques have been developed as remedial procedures. One method is to start with a few randomly chosen starting designs and then choose the best resulting design as the optimized design. Another common procedure is to use a “good” starting design that has already had some good properties or has been in use in practice. The overall performance of algorithms with these remedial procedures has been quite satisfactory. In our work, we chose the second option since the uniform design has been successfully used by Vapnik et al. (1994). We start our search with the original uniform design. Using MSE(tting) as the criterion for the design optimization, we identify “good” design points that contribute to the reduction in MSE and “bad” design points that

Measuring the VC-Dimension

1975

contribute to the increase in MSE. Our approach consists of pairwise exchanges between the worst point (which contributes to the biggest increase in MSE) and the best point (which contributes to the biggest reduction in MSE), under the constraint that the number of repeated experiments at each design point cannot exceed a preset upper limit. Once a design point reaches this limit (which is set to 25% of the total number of experiments), it is saturated and will not participate further in the exchange procedure. The reason for using the saturated limit is as follows: If all the sample points are moved to a single design point, then the estimates of VC dimension (see equation 3.1) at a single design point would be highly inaccurate. The saturation limit is empirically chosen to prevent this special case. The proposed algorithm for optimized design is as follows: 0. Carry Initialization: start with a uniform design. 1. Carry out the current design and compute the MSE of tting. 2. Rank all design points n1 , n2 , . . . , nd according to their contributions to MSE. The contribution of design point ni is computed by the following formula: Contribution (ni ) D (MSE (remove ni ) ¡ MSE ) / k (ni )

(3.2)

where MSE is the mean squared error of tting dened by equation 3.1 given all the design points, while MSE(remove ni ) is the tting MSE given all the design points except ni . The contribution of design point ni is normalized by the sample size k at ni . Contribution ni can be positive or negative. When the random variation is large at ni , it is likely to be negative; when the random variation is small, it is likely to be positive. 3. Perform an exchange by removing an experiment from the worst design point and adding it to the best design point. If this exchange yields a bigger MSE, it is reversed. The experiment is then added to the second (or third, fourth, . . . ) best design point, the rst of which leads to a smaller MSE. 4. Repeat the steps 1–3 until either of the following stopping criteria is met: all design points give nonpositive contribution, or all design points that have positive contributions are saturated. The resulting design is called an optimized design.

As we show in section 4, the proposed optimized design approach yields a design structure that is quite different from the original uniform design (Vapnik et al., 1994) yet optimized in terms of tting the theoretical function. The simulation results in the next section show that the proposed optimized design structure for measuring the VC-dimension estimation leads to a more accurate estimation of the VC-dimension (when the true VC-dimension is known) and better complexity control and hence better prediction accuracy.

1976

Xuhui Shao, Vladimir Cherkassky, and William Li

4 Measuring the VC-Dimension for Linear Estimators

A linear regression estimator approximates the target function by the linear combination of a set of xed basis functions, f (x, w ) D

m X jD 1

wj gj (x ) C w0 ,

(4.1)

where gj (x) is the basis function, for example, polynomials of different degrees: gj (x ) D x j . Given training samples f(xi , yi ), i D 1 . . . ng, the approximations provided by the linear estimator for the training data can be written as f (x, w ) D Zw ,

(4.2)

where w is the linear combination coefcients in vector form and Z is the estimation matrix formed by training inputs 0

1 B1 B Z D B. @ .. 1

x1 x2 .. . xn

x21 x22 .. . x2n

... ... ...

1 xm 1 mC x2 C .. C . . A xm n

(4.3)

The linear least-squares solution of w is: w ¤ D (ZT Z) ¡1 ZT y.

(4.4)

The VC-dimension of a linear estimator is simply the number of free parameters (or DOF, degrees of freedom). Since we can measure how well a certain design can estimate the true VC-dimension in the linear case, this serves the purpose of a control experiment. Note that the experimental procedure for measuring the VC-dimension is totally blind to the knowledge of number of free parameters or even the type of the learning machine. In order to use the experimental procedure in section 2 to estimate the VCdimension of the linear estimator, the binary classication problem involved in the procedure is solved as a regression problem. Then the output of the regression problem is thresholded at 0.5 to produce class labels 0 and 1. Experimental results comparing the original uniform design structure (Vapnik et al., 1994) and the proposed optimized structure for measuring the VC-dimension of linear estimators are shown in Table 1 and Figure 2. Figure 2 shows the result of measuring the VC-dimension of a linear classier for the 11-dimensional input vectors X D (x1 , x2 , . . . , x11 ). In this case, the true VC-dimension is 12, which is the number of free parameters. Figure 2 compares ten independent estimation trials using the original uniform

Measuring the VC-Dimension

1977

Table 1: Design Structures for Linear Estimators. Uniform design structure n / h 0.5 0.65 0.8 1.0 1.2 1.5 2.0 2.8 3.8 5.0 6.5 8.0 10.0 15.0 20.0 30.0 k 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 Optimized design structure n / h 0.5 0.65 0.8 1.0 1.2 1.5 2.0 2.8 3.8 5.0 6.5 8.0 10.0 15.0 20.0 30.0 k 14 0 0 0 0 0 0 0 0 18 18 20 24 66 80 80

design (see Figure 2a) with those using the optimized design structure (see Figure 2b). The design structures are shown in Table 1, in which k is the number of repeated experiments at each sample size, and n / h is the ratio of the sample size and the VC-dimension. Experimental results in Figure 2 demonstrate two (interacting) factors affecting accurate estimation of the VC-dimension: Fitting F (n/h)

x 1

Fitting F (n/h)

x 1

0.5

0.5

0 0

10

20

30

40

50 n/h

Estimated VC dimension

12

0 0

11

10

10

9

9 2

4

8

10

MSE of fitting

3

8

6

x 10

8

6

6

4

4

2

2

0

1

2

3

4

5

6

7

2

4

8

9 10

(a) Original uniform design

0

30

40

50 n/h

x 10

1

2

6

8

10

8

9 10

MSE of fitting

3

8

20

Estimated VC dimension

12

11

8

10

3

4

5

6

7

(b) Proposed optimal design

Figure 2: Measuring the VC-dimension of the linear estimator using the (a) original uniform design and (b) proposed optimized design.

1978

Xuhui Shao, Vladimir Cherkassky, and William Li

1. Reduction of the random variability of the estimates achieved by the optimized design (relative to the original uniform design). This can be expected since the reduction in MSE of tting is the criterion for design optimization. The top two plots in Figure 2 show a single trial for estimating the VC-dimension; they show a great deal of random variation in tting the theoretical function (see equation 2.4), especially when sample sizes are small. By avoiding these small sample sizes and having more repeated experiments at larger sample sizes, the optimized design structure can greatly increase the accuracy of estimation. The gain in accuracy is signicant; MSE(tting) of the optimized design is reduced by a factor of 3 (see Figure 2). On average the estimated VC-dimension is 2 DOFs closer to the theoretic value. The variance of estimation is also much smaller, which indicates more stable estimates. We performed another experiment, doubling the number of repeated experiments in the uniform design, to demonstrate that increasing the total sample size does not yield better results (original design: mean D 8.8887, variance D 0.1515; double sample size with original design: mean D 8.8415, variance D 0.0481; optimized design: mean D 10.8415, variance D 0.0289). 2. The effect of numerical optimization algorithm (used for estimating VC-dimension), causing a systematic error (underestimate) of the measured VC-dimension. A unique solution to the linear regression problem (see equation 4.4) exists when the columns of Z are linearly independent, which is true for most practical cases when the number of parameters is smaller than the number of samples (m · n). However, when m is close to n, due to the round-off errors in computing the Z matrix, the matrix ZT Z can be ill conditioned (rank decient). Thus, the least-squares solution may yield inaccurate (smaller) estimates of the true VC-dimension. Note that uniform design has many design points with a small number of samples, which results in rank-decient solutions. In contrast, the optimized design has most design points with a larger number of samples, so the estimates of the VC-dimension tend to be more accurate (i.e., closer to theoretical VC-dimension of 12). However, in both cases shown in Figure 2 (the original and proposed design), we can see the systematic error (underestimate) of the VC-dimension due to numerical optimization. The advantages of the proposed nonuniform experimental design can be explained as follows. With small sample sizes, the theoretical formula for the upper bound W(n / h) suggests that when the sample size is approaching the VC-dimension, the maximum deviation j (n ) is approaching 1.0. However, the deviation j (n ) is upper bounded by 1.0 (the maximum deviation of the errors on two independent sets can never exceed 1.0). The random samples ofj (n ) will always fall below 1.0 but never above 1.0 due to this upper bound, which effectively leads to underestimated VC-dimensions. This explains

Measuring the VC-Dimension

1979

why the VC-dimension estimated by the uniform design is smaller than that by the optimized design, which tends to use larger sample sizes. Finally, we point out that the optimized design structure shown in Table 1 can be used for measuring VC-dimension of any linear estimator. 5 Measuring the VC-Dimension for Penalized Estimators

This section describes experimental estimation of the VC-dimension for penalized linear estimators. The true VC-dimension of a penalized linear estimator is unknown. The practical reason for measuring the VC-dimension is to use it for model complexity control with the VC generalization bounds. We will use the measured VC-dimension for model selection in some simple regression tasks and the performance of model selection as an indirect measure of how well the VC-dimension has been estimated. Let us consider penalized linear estimators with the following risk functional (squared loss with ridge penalty on the norm of the coefcients): R(w ) D

n 1X (yi ¡ yOi )2 C lkw k2 . n i D1

(5.1)

For such estimators, the least-squares solution is similar to equation 4.4 except for the regularization term lI: Sl D Z ((ZT Z) C lI)¡1 ZT .

(5.2)

The regularization parameter l controls the model complexity of the penalized estimator. Given only the training data set, the goal of model selection is to choose the optimal l, which gives the best generalization ability (the best prediction accuracy for future data). In statistics, the model complexity of a penalized estimator (called effective DOF) is estimated(dened) by the eigenvalues of its equivalent kernel representations (see equation 5.2) (Hastie & Tibshirani, 1990; Bishop, 1995). Since the equivalent basis function expansion can be constructed from the eigenfunction decomposition of Sl , the sum of the eigenvalues of Sl is used to approximate the effective DOF, DOF D trace (Sl ), or, equivalently, DOF D

n X iD 1

¡

ei ei C l

(5.3)

¢ ,

where e0i s are eigenvalues of ZT Z.

(5.4)

1980

Xuhui Shao, Vladimir Cherkassky, and William Li

Table 2: Design Structures for Penalized Linear Estimators. Uniform design structure n / h 0.5 0.65 0.8 1.0 1.2 1.5 2.0 2.8 3.8 5.0 6.5 8.0 10.0 15.0 20.0 30.0 k 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 Optimized design structure n / h 0.5 0.65 0.8 1.0 1.2 1.5 2.0 2.8 3.8 5.0 6.5 8.0 10.0 15.0 20.0 30.0 k 0 0 0 0 0 0 0 0 0 18 20 30 34 58 80 80

For regression problems with squared loss function, the practical VC generalization bound has the following form (Vapnik, 1995; Cherkassky & Mulier, 1998):

(

n 1X (yi ¡ yO i )2 1 ¡ Prediction risk · n iD 1

s

¡ ¢

h h h ¡ ln n n n

ln n C 2n

!¡1 (5.5) C

where h is the VC-dimension of the estimator and n is the training sample size. The VC-method for model selection works as follows. According to the structural risk minimization principle (Vapnik, 1995), penalized estimators with decreasing l 0 s form a nested sequence of models with increasing model complexity. For each l, we minimize the training error on a given data set using the least-squares solution (see equation 5.2). Then the VC-bound (see equation 5.5) is used to estimate the prediction risk. The optimal regularization parameter l corresponds to the model with the smallest bound on prediction risk (see equation 5.5). The knowledge of the VC-dimension is required under this approach. However, for penalized linear estimators, the VC-dimension is not known. In previous empirical studies on model selection with penalized estimators (Cherkassky, Shao, Mulier, & Vapnik, 1999; Cherkassky & Mulier, 1998), the VC-dimension has been approximated by effective DOF. Let us consider three different complexity measures: VC-dimension estimated using the original uniform design, VC-dimension estimated using the proposed optimized design, and the effective DOF. We can use these three different model complexity measures with Vapnik’s measure (see equation 5.5) for model selection and compare their prediction accuracy. The optimized design structure for penalized linear estimators shown in Table 2 is obtained by the same design procedure described in section 4. And this optimized design structure can be used for measuring VC-dimension of any penalized linear estimator. Note that the optimized design structure for penalized linear estimators is almost identical to that of linear estimators (see Table 1). We conjecture that the optimized design structure proposed in this article may be used for other types of estimators as well. It uses more

Measuring the VC-Dimension

1981

Different Model Complexity Measures

30 DOF

Model Complexity

25 VCdim(uniform design)

20 VCdim(optimal design)

15

10

5

0 80

70

60

50

40 30 log 2 (l )

20

10

0

10

Figure 3: Different measures of model complexity for penalized linear estimator.

design points with a larger number of samples as compared to the uniform design. The penalized estimator in our experiments is a polynomial of degree 25 with constraints on the norm of its coefcients. Figure 3 shows the three complexity measures as a function of the regularization parameter l. The three curves differ, especially when l is small, which corresponds to high model complexity. Experimental set-up for empirical comparisons is as follows. For a given training sample and a given type of penalized linear estimator (i.e., penalized polynomial of degree 25), the following model selection methods are used: Vapnik’s measure (see equation 5.5) with VC-dimension estimated by uniform experimental design. Vapnik’s measure (see equation 5.5) with VC-dimension estimated via proposed (nonuniform) experimental design. Vapnik’s measure (see equation 5.5) with effective DOF (see equation 5.4) used in place of the VC-dimension.

1982

Xuhui Shao, Vladimir Cherkassky, and William Li (A) Sine squared Signal

1 0.8 0.6 0.4 0.2 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.7

0.8

0.9

1

(B) Blocks Signal

6 4 2 0 2 0

0.1

0.2

0.3

0.4

0.5

0.6

Figure 4: Target functions used for regression.

The prediction accuracy of model selection is measured as MSE, or the L2 distance between the true target function and its estimate from the training data. In addition, two classical model selection criteria are used for comparison with VC-method: Akaike’s nal prediction error (FPE) and generalized cross-validation (GCV), both using effective DOF as the complexity measure. Two different target functions are shown in Figure 4: the relatively smooth (low complexity) “sine squared” signal and the relatively high-complexity “blocks” signal (Cherkassky & Mulier, 1998). The training set consists of 100 points, randomly sampled from either of the two signals with gaussian additive noise. Each tting (model estimation) experiment is repeated 300 times, and the prediction accuracies (MSE) for different methods are compared using standard box plots (showing 5, 25, 50, 75, and 95 percentiles). Comparison results are shown in Figures 5 and 6. Relative prediction performance of the model selection methods can be judged from the box plots of RISK (MSE). Box plots showing lower values of RISK correspond to better model selection approaches. In particular, better model selection approaches select models providing the lowest guaranteed

Measuring the VC-Dimension

1983

(A) Prediction Accuracy,n=100,SNR=3.3

4

10

3

MSE

10

2

10

1

10

0

10

fpe DOF

vm opt

vm uniform

vm DOF

(B) Complexity (VC dim / DOF)

25

Complexity

gcv DOF

20 15 10 5

fpe DOF

gcv DOF

vm opt

vm uniform

vm DOF

Figure 5: Model selection results for estimating blocks signal with penalized polynomials. Key: vm D Vapnik’s method (using VC-bounds); fpe D Akaike’s nal prediction error (using effective DOF); gcv D generalized cross-validation (using effective DOF).

prediction risk (i.e., with the lowest risk at the 75 and 95 mark) and also the smallest variation of the risk (i.e., narrow box plots). Also notice that the RISK axis is in a logarithmic scale. The visually small difference between box plots actually means a much bigger difference in normal scale. For the penalized polynomial, the true VC-dimension is unknown, and the only way to check which complexity measure is better is to compare the results of model selection tasks. Figure 5 shows the prediction accuracy of the three model complexity measures. Here the relatively complex “blocks” signal is used to magnify the difference between the three complexity measures (they differ most when the complexity is high, as evident from Figure 3). As we can see in Figure 5, using VC-dimension obtained by the optimized design achieves better model selection performance (i.e., lower values of risk at 95 and 75 percentiles) and hence better prediction accuracy than the less accurate VC-dimension (obtained by uniform design). For a smooth signal, like the “sine-squared” signal, the differences among the three complexity

1984

Xuhui Shao, Vladimir Cherkassky, and William Li (A) Prediction Accuracy,n=30,SNR=2.5

10

10

5

MSE

10

0

10

5

10

fpe DOF

vm opt

vm uniform

vm DOF

(B) Complexity (VC dim / DOF)

25 Complexity

gcv DOF

20 15 10 5

fpe DOF

gcv DOF

vm opt

vm uniform

vm DOF

Figure 6: Model selection results for estimating sine-squared function. (See Figure 5 for the key.)

measures are not very signicant—less than one degree in the region of low complexity where the signal estimate is found by the VC-method (see Figure 3). Therefore, the differences of prediction performances of the three complexity measures are negligible, as shown in Figures 5 and 6 also show that the two classical model selection methods, FPE and GCV, generally provide estimation accuracy inferior to VC-bounds (Cherkassky & Mulier, 1998). Additional experiments (using different number of samples, noise levels, target functions, and approximating functions) not shown here due to space constraints conrm that for penalized linear estimators, the proposed optimized design for measuring VC-dimension results in a better prediction accuracy. 6 Conclusion

We have proposed an improved method for practical implementation of the experimental procedure (Vapnik et al., 1994) for measuring the VCdimension of an estimator. Compared to the original uniform design, the

Measuring the VC-Dimension

1985

proposed experimental procedure has the exibility to exchange sample points between xed design points to minimize the proposed optimality criterion (see equation 3.1) and therefore results in a better estimation accuracy. The proposed approach leads to a nonuniform experimental design for measuring the VC-dimension. Empirical comparisons for linear estimators (where the true VC-dimension is known) indicate that the proposed method achieves more accurate estimates of the VC-dimension than the original uniform design described in Vapnik et al. (1994). From a practical viewpoint, accurate estimation of the VC-dimension is seldom an ultimate goal, but it can be used for rigorous complexity control (model selection) using analytic VC-generalization bounds. Recent empirical evidence (Cherkassky & Mulier, 1998) indicates that VC-bounds enable more accurate complexity control than existing statistical approaches (such as Akaike’s nal prediction error, generalized cross-validation, and resampling methods). This suggests the practical importance of accurate estimation of the VC-dimension. Practical implications of an accurate measurement of VC-dimension are illustrated for penalized linear estimators (where the analytic estimates of VC-dimension do not exist). Using the proposed nonuniform experimental design for measuring the VC-dimension leads to more accurate model selection with VC generalization bounds. The advantages of the proposed approach (versus the original uniform design) are meaningful only for smallsample estimation problems, that is, when the true model complexity is high (see Figure 3). Recall, according to VC-theory, that small or nite-size problems correspond to settings where the (true) model complexity is of the same order as the number of training samples. In addition, our comparisons for penalized linear estimators indicate that VC-bounds yield more accurate predictive models than classical model selection criteria. Finally, our experimental results suggest that the numerical properties of an optimization (parameter estimation) algorithm have a nontrivial effect on estimating model complexity (VC-dimension) as well as on model selection. Although this is well known for nonlinear estimators (such as neural networks where the complexity can be controlled by optimization procedure), we observed the same effect for linear and penalized linear estimators. This observation has broad conceptual and practical implications for model selection using VC-bounds. However, detailed investigation of the effects of an optimization algorithm on model selection remains an open research area.

Acknowledgments

We thank V. Vapnik from AT&T Labs Research and two anonymous reviewers whose critical feedback helped to improve the quality of this article.

1986

Xuhui Shao, Vladimir Cherkassky, and William Li

References Bishop, C. (1995). Neural networks for pattern recognition. Oxford: Oxford University Press. Cherkassky, V., & Mulier, F. (1998). Learning from data: Concepts, theory and methods. New York: Wiley. Cherkassky, V., Shao, X., Mulier, F., & Vapnik, V. (1999)Model complexity control for regression using VC generalization bounds. IEEE Trans. Neural Networks, 10, 1075–1089. Cortes, C. (1995). Prediction of generalization ability in learning machines. Unpublished doctoral dissertation, University of Rochester. Hastie, T., & Tibshirani, R. (1990). Generalized additive models. London: Chapman and Hall. Li, W., & Wu, C.F.J. (1997). Columnwise-pairwise algorithms with applications to the construction of supersaturated designs. Technometrics, 39, 171–179. Vapnik, V. (1995). The nature of statistical learning theory. Berlin: Springer-Verlag. Vapnik, V., Levin, E., & Le Cun, Y. (1994). Measuring the VC-dimension of a learning machine. Neural Computation, 6, 851–876. Received June 15, 1998; accepted June 23, 1999.

Neural Computation, Volume 12, Number 9

CONTENTS Article

DEA: An Architecture for Goal Planning and Classication Fran¸cois Fleuret and Eric Brunet

1987

Note

On a Fast, Compact Approximation of the Exponential Function Gavin C. Cawley

2009

Letters

Bounds on Error Expectation for Support Vector Machines V. Vapnik and O. Chapelle

2013

A Computational Model of Lateralization and Asymmetries in Cortical Maps Svetlana Levitan and James A. Reggia

2037

Rate Limitations of Unitary Event Analysis A. Roy, P. N. Steinmetz, and E. Niebur

2063

Estimating Functions of Independent Component Analysis for Temporally Correlated Signals Shun-ichi Amari

2083

SMEM Algorithm for Mixture Models Naonori Ueda, Ryohei Nakano, Zoubin Ghahramani, and Geoffrey E. Hinton

2109

Stable Encoding of Finite-State Machines in Discrete-Time Recurrent Neural Nets with Sigmoid Units ´ ˜ Rafael C. Carrasco, Mikel L. Forcada, M. Angeles Vald´es-Munoz, ˜ and Ram´on P. Neco

2129

Approximate Maximum Entropy Joint Feature Inference Consistent with Arbitrary Lower-Order Probability Constraints: Application to Statistical Classication David J. Miller and Lian Yan

2175

l-Opt Neural Approaches to Quadratic Assignment Problems Shin Ishii and Hirotaka Niitsuma

2209

ARTICLE

Communicated by Dana Ballard

DEA: An Architecture for Goal Planning and Classication Fran c¸ ois Fleuret INRIA, Domaine de Voluceau Rocquencourt, 78150 Le Chesnay, France Eric Brunet Service de Psychiatrie Adulte, Hˆopital Mignot, 78150 Le Chesnay, France We introduce the differential efciency algorithm, which partitions a perceptive space during unsupervised learning into categories and uses them to solve goal-planning and classication problems. This algorithm is inspired by a biological model of the cortex proposing the cortical column as an elementary unit. We validate the generality of this approach by testing it on four problems with continuous time and no reinforcement signal until the goal is reached (constrained object moves, Hanoi tower problem, animat control, and simple character recognition). 1 Introduction

Two essential cognitive functions have generated multiple attempts at computer models. The rst one, classication, corresponds to the ability to isolate relevant information among a set of perceptions and to categorize it (such as objects or texture in images and words in sentences). The second one, goal planning, is the ability to generate sequences of actions to reach a goal (such as prey hunting, path nding, or strategy games) given a family of potential intermediate situations. Historically, those two problems have been studied separately. We address specically this issue in this article. The classication eld can be decomposed into two main families of algorithms. The parametric models are designed to specic problems, while the nonparametric ones, like nearest neighbors, neural nets or decision trees (Ripley, 1994), can be used to solve problems that have no a priori model. The latter seem to be closer to the natural capabilities of animal brains. The planning problem has been studied from various points of view. Some of the models are deterministic and symbolic, like the SOAR architecture (Laird, Newell, & Rosenbloom, 1987). A few algorithms propose a connectionnist approach like neuronal models (Dehaene & Changeux, 1997) or the SLUG architecture (Mozer & Bachrach, 1991). Others are based on a probabilistic representation of the problem within Markovian decision process (MDP) theory (Howard, 1960; Putterman, 1994). They focus on the need to estimate the parameters of the model efciently, given a decomposition of the task into a set of situations (Barto, Sutton, & Watkins, 1989; c 2000 Massachusetts Institute of Technology Neural Computation 12, 1987– 2008 (2000) °

1988

Fran¸cois Fleuret and Eric Brunet

Sutton, 1990; Moore & Atkeson, 1993; Kaelbling, 1993). Some algorithms using MDP strategies repeatedly divide the situations to obtain a description with an appropriate resolution (Chapman & Kaelbling, 1991; McCallum, 1996b; Moore & Atkeson, 1995). We describe in this article the differential efciency algorithm (DEA), used to construct a planning network. This work is inspired by the biological model of the cortex proposed by Burnod (1989). Formally, we show that despite the connectionist motivation, this model can be reduced to a standard MDP model, applied to a set of situations built iteratively. We show also that on particular problems, it is highly related to decision trees. It cumulates capabilities of both classication (binary trees) and goal planning (MDP) algorithms. There are several differences between the tasks we present, and what is usually addressed by goal planning algorithms. First, we consider both discrete and continuous-time problems. Second, there is no reward until the goal is reached (“goal of achievement” instead of “goal of maintenance”). Third, the choice of action is deterministic; thus, we can estimate precisely if the policy chosen by the algorithm enables it to reach the goal or not. The U-tree algorithm proposed by McCallum (1996b) already integrates perception and goal planning in the same structure and splits the perception states in order to obtain a realistic Markovian model of the world. We will show why the algorithm we propose is more efcient on the demanding tasks we try to solve. In section 2 we describe our framework. In section 3, we formalize the problem and the algorithm in the framework of stochastic process theory. In section 4 we test an implementation of DEA on four problems (constrained object moves, Hanoi tower problem, animat control, and character recognition) and compare it to the U-tree algorithm. 2 Framework 2.1 Universe/Agent Decomposition. Since we want to develop a system to solve tasks in various environments, we distinguish two objects. The rst one, the universe, describes an environment and a task to solve. The second one, which we call an agent, describes the resolution system. These two objects possess an internal state and interact through a channel called an interface (this notion has already been proposed by Booker, Goldberg, & Holland, 1989). Using such a generic interface allows us to test DEA in various environments without changing it and to validate its relative generality by estimating its performance the same way, whatever the task may be. Thus, we hope to limit, during the performance analysis, the bias due to a specic design of the agent in the problem. 2.2 Structure of a Cortical Model. Burnod (1989) raised the assumption that given its neural connections, the anatomical unit called the cortical

DEA: An Architecture

1989

column could possess functional properties underlying the mechanisms of planning and categorization. The local processing of such units could result, at a global level, in invariant-recognition and sensorimotor sequence learning (Guigon, Grandguillaume, Otto, Boutkhil, & Burnod, 1994). The cortical net processing described in those works relies on a fundamental planning mechanism. The cortex is a net composed of cortical units linked to one another by interunit connections. Each of those units receives perceptive information and is able to trigger actions. It can recognize a set of perceptive states that corresponds to a situation of the animal in its environment. The unit is said to be in the active state if the animal is in the associated situation and in the desired state if it is desirable for the animal to be in the associated situation. During learning, any connection between two units will be reinforced if there exists an elementary action that makes the animal reach the situation associated with the second unit starting from the situation associated with the rst one, with high probability. The planning is performed with call trees, each one being the propagation of a signal from unit to unit. The process starts by forcing the unit associated with the goal to be in the desired state. Then the desired state spreads along strong interunit connections. This local rule of propagation is quite natural considering the meaning of the desired state and the semantic associated with the connections: if it is desirable to be in a given situation, it is desirable to be in a situation allowing to reach it. When the call tree reaches the neighborhood of an active unit, a chain of actions, corresponding to the successive connections along the call tree, is initiated. The process is represented in Figure 1. We will see in section 3.3.2 how the call tree mechanism is related to probability estimations. 2.3 DEA and the Planning Network. The algorithm we propose here is inspired by the biologically plausible model just described. The formal similarity between the call tree mechanism as we described it and certain estimations of probabilities in an MDP model leads us to use this probabilistic framework. During a rst step, the algorithm records the successive states of the universe while random actions are performed. The total number of actions required by such an exploration, even in a deterministic universe, can grow exponentially with the number of potential situations (a typical example of such a universe is the reset state space; Koenig & Simmons, 1996). This problem can be partially solved by the restriction of this random exploration to a biased one (Kaelbling, Littman, & Moore, 1996). The usage of such “reex” actions during the learning period is justied to some extent by the existence of such behavior in mammals and humans (Piaget, 1964). When this random step is over, one can estimate the strengths of the interunit connections by considering the record of the successive states of the universe. Using those estimations, the learning process consists of progres-

1990

Fran¸cois Fleuret and Eric Brunet

Action 1

Active unit Desired unit

Time Action 2

Inactive and non-desired unit

Action 3

Inter-unit connections Call tree Execution of an action

Figure 1: Call tree schema. The desire states propagate until they reach the neighborhood of a unit in the active state. Then a succession of actions is performed.

sively splitting the units, and thus increasing the accuracy of the universe model associated with the net. One can compare this mechanism with the U-tree algorithm (McCallum, 1996b) or the Parti-Game algorithm (Moore & Atkeson, 1995). 3 Mathematical Formalization 3.1 Universe Denition. We dene a universe as a black box that possesses an internal time-dependent state. One can control how this state evolves using actions and can know partially its value through boolean projections called parameters. One of those parameters denes a subset of the state-space called the goal situation. Formally, given two nite sets A and P , the rst one dening a set of actions and the second a set of parameters, the following are the components of a universe:

A set U , the internal state-space A random variable u0 2 U , the initial state An element g 2 P , the goal situation A function p: U ! f0, 1gP , the value of the parameters A sequence of independent and identically distributed stochastic processes Tt : U £ A ! U , the dynamic of the internal state

DEA: An Architecture

1991

Cursor

0

Goal situation

1

Figure 2: Universe example. The cursor can move to the right and the left.

For each time step t, depending on the internal state at time t and the action performed, the process Tt determines the state at time t C 1. In all the universes that we will consider in this article, the Tt are almost deterministic and involve randomness only to reinitialize the state each time the goal is reached. We will denote by S D f0, 1gP the set of parameter value congurations. The set I D A £ S corresponds to the interface we described in Section 2.1. It is important to notice that the cardinality of S can be smaller than that of U , and therefore the information transmitted by p can be incomplete. In the rest of the article, we will denote by ut the process associated with the internal state-space and st D p (ut ) the process associated with the parameter values. An example of a universe is presented in Figure 2. It is composed of a cursor that can move step by step along a horizontal segment, a hundredth of the segment length each time. The goal situation is reached when the cursor is in the center third of the segment, and the parameters indicate only which of the thirds the cursor is in. We have: A D fG, Fg; G and F represent the moves to the left and the right.

P D ft1 , t2 , t3 g; each parameter is associated with one-third of the segment. g D t2 is the goal situation.

U D [0, 1] is the internal state representing the location of the cursor. u0 » U (U ) is the initial state of the universe. pt1 D 1[0, 1 ] , pt2 D 1] 1 , 2 ] , pt3 D 1] 2 ,1] . 3

3 3

3

The Tt are deterministic and dened by 8t, 8u 2 U : © ª 1 Tt (u, G) D max 0, u ¡ 100 © Tt (u, F) D min 1, u C

1 100

ª

.

3.2 Denition of an Agent. Given a universe, we dene an agent as an abstract object that chooses at each time step an action to perform depending on the actions done in the past and on the past and current values of the parameters. Those actions have to be chosen to reach the goal situation

1992

Fran¸cois Fleuret and Eric Brunet

starting from any situation. Formally, given an interface A £ S and a parameter g representing the goal situation, an agent is a sequence of independent stochastic processes: At : (S £ A )t £ S ! A . It is impossible to formalize accurately how efciently we expect the agent to reach the goal situation. In most cases, a pure random strategy almost surely reaches the goal situation in a nite time. Thus, one has to take into account the time the agent needs to reach the goal. An exact formalization would involve its ability to survive, which does not concern us here. We will restrict ourselves to the following simple criterion: The agent should be able, after a given number of learning steps, to reach the goal situation, starting from any situation, in a xed maximum duration. This duration will be different for each universe and will be of the order of the minimum time required to reach the goal, starting from the worst situation. An agent denes the evolution of the state of the universe. Indeed, denoting by at the process associated with the actions, we have, 8t, »

at D At (s0 , a0 , . . . , st¡1 , at¡1 , st ) ut C 1 D Tt (ut , at ).

with

st D p (ut )

(3.1)

3.3 Structure and Functioning of the Planning Network.

3.3.1 Unit Activation Denition. Given a universe, a planning network with N units is a family of functions a1 , . . . , aN : S ! f0, 1g. Each describes the activation of one unit, and we suppose in the rest of the article that the rst one is associated with the goal situation, so that 8x 2 S , a1 (x) D xg . We dene a policy as a function p : f0, 1gN ! A, which associates with each conguration of the activations of all the units an action to perform. The ai and a policy p dene an agent, and, according to equation 3.1, the process ut is well dened. We will denote by a1 (t), . . . , aN (t) the processes associated with the activations of the units, dened by ai (t) D ai (p (ut )), 8i. We assume that these processes are stationary. According to the description of the planning network in section 2.2, it seems quite natural to dene the desired state di of a unit as the probability of reaching the goal situation if this unit was activated. The higher this probability is, the more the organism has an interest in being in the situation associated with the unit. Formally, under the stationarity assumption and letting Pp denote the probabilities corresponding to the usage of the policy p , we obtain: di D max Pp (9t < 1, a1 (t ) D 1 | ai (0) D 1). p

DEA: An Architecture

1993

3.3.2 Estimation of the Desired Activations. We make the assumption in the rest of the article that the ai are exclusive; we will show we can force this property when we build the net: 8p,

X i

ai (p ) D 1.

(3.2)

Any policy p can then be reduced to a function p : f1, . . . , Ng ! A, which associates with each unit an action to perform. We can also dene a process c(t), which is the index of the activated unit at time t. We dene t0 as the 6 time of the rst activation modication: t0 D minft > 0, c(t) D c(0)g. If we make a Markovian assumption 1 on ct , we obtain the following standard equilibrium equation (Bellman, 1957; Bertsekas & Tsitsiklis, 1989): (

d1 D 1 P O 6 1, di D max a 8i D j Pp (i ) D a (t 0 < 1, aj ( t0 ) D 1 | ai (0) D 1) dj .

(3.3)

To estimate those probabilities, we make the assumption that for any action a 2 A, there exists a maximum delay of effect, Tmax (a), the maximum duration required for an action to provoke a change. We also favor the shortest path by introducing a decreasing factor c close to 1 (this term can be considered as the probability for the organism to survive during the next time step; Kaelbling et al., 1996). So, equation 3.3 becomes: (

d1 D 1 P O 6 1, di D c max a 8i D j Pp (i ) D a (t 0 < Tmax (a), aj (t0 ) D 1 | ai (0) D 1) dj .

(3.4)

Let us dene l ai, j D c POp (i) D a (t0 < Tmax (a), aj (t0 ) D 1 | ai (0) D 1).

(3.5)

Equation 3.4 gives the following desired state propagation (DP) algorithm: » d1 D 1 P a (DP) 6 1, di D max a 8i D j l i, j dj And we obtain the following action selection (AS) rule: 8i, p (i) D arg max a

X j

l ai, j dj .

(AS)

1 We do not make the assumption that ct is Markovian, only that the process restricted to the instants of activation modications is Markovian. This weaker assumption is satisfactory because the actions do not have an instantaneous effect.

1994

Fran¸cois Fleuret and Eric Brunet

Goal

1

A

1

0.5

1

0.5

1 C

B

A

Goal

C

1

B

1

Figure 3: The Markovian process with four states shown on the left is poorly approximated by the process shown on the right, which mixes two of its states.

To estimate the equilibrium of equation DP, one should consider this equation as a recursive denition of a vector sequence, the limit of which is the equilibrium. If we consider the di to represent the desired activations, this iterative algorithm can be considered as the local call tree rule described in section 2.2, the inter-cortical connections being represented by the l ai, j coefcients. Likewise, the action selection rule is described by equation AS. In some sense, it would be natural to consider the term we maximize in equation AS as an action desired activation: the most desired action becomes activated. During the reex period, the algorithm records all the values of the universe parameters and actions. Then it can count the number of times a given event has occurred and estimate the l ai, j values as ratios. During the construction of the units as well, the algorithm uses only the record made during the reex period; the agent does not perform active learning or test a policy. The estimation of the l ai, j is not as simple as it is with discrete-time universes. In a continuous-time universe, actions take a few cycles in order to produce an effect. Consequently the efciency of the activated actions for very short periods of time are underestimated (this is similar to the truncated observations in the statistical analysis of clinical trials). Thus, the distribution of action durations induces a bias in the frequency of successful actions, and one has to use a correction factor to estimate the probability of transitions properly. 3.3.3 Weakness of the Markovian Model. Even if the Markovian model used to justify equation 3.3 is natural, it is weak. It models reality only if the ai functions are appropriate. One can demonstrate the weakness of such a model with an example. Let us consider the Markovian process with four states A, B, C, and Goal, shown on the left in Figure 3. This process goes from state A to state Goal with a probability 1; if it is in one of the two other states, it oscillates between them forever. Let us now consider an approximation of this process that mixes the states A and B into one unique state A [ B. The probability that this approximation will go from A [ B to C is equal to P (B | A [ B ), and the

DEA: An Architecture

1995

probability of going to Goal is P ( A | A [ B ). If those two probabilities are equal to 0.5, then the initial process is modeled by the process shown on the right in Figure 3. Then, the estimated probability of reaching the Goal state in this rough model is 1 (the state would oscillate between C and A [ B until it reaches Goal, which happens almost surely in a nite time period). This estimate is far from the real probability, which equals P (Goal) C P (A ), which can be arbitrarily small. Such pathological processes are not just theoretical problems. We encountered them in the universes we used for testing the algorithm, so we must deal with them in order to build the net. Also, there exist universes such that a realistic Markovian model has precisely the structure of the pathological one we have seen. For example, in a universe in which the agent has to seek a target by alternating between an action to go forward and an action to recenter the target, the oscillations between the states “target centered” and “target not centered” enable one to reach the goal almost surely. Thus, we cannot expect to detect the anomaly simply by examining the values of the transition probabilities. 3.4 Construction of the Net.

3.4.1 Class of the ai Functions. For the planning unit nets we consider in the rest of this article, each ai is just a function of a few of the universe parameters and will be equal to 1 if and only if the vector of parameters has a given value (the units behave as the classiers proposed in Booker et al., 1989). Therefore, one can associate with each ai a vector v 2 f], 0, 1gP such that ¡ ¢ (ai (x ) D 1) () 8p 2 P , (xp D vp ) or (vp D ]) .

3.4.2 Iterative Construction of the Planning Unit Net. The construction of the net is done by iteratively increasing the total number of units. It consists of splitting the situation associated with one of the units by replacing this unit by two new units, the values of which depend on one more parameter than the original did. Formally, we will denote by R0 , . . . , Rk the sequence of nets, and an1 , . . . , ann C 2 the activations of the units of Rn . Let us denote by j the index of the unit we are going to split to build Rn C 1 from Rn , and let us denote by p the parameter used for the split (j and p depend on n, we will see later how they are determined). Then we have:

8x 2 S

8 C1 > an1 (x) D an1 (x) > > > ¢¢¢ > > n C1 ( ) > n ( x) > aj¡1 x D aj¡1 > > > < a n C 1 ( x) D a n ( x) x j

j

p

> ajnCC11 (x) D ajn (x)(1 ¡ xp ) > > > > > ajnCC21 (x) D ajnC 1 (x) > > > > > n C1 ¢ ¢ ¢ n : an C 3 (x) D an C 2 (x ).

1996

Fran¸cois Fleuret and Eric Brunet

The unit j of Rn is replaced by two units in Rn C 1 ; all the others remain the same. We will say that Rn is the father of Rn C 1 and that Rn C 1 is a son of Rn . 3.4.3 Split Selection. When the program is initialized, the net has only two units. The rst one is associated with the goal (cf. section 3.3.1), the other with its complement. At each step of the construction, one needs to determine which unit to split and with respect to which parameter. We use a brute-force method consisting of examining all the possible splits, computing a value function for each of them, and keeping the one that maximizes the value. We try to maximize the probability for the net of reaching the goal situation, starting from any initial situation. A natural value function would be the estimation of the probability of reaching the goal situation using the policy selected by the net. Considering the Markovian hypothesis and the stationarity hypothesis, and denoting p the policy chosen by the net R, we obtain Y(R ) D POp (9t < 1, a1 (t ) D 1) X D PO p (9t < 1, a1 (t) D 1 | ai (0) D 1)PO (ai (0) D 1) i

D

X i

di PO (ai (0) D 1).

As we saw in section 3.3.3, such an expression using just the Markovian model could provoke dramatic errors by leading to choosing a net that overestimates its probability of reaching the goal. To handle this, we propose a value function that compares the qualities of models between the sons and their father. 3.4.4 Differential Efciency. The Y function above estimates the probability of reaching the goal, starting from any initial situation and using the optimum policy chosen by the net. One can estimate this probability for any other policy by using a propagation rule similar to equation DP with a xed p . Let R be a net and R0 one of its sons. By denition, any policy p of R associates with a unit an action to perform. Hence a policy of R is also a policy of R0 . This is clear by considering a policy that associates with every unit common to both R and R0 the same action as p does, and to the two new units of R0 the action that p associates with the unit they replace. Then, in view of the preceding discussion, we can estimate with R0 the probability of reaching the goal using the policy p chosen by R. Thus, if R0 is a son of R, if p is the optimum policy chosen by R, and 0 p the one chosen by R0 , we compute with R0 the following value function, called differential efciency (DE), to estimate its quality: W (R0 , R ) D POp 0 (9t < 1, a1 (t) D 1) ¡ POp (9t < 1, a1 (t) D 1).

(DE)

DEA: An Architecture

1997

This value function is large if R0 estimates a high probability of reaching the goal, but also if it estimates that its father has a low probability of doing so. In particular, if a net estimates that its policy is not better than its father ’s, it has a zero value. Most of the time a large value of this function is related to an error in the Markovian model associated with R and solved by R0 . Thus, this split rule takes into account both the quality of the model and the efciency in reaching it. The Kolmogorov-Smirnov test used in U-tree (McCallum 1996a; 1996b) gives large values to splits that substantially change the expectation of reward, and it does not select splits that permit reaching the goal efciently as DEA does. In most of the tests we have made, U-tree requires more states than DEA does. Comparing the cost of these two algorithms, it appears that both consider all possible splits and compute a value that requires the estimation of the di in the candidate networks. For the U-tree algorithm, those di are used to estimate the distribution of the rewards (i.e., the distribution of the desired activation of the next activated unit, doing the selected action), whereas DEA uses the di to compare the estimates of the probability of reaching the goal. The costs of those two operations are similar, and we can, like McCallum (1996b), apply a few techniques in order to reduce the number of considered splits. Finally, the cost of one step is the same for DEA and U-tree. The latter reduces the total cost by applying few splits at each step. We have not tried this strategy with DEA, and as seen in section 4, in some cases U-tree performs better if splits are applied sequentially. 4 Results 4.1 Implementation and Evaluation Protocol. The experiments were done using a program written in C++ with GNU tools on GNU/Linux computers. The kernel simulating the planning unit net is the same whatever universe it is tested on. On a standard PC computer, for the examples we present here, the maximum computation time required to build the net is less than 15 minutes. The DEA is tested on four different universes, addressing various categories of problems. For a given universe, during an initial step, the algorithm records the values of the universe’s parameters and actions while the agent acts randomly, following reex rules. The duration of this period and the random choice of actions depend on the universe. 2 Then the algorithm builds the net by successively adding units, one at a time. The test protocol consists of estimating the performance of the net after each split. To compare the duration of the training period with the equivalent for discrete-time 2

The duration of the reex period is empirically xed so that the l ai, j coefcients are

estimated with enough accuracy.

1998

Fran¸cois Fleuret and Eric Brunet

10 X 01

11

21 Y

Figure 4: Block game. The blocks can move in four directions. The upper-left and upper-right squares are obstacles.

problems, we have to keep in mind the duration of continuous-time actions; a cycle in a continuous-time universe is not equivalent to one action trial. This repeated evaluation period consists of measuring in 1000 test cases the capability of the agent to reach the goal situation, starting in a random initial situation. The failure is the proportion of attempts for which the agent did not reach the goal situation in less time than a xed value. This maximum time before failure will be indicated for each universe and is of the order of the minimum time required to reach the goal, starting from the worst situation. This failure rate is a demanding measure. It quanties the ability of the net to generate a complete chain of actions from any situation to the goal. Because the choice of action is deterministic, a net that generates all but one action necessary to reach the goal fails. Thus, the net can have an arbitrarily high failure rate even if it determines almost all the necessary actions along the trajectory to the goal. We compare DEA to the U-tree algorithm developed by McCallum (1996b). The DEA algorithm applies splits one after another, whereas U-tree simultaneously applies all splits considered useful. Thus, we also tested a modied version of U-tree we call U-tree ¤ , which applies splits sequentially. 4.2 Block Game. The block game is a 3 £ 2 grid on which two blocks (X and Y, as shown in Figure 4) can move. At each time step, one of the two blocks can move a fraction of the size of a square in four directions (up, right, down, left). Therefore, there are eight actions. The moving blocks cannot penetrate each other, and the upper-left and upper-right squares of the grid constitute obstacles the blocks cannot penetrate either. The goal situation is reached when block X is in the lower-right square and block Y is in the lower-left square. There are eight parameters, indicating if one of the two blocks covers, partially or totally, one of the four grid squares. Each block can be on two squares, and so can activate two parameters at the same time.

DEA: An Architecture

1999

Figure 5: Results for the block game: failure rate versus the number of units.

The learning period has a total duration of 250, 000 time steps. At each step, the action performed has probability 0.9 to be the same as the one during the previous step. If it changes, a new action is randomly chosen. Each block can move one-eighth of a square size at each time step, so the goal can always be reached in fewer than 48 time steps. The maximum time before failure has been xed at 100 steps. Figure 5 shows the failure rate versus the number of units. A zero failure rate is obtained with a 15-unit net. At this point, the growth process stops because the W function is zero on all new nets. The total number of possible congurations of parameter values is 24. Hence, this experiment demonstrates the generalization capabilities of DEA, which is able to solve the task without building one unit for each conguration. The U-tree (respectively, U-tree ¤ ) requires 21 (respectively, 23) units to solve the same task with no error. 4.3 Hanoi Towers. The Hanoi towers game is a famous goal-planning task. There are three disks of different diameters that can be threaded on three vertical rods (see Figure 6). The player can move the upper disk of any rod to another rod but can never put a disk on a smaller one. The goal is to move the three disks to the rod on the right. There are six actions, corresponding to the rod from which the upper disk is removed and the rod it is placed on. An action can have no effect if there is no disk on the rst rod chosen, or if it is an illegal move due to the diameter constraint. There are nine parameters, each one associated with a rod and a disk.

2000

Fran¸cois Fleuret and Eric Brunet

A C

B 1

2

3

Figure 6: Hanoi towers problem.

Figure 7: Results for the Hanoi towers problem: failure rate versus the number of units.

The learning period lasts for 30,000 steps. At each step, an action is chosen randomly. All the actions last one cycle. Because the problem requires in the worst case nine moves, the maximum time before failure has been xed at 20 time steps. The zero failure rate is reached with a 20-unit net (see Figure 7). The average time required to reach the goal decreases (almost linearly from 5.3 to 4.6 cycles) as the number of units increases, whereas the failure rate is already zero. By making the model ner, DEA is able to nd shorter paths to the goal. The error rates for both versions of U-tree show that this algorithm requires almost an exhaustive model to solve the problem with a null error rate (it builds 26 units and the universe has 27 states). 4.4 Beaver World. The beaver world consists of a square playing eld containing three trees, two houses, and an animat (see Figure 8). The actions allow the animat to move—to grasp a tree or to release it. The goal situation

DEA: An Architecture

2001

Figure 8: Beaver world. The animat can turn, go forward, and grasp and release trees.

is reached when the animat is close to a house and a tree at the same time. We dene “close to the animat” as closer than a xed distance corresponding to the circle delimiting the eld of view around the animat in the gure. There are ve actions: GF makes the animat move forward one-hundredth of the play eld border length. TR and TL make the animat turn to the right and to the left onehundredth of a complete turn. Tk makes the animat grasp a tree if there is one close to it. Pt makes the animat release the tree if it was holding one. The animat perceives the universe with two eyes (its eld of view is composed of 5 £ 2 cells, with limits that go from ¡60± to C 60± relative to the direction of the animat) and by touching (it knows if there is a house or a tree close to it and it knows if it is holding a tree). Each object (houses, trees) has a color randomly selected in a set of 8 colors. There are 32 parameters: Gl represents the goal. Hl indicates that the animat is holding a tree. at indicates that the animat is touching a tree. a0,0 to a4,1 indicate that the animat has a tree in 1 of the 10 cells of its eld of view. bt indicates that the animat is touching a house.

2002

Fran¸cois Fleuret and Eric Brunet

Figure 9: Results for the beaver world: failure rate versus the number of units.

b0,0 to b4,1 indicate that the animat has a house in 1 of the 10 cells of its eld of view. c0 to c7 indicate which colors are visible in the eld of view. The training period lasts for 2.5 million steps. During this period, the action performed at each step depends on the value of the parameters. If the animat touches a tree (respectively, a house), it has a high probability of performing the “grasp tree” action (respectively, the “release tree” action). Otherwise, if there is an object in one of its elds of view (tree or house), it has a high probability of going forward; otherwise, it turns left or right with the same probability. This rough strategy enables the animat to explore the square and perform the “grasp tree” and “release tree” actions in appropriate conditions and frequently enough to estimate statistical parameters. To solve the problem, the animat must rst go to a tree and grasp it, then go to a house, and release the tree near the house. The maximum time required to reach the goal has been empirically measured and is under 200 time steps. The maximum time before failure has been xed at 1000 time steps. The measured total number of parameter value congurations is 62, 000, and DEA builds a net with 8 units, which solves the problem with a zero failure rate (see Figure 9). In order to control its trajectory to a tree or to a house, the animat goes forward when the target is in front of it and turns in order to put the target back in front as soon as it leaves it. The net built to perform this strategy has two substructures dedicated, respectively, to guidance to a tree and guidance to a house.

DEA: An Architecture

Extremity

2003

Corner

Curve

T-point

Figure 10: The detectors are associated with four kinds of local geometric congurations.

Both versions of U-tree perform very poorly on this task. Comparing the units built by DEA and those built by U-tree, it appears that whereas the former creates only the required units for the retina, both versions of the latter go on splitting units related to the retina. Since one can improve the model of the universe in terms of prediction (but not in terms of efciency to reach the goal) by creating one unit for each possible conguration of the retina’s parameters, this behavior of U-tree is to be expected. The reason for such a high failure rate is that when put in a random state, the beaver has a very high probability of being far from both trees and houses. Thus, a network whose policy is not complete (i.e., makes the animat go to the tree, grasp it, and then go to a house and release the tree) has an error rate close to 100%. 4.5 Eye Universe. This last universe is an elementary character recognition problem and demonstrates to some extent the classication capabilities of DEA and its relation with active vision (Rimey & Brown, 1994; Swain & Stricker, 1993). In this universe, the net can move a retina or name the object it has in its eld of view. Therefore, we unify goal planning and classication by adding “answer actions.” The universe consists of a square retina that can move onto a symbol to recognize, which is chosen from a set of four possible characters. This retina is divided into four quarters, each holding four geometric feature detectors, which detect, respectively, extremities, angles, curves, and Tjunctions (see Figure 10) and can be considered combinations of elementary border detectors, such as those that exist in the cortex (Hubel & Wiesel, 1962). Depending on the character shown to the retina and its location, some of the detectors are activated (see Figure 11). There are eight different actions: four related to the displacement of the retina in each of the four possible directions (one-twentieth of the retina border length at each step) and four others related to the naming of the character. There are 16 parameters, each associated with a quarter of the retina and a detector. The goal is reached when the net correctly gives the name of the character. We used two different sets of four characters: f1, 2, 3, 4g and fA, B, C, Dg.

2004

Fran¸cois Fleuret and Eric Brunet

Extremity

Extremity Corner Curve

Corner

Curve

Corner

Extremity

Extremity

T-point

T-point

Figure 11: Two examples of characters in the eye universe. Each quarter of the retina possesses four different features detectors.

The characters are piecewise linear, with labels “extremity,” “corner,” “curve,” and “t-point” assigned by hand at each vertex. We used the same sets of symbols during the learning and the testing; therefore, we do not prove any form of generalization capability. The learning period lasts 1 million time steps, during which the agent alternates between movements and answers. The movements are chosen in such a way that the character is roughly centered: if all the detectors in half of the retina are not activated, the movements that bring the character in this part of the retina are chosen with a higher probability. The functioning of the net after the learning period demonstrates the utility of an approach that allows the recognition system to reduce ambiguities by moving the retina on the character. For example, the symbols 2 and 3 activate the same detectors if they are in one quarter of the retina, since both have extremities, corners, and curves, and hence cannot be separated. To solve this problem, the net moves them so that they are on two quarters of the retina (the 2 symbol cannot activate two “corner” detectors, whereas the 3 can). The numbers of possible parameter value congurations are, respectively, 247 and 221 for the two sets of characters, and the nets required to solve the problems have 23 and 12 units. On the rst set of characters, U-tree and DEA have a similar failure rate, but DEA is more efcient on the second set (see Figure 12). 4.5.1 Comparison with Classication Trees. We can compare DEA to algorithms for binary trees by considering a universe like the eye universe, without actions to modify the data. Let us call Y the class of the object to be recognized. There is one action associated with each class, and the goal is reached if and only if the agent performs the right action. If a is a function of the parameters, put I (a) D max PO (Y D a | a (0) D 1) PO (a (0) D 1). a

DEA: An Architecture

2005

Figure 12: Results for the eye universe for the sets of characters f1, 2, 3, 4g (top) and fA, B, C, Dg (bottom): failure rate versus number of units.

It is easy to show that by denoting by a the activation of the unit to split and by a0 and a00 the activations of the two created units, we obtain W (R0 , R) D I (a0 ) C I (a00 ) ¡ I (a). This corresponds to a splitting rule that tries to minimize the error rate at each step. This rule has been described by and criticized in Breiman, Friedman, Olshen, and Stone (1984). The main defect is that reducing the misclassication rate at each step does not produce a good global strategy.

2006

Fran¸cois Fleuret and Eric Brunet

For example, two splits that lead to the same error rate are equivalent for this splitting rule even if one of the splits produces a pure unit with respect to Y, which is a better conguration since such a unit is optimal and ignored afterward. The standard method used to solve this problem consists of using criteria such as Shannon entropy, which yields better estimates of the “purity” of units because it depends on all the class probabilities, not only on the probability of the more probable state. We believe that a good criterion for the construction of the net should reduce to the Shannon entropy–based rule or an equivalent one in this particular universe. 5 Discussion

Starting from analogies with cortical structures, we have developed a planning network with a formalization on the MDP theory. We have also proposed a procedure to construct the net by adding units, equivalent to splitting iteratively the perception state. We have shown that this algorithm can solve both goal planning and classication problems, demonstrating that those functions can be combined into one structure. The comparison with U-tree, which addresses the same tasks, underscores the interest of DEA, which attempts to maximize both the quality of the Markovian model and the probability of reaching the goal. With such a strategy, DEA is able to generate more compact models (with a smaller number of states) and needs a smaller number of transition probability estimations to create a fully efcient Markovian model dedicated to one goal. In some critical cases (see the beaver example), increasing the quality of the Markovian model, without taking into account the efciency relative to the goal, leads to an excessive and useless creation of states. This algorithm is a simple example of a large class of algorithms using the general idea of efciency maximization. It could be improved by adding a rule for unit fusion, allowing backtracking the splitting process and canceling a split when it appears useless. Acknowledgments

We thank Yves Burnod for his advice and participation, and Donald Geman for his help during the preparation of this article. References Barto, A. G., Sutton, R. S., & Watkins, C. J. C. H. (1989). Learning and sequential decision making (Tech. Rep. COINS 89-95). Amherst, MA: University of Massachusetts. Bellman, R. (1957). Dynamic programming. Princeton, NJ: Princeton University Press.

DEA: An Architecture

2007

Bertsekas, D. P., & Tsitsiklis, J. N. (1989). Parallel and distributed computation. Englewood Cliffs, NJ: Prenctice Hall. Booker, L. B., Goldberg, D. E., & Holland, J. H. (1989). Classier systems and genetic algorithms. Articial Intelligence, 40, 235–282. Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classication and regression trees. Belmont, CA: Wadsworth. Burnod, Y. (1989). An adaptative neural network: The cerebral cortex. Paris: Masson collection biologie th´eorique. Chapman, D., & Kaelbling, L. P. (1991).Input generalization in delayed reinforcement learning: An algorithm and performance comparisons. In Proceedings of the International Joint Conference on Articial Intelligence (Sydney, Australia). Dehaene, S., & Changeux, J.-P. (1997). A hierarchical neuronal network for planning behavior. Proc. Natl. Acad. Sci. USA, 94, 13293–13298. Guigon, E., Grandguillaume, P., Otto, I., Boutkhil, L., & Burnod, Y. (1994).Neural network models of cortical functions based on the computational properties of the cerebral cortex. Journal of Physiology, Paris, 88, 291–308. Howard, R. A. (1960). Dynamic programming and Markov processes. Cambridge, MA: MIT Press. Hubel, D. H., & Wiesel, T. N. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, 160, 106–154. Kaelbling, L. P. (1993).Learning in embedded systems. Cambridge, MA: MIT Press. Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey. Journal of Articial Intelligence Research, 4, 237–285. Koenig, S., & Simmons, R. G. (1996). The effect of representation and knowledge on goal-directed exploration with reinforcement-learning algorithms. Machine Learning, 22, 227–250. Laird, J. E., Newell, A., & Rosenbloom, P. S. (1987). SOAR: An architecture for general intelligence. Articial Intelligence, 33, 1–64. McCallum, A. K. (1996).Learning to use selective attention and short-term memory in sequential tasks. In From animals to animats: Proceedings of the Fourth International Conference on Simulation of Adaptative Behavior. Cambridge, MA: MIT Press. McCallum, A. K. (1996b).Reinforcement learning with selective perceptionand hidden state. Unpublished doctoral dissertation, University of Rochester, Rochester, NY. Moore, A. W., & Atkeson, C. G. (1993). Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13, 103–130, Moore, A. W., & Atkeson, C. G. (1995). The parti-game algorithm for variable resolution reinforcement learning in multidimensional state-spaces. Machine Learning, 21. Mozer, M. C., & Bachrach, J. (1991). Slug: A connectionist architecture for inferring the structure of nite-state environments. Machine Learning, 7, 139– 160. Piaget, J. (1964). Six e´tudes de psychologie. Paris: Editions Denoel. Putterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley.

2008

Fran¸cois Fleuret and Eric Brunet

Rimey, R. R., & Brown, C. M. (1994). Control of selective perception using bayes nets and decision theory. International Journal of Computer Vision, 12, 173–207. Ripley, B. D. (1994). Neural networks and related methods for classication. Journal of the Royal Society B, 56(3), 409–456, Sutton, R. S. (1990). Integrated architecture for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh Int. Conf. on Machine Learning (pp. 216–224). San Mateo, CA: Morgan Kaufmann. Swain, M. J., & Stricker, M. A. (1993). Promising directions in active vision. International Journal of Computer Vision, 11, 109–126. Received September 18, 1998; accepted September 9, 1999.

NOTE

Communicated by Nicol Schraudolph

On a Fast, Compact Approximation of the Exponential Function Gavin C. Cawley University of East Anglia, Norwich, Norfolk NR4 7TJ, England Recently Schraudolph (1999) described an ingenious, fast, and compact approximation of the exponential function through manipulation of the components of a standard (IEEE-754 (IEEE, 1985)) oating-point representation. This brief note communicates a recoding of this procedure that overcomes some of the limitations of the original macro at little or no additional computational expense. 1 Motivation

Schraudolph quite rightly describes exponentiation as the quintessential nonlinear function of neural computation and provides C code implementing a fast approximation of this function, providing impressive speed improvements in both benchmark tests and in a real-world application (JPEG quality transcoding; (Lazzaro & Wawrzynek, 1999). The use of a static global variable is, however, problematic in a multithreaded environment. This note describes a reimplementation of the algorithm that eliminates the global variable without incurring signicant additional computational expense. 2 The Algorithm

IEEE standard 754 stipulates that double-precision oating-point numbers should be represented in the form (¡1)s (1 C m )2x¡x0 , where s is the sign bit, m is the 52-bit normalized mantissa (a binary fraction in the range (0, 1]) and x is the 11-bit exponent with a bias x0 D 1023. Schraudolph’s fast approximation to the exponential function is based on the identity ey D 2y / ln 2 . Setting the exponent, x, equal to the integer part of y / ln 2 C x0 provides a simple approximation to ey . This approach can be implemented using a C/C++ union construct allowing a double-precision IEEE-754 oating-point number, d, to be stored in the same locations in memory as a structure comprising two 32-bit signed integers, i and j, as shown in Figure 1. Setting 220 y C 220 x0 ¡ C ln 2 jD0 iD

c 2000 Massachusetts Institute of Technology Neural Computation 12, 2009– 2012 (2000) °

2010

Gavin C. Cawley

Figure 1: Bit representation of the union data structure used by the exponential function (Schraudolph, 1999).

results in the integer part of d containing the integer part of y / ln 2 C x0 , the factor of 220 providing the necessary shift left of 20 places. The fractional part overows into the most signicant bits of the mantissa, providing a degree of linear interpolation between integral values. The parameter C provides some control over the error properties of the approximation, and a value of 60801 minimizes the root-mean-square (RMS) relative error. 3 The Implementation

The revised C++ implementation is shown in Figure 2. The EXP macro has been replaced by an in-line function, exponential , and the global static variable eco is replaced by the local variable of exponential . This approach brings two benets. First, this implementation is more suitable for use in multithreaded programs because the static global variable has been eliminated. Second, the use of a function, instead of a macro, more cleanly encapsulates the algorithm and permits better use of type-checking rules. Note that this code can also be compiled by a standard ANSI C compiler if the inline keyword is omitted. 4 Benchmark Results

Table 1 displays benchmark results for the original and revised implementations of Schraudolph’s approximation. The benchmark used is similar to that Schraudolph used, except that for accuracy, the sum of 109 exponentials of random arguments is computed. Note that the exponential function is slightly faster than the EXP macro, except in the case of the Sun workstation, where the EXP macro is marginally faster. The ISO/ANSI C++ standard (ISO/IEC, 1998) states that the inline keyword suggests, but does not require, that the statements comprising the body of the function should be expanded into the body of the calling function, eliminating the computational expense of the function call. However most modern C and C++ compilers are capable of performing this optimization, especially in the case of small leaf functions, such as this. One might still expect the revised code to be slower because a local variable must be allocated and initialized each time the function is called, whereas the orig-

Fast, Compact Approximation of the Exponential Function

2011

#define EXPÇ A (1048576/MÇ LN2) #define EXPÇ C 60801 inline double exponential(double y) [ union [ double d; #ifdef LITTLEÇ ENDIAN struct [ int j, i; ] n; #elseif struct [ int i, j; ] n; #endif ] Ç eco; Ç eco.n.i = (int)(EXPÇ A*(y)) + (1072693248 - EXPÇ C); Ç eco.n.j = 0; return Ç eco.d; ] Figure 2: C++ code fragment implementing a fast approximation of the exponential function.

inal macro is able to reuse a global variable that is statically allocated and initialized. The function, though, does not have the side effect of modifying a global variable. Because it is generally infeasible for the compiler to deTable 1: Seconds Required for 109 Exponentiations. Manufacturer Processor Model/Speed

Intel Pentium II/300

Intel Pentium Xeon/450

Sun UltraSparc Ultra 5/360

DEC Alpha 500au/500

LITTLE ENDIAN

Yes WinNT egcs-2.91.57 C ¡O3 918 415 391

Yes Linux egcs-2.90.27 C++ ¡O1 431 281 261

No Sun OS 5.7 CC 5.0 C++ ¡O4 677 125 129

Yes Digital UNIX V4.0E DEC C V5.8-009 C ¡O2 130 41 40

Operating system Compiler Source Optimization exp (libm.a) EXP macro exponential function

2012

Gavin C. Cawley

termine statically whether this value is ever used, the original macro must contain code to update the global variable, whereas the function can be optimized so that the temporary variable exists only within registers. 5 Summary

A recoding of the splendid approximation of the exponential function described by Schraudolph is presented. Without incurring a signicant additional computational expense, it provides the following additional benets: The reimplementation is better suited to a multithreaded environment because the static global variable has been eliminated. Type-checking rules can be applied more strongly, and the algorithm is encapsulated more cleanly, for improved reliability. Acknowledgments

I thank Mark Fisher, Danilo Mandic, and the anonymous reviewers for their helpful comments and Shaun McCullagh for his assistance in collecting the benchmark results. References IEEE. (1985). Standard for binary oating-point arithmetic (IEEE/ANSI Std. 7541985). New York: American National Standards Institute/Institute of Electrical and Electronic Engineers. ISO/IEC. (1998). Programming languages—C++ (ISO/IEC Std. 144882-1998(E)). New York: American National Standards Institute. Lazzaro, J., & Wawrzynek, J. (1999). JPEG quality transcoding using neural networks trained with a perceptual error measure. Neural Computation, 11(1), 267–298. Schraudolph, N. N. (1999). A fast, compact approximation of the exponential function. Neural Computation, 11(4), 853–862. Received August 11, 1999; accepted October 8, 1999.

LETTER

Communicated by Peter Bartlett

Bounds on Error Expectation for Support Vector Machines V. Vapnik O. Chapelle AT&T Labs-Research, Ecole Normale Sup´erieure de Lyon, Lyon, France We introduce the concept of span of support vectors (SV) and show that the generalization ability of support vector machines (SVM) depends on this new geometrical concept. We prove that the value of the span is always smaller (and can be much smaller) than the diameter of the smallest sphere containing the support vectors, used in previous bounds (Vapnik, 1998). We also demonstate experimentally that the prediction of the test error given by the span is very accurate and has direct application in model selection (choice of the optimal parameters of the SVM).

1 Introduction

Recently, a new type of algorithm with a high level of performance, called support vector machines (SVM), has been introduced (Boser, Guyon, & Vapnik, 1992; Vapnik, 1995). Usually, the good generalization ability of SVM is explained by the existence of a large margin. Bounds on the error rate for a hyperplane that separates the data with some margin were obtained by Bartlett and Shawe-Taylor (1999) and Shawe-Taylor, Bartlett, Williamson, and Anthony (1998). In Vapnik (1998), another type of bound was obtained that demonstrated that for the separable case, the expectation of error probability for hyperplanes passing through the origin depends on the expectation of R2 /r 2 , where R is the maximal norm of support vectors and r is the margin. In this article we derive bounds on the expectation of error for SVM from the leave-one-out estimator, which is an unbiased estimate of the probability of test error. These bounds (which are tighter than the one dened in Vapnik, 1998, and valid for hyperplanes not necessarily passing through the origin) depend on a new concept, the span of support vectors. The bounds obtained show that the generalization ability of SVM depends on more complex geometrical constructions than large margin. To introduce the concept of the span of support vectors, we have to describe the basics of SVM. c 2000 Massachusetts Institute of Technology Neural Computation 12, 2013– 2036 (2000) °

2014

V. Vapnik and O. Chapelle

2 SVM for Pattern Recognition

We call the hyperplane w 0 ¢ x C b0 D 0

optimal if it separates the training data x 2 Rm ,

(x 1 , y1 ), . . . , (x`, y`),

y 2 f¡1, 1g

and if the margin between the hyperplane and the closest training vector is maximal. This means that the optimal hyperplane has to satisfy the inequalities yi (w ¢ xi C b) ¸ 1

i D 1, . . . , `

and has to minimize the functional R (w ) D w ¢ w . This quadratic optimization problem can be solved in the dual space of Lagrange multipliers. One constructs the Lagrangian L (w , b, ® ) D

` X 1 w ¢w ¡ ai [yi (w ¢ x i C b) ¡ 1] 2 iD 1

(2.1)

and nds its saddle point: the point that minimizes this function with respect to w and b and maximizes it with respect to ai ¸ 0,

i D 1, . . . , `.

(2.2)

Minimization over w denes the equation wD

` X

ai yi x i ,

(2.3)

iD 1

and minimization over b denes the equation ` X

i D1

ai yi D 0.

(2.4)

Substituting equation 2.3 back into the Lagrangian, equation 2.1, and taking into account equation 2.4, we obtain the functional W (®) D

` X

iD 1

ai ¡

` 1X ai aj yi yj xi ¢ xj , 2 i, j D1

(2.5)

Bounds on Error Expectation

2015

which we have to maximize with respect to parameters ® satisfying two constraints: equality constraint, equation 2.4, and positivity constraints, equation 2.2. The optimal solution ®0 D (a10 , . . . , a`0 ) species the coefcients for the optimal hyperplane w0 D

` X

iD 1

ai0 yi x i .

Therefore the optimal hyperplane is ` X

iD 1

ai0 yi x i ¢ x C b0 D 0,

(2.6)

where b0 is chosen to maximize the margin. It is important to note that the optimal solution satises the Kuhn-Tucker conditions ai0 [yi (w 0 ¢ x i C b0 ) ¡ 1] D 0.

From these conditions, it follows that if the expansion of vector w 0 uses vector xi with nonzero weight ai0 , then the following equality must hold: yi (w 0 ¢ xi C b0 ) D 1.

(2.7)

Vectors x i that satisfy this equality are called support vectors. Note that the norm of vector w 0 denes the margin r between optimal separating hyperplane and the support vectors r D

1 kw 0 k

.

Therefore, taking into account equations 2.4 and 2.7 we obtain ` ` ` X X X 1 ai0 , (2.8) D w0 ¢ w0 D yi ai0 w 0 ¢ x i D yi ai0 (w 0 ¢ xi C b0 ) D 2 r iD 1 iD 1 iD 1

where r is the margin for the optimal separating hyperplane. In the nonseparable case, we introduce slack variables ji and minimize the functional R(w , b, ® ) D

` X 1 w¢w CC ji , 2 iD 1

subject to constraints yi (w 0 ¢ xi C b0 ) ¸ 1 ¡ ji ji ¸ 0.

2016

V. Vapnik and O. Chapelle

When the constant C is sufciently large and the data are separable, the solution of this optimization problem coincides with the one obtained for the separable case. To solve this quadratic optimization problem for the nonseparable case, we consider the Lagrangian L (w , b, ® ) D

` ` ` X X X 1 w ¢ w¡ ai [yi (w ¢ x i C b) ¡ 1 C ji ] C C ji ¡ ºiji , 2 i D1 i D1 iD 1

which we minimize with respect to w , b, and ji and maximize with respect to the Lagrange multipliers ai ¸ 0 and ºi ¸ 0. The result of minimization over w and b leads to the conditions 2.3 and 2.4, and the result of minimization over ji gives the new condition ai C ºi D C.

(2.9)

Taking into account that ºi ¸ 0, we obtain 0 · ai · C.

(2.10)

Substituting equation 2.3 into the Lagrangian, we obtain that in order to nd the optimal hyperplane, one has to maximize the functional 2.5, subject to constraints 2.4 and 2.10. The box constraints, 2.10 (instead of the positivity constraints, 2.2), entail the difference in the methods for constructing optimal hyperplanes in the nonseparable case and the separable case, respectively. For the nonseparable case, the Kuhn-Tucker conditions, ai0 [yi (w 0 ¢ x i C b0 ) ¡ 1 C ji ] D 0

ºiji D 0,

(2.11)

must be satised. Vectors xi that correspond to nonzero ai0 are referred as support vectors. For support vectors, the equalities yi (w 0 ¢ x i C b0 ) D 1 ¡ ji

hold. From conditions 2.11 and 2.9, it follows that if ji > 0, then ºi D 0 and therefore ai D C. We will distinguish two types of support vectors: support vectors for which 0 < ai0 < C and support vectors for which ai0 D C. To simplify notations, we sort the support vectors such that the rst n¤ support vectors belong to the rst category (with 0 < ai < C) and the next m D n ¡ n¤ support vectors belong to the second category (with ai D C). When constructing SVMs, one usually maps the input vectors x 2 X into a high-dimensional (even innite dimensional) feature space w (x ) 2 F where one constructs the optimal separating hyperplane. Note that both

Bounds on Error Expectation

2017

the optimal hyperplane, 2.6, and the target functional, 2.5, that have to be maximized to nd the optimal hyperplane depend on the inner product between two vectors rather than on input vectors explicitly. Therefore one can use the general representation of inner product in order to calculate it. It is known that the inner product between two vectors w (x 1 ) ¢ w (x 2 ) has the following general representation, w (x1 ) ¢ w (x 2 ) D K (x1 , x2 ),

where K (x 1 , x 2 ) is a kernel function that satises the Mercer conditions (symmetric positive denite function). The form of kernel function K (x 1 , x 2 ) depends on the type of mapping of the input vectors w (x ). In order to construct the optimal hyperplane in feature space, it is sufcient to use a kernel function instead of inner product in expressions 2.5 and 2.6. Further, we consider bounds in the input space X. However, all results are true for any mapping w . To obtain the corresponding results in a feature space, one uses the representation of the inner product in feature space K (x, x i ) instead of the inner product x ¢ x i . 3 The Leave-One-Out Procedure

The bounds introduced in this article are derived from the leave-one-out cross-validation estimate, a procedure usually used to estimate the probability of test error of a learning algorithm. Suppose that using training data of size `, one tries simultaneously to estimate a decision rule and evaluate the quality of this decision rule. Using training data, one constructs a decision rule. Then one uses the same training data to evaluate the quality of the obtained rule based on the leave-one-out procedure: one removes from the training data one element (say, (xp , yp )), constructs the decision rule on the basis of the remaining training data, and then tests the removed element. In this fashion, one tests all ` elements of the training data (using ` different decision rules). Let us denote the number of errors in the leave-one-out procedure by L (x 1 , y1 , . . . , x `, y`). Luntz and Brailovsky (1969) proved the following lemma: The leave-one-out procedure gives an almost unbiased estimate of the probability of test error Lemma 1.

¡1 Ep`error DE

¡L

(x1 , y1 , . . . , x `, y`) `

¢

,

¡1 where p`error is the probability of test error for the machine trained on a sample of size ` ¡ 1.

“Almost” in lemma 1 refers to the fact that the probability of test error is for samples of size ` ¡ 1 instead of `.

2018

V. Vapnik and O. Chapelle

For SVMs, one needs to conduct the leave-one-out procedure only for support vectors; nonsupport vectors will be recognized correctly since removing a point that is not a support vector does not change the decision function. Remark.

In section 5, we introduce upper bounds on the number of errors made by the leave-one-out procedure. For this purpose, we need to introduce a new concept, the span of support vectors. 4 Span of the Set of Support Vectors

Let us rst consider the separable case. Suppose that (x 1 , y1 ), . . . , (xn , yn ) is a set of support vectors, and

®0 D (a10 , . . . , an0 ) is the vector of Lagrange multipliers for the optimal hyperplane. For any xed support vector xp , we dene the set L p as a constrained linear combination of the points fx i gi D6 p :

8 < X n Lp D lx: : iD 1, iD6 p i i

n X 6 p i D 1, i D

9 =

6 p, l i D 1, and 8i D

ai0 C yi yp ap0l i ¸ 0 .

;

Note that l i can be less than 0. We also dene the quantity Sp , which we call the span of the support vector xp as the distance between xp and this set (see Figure 1): Sp2 D d2 (xp , L p ) D min(xp ¡ x )2 .

(4.1)

x 2L p

As shown in Figure 2, it can happen that xp 2 L p , which implies Sp D d (xp , L p ) D 0. Intuitively, for smaller Sp D d(xp , L p ) the leave-one-out procedure is less likely to make an error on the vector xp . Indeed, we will prove (see lemma 3) that if Sp < 1 / (Dap0 ) (D is the diameter of the smallest sphere containing the training points), then the leave-one-out procedure classies xp correctly. By setting l p D ¡1, we can rewrite Sp as Sp2 D min

8 < X n :

(

i D1

!2 l ixi

: l p D ¡1,

n X iD 1

9 =

l i D 0, ai0 C yi yp ap0l i ¸ 0 .

;

(4.2)

Bounds on Error Expectation

2019

Figure 1: Consider the two-dimensional example. Three support vectors with a1 D a2 D a3 / 2. The set L 1 is the semi-opened dashed line: L 1 D fl 2 x 2 C l 3 x 3 , l 2 C l 3 D 1, l 2 ¸ ¡1, l 3 · 2g.

The maximal value of Sp is called the S-span: S D maxfd(x 1 , L 1 ), . . . , d (xn , L n )g D max Sp . p

We will prove (cf. lemma 2 below) that Sp · DSV . Therefore, S · DSV .

(4.3)

Depending on ®0 D (a10 , . . . , an0 ) the value of the span S can be much less than diameter DSV of the support vectors. Indeed, in Figure 2, d(x 1 , L 1 ) D 0 and by symmetry, d (xi , L i ) D 0, for all i. Therefore, in this example, S D 0. Now we generalize the span concept for the nonseparable case. In the nonseparable case we distinguish between two categories of support vectors: the support vectors for which 0 < ai < C

i D 1, . . . , n¤ ,

and the support vectors for which aj D C

j D n¤ C 1, . . . , n.

2020

V. Vapnik and O. Chapelle

Figure 2: In this example, we have x 1 2 L 1 and therefore d(x 1 , L 1 ) D 0. The set L 1 has been computed using a1 D a2 D a3 D a4 .

We dene the span of support vectors using support vectors of the rst category. That means we consider the value Sp D d (xp , L p ) where

8 < X n¤ Lp D lx: : iD 1, iD6 p i i

9 =

¤

n X 6 p i D 1, i D

6 p l i D 1, 8i D

0 · ai0 C yi yp ap0l i · C .

;

The differences in the denition of the span for the separable and the nonseparable case are that in the nonseparable case, we ignore the support vectors of the second category and add an upper bound C in the constraints on l i . Therefore, in the nonseparable case, the value of the span of support vectors depends on the value of C. It is not obvious that the set L p is not empty. It is proved in the following lemma. In both the separable and nonseparable cases, the set L p is not empty. Moreover, Sp D d (xp , L p ) · DSV . Lemma 2.

The proof is in the appendix. Remark.

From lemma 2, we conclude (as in the separable case) that

S · DSV ,

(4.4)

Bounds on Error Expectation

2021

where DSV is the diameter of the smallest sphere containing the support vectors of the rst category. 5 The Bounds

The generalization ability of SVMs can be explained by their capacity control. Indeed, the VC dimension of hyperplanes with margin r is less than D2 / 4r 2 , where D is the diameter of the smallest sphere containing the training points (Vapnik, 1995). This is the theoretical idea motivating the maximization of the margin. This section presents new bounds on the generalization ability of SVMs. The major improvement lies in the fact that the bounds will depend on the span of the support vectors, which gives tighter bounds than ones depending on the diameter of the training points. Let us rst introduce our fundamental result : If in the leave-one-out procedure a support vector xp corresponding to 0 < ap < C is recognized incorrectly, then the inequality Lemma 3.

p ap0 Sp max(D, 1 / C) ¸ 1 holds true. See the proof in the appendix. Lemma 3 leads to the following theorem for the separable case: Suppose that an SVM separates training data of size ` without error. ¡1 Then the expectation of the probability of error p`error for the SVM trained on the training data of size ` ¡ 1 has the bound Theorem 1.

¡1 ·E Ep`error

¡ SD ¢ `r 2

,

where the values of span-of-support vectors S, diameter of the smallest sphere containing the training points D, and the margin r are considered for training sets of size `. Let us prove that the number of errors made by the leave-one-out procedure is bounded by SD . Taking the expectation and using lemma 1 r2 will prove the theorem. Consider a support vector xp incorrectly classied by the leave-one-out procedure. Then lemma 3 gives ap0 Sp D ¸ 1 (we consider here the separable Proof.

2022

V. Vapnik and O. Chapelle

case and C D 1), and ap0 ¸

1 SD

holds true. Now let us sum the left-hand and the right-hand sides of this inequality over all support vectors where the leave-one-out procedure commits an error:

L (x1 , y1 , . . . , x `, y`) X 0 · ai . SD ¤ P Here ¤ indicates that the sum is taken only over support vectors where the leave-one-out procedure makes an error. From this inequality we have n L (x1 , y1 , . . . , x `, y`) 1X · a0 . `SD ` iD 1 i

Therefore we have (using, equation 2.8)

L (x1 , y1 , . . . , x `, y`) `

·

SD

Pn

0 i D 1 ai

`

D

SD `r 2

.

Taking the expectation over both sides of the inequality and using the Luntz and Brailovsky lemma, we prove the theorem. For the nonseparable case the following theorem is true: ¡1 The expectation of the probability of error p`error for an SVM trained on training data of size ` ¡ 1 has the bound

Theorem 2.

`¡1

Eperror · E

(

! p P ¤ S max(D, 1 / C) niD1 ai0 C m `

,

where the sum is taken only over ai corresponding to support vectors of the rst category (for which 0 < ai < C) and m is the number of support vectors of the second category (for which ai D C). The values of the span-of-support vectors S, diameter of the smallest sphere containing the training points D, and the Lagrange multipliers a0 D (a10 , . . . , an0 ) are considered for training sets of size `. The proof of this theorem is similar to the proof of theorem 1. We consider all support vectors, of the second category (corresponding to aj D C) as an error. For the rst category of support vectors, we estimate the Proof.

Bounds on Error Expectation

2023

number L ¤ (x1 , y1 , . . . , x `, y`) of errors in the leave-one-out procedure using lemma 3 as in the proof of theorem 1. We obtain

L (x 1 , y1 , . . . , x `, y`) `

· ·

¤(

L

x1 , y1 , . . . , x `, y`) C m `

p P¤ ai C m S max(D, 1 / C) `

.

Taking the expectation over both sides of the inequality and using the Luntz and Brailovsky lemma, we prove the theorem. Note that when m D 0 (separable case), equality 2.8 holds true. In this case (provided that C is large enough), the bounds obtained in these two theorems coincide. In theorems 1 and 2, it is possible, using inequality 3.4, to bound the value of the span S by the diameter of the smallest sphere containing the support vectors DSV . But, as pointed out by the experiments (see section 6), this would lead to looser bounds because the span can be much less than the diameter. 5.1 Extension. In the proof of lemma 3, it appears that the diameter of the training points D can be replaced by the span of the support vectors after the leave-one-out procedure. But since the set of support vectors after the leave-one-out procedure is unknown, we bounded this unknown span by D. Nevertheless, this remark motivated us to analyze the case where the set of support vectors remains the same during the leave-one-out procedure. In this situation, we are allowed to replace D by S in lemma 3; more precisely, the following theorem is true:

If the sets of support vectors of the rst and second categories remain the same during the leave-one-out procedure, then for any support vector xp , the following equality holds: Theorem 3.

yp ( f 0 (xp ) ¡ f p (xp )) D ap0 Sp2 , where f 0 and f p are the decision function given by the SVM trained, respectively, on the whole training set and after the point xp has been removed. The proof is in the appendix. The assumption that the set of support vectors does not change during the leave-one-out procedure can be wrong. Nevertheless, the proportion of points that violate this assumption is usually small compared to the number of support vectors. In this case, theorem 3 provides a good approximation of the result of the leave-one-out procedure, as pointed out by the experiments (see Figure 4).

2024

V. Vapnik and O. Chapelle

Note that theorem 3 is stronger than lemma 3 for three reasons: the term p ) Sp max(D, 1 / C becomes Sp2 , the inequality turns out to be an equality, and the result is valid for any support vector. The previous theorem enables us to compute the number of errors made by the leave-one-out procedure: Under the assumption of theorem 3, the test error prediction given by the leave-one-out procedure is Corollary 1.

t` D

1 `

L (x 1 , y1 , . . . , x `, y`) D

1 `

Cardfp: ap0 Sp2 ¸ yp f 0 (xp )g.

(5.1)

6 Experiments

The previous bounds on the generalization ability of SVMs involved the diameter of the smallest sphere enclosing the training points (Vapnik, 1995). We have shown (cf. inequality 3.4) that the span S is always smaller than this diameter, but to appreciate the gain, we conducted some experiments. First, we compare the diameter of the smallest sphere enclosing the training points, the one enclosing the support vectors, and the span of the support vectors using the postal database. This data set consists of 7291 handwritten digits of size 16 £ 16 with a test set of 2007 examples. Following Sch¨olkopf, Shawe-Taylor, Smola, and Williamson (1999), we split the training set in 23 subsets of 317 training examples. Our task is to separate digits 0 to 4 from 5 to 9. The error bars in Figure 3 are standard deviations over the 23 trials. The diameters and the span in Figure 3 are plotted for different values of s, the width of the RBF kernel we used: K (x , y ) D e

¡

kx ¡ y k2 2s 2

.

In this example, the span is up to six times smaller than the diameter. We would like to use the span for predicting the test error accurately. This would enable us to perform efcient model selection, that is, choosing the optimal values of parameters in SVMs (the width of the radical basis function (RBF) kernel s or the constant C, for instance). Note that the span S is dened as a maximum S D maxp Sp and therefore taking into account the different values, Sp should provide a more accurate estimation of the generalization error than the span S only. Therefore, we used the span-rule (see equation 5.1) in corollary 1 to predict the test error. Our experiments have been carried out on two databases: a separable one, the postal database, described above, and a noisy one, the breast cancer database. 1 The latter has been split randomly 100 times into a training set containing 200 examples and a test set containing 77 examples. 1

Available online at http://horn.rst.gmd.de/»raetsch/data/breast-cancer.

Bounds on Error Expectation

2025

2

Diameter Diameter SV Span

1.5

1

0.5

0 6

4

2

0 Log sigma

2

4

6

Figure 3: Comparison of D, DSV , and S.

Figure 4a compares the test error and its prediction given by the span rule, equation 5.1, for different values of the width s of the RBF kernel on the postal database. Figure 4b plots the same functions for different values of C on the breast cancer database. The prediction is very accurate, and the curves are almost identical. The computation of the span rule, equation 5.1, involves computing the span Sp , equation 3.2, for every support vector. Note, however, that we are interested in the inequality Sp2 · yp f (xp ) / ap0 , rather than the exact value of the span Sp . Therefore, if while minimizing Sp D d (xp , L p ) we nd a point x ¤ 2 L p such that d (xp , x ¤ )2 · yp f (xp ) / ap0 , we can stop the minimization because this point will be correctly classied by the leave-one-out procedure. Figure 5 compares the time required to train the SVM on the postal database, compute the estimate of the leave-one-out procedure given by the span rule, equation 5.1, and compute exactly the leave-one-out procedure. In order to have a fair comparison, we optimized the computation of the leave-one-out procedure in the following way: for every support vector xp , we take as a starting point for the minimization (see equation 2.5) involved to compute f p (the decision function after having removed the point xp ), the solution given by f 0 on the whole training set. The reason is that f 0 and f p are usually “close.”

2026

V. Vapnik and O. Chapelle

40

Test error Span prediction

35 30

Error

25 20 15 10 5 0 6

4

2

0 Log sigma A

36

2

4

6

Test error Span prediction

34 32

Error

30 28 26 24 22 20 2

0

2

4

Log C B

6

8

10

12

Figure 4: Test error and its prediction using the span rule, equation 5.1.

Bounds on Error Expectation

2027

120

Training Leave one out Span

Time in sec

100 80 60 40 20 0 6

4

2

0 Log sigma

2

4

6

Figure 5: Comparison of time required for SVM training, computation of span, and leave-one-out on the postal database.

The results show that the time required to compute the span is not prohibitive and is very attractive compared to the leave-one-out procedure. 7 Conclusion

We have shown that the generalization ability of support vector machines depends on a more complicated geometrical concept than the margin only. A direct application of the concept of span is the selection of the optimal parameters of the SVM since the span enables an accurate prediction of the test error. Similarly to Weston, (1999), the concept of the span can also lead to new learning algorithms involving the minimization of the number of errors made by the leave-one-out procedure. Appendix: Proofs A.1 Proof of Lemma 2. We will prove this result for the nonseparable case. The result is also valid for the separable case since it can be seen as a particular case of the nonseparable one with C large enough. Let us dene L pC as the subset of L p with additional constraints l i ¸ 0:

L pC D

8 < X n : iD1, iD6 p

9 = l ixi 2 L p : l i ¸ 0

6 p . iD

;

(A.1)

2028

V. Vapnik and O. Chapelle

6 We shall prove that L pC D ; by proving that a vector l of the following form exists:

l j D 0, li D m li D m

C

j D n¤ C 1, . . . , n

¡ ai0 , ap0

(A.2)

yi D yp ,

ai0 , ap0

6 p, iD

i D 1, . . . , n¤

i D 1, . . . , n¤

6 yp , yi D

(A.3) (A.4)

0·m ·1

(A.5)

l i D 1.

(A.6)

n X i D1

P It is straightforward to check that if such a vector l exists, then l i x i 2 C 6 C 6 ;. L p and therefore L p D ;. Since L p ½ L p , we will have L D Taking into account equations A.3 and A.4, we can rewrite constraint A.6 as follows: 0 1 C

n¤ n¤ X m B X C 0 (C ¡ ai ) C 1D 0@ ai0 A . ap iD 1, iD p iD 1 6

y i D yp

(A.7)

6 yp yi D

We need to show that the value of m given by equation A.7 satises constraint A.5. For this purpose, let us dene D as ¤

D

D

n X i / yi D yp

(C ¡ ai0 ) C ¤

D ¡yp

n X i D1

yi ai0 C

¤

n X 6 yp i / yi D

ai0

(A.8)

¤

n X

(A.9)

C.

i / yi D yp

Now, note that n X i D1

yi ai0 D

¤

n X i D1

yi ai0 C C

n X i D n¤ C 1

yi D 0.

Combining equations A.9 and A.10, we get

D

D Cyp D Ck,

n X i D n¤ C 1

¤

yi C

where k is an integer.

n X i / yi Dyp

C

(A.10)

Bounds on Error Expectation

2029

Since equation A.8 gives D > 0, we have,nally, (A.11)

D ¸ C. Let us rewrite equation A.7 as m (D ¡ (C ¡ ap0 )). ap0

1D

We obtain m D

ap0

D ¡ (C ¡ ap0 )

.

(A.12)

Taking into account inequality A.11, we nally get 0 · m · 1. Thus, constraint A.5 is fullled and L pC is not empty. Now note that the set L pC is included in the convex hull of fx i giD6 p , and 6 ;, we obtain since L pC D d(xp , L pC ) · DSV , where DSV is the diameter of the smallest ball containing the support vectors of the rst category. Since L pC ½ L p , we nally get Sp D d (xp , L p ) · d (xp , L pC ) · DSV . A.2 Proof of Lemma 3. Let us rst consider the separable case. Suppose that our training set fx 1 , . . . , x `g is ordered such that the support

vectors are the rst n training points. The nonzero Lagrange multipliers associated with these support vectors are a10 , . . . , an0 .

(A.13)

In other words, the vector ®0 D (a10 , . . . , an0 , 0, . . . , 0) maximizes the functional W (® ) D

` X

iD 1

ai ¡

` 1X ai aj yi yj x i ¢ xj , 2 i, jD1

(A.14)

subject to the constraints

® ¸ 0, ` X

iD 1

ai yi D 0.

(A.15) (A.16)

2030

V. Vapnik and O. Chapelle

Let us consider the result of the leave-one-out procedure on the support vector xp . This means that we maximized functional A.14 subject to constraints A.15 and A.16 and the additional constraint ap D 0,

(A.17)

and obtained the solution p p ®p D (a1 , . . . , a` ).

Using this solution, we construct the separating hyperplane w p ¢ x C bp D 0,

where wp D

` X

p

ai yi xi .

iD 1

We would like to prove that if this hyperplane classies the vector xp incorrectly, yp (w p ¢ xp C bp ) < 0,

(A.18)

then ap0 ¸

1 . Sp D

Since ®p maximizes equation A.14 under constraints A.15–A.17 the following inequality holds true, W (®p ) ¸ W (®0 ¡ ± ),

(A.19)

where the vector ± D (d1 , . . . , dn ) satises the following conditions: dp D ap0 , a0 ¡ d ¸ 0, n X i D1

di yi D 0.

di D 0,

i> n

(A.20)

From inequality A.19 we obtain W (®0 ) ¡ W (®p ) · W (®0 ) ¡ W (®0 ¡ ± ).

(A.21)

Bounds on Error Expectation

2031

Since ®0 maximizes A.14 under the constraints A.15 and A.16, the following inequality holds true: W (®0 ) ¸ W (®p C ° ),

(A.22)

where ° D (c 1 , . . . , c `) is a vector satisfying the constraints

®p C ° ¸ 0, ` X

iD 1

c i yi D 0,

p

ai D 0 H) c i D 0,

(A.23)

6 p. iD

From equations A.21 and A.22, we have W (®p C ° ) ¡ W (®p ) · W (®0 ) ¡ W (®0 ¡ ± ).

(A.24)

Let us calculate both the left-hand side, I1 , and the right-hand side, I2 , of inequality A.24: I1 D W (®p C ° ) ¡ W (®p ) ` X

D

iD 1

¡ D D

p

(ai C c i ) ¡

` X

i D1

` X

iD 1 ` X

iD 1

p

ai C

ci ¡

` X

i, j

` 1X p p (ai C c i )(aj C c j )yi yj x i ¢ xj 2 i, jD 1

` 1X p p ai aj yi yj x i ¢ xj 2 i, jD 1

p

c i aj yi yj x i ¢ xj ¡

c i (1 ¡ yi w p ¢ x i ) ¡

` 1X yi yjc ic j x i ¢ xj 2 i, j

` 1X yi yjc ic j (x i , xj ). 2 i, j

Taking into account that ` X

iD 1

c i yi D 0,

we can rewrite expression I1 D

` X 6 p iD

¡

c i [1 ¡ yi (w p ¢ x i C bp )] C c p [1 ¡ yp (wp ¢ xp C bp )] n 1X yi yjc ic j x i ¢ xj . 2 i, j

2032

V. Vapnik and O. Chapelle

6 p the condition A.23 means that either c i D 0 or x i is a support Since for i D vector of the hyperplane wp , the following equality holds

c i [yi (w p ¢ x i C bp ) ¡ 1] D 0. We obtain I1 D c p [1 ¡ yp (wp ¢ xp C bp )] ¡

n 1X yi yjc ic j xi ¢ xj . 2 i, j

Now let us dene vector c as follows: c p D c k D a, ci D 0

6 fk, pg, i2

p

6 where a is some constant and k is such that yp D yk and ak > 0. For this vector we obtain

a2 kxp ¡ xk k2 2 a2 ¸ a[1 ¡ yp (w p ¢ xp C bp )] ¡ D2 . 2

I1 D a[1 ¡ yp (w p ¢ xp C bp )] ¡

(A.25)

Let us choose the value a to maximize this expression: aD

1 ¡ yp (wp ¢ xp C bp ) D2

.

Putting this expression back into equation A.25, we obtain I1 ¸

1 (1 ¡ yp [(xp , wp ) C bp ])2 . 2 D2

Since, according to our assumption, the leave-one-out procedure commits an error at the point xp (that is, the inequality A.18 is valid), we obtain I1 ¸

1 . 2D2

(A.26)

Now we estimate the right-hand side of inequality A.24: I2 D W (®0 ) ¡ W (®0 ¡ ± ). We choose di D ¡yi yp ap0l i , where l is the vector that denes the value of d (xp , L p ) in equation 4.2.

Bounds on Error Expectation

2033

We have I2 D W (® 0 ) ¡ W ( ® 0 ¡ ± ) D

n X iD 1

C

ai0 ¡

n n X 1 X (ai0 C yi yp ap0l i ) ai0 aj0 yi yj x i ¢ xj ¡ 2 i, jD 1 iD 1

n 1X (a0 C yi yp ap0l i )(aj0 C yj yp ap0l j )yi yj x i ¢ xj 2 i, jD 1 i

0 D ¡yp ap

n X i D1

yil i C yp ap0

(

n X 1 C (ap0 ) 2 l ixi 2 iD 1

!2

n X

i, j D1

ai0l j yi x i ¢ xj

.

(A.27)

Since n X iD 1

l i D 0,

and xi is a support vector, we have I2 D D

yp ap0

n X

l i yi [yi (w 0 ¢ x i C b0 ) ¡ 1] C

iD 1 (ap0 )2 Sp2 .

(ap0 )2 2

(

n X

!2 l i xi

iD 1

(A.28)

2

Combining equations A.24, A.26, and A.28, we obtain ap0 Sp D ¸ 1. Consider now the nonseparable case. The sketch of the proof is the same. There are only two differences. First, the vector ° needs to satisfy

®p C ° · C. A very similar proof to the one of lemma 2 gives us the existence of ° . The other difference lies in the choice of a in equation A.25. The value of a that maximizes equation A.25 is 1 ¡ yp (wp ¢ xp C bp ) . D2 But we need to fulll the condition a · C. Thus, if a¤ > C, we replace a by C in equation A.25, and we get: a¤ D

I1 ¸ C[1 ¡ yp (w p ¢ xp C bp )] ¡

C2 2 D 2

2034

V. Vapnik and O. Chapelle D CD

¡

2

¸ CD2

a¤ ¡

C 2

¢

a¤ 2

C [1 ¡ yp (w p ¢ xp C bp )] 2 C ¸ . 2 D

The last inequality comes from equation A.18. Finally, we have

¡

I1 ¸

1 1 min C, 2 2 D

¢

.

By combining this last inequality with equations A.24 and A.28, we prove the lemma. A.3 Proof of Theorem 3. The proof follows the proof of lemma 3. Under the assumption that the set of support vectors remains the same during the leave-one-out procedure, we can take

± D ° D ®0 ¡ ®p

as ®0 ¡ ®P is a vector satisfying simultaneously the set of constraints A.20 and A.23. Then inequality A.24 becomes an equality: I1 D W (®0 ) ¡ W (®p ) D I2

(A.29)

From inequality A.21, it follows that I2 · I2¤ D W (®0 ) ¡ W (®0 ¡ ± ¤ ),

(A.30)

where di¤ D ¡yi yp ap0l i and l is given by the denition of the span Sp (cf. equation 4.2). The computation of I2 and I2¤ is similar to the one involved in the proof of lemma 3 (cf. equation A.28): I¤2 D I2 D

(ap0 )2 2 (ap0 )2 2

Sp2 ¡ ap0 [yp (w 0 ¢ xp C b0 ) ¡ 1] !2

(

X i

l ¤i xi

¡ ap0 [yp (w 0 ¢ xp C b0 ) ¡ 1],

where l ¤i D yi

p

ai ¡ ai0 . ap0

From equation A.30 we get (

P

¤ )2 i l i xi

· Sp2 .

Bounds on Error Expectation

Now note that Finally, we have

(

P

¤ 6 p l i xi iD

2035

P 2 L p and by denition of Sp , ( i l ¤i x i )2 ¸ Sp2 .

!2

X

D Sp2 .

l ¤i x i

i

(A.31)

The computation of I1 gives (cf. equation A.25) I1 D

ap0 [1

¡ yp (wp ¢ xp C bp )] ¡

(ap0 )2 2

(

X i

!2 l ¤i x i

.

Putting the values of I1 and I2 back in equation equation A.29, we get !2 (ap0 )2

(

X i

l ¤i x i

D ap0 yp [ f 0 (xp ) ¡ f p (xp )],

and the theorem is proved by dividing by ap0 and taking into account A.31: ap0 Sp2 D yp [ f 0 (xp ) ¡ f p (xp )]. Acknowledgments

We thank L´eon Bottou, Patrick Haffner, Yann Lecun, and Sayan Mukherjee for very helpful discussions. References Bartlett, P., & Shawe-Taylor, J. (1999). Generalization performance of support vector machines and other pattern classiers. In B. Sch¨olkopf, C. J. C. Burges, & A. J. Smola, (Eds.), Advances in kernel methods—Support vector learning (pp. 43–54). Cambridge, MA. MIT Press. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classiers. In D. Haussler, (Ed.), Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory (pp. 144–152). Pittsburgh, PA: ACM Press. Luntz, A., & Brailovsky, V. (1969). On estimation of characters obtained in statistical procedure of recognition. Technicheskaya Kibernetica, 3 (in Russian). Sch¨olkopf, B., Shawe-Taylor, J., Smola, A. J., & Williamson, R. C. (1999). Generalization bounds via eigenvalues of the Gram matrix (Tech. Rep. No. 99-035). NeuroCOLT. Shawe-Taylor, J., Bartlett, P. L., Williamson, R. C., & Anthony, M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag.

2036

V. Vapnik and O. Chapelle

Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Weston, J. (1999). Leave-one-out machines. In M. Press (Ed.), Advances in large margin classiers. Received March 24, 1999; accepted September 24, 1999.

LETTER

Communicated by Geoffrey Goodhill

A Computational Model of Lateralization and Asymmetries in Cortical Maps Svetlana Levitan James A. Reggia Deptartments of Computer Science and Neurology, Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742, U.S.A. While recent experimental work has dened asymmetries and lateralization in left and right cortical maps, the mechanisms underlying these phenomena are currently not established. In order to explore some possible mechanisms in theory, we studied a neural model consisting of paired cerebral hemispheric regions interacting via a simulated corpus callosum. Starting with random synaptic strengths, unsupervised (Hebbian) synaptic modications led to the emergence of a topographic map in one or both hemispheric regions. Because of uncertainties concerning the nature of hemispheric interactions, both excitatory and inhibitory callosal inuences were examined independently. A sharp transition in model behavior was observed depending on callosal strength. For excitatory or weakly inhibitory callosal interactions, complete and symmetric mirrorimage maps generally appeared in both hemispheric regions. In contrast, with stronger inhibitory callosal interactions, partial to complete map lateralization tended to occur, and the maps in each hemispheric region often became complementary. Lateralization occurred readily toward the side having a larger cortical region or higher excitability. Asymmetric synaptic plasticity, however, had only a transitory effect on lateralization. These results support the hypotheses that interhemispheric competition occurs, that multiple underlying asymmetries may lead to function lateralization, and that the effects of asymmetric synaptic plasticity may vary depending on whether supervised or unsupervised learning is involved. To our knowledge, this is the rst computational model to demonstrate the emergence of topographic map lateralization and asymmetries. 1 Introduction

Topographic cortical maps are regions of the cerebral cortex that represent some aspect of the environment (visual space, body surface, etc.) in a topology-preserving fashion (Kaas, 1991; Udin & Fawcett, 1988). It has recently been demonstrated electrophysiologically that corresponding regions of topographic maps in primary sensory and motor cortex exhibit a rich range of patterns of individual lateralization and asymmetry, even c 2000 Massachusetts Institute of Technology Neural Computation 12, 2037– 2062 (2000) °

2038

Svetlana Levitan and James A. Reggia

under synchronous bilateral stimulation. For example, in one recent study with more than 100 animals, it was found that somatosensory, auditory, and visual map asymmetries are almost always present in an individual animal’s primary sensory cortex in the context of coarse bilateral stimuli (Bianki, 1993). These map asymmetries were classied as qualitative (complete lateralization), quantitative (partial lateralization), topographic (no lateralization but asymmetric spatial distributions or mosaic patterns), and quantitative-plus-topographic (lateralization and asymmetric spatial distribution). The dependence of asymmetries on callosal connectivity was shown by the fact that callosal sectioning generally caused a partial or complete loss of map asymmetries (Bianki, 1993). In another series of experiments with six adult squirrel monkeys, detailed primary motor cortex forelimb maps were found to be both larger and more complex in the hemisphere opposite the preferred hand (Nudo, Jenkins, Merzenich, Prejean, & Grenda, 1992). The causes of these map asymmetries are not known. In this article we describe a study with a recurrently connected neural model consisting of corresponding left and right hemispheric regions interacting via a simulated corpus callosum. Although the model hemispheric regions are not intended as realistic representations of any specic neurobiological system, they do have a spatial organization motivated by neocortical architecture, their interconnections are roughly homotopic as occurs biologically, and they self-organize using unsupervised (Hebbian) learning. We used this model to examine in theory how map formation could be affected by underlying asymmetries in cortical region size, excitability, plasticity, and so forth and by variations in the assumed excitatory and inhibitory nature of interhemispheric callosal interactions. The specic hemispheric region asymmetries we studied were motivated by experimental evidence that corresponding cortical regions can differ in their relative size (Geschwind & Levitsky, 1968), dendritic branching (Scheibel et al., 1985), neurotransmitter levels (Tucker & Williamson, 1984), and excitability to external stimuli (Macdonell et al., 1991). The variations in callosal inuences were motivated by disagreements in the literature about whether interhemispheric interactions are predominantly excitatory or inhibitory (Berlucchi, 1983; Cook, 1986; Denenberg, 1983; Ferbert, Priori, & Rothwell, 1992; Kinsbourne, 1978; Meyer, Roricht, ¨ von Einsiedel, Kruggel, & Weindl, 1995). Biological callosal connections are presumably excitatory, as they arise primarily from pyramidal cells and synapse on contralateral pyramidal and stellate cells (Innocenti, 1986). However, the resultant transcallosal inuences of one hemisphere on the other have often been argued to be inhibitory or competitive (Cook, 1986; Denenberg, 1983; Kinsbourne, 1978), based in part on strong, prolonged inhibition that follows initial excitatory postsynaptic potentials (Toyama, Tokashiki, & Matsunami, 1969), and supported by recent transcranial magnetic stimulation studies indicating that activation of one cortical region can functionally inhibit the corresponding contralateral region (Ferbert et al., 1992; Meyer et al., 1995).

A Computational Model

2039

Our initial hypotheses were that map lateralization and asymmetries would arise from all of the underlying hemispheric asymmetries that we examined (for example, cortical region size, excitability, and plasticity) and that map lateralization and asymmetries would gradually occur as callosal connections became progressively more inhibitory. These expectations derive from results with an earlier computational study of lateralization involving supervised learning that did not examine map formation (Reggia, Goodall, & Shkuro, 1998). The actual situation proved to be more interesting, with only some factors causing persistent lateralization, and the existence of a sharp transition between callosal strengths leading to map asymmetries that can be demonstrated both analytically and by numerical simulations. These results show that lateralization can occur in the absence of an error signal and that differences in left and right activation levels may sufce to drive lateralization and asymmetries in the case of self-organizing maps. 2 Methods

We adopted the methods used in a previous model of a single hemisphere self-organizing map (Sutton, Reggia, Armentrout, & D’Autrechy, 1994) to a two-hemisphere model. 2.1 The Model. The model consists of two cortical regions interconnected by a corpus callosum and receiving input connections from twodimensional sensory surfaces, as shown in Figure 1. Each hemispheric region or cortical layer represents a small patch of cerebral cortex. The model cortices are two-dimensional, with individual elements representing cortical columns. These elements hexagonally tessellate the cortex, with each element having excitatory connections to its six nearest neighbors. Each cortical element connects by the corpus callosum to those elements lying within a certain radius Rcc of the element homotopic to it in the opposite hemisphere. As is often done in simulations of this sort to avoid edge effects, the opposite edges of a hemispheric region are connected (forming a torus). Two versions of the model were studied. In one version, two sensory areas connect by divergent connections to opposite cortical regions (see Figure 1a), so independent input stimuli can reach model cortical regions. This corresponds to the typical afferent connectivity in important biological cortical regions (e.g., most primary somatosensory and visual cortex). In the second version, a single shared sensory surface connects to both hemispheric regions (see Figure 1b). This architecture minimizes the likelihood that asymmetries in input stimuli themselves would lead to lateralization and asymmetries in maps (i.e., as opposed to intrinsic hemispheric asymmetries or callosal inuences). This second version of the model can be related only to cortical regions receiving bilateral inputs, such as visual cortex for midline retinal areas (Stone, Leicester, & Sherman, 1973; Orban, 1984; Fendrich, Wessinger, & Gazzaniga, 1996), or to special experimental

2040

Svetlana Levitan and James A. Reggia

Corpus Callosum Right Cortex

Left Cortex

Left Sensory Surface

Right Sensory Surface

Corpus Callosum Left Cortex

Right Cortex

Sensory Surface Figure 1: The model of two interacting cortical regions. (a) Two sensory surfaces sending divergent inputs to the contralateral cortices. (b) The same sensory surface for both cortices.

situations where matched bilateral input stimuli are used (Bianki, 1993). In most of our simulations, the sensory surface(s) and both cortical sets have the same number of elements. Each sensory element sends divergent input to the corresponding cortical element in the contralateral cortex and its neighbors within a certain radius (RL or RR ). We used activation and learning rules from previous studies of singlehemisphere map formation (Armentrout, Reggia, & Weinrich, 1994; Sutton

A Computational Model

2041

et al., 1994), modifying them for two hemispheres connected by a corpus callosum. A real-valued activation level aLi (t) is associated with the ith element of the left cortical region, aRi (t) with the right homotopic element, and aSi (t) with the corresponding element of the sensory surface. We give just the equations for the left hemisphere; those for the right are analogous. Activation aLi (t) is governed by: C ¡ L ( M ¡ aLi ) C (cs C inLi )a i , daLi / dt D inLi

(2.1)

where cs < 0 is a self-inhibition constant, M > 0 is the maximal activation level, and initial activation levels are all zero. With an inhibitory corpus callosum, inputs are C inLi D

¡ inLi D

X j

X m

L cLL ij aj C

X k

S cLS ik ak ,

(2.2)

R cLR im am .

(2.3)

C The sums in excitatory input inLi range over the six immediate neighbors j in the same cortical hemisphere and over elements k in the sensory surface ¡ that send input to i. Index m in inhibitory input inLi ranges over elements in the opposite hemisphere within radius Rcc of the element homotopic to P R i. When the corpus callosum is excitatory, the callosal term m cLR im am inC stead is added into inLi , and self-inhibition is increased to prevent excessive ¡ hemispheric activation using inLi D ¡2.6K, where K is the callosal strength. Note that for the reasons given in the section 1, K represents the resultant transcallosal effect and not the excitatory and inhibitory synaptic weight on callosal connections, which are excitatory. A “Mexican hat” pattern of lateral interactions occurs in biological cortex and is widely used in neural models of self-organizing maps (Kohonen, 1995). As in our earlier models, we achieve this by having each sensory element competitively distribute its output among the receiving cortical elements, and each cortical element among its nearest neighbors, using

wLik (aLi C q) L , cLS ik D cp P L L n wnk ( an C q )

L cLL ij D clf P

aLi C q . L n (a n C q )

(2.4)

Here q is a small constant (0.00001), cpL and cLlf are, respectively, the input gain and lateral feedback constants for the left hemisphere, and the sums in the denominators are over all elements of one hemisphere connected to the given source of activation. Such competitive distribution of activation has repeatedly been shown in the past to produce Mexican hat activation patterns and topographic map formation similar to those produced by inhibitory connections (Reggia, D’Autrechy, Sutton, & Weinrich, 1992; Armentrout et al., 1994; Sutton et al., 1994).

2042

Svetlana Levitan and James A. Reggia Table 1: Baseline Parameter Values. Symbol

Meaning

Baseline Value

size KLR , KRL K cpL , cpR cLlf , cRlf

Number of elements in a cortical set Callosal strengths from R to L and back Callosal strength when KLR D KRL D K Input gain in L and R Lateral feedback in L and R

16 £ 16 Varies Varies 1.0 0.6

Self-inhibition constant Maximal activation constant Radius of corpus callosum connections Radii from S to L and R Learning rates for L and R

¡2.0 3.0 5 3 0.01

cs M Rcc RL , RR 2 L, 2 R

Note: L and R denote left and right cortex; S denotes sensory surface.

Intercortical connections work similarly. Let callosal strength to the left from the right hemisphere be KLR and to right from the left be KRL , using just K when these are the same as is usually the case. For inhibitory callosal inuences (K < 0), we use !¡1 LR cLR im D K

(

X n

aLn C q

to normalize total callosal signal. For excitatory corpus callosum (K > 0), the inuence of the opposite hemisphere is distributed competitively, with KLR playing the role of cpL in equation 2.4: aLi C q LR ¢. cLR im D K P ¡ L n an C q

(2.5)

Self-organization is based on an unsupervised learning rule D wLik D 2 L ( aSk ¡ wLik )aLi where 2 L is the left hemisphere learning rate. Learning occurs on the connections from sensory surfaces to cortical regions. Weights on these connections are initialized with random values between 0.00001 and 1.0. Sensory stimuli in the form of random seven-element hexagonal patches of activation are applied to the two sensory surfaces either independently or symmetrically to drive the self-organization of the model. 2.2 Experimental Methods. Using the two-hemispheric region models described above, we performed a series of simulations, systematically varying parameters and studying the effects on map lateralization and map symmetry as a function of callosal strength K. For a baseline parameter set, we used the values shown in Table 1. We altered one parameter at a time and studied the resulting lateralization for callosal strengths K ranging from ¡4

A Computational Model

2043

(strongly inhibitory) to C 3 (strongly excitatory). For maximal comparability, the same seed for random weight generation was used in most runs. This seed was chosen as causing the least lateralization in the symmetric case in numerous simulations with different seeds. However, all major ndings in this article were veried by additional simulations with several different seeds. We ran simulations using both versions of the model (see Figures 1a and 1b) and found the qualitative results in most cases to be similar, indicating that the independence and symmetry of the input stimuli had only a secondary effect on results. (For this reason and for brevity, we primarily describe the results obtained with the single-input-layer model of Figure 1b, noting a few differences with the dual-input-layer version (see Figure 1a) where appropriate. (For the effects of systematically varying input stimuli symmetry and independence with the dual input version, see Levitan, Stoica, & Reggia, 1999.) In addition to visual inspection of the resulting maps, we measured the degree of cortical map organization, lateralization, and mirror symmetry using metrics described in Alvarez, Levitan, and Reggia (1998). Although objective in nature, these measures have been shown to correlate fairly well with corresponding subjective estimates. Briey, the measures are: Organization measure. Indicates the degree of topographic map formation in a single hemispheric region on a 0 (no map forms) to 1 (nearly perfect map) scale. Lateralization measure. Ranges from ¡1, indicating complete map formation on the left and no map on the right, to 1, indicating the opposite; a nonzero value indicates that one map represents a larger total sensory area than the other (i.e., there is “quantitative asymmetry”; Bianki, 1993). Mirror symmetry measure. Ranges from 1 for a pair of cortical maps covering the same regions in the sensory surface, to ¡1 for complementary maps covering different regions. In other words, this measure indicates the extent to which the two maps are mirror images of one another, irrespective of their overall lateralization, with a negative value implying that there is “topographic asymmetry” (Bianki, 1993). These three measures are dened in appendix A and illustrated below. 3 Results

We rst describe the general nature of maps observed during simulations and then the results of simulations in which callosal strengths were systematically varied. 3.1 General Nature of Maps Observed. Figure 2 shows representative examples of the maps observed during simulations. Each part of Figure 2 presents a pair of images indicating the topographic maps for a pair of

2044

Svetlana Levitan and James A. Reggia

Figure 2: Pairs of cortical maps a–i are equal sized and have one sensory region: (a, b, c) Unorganized, pretraining. (d) Organized and symmetric. (e, f) Complementary mosaics. (g, h, i) Asymmetric: left more organized than right. (j) The left hemispheric region has more elements and is better organized. (k, l) Similar maps produced by the model with two sensory regions when the training inputs were independent (k) or symmetric (l).

corresponding cortical regions like those illustrated in Figure 1. A standard, widely used approach to representing map organization is employed in Figure 2 (Kohonen, 1995). Specically, each of the 24 pictures in Figure 2 plots

A Computational Model

2045

the centers of receptive elds of a cortical region in the space of the sensory surface (i.e., it is not a picture of the cortical regions involved). For example, the vertices (nodes) in the left picture in Figure 2h represent the centers of the receptive elds of the left cortical region plotted in the space of the sensory surface in Figure 1b. Each pair of receptive eld centers (vertices) of nearestneighbor cortical elements is connected by a line segment, so there are six line segments for each vertex. The entire grid in Figure 2h shows that a fairly organized map is present in the left cortical region (i.e., the sensory surface projects in a smooth fashion onto the two-dimensional cortex surface). In contrast, the right hemispheric region is fairly disorganized in this case (see the right part of Figure 2h). Also, in each picture in Figure 2, each vertex is encircled by a small ellipse centered on the vertex whose axes are proportional to the relative x and y diameters of the receptive eld. These ellipses do not show the absolute size of receptive elds (the latter being larger and overlapping), but instead are simplied and uniformly reduced-size representations that indicate only the relative sizes of receptive elds (Sutton et al., 1994). Along with regular receptive eld locations, small receptive elds (e.g., Figure 2d) generally indicate highly organized map regions, while larger ones (e.g., Figure 2a) indicate poor map formation. Maps in Figure 2 involve equal-size cortical regions (both 16 £ 16) except for Figure 2j where the right cortical region is smaller (12 £ 12). The smaller right region in Figure 2j is evident in that more vertices are plotted in the left picture. Inspection of these pairs of maps shows that in some cases, maps are disorganized bilaterally, as in Figures 2a to 2c. In these cases, the mirror symmetry measure is taken to be undened because there are no organized map regions whose overlap can be measured (Alvarez et al., 1998). In other cases, the maps are well organized bilaterally (see Figure 2d) or on just one side (see Figures 2g and 2h). Finally, in some cases, maps may form mosaic patterns that divide up the sensory surface in complementary ways, either in roughly equal (see Figures 2e and 2l) or quite unequal shares (see Figures 2f, 2i, and 2j). Table 2 gives the values of the quantitative measures described earlier that we use for the maps shown in Figure 2. For example, maps in Figures 2a through 2d show increasing individual hemispheric region organization, while in each case the lateralization measure does not exceed 0.03, indicating that signicant lateralization has not occurred. Signicant lateralization is measured with many of the other map pairs shown and is of greatest magnitude for Figure 2g (lateralization D ¡0.72). The two maps in Figure 2d are mirror images of each other (mirror symmetry D 1.0), and those in Figures 2e, 2f, 2i, 2j, and 2l are complementary, representing largely disjoint portions of the sensory surface (mirror symmetry D ¡1.0). Figure 2e illustrates that although the lateralization measure is roughly zero (both maps being organized overall the same amounts), there can still be “lateralization of functionality” in the sense that each hemispheric region is organized for a different portion of the sensory surface (mirror symmetry D ¡1.0).

2046

Svetlana Levitan and James A. Reggia Table 2: Values of Quantitative Measures for the Maps in Figure 2. Map

Left Organization

Right Organization

Lateralization

Mirror Symmetry

a b c d e f g h i j k l

0.01 0.25 0.59 0.98 0.62 0.47 0.81 0.96 0.82 0.83 0.39 0.52

0.01 0.28 0.61 0.99 0.64 0.78 0.09 0.32 0.29 0.37 0.57 0.58

0.0 0.03 0.02 0.01 0.02 0.31 ¡0.72 ¡0.64 ¡0.53 ¡0.46 0.18 0.06

Undened Undened Undened 1.0 ¡1.0 ¡1.0 ¡0.81 ¡1.0 ¡1.0 ¡0.99 ¡0.94 ¡1.0

3.2 Symmetric Hemispheric Regions. We rst undertook simulations using a symmetric version of the model where the two hemispheric regions were identical except for the initial random weights. Complete topographic maps similar to those in Figure 2d form in both hemispheres when corpus callosum inuences are excitatory, absent, or weakly inhibitory. However, as callosal inhibition becomes stronger, there is a sharp transition at roughly K D ¡1.4 from two complete maps to two complementary mosaic pattern maps similar to those in Figure 2e. Figure 3 shows how the organization, lateralization, and symmetry measures vary as the callosal strength K is systematically altered. Since the organization values on the left and right are essentially the same, lateralization is very close to zero for all values of K. As seen in Figure 3a, initial organization values are close to 0 for K < ¡1, but grow noticeably for positive K. This happens because inhibitory callosal connections “push” the receptive elds of corresponding cortical elements away from each other and hence from their ideal location in a perfectly organized map, while excitatory callosal connections “pull” the receptive elds of corresponding cortical elements closer to each other and to their ideal locations, thus making the pretraining maps look more organized. Figures 2a through 2c show pretraining maps for K D ¡2, K D 0, and K D 2 in the symmetric case. 3.3 Asymmetric Cortical Excitability. Asymmetric cortical excitability has been associated experimentally with functional lateralization (Macdonell et al., 1991) and regionally may be implied by asymmetries in various neurotransmitter levels (Tucker & Williamson, 1984). In our model, the learning rule is such that the higher the activation is in one hemisphere, the faster it should self-organize. Conversely, a hemisphere that gets very little

A Computational Model

2047

Organization 1 0.8 0.6 0.4 0.2 -4 -3 -2 -1 0

1

2

3

K

Activation 6 5 4 3 2 1 -4

-3 -2

-1

0

1

2

3

K

Figure 3: Symmetric hemispheric regions (except for random initial weights). (a) Organization of left and right maps as a function of callosal strength K. Black lines, after training; gray lines, before training; solid lines, right hemisphere; dashed lines, left hemisphere. (b) Activation in the two hemispheres. (c) Lateralization and mirror symmetry after training. Solid line, lateralization; dashed line, mirror symmetry.

2048

Svetlana Levitan and James A. Reggia

or no activation cannot effectively learn a good topographic map. Several parameters of our model can cause different excitability in the two hemispheres. Figures 2h and 2i show sample maps resulting from a small difference in input gain or lateral feedback (in both cases, the left hemisphere is more active). Figure 4 summarizes the results of simulations when the two hemispheres had only a slightly different input gain favoring the left: cpL D 1.05 and cpR D 1.0. For approximately K > ¡1.2, symmetric highly organized maps always formed without lateralization (similar to Figure 2d). For K < ¡1.2 however, strong lateralization to the left always occurred, usually with mosaic patterns like those in Figure 2i. Note that activation on the left is much higher than on the right, which is the main reason for the lateralization. Similar results were obtained with mildly asymmetric maximal activation constants ML and MR , and with small asymmetries in lateral feedback strengths cLlf and cRlf . Small differences in the strength of corpus callosum inhibition also caused signicant lateralization. With xed KLR D ¡2, for example, when KRL < KLR complete lateralization to the left occurred, and for KRL > KLR , the opposite occurred. Results for independent, asymmetric training input stimuli with the dual input version (see Figure 1a) showed somewhat less lateralization than other cases. 3.4 Explaining the Sharp Transition in Model Behavior. In the simulations, a sharp transition occurs in model behavior at a specic callosal strength (roughly K D ¡1.4 in symmetric case, K D ¡1.2 in asymmetric excitability case). For K above this strength, symmetric, nonlateralized, and highly organized maps occur, while for K below, lateralized and asymmetric mosaic pattern maps occur. What causes this transition or bifurcation in model behavior? We show that a change in the dynamics of total hemispheric activations occurs near K D ¡1.4 and then explain how it leads to asymmetry in map formation. First, consider the difference in total activation of the left hemispheric region L versus the right R in the model. The activation dynamics described by equations 2.1 through 2.5 are highly nonlinear and difcult to analyze. However, as in Reggia and Edwards (1990), a “linearized” version of these equations gives insight into the model’s dynamics. Consider equations C ¡ C inLi , da Li / dt D cs aLi C inLi

(3.1)

C ¡ where aLi is the activation of element i in L, and inLi and inLi are inputs to that element given by equations 2.2 to 2.5. By algebraic manipulations, for j 2 L, k 2 S, m 2 R we have

X i2L

L cLL ij D clf ,

X i2L

L cLS ik D cp ,

X i2L

LR cLR im D K .

Analogous equations hold for the right hemispheric region R.

(3.2)

A Computational Model

2049

Organization 1 0.8 0.6 0.4 0.2 -4 -3 -2 -1

0

1

2

3

K

Activation 6 5 4 3 2 1 -4

-3

-2

-1

0

1

2

3

K

Figure 4: Results with asymmetric input gain (cpL D 1.05, cpR D 1.0). Same notation as in Figure 3. The left hemisphere clearly dominates for roughly K < ¡1.2; its posttraining organization is higher than the organization on the right.

2050

Svetlana Levitan and James A. Reggia

Consider a xed input pattern. Denote the total activation in the left hemisphere by AL , that in the right by AR , and that in the sensory surface by AS . Adding together the equations 3.1 for all elements i in the left hemisphere and using equations 2.2, 2.3, and 3.2, we obtain: dAL / dt D cs AL C cLlf AL C cpL AS C KLR AR

(3.3)

and a similar equation for the right hemisphere. In the symmetric case, where cLlf D cRlf D clf , cpL D cpR D cp and KLR D KRL D K, subtracting dAR / dt from dA L / dt we have: d ( AL ¡ AR ) / dt D (cs C clf ¡ K )(AL ¡ AR ).

(3.4)

This indicates that when K < cs C clf , an initial difference in activation levels of the two hemispheres will grow exponentially with time, while for K > cs C clf , this difference will decay exponentially. With our baseline parameters, this change is expected at K D cs C clf D ¡1.4, precisely where the transition occurs in the symmetric case simulations (see Figure 3). A more involved analysis in appendix B shows that similar changes in asymptotic behavior of total hemispheric activations occur in a more general case for slightly larger K (e.g., when the two hemispheres have different excitation). How does this change in activation growth affect symmetry and lateralization? With random initial weights, an input pattern quickly causes slightly different initial levels of activation in the two hemispheres. With time, this initial difference will either disappear (for K > ¡1.4) or grow (for K < ¡1.4). In the former case, both hemispheres will have nearly equal activation levels by the time a learning step occurs, so they will “learn” the input at about the same speed (assuming equal learning rates), hence eventually developing complete and symmetric maps. In the latter case, by the time of learning, one hemisphere may be largely inactive while the other is highly active, so the input is more effectively learned by the more active hemisphere, resulting in local lateralization of map formation. Recalling that the hemispheric regions are homotopically connected, this competition occurs locally between mirror image sections of the cortex. Thus, a small part of the map can become better organized on one side than the other, and this may spread by a “percolation process.” After a sufciently long training period, a mosaic pattern is therefore likely to occur for K < ¡1.4. For symmetric hemisphere parameters, by chance each hemisphere would be expected to be dominant on about the same number of input patterns as the other, and so complementary maps but almost no lateralization is expected. With asymmetric cortical excitability, we would expect larger total activation more often in the more active hemisphere, so that hemisphere will develop a map covering a larger part of the sensory surface, which leads to lateralization.

A Computational Model

2051

Organization

Organization

1 0.8 0.6 0.4 0.2 -4 -3 -2 -1

0

1 0.8 0.6 0.4 0.2 1

2

3

K

K -4 -3.5 -3 -2.5 -2 -1.5 -1

Activation

Activation 3 2.5 2 1.5 1 0.5

8 6 4 2 -4

-3

-2

-1

0

1

2

3

K

-4

-3.5 -3 -2.5 -2

K -1.5 -1

Figure 5: Results for the two hemispheric regions with different numbers of elements (left larger), when the initial activations are not equal (a–c) and when they are equal due to adjustment of callosal connections (d–f). Same notation as Figure 3.

3.5 Asymmetric Hemispheric Sizes. Experimentally measured differences in hemispheric region sizes have been reported (e.g., Geschwind & Levitsky, 1968), although their nature and signicance have been questioned (Loftus et al., 1993). To examine how asymmetric hemispheric region size inuences map organization, we ran simulations with the two hemispheric regions having different numbers of elements: the left having 16 £ 16 elements and the right 12 £12, for a total of 256 versus 144 elements. Figures 5a to 5c show the results as callosal strength is varied. For excitatory, absent, or

2052

Svetlana Levitan and James A. Reggia

mildly inhibitory callosal strengths (roughly K > ¡1.2), both hemispheres formed highly organized and symmetric maps. For strongly inhibitory callosal strengths (K < ¡1.2), marked map lateralization occurred, with a much better organized map in the larger left region and with complementary maps. Figure 2j shows a typical example of the maps found under these latter conditions. Figure 5b shows quite clearly that when inhibition is strong (roughly K < ¡1.2), the initial activation on the left was much higher than on the right. To remove this potentially biasing factor, callosal inhibition strength was adjusted so that the inhibition from left to right was slightly weaker than inhibition from right to left. A roughly 5% difference produced nearly equal initial activations on both sides for K < ¡1. The effect of this adjustment is demonstrated in Figures 5d through 5f. Lateralization is smaller, but is still present for sufciently strong inhibition (K < ¡1.5), and the maps formed under these conditions remain complementary.

3.6 Asymmetric Learning Rates. Another possible cause of lateralization is a difference in synaptic plasticity in the two hemispheres. The biological existence of asymmetric plasticity is suggested by asymmetric hemispheric neurotransmitters (Tucker & Williamson, 1984), and directly indicated by asymmetric synaptogenesis during development and early life (Bates, Thal, & Janowsky, 1992). We ran simulations where all parameters of the two hemispheres were the same except initial random weights and the learning rates (left 0.01, right 0.001). Figures 6 and 7 display simulation results for brief (16,000 inputs) and long-term (123,000 inputs) training, showing that initially the hemisphere with a higher learning rate organizes faster. However, after sufciently long training, the difference in organization values becomes much smaller. For this type of simulation, much more transient lateralization occurs with excitatory and weak inhibitory callosal connections than with strongly inhibitory ones. For example, Figure 8 shows that for K D ¡3, both hemispheres learn very slowly, while for K D 0, the one with higher learning rate learns much faster than the other. These results differ from those with the asymmetries examined above in that the lateralization that occurs is more pronounced for K > ¡1.4. Given the asymmetry in learning rates, transient lateralization to the left would generally be expected for all values of K: the larger left learning rate would cause its map to organize more quickly, but the slower right side would eventually catch up. Lateralization does not occur with strongly inhibitory callosal connections here because the higher learning rate on the left is largely cancelled by simultaneously lower mean activation levels on the left that slow learning (see Figure 7).

A Computational Model

2053

Different Learning Rates. Lateralization Changes with Time Lateralization in Topographic Map Formation

1 Before Training After 16000 Inputs After 123000 Inputs

0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 4

3

2 1 0 1 Corpus Callosum Strength K

2

3

Figure 6: Lateralization as a function of K with asymmetric learning rates. Lateralization is signicant for K > ¡1 after brief training, but subsequently essentially vanishes.

4 Discussion

To our knowledge, this study is the rst use of a neural model to demonstrate emergent lateralization and asymmetries in topographic maps. Although there have been previous neural models of hemispheric interactions and information transfer (Anninos, Argyrakis, & Skouras, 1984; Anninos & Cook, 1988; Cook & Beech, 1990; Ringo, Doty, Demeter, & Simard, 1994), they did not examine lateralization or map self-organization (summarized in Reggia et al., 1998). We recently studied lateralization in two different neural models consisting of paired cerebral hemispheric regions interacting via the corpus callosum. In one study, a model was trained to generate the correct sequence of phonemes for 50 monosyllabic words (Reggia et al., 1998), and in the other it was trained to classify a small set of letters presented in the left, central, or right visual elds (Shevtsova & Reggia, 1999). Both models used supervised learning. Lateralization occurred readily in these two models toward the larger, more excitable, or more plastic side and appeared most readily with strongly inhibitory callosal inuences. A previous mixture-of-experts model using supervised learning has also demonstrated lateralization in processing spatial relations with asymmetric receptive eld sizes (Jacobs &

2054

Svetlana Levitan and James A. Reggia

Organization

Organization

1 0.8 0.6 0.4 0.2 -4 -3 -2 -1

0

1 0.8 0.6 0.4 0.2 2

1

3

K

-4 -3 -2 -1

Activation 5

4 3

3

2

2

1

1 -3 -2

-1

0

2

1

3

K

-4

-3

-2

Mirror Symmetry -4

-3

-2

-1 -0.2

K

2

1

Activation

4

-4

0

1

-1

0

K

2

1

Mirror Symmetry 2

3

K

1 0.5

-0.4 -4

-3

-2

-0.8

-1 -0.5

-1

-1

-0.6

1

2

3

K

Figure 7: Results for the two hemispheres with different learning rates after about 16,000 (a–c) and 123,000 (d–f) training inputs. Solid lines in (a, b, d, e) correspond to the right hemisphere with learning rate 0.001, the dashed lines to the left hemisphere with learning rate 0.01. After 16,000 inputs the left hemisphere is dominant for K > ¡1.4. After very long training, the two hemispheres achieve about the same organization levels. Pretraining organization and activation are as in Figure 3.

A Computational Model

2055

Organization 1 0.8 0.6 0.4 0.2 10 15 20 25 30

5

Frame

Organization 1 0.8 0.6 0.4 0.2 0

5

10 15 20 25 30

Frame

Figure 8: Organization with different learning rates during training (learning curves). One frame corresponds to 4096 inputs. Solid black line, right hemisphere; dashed line, left hemisphere (with higher learning rate). The thick gray line shows the difference between organizations in the two hemispheres. For K D ¡3 the difference is small. For larger values of K, it rst becomes large, but vanishes after longer training.

2056

Svetlana Levitan and James A. Reggia

Kosslyn, 1994). The map formation model we describe in this article differs substantially from these earlier models. Our simulations here examine self-organization of maps using solely unsupervised learning rather than supervised learning to acquire a skill (phoneme generation, letter classication). We have thus examined whether lateralization can arise in the absence of error signals. Although our current model is intended only as an abstract representation of interacting cortical regions, and not of any specic biological system, the results provide some insight into how underlying cortical asymmetries and callosal inuences can interact to cause hemispheric map asymmetries. Our results can be summarized as follows. First, when callosal connections were absent (K D 0), in every case complete, mirror-symmetric maps ultimately formed in both hemispheric regions without signicant lateralization. This nding is consistent with experimental data suggesting that lateralization and complementary mosaic maps arise due to hemispheric interactions (Bianki, 1993). Second, a sharp transition in model behavior was observed depending on callosal strength. For excitatory, absent, or weakly inhibitory callosal strengths, complete and symmetric mirror-image maps typically appeared in both hemispheric regions. In contrast, with stronger inhibitory callosal inuences, partial to complete map lateralization tended to occur, and the maps in each hemispheric region often became complementary, resembling mosaic patterns observed experimentally (Bianki, 1993). These results, along with those of the earlier phoneme-sequencing model, provide support for the long-standing hypothesis that the corpus callosum plays a predominantly functionally inhibitory role (Kinsbourne, 1978; Cook, 1986). A third result of this work is the nding that lateralization occurred readily toward the side having a larger cortical region or higher excitability. These factors produced lateralization more or less independently. This suggests that biological lateralization may be a multifactorial process, consistent with our earlier lateralization models noted above and consistent with arguments that current experimental data do not support the concept of a single underlying factor causing human behavioral lateralization (Hellige, 1993). These results are also consistent with the general concept that initial small quantitative differences in hemispheric regions can ultimately give rise to qualitative differences. They demonstrate that lateralization and map asymmetries can arise due to activation dynamics and unsupervised learning in the absence of error signals. Finally, we observed that asymmetric synaptic plasticity had only a transitory effect on lateralization of map formation. In contrast, asymmetric plasticity in our earlier phoneme-sequencing model led to marked, persistent functional lateralization, while in the letter classication model, mild but persistent lateralization occurred. The difference occurs because map formation here is based solely on unsupervised learning, whereas it was controlled by an error-correction process in the earlier models (e.g., in the phoneme-

A Computational Model

2057

sequencing model, once one hemisphere learned to control phoneme sequencing, the error dropped to zero and the other hemisphere stopped learning). The intriguing possibility suggested by these results is that for biological lateralization, the effects of asymmetric synaptic plasticity may vary in part depending on whether supervised or unsupervised learning is involved. Appendix A: Quantitative Map Measures

The measures for map organization, lateralization and mirror symmetry used here are from Alvarez et al. (1998) and are computed as follows. For each cortical element n, a map M is described by the offsets of its receptive eld center (cx(n), cy (n)) from the ideal position in a perfectly organized map and by the receptive eld radii (rx(n), ry (n )). The “differential square distance” between two immediate neighbors n and n0 is computed as |(n, n0 )| 2 D (cx (n0 ) ¡ cx (n ))2 C (cy(n0 ) ¡ cy(n ))2 C (rx ( n0 ) ¡ rx ( n )) 2 C ( ry( n0 ) ¡ ry ( n)) 2 . An organization measure kMk for map M is then computed as 0 1 kMk D N

X all M¡nodes n

st,s @

11 X

2

0

|(n, n )|

2A

,

all neighbors n0 of n

where s t,s (x ) D (1 C e¡2s(x¡t ) )¡1 . Parameters t and s were chosen to approximate human estimates of organization values for various maps (Alvarez et al., 1998). A simple difference of the two organization values is a good lateralization measure: Lat(L, R ) D kRk ¡ kLk. When kRk D 1 and kLk D 0, Lat(L, R) D C 1; while for kRk D 0 and kLk D 1, Lat(L, R) D ¡1; and for kRk D kLk, the lateralization is 0. The degree of mirror symmetry is estimated using a measure of the overlap of organized regions oreg(L) in the left map L and oreg(R0 ) in the mirror image R0 of the right map R. Areas of the intersection (oreg(L) \ oreg(R0 )), union, and symmetric difference (oreg(L)D oreg(R0 )) of the above regions are used to dene map overlap(L, R) D

area(oreg(L) \ oreg(R0 )) ¡ area(oreg(L)D oreg(R0 )) . area(oreg(L) [ oreg(R0 ))

The intersection term measures the overlap of the organized regions, while the symmetric difference term measures the discrepancy between these regions. When no organized regions exist, map overlap is undened.

2058

Svetlana Levitan and James A. Reggia

Appendix B: Model Dynamics

Consider the general linearized case (see equation 3.1) where the hemispheric regions may be asymmetric. Denote cs C cLlf D cL , cs C cRlf D cR , cpL AS D BL , and cpR AS D B R , giving dAL / dt D cL AL C KLR AR C BL ,

dAR / dt D KRL AL C cR AR C BR .

¡c

(B.1)

¢

KLR This is a system of two linear differential equations, with matrix , K cR whose trace is cL C cR (negative in our model), and whose determinant is cL cR ¡ KRL KLR ; it thus has a unique xed point unless KRL KLR D cL cR . The xed point, which is asymptotically stable when KRL KLR < cL cR and asymptotically unstable when KRL KLR > cL cR , is L RL

A¤L D (¡cR BL C KLR BR ) / (cL cR ¡ KLR KRL ), A¤R D (¡cL BR C KRL BL ) / (cL cR ¡ KLR KRL ). In the symmetric case with baseline parameter values cL D cR D ¡1.4, BL D BR D 7.0, KLR D KRL D K, we have A¤L D A¤R D 7 / (1.4 ¡ K ), so for K D 1.4 activation grows without bound in the linearized model. However, the additional self-inhibition that we use for K > 0 in simulations prevents this problem. As stated earlier, for positive K instead of cL D cR D ¡1.4, we use cL D cR D ¡1.4 ¡ 2.6K, which makes the behavior of the xed point smooth for positive K (although it is not differentiable at K D 0): A¤L D A¤R D 7 / (1.4 C 1.6K). This xed point remains asymptotically stable for all K > 0, facilitating symmetric map formation in both hemispheres. Thus, in the symmetric case, the xed point remains bounded and is asymptotically stable for all K > ¡1.4 and asymptotically unstable for K < ¡1.4. In asymmetric cases, the coordinates of the xed point behave more interestingly. For different input gain constants (cpL D 1.05 makes BL D 7.35, but cL D cR ) the difference A¤L ¡ A¤R D (BL ¡ BR ) / (K ¡ cL ) remains quite small for K > ¡1, but gets very large as K approaches cL D ¡1.4. Different lateral feedback has a similar effect, except the bifurcation point changes slightly: when cL D ¡1.3, the xed point becomes unstable for a larger K (K ¼ ¡1.349). The difference of the total activations at the xed point is given by A¤L ¡ A¤R D BL (cL ¡ cR ) / (cL cR ¡ K2 ), which is small for K > ¡1 and also goes up sharply near the bifurcation point. Figure 9 shows the dependence of the xed point on K for various cases. Here the additional self-inhibition for positive K is taken into account. Finally, asymmetric excitability actually causes two transitions in the model’s dynamics. While the coordinates of the asymptotically stable xed point remain close to each other (for K ¸ ¡1), we have a situation similar to the symmetric case for K > ¡1.4, so complete symmetric maps form.

A Computational Model

2059

AL AR 8 6 4 2 -4

-2

-2 -4

2

4

2

4

K

AL AR 8 6 4 2 -4

-2

-2 -4

K

Figure 9: Fixed point of system (B.1) in (a) the symmetric case and (b) with asymmetric input sensitivities or lateral feedback coefcients. Dashed line, total activation for the left hemisphere; solid line for the right. In the symmetric case, the values are equal; in the asymmetric case, they are close except near K D ¡1.4.

But when the asymptotically stable xed point coordinates have different orders of magnitude (i.e., K between ¡1.2 and ¡1.4), then the activation levels in the two hemispheres will differ signicantly (with the left hemisphere always much more active), and most learning will occur in the left hemisphere. Thus, the posttraining organization level on the left stays close to 1, while on the right it drops sharply. This accounts for the “bumps” in the

2060

Svetlana Levitan and James A. Reggia

posttraining activation levels in Figure 4b and the sharp changes in lateralization and mirror symmetry in Figures 4c and 4d. Finally, at K < ¡1.4, the xed point becomes asymptotically unstable, and any initial difference in the activations of two hemispheres will grow exponentially (as in symmetric case for K < ¡1.4). However, due to the asymmetry in excitability, the left hemisphere is more likely to have higher initial activation. This facilitates complementary map formation favoring the more active hemisphere. Acknowledgments

This work was supported by NINDS award NS35460. We thank Ioana Stoica for help with implementation of some simulations. References Alvarez, S., Levitan, S., & Reggia, J. (1998). Metrics for cortical map organization and lateralization. Bull. Math. Biol., 60, 27–47. Anninos, P. A., Argyrakis, P., & Skouras, A. (1984). A computer model for learning processes and the role of the cerebral commissures. Biol. Cybern, 50, 329– 336. Anninos, P. & Cook, N. (1988). Neural net simulation of the corpus callosum. Intl. J. Neurosci., 38, 381–391. Armentrout, S., Reggia, J., & Weinrich, M. (1994). A neural model of cortical map reorganization following a focal lesion. Articial Intelligence in Medicine, 6, 383–400. Bates, E., Thal, D., & Janowsky, J. (1992). Early language development and its neural correlates. In S. Segalowitz & I. Rapin (Eds.), Handbook of Neuropsychology, 7 (pp. 69–110). Amsterdam: Elsevier. Berlucchi, G. (1983). Two hemispheres but one brain. Behav. Brain Sci., 6, 171– 173. Bianki, V. (1993). The mechanism of brain lateralization. Philadelphia: Gordon & Breach. Cook, N. (1986). The brain code. London: Methuen. Cook N. & Beech A. (1990). The cerebral hemispheres and bilateral neural nets. Int. J. Neurosci., 52, 201–210. Denenberg, V. (1983). Micro and macro theories of the brain. Behav. Brain Sci., 6, 174–178. Fendrich, R., Wessinger, C. M., & Gazzaniga, M. S. (1996).Nasotemporal overlap at the retinal vertical meridian: Investigations with a callosotomy patient. Neuropsychologia, 34, 637–646. Ferbert, A., Priori, A., & Rothwell, J. C. (1992). Interhemispheric inhibition of the human motor cortex. J. Physiol., 453, 525–546. Geschwind, N., & Levitsky, W. (1968).Left-right asymmetries in temporal speech region. Science, 161, 186–187. Hellige, J. (1993). Hemispheric asymmetry. Cambridge, MA: Harvard University Press.

A Computational Model

2061

Innocenti, G. (1986). General organization of callosal connections in the cerebral cortex. In E. Jones & A. Peters (Eds.), Cerebral Cortex, 5 (pp. 291–353). New York: Plenum. Jacobs, R., & Kosslyn, S. (1994). Encoding shape and spatial relations. Cognitive Science, 19, 361–386. Kaas, J. (1991). Plasticity of sensory and motor maps in adult mammals. Ann. Rev. Neurosci., 14, 137–167. Kinsbourne, M. (Ed.). (1978). Asymmetrical function of the brain. New York: Cambridge University Press. Kohonen, T. (1995). Self-organizing maps. Berlin: Springer-Verlag. Levitan, S., Stoica, I., & Reggia, J. A. (1999). A model of lateralization and asymmetries in cortical maps. In Proceedings of IJCNN99. Washington D.C. Loftus, W., Tramo, M., Thomas, C., Green, R., Nordgren, R., & Gazzaniga, M. (1993). Three-dimensional quantitative analysis of hemispheric asymmetry in the human superior temporal region. Cerebral Cortex, 3, 348–355. Macdonell, R., Shapiro, B., Chiappa, H., Helmers, S., Day, B., & Shahani, B. (1991). Hemispheric threshold differences for motor evoked potentials produced by magnetic stimulation. Neurol., 41, 1441–1444. Meyer, B.-U., Roricht, ¨ S., von Einsiedel, H., Kruggel, F., & Weindl, A. (1995). Inhibitory and excitatory interhemispheric transfers between motor cortical areas in normal humans and patients with abnormalities of corpus callosum. Brain, 118, 429–440. Nudo, R., Jenkins, W., Merzenich, M., Prejean, T., & Grenda, R. (1992). Neurophysiological correlates of hand preference in primary motor cortex of adult monkeys. J. Neurosci., 12, 2918–2947. Orban, G. (1984), Neuronal operations in the visual cortex. Berlin: Springer-Verlag. Reggia, J., & Edwards, M. (1990). Phase transitions in connectionist models having rapidly varying connection strengths. Neural Computation, 2, 523–535. Reggia, J., D’Autrechy, C., Sutton, G., & Weinrich, M. (1992). A competitive distribution theory of neocortical dynamics. Neural Computation, 4, 287–317. Reggia, J., Goodall, S., & Shkuro, Y. (1998). Computational studies of lateralization of phoneme sequence generation. Neural Computation, 10, 1277–1297. Ringo, J., Doty, R., Demeter, S., & Simard, P. (1994). Time is of the essence: A conjecture that hemispheric specialization arises from interhemispheric conduction delay. Cerebral Cortex, 4, 331–343. Scheibel, A., Fried, I., Paul, L., Forsythe, A., Tomiyasu, U., Wechsler, A., Kao, A., & Slotnick, J. (1985). Differentiality characteristics of the human speech cortex. In D. Benson & E. Zaidel (Eds), The dual brain (pp. 65–74). New York: Guilford. Shevtsova, N. & Reggia, J. (1999). A neural network model of lateralization during letter identication. Journal of Cognitive Neuroscience, 11, 167–181. Stone, J., Leicester, J., & Sherman, S. (1973). Naso-temporal division of the monkey’s retina. J. Comp. Neur., 150, 333–348. Sutton, G., Reggia, J., Armentrout, S., & D’Autrechy, C. (1994). Cortical map reorganization as a competitive process. Neural Computation, 6, 1–13.

2062

Svetlana Levitan and James A. Reggia

Toyama, K., Tokashiki, S., & Matsunami, K. (1969). Synaptic action of commissural impulses upon association efferent cells in cat visual cortex. Brain Res., 14, 518–520. Tucker, D., & Williamson, P. (1984).Asymmetric neural control systems in human self-regulation. Psychol. Review, 91, 185–215. Udin, S., & Fawcett, J. (1988). Formation of topographic maps. Annual Review of Neuroscience, 11, 289–327. Van Kleek, M., & Kosslyn, S. (1991). Use of computer models to study cerebral lateralization. In F. Kitterle (Ed.), Cerebral laterality: Theory and research (pp. 155–174). Hillsdale, NJ: Erlbaum. Received June 18, 1999; accepted September 23, 1999.

LETTER

Communicated by Ad Aertsen

Rate Limitations of Unitary Event Analysis A. Roy P. N. Steinmetz E. Niebur Zanvyl Krieger Mind/Brain Institute,Johns Hopkins University,Baltimore, MD 21218, U.S.A. Unitary event analysis is a new method for detecting episodes of synchronized neural activity (Riehle, Gr un, ¨ Diesmann, & Aertsen, 1997). It detects time intervals that contain coincident ring at higher rates than would be expected if the neurons red as independent inhomogeneous Poisson processes; all coincidences in such intervals are called unitary events (UEs). Changes in the frequency of UEs that are correlated with behavioral states may indicate synchronization of neural ring that mediates or represents the behavioral state. We show that UE analysis is subject to severe limitations due to the underlying discrete statistics of the number of coincident events. These limitations are particularly stringent for low (0–10 spikes/s) ring rates. Under these conditions, the frequency of UEs is a random variable with a large variation relative to its mean. The relative variation decreases with increasing ring rate, and we compute the lowest ring rate, at which the 95% condence interval around the mean frequency of UEs excludes zero. This random variation in UE frequency makes interpretation of changes in UEs problematic for neurons with low ring rates.As a typical example, when analyzing 150 trials of an experiment using an averaging window 100 ms wide and a 5 ms coincidence window, ring rates should be greater than 7 spikes per second. 1 Introduction

Despite intensive research efforts over half a century, fundamental questions regarding the nature of neuronal representations remain unanswered. One long-standing debate (Hubel, 1959; Werner & Mountcastle, 1963; Smith & Smith, 1965; Grifth & Horn, 1966; Bullock, 1970; Softky & Koch, 1993; Shadlen & Newsome, 1995; Softky, 1995) is whether this representation involves only the mean rate of action potentials, averaged over times on the order of a second, or whether ner temporal structures play a role. Of particular interest are those time structures involving multiple neurons. For instance, synchronous or oscillatory ring of neurons in the cortex has been proposed as a mechanism for representing sensory information (Singer, c 2000 Massachusetts Institute of Technology Neural Computation 12, 2063– 2082 (2000) °

2064

A. Roy, P. N. Steinmetz, and E. Niebur

1993), for labeling different attributes of a stimulus (Decharms & Merzenich, 1996), and as a possible representation of internal behavioral states (Gerstein & Clark, 1964; Abeles, 1982b, 1991; Dayhoff & Gerstein, 1983a, 1983b; Niebur, Koch, & Rosin, 1993; Niebur & Koch, 1994, Steinmetz et al., 2000). As techniques for recording from multiple neurons become more commonplace, methods for analyzing data from simultaneous recordings become increasingly important. Methods now considered part of the standard repertoire are the cross-correlogram (see Eggermont, 1990, for a review), the joint peristimulus time histogram (Aertsen, Gerstein, Habib, & Palm, 1989), and the gravity method (Gerstein, Bedenbaugh, & Aertsen, 1989). Any addition to the available methods is welcome. There is a particular need for methods that do not require extensive averaging over trials since the brain has to solve problems as they appear, without repetitions and multiple samples of a stimulus. If the brain is able to work without averaging, then it should be possible to detect the “code” that the brain uses by applying algorithmic methods that do not require averaging either. It was one of the goals leading to the development of unitary event analysis to provide such a method (Grun, ¨ 1996). The basic task of unitary event analysis is the detection of periods of synchronous ring of multiple neurons (Grun, ¨ Aertsen, Abeles, & Gerstein, 1993; Grun, ¨ 1996; Riehle et al., 1997). The method determines intervals of time during an experiment when the number of coincident rings of two or more simultaneously recorded neurons signicantly exceeds the number that would be expected if the neurons red independently. Action potentials occurring simultaneously during these intervals are labeled unitary events (UEs). Changes in the frequency of UEs correlated with the presumed perceptual state of the animal subject were recently observed in monkey primary motor cortex by Riehle et al. (1997). In these experiments, UEs occurred just prior to the possible appearance of a visual movement cue. More important, the frequency of UEs was signicantly higher when the monkey later performed the instructed movement correctly than when the animal failed. The purpose of this article is to analyze this new method and point out limitations on its use. We do so using theoretical analysis and numerical simulation and by application of UE analysis to data recorded in primate somatosensory cortex. The central result is that any application of UE analysis requires careful examination of the ring rates of the analyzed sequences of action potentials (“spike trains”). While earlier work (Grun, ¨ 1996) highlighted limitations of the method for spike trains with high ring rates, we demonstrate here that use of the method for low-rate spike trains (0– 10 spikes/s) can also produce artifactual changes in the occurrence of UEs that may appear correlated with behavioral states (as is the case in the data analyzed in this article). This report illustrates this artifact, examines why it occurs, and determines the lower bounds on the ring rates when UE analysis can be used. (Parts of this work have been presented in abstract form in Roy, Steinmetz, & Niebur, 1998.)

Rate Limitations of Unitary Event Analysis

2065

2 Unitary Events Method Dened

Unitary events are coincident action potentials that occur at a greater frequency than would be expected if the spike trains were independent (Grun, ¨ 1996; Riehle et al., 1997). The basic idea is to count the observed number of coincidences (simultaneous ring of two or more neurons during a small analysis interval of width b, typically 5 ms) in an interval of width Tw (typically 100 ms). This count is compared with the number of coincidences that would occur if each neuron red independently with the average rate over the interval Tw . To perform this analysis, each spike train is represented as a binary (0, 1) sequence in time. Each of its segments of length Tw is subdivided into Nb D Tw exclusive bins of width b. Each bin has either the value 0, representing b no spikes, or 1, representing one or more spikes occurring in the bin. This sequence will be denoted s( j), j D 1, . . . , Nb . In UE analysis, it is assumed that spike trains obey Poisson statistics of constant rate within any interval Tw (i.e., that a spike train is a homogeneous Poisson process within this interval). Rates are allowed to be different in different segments, but the ring is always assumed to be governed by Poisson statistics. Therefore, neural ring is assumed to be an inhomogeneous Poisson process over intervals longer than 1 Tw . Under these assumptions, the probability of a neuron’s ring at least once in a bin is P (s ( j) D 1) ´ 1 ¡ P (no spikes) D 1 ¡ e¡n / Nb ,

(2.1)

where n is the total number of events (ones) observed in Tw . Multiple repetitions (“trials”) of an experiment are treated by assuming stationarity across the Nt trials but not necessarily within a trial beyond a time Tw . This allows the corresponding intervals from each trial to be concatenated into a single process of length Nt £ Tw , which is stationary by assumption. In this case, the number of bins per segment is Nb D NtbTw . Note a conceptual difculty here that is due to the overlap between neighboring segments. Strictly speaking, the rate in any segment determines the rate in all segments, because of this overlap in conjunction with the requirement of constant rate within any segment. Violation of this assumption may be tolerable as long as the ring rates vary slowly, compared to Tw . In practice, observed ring rates do not always change slowly; for an example, see the raster plots in Figure 1 around the times of stimulus onset and offset, or Figures 2 to 4 in Riehle et al. (1997). Grun ¨ (1996) developed more sophisticated methods for the choice of Tw , which may alleviate the practical consequences but do not solve the principal problems addressed later in this article. We will ignore this technical difculty of UE analysis. 1

2066

A. Roy, P. N. Steinmetz, and E. Niebur

The spiking activity from N simultaneously recorded neurons is described by a joint process composed of N parallel binary processes. The process is represented by a time sequence of N-tuples, V ( j), j D 1, . . . , Nb . Each component of the N-tuple is either zero (no action potential) or unity (one or more action potentials). Let Vk be a particular N-tuple and let Vki denote its ith component. Assuming that the neurons re independently, the joint probability of observing the N-tuple Vk is then given by the product of the marginal probabilities for each of the N processes: P (V k ) D

N Y i D1

»

e ¡ni / Nb 1 ¡ e ¡ni / Nb

¼ if Vki D 0 , if Vki D 1

(2.2)

where ni is the number of bins with unity value in the concatenated process for neuron i. Given the assumed stationarity, the number ok of occurrences of a particular N-tuple Vk in Nb bins has a binomial distribution. Its probability is given by P (ok |P(Vk ), Nb ) D

¡N ¢ b

ok

P (Vk )ok [1 ¡ P (Vk )]Nb ¡ok .

(2.3)

For each number of occurrences ok of N-tuple Vk , we can compute the signicance level for a critical region that contains just this outcome. The signicance level will be the total probability of nding the observed number, ok , of occurrences of Vk or an even larger and less likely number of occurrences. We dene this level as the joint p-value of the outcome: 2 N . Xb joint p-value (ok |P(Vk ), Nb ) D P (r|P (Vk ), Nb ) r D ok

D

Nb X

r D ok

¡N ¢ b

r

(2.4)

P (Vk )r [1 ¡ P (Vk )]Nb ¡r ,

where P (Vk ) is given by equation 2.2. In the cases considered in this article, only two simultaneous spike trains are analyzed. Therefore only four ring patterns of Vk are possible, and 1 only one of them, , representing coincident ring, is of primary interest. 1

¡¢

In the original UE analysis of Grun ¨ (1996), the joint p-value is based on the Poisson approximation to the binomial distribution (see her chapter 3). Although this approximation is frequently justied, it fails for the precise considerations introduced in section 3.3, and we therefore use the exact form in equation 2.4. 2

Rate Limitations of Unitary Event Analysis

2067

In the method of UE analysis, if the joint p-value is less than the target pvalue, then all coincidences inside the averaging window are labeled unitary events. 3 The target p-value (or a level) is the arbitrary signicance level chosen to test the null hypothesis that the two simultaneously recorded spike trains are independent. In this article, we consistently use a target p-value of 0.05. 3 Analysis of Unitary Events Method 3.1 Application to Cortical Spike Trains. Unitary events were detected in simultaneously recorded pairs of spike trains originating from individual neurons in somatosensory cortical area II (SII) of macaque monkey. During each trial, a video screen displayed either a blank screen indicating that the monkey was required to perform a tactile task, or a rectangular box consisting of a white outline and black center indicating that the monkey was required to perform a visual task. The tactile task consisted of discriminating the orientation of a 2 mm wide nonresilient bar pushed twice perpendicularly against the distal nger pad of a restrained nger; the orientation of the bar could be the same or different between the two presentations in the same trial. The monkey had to decide whether the orientations were identical or different. The visual task consisted of distinguishing whether the left or right box presented on the video screen was dimmed. In both cases, the monkey indicated its response by pressing a foot switch. Details of the experimental procedure were as described by Hsiao, O’Shaughnessy, and Johnson (1993) except that the tactile and visual stimulation were as described above. Figures 1A and 1B show recorded action potentials of two neurons in SII as well as UEs during 212 trials of this experiment while the monkey 3 A possible alternative, suggested by Gr un ¨ (1996), is to label as unitary events only some of the coincidences in the interval. Their number would be given by the number of coincidences exceeding that expected to occur by chance given the observed ring rates. This is not the case in the standard UE analysis approach as dened by Grun ¨ (1996) and Riehle et al. (1997) in which all coincidences are labeled UEs independent of the ring rates. We cannot identify which ones are in excess of the total number of coincidences. We can then make a statement that x% of them are in excess of what is expected. Alternatively, the coincidences that qualify as UEs could be selected randomly at the appropriate frequency. Another approach could be to generate a weighting procedure for the UEs (i.e., the weight of a coincidence increases monotonically (linearly) if it is detected as being signicant in multiple overlapping blocks). After the completion of the xed number of overlapping tests for each UE, the ones with the least weight are removed until the theoretical expected number is reached.

2068

A. Roy, P. N. Steinmetz, and E. Niebur

200

trial #

150

100

50

0

0

1

2

1

2

t[secs]

3

4

3

4

200

trial #

150

100

50

0

0

t[secs]

Figure 1: (A,B) Raster plots of action potentials (dots) and UEs (diamonds) in two cells as a function of time within each trial when the monkey performed the tactile task, requiring attention to the task. Each row corresponds to one trial. The stimulus bar is in contact with the skin from about t D 1.7 s to 2.2 s and from t D 3.3 s to 3.8 s; the “on” and “off” responses can be clearly seen for the cell shown in A. See the text for details.

Rate Limitations of Unitary Event Analysis

2069

worked on the tactile task. Note that the UEs are identical in parts A and B of Figure 1 since they are dened by the coincident ring of the two neurons whose action potentials are shown in these two gures. Note also that the assumption of stationarity across (but not within) trials appears justied. Figures 1A and 2A contrast the data that were obtained when the monkey worked in the tactile and visual task, respectively, in the same neuron. The intervals between 1–1.5, 2.5–3.0, and 3.8–4.3 seconds show a change in the frequency of UEs when the intervals are compared between Figures 1A and 2A, thus demonstrating that the frequency of UEs is correlated with the behavioral state of the primate. The stimulus bar contacts the skin during the time intervals between t D 1.7 s and 2.2 s and between t D 3.3 s and 3.8 s. Note that the UE frequency increases before (and not during) the stimulation. The underlying joint p-values are shown in Figure 2D, which plots (for better resolution) log10 1¡P for the tactile P task. A simple explanation assuming that UEs are suppressed by the application of the stimulus bars does not hold, since UE frequency decreases at the end of the trial, when the stimulus bar is again removed from the skin (after t D 3.8 s). Note, however, that the ring frequency of neuron 2 (raster plots in Figure 1B) is reduced toward the end of the trial (after about 3.2 s). Although UE analysis is designed to compensate for rate effects, we will see that changes in the ring rate may lead to changes in UE frequency. Figure 2A shows that no such increase in UE frequency is observed when the animal is performing the visual task in the same intervals with identical sensory stimulation (remember that the same tactile stimulus is being delivered to the monkey even when he is doing the visual task). Increased UE frequency therefore could be interpreted as signaling a buildup of an anticipatory state in the animal rather than a response to sensory stimulation. Of the 30 pairs of neurons examined, 9 (30%) showed changes in UE frequency correlated with the behavioral state of the monkey. Recordings from an additional 7 cell pairs contained many UEs, but UE occurrence did not vary with behavior. The remaining 16 pairs contained an insignicant number of UEs. These results appear comparable to those reported by Riehle et al. (1997) in motor cortex, where the frequency of UEs increases sharply in those intervals in which the monkey might anticipate the occurrence of a visual cue for movement. We will see that this interpretation must be made cautiously and may not be warranted, particularly not with low ring rates. 3.2 Discrete Distribution of the Number of Coincidences and Possible Signicance Levels. In order to elucidate the mechanisms underlying the

changes of UE frequency, we reevaluated the statistical test used to identify UEs. As reviewed in section 2, the method of UE analysis detects coincident spikes occurring in time intervals where the null hypothesis of independent neural ring is rejected. This interpretation of changes in UE frequency will

2070

A. Roy, P. N. Steinmetz, and E. Niebur

250

trial #

200

150

100

50

0

0

1

2

1

2

t[secs]

3

4

3

4

4

3

log[(1-P)/P]

2

1

0

-1

-2 0

t[secs]

Figure 2: (A) Similar raster plot as in Figure 1A when the monkey performed the visual task, which does not require tactile attention. (B) Joint p-values (P in the graph) plotted as a ratio of logarithmic functions for overlapping 100 ms time segments in the tactile task. The horizontal line at ordinate value log10 19 ¼ 1.3 shows the target signicance level (in this case, 0.05). At times (i.e., values on the abscissa) without entries in the gure, bins have no recorded spikes in one or both neurons.

Rate Limitations of Unitary Event Analysis

2071

be valid, however, only if the signicance level of the test for independent ring is equal for the time intervals being compared. If the signicance level of the test varies, then changes in UE frequency are due not only to changing frequencies of coincidences but also to changing signicance levels. It is a statistically ill-dened procedure to compare UE frequencies between intervals with different signicance levels. The signicance level of the test of independent ring is equal to the probability of a time interval containing a number of coincidences that lies in the critical region of the test. The structure of this critical region is determined by the probability distribution of the number of coincidences for independent processes. As reviewed in section 2, this distribution is binomial, with its shape depending on both the number of time bins examined and the rates of neural ring. Two examples with 2000 bins and two different rates are shown in Figures 3A and 3B. Critical regions for the maximum likelihood test used in UE analysis consist of right-hand tails of this distribution, since the maximum likelihood rule requires that if a specic number of coincidences is included in the critical region, then all less likely coincidences must also be included (Lindgren, 1976). It is therefore important to consider the effective signicance level, which we dene as the total probability of the largest critical region that does not exceed the target signicance level p. It is usually obtained by integrating the right tail of the probability density function leftward from large numbers until the area reaches the target p value. Examples are shown in Figure 3A for a ring rate of 5 spikes per second and in Figure 3B for 20 spikes per second. The gures show the critical regions closest to, but not exceeding, a signicance level of p D 0.05. The crucial observation is that the effective signicance levels are discrete, since the number of coincidences in a window of given size Tw is an integer and therefore a discrete variable. At low ring rates, such as 5 spikes per second, this discrete distribution creates effective signicance levels that are broadly spaced and usually approximate the target level poorly. The effect is less important for high ring rates, such as 20 spikes per second (for the chosen parameters), where a signicance level near the target signicance level can be chosen with reasonable accuracy (see below for a more quantitative statement). As the mean of the binomial distribution increases (both with increasing bin size and with increasing average ring rate), the effective signicance level approaches p, but for many physiologically realistic spike rates, the difference between the effective signicance level and p itself is a substantial fraction of p. The dependence of the effective signicance level on neuronal ring rate is shown in Figure 4, which plots the effective signicance level for a pair of independent neurons using a target p D 0.05 for 3000 time bins of length 5 ms each (this corresponds to a typical experiment with NT D 150 trials and Tw D 100 ms). The most problematic point when performing UE analysis is the presence of large jumps in signicance level as a function of rate. These

2072

A. Roy, P. N. Steinmetz, and E. Niebur

Probability of occurrence

0.5 0.4 0.3 0.2 0.1 critical region 0

0

1

2

3 4 5 6 7 No. of coincidences

8

9

10

Probability of occurrence

0.5 0.4 0.3 0.2 0.1 critical region 0

0

5

10

15 20 25 30 No. of coincidences

35

40

Figure 3: (A) Probability distribution of the number of coincidences for a ring rate of 5 spikes per second. The effective signicance is shown for a signicance level of p D 0.05. (B) As in (A) but for 20 spikes per second.

occur when a particular outcome, such as three coincidences per bin, is no longer included in the critical region as the rate increased. It can be seen that a small change in ring rate (e.g., 0.1 spike/s) can lead to a very large change in the size of the critical area—the effective signicance level. The

Rate Limitations of Unitary Event Analysis

2073

0.05

Effective Significance level

0.04

0.03

0.02

0.01

0

0

5

10 15 20 Firing rate (spikes/s)

25

30

Figure 4: Effective signicance levels for the maximum likelihood test of independent ring. Nb D 3000, b D 5 ms.

largest jumps appear at small ring rates. For instance, in the case illustrated in Figure 4, a small increase in ring rate at about 2 spikes per second (rst branch shown in the gure) leads to an increase in effective signicance by a whole order of magnitude, from p D 0.05 to about p D 0.005. It should be noted that the ring rate of the model can change by an arbitrarily small amount, although the smallest change in the observed rate in a simulation will correspond to the addition or removal of a single spike over NT trials in the averaging window of width Tw . We have thus determined that the discrete nature of coincident events is the cause of the observed large changes in signicance level. Small changes of neural ring rates correlated with behavioral status may therefore lead to large changes in signicance level. If this is the case, we cannot conclude that different frequencies of UEs are due to differences in the correlation state of the neurons. In terms of UE analysis, which is based on constant signicance levels, the apparent changes in UE frequency might represent artifacts of the method. Although the jumps in effective signicance level for a single test are large, we show in the next section that they do not appear clearly in UE analysis because the method includes testing the coincidences multiple times. The effects of this repeated testing will be explored next.

2074

A. Roy, P. N. Steinmetz, and E. Niebur

3.3 Rate Dependence of Unitary Events. As described in section 2, the presence of UEs is determined by the number of coincidences within a window of length b in an interval Tw . If this number is in the critical region of a test for independent ring, then all coincidences within the interval are considered UEs. After each test, the interval slides by the width of the coincidence window, b, and the test is repeated. Therefore, each coincidence is subjected to Tw / b dependent tests, which means 20 tests for typical window sizes T w D 100 ms, b D 5 ms. Furthermore, the average rate estimate used to construct a single test is a random variable. Since the sliding window moves one bin at a time, the tests of the null hypothesis are dependent. The randomness of the rate estimate and the interdependence of these tests make it difcult to determine analytically the overall signicance level of the test for coincident ring employed in UE analysis. We therefore compute the signicance level by Monte Carlo simulation. The signicance level is equal to the probability that coincident spikes are labeled as unitary events even though the null hypothesis of independent ring is true. The average rate estimate for a single test was simulated as being equal PT b to (Tw / b) ¡1 awD/1 Ya , where Ya are independent Poisson-distributed random variables drawn from a xed ring rate within the time window of width Tw . The Monte Carlo computations consisted of 10,000 simulated trials of 3Nb ¡ 1 bins each for two independent Poisson processes at each value of the ring rate. Unitary events were detected using the methods described in section 2. The frequency of UE occurrence was dened as the number of computed UEs in the central block of Nb columns of bins divided by bNb Nt (this avoids edge effects). The mean and variance of the UE frequency were then computed over all iterations of the simulation and shown in Figure 5. This graph shows that the average frequency of UE occurrence is a monotonically increasing function of rate. Intuitively, this can be understood from the fact that the total number of accidental coincidences increases with the product of the ring rates of the two neurons and that therefore the number of accidental coincidences labeled as UEs also increases. The dependence of UE frequency on ring rates has previously been noted by Gr un ¨ (1996) at higher ring rates, and it can lead to marked increases in UE frequency. For example, in Figure 5, it can be seen that a twofold rate change from 4.5 spikes per second to 9 spikes per second causes more than a fourfold increase in UE frequency. In addition, it is important to notice that 20 dependent tests are performed for every coincidence since each bin of width b D 5 ms is member of 20 windows of width Tw D 100 ms, each of which has, in general, a different ring rate. If any of these tests detects a signicantly elevated number of coincidences, all coincidences in the interval are labeled UEs. This causes the transformation of the “sawtooth” function in Figure 4 into the smooth increase in average number of UEs shown in Figure 5. The large jumps in

Rate Limitations of Unitary Event Analysis

Frequency (Hz)

0.6

2075

UE Frequency St. dev

0.4

0.2

0

0

2

4

6 8 Firing rate (spikes/s)

10

12

Figure 5: UE frequency as a function of ring rate. Nb D 3000, b D 5 ms.

signicance levels in Figure 4 are reected in the large standard deviation of UE frequency relative to its mean shown in Figure 5. Since the size of the jumps decreases with increasing ring rates, the same follows for the standard deviation of UE frequency relative to its mean. Thus, although the target signicance level used in performing UE analysis remains constant, the effective signicance level of the test for unitary events is dependent on ring rate. In section 3.4 we will develop criteria for deciding when the frequency of UEs is a reliable quantity. Before doing this, however, we applied this analysis to the data recorded in somatosensory cortex. Figure 6 demonstrates the effect of varying ring rates on UE frequency. We computed the time-varying ring rates for each of the two neurons whose activity is plotted in Figures 1A and 1B using a 100 msec averaging window and generated two (nonhomogeneous) Poisson processes using these rate proles. The time-varying rate proles were computed consistent with the assumed null hypothesis for UE analysis (which is in this case homogeneous or stationary rates over 100 msec). UE analysis was then applied to the simulated processes. Figure 6A shows the UEs obtained. The pattern of UEs is remarkably similar to that observed in Figure 1A. Similarly, the logarithmic plot of the signicance of the events shown in Figure 6B captures the relevant features of the corresponding plot in Figure 2B.

2076

A. Roy, P. N. Steinmetz, and E. Niebur

200

trial #

150

100

50

0

0

1

2

1

2

t[secs]

3

4

3

4

4 3

log[(1-P)/P]

2 1 0 -1 -2 -3 0

t[secs]

Figure 6: (A) Raster plots of action potentials (dots) and UEs (diamonds) in two cells simulated as realizations of two independent inhomogeneous Poisson processes with a rate prole similar to the cell from Figure 1A. See the text for details. (B) Joint p-values (P in the graph) plotted as a ratio of logarithmic functions for overlapping 100 ms time segments.

Rate Limitations of Unitary Event Analysis

2077

11

Minimum rate (spikes/s)

10 9 8 7 6 5 4 50

100

150

200 # of trials

250

300

350

Figure 7: Minimum rates where mean UE frequency = 2s.

A uniformly distributed (or random) pattern of UEs was obtained when we applied UE analysis to similarly generated inhomogeneous spike train simulations of each of the two neurons whose UE analyses are shown in Figure 2A. The random nature of the distribution of UEs matches the obvious lack of pattern in Figure 2A (data not shown). Therefore, based on these simulations, it can be argued that a large portion of the observed pattern of UEs in those gures is due to rate effects and not to rate-independent coincidences. 3.4 Firing Rates with Reliable Signicance Levels. Figure 5 shows that the frequency of UE occurrence for low neural ring rates is a random variable with a large standard deviation relative to its mean under the null hypothesis of independent ring. This random variation of the mean UE frequency makes it difcult to interpret changes in UE frequency. If this variation is high relative to the mean signicance level, then changes in UE occurrence may be due to changes unrelated to changes in the correlational state of the neurons. For instance, changes in the ring rate that are orders of magnitude smaller than the standard deviation of the average ring rate (fractions of a Hz; see Figure 4) can change the effective signicance level by a factor of 10. As a lower bound on rates where the variation is acceptable, we show in Figure 7 the minimum rate where a

2078

A. Roy, P. N. Steinmetz, and E. Niebur

95% condence interval excludes zero or, equivalently, where the mean is equal to twice the standard deviation. Other intervals may be appropriate depending on the degree of condence desired in a specic investigation. For example, when analyzing simulated data for a typical experiment comprising 3000 time bins (100 ms averaging and 5 ms analysis windows for 150 trials), a 95% condence interval (2s) for the signicance level includes zero for ring rates below 7.3 spikes per second. As the number of trials increases, lower rates are sufcient to produce a reliable signicance level. 4 Discussion

It is well established that the average rate of neuronal rings is used to represent information in parts of the nervous system (Kandel, Schwartz, & Jessel, 1991). It is likely that the nervous system also makes use of some of the temporal variation of the ring rate, for example, of onset and offset responses of some neural populations. It is a matter of active discussion (and has been for several decades; Hubel, 1959; Werner & Mountcastle, 1963; Smith & Smith, 1965; Grifth & Horn, 1966; Bullock, 1970) whether more subtle forms of temporal structures are being employed to encode and process neuronal information. One of the attractive features of temporal codes is that they do not rely on averaging over either large populations or long time intervals. Instead, it is conceivable that very specic patterns are generated in one part of the nervous system and instantaneously decoded and processed in other parts. As long as it is highly unlikely that such patterns are generated by chance, little or no averaging is required to recognize the patterns with a high degree of condence. Codes that operate without extensive temporal averaging have clear advantages: the brain of a behaving animal needs to solve problems as they appear, without waiting for repetitions and multiple samples of a stimulus. If this is the case, it should be possible to detect such temporal structures by algorithmic methods that do not require extensive averaging over either time or population. This is not the case for many of the methods used to study the temporal structure of neuronal responses. For instance, the autocorrelation function and its Fourier transform, the power spectrum, are global (nonlocal) measures in the temporal domain. The same is true for their generalizations to multiple spike trains, that is, the cross-correlation function and its Fourier transform. Therefore, such methods require data on the ring behavior over a long time period and are less suitable for making quick decisions. A recently proposed local method that does not rely on averaging over long time periods is unitary event analysis (Grun ¨ et al., 1993; Grun, ¨ 1996; Riehle et al., 1997). The underlying assumption is the existence of UEs in brain activity, represented by “brief unusual constellations of activity”

Rate Limitations of Unitary Event Analysis

2079

among neuronal assemblies (Grun, ¨ 1996). It is assumed that the degree of unusuality is given by the degree of interneuronal correlation and measured by a quantity closely related to the surprise measure used previously in neurophysiology (L´egendy, 1975; L´egendy & Salcman, 1985; Palm, Aertsen, & Gerstein, 1988) whose expectation value is, in turn, closely related to Shannon’s (1948) information and the entropy of statistical physics (Boltzmann, 1887). Since the frequency of accidental correlations increases with the ring rates of participating neurons, applying appropriate corrections requires knowledge of mean ring rates, which are computed by averaging over trials. Gr un ¨ (1996) has pointed out that UE analysis is of limited value if the ring rates of the analyzed neurons are high. The reason is, essentially, that the number of accidental coincidences becomes too large, which makes the detection of possible nonaccidental coincidences more difcult. In this article, we report limitations of this method that are particularly stringent in the case of low neural ring rates. This is of particular relevance since the ring rates in the more central cortical areas are usually low and similar to those found to be problematic in our analysis (a few impulses per second or less; see, e.g., Abeles, Vaadia, & Bergman, 1990). Since it is usually assumed that temporal codes such as those assumed in UE analysis are of more relevance in central than in peripheral cortical areas, the limitations reported here are pertinent. It is also important to note that UE analysis does not provide a measure of signicance of the absolute number of excess coincidences. It is therefore unlike the signicance of spatiotemporal patterns computed by Abeles (1982a) and Abeles, Bergman, Margalit, and Vaadia (1993). Rather, UE analysis attaches a label to brief periods of increased coincident ring and examines changes in the pattern of the label that are correlated with different behavioral or response states. In fact, the overlapping nature of the test epochs used in UE analysis makes it difcult to determine the signicance of the number of excess coincidences as a whole. At the root of the problem are the underlying discrete statistics and the resulting changes in effective signicance levels of the test for coincident ring. We show that for low but realistic ring rates (e.g., 1 Hz), the effective signicance level for coincidences labeled as UEs varies considerably. For instance, the signicance level of coincidences can be as low as 0.005 instead of the target level 0.05 (see Figure 4)—a whole order of magnitude below the target level—and the average signicance level of UEs at low rates is about half of that of the target level. These changes cause a large error in estimation of the mean number of UEs expected under the null hypothesis. Thus, it is difcult to determine whether changes in observed UE frequency are due to random variations or whether they signify changes in actual neural synchrony. In conclusion, we have pointed out limitations in the applicability of UE analysis in the case of low ring rates. These limitations arise due to

2080

A. Roy, P. N. Steinmetz, and E. Niebur

the discrete character of the underlying point process model. If experimental conditions permit the collection of sufciently large amounts of data, these limitations can be overcome. We provide practical guidelines for the parameter ranges within which the analysis is applicable.

Acknowledgments

We thank Paul Fitzgerald, Steve Hsiao, and Ken Johnson for providing the data presented in Figures 1 and 2. We also thank both anonymous referees for valuable suggestions. This work was supported by the Lucile P. Markey Foundation, the National Science Foundation, and the Alfred P. Sloan Foundation through a Sloan Fellowship to E. N.

References Abeles, M. (1982a). Local cortical circuits. New York: Springer-Verlag. Abeles, M. (1982b). Role of the cortical neuron: Integrator or coincidence detector? Israel J. Med. Sci., 18, 83–92. Abeles, M. (1991). Corticonics—Neural circuits of the cerebral cortex. New York: Cambridge University Press. Abeles, M., Bergman, H., Margalit, E., & Vaadia, E. (1993). Spatiotemporal ring patterns in the frontal cortex of behaving monkeys. J. Neurophysiology, 70(4), 1629–1638. Abeles, M., Vaadia, E., & Bergman, H. (1990). Firing patterns of single units in the prefrontal cortex and neural network models. Network: Computation in Neural Systems, 1, 13–25. Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal ring correlation: Modulation of effective connectivity. J. Neurophysiol., 61(5), 900–917. ¨ Boltzmann, L. (1887). Uber die mechanischen Analogieen des Zweiten Hauptsatzes der Thermodynamik. J. Reine Angew Mathematik, 100, 201–212. Bullock, T. H. (1970). The reliability of neurons. J. Gen. Physiol., 55, 565–584. Dayhoff, J., & Gerstein, G. (1983a).Favored patterns in spike trains. I. Detection. J. Neurophysiol., 49, 1334–1348. Dayhoff, J., & Gerstein, G. (1983b). Favored patterns in spike trains. II. Application. J. Neurophysiol., 49, 1349–1363. Decharms, R. C., & Merzenich, M. M. (1996). Primary cortical representation of sounds by the coordination of action potential timing. Nature, 381, 610– 613. Eggermont, J. J. (1990). The correlative brain: Theory and experiment in neural interaction. New York: Springer-Verlag. Gerstein, G., Bedenbaugh, P., & Aertsen, A. (1989). Neuronal assemblies. IEEE Transactions on Biomedical Engineering, 36(1), 4–14.

Rate Limitations of Unitary Event Analysis

2081

Gerstein, G., & Clark, W. (1964).Simultaneous studies of ring patterns in several neurons. Science, 143, 1325–1327. Grifth, J. S., & Horn, G. (1966). An analysis of spontaneous impulse activity of units in the striate cortex of unrestrained cats. J. Physiology (London), 186, 516–534. Grun, ¨ S. (1996). Unitary joint-events in multiple-neuron spiking activity. Frankfurt am Main: Verlag Harri Deutsch. Grun, ¨ S., Aertsen, A., Abeles, M., & Gerstein, G. (1993). Unitary events in multineuron activity: Exploring the language of cortical cell assemblies. In N. Elsner & M. Heisenberg (Eds.), Gene-brain-behavior (p. 470). New York: Georg Thieme Verlag. Hsiao, S. S., O’Shaughnessy, D. M., & Johnson, K. O. (1993). Effects of selective attention on spatial form processing in monkey primary and secondary somatosensory cortex. J. Neurophysiology, 70(1), 444–447. Hubel, D. H. (1959). Single unit activity in striate cortex of unrestrained cat. J. Physiology (London), 147, 226–238. Kandel, E., Schwartz, J., & Jessel, T. M. (1991). Principles of neural science (3rd ed.). New York: Elsevier. L´egendy, C. (1975).Three principles of brain function and structure. International Journal of Neuroscience, 6, 237–254. L´egendy, C., & Salcman, M. (1985). Bursts and recurrences of bursts in the spike trains of spontaneously active striate cortex neurons. J. Neurophysiol., 53, 926– 939. Lindgren, B. W. (1976). Statistical Theory (3rd ed.). New York: Macmillan. Niebur, E., & Koch, C. (1994). A model for the neuronal implementation of selective visual attention based on temporal correlation among neurons. Journal of Computational Neuroscience, 1(1), 141–158. Niebur, E., Koch, C., & Rosin, C. (1993). An oscillation-based model for the neural basis of attention. Vision Research, 33, 2789–2802. Palm, G., Aertsen, A., & Gerstein, G. (1988). On the signicance of correlations among neuronal spike trains. Biol. Cybern., 59, 1–11. Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor cortical function. Science, 278, 1950–1953. Roy, A., Steinmetz, P., & Niebur, E. (1998). Rate dependence of unitary event analysis. Soc. Neurosci. Abstr., 28, 1513. Shadlen, M. N., & Newsome, W. T. (1995). Is there a signal in the noise? Current Opinion in Neurobiology, 5, 248–250. Shannon, C. E. (1948). A mathematical theory of communication. Bell Syst. Tech. J., 27, 379–423, 623–656. Singer, W. (1993). Synchronization of cortical activity and its putative role in information processing and learning. Annu. Rev. Physiol., 55, 349–374. Smith, D. R., & Smith, G. K. (1965). A statistical analysis of the continuous activity of single cortical neurones in the cast anaesthetized isolated forebrain. Biophys. J., 5, 47–74. Softky, W. (1995). Simple codes versus efcient codes. Current Opinion in Neurobiology, 5, 239–247.

2082

A. Roy, P. N. Steinmetz, and E. Niebur

Softky, W., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13(1), 334–350. Steinmetz, P. N., Roy, A., Fitzgerald, P. J., Hsiao, S. S., Johnson, K. O., & Niebur, E. (2000). Attention modulates synchronized neuronal ring in primate somatosensory cortex. Nature, 404, 187–190. Werner, G., & Mountcastle, V. B. (1963). The variability of central neural activity in a sensory system and its implications for the central reection of sensory events. J. Neurophysiol., 26, 958–977. Received December 4, 1998; accepted September 17, 1999.

LETTER

Communicated by Jean Fran c¸ ois Cardoso

Estimating Functions of Independent Component Analysis for Temporally Correlated Signals Shun-ichi Amari RIKEN Brain Science Institute, Japan This article studies a general theory of estimating functions of independent component analysis when the independent source signals are temporarily correlated. Estimating functions are used for deriving both batch and on-line learning algorithms, and they are applicable to blind cases where spatial and temporal probability structures of the sources are unknown. Most algorithms proposed so far can be analyzed in the framework of estimating functions. An admissible class of estimating functions is derived, and related efcient on-line learning algorithms are introduced. We analyze dynamical stability and statistical efciency of these algorithms. Different from the independently and identically distributed case, the algorithms work even when only the second-order moments are used. The method of simultaneous diagonalization of crosscovariance matrices is also studied from the point of view of estimating functions. 1 Introduction

Independent component analysis or blind source separation opens a new paradigm of signal processing applicable to many elds of science and engineering. It decomposes observed correlated signals into a sum of independent components. There have been proposed a number of batch estimation and on-line learning algorithms for blind source separations (Jutten & H´erault, 1991; Comon, 1994; Bell & Sejnowski, 1995; Cardoso & Laheld, 1996; Amari, Cichocki, & Yang, 1996; Oja, 1997, and many others; see also Amari & Cichocki, 1998 for review). They work well even when the source signals have temporal correlations, which is the case in many practical applications. However, one can make use of these correlations positively for obtaining more efcient and reliable algorithms that work well even when the probability structures of temporal correlations are unknown. We apply the estimating function theory for semiparametric statistical models (Amari & Kawanabe, 1997; Amari & Cardoso, 1997) to this problem, because it can be used in the blind case where temporal correlations are unknown. We propose a general type of algorithms applicable to such cases. The proposed algorithms work well even when we use only the secc 2000 Massachusetts Institute of Technology Neural Computation 12, 2083– 2107 (2000) °

2084

Shun-ichi Amari

ond moments of signals, provided temporal structures of the sources are different. This is completely different from the independently and identically distributed (i.i.d.) (that is, white) case where we need higher-order cumulants. This fact has been known (Tong, Liu, Soon, & Huang, 1991; Molgedey & Schuster, 1994), and batch algorithms of simultaneous diagonalization of cross-covariance matrices have been proposed (Tong et al., 1991; Molgedey & Schuster, 1994; Ikeda & Murata, 1998; Murata, Ikeda, & Ziehe, 1998; Belouchrani, Abed-Meraim, Cardoso, & Moulines, 1997; Ziehe & Muller, ¨ 1998). The maximum likelihood method has also been studied (Pham & Garat, 1997; Perlmutter & Parra, 1997; Attias & Schreiner, 1998; Davies, 1999). All of these can be analyzed from the point of view of estimating functions. A class of estimating functions is said to be admissible when, given an arbitrary estimator, one can nd an estimator derived from the class that has smaller or at least equal asymptotic estimation error. Information geometry of estimating functions (Amari & Kawanabe, 1997; Amari & Cardoso, 1997) is applied to derive a class of admissible estimating functions. The class includes the score function of the maximum likelihood estimator (MLE). It will also be demonstrated that the method of simultaneous diagonalization of cross-correlation matrix is derived from an estimating function but does not belong to the admissible class. We nally present the analysis of dynamical stability and statistical efciency of learning algorithms. Given an estimating function, we show that there exist an innite number of equivalent estimating functions that give the same estimator. They form an equivalence class. However, when we construct on-line learning algorithms from those estimating functions, their dynamical behaviors are different. They have different dynamical stability and different convergence speed. We dene a standardized estimation function (Amari, 1999a) among an equivalence class of estimating functions, and prove that the learning algorithm derived from the standardized one has the best asymptotic performance and is equivalent to the related batch estimator. We give the standardized estimating function explicitly by generalizing Amari (1999a).

2 Temporally Correlated Sources 2.1 Source Model. Let us consider n statistically independent information sources that generate time series fsi (1), si (2), . . . , si (t), . . .g, i D 1, . . . , n, at discrete times t D 1, 2, . . .. Different sources are stochastically indepen6 dent, so that any two signals si (t) and sj (t0 ), i D j, are independent at any times t and t0 . However, signals from the same source may have temporal correlations, that is, si (t) and si (t0 ) are correlated. We further assume that si (t) has zero mean, E[si (t)] D 0, where E denotes expectation.

Estimating Functions

2085

Let us consider a regular stationary stochastic model described by a linear model, si (t) D

pi X kD 1

aik si (t ¡ k ) C ei (t ),

(2.1)

where pi may be a nite number or innite and ei (t) is a zero-mean i.i.d. (that is, white) time series called innovations. This article treats only such source models. The innovations may be gaussian or nongaussian. We assume that they satisfy E [ei (t)] D 0, £ ¤ E ei (t)ej (t0 ) D 0,

6 6 t0 ). (i D j or t D

(2.2)

When pi is nite, this is an autoregressive model of degree pi . By introducing the time shift operator z¡1 such that z¡1 si (t) D si (t ¡ 1), we can rewrite equation 2.1 as ±

²

Ai z ¡1 si (t) D ei (t ),

(2.3)

where ±

²

Ai z ¡1 D 1 ¡

pi X kD 1

aik z ¡k .

(2.4)

By using the inverse of the polynomial Ai , the source signal is written as ±

²

si (t) D Ai¡1 z ¡1 ei (t),

(2.5)

where Ai¡1 (z¡1 ) is a formal innite power series of z ¡1 , ±

²

Ai¡1 z ¡1 D

1 X

bi z ¡i ,

(2.6)

i D0

such that ±

²

±

²

Ai¡1 z ¡1 Ai z¡1 D 1.

(2.7)

Function Ai¡1 (z¡1 ) represents the impulse response function of the ith source, by which fsi (t )g is generated from white signals fei (t)g. Let ri (ei ) be

2086

Shun-ichi Amari

the probability density function of ei (t). Then the conditional probability density function of si (t) conditioned on the past signals can be written as ( pi fsi (t)| si (t ¡ 1), si (t ¡ 2) . . .g D ri si (t) ¡

X k

) aik si (t

¡ k)

n ± ² o D ri Ai z ¡1 si (t ) .

(2.8)

Therefore, for vector source signals s (t ) D [si (t), . . . , sn (t)]T at time t, where T denotes transposition, the conditional probability density is p fs (t)| s (t ¡ 1), s (t ¡ 2), . . .g D

n Y

n o ri Ai (z ¡1 )si (t) .

(2.9)

iD 1

We introduce the following notations:

" D [e 1 , . . . , e n ]T , ± ² h ± ²i A z¡1 D diag Ai z ¡1 , r (" ) D

n Y

r i ( ei ) ,

(2.10) (2.11) (2.12)

iD 1

and use abbreviation

st D s(t),

xt D x(t)

(2.13)

when there is no confusion. We also denote the past signals by

st,past D fst¡1 , st¡2 , . . .g .

(2.14)

Then equation 2.9 is rewritten as n ± ² o © ª p st | st,past D r A z¡1 st .

(2.15)

The joint probability density function of fs1 , s2 , . . . , st g is p ( s1 , s2 , . . . , st ) D D

t Y

© ª p si | si,past

i D1 t n Y

± ² o r A z¡1 si ,

i D1

(2.16)

Estimating Functions

2087

where st (t · 0) are put equal to 0. Practically st (t < 0) are not equal to 0 so that equation 2.16 is an approximation that holds asymptotically, that is, for large t. The source models are specied by n functions ri (ei ) and n inverse impulse response functions Ai (z¡1 ). Blind source separation should extract the independent signals from their (instantaneous linear) mixtures without knowing exact forms of ri (ei ) and Ai (z ¡1 ). In other words, they are treated as unknown parameters. 2.2 Mixing Model. We consider the case when observed signals are instantaneous mixtures of the source signals. (See Amari, Douglas, & Cichocki, 1998, and Zhang, Amari, & Cichocki, 1999, for geometry of convolutive mixtures; and Parra, 1998, for the maximum likelihood approach to the convolutive case.) Let x (t ) D [x1 (t), . . . , xn (t)]T be the set of linearly mixed signals,

xi (t) D

n X

Hij sj (t),

(2.17)

j D1

or in the matrix-vector notation,

x(t) D Hs (t).

(2.18)

Here, H is an n £ n unknown nonsingular mixing matrix. Here, we assume that the number of observations is exactly the same as that of the source signals. We also assume that H does not include temporal operations, that is, it does not include z ¡1 and does not depend on t. All of these assumptions can be relaxed (see, e.g., Lewicki & Sejnowski, 1998; Cichocki, Thawonmas, & Amari, 1997; Amari, 1999b; and many others), but we stick with them in this article. 2.3 Unmixing Model. Let W be an n £ n nonsingular matrix that trans-

forms observed x (t ) to

y (t ) D W x(t).

(2.19)

This provides an estimate of the original source signals s (t). If we have W D H ¡1 , y (t) is exactly equal to s(t). So our problem is to estimate H ¡1 from observations x (1), . . . , x (t). However, as is well known, the problem is ill posed and H or its inverse W is not identiable. The following facts are known (Comon, 1994): 1. The scales of the source signals are not identiable. 2. The order to arrange n extracted independent signals is not identiable, so that there is indeterminacy of permutations of sources.

2088

Shun-ichi Amari

However, we do not ¡need ¢ to assume that no more than one source is gaussian provided all Ai z ¡1 are different (Tong et al., 1991). We assume, for convenience, that the scale of each si satises some constraint, for example, E [hi fsi (t )g] D 0,

(2.20)

where hi is a norming function. A simple way is to use hi (si ) D s2i ¡ 1. This determines the scales of recovered signals uniquely. Matrix W is called a separating solution, when y (t) D W x(t) becomes a rescaled permutation of the original s (t). 3 Statistical Framework 3.1 Semiparametric Statistical Model. Given t observations fx1 , . . . , xt g, their joint probability density function is easily derived from equation 2.16 and st D W xt , where W D H ¡1 . It is written as

p fx1 , . . . , xt , W I A , rg D det | W | t

t Y

n ± ² o ri A z ¡1 W xi ,

(3.1)

i D1

which is specied by the unmixing parameter W D H ¡1 and the parameters A, r of the source models. The statistical model, equation 3.1, includes two kinds of parameters. One is W , which is to be estimated from observed data. This is called a (matrix-valued) parameter of interest. The other is A and r, called nuisance parameters, which are unknown but we do not have a particular interest in estimating them. Here, the nuisance parameters include n functions r1 , . . . , rn , which are of innite dimensions. A statistical model is said to be semiparametric when it includes a nite dimensional parameter of interest together with innite dimensional nuisance parameters. 3.2 Likelihood and Score Functions. For the moment, we assume that r and A are known. We are then able to use the maximum likelihood (ML) method to estimate W . The log-likelihood is derived from equation 3.1 as

lt (x1 , . . . , xt I W , A, r) D log p fx1 , . . . , xt I W , A, rg D t log | W | C D t log | W | C

t X

n ± ² o log r A z ¡1 W xi

iD 1 t X iD 1

n ± ² o log r A z ¡1 y i ,

(3.2)

where we put

y i D W xi .

(3.3)

Estimating Functions

2089

Note that y i depends on W . The MLE is the one that maximizes the above likelihood for given t observations x1 , . . . , xt . We put ± ± ² ² ¡ ¢ l y t I W , r, A D log r A z¡1 y t .

(3.4)

Note that l depends ¡ not¢ only on y t but also on past y t¡1 , y t¡2 , . . . , because of the operator A z ¡1 . Note also that l is a function of W only through y t ’s. We then have lt D t log | W | C

t X ¡ iD 1

¢ l yi, W .

(3.5)

The score function is the derivative of the log-likelihood lt with respect to W . To calculate this, we put ¡ ¢ dl D l y t , W C dW ¡ l(W ).

(3.6)

In order to calculate the differential dl, we use the method in Amari, Chen, and Cichocki (see elsewhere in this issue). Let f (w ) be any function of w. Then, df D

¡ @f ¢ @w

T

dw .

In our case, equation 3.4 gives l D log r(w ), where w D Ay t . Hence, by putting

Qi D ¡

¡ ¢ d log ri yi , dyi

’r D (Q1 , . . . , Qn )T ,

(3.7)

we have ¡ ¢T ¡ ¢ dl D ¡’r Ay t d Ay t . ¡ ¢ Now, we note that d Ay t D Ady t and dy t D dW xt D dX y t ,

(3.8)

where we put dX D dW W ¡1 .

(3.9)

2090

Shun-ichi Amari

Hence, we have ¡ ¢T dl D ¡’r Ay t AdX y t .

(3.10)

It should be remarked that dX is not integrable, that is, there is no function X (W ) such that its differential gives dX . However, dX and dW are linearly related, so that dX can be regarded as a nonholonomic basis of the tangent space (see Amari, Chen, and Cichocki elsewhere in this issue). From

d log | W | D log W C dW ¡ log | W | (I C dX )D tr(dX ), D log

(3.11)

we nally have the score function in terms of dX , @lt

dX

D

@lt @W

t h X

W D

i D1

© ¡ ¢ ª i I ¡ ’r Ay i ± A y Ti ,

(3.12)

where ’r ± A is a column vector whose components are Qj Aj (z¡1 ). Note that, P when dl is written in component form as dl D cij dX ij , the derivative @l / dX is a matrix whose elements are cij . Hence, ’r (Ay ) ± Ay T is represented in component form as Qi (Ai (z¡1 )yi ) Ai (z ¡1 )yj . By putting © ª dl D F (y , W I r, A ) D I ¡ ’r (Ay ) ± A y T , dX

(3.13)

the ML equation is given by t X iD 1

¡ ¢ F y i , W I r, A D 0.

(3.14)

This is the same as that derived by Attias and Schreiner (1998). The MLE is derived by solving the above equation, provided we know r and A. 4 Estimating Function and Its Efciency 4.1 Estimating Function. In blind source separation, we cannot use the ML method because we do not know the exact values of the nuisance parameters r and A. The method of estimating functions (Amari & Kawanabe, 1997) can be applied to such cases. Let us consider a matrix-valued function F (y t , W ), which may include time shift operators like equation 3.13,

Estimating Functions

2091

that is, it may depend on past values of y t . However, F does not include parameters r and A. When F satises » £ © ª¤ D 0, EW,r,A F y t , W 0 6 0, D

when W D W 0 otherwise,

(4.1)

where EW,r,A implies expectation with respect to the source model with distribution 2.16 having any r and A, F is called an estimating function. We 0 6 0 when W is in may relax the condition slightly such that EW,r,A [F (W )0 ] D a neighborhood of W . In this case, F is called a local estimating function. When an estimating function F is given, the solution of t X i D1

© ª F yi, W D 0

(4.2)

O , because the right-hand side of equation 4.2 converges gives an estimator W to the expectation in the right-hand side of equation 4.1. Equation 4.2 is p called the estimating equation. It is known that this gives a t-consistent estimator, that is, an estimator such that the expectation of the square of estimation error decreases of order 1 / t for large t, whatever the true r and A are. We can also consider an on-line learning equation for D W t D W t C 1 ¡ W t,

DW t

©

ª

D gt F y t , W t W t ,

(4.3)

that does not include the true r and A. Hence, they can be used without knowing r and A. The separating solution W is the equilibrium of the averaged version of the stochastic equation, 4.3, although its stability is not guaranteed. Almost all of the proposed independent component analysis algorithms of batch type and on-line learning type can be described in terms of estimating functions, although this fact has not been recognized widely. Therefore, it is important to give a general theory of estimating functions in ICA. A problem is how to nd an estimating function, provided it exists. The second question is how to nd a good estimating function having a good estimating performance. The third question is how to use an estimating function for obtaining a good learning or adaptive algorithm. We study these problems. The general theory given by Amari and Kawanabe (1997) elucidated the set of all the estimating functions and their statistical efciency in the framework of information geometry (Amari, 1985; Amari & Nagaoka, 1999). Amari and Cardoso (1997) applied the theory to blind source separation, establishing the concept of a class of efcient or admissible estimating functions. (See also Amari, 1999a.) We apply these results to the problem.

2092

Shun-ichi Amari

O t be a consistent estimator 4.2 Asymptotic Variance of Estimator. Let W

of W based on t observations. Let O t DW

O t ¡W DW

(4.4)

be the error of estimation, and let Ot DX

D

O t W ¡1 DW

(4.5)

O t as be the error in terms of nonholonomic basis dX . One may consider D X the relative error (Cardoso & Laheld, 1996), since the error of a recovered signal is given by ±

D yt

D y t ¡ st D D

O t xt DW

D

²

O t x t ¡ st W C DW

O t st . DX

(4.6)

O t connected with errors in y . This gives a more direct interpretation of D X t O t is given by The asymptotic covariance of estimator W

V D lim t Cov [D X t , D X t ] , t!1

(4.7)

where Cov[Y, Z] denotes the covariance of random variables Y and Z. Here, D X is a matrix, so that V is a quantity having four indices. In component form, this is written as £ ¤ Vij,kl D lim t Cov D Xij , D Xkl . t!1

(4.8)

It is convenient to use extended indices (A, B, etc.), which represent pairs of indices such as A D (i, j), B D (k, l). In this notation, we have VAB D lim t Cov [D XA , D XB ] . t!1

(4.9)

In order to calculate the asymptotic covariance of the estimator derived from estimating function F , we dene µ

KDE

@F (y , W ) @X

¶ .

(4.10)

This is again a quantity having four indices. But we can summarize it in the extended matrix by using double indices µ KAB D E

@F A @XB

¶ .

(4.11)

Estimating Functions

2093

We also dene G D (GAB ) by GAB D E [FA FB ] .

(4.12)

The following theorem is a direct generalization of the result of ¡ Pham¢ and Garat (1997) (see also Amari, 1999a). The difference is that F y t , W includes previous y t¡1 , . . . , but the proof is quite similar. Theorem 1.

V D K ¡1 GK ¡1 ,

(4.13)

where K ¡1 is the inverse of extended matrix K D (KAB ). 4.3 Admissible Class of Estimating Functions. A class E of estimating functions is said to be admissible when, for any estimating function F , there exists an estimating function in the class E whose asymptotic variance is equal to or smaller than that of F in the sense of positive deniteness. Amari and Kawanabe (1997) elucidated the structure of the set of all the estimating functions, giving the smallest admissible set of estimating functions. We apply the general theory to the problem. The general theory is stated in terms of ber bundles in information geometry, but the model has a very simple structure so that the results can be stated without differentialgeometrical concepts. The most important fact is that the semiparametric statistical model is m-information-curvature free (see Amari & Kawanabe, 1997 for details; see also Amari & Cardoso, 1997). Because of this property, the score functions are estimating functions. Moreover, they span the admissible class of estimating functions. This implies that even when we misspecify the nuisance parameter r and A , the score function derived therefrom is still an estimating function, giving a consistent estimator. It should be noted that a semiparametric statistical model is not m-curvature free in general. In such a case, misspecication of the nuisance parameter gives a serious bias so that the derived estimator is inconsistent. Given any xed independent distribution q and lter matrix B (z¡1 ), let us dene a function F (y I W ) by F (y I W ) D F (y I W , q, B ), where xed q and B are suppressed in the notation of F . This is an estimating function, because it satises

£ ¤ E W,r,A F (y I W , q, B ) D 0

(4.14)

for any sources having true independent distributions r and lters A (z¡1 ). Here, it should be noted that F (y I W , q, B ) is the score function when the true nuisance parameters happen to be r D q and A D B (z¡1 ).

2094

Shun-ichi Amari

We have its explicit form dl (y I W , q, B ) dX n± ± ² ² ± ²o D I ¡’ q B z¡1 y ± B z ¡1 y T ,

F (y I W , q, B ) D

(4.15)

where q (y ) D

n Y

qi (yi ),

(4.16)

iD 1

¡ ¢T ’q D Qq1 , . . . , Qqn ,

(4.17)

¡ ¢ d log qi yi . dyi

(4.18)

Qqi D ¡

It was further proved in Amari and Cardoso (1997) that the diagonal terms of F can be arbitrary, because they are used only to assign the scales of recovered signals that are indenite. When the identiability conditions are not satised, those F do not necessarily satisfy equation 4.1, although they satisfy equation 4.14. Here, we state the identiability conditions clearly (see Tong et al., 1991; Comon, 1994): 1. All the independent sources have different spectra, that is, all Ai (z ¡1 ) are different, or 2. When some sources have the same spectra, the distributions ri of these sources are nongaussian except for one source. Summarizing these, we obtain the following theorem from the general theory (Amari & Kawanabe, 1997). The “smallest” part is proved from the fact that F (y I W , q, B ) is the optimal estimating function to the sources specied by q and B . When the identiability condition is satised, the smallest admissible class of estimating functions is spanned by the nondiagonal elements of F (y I W , q, B ), where q, B , and the diagonal elements Fii are arbitrary. Theorem 2.

The estimating equation is t X iD 1

¡ ¢ F y i I W , q, B D 0.

(4.19)

There are a lot of estimating functions other than the above type. For example, simultaneous diagonalization of covariance matrices h i (4.20) E x ( t )x T ( t ¡ t )

Estimating Functions

2095

for various t or their combinations gives an estimator (Tong et al., 1991; Molgedey & Schuster, 1994; Ikeda & Murata, 1999; Belouchrani et al., 1997). This type of estimators looks quite different from those derived from estimating functions. To our surprise, they are given by estimating functions shown in the next section under certain conditions. Although this method is interesting from the computational viewpoint, it is not admissible in the above sense. Therefore, we can nd better estimating functions. Another example of nonadmissible estimating function—that is, one that does not belong to the admissible class—is, ±

±

² ²±

FQ (y ) D ’ B z ¡1 y

±

² ²T

C z¡1 y

(4.21)

where B and C are arbitrary lters that may be equal to each other, where the diagonal terms are arbitrary. It is easy to see that they satisfy equation 4.1. 5 Simultaneous Diagonalization of Cross-Correlation Matrices

The method of simultaneous diagonalization of cross-correlation matrices is quite different from the other methods, which use nonlinear functions, higher-order cumulants, and cost functions to be minimized. This section shows that this method is also analyzed in the framework of estimating functions when the processes are gaussian. Let R S (t ), RX (t ), and R Y (t ) be the cross-correlation matrices of s (t), x (t) and y (t ), respectively, dened by h i RS (t ) D E s(t)s(t ¡ t )T

(5.1)

and so on. Since different sources are independent, R S (t ) is a diagonal matrix for any time delay t . These matrices are connected by the relations

RX (t ) D H RS (t )H T ,

(5.2)

R Y (t ) D W R X (t )W

(5.3)

T

so that the true W is the one that diagonalizes RX (t ) for all t simultaneously. This leads us to the following batch algorithm of estimating W : 1. From observed signals fx1 , . . . , xt g, calculate the empirical cross-correlation X O X (t ) D 1 R xi xTi¡t . t i

(5.4)

2096

Shun-ichi Amari

2. Prewhiten the signals by x N D ¤Ox such that the correlation matrix O with 0 delay RXN (0) of x N becomes equal to the identity matrix, where O X (0) O is the orthogonal matrix composed of the eigenvectors of R and ¤ is a diagonal matrix composed of the inverse square roots of eigenvalues. We then have the resultant cross-correlation matrix O N (t ) of x O N (0) is the identity matrix. R N , where R X X

3. Let W D U ¤O be the singular value decomposition of W . We have O X (0). The remaining task is to diagonalize xed ¤ and O by using R O RXN (t ), t D 1, 2, . . . , by nding a suitable orthogonal matrix U such O Y (t ) are diagonal where y D U x that R N . A typical algorithm is to nd U that minimizes a weighted sum of the squares of off-diagonal elements of RY (t ), LO (U ) D

XX 6 0 6 j tD iD

n o2 c(t ) RO Yi, j (t )

(5.5)

for suitable nonnegative weights c (t ). We show how the above method is related to estimating functions. We 6 have, for i D j, t > 0, Cij (t ) D E

h©

yi (t)yj (t ¡ t )

ª2 i

D E

h©

y i (t ) © £

ª2 i h © ª2 i E yj (t ¡ t )

C 2 E yi ( t ) yj ( t ¡ t )

¤ª2

C Cum, (5.6)

where Cum denotes the fourth-order cumulant of yi (t), yi (t), yj (t¡t ), yj (t¡t ). When the source signals are gaussian, this cumulant vanishes and hence © ª2 Cij (t ) D const C 2 Rij (t ) ,

(5.7)

because E[xN i (t)2 ] D E[yi (t)2 ] D 1 holds when U is an orthogonal matrix. Therefore, the cost function, 5.5, to be minimized is written as C (U ) D

1XX ct Cij (t ). 2 i D6 j t

(5.8)

Given a set of observations y (1), . . . , y (t), where y (t) D U x N (t), Cij (t ) is replaced by its empirical estimate t © ª2 1X CO ij (t ) D yi (k )yj (k ¡ t ) , t kD 1

(5.9)

Estimating Functions

2097

O minimizes where yj (k ), k · 0, are put equal to 0. The estimator U 1XX O CO (U ) D ct Cij (t ). 2 iD6 j t

(5.10)

In order to calculate the derivative of CO ij (t ), we use h i h d y2i (t)yj2 (t ¡ t ) D 2 yi (t)dyi (t)yj2 (t ¡ t )

C yi ( t ) 2 yj (t ¡ t ) dyj ( t ¡ t )

D 2

Xh k

i

yi (t)dXik yk (t)yj2 (t ¡ t ) i

C y2i (t ) yj ( t ¡ t ) dXjk yk ( t ¡ t ) .

(5.11)

where dX D dU U ¡1

(5.12)

is an antisymmetric matrix, because U is an orthogonal matrix. We note that

2

3 X

t 4 d CO ij (t )5 2 6 j iD t h XXX

D

6 j iD

k

lD 1

t XX X

D

lD 1

k

6 j iD

yi (l)dXik yk (l)yj2 (l ¡ t ) C y2i (l)yj (l ¡ t )dXjk yk (l ¡ t )

h i dXik yi (l)yk (l)yj2 (l ¡ t ) C yj2 (l C t )yi (l)yk (l) ,

i

(5.13)

where y (l), l > t is put equal to 0. We take up the coefcient of dX ik in the rst term of the right-hand side of equation 5.13, X 6 i jD

yi (l) yk (l)yj2 (l ¡ t ).

(5.14)

Because dX ik D ¡dX ki holds, the coefcients of dX ik are summarized into yi (l)yk (l)

8 <X :

6 i jD

yj2 (l ¡ t ) ¡

n

X 6 k jD

yj2 (l ¡ t ) o

D yi (l )yk (l ) y2k (l ¡ t ) ¡ y2i (l ¡ t ) .

9 = ;

(5.15)

2098

Shun-ichi Amari

A similar relation holds for the second terms. Therefore, the derivatives are given by t X O Cmn (t ) @Xik 2 mD 6 n h i X D yi (l)yk (l) y2k (l ¡t ) ¡y2i (l¡t ) C y2k (l C t )¡ y2i (l C t ) . @

(5.16)

l

Summarizing these results, we have the following theorem: When the source signals are gaussian, the method of simultaneous diagonalization is equivalent to the estimating function method with the estimating function Theorem 3.

Fik (y , U ) D yi yk

X t

² ¡ ¢± c(t ) z ¡t C zt y2k ¡ y2i .

(5.17)

Note that the estimating function, 5.17, does not belong to the admissible class. Hence, one can nd a better estimating function in class E . 6 Standardized Estimating Function and Error Analysis 6.1 Equivalence Class of Estimating Functions. Two estimating functions may give the same estimator; if this is the case, they are said to be equivalent. We study the equivalence of estimating functions in this subsection. Let R (W ) D (RAB ) be a nonsingular matrix operator that may depend on W . It operates on F D (FB ) such that

RF D

(

X

!

RAB FB .

(6.1)

B

When F is an estimating function, FQ D RF is also an estimating function, because

E W,r,A [R (W )F (y , W )] D 0

(6.2)

holds for any W , r, A. The two estimating equations X X

F (y i , W ) D 0

(6.3)

FQ (y i , W ) D 0

(6.4)

give the same solution, so they are equivalent.

Estimating Functions

2099

Among a class of equivalent estimating functions, the one F ¤ that satises µ

K¤ D E

@F

¤¶

@X

D identity operator

(6.5)

is called the standardized estimating function (Amari, 1999a). Given an estimating function F , its standardized form is given by

F ¤ D K ¡1 F ,

(6.6)

because µ

K¤ D E

@F

¤¶

@X

D K

¡1

D

@K

¡1

@X

µ E [F ] C K ¡1 E

@F

¶

@X

K D identity.

(6.7)

From equations 4.12 and 4.13, we see that the asymptotic covariance V of an estimating function is given, in terms of the standardized estimating function, by £ ¤ V D E F¤ ±F¤

(6.8)

where ± is the tensor product. 6.2 Derivation of Standardized Estimating Function. We now calculate

µ

KDE

@F (y I W , q, B )

¶

@X

(6.9)

as the true solution W D H ¡1 . Rewriting F D @l / dX or h ¡ ¢ ± ²i T dl D ¡trdX C ’ yQ B z¡1 dX y

(6.10)

in component form, where we put ±

²

yQ D B z ¡1 y ,

(6.11)

2100

Shun-ichi Amari

we calculate the second-order differential in the component form as follows: d2 l D d D

8 <X : i, j,k

;

X©

¡ ¢ ¡ ¢ ª QP i yQ i dyQ i bik yj (t ¡ k ) C Qi yQ i bik dyj (t ¡ k ) dX ij

i, j,k

X D

9

= ¡ ¢ Qi yQ i bik yj (t ¡ k ) dX ij

¡ ¢ QPi yQ i bil ym (t ¡ l)bik yj (t ¡ k )dX im dX ij

i, j,m,l,k

C

X i, j,m,k

¡ ¢ Q yQ i bik ym (t ¡ k )dXjm dX ij ,

© ª where QPi (y ) D d / dy Qi (y) . At the true solution where £ ¡ ¢ ¤ £ ¡ ¢ ¤ E QP yQ i yj ym D E Q yQ i yj D 0, we have d2 l D

X i

C

C

unless i D j D m,

(6.12)

(6.13)

h ¡ ¢ i¡ ¢2 E QPi yQ i yQ 2i dXii

X i, j

£ ¡ ¢ ¤ E Q yQ i yQ i dXij dXji

2

X 6 j iD

( )2 3 ¡ ¢ X i ¡ ¢2 E 4QP i yQ i bk yj (t ¡ k ) 5 dX ij .

(6.14)

kD 0

Hence, the quadratic form d2 l in terms of dX ij ’s splits into the diagonal terms X¡

¢¡ ¢2 mQ i C 1 dX ii

(6.15)

6 and 2 £ 2 minor matrices consisting of dXij and dXji (i D j), o Xn ¡ ¢2 kQi sQ ij2 dXij C dXij dXji ,

(6.16)

6 j iD

where we put h ¡ ¢ i mQ i D E QPi yQ i yQ 2i £ ¡ ¢¤ kQi D E QPi yQ i 2(

sQ ij2 D E 4

X lD 0

bil yj (t ¡ l)

(6.17) (6.18) )2 3

5.

(6.19)

Estimating Functions

2101

When we put Fii (y ) D 1 ¡ y2i ,

(6.20)

the recovered signals satisfy h i E y2i D 1.

(6.21)

In this case, the diagonal term is 2

X

dX 2ii ,

(6.22)

and the 2 £ 2 diagonal terms are ¡ ¢2 kQi sQ ij2 dX ij C hQ i dXij dXji ,

(6.23)

where £ ¡ ¢ ¤ hQ i D E Q yQ i yQ i .

(6.24)

The inverse of K has the same block structure as K . Its diagonal KAA parts for A D (i, i) are 1

kii,ii D

1 C mQ i

,

(6.25)

6 and its 2 £ 2 diagonal parts KAA0 for A D (i, j) and A0 D ( j, i), i D j are " # kQj sQji2 ¡1 , (6.26) KAA0 D cij ¡1 kQi sQ 2

ij

where cij D

1 . Qki kQj sQ 2 sQ 2 ¡ 1 ij ji

(6.27)

We have similar expressions for Fii D 1 ¡ y2i . We thus have the standardized estimating function. The standardized estimating function is given by h ¡ ¢ (i ) ¡ ¢ ( j) i F¤ij D cij ¡kQj sQji2 Qi yQ i yQj C Qj yQj yQ i ,

Theorem 4.

F¤ii D where ()

1 mQ i C 1 ±

©

¡ ¢ ª 1 ¡ Qi yQ i yQ i ,

(6.28) (6.29)

²

i yQj D Bi z¡1 yj .

(6.30)

2102

Shun-ichi Amari

6.3 Asymptotic Errors. The asymptotic estimation error is easily ob-

tained from £ ¤ G¤AB D E F¤A F¤B .

(6.31)

The results in this section are derived in a similar way used in Amari (1999a). Theorem 5. 2 Q2 Q Q G¤ik, jk D cik cjk sQ ik2 sQjk2 sQ kk kj li lj ,

G¤ii,ij D

1 Qi C 1 m

h ¡ ¢i cij kQj sQ ii2 sQ ij2 lQj E yQ 2i Qi yQ i

(6.32) (6.33)

where £ ¡ ¢¤ lQi D E Qi yQ i .

(6.34)

The error covariances of recovered signals are given from £ ¤ X £ ¤ E D yi D yj D E D Xik D Xjk sk2 .

(6.35)

It is remarkable that the “superefciency” of £ ¤ E D yi D yj D O

¡1¢ t2

,

6 (i D j)

holds, when the condition £ ¡ ¢¤ lQi D E Qi yQ i D 0

(6.36)

(6.37)

holds. The proof is similar to the noncorrelated case (Amari, 1999a). 7 Stability and Efciency of On-Line Learning 7.1 On-Line Learning. Given an estimating function F , we can construct an on-line learning algorithm for D W t D W t C 1 ¡ W t , ¡ ¢ D W t D gt F y t , W t W t , (7.1)

where gt is a learning rate depending on t and W t is multiplied from the right to implement natural gradient learning. Let R (W )F be an arbitrary estimating function equivalent to F . The batch estimation gives the same result for F and R (W )F . However, ¡ ¢ D W t D gt R (W t ) F y t , W t W t (7.2) gives a different dynamical behavior from that of F . They have not only different asymptotic errors but also different stability at the true solution.

Estimating Functions

2103

Therefore, we should carefully choose an adequate estimating function among the equivalence class. 7.2 Stability of Learning Algorithm. Estimating function

¡ ¢ () Fij (y ) D I ¡ Qi yQ i yQj i ,

(7.3)

is a generalization of most popular estimating functions of the type F D I ¡ Q (y )y T . The stability of the learning equation was studied in Amari, Chen, and Cichocki (1997) in the latter case. A similar analysis is valid in the present general case, and we give only results, since the proof is very similar. Theorem 6.

The separating solution is asymptotically stable when and only

when 1. mQ i C 1 > 0, 2. kQi > 0,

(7.4) (7.5)

3. kQi kQj sQ ij2 sQji2 > 1.

(7.6)

When the observed signals x is prewhitened such that £ ¤ E xi (t )xj (t) D dij ,

(7.7)

the remaining degrees of freedom for W are within orthogonal transformations. In this case, we have additional restrictions dX ij D ¡dXji .

(7.8)

This is derived by differentiating the orthogonality condition,

W W T D I,

(7.9)

so that dW W T C W dW T D dX C dX T D 0. The learning algorithm reduces to h ¡ ¢ ± ² i D W t D ¡gt Q yQ ± B z¡1 y T

anti

W,

(7.10)

(7.11)

in this case, where [A]anti denotes the antisymmetric part of matrix A, that is, [A]anti D

² 1± A ¡ AT . 2

(7.12)

2104

Shun-ichi Amari

The second differential, 6.14, reduces to d2 l D

Xn 6 j iD

¡1 C kQi sQ ij2 C kQj sQji2

o¡

dXij

¢2

.

(7.13)

Hence, the stability condition is kQi sQ ij2 C kQj sQji2 > 2.

(7.14)

When the source signals are temporally correlated, one can use only the second-order moments of xi (t)’s for separating the source signals, provided the sources have different spectra. In this case, one may put ¡ ¢ Qi yi D yi , ¡ ¢ QP i yi D 1.

(7.15) (7.16)

Here, mQ i C 1 > 0, kQi D 1

(7.17) (7.18)

automatically hold. Therefore the stability condition reduces to sQ ij2 sQji2 > 1,

(7.19)

in the general case and sQ ij2 C sQji2 > 2

(7.20)

in the prewhitened case. 7.3 Stable and Efcient On-Line Learning. On-line learning uses one observed signal y t D W t xt to update the current W t and discards it. Hence, xt is used once when it is observed, and it is impossible to use old data again. On-line learning can be implemented simply and is robust under changing environments, but its efciency is in general worse than batch estimation using the same estimating function. However, the following theorem (Amari, 1998) shows that on-line learning with the standardized estimating function is asymptotically as efcient as the batch procedure using the same estimating function.

The asymptotic error covariance matrix of learning by the standardized estimating function is the same as the batch estimation error, when gt D 1 / t. Theorem 7.

Estimating Functions

2105

Moreover, we can prove the following stability theorem: The true solution is always stable under on-line learning with the standardized estimating function, provided the identiability condition is satised. Theorem 8.

In, order to implement the standardized estimating function, we need to know the statistics kQ i , sQ ij2 , and so on that depend on the source probability distribution. Hence, we need to use an adaptive method of estimating them. 8 Conclusions

We have proposed a new type of batch estimators and on-line learning algorithms when source signals have unknown temporal correlations. Moreover, we have elucidated the relationship among various estimators proposed so far from the point of view of information geometry of estimating functions. Asymptotic error analysis as well as stability and efciency of learning algorithms was given. Practical implementations and applications of the theoretical study are subjects for future study. References Amari, S. (1985). Differential-geometrical methods of statistics. Berlin: SpringerVerlag. Amari, S. (1998). Natural gradient works efciently in learning, Neural Computation, 10, 251–276. Amari, S. (1999a). Superefency in blind source separation. IEEE Trans. Signal Processing, 47, 936–944. Amari, S. (1999b). Natural gradient for over- and under-complete ICA. Neural Computation, 11, 1875–1883. Amari, S., & Cardoso, J. F. (1997). Blind source separation—Semiparametric statistical approach. IEEE Trans. on Signal Processing, 45, 2692–2700. Amari, S., Chen, T.-P., & Cichocki, A. (1997). Stability analysis of adaptive blind source separation. Neural Networks, 10, 1345–1351. Amari, S., & Cichocki, A. (1998). Adaptive blind signal processing—neural network approach. Proceedings of IEEE, 86, 2026–2048. Amari, S., Cichocki, A., & Yang, H. H. (1996). A new learning algorithm for blind signal separation. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 752–763). Cambridge, MA: MIT Press. Amari, S., Douglas, S. C., & Cichocki, A. (1998). Multichannel blind deconvolution and source separation using the natural gradient. Manuscript submitted for publication. Amari S., & Kawanabe, M. (1997).Information geometry of estimating functions in semi-parametric statistical models. Bernoulli, 3, 29–54.

2106

Shun-ichi Amari

Amari, S., & Nagaoka, H. (1999). Introduction to information geometry. Oxford: Oxford University Press. Attias, H., & Schreiner, C. E. (1998). Blind source separation and deconvolution: The dynamic component analysis algorithm. Neural Computation, 10, 1373– 1424. Bell, A. J., & Sejnowski, T. J. (1995). An information-maximization approach to blind separation and blind deconvolution. Neural Computation, 7, 1129–1159. Belouchrani, A., Abed-Meraim, K., Cardoso, J.-F., & Moulines, E. (1997). A blind source separation technique using second-order statistics. IEEE Trans. on Signal Processing, 45, 434–444. Cardoso, J.-F., & Laheld, B. (1996).Equivariant adaptive source separation, IEEE Trans. Signal Processing, SP-43, 3017–3029. Cichocki, A., Thawonmas, R., & Amari, S. (1997). Sequential blind signal extraction in order specied by stochastic properties. Electronics Letters, 33, 64– 65. Comon, P. (1994). Independent component analysis: A new concept? Signal Processing, 36, 287–314. Davies, M. (1999). Blind signal separation through generative AR modelling. Manuscript submitted for publication. Ikeda, S., & Murata, N. (1998). A method of blind separation based on temporal structure of signals. In Proceedings of the Fifth International Conference on Neural Information Processing (pp. 737–742). Jutten, C., & H´erault, J. (1991). Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture. Signal Processing, 24, 1– 20. Lewicki, M. S., & Sejnowski, T. (1998). Learning nonlinear overcomplete representation for efcient coding, In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 815–821). Cambridge, MA: MIT Press. Molgedey, L., & Schuster, H. G. (1994). Separation of a mixture of independent signals using time delayed correlations. Phys. Rev. Lett., 72, 3634–3637. Murata, N., Ikeda, S., & Ziehe, A. (1999). An approach to blind source separation based on temporal structure of speech signals. BSIS Tech. Rep. 98–2). Oja, E. (1997).The nonlinear PCA learning rule in independent component analysis. Neurocomputing, 17, 25–45. Parra, L. (1998). Temporal models in blind source separation. In L. Giles & L. Gori (Eds.), Adaptive processing of sequences and data structures (pp. 229–247). Berlin: Springer-Verlag. Perlmutter, B. A., & Parra, L. C. (1997). Maximum likelihood blind source separation: A context-sensitive generalization of ICA. In M. C. Mozer, M. I. Jordan, & T. Petshe (Eds.), Advances in neural information processing systems 9, Cambridge, MA: MIT Press. Pham, D. T., & Garat, P. (1997). Blind separation of mixture of independent sources through a quasi-maximum likelihood approach. IEEE Trans. Signal Processing, 45, 1457–1482. Tong, L., Liu, R.-W., Soon, V. C., & Huang, J. F. (1991). Indeterminacy and identiability of blind identication. IEEE Trans. Circuits and Systems, 409–509.

Estimating Functions

2107

Zhang, L.-Q., Amari, S., & Cichocki, A. (1999). Semiparametric model and superefciency in blind deconvolution. Manuscript submitted for publication. Ziehe, A., & Muller, ¨ K-R. (1998), TDSEP—an efcient algorithm of blind separation using time structure, In L. Niklasson et al (Eds.), Perspectives in neural computing (Proceedings of ICANN’98) (pp. 675–680). Berlin: Springer-Verlag. Received January 11, 1999; accepted September 16, 1999.

LETTER

Communicated by Christopher Bishop

SMEM Algorithm for Mixture Models Naonori Ueda Ryohei Nakano ¤ NTT Communication Science Laboratories, Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0237 Japan Zoubin Ghahramani Geoffrey E. Hinton Gatsby Computational Neuroscience Unit, University College London, London WC1N 3AR, U.K. We present a split-and-merge expectation-maximization (SMEM) algorithm to overcome the local maxima problem in parameter estimation of nite mixture models. In the case of mixture models, local maxima often involve having too many components of a mixture model in one part of the space and too few in another, widely separated part of the space. To escape from such congurations, we repeatedly perform simultaneous split-and-merge operations using a new criterion for efciently selecting the split-and-merge candidates. We apply the proposed algorithm to the training of gaussian mixtures and mixtures of factor analyzers using synthetic and real data and show the effectiveness of using the splitand-merge operations to improve the likelihood of both the training data and of held-out test data. We also show the practical usefulness of the proposed algorithm by applying it to image compression and pattern recognition problems. 1 Introduction

Mixture density models, in particular normal mixtures, have been used extensively in the eld of statistical pattern recognition (MacLachlan & Basford, 1987). Recently, more sophisticated mixture density models such as mixtures of latent variable models (e.g., probabilistic PCA and factor analysis) have been proposed to approximate the underlying data manifold (Hinton, Dayan, & Revow, 1997; Tipping & Bishop, 1997; Ghahramani & Hinton, 1997). The parameters of these mixture models can be estimated using the expectation-maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) based on the maximum likelihood framework. A common and ¤

Present address: Nagoya Institute of Technology, Gokiso-cho, Showa-Ku, Nagoya 466–8555 Japan c 2000 Massachusetts Institute of Technology Neural Computation 12, 2109– 2128 (2000) °

2110

N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton

serious problem associated with these EM algorithms, however, is the local maxima problem. Although this problem has been pointed out by many researchers, the best way to solve it, in practice, is still an open question. Two of the authors have proposed the deterministic annealing EM (DAEM) algorithm (Ueda & Nakano, 1998), where a modied posterior probability parameterized by temperature is derived to avoid local maxima. However, when mixture density models are involved, local maxima arise when there are too many components of a mixture model in one part of the space and too few in another. The DAEM algorithm and other algorithms are not very effective at avoiding such local maxima because they are not able to move a component from an overpopulated region to an underpopulated region without passing through positions that give a lower likelihood. We therefore introduce a discrete move that simultaneously merges two components in an overpopulated region and splits a component in an underpopulated region. The idea of performing split-and-merge operations has been successfully applied to clustering (Ball & Hall, 1967) and vector quantization (Ueda & Nakano, 1994). Recently, split-and-merge operations have also been proposed for Bayesian normal mixture analysis (Richardson & Green, 1997). Since they use split-and-merge operations with a Markov chain Monte Carlo method, it is computationally much more costly than our algorithm. In addition, we introduce new split-and-merge criteria to select the split-and-merge candidates efciently. Although the proposed method, unlike the DAEM algorithm, is limited to mixture models, we have experimentally conrmed that our split-andmerge EM (SMEM) algorithm obtains better solutions than the DAEM algorithm. We have already given a basic idea of the SMEM algorithm (Ueda, Nakano, Ghahramani, & Hinton, 1999). This article describes the algorithm in detail and shows real applications, including image compression and pattern recognition. 2 EM Algorithm

Before describing our SMEM algorithm, we briey review the EM algorithm. Suppose that a set Z consists of observed data X and unobserved data Y . Z D (X , Y ) and X are called complete data and incomplete data, respectively. Assume that the joint probability density of Z is parametrically given as p (X , Y I H), where H denotes parameters of the density to be estimated. The maximum likelihood estimate of H is a value of H that maximizes the incomplete data log-likelihood function: def

L (HI X ) D log p (X I H ) Z D log p (X , Y I H ) d Y .

(2.1)

The characteristic of the EM algorithm is to maximize the incomplete data

SMEM Algorithm for Mixture Models

2111

log-likelihood function by iteratively maximizing the expectation of the complete data log-likelihood function: c (HI Z )

L

def

D log p (X

, Y I H ).

(2.2)

Suppose that H (t ) denotes the estimate of H obtained after the tth iteration of the algorithm. Then, at the t C 1th iteration, the E-step computes the expected complete data log-likelihood function denoted by Q (H |H (t) ) and dened by def

Q(H |H (t ) ) D EfL

c (HI Z )| X

I H (t ) g,

(2.3)

and the M-step nds the H maximizing Q (H |H (t ) ). The convergence of the EM steps is theoretically guaranteed (Dempster, Laird, & Rubin, 1977). 3 Split-and-Merge EM Algorithm 3.1 Split-and-Merge Operations. We restrict ourselves here to mixture density models. The probability density function (pdf) of a mixture of M density models is given by

p (x I H ) D

M X mD1

am pm (xIhm ),

(3.1)

where am is the mixing proportion of the mth model1 and satises am ¸ 0 P and M mD 1 am D 1. The pm (x Ihm ) is a d-dimensional density model corresponding to the mth model. Clearly, H D f(am , hm ), m D 1, . . . , Mg is an unknown parameter set. In the case of mixture models, the model index m 2 f1, . . . , Mg is unknown for an observed data xn and therefore m corresponds to the unobserved data mentioned in section 2. Noting that the pdf of the complete data is p (x, mI H ) D am pm (xI hm ) in this case, we have Q(H |H (t ) ) D

N X M X n D 1 mD 1

flog am pm (xn I hm )gP (m| xn I H (t ) ).

(3.2)

Here, P (m| xn I H (t) ) is a posterior probability and is computed as () () amt pm (xn Ihmt ) . P (m| xn I H (t ) ) D P M (t) (t ) ( ) l D 1 al pl xn Ihl

(3.3)

N is the number of observed data points. 1 Strictly speaking, we should call it the mth model component or model component vm , but to simplify the notation and wording, we call it model m or the mth model hereafter.

2112

N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton

Looking at equation 3.2 carefully, one can see that the Q function can be represented in the form of a direct sum, Q (H |H (t ) ) D

M X

qm (hm |H (t ) ),

(3.4)

m D1

where qm (hm |H (t ) ) D

N X nD 1

P (m| xn I H (t) ) log am pm (xn I hm )

(3.5)

and depends on only am and hm . Let H ¤ denote the parameter values estimated by the usual EM algorithm. Then, after the EM algorithm has converged, the Q function can be rewritten as Q¤ D q¤i C qj¤ C q¤k C

X

q¤m .

(3.6)

6 i, j,k m,m D

We then try to increase the rst three terms of the right-hand side of equation 3.6 by merging models i and j to produce a model i0 , and splitting the model k into two models j0 and k0 . 3.1.1 Initialization. To reestimate the parameters of these new models, we have to initialize the parameters corresponding to the new models using H ¤ . Intuitively natural initializations are given below. The initial parameter values for the merged model i0 are set as linear combinations of the original ones before the merge: ai0 D ai¤ C aj¤

and hi0 D

ai¤hi¤ C aj¤hj¤ ai¤ C aj¤

.

(3.7)

Noting that the mixing proportion is estimated as the average of posterior P ¤ ). One can see that the initialization of ( over data, al¤ D 1 / N N n D 1 P l| x n I H ¤ hi0 is the linear combination of hi and hj¤ , weighted by the posteriors. On the other hand, as for models j0 and k0 , we set aj0 D ak0 D

ak¤ 2

hj0 D hk¤ C 2

and hk0 D hk¤ C 2 0 ,

(3.8)

where 2 or 2 0 is some small, random perturbation vector or matrix (i.e., k2 k ¿ khk¤ k). In the case of mixture gaussians, covariance matrices Sj0 and Sk0 should be positive denite. In this case, we can initialize them as

Sj0

D

S k0

D det( Sk¤ )1 / d Id

(3.9)

SMEM Algorithm for Mixture Models

2113

Figure 1: An example of initialization in a two-dimensional gaussian case. (a) A gaussian just before split (left) and initialized gaussians just after split (right). (b) Two gaussians just before merge (left) and an initialized gaussian just after merge (right).

instead of equation 3.8. Here, det( S ) denotes the determinant of matrix S , and Id is the d-dimensional identity matrix. Figure 1 shows a simple example of the initialization steps for a two-dimensional gaussian case using equations 3.7 through 3.9. 3.1.2 Partial EM Steps. The parameter reestimation for m0 D i0 , j0 , and k can be done by using EM steps, but instead of equation 3.3, we use the following modied posterior probability: 0

() P (m 0 | x I H t ) D P

()

()

amt 0 pm0 (xI hmt0 )

(t ) (t ) l D i0 , j0 ,k0 al pl ( xI hl )

2114

N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton

£

X m Di, j,k

P (m| xI H ¤ ),

for m0 D i0 , j0 , k0 .

(3.10)

Using equation 3.10, the sum of posterior probabilities for models i0 , j0 , and k0 becomes equal to the sum of posterior probabilities for models i, j, and k just before split-and-merge. That is, X m0 Di0 , j0 ,k0

() P (m0 | xI H t ) D

X mD i, j,k

P (m| xI H ¤ )

(3.11)

always holds during the reestimation process. By this, we can reestimate the parameters for models i0 , j0 , and k0 consistently without affecting the other models. We call these EM steps partial EM steps. These partial EM steps make the total algorithm efcient. 3.2 SMEM Algorithm. After the partial EM steps, the usual EM steps, called the full EM steps, are performed as a postprocessing operation. After these steps, if Q is improved, then we accept the new estimate and repeat the above after setting the new parameters to H ¤ . Otherwise we reject it, go back to H ¤ , and try another candidate. We summarize these as the following SMEM algorithm:

1. Perform the usual EM updates from some initial parameter value H until convergence. Let H ¤ and Q¤ denote the estimated parameters and corresponding Q function value after the EM algorithm has converged, respectively. 2. Sort the split-and-merge candidates by computing split-and-merge criteria (described in the next section) based on H ¤ . Let fi, j, kgc denote the cth candidate. 3. For c D 1, . . . , Cmax , perform the following: After making the initial parameter settings based on H ¤ , perform the partial EM steps for fi, j, kgc and then perform the full EM steps until convergence. Let H ¤¤ be the obtained parameters and Q¤¤ be the corresponding Q function value after the full EM has converged. If Q¤¤ > Q¤ , then set Q¤ Ã Q¤¤ , H ¤ Ã H ¤¤ and go to step 2. 4. Halt with H ¤ as the nal parameters.

Note that when a certain split-and-merge candidate that improves the Q function value is found in step 3, the other successive candidates are ignored. There is no guarantee therefore that the split-and-merge candidates that are chosen will give the largest possible improvement in Q. This is not a major problem, however, because the split-and-merge operations are performed repeatedly. If there were no heuristics for ordering potential splitand-merge operations, we would have to consider them all Cmax D M (M¡1)

SMEM Algorithm for Mixture Models

2115

( M ¡ 2) / 2, but experimentally we have conrmed that Cmax ’ 5 may be enough because the split-and-merge criteria do work well. The SMEM algorithm monotonically increases the Q function value, and if the Q function value does not increase for all c D 1, . . . , Cmax , then the algorithm stops. Since the full EM steps equivalent to the original EM steps are performed after the convergence of the partial EM steps, it is clear that the SMEM algorithm maintains the global convergence properties of the EM algorithm. In the SMEM algorithm, the split-and-merge operations are simultaneously performed so that the total number of mixture components is unchanged. In general, the Q function value increases as the number of parameters (or the number of mixture components) increases. Thus, in order to check whether Q is improved by the rearrangement of model components at step 3, it is necessary to keep the number of model components unchanged. Intuitively, a simultaneous split-and-merge can be viewed as a way of tunneling through low-likelihood barriers, thereby eliminating many poor local optima. In this respect, it has some similarities with simulated annealing, but the moves that are considered are long range and very specic to the particular problems that arise when tting mixture models. 3.3 Split-and-Merge Criteria. Each of the split-and-merge candidates can be evaluated by its Q function value after step 3 of the SMEM algorithm mentioned in section 3.2. However, since there are so many candidates, some reasonable criteria for ordering the split-and-merge candidates should be used to accelerate the SMEM algorithm.

3.3.1 Merge Criterion. In general, when there are many data points, each of which has almost equal posterior probability given by equation 3.3 for any two components, it can be thought that these two components might be merged. To evaluate this numerically, we dene the following merge criterion:2 Jmerge (i, jI H ¤ ) D P i (H ¤ )T Pj (H ¤ ),

(3.12)

where P i (H ¤ ) D (P (i| x1 I H ¤ ), . . . , P (i| xN I H ¤ ))T 2 RN is an N-dimensional vector consisting of the posterior probabilities for the ith model. T denotes 2 This merging criterion favors merges with larger classes. To avoid this bias, a modied criterion,

Jmerge (i, jI H ¤ ) D

P i (H ¤ ) T Pj (H ¤ )

kP i (H ¤ )kkPj (H ¤ )k

,

can be used. In our experiments, however, we used equation 3.12 for simplicity, and the merge criterion did work well, as shown later.

2116

N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton

the transpose operation. k ¢ k denotes the Euclidean vector norm. Clearly, two components vi and vj with large Jmerge (i, jI H ¤ ) are good candidates for a merge. 3.3.2 Split Criterion. As a split criterion (Jsplit ), we dene the local Kullback divergence as Z fk (xI H ¤ ) ¤ ( ) (3.13) Jsplit kI H D fk (xI H ¤ ) log dx, pk (xIhk¤ ) which is the distance between two distributions: the local data density fk (x) around the kth model and the density of the kth model specied by the current parameter estimate H ¤ . The local data density is dened as fk (xI H ¤ ) D

PN

n D 1 d( x ¡ x n ) P ( k| xn I H PN ¤) ( n D 1 P k| xn I H

¤)

.

(3.14)

This is a modied empirical distribution weighted by the posterior probability so that the data around the kth model are focused on. Note that when the weights are equal, that is, P (k| xI H ¤ ) D 1 / M, equation 3.14 is the usual empirical distribution: pk (xI H ¤ ) D

N 1 X d(x ¡ xn ). N nD 1

(3.15)

Since it can be thought that the model with the largest Jsplit (kI H ¤ ) has the worst estimate of the local density, we should try to split it. The split criterion dened by equation 3.13 can be viewed as a likelihood ratio test. That is, fk (xI H ¤ ) / pk (xI hk¤ ) can be interpreted as the likelihood ratio test statistic. In this sense, our split criterion is similar to a cluster validity test formula proposed by Wolfe (1970). 3.3.3 Sorting Candidates. Using Jmerge and Jsplit, we sort the split-andmerge candidates as follows. First, the merge candidates are sorted based on Jmerge. Then, for each sorted merge candidate fi, jgc , the split candidates, excluding fi, jgc , are sorted as fkgc . By combining these results and renumbering them, we obtain fi, j, kgc , c D 1, . . . , M ( M ¡ 1)(M ¡ 2) / 2. 4 Application to Density Estimation by Mixture of Gaussians 4.1 Synthetic Data. We apply the proposed algorithm to a density estimation problem using a mixture of gaussians. In this case, pm (xI hm ) on the right-hand side of equation 3.1 becomes

pm (xI hm ) D (2p ) ¡d / 2 det( Sm )¡1 / 2

SMEM Algorithm for Mixture Models

» ¼ 1 ¡1 (x ¡ µ m ) . £ exp ¡ (x ¡ µ m )T Sm 2

2117

(4.1)

Clearly, hm corresponds to mean vector µ m and covariance matrix S m . In the case of density estimation by a mixture of gaussians, as is well known, the global maxima of the likelihood correspond to singular solutions with zero covariances. Therefore, the SMEM algorithm may converge to the singular solutions. To prevent this, we used a Bayesian regularization method (Ormoneit & Tresp, 1996). That is, the update equation for the covariance matrix is given by P

Sm(t C 1)

D

(t C 1) ( )(x ¡ µ mt C 1) )T P (m| xI H (t ) ) C lId m x2X (x ¡ µP , (t ) x2X P(m|xI H ) C 1 m D 1, . . . , M.

(4.2)

Here Id is the d-dimensional unit matrix, and l is a regularization constant determined by some validation data. In the experiment we set l D 0.1. First, we used the two-dimensional synthetic data in Figure 2 to demonstrate visually the usefulness of the split-and-merge operations. The initial mean vectors and covariance matrices were, as shown in Figure 2b, set to near the means of all of the data and unit matrices, respectively. The usual EM algorithm converged to the local maximum solution shown in Figure 2c, whereas the SMEM algorithm converged to the superior solution shown in Figure 2f, very close to the true one. The split of the rst gaussian shown in Figure 2d appeared to be redundant, but as shown in Figure 2e they are successfully merged, and the original two gaussians were improved. This indicates that the split-and-merge operations not only appropriately assign the number of gaussians in a local data space, but can also improve the gaussian parameters themselves. 4.2 Real Data. Next, we tested the proposed algorithm using 20-dimensional real data (facial images processed into feature vectors) where the local maxima made the optimization difcult (see Ueda & Nakano, 1998, for the details.) The data size was 103 for training and 103 for test. We ran three algorithms (EM, DAEM, and SMEM) for 10 different initializations using the k-means clustering algorithm. We set M D 5 and used a diagonal covariance for each gaussian. Table 1 shows the summary statistics (mean, standard deviation (std), maximum, and minimum) of log-likelihood values per sample size obtained by each of the EM, DAEM, and SMEM algorithms for 10 different initializations. As shown in Table 1, even the worst solution found by the SMEM algorithm was better than the best solutions found by the other algorithms on both the training and test data. Moreover, the log-likelihoods achieved by the SMEM algorithm have lower variance than those achieved by the EM algorithm.

2118

N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton

Figure 2: Results by the EM and SMEM algorithms for a two-dimensional gaussian mixture density estimation problem. (a) Contours of true gaussians, (b) initial density, (c) result by the EM algorithm, (d)–(e) examples of split and merge operations by the SMEM algorithm, and (f) the nal result. Table 1: Log-Likelihood / Sample Size. Initial value

EM

DAEM

SMEM

Training

mean std max min

¡159.1 1.77 ¡157.3 ¡163.2

¡148.2 0.24 ¡147.7 ¡148.6

¡147.9 0.04 ¡147.8 ¡147.9

¡145.1 0.08 ¡145.0 ¡145.2

Test

mean std max min

¡168.2 2.80 ¡165.5 ¡174.2

¡159.7 1.00 ¡158.0 ¡160.8

¡159.8 0.37 ¡159.6 ¡159.8

¡155.9 0.09 ¡155.9 ¡156.0

Figure 3 shows log-likelihood value trajectories accepted in step 3 of the SMEM algorithm during the estimation process. The dotted lines in Figure 3 denote the starting points of step 2. Note that it is due to the initialization in step 3 that the log-likelihood decreases just after the splitand-merge. Comparing the convergence points at step 3 marked by the “±” symbol in Figure 3, one can see that the successive split-and-merge operations improved the log-likelihood for both the training and test data,

SMEM Algorithm for Mixture Models

2119

Figure 3: Trajectories of log-likelihood. The upper (lower) result corresponds to the training (test) data.

as we expected. Table 2 compares the number of iterations executed by the three algorithms. Note that in the SMEM algorithm, the number includes not only partial and full EM steps for accepted operations, but also EM-steps for rejected ones. From Table 2, the SMEM algorithm was about 8.7 times slower than the original EM algorithm. The average rank of the accepted split-and-merge candidates was 1.8 (std = 0.9), which indicates that the proposed split-and-merge criteria worked very well. 5 Application to Dimensionality Reduction Using Mixture of Factor Analyzers 5.1 Factor Analyzers. A single factor analyzer (FA) (Anderson, 1984) assumes that an observed p-dimensional variable x is generated as a linear transformation of some lower q-dimensional latent variable z » N (0 , I ) Table 2: The Number of EM-Steps.

mean std max min

EM

DAEM

SMEM

47 16 65 37

147 39 189 103

409 84 616 265

2120

N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton

plus additive gaussian noise v » N (0 , Y). Y is a diagonal matrix. That is, the generative model can be written as

x D Wz C v C µ .

(5.1)

Here, W 2 R p£q is a transformation matrix and is called a factor loading matrix. µ is a mean vector. Then, from a simple calculation, the pdf of the observed data by an FA model can be obtained by (µ , WW T C Y).

p(xI H ) D N

(5.2)

Noting that W can be represented by linearly independent column vectors as W D [w1 , . . . , w q ], where w i 2 Rp£1 , we can rewrite equation 5.1 as

xD

q X i D1

zi wi C v C µ ,

(5.3)

where zi is the ith component of z . Clearly, equation 5.3 shows that w1 , . . . , wq form the basis in a latent space. Hence, FA can be interpreted as a dimensionality reduction model extracting a linear manifold (afne subspace) M D L (q ) C µ underlying the given observed data space, where L (q) denotes the linear subspace of Rp spanned by the basis w1 , . . . , w q . 5.2 Mixture of Factor Analyzers. A mixture of factor analyzers (MFA), proposed by Ghahramani and Hinton (1997), is an extension of single FA. That is, MFA is dened as the combination of M mixture of FAs and can be thought of as a reduced dimensional mixture of gaussians. The MFA (q ) model extracts q-dimensional locally linear manifolds Mm D Lm C µ m for m D 1, . . . , M underlying the given high-dimensional data. More intuitively, the MFA model can perform clustering and dimensionality reduction simultaneously. Since the MFA model extracts a globally nonlinear manifold, it is more exible than the single FA model. The pdf of the observed data by M mixtures of FAs is given by

p(xI H ) D

M X

am N

m D1

(µ m , W m W Tm C Ym ).

(5.4)

(See Ghahramani & Hinton, 1997, for details.) Note that equation 5.4 is a natural extension of equation 5.2. The unknown parameters H D fam , µ m , W m | m D 1, . . . , Mg of the MFA model can be estimated by the EM algorithm. In this case, the complete data log-likelihood becomes

L

c

D log D

N Y M Y

p(xn , z n , mI H m )um

nD 1 m D1

N X M X

nD 1 m D1

um log p (xn , z n , mI H m ).

(5.5)

SMEM Algorithm for Mixture Models

2121

Figure 4: Results by the EM and SMEM algorithms for a one-dimensional manifold extraction problem. (a) True manifold and generated data, (b) initial estimate, (c) result by the EM algorithm, (d)–(e) examples of split and merge operations, and (f) the nal result.

Here, um is a mixture indicator variable, where if xn is generated by the mth model, um D 1; otherwise, um D 0. Since L c can be represented in the form of a direct sum, the Q function is also decomposable, and therefore the SMEM algorithm is straightforwardly applicable to the parameter estimation of the MFA model. 5.3 Demonstration. Figure 4 shows results of extracting a one-dimensional manifold from three-dimensional data (noisy shrinking spiral) using the EM and SMEM algorithms. The data points in Figure 4 were generated by

(X1 , X2 , X3 ) D ((13 ¡ 0.5t) cos t, ¡(13 ¡ 0.5t) sin t, t) C additive noise,

(5.6)

where t 2 [0, 4p ]. In this case, each factor loading matrix W m becomes a three-dimensional column vector corresponding to each thick line in Figure 4. The center position and the direction of each thick line are m m and W m , respectively. In addition, the length of each thick line is 2kW m k. Although the EM algorithm converged to poor local maxima as shown in Figure 4(c), the SMEM algorithm successfully extracted the data manifold shown in Figure 4(f). Table 3 compares average log-likelihoods per data point over 10 different initializations. The log-likelihood values were drastically improved on both the training and test data by the SMEM algorithm.

2122

N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton

Table 3: Log-Likelihood / Sample Size.

Training Test

EM

SMEM

¡7.68 (0.151) ¡7.75 (0.171)

¡7.26 (0.017) ¡7.33 (0.032)

5.4 Practical Applications.

5.4.1 Image Compression. An MFA model is available for block transform image coding. In this method, as in usual block transform coding approaches such as Karhunen-Loeve ` transformation or principal component analysis (PCA), an image is subdivided into nonoverlapping blocks of b £ b pixels. Typically, b D 8 or b D 16 is employed. Each block is regarded as a d (D b £ b)-dimensional vector x. Let X be a set of obtained x values. Then, using X , an image is transformed by the following steps in the compression algorithm: 1. Set the desired dimensionality q and the number of mixture components M. 2. Estimate µ m and W m , for m D 1, . . . , M by tting an MFA model to X . 3. For each x 2 X , compute ( ) xO m D W m (W Tm W m ) ¡1 W Tm (x ¡ µ m ) C µ m , for m D 1, . . . , M. (

(5.7)

¤)

Then assign x O m as a reconstructed vector that minimizes the squared (m ) error kx O ¡ xk2 by its reconstruction.

4. Transform each x O

(m¤ )

into a block image (reconstructed block image).

The derivation of equation 5.7 is as follows. The least-squares reconstructed vector x O is, as shown in Figure 5, obtained by orthogonally pro(q ) jecting x ¡ µ onto Lm and adding µ m . That is,

xO D Pm (x ¡ µ m ) C µ m ,

(5.8) (q )

where Pm is the projection matrix of Lm and is computed from P m D W m (W Tm W m ) ¡1 W Tm .

(5.9)

Substituting equation 5.9 into equation 5.8, we obtain equation 5.7. Figure 6 shows an example of compressed image by MFA with q D 4 and M D 10. Figure 6a shows a sample image used for the training data X . We

SMEM Algorithm for Mixture Models

2123

Figure 5: Illustration of image reconstruction.

should use test images independent of the training image, but for simplicity we used the same image for the training and test. For comparison, we also tried the usual PCA-based image compression method (see Figure 6b). In the case of image compression by PCA, W is an orthogonal matrix composed of q column vectors (eigenvectors) corresponding to the q largest eigenvalues of the sample autocorrelation matrix: SD

1 X xxT . | X | x2X

(5.10)

Here, | X | denotes the number of components of X . Since WW T D Iq (i.e., a set of column vectors fw 1 , . . . , w q g is orthogonal), the reconstructed vector by PCA is given by

xO D WW T x.

(5.11)

The major difference between the PCA and MFA approaches is that the former nds a global subspace from X , while the latter simultaneously clusters the data and nds an optimum subspace for each cluster. Therefore, it is natural that the results by the MFA approaches (Figures 6c and 6d were much richer than that by the PCA approach—Figure 6b). Note that a mixture of PCA (MPCA) (Tipping & Bishop, 1997) is a much better model than PCA, but we have not compared it here because our purpose was to compare the SMEM algorithm with the EM algorithm for the MFA-based image compression.

2124

N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton

Figure 6: An example of image reconstruction. (a) Original image, (b) Result by PCA, (c) Result by MFA model trained by the usual EM algorithm, and (d) Result by MFA model trained by the SMEM algorithm.

Comparing Figure 6c to Figure 6d, one can see that the quality of the reconstructed image using the SMEM algorithm is better than that using the EM algorithm. The mean squared errors per block of these constructed images were 15.8 £ 103 , 10.1 £ 103 , and 7.3 £ 103 for Figures 6b, 6c, and 6d, respectively. 5.4.2 Application to Pattern Recognition. The MFA model is also applicable to pattern recognition tasks (Hinton, et al., 1997) since once an MFA model is tted to each class, we can compute the posterior probability for each data point. More specically, the pdf of class vi by the MFA model is

SMEM Algorithm for Mixture Models

2125

given by pi (xI H i ) D

M X

(µ im , W im W Tim C Yim ),

Pim N

mD 1

(5.12)

where H i is a set of MFA model parameters for class vi . That is, H i D fPim , µ im , W im , Yim | m D 1, . . . , Mg. Pim is the mixing proportion of the mth model for class vi . Using equation 5.12, the posterior probability of class vi given x is computed from Pi pi (xI H i ) . P (vi | x) D P C ( ) j D1 Pj pj x I H j

(5.13)

Here, C is the number of classes and Pi is the prior of class vi and is estimated from Pi D

Ni , N

(5.14)

where Ni is the number of training samples of class vi and N is the total sample size of all classes. Then, the optimum class i¤ for x based on the Bayes decision rule is written as follows: i¤ D arg max P (vi | x ) i

D arg max Ni i

M X mD1

Pim N

(µ im , W im W Tim C Yim ).

(5.15)

We compared the MFA model with another classication method based on clustering and dimensionality reduction, called the multiple subspace method (MSS) proposed by (Sugiyama and Ariki (1998). In order to dene the MSS method we will rst describe the simpler subspace (SS) method for classication known as CLAFIC (Oja, 1983). In the CLAFIC method, a linear subspace is extracted from the training data for each class, and then the distance between an input vector x to be classied and its projected vector onto the linear subspace is computed for each class. Next, the input vector is classied as class vi¤ with the minimum distance (see Oja, 1983, for details). In the SS method, a single subspace is extracted for each class. On the other hand, in the MSS method, the data are rst divided into several clusters by using the k-means algorithm. Then, for each cluster, the CLAFIC method is performed. This MSS method can be expected to improve the classication performance of the CLAFIC method, but due to the absence of a probability density model, it does not perform Bayes classication.

2126

N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton

Figure 7: Recognition rates for hand-written digit data. (a) Results by MSS methods (M D 1 corresponds to the SS method). (b) Results by an MFA-based method trained by the EM algorithm. (c) Result by an MFA-based method trained by the SMEM algorithm.

We tried a digit recognition task (10 digits (classes)) using three methods: the MSS method, the MFA-based method with the EM algorithm, and the MFA-based method with the SMEM algorithm. The data were created using (degenerate) Glucksman’s features (16-dimensional data) by NTT labs (Ishii, 1989). The data size was 200 per class for training and 200 per class for test. The recognition accuracy values for the training and test data obtained by these methods are given in Figure 7. In Figure 7, q denotes the dimensionality of the latent space and M is the number of components in mixture. Note that the SS (CLAFIC) method corresponds to M D 1 in the MSS method shown in Figure 7a. Clearly, the MFA-based method with the SMEM algorithm consistently outperformed both the MSS method and the MFA-based method with the EM algorithm. The recognition accuracy by the 3-nearest neighbor (3NN) classier was 88.3%. It is interesting that the MFA approach by the SMEM algorithm could outperform the nearest-neighbor approach when q D 3 and M D 5 (91.9%). This suggests that the intrinsic dimensionality of the data might be three or so. 6 Conclusion

We have shown how simultaneous split-and-merge operations can be used to move components of a mixture model from regions in a space in which there are too many components to regions in which there are too few. Such moves cannot be accomplished by methods that continuously move components through intermediate locations because the likelihoods are lower at these locations. A simultaneous split-and-merge can be viewed as a way of tunneling through low-likelihood barriers, thereby eliminating many nonglobal optima.

SMEM Algorithm for Mixture Models

2127

Note that the SMEM algorithm is applicable to a wide variety of mixture models, as long as decomposition 7 holds. To make the split-and-merge method more efcient, we have introduced criteria for deciding which splits and merges to consider and have shown that these criteria work well for low-dimensional synthetic data sets and higher-dimensional real data sets. Our SMEM algorithm consistently outperforms the standard EM algorithm, and therefore it can be very useful in practice. In the SMEM algorithm, the split-and-merge operations are used to improve the parameter estimates within the maximum likelihood framework. However, by introducing probability measures over model, we could also use the split-and-merge operations to determine the appropriate number of components within the Bayesian framework. This extension is now in progress.

References Anderson, T. W. (1984). An introduction to multivariate statistical analysis (2nd ed). New York: Wiley. Ball, G. H., & Hall, D. J. (1967). A clustering technique for summarizing multivariate data. Behavioral Science, 12, 153–155. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society B, 39, 1–38. Ghahramani, Z., & Hinton, G. E. (1997).The EM algorithmfor mixtures of factor analyzers (Tech. Report No. CRG-TR-96-1).Toronto: University of Toronto. Available online at: http://www.gatsby.ucl.ac.uk/»zoubin/papers/tr-96-1.ps.gz. Hinton, G. E., Dayan, P., & Revow, M. (1997). Modeling the minifolds of images of handwritten digits. IEEE Trans. PAMI, 8(1), 65–74. Ishii, K. (1989). Design of a recognition dictionary using articially distorted characters. Systems and Computers in Japan, 21(9), 669–677. MacLachlan, G., & Basford, K. (1987). Mixture models: Inference and applications to clustering. New York: Marcel Dekker. Oja, E. (1983).Subspace methods of pattern recognition. Letchworth: Research Studies Press Ltd. Ormoneit, D., & Tresp, V. (1996). Improved gaussian mixture density estimates using Bayesian penalty terms and network averaging. In D. Touretzky, M. Moser, & M. Hasselmo (Eds.), Neural information processing systems, 8 (pp. 542–548). Cambridge, MA: MIT Press. Richardson, S., & Green, P. J. (1997). On Bayesian analysis of mixtures with an unknown number of components. J. R. Statist. Soc. B, 59(4), 731–792. Sugiyama, Y., & Ariki, Y. (1998). Automatic classication of TV sports news videos by multiple subspace method. Trans. Instituteof Electronics, Information and Communication Engineers, 9 (pp. 2112–2119) (in Japanese). Tipping, M. E., & Bishop, C. M. (1997).Mixtures of probabilisticprincipal component analysers (Tech. Rep. No. NCRG-97-3). Birmingham, UK: Aston University.

2128

N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton

Ueda, N., & Nakano, R. (1994). A new competitive learning approach based on an equidistortion principle for designing optimal vector quantizers. Neural Networks, 7(8), 1211–1227. Ueda, N., & Nakano, R. (1998). Deterministic annealing EM algorithm. Neural Networks, 11(2), 271–282. Ueda, N., Nakano, R., Ghahramani, Z., & Hinton, G. E. (1999). SMEM algorithm for mixture models. In M. S. Keans, S. A. Solla, & D. A. Cohn (Eds.), Neural information processing systems, 11. Cambridge, MA: MIT Press. Wolfe, J. H. (1970). Pattern clustering by multivariate mixture analysis. Multivariate Behavioral Research, 5, 329–350. Received March 22, 1999; accepted September 20, 1999.

LETTER

Communicated by C. Lee Giles

Stable Encoding of Finite-State Machines in Discrete-Time Recurrent Neural Nets with Sigmoid Units Rafael C. Carrasco Mikel L. Forcada ´ M. Angeles Vald´e s-Mu noz ˜ Departament de Llenguatges i Sistemes Inform`atics, Universitat d’Alacant, E-03071 Alacant, Spain ˜ Ram on ´ P. Neco Departament de Ci`encies Experimentals i Tecnologia, Universitat Miguel Hern`andez, E-03202 Elx, Spain There has been a lot of interest in the use of discrete-time recurrent neural nets (DTRNN) to learn nite-state tasks, with interesting results regarding the induction of simple nite-state machines from input–output strings. Parallel work has studied the computational power of DTRNN in connection with nite-state computation. This article describes a simple strategy to devise stable encodings of nite-state machines in computationally capable discrete-time recurrent neural architectures with sigmoid units and gives a detailed presentation on how this strategy may be applied to encode a general class of nite-state machines in a variety of commonly used rst- and second-order recurrent neural networks. Unlike previous work that either imposed some restrictions to state values or used a detailed analysis based on xed-point attractors, our approach applies to any positive, bounded, strictly growing, continuous activation function and uses simple bounding criteria based on a study of the conditions under which a proposed encoding scheme guarantees that the DTRNN is actually behaving as a nite-state machine. 1 Introduction

The relationship between discrete-time recurrent neural networks (DTRNN) and nite-state machines (FSM) has been explored in a number of different ways by many researchers in the past 10 years, although this relationship has earlier roots (McCulloch & Pitts, 1943; Kleene, 1956; Minsky, 1967). All of these early papers refer to neural networks made up of threshold units (with steplike activation functions). More recently, Alon, Dewdney, and Ott (1991), Indyk (1995), and Horne and Hush (1996) have studied in more detail bounds on the number of threshold units necessary to implement an FSM of a given number of states. Also, in a recent work, Kremer (1995) has c 2000 Massachusetts Institute of Technology Neural Computation 12, 2129– 2174 (2000) °

2130

´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

shown that Elman’s (1990) simple recurrent nets (a class of DTRNN) using threshold units can actually represent any deterministic nite automaton (a class of FSM). The interest in the relationship between FSM and DTRNN is partly motivated by the fact that one can view DTRNN as state machines: a new state for the hidden units is computed from the previous state, and the currently available input in each cycle, and possibly an output, is computed in each cycle too. One can say that DTRNN using activation functions other than thresholds are state machines that are not restricted to be nite. The architectural similarities between FSM and DTRNN will be discussed in detail in the next section of this article. Under this intuitive assumption that being nonnite state machines themselves, DTRNN can emulate FSMs, a number of researchers set out to test whether DTRNN with real-valued, continuous sigmoid activation functions could learn FSM behavior from samples (Cleeremans, Servan-Schreiber, & McClelland, 1989; Pollack, 1991; Giles et al., 1992; Watrous & Kuhn, 1992; Maskara & Noetzel, 1992; Sanfeliu & Alqu´ezar, 1994; Manolios & Fanelli, ˜ 1994; Forcada & Carrasco, 1995; Tino L & Sajda, 1995; Neco & Forcada, 1996; Gori, Maggini, Martinelli, & Soda, 1998); and even compared the performance of different architectures (Miller & Giles, 1993; Horne & Giles, 1995). The use of sigmoid functions is motivated by the need to have error functions that are differentiable with respect to the weights so that gradient-descent algorithms may be used for learning. The results of these works show that DTRNN indeed may learn nitestate-like behavior from samples of that behavior, but some problems persist. First, after learning, nite-state-like behavior is observed for short input strings (with lengths in the range of that of strings used for training); however, for longer input strings, the clusters observed for the values of the hidden state vector (usually interpreted as the states of the FSM learned) start to blur and nally merge, leading to incorrect state representations and, accordingly, incorrect output. This behavior is often referred to as instability, and hampers the ability of trained DTRNN to generalize the learned behavior (see Tino, L Horne, Giles, & Colingwood, 1998, for a theoretical analysis of generalization loss). As a partial solution, some authors force learning to occur in such a way that state values form clusters (Das & Das, 1991; Zeng, Goodman, & Smythe, 1993, 1994; Das & Mozer, 1998). Second, when the FSM to be learned has long-term dependencies, that is, outputs that depend on inputs that have been presented very early during the processing of a string, gradient-descent algorithms suffer from the problem of vanishing gradients, which makes it difcult to relate late contributions to the error to small changes in the state of neurons in early stages of string processing (Bengio, Simard, & Frasconi, 1994). Once the DTRNN has been trained, researchers use specialized algorithms to extract FSM from the dynamics of the DTRNN;some use a straightforward equipartition of neural state-space followed by a branch-and-bound

Stable Encoding of Finite-State Machines

2131

algorithm (Giles et al., 1992) or a clustering algorithm (Cleeremans et al., 1989; Manolios & Fanelli, 1994; Gori et al., 1998). Very often the nite-state automaton extracted behaves correctly, and then does so, obviously, for strings of any length. However automaton extraction algorithms have been criticized (Kolen & Pollack, 1995; Kolen, 1994) in the sense that the extraction of FSM may not reect the computation actually being performed by the DTRNN. More recently, Casey (1996) has shown that DTRNN can indeed “organize their state space to mimic the states in the . . . state machine that can perform the computation” and be trained or programmed to behave as FSM. Arai and Nakano (1996) stated that when in a DTRNN, the gain of the sigmoid function reaches a certain nite value, the DTRNN behaves as a stable FSM1 and use this result to formulate a training method (Arai & Nakano, 1996, 1997); however, they fail to provide a rigorous proof and resort to an intuitive explanation. Also recently, Blair and Pollack (1997) showed that an increasing-precision dynamical analysis may identify, in the limit, those DTRNNs that have actually learned to behave like FSM. Finally, some researchers have set out to dene ways to program a sigmoid-based DTRNN so that it behaves like a given FSM, that is, sets of rules for choosing the weights and initial states of the DTRNN based on the transition function and the output function of the corresponding FSM. Omlin and Giles (1996a, 1996b) have proposed an algorithm for encoding deterministic nite-state automata (a class of FSM) in second-order recurrent neural networks such as the ones used by Giles et al. (1992). The encoding is based on a study of the xed points of the logistic sigmoid function. There are many similarities between their work and the one presented here: the use of worst-case studies; the denition of low, high, and forbidden state values; a similar scheme for choosing weights; and so on. One of the main differences is the level on which the conditions for stable encoding are expressed. We do not explicitly require our states to be structured around xed points of the activation function. Finally, weight schemes, although very similar, are not completely identical. Alqu´ezar and Sanfeliu (1995) have shown that deterministic nite-state automata may be encoded in Elman (1990) nets provided that sigmoids taking and returning rational numbers are used (general real-valued sigmoids are not allowed).2 Their construction will be generalized to real-valued sigmoids and a larger class of FSM in this article. Their proof relies on converting the deterministic nite-state automaton (DFA) into a new DFA in which all states are split in as many states as there are symbols in the input alphabet of the DFA (as Minsky, 1967, did).

1 An innite value of the gain would correspond to the use of a step function; the result in this case is well known (Minsky, 1967). 2 Kremer’s (1995) work used Elman nets with threshold functions, not sigmoids.

2132

´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

Kremer (1996) has recently shown that a single-layer rst-order recurrent neural network with real sigmoid squashing functions returning values in [0, 1] can represent the state transition function of any nite-state automaton, provided that the states are split as in the previous work. In contrast to Alqu´ezar and Sanfeliu’s (1995) and Omlin and Giles’s (1996a, 1996b) work and to the approach used in this article, Kremer’s construction uses different weights for each state neuron. Frasconi, Gori, and Maggini (1996) have shown how nite-state automata may be encoded in recurrent networks using radial basis functions instead of sigmoids. L õ ma (1997) has shown that the behavior of any DTRNN using Recently, S´ threshold activation functions may be stably emulated by another DTRNN using activation functions in a very general class that includes the sigmoid L õ ma provides constructive proof that functions considered in this article. S´ contains a prescription to compute the new values of weights. In a more L õ ma and Wiedermann (1998) show that any regular expression recent paper, S´ of length l may be recognized by a DTRNN having H (l) threshold units. 3 Combining both results, one obtains a prescription to encode any regular expression into a sigmoid DTRNN so that it recognizes the corresponding L õ ma’s (1997) language. As will be discussed in more detail in section 8, S´ result may also be combined with existing results on the simulation of FSM (Alon et al., 1991; Indyk, 1995; Horne & Hush, 1996) in threshold DTRNN to extend them to analog DTRNN. This article gives a detailed description on how to directly encode FSMs—including regular language recognizers— into a variety of commonly used rst- and second-order sigmoid recurrent neural networks so that the weights for stable simulation are relatively small. Small weights are of interest if the method is used to inject partial a priori knowledge into the DTRNN before training it through gradient descent, where activation function saturation would be a serious problem. This article aims at expanding the current results on stable encoding of FSM on DTRNN to a larger family of sigmoids, a larger variety of DTRNN, and a wider class of FSM architectures by establishing a simplied procedure to prove the stability of a devised encoding scheme. Some of the results ˜ presented have already been used by Kremer, Neco, and Forcada (1998) to constrain the training of DTRNN so that they assume an FSM-like behavior. Section 2 denes the classes of FSM and DTRNN that will be studied and establishes architectural parallelisms between them; section 3 describes the conditions under which a DTRNN behaves as an FSM and denes a general encoding scheme based on these conditions and a set of sufcient conditions for the validity of the encoding; sections 4, 5, and 6 describe encoding schemes for Mealy FSM, Moore FSM, and DFA; section 7 shows experimental results that support the proposed encodings and illustrate the sufciency

3

We thank an anonymous referee for calling our attention to this work.

Stable Encoding of Finite-State Machines

2133

of the conditions dening each particular encoding; nally, concluding remarks and a table summarizing the encoding schemes (Table 6) may be found in section 8. 2 Denitions 2.1 Mealy and Moore Machines and Deterministic Finite Automata.

Mealy machines (Hopcroft & Ullman, 1979) are nite-state machines that act as transducers or translators, taking a string on an input alphabet and producing a string of equal length on an output alphabet. Formally, a Mealy machine is a six-tuple M D (Q, S , C, d, l, qI )

(2.1)

where Q D fq1 , q2 , . . . , q |Q| g is a nite set of states.

S D fs1 , s2 , . . . , s| S | g is a nite input alphabet.

C D fc 1 , c 2 , . . . , c |C | g is a nite output alphabet.

d: Q £ S ! Q is the next-state function, such that a machine in state qj , after reading symbol sk , moves to state d(qj , sk ) 2 Q.

l: Q £ S ! C is the output function, such that a machine in state qj , after reading symbol sk , writes symbol l(qj , sk ) 2 C.

qI 2 Q is the initial state in which the machine is found before the rst symbol of the input string is processed. A Moore machine (Hopcroft & Ullman, 1979) may be dened by a similar six-tuple, with the only difference that symbols are output after the transition to a new state is completed instead of during the transition itself, and the output symbol depends on only the state just reached, that is, l: Q ! C. The classes of translations that may be performed by Mealy machines and Moore machines are identical. Indeed, given a Mealy machine, it is straightforward to construct the equivalent Moore machine, and vice versa (Hopcroft & Ullman, 1979). Deterministic nite automata (DFA) (Hopcroft & Ullman, 1979) may be seen as a special case of Moore machines. A DFA is a ve-tuple M D (Q, S , d, qI , F )

(2.2)

where Q, S , d, and qI have the same meaning as in Mealy and Moore machines and F µ Q is the set of accepting states. If the state reached by the DFA after reading a complete string in S ¤ (the set of all nite-length strings over S , including the empty string l) is in F, then the string is accepted; if not, the string is not accepted (or it is rejected). This would be equivalent

´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

2134

to having a Moore machine whose output alphabet has only two symbols, Y (yes) and N (no), and looking only at the last output symbol (not at the whole output string) to decide whether the string is accepted or rejected. 2.2 Discrete-Time Recurrent Neural Networks. Discrete-time recurrent neural networks (DTRNN) (see Haykin, 1998; Hertz, Krogh, & Palmer, 1991; Hush & Horne, 1993; Tsoi & Back, 1997), may be viewed as neural state machines (NSM) and dened in a way that is parallel to the above denitions of Mealy and Moore machines.4 A neural state machine N is a six-tuple

N D (X, U, Y, f, h , x 0 )

(2.3)

in which X D [S0 , S1 ]nX is the state-space of the NSM, with S0 and S1 the values dening the range of outputs of the activation functions and nX the number of state units. U D RnU denes the set of possible input vectors, with R the set of real numbers and nU the number of input lines. Y D [S0 , S1 ]nY is the set of outputs of the NSM, with nY the number of output units (it is assumed that the activation function of output and state units is the same). f: X £ U ! X is the next-state function, a feedforward neural network that computes a new state x[t] from the previous state x[t ¡ 1] and the input just read u [t]. h is the output function, which in the case of a Mealy NSM is h : X£U ! Y, that is, a feedforward neural network that computes a new output y [t] from the previous state x [t ¡ 1] and the input just read u [t], and in the case of a Moore NSM is h : X ! Y, a feedforward neural network that computes a new output y[t] from the newly reached state x [t]. x 0 is the initial state of the NSM, that is, the value that will be used for x [0]. Most classical DTRNN architectures may be directly dened using the NSM scheme; the following section shows some examples (in all of them, weights and biases are assumed to be real numbers). In all of the following it will be assumed that f and h are dened for all possible values of their arguments. This is consistent with their implementation as feedforward neural networks. 4

This parallelism is inspired in the relationship established by Pollack (1991) between DFA and a class of second-order DTRNN, under the name of dynamical recognizers.

Stable Encoding of Finite-State Machines

2135

2.2.1 Neural Mealy Machines. A commonly used second-order recurrent neural network (Giles et al, 1992; Watrous & Kuhn, 1992; Pollack, 1991; Forcada & Carrasco, 1995; Zeng et al., 1993) may be formulated as a Mealy NSM described by a next-state function whose ith coordinate (i D 1, . . . , nX ) is 0 1 fi (x [t ¡ 1], u [t]) D g @

nX X nU X

j D1 k D 1

xxu Wijk xj [t ¡ 1]uk [t] C Wix A ,

(2.4)

where g: R ! [S0 , S1 ] is a sigmoid function and an output function whose ith coordinate (i D 1, . . . , nY ) is 0 1 nU nX X X

hi (x [t ¡ 1], u [t]) D g @

jD 1 kD 1

yxu

y

Wijk xj [t ¡ 1]uk [t] C Wi A .

(2.5)

Throughout the article, a homogeneous notation will be used for weights. Superscripts indicate the computation in which the weight is involved: the xxu indicates that the weight is used to compute a state (x) from a xxu in Wijk y

state and an input (xu); the y in W i (a bias) indicates that it is used to compute an output. Subscripts designate, as usual, the particular units involved and run parallel to superscripts. Another Mealy NSM is Robinson and Fallside’s (1991) recurrent error propagation network, a rst-order DTRNN that has a next state function whose ith coordinate is given by 0 fi (x [t ¡ 1], u [t]) D g @

nX X j D1

Wijxx xj [t

¡ 1] C

nU X jD 1

1 Wijxu uj [t] C

Wix A ,

(2.6)

and an output function h (x [t ¡1], u [t]) whose ith component (i D 1, . . . , nY ) is given by 0

nX X

hi (x [t ¡ 1], u [t]) D g @

jD 1

yx Wij xj [t

¡ 1] C

nU X j D1

1 yu Wij uj [t] C

y Wi A .

(2.7)

It is easy to show (using a suitable odd-parity counterexample in the way described by Goudreau, Giles, Chakradhar, & Chen, 1994) that Robinson and Hallside’s (1991) networks cannot represent the output function of all Mealy machines unless a two-layer scheme like the following is used: 0

1

hi (x [t ¡ 1], u [t]) D g @

Wij zj [t] C Wi A

nZ X jD 1

yz

y

(2.8)

´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

2136

with 0

nX X

zi [t] D g @

jD 1

Wijzx xj [t ¡ 1] C

nU X jD 1

1 Wijzu uj [t] C Wiz A

(2.9)

(i D 1, . . . , nZ ) and nZ the number of units in the layer before the actual output layer (the hidden layer of the output function). The new architecture will be named an augmented Robinson-Fallside network in this article. 2.2.2 Neural Moore Machines. Elman’s (1990) simple recurrent net, a widely used Moore NSM, is described by a next-state function identical to the next-state function of Robinson and Fallside’s (1991) net, equation 2.6, and an output function h (x [t]) whose ith component (i D 1, . . . , nY ) is given by 0 hi (x [t]) D g @

nX X jD 1

1 yx Wij xj [t] C

y Wi A .

(2.10)

The second-order counterpart of Elman’s (1990) simple recurrent net has been used by Carrasco, Forcada, and Santamar´õ a (1996) and Blair and Pollack (1997). In that case, the ith coordinate of the next-state function is identical to equation 2.4, and the output function is identical to equation 2.10. 2.2.3 Sigmoid Functions. When DTRNN are used to learn FSM behavior from samples (Cleeremans et al., 1989; Pollack, 1991; Giles et al., 1992; Watrous & Kuhn, 1992; Maskara & Noetzel, 1992; Sanfeliu & Alqu´ezar, 1994; ˜ Manolios & Fanelli, 1994; Forcada & Carrasco, 1995; Neco & Forcada, 1996; Gori et al., 1998), a common choice is to use continuous, real-valued activation functions for neurons in the DTRNN; this allows the construction of gradient-descent-based learning algorithms. Most researchers have used a bounded sigmoid function called the logistic function: gL ( x) D

1 . 1 C exp(¡x)

(2.11)

This function is strictly growing, dened for any real value, takes positive values, is continuous, and is bounded by 0 and 1; as a consequence, it has an inverse. We will consider a slightly more general class of sigmoid functions in which any element g has the same four properties as the logistic: Strictly growing (g0 (x) > 0 8x 2 R)

Continuous (limx!c g(x ) D g(c), 8c 2 R) Positive (g(x) ¸ 0 8x 2 R)

Stable Encoding of Finite-State Machines

2137

Bounded (between S0 and S1 , with S1 > S0 ): g: R ! [S0 , S1 ].

(2.12)

2.2.4 High and Low Signals. Throughout this article, the output range of all units, [S0 , S1 ], will be partitioned in three subintervals (as in Omlin & Giles, 1996a)—high: [2 1 , S1 ], low: [S0 , 2 0 ], and forbidden: ]2 0 , 2 1 [—with S0 < 2 0 < 2 1 < S1 . The values of 2 0 and 2 1 will be determined later. Note that the forbidden interval may in principle be arbitrarily small; the only thing required by the above denition is that high and low values form disjoint sets and therefore cannot be confused. Dening valid high and low signals and forbidden intervals is customary when designing devices with digital (discrete) behavior from analog equipment such as integrated circuits based on semiconductors, so that assemblies of these circuits still show the correct digital behavior (see, e.g., Floyd, 1996), and indeed, the goal of this article is similar: we want to ensure nite-state behavior in networks built from analog activation functions (real sigmoids). 3 Encoding 3.1 Conditions for a DTRNN to Behave Like an FSM. Recently paper, Casey (1996) (see also Casey, 1998) has shown that a DTRNN performing a robust DFA-like computation (a special case of FSM-like computation) must organize its state-space in mutually disjoint, closed sets with nonempty interiors corresponding to the states of the DFA. These states can be taken to be all of the points such that if the DTRNN is initialized with any of them, the DTRNN will produce the same output as the DFA initialized in the corresponding state. Casey (1996) uses robust emulation to mean correct emulation even under the injection of bounded additive noise into states; a similar result is described by Maass and Orponen (1998), who also prove, as does L õ ma (1997), that any discrete neural net may be simulated by an analog net S´ even in the presence of bounded noise. When noise is not bounded (as, for example, zero-mean gaussian noise), Maass and Sontag (1999) have shown that DTRNN are not even capable of recognizing all regular languages, but only a subset of them. Our article deals with the noiseless emulation of FSM by DTRNN, although the results presented here could easily be generalized to the case of bounded noise. Casey’s (1996) formulation, which is closely connected to that of Pollack’s (1991) dynamical recognizer, has inspired the following denition. A DTRNN N D (X, U, Y, f, h , x 0 ) behaves like an FSM M D (Q, S , C, d, l , qI ) when two sets of conditions, one relating to representation and intepretation (R1–R4) and the other to the dynamics of the DTRNN (D1 and D2), are held.

2138

´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

R1. Representation of state. Each state qi 2 Q is assigned a nonempty re-

gion Xi µ X such that the DTRNN N is said to be in state qi at time t when x [t] 2 Xi . Accordingly, these regions must be disjoint: Xi \ Xj D ; 6 if qi D qj . There may be regions of X that are not assigned to any state. That is, there exists a mapping5 F : Q ! 2X from states in Q into subsets of X and a backward mapping J : X ! Q, which assigns a state in Q to each valid state vector in X.

R2. Representation of input symbols. Each possible input symbol sk 2 S is assigned a different vector u k 2 U (it would also be possible to assign a

nonempty region Uk µ U to each symbol). We may say that there exists a mapping G : S ! U from symbols in S into points in U.

R3. Interpretation of output. Each possible output symbol c m 2 C is as-

signed a nonempty region Ym µ Y such that the DTRNN N is said to output symbol c m at time t when y [t] 2 Ym . Analogously, these regions 6 c n . There may be regions of Y that must be disjoint: Ym \ Yn D ; if c m D are not assigned to any output symbol. This may be expressed by saying that there exists a mapping H : C ! 2Y from symbols in C into subsets of Y, and a backward mapping K : Y ! C, which assigns a symbol in C to each valid output vector in Y.

R4. Correctness of the initial state. The initial state of the DTRNN belongs to the region assigned to the initial state qI , that is, x0 2 XI . Equivalently, J (x 0 ) D qI . D1. Correctness of the next-state function. For any state qj and symbol sk

of M, the transitions performed by the DTRNN N from any point in the region of state-space Xj assigned to state qj when symbol sk is presented to the network must lead to points that belong to the region Xi assigned to qi D d(qj , sk ); formally, this may be expressed as f k (Xj ) µ Xi 8qj 2 Q, sk 2 S : d(qj , sk ) D qi ,

(3.1)

where the shorthand notation f k (A ) D ff (x, u k ): x 2 Ag has been used. Using the mappings just dened, this is equivalent to saying that d(qj , sk ) D qi , J (f (F (qj ), G (sk )) D qi .

D2. Correctness of output. In the case of Mealy NSM, for any state qj and

symbol sk of M, the output produced by the DTRNN N from any point in the region of state-space Xj assigned to state qj when symbol sk is presented to the network belongs to the region Ym assigned to c m D l(qj , sk ); formally, this may be expressed as h k (Xj ) µ Ym 8qj 2 Q, sk 2 S : l (qj , sk ) D c m , 5

(3.2)

We thank one of the anonymous referees for having suggested the use of explicit mappings between the FSM and its representation in the DTRNN.

Stable Encoding of Finite-State Machines

2139

where the shorthand notation h k (A ) D fh (x , u k ): x 2 Ag

(3.3)

has been used. This is equivalent to saying that l (qj , sk ) D c m , K (h (F (qj ), G (sk )) D c m . In the case of Moore NSM, for any state qi , the output produced by the DTRNN N from any point in the region, the condition may be expressed as: h (Xj ) µ Ym 8qi 2 Q: l (qi ) D c m ,

(3.4)

with h ( A) D fh (x ): x 2 Ag. Or, equivalently, l (qi ) D c m , K (h (F (qi )) D c m. The regions Xi µ X, i D 1, . . . , |Q| and Ym µ Y, m D 1, . . . , |C | may have an uncountable number of points because of being subsets of Rn with nonempty interiors for some n. However, for a nite input alphabet S , only a countable number of points in the state (X) and output (Y) spaces are actually visited by the net for the set of all possible input strings over S , denoted S ¤ , which is also denumerable. DFA represent a special case. As noted in section 2.1.3, deterministic nite automata may be seen as Moore or Mealy machines having an output alphabet C D fY , N g whose output is examined only after the last symbol of the input string is presented, and it is such that, for a Moore machine, » l(qi ) D

Y if qi 2 F N otherwise,

(3.5)

and for a Mealy machine, » l(qi , sk ) D

Y if d (qi , sk ) 2 F N otherwise.

(3.6)

A region in output space Y would be assigned to each one of these two symbols: YY , YN , such that YY \ YN D ;, and the output of the NSM would be examined only after the whole string has been processed. 3.2 A General Encoding Scheme. To encode an FSM in a DTRNN, it sufces to nd a way of assigning an initial state x0 , the regions Xi µ X and Ym µ Y, the input vectors u k , and the functions f and h so that it fullls all the conditions given in the previous section. We propose a simple scheme to encode FSM in DTRNN that is general enough to be applied to a variety of FSM and DTRNN architectures (but admittedly neither the only possible nor the best one can design). This scheme is based in the customary “corner” or “exclusive” scheme in which each

2140

´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

FSM state qi is encoded by DTRNN state vectors having coordinate i close to the highest possible output value of the sigmoid (S1 ) and all other coordinates close to the lowest possible output value (S0 ); each output symbol c m is encoded in a similar way, and input vectors u k are unit vectors in R | S | having all their coordinates equal to 0 except for the kth coordinate, which is taken to be equal to 1. As will be shown, it is not difcult to nd architectures and weight schemes for the DTRNN to so behave. If one would require DTRNN state values to be exactly S0 and S1 , that would require some weights to be innite, and the result would be similar to not using sigmoids at all, but using steplike functions instead. If sigmoids and nite weights are used, neuron outputs are never exactly S0 and S1 , and this in turn “contaminates” further outputs due to feedback; this accumulative “contamination” may in principle be such that outputs get so far from S0 and S1 that the requirement of disjoint regions for each state may be violated (a behavior that some authors call instability), but this is not always necessarily the case. The encoding scheme proposed here establishes a way to assign regions of DTRNN state-space to FSM states (and thus a way to establish forbidden regions) and xing weights to values such that DTRNN transitions departing from points in region Xj with symbol sk always end in Xi if d(qj , sk ) D qi , that is, a scheme in which state “contamination” is bounded to tolerable levels, and outputs are also guaranteed to be correct (see the description of low, high, and forbidden intervals in section 2.2.4). This is what some authors (e.g., Omlin & Giles, 1996b) would call a stable encoding. The “corner” scheme with tolerances is naturally related to the denition of low, high, and forbidden values for DTRNN states. The values of 2 0 and 2 1 will be chosen in each encoding to be as far as possible from S0 and S1 , respectively, to allow the smallest possible values for weights. Moving away from saturating the sigmoids by using small weights and tolerance intervals around the limiting values S0 and S1 makes special sense in a gradient-descent learning setting, because saturated sigmoids lead to vanishing derivatives. All of our sigmoid constructions use sets of weights that are either zero or simple multiples of a single positive parameter H (H, ¡H; H / 2, 3H / 2), and are such that in the limit H ! 1, they would behave exactly like classical step-function constructions using weights that are zero or simple rational numbers (1, ¡1, 1 / 2, 3 / 2, etc.), taking state vectors to the exact corners of state-space. 6 The main assumption of all of our encoding schemes is the following: there exist sets of values of 2 0 , 2 1 , and H such that the corresponding 6 It could also be said that our sigmoid constructions are obtained by using these simple rational weights and setting H, the gain of all sigmoids, to a small nite value that still guarantees FSM-like behavior of the DTRNN. This is related to the result by Arai and Nakano (1996) mentioned in section 1.

Stable Encoding of Finite-State Machines

2141

DTRNNs may be interpreted as behaving exactly like FSM. What follows is a description of the common details of all the encoding schemes: R1. Representation of state. The number of state units, nX , is taken to be

equal to the number of states in the FSM,7 |Q|. Each state qi 2 Q is assigned a region Xi (2 0 , 2 1 ) µ X such that the DTRNN N is said to be in state qi at time t when x [t] 2 Xi (2 0 , 2 1 ). These regions are dened by 6 ig, Xi (2 0 , 2 1 ) D fx 2 X: xi 2 [2 1 , S1 ] ^ xj 2 [S0 , 2 0 ], 8j D

(3.7)

that is, all the coordinates of vectors in Xi are low except for coordinate i, which is high, and the region stands at a corner of the hypercube X D [S0 , S1 ]nX . These regions are disjoint because 2 0 < 2 1 : Xi (2 0 , 2 1 ) \ 6 qj . There are points in X that are not assigned to any Xj (2 0 , 2 1 ) D ; if qi D region. R2. Representation of input symbols. The number of inputs to the net-

work nU is chosen to be equal to the number of symbols in the input alphabet, | S |. Each possible input symbol sk 2 S is assigned a vector u k 2 U such that its jth coordinate, ujk , equals djk , that is, it is 1 if j D k and zero otherwise. This is usually known as exclusive, one-hot, or unary encoding of inputs. R3. Interpretation of output. The number of output units nY is chosen to be

equal to the number of symbols in the output alphabet, |C |. Each possible output symbol c m 2 C is assigned a region Ym (2 0 , 2 1 ) µ Y such that the DTRNN N is said to output symbol c m at time t when y [t] 2 Ym (2 0 , 2 1 ). The regions are dened as follows: 6 mg, Ym (2 0 , 2 1 ) D fy 2 Y: ym 2 [2 1 , S1 ] ^ yn 2 [S0 , 2 0 ], 8n D

(3.8)

that is, all the coordinates of output vectors in Ym are low except for coordinate m, which is high, and the region stands at a corner of the hypercube Y D [S0 , S1 ]nY . The use of identical values of 2 0 and 2 1 for both state and output regions is not required but it is convenient because it simplies the ensuing theoretical treatment. It is obvious that these 6 c n , provided that regions are disjoint, Ym (2 0 , 2 1 ) \ Yn (2 0 , 2 1 ) D ; if c m D 2 0 < 2 1 . There are points in Y that are not assigned to any of the output regions Ym . As discussed in section 2.1.3 and later at the end of section 3.1, deterministic nite automata may be seen as Mealy and Moore machines having a two-symbol output alphabet C D fY , N g and whose output is examined only after the whole input string has been processed. In the general 7 As will be shown, some DTRNN architectures are computationally incapable of representing all FSM, but in the cases studied, the original FSM M may be converted into a new FSM M0 with a larger number of states that may itself be encoded in the DTRNN.

´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

2142

encoding scheme used in this section, two output neurons would have to be used to represent this computation. However, in the DFA case, it is convenient (and parallel to the work by Omlin & Giles, 1996a, 1996b, and Alqu´ezar & Sanfeliu, 1995) to use a more distributed encoding in which a single output neuron is used, since the state of the other neuron would be its complement. In this case, YY D [2 1 , S1 ] and YN D [S0 , 2 0 ]. Distributed output schemes would in principle be adequate for output alphabets of any size; our general encoding scheme uses an exclusive interpretation of output because this is the common choice when DTRNN are used to learn transductions of any kind, and also because it makes it easier to derive the conditions for the DTRNN to behave like the corresponding FSM. R4. Correctness of the initial state. The initial state of the DTRNN N, is x0 D fS0 C diI (S1 ¡ S0 )gniDX1 , which is in XI (2 0 , 2 1 ) regardless of the choice

of 2 0 and 2 1 (any other state in XI (2 0 , 2 1 ) would have been equally valid but less convenient).

D1. Correctness of the next state function. The next-state function for all

the DTRNN architectures discussed here will be chosen to have a single adjustable parameter H > 0. For any state qj and symbol sk of M, the transitions performed by the DTRNN N from any point in the region of state-space Xj (2 0 , 2 1 ) assigned to state qj when symbol sk is presented to the network must lead to points that belong to the region Xi (2 0 , 2 1 ) assigned to qi D d(qj , sk ). Formally, this may be expressed as ( )

f kH (Xj (2 0 , 2

1 ))

µ Xi (2 0 , 2 1 ) 8qj , sk : d(qj , sk ) D qi ,

(3.9)

where the superscript (H ) expresses the parametric dependence on H. This equation is derived from the general condition, equation 3.1, and the denition of the regions assigned to each FSM state, equation 3.7. D2. Correctness of output. The output function for all the DTRNN archi-

tectures discussed here will be chosen to have a single adjustable parameter, H > 0. In the case of Mealy NSM, for any state qj and symbol sk of M, the output produced by the DTRNN N from any point in the region of state-space Xj (2 0 , 2 1 ) assigned to state qj when symbol sk is presented to the network belongs to the region Ym (2 0 , 2 1 ) assigned to c m D l (qj , sk ). Formally, this may be expressed as ( )

H h k (Xj (2 0 , 2

1 ))

µ Ym (2 0 , 2 1 ) 8qj , sk : l(qj , sk ) D c m ,

(3.10)

where the superscript (H ) expresses the parametric dependence on H. This equation is derived from the general condition, equation 3.2, and the denition of the regions assigned to each FSM state, equation 3.7, and each output symbol, equation 3.8. It would have been possible to use different values of 2 0 , 2 1 , and H for output and state; this would allow

Stable Encoding of Finite-State Machines

2143

for the choice of smaller weight values but would duplicate the number of parameters. In the case of Moore NSM, for any state qj , the output produced by the DTRNN N from any point in the region of state-space Xj (2 0 , 2 1 ) assigned to state qj belongs to the region Ym (2 0 , 2 1 ) assigned to c m D l (qj ). The condition may be expressed as: h (H ) (Xj (2 0 , 2

1 ))

µ Ym (2 0 , 2 1 ) 8qj : l (qj ) D c m .

(3.11)

This equation is derived from condition 3.4 and the denitions of the regions assigned to each FSM state (see equation 3.7) and to each output symbol (see equation 3.8). Conditions R1 through R4 may be applied to all of the architectures and will be assumed to be implicit in all of the encodings proposed; therefore, we will focus on the dynamical conditions D1 and D2 to derive stability conditions for each encoding. If values of 2 0 , 2 1 , and H exist such that the conditions S0 < 2 0 < 2 1 < S1 , and the conditions derived for a particular DTRNN architecture from equations 3.9 and either 3.10 for a Mealy machine or 3.11 for a Moore machine are met, then the DTRNN is guaranteed to behave like an FSM. The properties of the class of sigmoid functions studied in this article, given in section 2.2.3, in particular their monotonous growth, make it very easy to obtain conditions on H, 2 0 , and 2 1 in closed form. These conditions are in general sufcient but not necessary. One of the reasons has been given in section 3.1: the set of all possible values of x and y reachable by the DTRNN after reading any string over the alphabet S is a denumerable set, whereas the above conditions work on nondenumerable regions of nondenumerable state and output spaces. For example, the conditions require even the worst (furthest from the corner) points of all regions Xi (2 0 , 2 1 ) to map onto the corresponding valid regions of state-space, while it may be possible that those worst points are never reached from the initial state x 0 with any string over S . As a result, smaller values of H, and even weight schemes in which all weights are not different (as in Kremer, 1996) may still guarantee FSM-like behavior. 4 Encoding of Mealy Machines in DTRNN 4.1 Mealy Machines in Second-Order DTRNN. To encode Mealy machines in second-order DTRNN such as the ones described by equations 2.4 and 2.5, we will use nX D |Q|, nU D | S |, and nY D |C |. We propose two possible constructions that correspond to two different versions of the next( ) ( ) state function f kH and the output function h kH . The rst construction uses yxu xxu g and fW two sparse weight matrices, fWijk g, and a bias on each state and ijk output unit. Weights can only take two values, either H or 0, and all biases are equal to ¡H / 2. This construction is similar but not identical to the one

´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

2144

proposed by Omlin and Giles (1996a, 1996b). 8 The second construction uses xxu g and fW yxu g, but no biases. Weights can two dense weight matrices, fWijk ijk take only two values, H and ¡H. 4.1.1 Using a Sparse Weight Matrix. In this construction 9 all state and output units are biased to assume low values by using a bias of ¡H / 2; for a state unit to assume a high value, it has to be excited by means of a connection of weight H from a unit in a high state. The DTRNN is said to be in state qi and outputting symbol c m at time t when its state vector x[t] is in Xi (2 0 , 2 1 ) and its output vector y [t] is in Ym (2 0 , 2 1 ), that is, when all state units are in a low state except for unit i and all output units are low except for unit m. If the previous state x [t ¡ 1] was such that the DTRNN might be said to be in state qj , then x [t ¡ 1] 2 Xj (2 0 , 2 1 ) (only state unit j was high and the rest were low). If the symbol presented at time t is sk , all the components of the input vector u [t] are 0 except for component k. Thus, if d(qj , sk ) D qi xxu and W yxu to be high enough to and l (qj , sk ) D c m , we want weights Wijk mjk surmount the negative bias ¡H / 2 and excite state unit i and output unit m yxu 6 i0 , m D 6 m0 ) to be zero so that the and all other weights Wixxu and Wm0 jk (i D 0 jk negative bias on all other state and output units brings them to a low state. The resulting choice of weights is: » xxu W ijk

yxu

W ijk

D

» D

W ix D ¡

H 0

if otherwise,

d(qj , sk ) D qi ,

(4.1)

H 0

if otherwise,

l (qj , sk ) D c i ,

(4.2)

H , 2

(4.3)

H . 2

(4.4)

and y

Wi D ¡

Now we want to nd conditions on H, 2 0 , and 2 1 such that the next-state function and the output function of the DTRNN constructed in this way are correct, that is, such that they satisfy conditions 3.9 and 3.10 (these are conditions D1 and D2 in section 3.2). With this encoding, the dynamics of 8

They set some weights to ¡H. This encoding has also been used by some of us to encode an extension of Mealy ˜ machines (Neco et al., 1999)and to directly encode nondeterministic nite automata without converting them rst to DFA (Carrasco et al., 1999). 9

Stable Encoding of Finite-State Machines

2145

the second-order recurrent neural network, represented by equations 2.4 and 2.5, may be studied in four cases: high and low states and high and low outputs. Condition D1: Correctness of the next state. For the particular nextstate function dened by equations 4.1 and 4.3, we will study the transition d(qj , sk ) D qi ; that is, x [t ¡ 1] is any point in Xj (2 0 , 2 1 ), u [t] D u k , and nd conditions on 2 0 , 2 1 , and H so that any resulting state x [t] is in Xi (2 0 , 2 1 ). The new state of neuron i is given by ! X H , (4.5) xi [t] D g Hxl [t ¡ 1] ¡ 2 l2C

(

ik

where Cik D fl: d(ql , sk ) D qi g,

(4.6)

and, obviously, j 2 Cik . The equation must fulll xi [t] ¸ 2 1 for x [t] 2 Xi (2 0 , 2 1 ) to hold. In the worst case, xj [t ¡ 1] contributes with the lowest possible high signal (xj [t¡1] D 2 1 ), and the rest of the xl [t¡1] contribute with the strongest possible low signal for all valid sigmoid functions, that is, S0 D 0; therefore,

¡ ¡

g H 2

1

¡

1 2

¢¢

¸2

1

(4.7)

guarantees a high state value for xi [t]. This worst case may indeed occur when Cik D fjg. This will be important to explain some of the experimental results shown in section 7. 6 i, the signal has to be low, On the other hand, for all other state units, i0 D xi0 [t] · 2 0 . But some of the low signals at time t ¡ 1 may be high (far from S0 and closer to 2 0 ) and raise the value of xi0 [t], since there may exist states ql such that d(ql , sk ) D qi0 , which prescribe nonzero weight values W ixxu 0 lk D H; these are the states in Ci0 k (see equation 4.6). It is convenient to dene Âx D max |Cjk |, j,k

(4.8)

that is, the size of the biggest of such sets, which may be understood as the maximum fan-in of each state in the transition diagram of the Mealy machine. In the worst case, the Âx low signals have the highest possible low value (2 0 ) and contribute through a weight H to weaken (increase) the desired low signal for xi0 [t]. In that worst case, the equation dening the state of xi0 [t] (according to equations 2.4, 4.1, and 4.3) 0 1 X H (4.9) xi0 [t] D g @ Hxl [t ¡ 1] ¡ A 2 l2C 0 i k

´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

2146

reduces to

¡ ¡

g H Âx2

0

¡

1 2

¢¢ · 2 0,

(4.10)

6 i are low. an equation that may be used to guarantee that all i0 D

Condition D2: Correctness of the output. For the particular output function dened by equations 4.2 and 4.4 we will study the processl(qj , sk ) D c m ; that is, x [t ¡ 1] is any point in Xj (2 0 , 2 1 ), u [t] D u k , and nd conditions on 2 0 , 2 1 , and H so that any resulting output y[t] is in Ym (2 0 , 2 1 ). If we proceed analogously to the next-state study above, we obtain two conditions that guarantee a high value of ym [t] and a low value for ym0 [t]—one identical to equation 4.7 and the other one given by

¡ ¡

g H Ây2

0¡

1 2

¢¢

· 2 0,

(4.11)

with Ây D max |Dmk |,

(4.12)

m,k

the maximum fan-in of the output function and Dmk D fl: l (ql , sk ) D c m g.

(4.13)

Therefore, if one can nd 2 0 and 2 1 such that S0 < 2 0 < 2 1 < S1 and such that conditions 4.7, 4.10, and 4.11 are met, then the second-order DTRNN with the dynamics dened by equations 2.4 and 2.5 and constructed according to equations 4.1 through 4.4 behaves like the corresponding Mealy machine. Conditions 4.10 and 4.11 may be ensured by a more stringent condition,

¡ ¡

1 g H Â2 0 ¡ 2

¢¢

· 2 0,

(4.14)

with Â D max(Âx , Ây ) (note that always Â · nX ). The conditions derived here are sufcient but not necessary—due to the use of worst cases that may not occur in general and to the merging of conditions into more stringent ones—and thus, smaller values of H, larger values of 2 0 , and smaller values of 2 1 may still be adequate for the second-order DTRNN to behave like the corresponding Mealy machine. Due to the form of equations 4.7 and 4.14, not only sets of values of H, 2 0 , and 2 1 can be found for any Â > 1 (for any Mealy machine), but it is also the case that a minimum positive value of H can be found for each fan-in

Stable Encoding of Finite-State Machines

2147

Â. Clearly, reducing H increases 2 0 and reduces 2 1 . One can directly search for the minimum possible H and tabulate the results for each possible value of Â. Note that conditions for H, 2 0 , and 2 1 are independent of the size of the input alphabet S . This will indeed be the case in all the constructions presented in this article. These values are easily obtained by realizing that the most stringent condition is 4.14. One can turn it into an equation, solve for H, and take @H / @2 0 D 0. The resulting equation reads g¡1 (2 0 )

2 0¡

1 2Â

D

g0

¡

1 g ¡1 (2

0)

¢,

(4.15)

with 0 < 2 0 < 1 / 2Â and g0 (x ) the derivative of the sigmoid function. After solving for 2 0 (for example, using an iterative formula), the resulting value may be used in equation 4.14 to obtain the minimum value for H and in equation 4.7 to obtain the corresponding value of 2 1 . In particular, g is the logistic function gL ; then S0 D 0, S1 D 1, and the resulting values of 2 0 , 2 1 , and minimum H for some values of Â are shown in Table 1, where the superscripts C and ¡ on some numbers denote values innitesimally smaller or larger than the one given. Taking into account that 1 / g0L ( gL¡1 (2 0 )) D 2 0¡1 ¡ (1 ¡ 2 0 ) ¡1 , a possible iteration scheme is 2 0 [t C 1] D

(

gL¡1 (2 0 [t])

1 2 0 [t] ¡ 2Â

¡

1 1 ¡ 2 0 [t]

!¡1

.

(4.16)

A few iterations starting with an intermediate value in (0, 1 / 2Â) such as > 1. As expected, the minimum H is a growing function of Â, shown in Figure 1,

2 0 [0] D 1 / 4Â are enough for any Â

H D g(Â),

(4.17)

that grows slower than log(Â) (the ratiog(Â) / log(Â) is a decreasing function of Â). On the other hand, 2 0 is a decreasing function of Â, shown in Figure 2, 2 0 D 2 (Â )

(4.18)

which vanishes slightly faster than 1 / Â. The forbidden interval (2 0 , 2 1 ) is rather wide, except in the case Â D 1, where it is innitesimally small. Section 7 shows experimental results that illustrate the theoretical results obtained here. Equation 4.7 corresponds to a worst-case situation that may indeed occur when the only transition into an FSM state qi when reading a symbol sk comes from a single state qj . In particular, if these transitions form loops

2148

´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco Table 1: Minimum Values of Weights and Limiting Values for Encoding a Mealy Machine on a Second-Order DTRNN with Biases Using the Logistic Sigmoid Function. Â 1 2 3 4 7 10

H C

4 7.17799 8.36131 9.14211 10.5829 11.4664

2 0

2 1

0.5 ¡

0.5 C 0.965920 0.982628 0.988652 0.994704 0.996648

0.0753324 0.0415964 0.0281377 0.0136861 0.00879853

Note: The value of 2 0 is obtained through iteration of equation 4.16; the corresponding values of H and 2 1 are obtained by directly solving equations 4.14 and 4.7, respectively.

Figure 1: The function g.

such as d(qi , sk ) D qi , the DTRNN happens to iterate function g(Hx ¡ H / 2), which happens to have a xed-point attractor exactly at x¤ D 2 1 if H is set to the theoretical value. For those FSM, if H is set to a value that is slightly lower than the theoretical value, then the attractor moves to a value x¤ < 2 1 and enters the forbidden interval, and after a few iterations, the DTRNN develops invalid representations of state. However, for other FSMs that do not show these loops, it may happen that a lower value of H may still be compatible with the validity of all possible representations of state. This has to be taken into account when interpreting some of the experimental results in section 7.

Stable Encoding of Finite-State Machines

2149

Figure 2: The function 2 .

4.1.2 Using a Dense Weight Matrix. An alternative encoding10 uses no biases, as in the original work by Giles et al. (1992). In this construction, weights of H are used as in the same way as in the previous construction, but weights of ¡H are used instead of 0. The effect of the negative weights is to lower the value of a state or output unit when it has to be low, with a similar effect to that of a negative bias. The values of weights are thus given by: » xxu Wijk

D

H ¡H

if otherwise

H ¡H

if otherwise

d(qj , sk ) D qi

(4.19)

and yxu

Wijk

» D

l (qj , sk ) D c i

.

(4.20)

The conditions for this construction to guarantee a correct next state and a correct output are derived in a similar way as that of the previous section, that is, from conditions D1 and D2 of section 3.2. Condition D1: Correctness of the next state. When the Mealy machine moves reads symbol sk , state xi [t] is computed as follows: 0 1

xi [t] D g @ 10

X l2Cik

Hxl [t ¡ 1] C

X 6 l2C ik

(¡H )xl [t ¡ 1]A .

(4.21)

Which has already been preliminarily presented by two of us (Kremer et al., 1998).

´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

2150

If the 2ODTRNN may be said to be in state qj at time t ¡ 1, then x[t ¡ 1] is in Xj (2 0 , 2 1 ). If the input presented to it, u [t], is u k , representing sk , and d(qj , sk ) D qi , then the lowest value of xi [t]—which should be larger than 2 1 for x [t] to be in Xi (2 0 , 2 1 ), that is, a valid neural representation of state qi in 6 qi for all the Mealy machine—is obtained when Cik D fjg (that is, d(ql , sk ) D 6 lD j, a worst case that may never occur in a given automaton), xj [t ¡1] D 2 1 , 6 C ik are equal to 2 0 . In this worst case, the following and all other xl [t ¡ 1], l 2 inequality results: g (H2

1

¡ (| Q| ¡ 1)H2 0 ) ¸ 2 1 .

(4.22)

6 Now let us consider the values of all xi0 [t], i0 D i at time t, which have to be low (smaller than 2 0 ) for x[t] to be in Xi (2 0 , 2 1 ), that is, a valid neural representation of qi . The worst situation would be when d(ql , sk ) D qi0 for all ql except for qj . In this case,

0 xi0 [t] D g @¡Hxj [t ¡ 1] C

1 X 6 j lD

Hxl [t ¡ 1]A .

(4.23)

The worst possible set of values for x [t ¡ 1] that is still in Xj (2 0 , 2 1 ), that is, a valid neural representation of qj [t] (the set of values leading to the highest value for xi0 [t]), is obtained when xj [t ¡ 1] takes the lowest allowed value 6 (2 1 ), and all other xl [t ¡ 1], l D j take their highest allowed value (2 0 ). The resulting inequality is g (¡H2

1 C

(| Q| ¡ 1)H2 0 ) · 2 0 .

(4.24)

Condition D2: Correctness of output. A parallel study of the output dy-

namics of this construction yields two conditions that are exactly equivalent to equations 4.22 and 4.24. Values for 2 0 , 2 1 , and H such that S0 < 2 0 < 2 1 < S1 and that fulll conditions 4.22 and 4.24 guarantee that the 2ODTRNN constructed according to equations 4.19 and 4.20 behaves exactly like the corresponding Mealy machine. To nd the minimum value of H satisfying these equations for a given |Q|, one may use a direct search in (2 0 , 2 1 ) space. If g is the logistic function gL , the solution is found to have symmetric low and high intervals,11 that 11 It is easy to prove why the solution for the logistic function g satises 2 L 0 C 2 1 D 1: let us suppose that we have 2 0 > 1 ¡ 2 1 and H satisfying both inequalities 4.22 and 4.24, and dene 2 00 D 1 ¡ 2 1 ; clearly 2 00 < 2 0 and therefore equation 4.22 is also satised by the pair (2 00 , 2 1 ). Applying the transformation gL (¡x) D 1 ¡ gL (x) yields the inequality gL (¡H2 1 C (|Q| ¡ 1)H2 00 ) · 1 ¡ 2 1 D 2 00 which shows that the pair (2 00 , 2 1 ) also satises equation 4.24.

Stable Encoding of Finite-State Machines

2151

Table 2: Values of Weights and Limiting Values for Encoding a DFA on a SecondOrder DTRNN Without Biases. |Q|

H

2 3 4 5 6 7 10 30

C

2 3.11297 3.58899 3.92227 4.18066 4.39206 4.86330 6.22427

Note: 2

is, 2

0 C2 1

1

2 0

0.5¡ 0.121951 0.0753324 0.0538956 0.0415964 0.0336391 0.0210033 0.00538437

D 1 ¡ 2 0.

D 1. In this case, equations 4.22 and 4.24 both reduce to

gL (¡H C |Q|H2 0 ) · 2 0 ,

(4.25)

which may be solved similarly to equation 4.14 by realizing that it may be rewritten as

¡ ¡ |Q|

gL 2H

2

2 0¡

1 2

¢¢

· 2 0.

(4.26)

Solutions are, accordingly, H D (1 / 2)g(|Q| / 2) and 2 0 D 2 (| Q| / 2); that is, values of H obtained for this construction are smaller than (less than half of) those obtained for the construction in the previous section, and the forbidden interval is narrower.12 Some values are shown in Table 2. Experimental results supporting the validity of this construction are given in section 7. 4.2 Mealy Machines in First-Order DTRNN. First-order, single-layer Mealy neural state machines, also known as Robinson and Fallside’s (1991) recurrent error propagation networks, dened by equations 2.6 and 2.7, cannot perform all possible Mealy machine computation if an exclusive interpretation of output (one neuron per output symbol) is used. This section shows how to encode any Mealy machine in the modied architecture whose next-state function is computed by a single-layer feedforward 12 It is interesting to note here that for the case |Q| D 2, if instead of dening a forbidden interval that nally ends up being of zero measure, one simply divides the range (0, 1) of the logistic function in two halves representing low and high signals, it may be proved that any weight H > 0 still guarantees the correct behavior. This will be apparent in the results shown in section 7.

´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

2152

network of the form given in equation 2.6 and whose output function is computed by a two-layer feedforward neural network of the form given in equations 2.8 and 2.9. The encoding will use nU D | S |, nY D |C |, nX D |Q|| S |, and nZ D |C || S |. The reasons for the last two choices will be clear in the following. The construction denes a new Mealy machine M0 D (Q0 , S , C 0 , d 0 , l 0 , q0I ) (the split-state, split-output Mealy machine) as follows: The new set of states is Q0 D Q £ S , that is, new states are pairs (qi , sk ).

The new set of outputs is C 0 D C £ S , that is, new outputs are pairs (c m , sk ). The new next-state function d 0 : Q0 £ S ! Q0 is dened as follows: d 0 ((qi , sk ), sl ) D (d (qi , sl ), sl ),

(4.27)

that is, the second component of each state in Q0 represents the symbol that was read before reaching it. The new output function l 0 : Q0 £ S ! C 0 is dened as follows: l 0 ((qi , sk ), sl ) D (l (qi , sl ), sl ).

(4.28)

The new initial state q0I is any of the states (qI , sk ), sk 2 S .

It is not at all difcult to show that the split-state, split-output Mealy machine M0 performs the same nite-state computation (transduction) as M, if all the symbols in each of the sets f(c m , sk ): sk 2 S g output by M0 are interpreted as c m . Note that after splitting, some of the states in Q0 and some of the outputs in C 0 may be useless (states may be inaccessible from the initial state or outputs may never be produced) and could be eliminated before proceeding with the construction. The proposed scheme encodes the split-state, split-output Mealy machine M0 , hence the need for nX D |Q0 | D |Q|| S | state units and nY D |C 0 | D |C || S | hidden units in the output function. Each state unit (resp., each hidden unit in the output function) is made to correspond to (and numbered as) a state in Q0 (resp., an output in C 0 ). The values of weights and biases (as generalized from Alqu´ezar & Sanfeliu’s, 1995, construction of DFAs on Elman, 1990, nets) for the next-state function are given by: W ijxx

D

» H 0

if d 0 (qj0 , sk ) D q0i for some sk 2 S otherwise,

(4.29)

xu W ik

D

» H 0

if d 0 (qj0 , sk ) D q0i for some qj0 2 Q0 otherwise,

(4.30)

W ix D ¡

3H , 8i D 1, . . . , nX . 2

(4.31)

Stable Encoding of Finite-State Machines

2153

The construction works as follows: a state neuron i will be excited to a high state only if both a state neuron j such that d 0 (qj0 , sl ) D q0i for some sl is high and an input k such that d 0 (q0l , sk ) D q0i for some q0l is also high, because a bias of ¡3H / 2 has to be surmounted (the bias of ¡3H / 2 is necessary because we want the output of a unit to be high only when at least two inputs—one from a state unit and another one from an input signal—contribute with a value around C H). The split-state, split-output Mealy machines are such that this is possible only if d 0 (qj0 , sk ) D q0i ; otherwise, the bias will bring the state unit i to a low state. 13 The output function is dened by the following weight scheme: Wijzx

D

zu Wik

D

» H 0 »

Wiz D ¡

H 0

if l 0 (qj0 , sk ) D c i0 for some sk 2 S otherwise,

(4.32)

if l 0 (qj0 , sk ) D c i0 for some qj0 2 Q0 otherwise,

(4.33)

3H 8i D 1, . . . , nZ . 2

(4.34)

The construction works similarly to the next-state function: a hidden neuron in the output section of the net, i, will be excited to a high state only if both a state neuron j such that l 0 (qj0 , sl ) D c i0 for some sl is high and an input k such that l 0 (q0l , sk ) D c i0 for some q0l is also high, because a bias of ¡3H / 2 has to be surmounted. The split-state, split-output Mealy machines are such that this is possible only if l 0 (qj0 , sk ) D c i0 ; otherwise, the bias will bring the hidden unit i to a low state. Finally, the output layer collects the activations of hidden output nodes to produce output that may be interpreted in an exclusive fashion. To achieve this, each output unit m is biased with ¡H / 2 to remain in a low state unless it receives, through a weight of H, a high signal from one of the | S | hidden units corresponding to symbol c m . The construction is as follows: yz

Wij

y

D

Wi D ¡

» H 0

if c j0 D (c i , sk ) for some sk 2 S otherwise,

H 8i D 1, . . . , nY . 2

(4.35)

(4.36)

13 This next-state function construction differs from the one proposed by Kremer (1996) in that the same parameter H is used for all weights. Kremer used different values for each state unit, and the output interval of the logistic function [0, 1] was divided in low, forbidden, and high interval in a symmetric fashion (2 0 C 2 1 D 1).

2154

´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

In contrast to Alqu´ezar & Sanfeliu’s (1995) construction, the next-state weights and biases are given in terms of a single parameter (H) to simplify the construction, but it would not be difcult to use different parameters and dene conditions to determine their values; the conditions for stability are derived from conditions D1 and D2 in section 3.2. Condition D1: Correctness of the next state. Again, we want to nd conditions to ensure that whenever the DTRNN is in any state x [t ¡1] in the region Xj (2 0 , 2 1 ) corresponding to state qj0 and reads as input u [t] the vector u k corresponding to symbol sk , it always produces a next state in region Xi (2 0 , 2 1 ) if d 0 (qj0 , sk ) D q0i , according to equation 3.9. Taking into account the weight and bias prescriptions—equations 4.29 through 4.31—and the special characteristics of the split-state transition function, equation 2.6 computes xi [t] as follows:

xi [t] D g

(

X l2Cik

H Hxl [t ¡ 1] ¡ 2

! ,

(4.37)

identical to equation 4.5, where Cik D fl: d 0 (q0l , sk ) D q0i g

(4.38)

is the set of the indices of states that may contribute to the activation of unit i (note that, due to the splitting of states, if state q0i is reachable through transitions involving symbol sk , it cannot be reached through transitions involving other symbols). Obviously, qj0 2 Cik . For x[t] to be in Xi (2 0 , 2 1 ), equation 4.37 must fulll xi [t] ¸ 2 1 . The worst case is, as in section 4.1.1, when xj [t ¡ 1] contributes the lowest possible high signal, 2 1 and all other signals are the lowest possible low signal, that is, S0 D 0 (for any sigmoid function in the class considered). In that case, we recover equation 4.7. The 6 conditions that ensure that all other xi0 [t], i0 D i are low, that is, xi0 [t] · 2 0 , are identical to those discussed in section 4.1.1, and lead to a worst-case condition identical to condition 4.10. Therefore, if conditions 4.7 and 4.10 and S0 < 2 0 < 2 1 < S1 are met, it is guaranteed that the augmented Robinson and Fallside network will always exhibit the next-state behavior of the corresponding Mealy machine. Condition D2: Correctness of the output. To ensure that the output is correct, we have to ensure, rst, that the units in the hidden layer represent the split output of the split-state, split-output Mealy machine M0 correctly and, second, that the output layer collects these representations correctly into an output that may be interpreted in an exclusive fashion. Let us call Z D [S0 , S1 ]nZ the set of possible values of the hidden output layer. A region Zi (2 0 , 2 1 ) 2 Z will be assigned to each symbol c i0 in C in the

Stable Encoding of Finite-State Machines

2155

same way as regions Yi (2 0 , 2 1 ) 2 Y D [S0 , S1 ] are assigned to each symbol c i in C (see equation 3.8). First, we want to nd conditions to ensure that whenever the DTRNN is in any state x [t ¡ 1] in the region Xj (2 0 , 2 1 ) corresponding to state qj0 and reads as input u [t] the vector u k corresponding to symbol sk , it always produces a hidden output z [t] in region Zi (2 0 , 2 1 ) if l 0 (qj0 , sk ) D c i0 , according to equation 4.9. Taking into account the weight and bias prescriptions—equations 4.32 through 4.34—equation 2.9 computes zi [t] as follows: ! X H , (4.39) zi [t] D g Hxl [t ¡ 1] ¡ 2 l2D

(

ik

very similar to equation 4.37, where Dik D fl: l 0 (q0l , sk ) D c i0 g.

(4.40)

Obviously, qj0 2 Cik . For z [t] to be in Zi (2 0 , 2 1 ) equation 4.39 must fulll zi [t] ¸ 2 1 . The worst case is, as in section 4.1.1, when xj [t ¡ 1] contributes the lowest possible high signal, 2 1 and all other signals are the lowest possible low signal, that is, S0 D 0 (for any sigmoid function in the class considered). In that case, we recover equation 4.7. The conditions that ensure that all 6 i are low, that is, zi0 [t] · 2 0 , are identical to those discussed other zi0 [t], i0 D in section 4.1.1, and lead to a worst-case condition identical to condition 4.11 with Ây D max |Dmk |.

(4.41)

m,k

That is, if conditions 4.7 and 4.11 and S0 < 2 0 < 2 1 < S1 are met, then the DTRNN is guaranteed to produce a correct representation of output in the hidden output layer, provided that the state representations are correct. Second, we have to ensure that the output layer collects the hidden output representations in a way that it produces a correct output. We have to nd conditions ensuring that for any c j0 2 C 0 , any z [t] 2 Zj (2 0 , 2 1 ) is mapped into an output y [t] 2 Yi if c j0 D (c i , sk ) for some sk 2 S . Taking into account the weight and bias prescriptions (see equations 4.35 and 4.36), equation 2.8 computes the mth component of the output vector according to 0 1 X H (4.42) ym [t] D g @ Hzl [t] ¡ A . 2 c 0 D (c ,s ) l

m

k

If m D i, the worst case for this equation (lowest possible value of yi [t]) is when all zl [t] are 0 (S0 D 0 is the worst case for all sigmoids) except for zj [t],

´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

2156

which is exactly 2 1 . Since we want yi [t] to be high, that is, yi [t] ¸ 2 1 , we obtain the condition

¡ ¡

g H 2

1

¡

1 2

¢¢

¸ 2 1,

(4.43)

6 which is identical to condition 4.7. If m D i, then we want ym [t] to be low. The worst case for equation 4.42 (highest value of ym [t]) occurs when all zl [t] have the highest valid value, 2 0 . Since we want ym [t] to be low, that is, ym [t] · 2 0 , we obtain the condition

¡ ¡

g H nU 2

0

¡

1 2

¢¢

· 2 0.

(4.44)

As in section 4.1.1, the single condition, 4.14, ensures that all three conditions—4.10, 4.11, and 4.44—are simultaneously fullled, but in this case Â D max(Âx , Ây , nU ).

(4.45)

Therefore, if H, 2 0 and 2 1 are chosen according to conditions 4.7, 4.14, and 4.45, and S0 < 2 0 < 2 1 < S1 , then the augmented Robinson and Fallside network behaves like the corresponding Mealy machine. In the case of the logistic function, g(x ) D gL (x ), the results obtained for H are identical to those discussed in section 4.1.1. 5 Encoding of Moore Machines in DTRNN 5.1 Moore Machines in First-Order DTRNN. We will show that Moore machines can be encoded in Elman’s (1990) simple recurrent nets, a class of rst-order neural Moore machines whose dynamics is dened by equations 2.6 and 2.10. Elman nets have been proved to be computationally equivalent to any DFA if a step function is used (Kremer, 1995). The need for a specialized output layer was explained earlier by Goudreau et al. (1994), who showed that there are DFA whose computation cannot be represented on a rst-order DTRNN without the additional layer. Alqu´ezar and Sanfeliu (1995) adapted Minsky’s (1967) construction and showed that an Elman network with sigmoid functions taking and returning rational numbers could stably represent a DFA. Here the construction is extended to real-valued sigmoid functions and Moore machines. The split-state Moore machine M0 equivalent to a Moore machine M is dened analogously to the split-state, split-output Mealy machine of section 4.2; the only difference is the output function, l 0 : Q0 ! C, which is dened as follows:

l 0 ((qi , sk )) D l(qi ).

(5.1)

Stable Encoding of Finite-State Machines

2157

Accordingly, the encoding uses nU D | S | input lines as usual, nY D |C | output lines, and nX D | S ||Q| state units. Condition D1 in section 3 leads to the same conditions as in section 4.2, because the next-state function is constructed identically: equations 4.7 and 4.10 and S0 < 2 0 < 2 1 < S1 . Therefore it sufces to consider condition D2 (correctness of output). The output function is dened by the following weight scheme: yx Wij

» D

y

Wi D ¡

if l (qj0 ) D c i otherwise,

H 0

H i D 1, . . . , nY , 2

(5.2)

(5.3)

so that an output unit m is biased to be in a low state unless one of the state units i such that l(q0i ) D c m is high enough to surmount that bias. We want to nd conditions to ensure that, whenever the DTRNN has reached any state x [t] in the region Xi (2 0 , 2 1 ) corresponding to state q0i , it will always produce an output vector y [t] in region Ym (2 0 , 2 1 ) if l 0 (q0i , sk ) D c m , according to equation 3.11. Taking into account the weight and bias prescriptions (see equations 5.2 and 5.3), equation 2.10 computes the jth component of the output vector according to 0 1 X H (5.4) yj [t] D g @ Hxl [t] ¡ A . 2 0 0 l ( ) Dc l:

ql

j

If j D m, the worst case for this equation (lowest possible value of ym [t]) is when all xl [t] are 0 (S0 D 0 is the worst case for all sigmoids) except for xi [t], which is exactly 2 1 . Since we want ym [t] to be high, that is ym [t] ¸ 2 1 , we 6 obtain a condition identical to condition 4.7. If j D m then we want yj [t] to be low. The worst case for equation 5.4 (highest value of yj [t]) occurs when all xl [t] have the highest valid value, 2 0 . Since we want yj [t] to be low, that is, yj [t] · 2 0 , we recover condition 4.11, with the maximum output fan-in Ây dened as follows: Ây D max |Dm |, m

(5.5)

where Dm D fl: l(q0l ) D c m g.

(5.6)

As in section 4.1.1, a single condition, 4.14, ensures that both conditions 4.10 and 4.11 are simultaneously fullled. In the case of the logistic function, g (x ) D gL (x ), the results obtained for H are identical to those discussed in section 4.1.1.

2158

´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

5.2 Moore Machines in Second-Order DTRNN. A Moore machine may be straightforwardly encoded in a second-order Moore NSM (Blair & Pollack, 1997; Carrasco et al., 1996) with a next-state function dened by equation 2.4 and an output function dened by equation 2.10 using nU D | S |, nX D |Q|, nY D |C |:

If the construction with biases dened in section 4.1.1 is used, then the conditions to be fullled are S0 < 2 0 < 2 1 < S1 , equation 4.7 and equation 4.14, with Â D max(Âx , Ây ), Âx dened by equation 4.8 and Ây dened by equation 5.5 (but dropping the primes in equation 5.6). The values of H for the logistic sigmoid function gL (x) are identical to those obtained in section 4.1.1. If the construction without biases dened in section 4.1.2 is used, then the conditions to be fullled are S0 < 2 0 < 2 1 < S1 , equation 4.22, equation 4.24 and equation 4.11, with Ây dened as in equation 5.5. Note that conditions 4.22 and 4.24 dene H as a function of the number of states |Q|, whereas condition 4.10 denes H as a function of the maximum output fan-in Ây . To look for the minimum value of H ensuring both conditions for a given FSM, one may look for the minimum H in both sets of equations and then take the larger one. Note that even if Ây · |Q|, the values of H dened by conditions 4.22 and 4.24 are much lower, as discussed at the end of section 4.1.2, and therefore Ây is likely to dene H in many cases. 6 Encoding of Deterministic Finite Automata in DTRNN

DFA will be encoded in DTRNN as a special case of Mealy and Moore machines, with a single output neuron, as described in section 3, even when DFA may actually be seen as outputting two different symbols, Y (input string accepted) and N (input string not accepted). Any of the constructions described for Mealy and Moore machines are valid, but here the output neuron devoted to the rejection symbol N is eliminated. When using rstorder DTRNN, Elman nets will be used as Moore NSM, and the DFA will be encoded as in section 5.1. When using second-order DTRNN, the most convenient constructions are those described in sections 4.1.1 and 4.1.2, where the DFA would be implemented as a Mealy machine. This section will detail only the weight and bias prescriptions for output, because the nextstate functions are identical to those described in the sections mentioned. 6.1 Encoding DFA in Elman Nets. Encoding a DFA in an Elman net is accomplished in the same way as for a Moore machine, as described in section 5.1. The construction is therefore very similar to that used by Alqu´ezar and Sanfeliu (1995): a new split DFA M0 D (Q0 , S , d 0 , q0I , F0 ) is built from M D (Q, S , d, qi , F ) (see section 2.1.3) by splitting states (Q0 , d 0 , and q0I are dened as in section 4.2 and F0 D f(qi , sk ): qi 2 Fg; then the next-

Stable Encoding of Finite-State Machines

2159

state function is encoded according to equations 4.29 through 4.31, and the output function is constructed as follows: » if qj0 2 F0 H yx (6.1) W1j D 0 otherwise, and y

W1 D ¡

H . 2

(6.2)

The denition of Ây used here is Ây D |F0 |, instead of equation 5.5. The smallest values of H guaranteeing this construction with the logistic sigmoid gL (x ) are therefore identical to those obtained in section 4.1.1. 6.2 Encoding DFA in Second-Order DTRNN. Encoding a DFA in a second-order net with biases is accomplished in the same way as for a Mealy machine, as described in section 4.1.1; the construction is therefore very similar to that used by Omlin and Giles (1996a, 1996b) but not identical. The next-state function is encoded according to equations 4.1 and 4.3, and the output function is constructed as follows: » if d(qj , sk ) 2 F H yxu (6.3) W1jk D 0 otherwise, y

W1 D ¡

H . 2

(6.4)

The denition of Ây used here is Ây D |F|, instead of equation 4.12. The smallest values of H guaranteeing this construction with the logistic sigmoid gL (x ) are therefore identical to those obtained in section 4.1.1. Encoding a DFA in a second-order net without biases is accomplished in the same way as for a Mealy machine, as described in section 4.1.2. The next-state function is encoded according to equation 4.19, and the output function is constructed as follows: » if d(qj , sk ) 2 F H yxu (6.5) W1jk D ¡H otherwise. The smallest values of H guaranteeing this construction with the logistic sigmoid gL (x ) are those given in Table 2. 7 Experimental Results

In this section we show results of experiments performed to estimate the minimum value of H needed to ensure correct encodings of randomly generated Mealy and Moore machines. For each machine, we looked experimentally for this value by searching for H (L ), the minimum value of H needed to ensure the correct behavior of the network for strings of length L.

2160

´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

The experimental results shown in this section are not intended to prove the validity of the constructions described in the previous sections, but instead aim at illustrating it and also at showing that the upper bounds given by the theoretical treatment are in many cases very high, due to the imposition of necessary conditions based on worst-case situations that are not expected to occur for most FSM and DTRNN. Experimental results will be shown for the following constructions: Mealy machines in second-order Mealy NSM, with and without bias (described in sections 4.1.1 and 4.1.2) Mealy machines in augmented Robinson and Fallside networks (rstorder Mealy NSM, described in section 4.2) Moore machines in Elman nets (described in section 5.1) Moore machines in second-order Moore NSM with biases (described in section 5.2) We have decided not to show any results for deterministic nite automata because they may be easily treated as special cases of Mealy and Moore machines, as shown in section 6. The results of three kinds of experiments will be shown: Experiments to estimate a value of H for small FSM by running all input strings up to a given length and depicting the asymptotic behavior of H (section 7.1) Experiments to estimate a value of H for small FSM from random sets of very long strings Experiments to show the correct behavior of H for large FSM All of the experiments will be performed on binary input and output alphabets in view of the result that the values of H do not depend on the size of the alphabet. 7.1 Asymptotical Values of H for Small FSM. This section describes the results of experiments performed to estimate the minimum value of H needed to ensure a correct encoding of a nite-state machine in each of the encodings studied in this article. This value can only be estimated, since there is just one obvious way of performing the experiments: directly searching (to a given precision, 0.01 in our calculations) for the minimum value of H needed to ensure correct behavior of the network for strings of increasing length (this value is H (L )). The network is taken to be behaving correctly for a given length L when the DTRNN produces correct output for all strings of length L. In our experiments in this section, L D 1, . . . , 20. We use two criteria to dene a correct output of the network: a weak criterion, using a forbidden interval of zero measure (2 0 D 2 1 D 0.5), and a strong criterion, using the theoretical forbidden interval computed for each

Stable Encoding of Finite-State Machines

2161

network (S0 < 2 0 < 2 1 < S1 ). As will be shown, the value of H (L ), increases with L, is always smaller than or equal to the theoretical upper bound Hth estimated for the corresponding construction for all the lengths tested (up to length 20), and seems to tend to stabilize around a value Hexp D max H (L ) L>0

(7.1)

(the overall shape of H (L ) vaguely resembles a noisy hyperbolic tangent in most cases). For each of the encoding schemes studied, 10 FSM (5 Mealy and 5 Moore machines) with S D f0, 1g and C D f0, 1g with a given number of states |Q| or with a given fan-in Â (depending on the encoding) were randomly generated, and the values of H (L ) for L D 1, 2, . . . , 20 were computed to a precision of 0.01 for each one of the FSM. The number of states (resp., fan-in) of the generated automata was |Q| (resp., Â) D 2, 3, 4, 7, 10. We have found the same asymptotical behavior for all the size parameter values used in the experiments. We have found that the values of H obtained in our simulations are in general very insensitive to the precision with which the calculations of the states of each unit are performed (as tested using a parameter to control the number of signicant digits returned by the sigmoid gL (x)). The only problematic case occurred when the minimum H (L ) appeared to be a positive innitesimal, that is, almost zero, for any length (this happens for two-state Mealy machines in the dense-weight version of the 2ODTRNN encoding, section 4.1.1, even though the theory prescribes an upper bound of 2); in this last case, the results were strongly dependent on precision. We found indeed that H (L ) was a decreasing function of precision, and that it could be made as small as desired by increasing the number of signicant gures used in the calculations. In Figures 3 to 7 we show the results obtained for the FSM of size 10 (number of states |Q| or fan-in Â, depending on the construction). In these gures we represent normalized values of H (L ) (i.e., H (L ) / Hth , where H (L) is the experimental value of H obtained for each string length and Hth is the theoretical value of H obtained for each encoding). We show the results using error bars with the minimum, average, and maximum value of the normalized experimental H obtained for each encoding and string length L, L D 1, 2, . . . , 20. The results for the other FSM of sizes 2, 3, 4, and 7 are very similar; Table 3 summarizes the results for all these FSM, normalized as H (20) / Hth . It is important to comment on the fact that when the forbidden interval is enforced, some constructions, all of them based in condition 4.7 (rst-order constructions and second-order constructions with biases), show experimental values of H that are equal to the theoretical value. As discussed at the end of section 4.1.1, this happens for some kinds of FSM with loops and is due to the fact that for the theoretical values of H and 2 1 , a xed-point

2162

´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

Figure 3: Minimum and maximum values of H(L) / Hth for ve randomly generated Mealy machines of |Q| D 10 encoded in second-order Mealy NSM without bias. Upper lines: theoretical 2 0 (see equation 4.25). Lower lines: 2 0 D 0.5.

Figure 4: Minimum and maximum values of H(L) / Hth for ve randomly generated Mealy machines with Â D 10 encoded in second-order Mealy NSM with bias. Upper lines: theoretical 2 0 and 2 1 (see equations 4.7 and 4.14). Lower lines: 2 0 D 2 1 D 0.5.

attractor exists at the very edge of the forbidden interval and any value of H smaller than the theoretical value moves this attractor into the forbidden interval. 7.2 Experiments with Long Strings. In this section we show experiments to test the stability of the encodings described when long strings are

Stable Encoding of Finite-State Machines

2163

Figure 5: Minimum and maximum values of H(L) / Hth for ve randomly generated Mealy machines with Â D 10 encoded in rst-order Mealy NSM. Upper lines: theoretical 2 0 and 2 1 (see equations 4.7 and 4.14).Lower lines: 2 0 D 2 1 D 0.5.

Figure 6: Minimum and maximum values of Hexp for ve randomly generated Moore machines with Â D 10 encoded in second-order Moore NSM with bias. Upper lines: theoretical 2 0 and 2 1 (see equations 4.7 and 4.14). Lower lines: 2 0 D 2 1 D 0.5.

0.5307

0.4307

0.0078

0.0000

Second order with bias

Second order without bias

0.3748

0.5360

0.5320

1

1

0.7710

0.7622

0.7872

Maximum

D 0.5

0.9137

0.6982 1.0000

1.0000

0.0250

0.9319

0.9318

0.7285

0.9829

0.9931

0.8995

1.0000

1.0000

H(20) / Hth for Theoretical 2 0 and 2 1 Minimum Average Maximum

0.9742

0.9557

H(20) / Hth for Theoretical 2 0 and 2 1 Minimum Average Maximum

MEALY MACHINES

0.7636

0.6912

Maximum

D 0.5

Note: The left and right parts of the table differ in what “correctly classied” means: correct side of [0, 1] with no forbidden interval (2 0 D 2 1 D 0.5) on the left; inside predicted high and low intervals for stable behavior on the right.

0.1456

First order

H(20) / Hth for 2 0 D 2 Minimum Average

0.0050

Second order with bias

Architecture

0.0025

H(20) / Hth for 2 0 D 2 Minimum Average

First order

Architecture

MOORE MACHINES

Table 3: Minimum, Maximum, and Average H(20) / Hth Values for 25 Randomly Generated Mealy and Moore Machines of sizes 2, 3, 4, 7, and 10 and All Strings Up to Length 20.

2164 ´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

Stable Encoding of Finite-State Machines

2165

Figure 7: Minimum and maximum values of Hexp for ve randomly generated Moore machines with Â D 10 encoded in Elman nets (rst-order NSM). Upper lines: theoretical 2 0 and 2 1 (see equations 4.7 and 4.14).Lower lines: 2 0 D 2 1 D 0.5.

presented to the network. We have used the same FSM as in the previous section and looked for the minimum H needed to process correctly 200 randomly generated strings of length 2000 (the same set of strings for all the experiments). We call this value H2000. The values of H obtained in these experiments are the same values obtained in the previous section or slightly larger. The maximum deviation from the experimental H obtained for L D 20 is 0.02 for all the encodings. This illustrates that the value of H (L ) seems to stabilize around an asymptotical value that is very close to the value obtained for L D 20. In Table 4 we summarize these results: normalized minimum, average, and maximum experimental values of H2000 for all the FSM and DTRNN encodings described, using the same FSM as in section 7.1. As in this section, we used two criteria to dene a correct output of the network: a weak criterion (using a forbidden interval of zero measure) and a strong criterion (using a forbidden interval with the theoretical 2 0 and 2 1 obtained in the previous sections). As in the experiments in section 7.1, we can see that the values of H2000 / Hth obtained with the weak criterion are lower than the values of H2000 / Hth obtained with the strong criterion. For the same encodings as in section 7.1 (rst order and second order with bias encodings), we found that the maximum value of H2000 is exactly the theoretical upper limit Hth obtained for the encoding, and the average H2000 / Hth is very close to 1.0. Most of the minimum values of H2000 / Hth are very close to zero because the second-order constructions without biases for FSM of size 2 need a very low value of H (see note 12).

0.5698

0.4900

0.0113

0.4604

0.0161

First order

Second order with bias

Second order without bias

0.5748

0.6115

0.5137

H2000 / Hth for 2 0 D 2 Minimum Average

0.0014

Second order with bias

Architecture

0.0014

H2000 / Hth for 2 0 D 2 Minimum Average

First order

Architecture

0.9319

0.6982

1.0000

0.7191

0.8149

Maximum

1.0000

1.0000

0.0250

0.9374

0.9319

0.7740

0.9869

0.9938

1.0000

1.0000

1.0000

H2000 / Hth for Theoretical 2 0 and 2 1 Minimum Average Maximum

0.9763

0.9582

H2000 / Hth for Theoretical 2 0 and 2 1 Minimum Average Maximum

Mealy Machines

0.7344

0.7218

Maximum

D 0.5

1 D 0.5

1

Moore Machines

Table 4: Minimum, Maximum, and Average Values of H2000 / Hth for 5 Randomly Generated Mealy and Moore Machines for 200 Strings of Length 2000.

2166 ´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

Stable Encoding of Finite-State Machines

2167

7.3 Experiments with Large FSM. We have made experiments to test the stability of the encodings with large FSM. We generated randomly 10 automata (5 Mealy and 5 Moore machines) with |Q| D 200 and, eliminating the useless states, we obtained automata of sizes in the range 81–165. Using the same long strings as in section 7.2, we obtained the minimum H2000 needed to obtain the correct outputs, using the weak and the strong criterion. The results obtained are shown in Table 5. In this table we show the size parameters for each construction (minimum, average, and maximum) and the normalized H2000 with the weak and strong criterion (minimum, average, and maximum). In all the experiments we can see that the experimental values of H2000 obtained are always lower than the theoretical H computed in previous sections for each architecture (i.e., H2000 / Hth · 1). The main difference with the results reported in the previous section is that automata having very low fan-in and therefore leading to very small values of Hth are now very unlikely to appear. The results reported in this section illustrate again that the upper bounds given by the theory are sufcient to ensure a stable behavior in the network even when the network size and string length are large. 8 Conclusion

We have completed the results by Omlin and Giles (1996a, 1996b) on stable encoding of FSM on DTRNN to a larger family of sigmoids (in particular, to any positive, strictly growing, continuous, bounded activation function), a larger variety of DTRNN architectures (including various rst-order and second-order networks), and a wider class of FSM architectures (Mealy and Moore machines), by establishing a simplied procedure to prove the stability of encoding schemes based in the classical one-hot representation of states in the hidden units of DTRNN.A summary of the encodings described in the paper is presented in Table 6. Our formulation is based on ideas put forward by Casey (1996) and earlier by Pollack (1991). To justify the constructions, we have found it very natural to speak about DTRNN in terms of Mealy and Moore neural state machines. These constructions are based on bounding criteria, in contrast to the one by Omlin and Giles (1996a), which relies on a detailed analysis of xed-point attractors (although, as one of the referees pointed out, the only way for the activations of a neural unit to stay near one of the two saturation values is through the existence of a near-saturation attractive xed point (Li, 1992)). We have performed extensive numerical experiments for the standard logistic function that illustrate the validity of the results and show that for some constructions, the weights prescribed by the theory are larger than necessary, due to the fact that the worst cases used to derive them seldom occur. We plan to expand the work to study distributed—as opposed to onehot—encodings of state. Some of the results presented here have already

81

149

Second order with bias

Second order without bias

157.4

86

110.4

Average

Size

87

118

Average

Size

165

94

116

0.6285

0.5442

Minimum

0.7242

0.6294

0.5910

Minimum

0.7473

0.6316

0.5956

Average

H2000 / Hth for 2 0 D 2 1 D 0.5

0.6313

0.5640

Average

H2000 / Hth for 2 0 D 2 1 D 0.5

MEALY MACHINES

Maximum

97

136

Maximum

Note: We used 200 randomly generated strings of length 2000.

100

First order

Minimum

78

Second order with bias

Architecture

104

Minimum

First order

Architecture

Moore Machines

0.7796

0.6346

0.6030

Maximum

0.6343

0.5725

Maximum

0.9458

0.9231

Average

0.8776

0.9451

0.8850

Minimum

0.8802

0.9457

0.8862

Average

1

0.8860

0.9462

0.8869

Maximum

1

0.9464

0.9257

Maximum

H2000 / Hth for Theoretical 2 0 and 2

0.9452

0.9180

Minimum

H2000 / Hth for Theoretical 2 0 and 2

Table 5: Minimum, Average, and Maximum Values of H2000 / Hth Obtained for Five Randomly Generated Mealy and Moore Machines with |Q| up to 200 and Eliminating Useless States.

2168 ´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

DTRNN Architecture

Second order Second order

Robinson-Fallside

Elman

Second order

Second order

FSM Class

Mealy Mealy

Mealy

Moore

Moore

Moore

0

0

|Q|| S |

|Q|

|Q|

0

(|Q| C |C |)| S |

or 0

¡H

0 ¡H

Low Weight

|Q| |Q|

Number of Hidden Units

H

H

H

H

H H

High Weight

Table 6: Summary of the Encodings Proposed in the Article.

¡H / 2

¡H / 2

0

¡H / 2

¡H / 2

or

or

¡3H / 2

or

¡3H / 2

¡H / 2 0

Bias

|Q|

Fan-in and

Fan-in

Fan-in

Fan-in

fan-in |Q|

H determined by

5.2

5.2

5.1

4.2

4.1.1 4.1.2

Section

Stable Encoding of Finite-State Machines 2169

2170

´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

been used by some of us (Kremer et al., 1998) to constrain the training of DTRNN so that they learn to behave as FSM from samples, to encode nondeterministic nite automata into second-order DTRNN (Carrasco et al., 1999), and to encode an extension of Mealy machines into second-order ˜ DTRNN (Neco, Forcada, Carrasco, & Vald´es-Munoz, ˜ 1999). L õ ma The results we have presented are related to those obtained by S´ (1997), who proves that the behavior of any DTRNN using threshold activation functions may be stably emulated by another DTRNN using activation functions in a very general class that includes the sigmoid functions considered in this article. As one of the referees has pointed out, this result may be combined with existing results concerning the encoding of nite-state machines in threshold DTRNN: L õ ma and Wiedermann (1998) shows that any regular ex A result by S´ pression of length l may be recognized by a threshold DTRNN having H (l) units. The result is that a regular expression may be very efciently encoded in analog DTRNN having activation functions more general than ours. The main difference between this result and ours is that we are interested in encoding nite-state machines directly. Alon et al. (1991), Indyk (1995), and Horne and Hush (1996) show that nite-state machines with n states may be encoded in threshold DTRNN having a number of units that grows sublinearly with L õ ma’s (1997) result, we obtain that nite-state n. Combining this with S´ machines may be sublinearly enncoded in analog DTRNN. However, Alon et al.’s (1991) and Horne and Hush’s (1996) results, though more efcient, are not as directly applicable to a given FSM as ours. All of the constructions presented in this article work may also be L õ ma’s straightforwardly applied to threshold DTRNN using H D 1. S´ (1997) result may convert these threshold DTRNN into sigmoid DTRNN. The resulting weights are basically twice as large as the ones obtained L õ ma’s construction is valid in this article. This is due to the fact that S´ for general threshold DTRNN, whereas our result takes advantage of the fact that states and inputs are encoded in simple one-hot fashion. We provide a detailed description on how to construct stable encodings of a variety of FSM into a variety of commonly used rst- and second-order DTRNN and a way to choose weights that are relatively small, which may be of interest when the construction will be followed by gradient-descent learning. Weights are still rather large for gradient-descent learning, but as section 7 shows, smaller values of H may still guarantee correct behavior. Acknowledgments

This work has been supported by the Spanish Comision Interministerial de Ciencia y Tecnolog´õ a through grant TIC97-0941. We also thank C. Lee

Stable Encoding of Finite-State Machines

2171

Giles, Stefan C. Kremer, Christian W. Omlin, Jos´e Oncina, and all of the anonymous referees for their critical comments and ideas. References Alon, N., Dewdney, A. K., & Ott, T. J. (1991). Efcient simulation of nite automata by neural nets. Journal of the Association of Computing Machinery, 38, 495–514. Alqu´ezar, R., & Sanfeliu, A. (1995). An algebraic framework to represent nite state automata in single-layer recurrent neural networks. Neural Computation, 7, 931–949. Arai, K.-I., & Nakano, R. (1996). Annealing RNN learning of nite state automata. In Proceedings of ICANN’96 (pp. 519–524). Arai, K.-I.,& Nakano, R. (1997).Adaptive b-scheduling learning method of nite state automata by recurrent neural networks. In Progress in connectionist-based information systems: Proceedings of the 1997 International Conference on Neural Information Processing and Intelligent Information Systems (Vol. 1, pp. 351–354). Singapore: Springer-Verlag. Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difcult. IEEE Transactions on Neural Networks, 5(2), 157–166. Blair, A., & Pollack, J. B. (1997). Analysis of dynamical recognizers. Neural Computation, 9(5), 1127–1142. Carrasco, R. C., Forcada, M. L., & Santamar´õ a, L. (1996). Inferring stochastic regular grammars with recurrent neural networks. In L. Miclet & C. de la Higuera (Eds.), Grammatical inference: Learning syntax from sentences (pp. 274– 281). Berlin: Springer-Verlag. Carrasco, R. C., Oncina, J., & Forcada, M. L. (1999). Efcient encodings of nite automata in discrete-time recurrent neural networks. In Proceedings of ICANN’99, International Conference on Articial Neural Networks (Vol. 2, pp. 673–677). Casey, M. (1996). The dynamics of discrete-time computation, with application to recurrent neural networks and nite state machine extraction. Neural Computation, 8(6), 1135–1178. Casey, M. (1998).Correction to proof that recurrent neural networks can robustly recognize only regular languages. Neural Computation, 10(5), 1067–1069. Cleeremans, A., Servan-Schreiber, D., & McClelland, J. L. (1989). Finite state automata and simple recurrent networks. Neural Computation, 1(3), 372–381. Das, S., & Das, R. (1991).Induction of discrete state-machine by stabilizing a continuous recurrent network using clustering. Computer Science and Informatics, 21(2), 35–40. Das, S., & Mozer, M. (1998). Dynamic on-line clustering and state extraction: An approach to symbolic learning. Neural Networks, 11(1), 53–64. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211. Floyd, T. L. (1996). Digital fundamentals (6th ed.). Englewood Cliffs, NJ: PrenticeHall.

2172

´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

Forcada, M. L., & Carrasco, R. C. (1995). Learning the initial state of a secondorder recurrent neural network during regular-language inference. Neural Computation, 7(5), 923–930. Frasconi, P., Gori, M., Maggini, M., & Soda, G. (1996). Representation of nitestate automata in recurrent radial basis function networks. Machine Learning, 23, 5–32. Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., & Lee, Y. C. (1992). Learning and extracted nite state automata with second-order recurrent neural networks. Neural Computation, 4(3), 393–405. Gori, M., Maggini, M., Martinelli, E., & Soda, G. (1998). Inductive inference from noisy examples using the hybrid nite state lter. IEEE Transactions on Neural Networks, 9(3), 571–575. Goudreau, M. W., Giles, C. L., Chakradhar, S. T., & Chen, D. (1994). First-order vs. second-order single layer recurrent neural networks. IEEE Transactions on Neural Networks, 5(3), 511–513. Haykin, S. (1998). Neural networks—A comprehensive foundation (2nd ed.). Upper Saddle River, NJ: Prentice-Hall. Hertz, J., Krogh, A., & Palmer, R. G. (1991). 1991. Introduction to the theory of neural computation. Redwood City, CA: Addison-Wesley. Hopcroft, J. E., & Ullman, J. D. (1979). Introduction to automata theory, languages, and computation. Reading, MA: Addison–Wesley. Horne, B. G., & Giles, C. L. (1995). An experimental comparison of recurrent neural networks. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 697–704). Cambridge, MA: MIT Press. Horne, B. G., & Hush, D. R. (1996).Bounds on the complexity of recurrent neural network implementations of nite state machines. Neural Networks, 9(2), 243– 252. Hush, D., & Horne, B. (1993). Progress in supervised neural networks. IEEE Signal Processing Magazine, 10(1), 8–39. Indyk, P. (1995).Optimal simulation of automata by neural nets. In Proceedings of the 12th Annual Symposium on Theoretical Aspects of Computer Science (pp. 337– 348). Berlin: Springer-Verlag. Kleene, S. C. (1956).Representation of events in nerve nets and nite automata. In C. E. Shannon & J. McCarthy (Eds.), Automata studies Princeton, NJ: Princeton University Press. Kolen, J. F. (1994). Fool’s gold: Extracting nite state machines from recurrent network dynamics. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 501–508). San Mateo, CA: Morgan Kaufmann. Kolen, J. F., & Pollack, J. B. (1995). The observer ’s paradox: Apparent computational complexity in physical systems. Journal of Experimental and Theoretical Articial Intelligence, 7, 253–277. Kremer, S. C. (1995). On the computational power of Elman-style recurrent networks. IEEE Transactions on Neural Networks, 6(4), 1000–1004. Kremer, S. (1996). A theory of grammatical induction in the donnectionist paradigm. Unpublished Ph.D. dissertation, University of Alberta, Edmonton, Alberta.

Stable Encoding of Finite-State Machines

2173

˜ Kremer, S., Neco, R. P., & Forcada, M. L. (1998).Constrained second-order recurrent networks for nite-state automata induction. In L. Niklasson, M. Bod´en, & T. Ziemke (Eds.), Proceedings of the 8th International Conference on Articial Neural Networks ICANN’98 (Vol. 2, pp. 529–534). London: Springer. Li, L.K. (1992). Fixed point analysis for discrete-time recurrent neural networks. In Proceedings of IJCNN (Vol. 4, pp. 134–139). Maass, W., & Orponen, P. (1998). On the effect of analog noise in discrete-time analog computations. Neural Computation, 10(5), 1071–1095. Maass, W., & Sontag, E. D. (1999). Analog neural nets with gaussian or other common noise distribution cannot recognize arbitrary regular languages. Neural Computation, 11(3), 771–782. Manolios, P., & Fanelli, R. (1994). First order recurrent neural networks and deterministic nite state automata. Neural Computation, 6(6), 1154–1172. Maskara, A., & Noetzel, A. (1992). Forcing simple recurrent neural networks to encode context. In Proceedings of the 1992 Long Island Conference on Articial Intelligence and Computer Graphics. McCulloch, W. S., & Pitts, W. H. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5, 115–133. Miller, C. B., & Giles, C. L. (1993). Experimental comparison of the effect of order in recurrent neural networks. International Journal of Pattern Recognition and Articial Intelligence, 7(4), 849–872. Minsky, M. L. (1967). Computation: Finite and innite machines. Englewood Cliffs, NJ: Prentice-Hall. ˜ Neco, R. P., & Forcada, M. L. (1996). Beyond Mealy machines: Learning translators with recurrent neural networks. In Proceedings of the World Conference on Neural Networks ’96 (pp. 408–411). San Diego. ˜ Neco, R. P., Forcada, M. L, Carrasco, R. C., & Vald´es-Munoz, ˜ A. (1999).Encoding of sequential translators in discrete-time recurrent neural nets. In Proceedings of the European Symposium on Articial Neural Networks ESANN’99 (pp. 375– 380). Omlin, C. W., & Giles, C. L. (1996a). Constructing deterministic nite-state automata in recurrent neural networks. Journal of the ACM, 43(6), 937–972. Omlin, C. W., & Giles, C. L. (1996b).Stable encoding of large nite-state automata in recurrent neural networks with sigmoid discriminants. Neural Computation, 8, 675–696. Pollack, J. B. (1991). The induction of dynamical recognizers. Machine Learning, 7, 227–252. Robinson, T., & Fallside, F. (1991).A recurrent error propagation network speech recognition system. Computer Speech and Language, 5, 259–274. Sanfeliu, A., & Alqu e´ zar, R. (1994).Active grammatical inference: A new learning methodology. In D. Dori & A. Bruckstein (Eds.), Shape and structure in pattern recognition. Singapore: World Scientic. L õ ma, J. (1997). Analog stable simulation of discrete neural networks. Neural S´ Network World, 7, 679–686. L õ ma, J., & Wiedermann, J. (1998). Theory of neuromata. Journal of the ACM, S´ 45(1), 155–178.

2174

´ Vald´es-Munoz, ˜ R. C. Carrasco, M. L. Forcada, M. A. ˜ & R. P. Neco

L P., Horne, B. G., Giles, C. L., & Colingwood, P. C. (1998). Finite state maTino, chines and recurrent neural networks—automata and dynamical systems approaches. In J. E. Dayhoff & O. Omidvar (Eds.), Neural networks and pattern recognition (pp. 171–220). San Diego, CA: Academic. L P., & Sajda, J. (1995). Learning and extracting initial Mealy automata with Tino, a modular neural network model. Neural Computation, 7(4). Tsoi, A. C., & Back, A. (1997). Discrete time recurrent neural network architectures: A unifying review. Neurocomputing, 15, 183–223. Watrous, R. L., & Kuhn, G. M. (1992). Induction of nite-state languages using second-order recurrent networks. Neural Computation, 4(3), 406–414. Zeng, Z., Goodman, R. M., & Smyth, P. (1994).Discrete recurrent neural networks for grammatical inference. IEEE Transactions on Neural Networks, 5(2),320–330. Zeng, Z., Goodman, R. M., & Smyth, P. (1993). Learning nite state machines with self-clustering recurrent networks. Neural Computation, 5(6), 976–990. Received November 16, 1998; accepted September 23, 1999.

LETTER

Communicated by Thomas Dietterich

Approximate Maximum Entropy Joint Feature Inference Consistent with Arbitrary Lower-Order Probability Constraints: Application to Statistical Classication David J. Miller Lian Yan Department of Electrical Engineering, The Pennsylvania State University, University Park, PA 16802, U.S.A. We propose a new learning method for discrete space statistical classiers. Similar to Chow and Liu (1968) and Cheeseman (1983), we cast classication/inference within the more general framework of estimating the joint probability mass function (p.m.f.) for the (feature vector, class label) pair. Cheeseman’s proposal to build the maximum entropy (ME) joint p.m.f. consistent with general lower-order probability constraints is in principle powerful, allowing general dependencies between features. However, enormous learning complexity has severely limited the use of this approach. Alternative models such as Bayesian networks (BNs) require explicit determination of conditional independencies. These may be difcult to assess given limited data. Here we propose an approximate ME method, which, like previous methods, incorporates general constraints while retaining quite tractable learning. The new method restricts joint p.m.f. support during learning to a small subset of the full feature space. Classication gains are realized over dependence trees, tree-augmented naive Bayes networks, BNs trained by the Kutato algorithm, and multilayer perceptrons. Extensions to more general inference problems are indicated. We also propose a novel exact inference method when there are several missing features. 1 Introduction

In this work, we tackle the difcult problem of learning the joint probability mass function (p.m.f.) for a discrete feature space, consistent with general lower-order probabilistic constraints involving collections of features (e.g., single features, pairs, triples). Given the joint feature p.m.f., one can address a variety of inference tasks: computing a posteriori probabilities (1) for the class feature given knowledge of the remaining feature values; (2) for the class feature given that some features are missing; and (3) for any (e.g., userspecied) features given values for the other features (or given values for a feature subset). The rst two capabilities form the basis for pseudo-Bayes classication. Task 3 is a generalized classication or inference capability c 2000 Massachusetts Institute of Technology Neural Computation 12, 2175– 2207 (2000) °

2176

David J. Miller and Lian Yan

allowing any of the features (or a subset of features) to be inferred at a given time. Thus, for this case, the model must be learned consistent with its potential application to any of the (combinatoric) inference tasks a model user may select. This general inference capability has a variety of applications, including diagnosis with multiple possible diseases, information retrieval, and imputation of missing elds in databases. In this work, we mainly focus on the standard classication capabilities (1 and 2). However, the framework we suggest is also amenable, by natural extension, to address capability 3, and it is from the viewpoint of this last, general inference objective that the method we propose can be best understood and related to some existing techniques. In particular, general inference capability is one of the main attractive features of Bayesian networks (BNs) (Pearl, 1988; Herskovits & Cooper, 1991; Heckerman, Geiger, & Chickering, 1995; Buntine, 1996; Neal, 1992; Frey, 1998). These structures represent the joint p.m.f. for a feature space explicitly as a product of conditional probabilities over the constituent scalar features. BNs have generated a great deal of recent interest in machine learning, based on their versatility for doing statistical inference and their convenient and informative representation using graphs. However, although BNs have attractive features and have achieved mainstream use, they also have limitations relating to their model structure and difculties with learning this structure. BNs require explicit expression of conditional independence relations between features. While this is informative to a model user, the conditional independence structure is not so easily learned from training data. 1 First, strict determination of conditional independence is not easily assessed (and thus may not be the most appropriate approach) given limited data. Second, optimizing over the (combinatoric) set of possible BN structures is typically accomplished by sequential, greedy methods (Herskovits & Cooper, 1991; Heckerman et al., 1995), which may be suboptimal. Also, associated with sequential learning of the model is the issue of where to stop, to avoid overtting. While priors over structures have been introduced for Bayesian learning of the model (Cooper & Herskovits, 1992; Heckerman et al., 1995), some questions have been expressed concerning the choice of prior and the effectiveness of this scheme (Ripley, 1996). Finally, a BN difculty for inference problems such as classication is associated with the Markov blanket (see, e.g., Haykin, 1999), for nodes and features in the network. This blanket, determined by the conditional independence structure, denes the set of features that may possibly be used (contingent on which features are present or missing) when computing a posteriori probabilities 1 Originally these models were not learned from an existing database, but were formed by eliciting conditional independencies and probabilities from domain experts. However, this effort is time-consuming and costly. Moreover, in many domains, databases have been collected (or are currently being collected). Thus, there is now great impetus to learn the models from training data.

Approximate Maximum Entropy Inference

2177

for a given feature. Friedman and Goldszmidt (1996) was noted that the (typically small) Markov blanket may limit the classication performance of BNs. It was thus proposed to improve performance (using an approach similar to Chow and Liu (1968) by conditioning all features on the feature to be classied.2 Our approach avoids some BN problems by using an alternative model representation, based on a set of parameters rather than a set of conditional probabilities. First, our representation does not require making any explicit conditional independence assessments or choices. In fact, our model allows (even weak) dependencies among all features. This also means that unlike BNs, our model will typically make use of all known features when forming inferences. Moreover, our (parameterized) model is amenable to an effective joint optimization technique, which we propose here. There is a potential overtting issue for our model just as for BNs. While we do not propose a sophisticated technique for model selection, we do suggest a heuristic principle that, though very simple, is well supported by our experimental results. One disadvantage of our model is that it does not make conditional independencies explicit and thus is less informative to the user about which features are important for making inferences. In this sense, our model is similar to (black box) neural networks. However, we are willing to sacrice informativeness if this helps to yield more accurate inferences. Our work builds on earlier work of Cheeseman (1983) and Ku and Kullback (1969). Cheeseman proposed nding the maximum entropy (ME) joint feature p.m.f. consistent with (arbitrary) lower-order probability constraints. The ME principle has been championed as a general tool for statistical inference by Jaynes (1982). It can be interpreted and justied from several standpoints: 1. Information theory (Shore & Johnson, 1980). 2. Statistics. The p.m.f. that maximizes entropy subject to expected value constraints comes from a special parametric family (a Gibbs p.m.f.) based on a set of Lagrange multipliers. As noted, for example, in Berger, Della Pietra, and Della Pietra (1996), the p.m.f. that solves the ME problem is the one (within this parametric family) that maximizes the likelihood of the training sample. 3. Probability theory. The ME p.m.f. can be understood from the standpoint of achieving observed expected values in an exponentially greater number of ways than any other distribution (Jaynes, 1982). The ME principle has been applied in a number of settings (Burg, 1975; Jaynes, 1982). Within pattern recognition, it has been applied to unsuper2

In this way, all features are used for doing inference on the class feature.

2178

David J. Miller and Lian Yan

vised learning (Rose, Gurewitz, and Fox, 1992; Buhmann & Kuhnel, 1994), supervised learning (Miller, Rao, Rose, & Gersho, 1996), natural language processing (Berger et al., 1996), image modeling (Zhu, Wu, & Mumford, 1997), and statistical learning theory (Bilbro & Van den Bout, 1992). Successful application of this principle rests with the amount and accuracy of domain knowledge encoded into the model by the probabilistic constraints. Cheeseman’s suggestion to incorporate arbitrary lower-order constraints is in principle powerful, allowing the joint p.m.f. to express general dependencies between features. Unfortunately, there is a substantially challenging learning problem for estimating the ME model parameters. Ku and Kullback (1969) proposed a “simple” iterative algorithm for ME joint p.m.f. estimation that successively satises one probability constraint at a time. Satisfying one constraint will likely cause violation of others, thus requiring repeated cycling through the constraints. Ku and Kullback’s algorithm learns a model of size exponential in the feature dimensionality. However, much more telling of its enormous complexity is the fact that to satisfy a single constraint at each step, one must marginalize the joint p.m.f. over all features not included in this constraint. For a feature dimensionality of size N with J discrete values per feature and pairwise feature constraints, this requires summing JN¡2 terms. This operation, even if computed offline, is completely infeasible for “large” N (perhaps 5 or 10, depending on J), which certainly will be encountered in practice. Ku and Kullback presented results only for N D 4 and J D 2. Moreover, their method learns the joint p.m.f. values directly. Thus, if the joint p.m.f. is computed off-line (the only even remotely feasible way), one must store all the values for later use. This entails huge storage. Cheeseman (1983) recognized the potential signicance of the ME approach for machine learning and showed how to use the model for inference. However, no practical learning method was proposed, and experimental results were not presented. Gevarter (1986) suggested a heuristic method that constructs an ME distribution by adding constraints incrementally. At each stage, the algorithm uses a signicance test to identify the constraint least satised by the current model. The joint p.m.f. is then modied to incorporate this constraint. Gevarter demonstrated this technique, but only for a three-dimensional feature space. For high dimensions, this approach is also intractable, requiring marginalizations similar to Ku and Kullback’s method. Another technique for learning ME joint p.m.f.s is iterative proportional tting (Edwards & Havranek, 1985). Although this method also requires huge complexity, there is a version by Jirousek and Preucil (1995) with improved efciency when the set of constraints (and hence the joint p.m.f.) has a special structure, as represented by an associated graph. Specifically, efcient computation can be achieved if the graph can be decomposed into (much) lower-dimensional cliques. However, here we work from the principle of encoding as much information as possible when forming the joint pmf; we will encode constraints on all pairwise feature p.m.f.s. The

Approximate Maximum Entropy Inference

2179

associated graph is in this case fully connected, with a single clique that includes all the features. Thus, Jirousek and Preucil’s (clique-based decomposition) method cannot be applied efciently. As a practical alternative to learning the general ME joint p.m.f., Herskovits and Cooper (1991) proposed an algorithm called Kutato that applies the ME principle in a heuristic way for the construction of Bayesian networks. These authors recognized earlier ME work but noted that the learning complexity for the general ME approach is one of the primary reasons for seeking other models (BNs). Pearl (1988) notes objections to maximum entropy on other grounds related to causal interpretations of probabilistic reasoning, but also cites complexity as a main barrier to using the ME approach. In this work, we reconsider the ME learning problem with special emphasis on the design of statistical classiers. We develop an approximate ME technique that, unlike previous methods, has practical learning complexity. The key aspect of this approach, to be explained shortly, is the restriction (during learning) of joint pmf support to a small subset of the full feature space. We learn parameterized a posteriori probabilities jointly with this support-restricted p.m.f. to maximize joint entropy while satisfying given constraints. Our method can encode any set of probabilistic constraint information that may be directly available. However, often in practice (and in all our experiments here), all that is available is a training set. In this case, we must select the constraints we will encode and estimate constraint probabilities from training set frequency counts. In choosing constraints, we are guided by the fact that the accuracy of frequency count estimates decreases with increasing order of the probabilities. This has led us to restrict the probabilities we encode to those of second order (i.e., pairwise feature probabilities). Building ME models consistent with these constraints, we obtain classication results better than or competitive with known techniques for problems with high-dimensional feature space from the UC Irvine machine learning repository. We thus suggest a practical technique with good performance for (approximately) realizing ME inference. Moreover, we emphasize that while the work here focuses only on building classiers (for inference on a single, xed feature), our method can be extended to yield an algorithm, also of tractable complexity, for learning more general (task 3)) inference models (Yan & Miller, 1999a). In section 2, we discuss the ME learning problem and develop our proposed technique. We also introduce a novel, exact way to perform inference for the case of a few missing features. In section 3, we test our ME method for discrete space classication in comparison with dependence trees (Chow & Liu, 1968), tree-augmented naive Bayes (TAN) (Friedman & Goldszmidt, 1996), the Kutato algorithm for learning more general BNs, and with multilayer perceptrons. Both the complete and missing feature cases are considered. We conclude by identifying some extensions building on the work here.

2180

David J. Miller and Lian Yan

2 Maximum Entropy Estimation for Discrete Space Classication 2.1 The Maximum Entropy Joint p.m.f.. We rst consider the general problem of obtaining the joint p.m.f. for a feature vector F D (F1 , F2 , . . . , FN ), with Fi 2 Ai , Ai the discrete, nite set f1, 2, 3, . . . , | Ai |g,3 and with the full discrete feature space, comprising all possible N-tuples, denoted G ´ A1 £ A2 £ ¢ ¢ ¢ £ AN . The problem posed by Ku and Kullback (1969) and Cheeseman (1983) is that of choosing the joint p.m.f. P[ F ] to maximize joint entropy while satisfying equality constraints on specied lower-order probabilities. To illustrate the difculties, it will sufce to consider secondorder (pairwise) probability constraints. Moreover, pairwise constraints are the ones we will typically impose in practice. Thus, suppose we are given knowledge of all pairwise p.m.f.s: fP[Fm , Fn ], 8m, 8n > mg. The ME joint ) p.m.f. consistent with these N (N¡1 pairwise p.m.f.s has the Gibbs form: 2

PP N¡1 N

c (Fm D fm , F n D fn )

e mD 1 n > m , P[ F D f ] D N¡1 N PP c (F m D f 0 , Fn D f 0 ) P m n em D 1 n > m

f ´ ( f1 , . . . , fN ) 2 G .

(2.1)

f 0 2G

Here, c (Fm D fm ,Fn D fn ) is a Lagrange multiplier corresponding to an equality constraint on the individual pairwise probability P[Fm D fm , Fn D fn ]. Note that the joint p.m.f. is specied by the set of Lagrange multipliers C D fc (Fm D fm ,Fn D fn ) 8m, 8n > m, fm 2 Am fn 2 A n g, with one multiplier for each individual pairwise probability. While the form of the joint p.m.f. is known, direct computation of equation 2.1 is generally intractable since calculation of the partition function in Q the denominator requires computing and summing N i D 1 | A i | terms. This may not be problematic during model use since one is generally not interested in P[ F D f ] directly, but rather in a posteriori probabilities of single features or feature groups. Such quantities, which depend on the joint p.m.f.’s Lagrange multipliers but not directly on P[ F ], can often be practically computed. However, this assumes the model parameters C are known. The real obstacle lies in the learning problem for acquiring the c ’s. We have already described serious difculties with previous ME methods. Since the form of the joint p.m.f. is known, one is tempted to try a direct optimization approach: 1. Write down the ME constrained optimization problem. 3

We will use the upper-case Fi to denote a random variable, with the lowercase fi denoting a particular realization.

Approximate Maximum Entropy Inference

2181

2. Use the Lagrangian method to convert the constrained problem into an unconstrained one. The Lagrangian cost will have the form D¡TH, with H the joint entropy, D a cost function encapsulating the equality constraints on lower-order probabilities, and T a Lagrange multiplier. 3. Express H and D explicitly as functions of the joint p.m.f. with optimal form of equation 2.1. 4. Minimize D ¡ TH over the c parameters. Some search for the appropriate value of T, satisfying the constraints with maximum entropy, may be required. There are two major difculties with this approach. First, the optimization requires explicitly calculating P[ F D f ], intractable as already discussed. Second, the constraint cost D will require marginalizations over the joint p.m.f. similar to those of Ku and Kullback’s method. These operations are also intractable. While the method just outlined is clearly infeasible, the approximate method next proposed is somewhat inspired by the principle of this direct approach. In the sequel, we focus on the ME learning problem given that inference on a single feature (classication) is the model’s ultimate use objective. We will later indicate extensions of our approach suitable for other inference objectives. 2.2 Formulation for Statistical Classication. Here we develop an ME approach meeting two fundamental requirements: (1) in learning the joint p.m.f., we must produce a model useful for classication; (2) the learning must have practical complexity. To fulll these desiderata, there are several key steps in our formulation. Joint PMF representation . We previously discussed two different p.m.f. representations. One is the Gibbs form (see equation 2.1), based on C. A second form, used by Ku and Kullback, is an exhaustive table of joint p.m.f. values. Neither of these forms appears to be useful, due to the huge associated learning and storage complexities. Here we propose a third representation, which is a hybrid of the previous ones. It is based on both a reduced set of Lagrange multipliers and a set of p.m.f. table values. This parameterization is motivated by the fact that only a subset of C is needed to specify the a posteriori probabilities used for classication. Thus, these are the only Lagrange multipliers that our ME approach needs to learn. Support approximation . While the new parameterization does explicitly represent the desired classication parameters, this joint p.m.f. model is unfortunately still not amenable to practical learning. Thus we require an approximation. Here again we will use the fact that the full joint p.m.f. is not needed during classication; only a posteriori probabilities are required. Therefore, we can choose to represent the joint p.m.f. in an approximate

2182

David J. Miller and Lian Yan

fashion if this helps to achieve practical learning of the a posteriori model. What we suggest is restricting the support of the joint feature p.m.f. to a small subset of the full feature space. This support restriction is the key to achieving efcient ME learning. A reasonable support choice is the empirical support supplied by the available training data set. Lagrangian formulation . Given our approximate parameterization, the objective is to maximize entropy of the joint p.m.f. subject to equality constraints on all pairwise feature p.m.f.s. This constrained problem appears to be complicated because it requires simultaneous satisfaction of numerous individual constraints. However, we avoid potential difculties by converting the constrained problem into an unconstrained one. In this scheme, each pairwise p.m.f. constraint is converted into a cost to be minimized; an equality constraint on a p.m.f. is conveniently expressed using cross entropy. We then form an overall constraint cost, D, as a sum of all the individual cross entropies. D measures the degree of constraint satisfaction, and when D is minimized (i.e., zeroed), then all the constraints are fully satised. Our learning procedure thus minimizes the Lagrangian L D D ¡ TH, requiring the variation of only a single parameter, T, to control all the constraints. As T is lowered, we grudgingly trade decreases in entropy for decreases in D. The ultimate goal is the solution with highest entropy such that D D 0. We next consider in more detail each of the three ME formulation steps. 2.2.1 Joint PMF Representation. Consider the feature vector FQ D ( F , C), with F ´ (F1 , F2 , . . . , FN ) now augmented by a class feature/label C 2 f1, 2, . . . , Kg, K the number of classes. The ME joint p.m.f. for FQ consistent with constraints on all pairwise feature p.m.f.s can still be written in the intractable Gibbs form (see equation 2.1). However, classication does Q during use. It only requires the a posteriori not require computing P[F] Q These probabilities have the quite class probabilities associated with P[F]. tractable Gibbs form N P

c (F i D fi , CD c)

e iD 1 , P[C D c| F D f ] D N P c 0 (F i D fi , C D c ) K P e iD 1

c D 1, . . . , K.

(2.2)

c0 D 1

Note that these probabilities do not depend on all the Lagrange multipliers; they depend on only the subset of C that represent constraints involving the class label, that is, C c ´ fc (Fi D fi ,C Dc ) , i D 1, . . . , N, fi D 1, . . . , | Ai |, c D 1, . . . , Kg. Thus, given that we are interested only in inference on the class label, we only truly require the optimization method to learn this subset of the Lagrange multipliers. While this suggests the learning problem is now simpler than optimizing all of C, we must be careful here; the

Approximate Maximum Entropy Inference

2183

parameters C c and hence the resulting a posteriori probabilities (see equation 2.2) will be the ME solution only if they are learned in a way consistent with the ME joint p.m.f. objective. What we require is a tractable learning approach that maximizes entropy of the joint p.m.f. subject to all available probabilistic constraints, while producing at least, but perhaps no more than, the reduced set of parameters C c . What this really suggests, then, is that we should try to nd an alternative joint p.m.f. representation that only depends on C through the reduced set C c . Such a representation is in fact easily achieved via the Bayes rule decomposition: P[F D f , C D c] D P[C D c|F D f ]P[F D f ],

8 f , 8c.

(2.3)

In equation 2.3, we will express P[C D c|F D f ] in its Gibbs form (see equation 2.2) based on C c . However, crucially, instead of specifying P[ F ] using C, we will instead specify it by an exhaustive table of p.m.f. values fP[ f ], f 2 G ´ A1 £ A2 £ ¢ ¢ ¢ £ AN g. Note that with this parameterization, equation 2.3 is now a hybrid of the previous joint p.m.f. forms, based on both Lagrange multipliers and a table of values. Clearly, this representation introduces no suboptimality, as there are choices for the model fC c , fP[ f ], f 2 G gg that will achieve the ME joint Q However, it does not appear that p.m.f. for the augmented feature vector F. this provides any practical advantage during learning. In fact, equation 2.3 vastly expands the number of parameters that need to be learned, replacing C ¡ C c with the huge, exhaustive p.m.f. table fP[ f ], f 2 G g. While the model fC c , fP[ f ]gg is not conducive to efcient learning, a related approximate representation, developed next, will indeed allow tractable, effective (approximate) learning of the desired parameters C c . 2.2.2 Support Approximation. The main obstacle to learning is the huge table fP[ f ], f 2 G g. Conceivably, practical learning could be achieved if the size of this table was reduced. A key observation is, again, that the joint p.m.f. P[ F ] plays no role during model use or classication. Thus, the probabilities fP[ f ]g are accurately described as auxiliary variables, needed only to support the learning process (and then afterward discardable). This suggests we can choose to represent P[ F ] in an approximate fashion if this helps to achieve tractable learning. The approximation may to some extent affect the accuracy of the learned model C c , but will in no way affect our ability to implement a classication or inference rule based on equation 2.2. Thus, instead of representing P[ F ] on the full feature space G , suppose that we restrict the nonzero support of P[ F ] to a small subset of G . We assume P[ f ] is nonzero only for f 2 G s , with the support set G s a small subset of

2184

David J. Miller and Lian Yan

the N-tuples from G . As we next introduce through an example, a natural choice for the support set G s is posited by the available training data set itself. Consider the two-class Hepatitis problem, from the UC-Irvine Machine Learning repository. The feature space F is 19-dimensional, with 2 to 10 values for each feature. The data set consists of 155 19-tuples, each with an associated class label. Suppose we take 100 of these to form the training set. For this problem, the parameter set C c consists of 164 Lagrange multipliers, each associated with a constraint on a particular pairwise probability, P[Fi D fi , C D c]. Unfortunately, the number of joint p.m.f. cells/19-tuples f 2 A1 £ A2 £ ¢ ¢ ¢ £ A 19 is enormous: over 40 billion. We cannot learn a joint p.m.f. table of this size. However, an obvious choice for a reduced support set is the empirical support expressed by the training data set itself. We suggest taking the nonaugmented feature vectors in the training set as the 19-tuples/p.m.f. cells comprising G s . Since the training set points have occurred, they should be given nonzero probability; that is, the support set should at least include the training set. Moreover, recall that in practice the pairwise constraint probabilities will be estimated from frequency counts on the training set. Thus, by also choosing the training set as our joint p.m.f. support set, we at least ensure that the joint p.m.f. support will be sufcient to satisfy the constraints that we encode. With this support restriction, the number of cells with nonzero probability is only | G s | D 100. This is a minute fraction of the table size required to specify P[ F ] on the full space G . Clearly, restricting support to the training set can dramatically reduce the number of p.m.f. values that need to be learned for high-dimensional feature spaces. We will see that this approximation is the key to achieving (quite) manageable ME learning. However, it remains to justify the accuracy of this approximation. Clearly our support restriction is completely untenable if Q accurately. Howthe objective is to learn the augmented joint p.m.f. P[F] ever, here we are interested only in accurately learning the classication parameters C c . With respect to this goal, our compelling, albeit heuristic, justication for our support choice is based on our experimental results (section 3). In particular, we have found that while the classication performance does weakly depend on the size of G s , the choice of small p.m.f. support, based on the training set or even a subsampling of it, leads to performance better than or competitive with known techniques for both real and synthetic data sets. Moreover, if the training set is very small, we can increase the size of G s (by randomly sampling additional N-tuples from G ) at the expense of increased learning complexity. However, in practice we have observed only small performance improvement as the support size is increased. We next develop the ME learning algorithm that capitalizes on our support restriction. Example.

Approximate Maximum Entropy Inference

2185

2.2.3 Lagrangian Formulation. Given the support restriction to G s , we pose the ME learning problem as choosing the model parameters ffc (Fm D fm ,CD c) g, fP[ f ], f 2 G s gg to maximize joint entropy while satisfying

the given pairwise constraints.4 Following the direct approach description of section 2.1, we form the quantities H and D. For notational convenience, let us represent the N-tuples comprising G s by an enumerated set: ( ) ( ) ( ) G s ´ f fm , m D 1, . . . , |G s |g, with fm ´ ( f1 m , f2 m , . . . , fNm ). Then, based on the support restriction, the joint entropy, H, can be written 5 as

HD ¡

|Gs | X mD1

P[ fm ] log(P[ fm ]) ¡

|Gs | X

P[ fm ]

mD 1

K X cD 1

P[C D c| fm ]

£ log(P[C D c| fm ]).

(2.4)

Now let us compose the constraint cost, D. For satisfying constraints on lower-order probabilities, there are several possible criteria that could be used, including squared distance and cross entropy. The latter criterion will be applied here. Suppose we have obtained estimates for all pairwise p.m.f.s ffP[Fk , Fl ]g, fP[Fk , C]gg and wish to constrain our joint p.m.f. to agree with these probabilities. Given the support restriction, the model’s pairwise p.m.f. P M [Fk , Fl ] can be calculated as PM [Fk D i, Fl D j] D

X (m)

fm 2Gs | fk

P[ fm ], (m)

Di, fl

Dj

i D 1, . . . , | Ak |, j D 1, . . . , | Al |.

(2.5)

Note that equation 2.5 is simply a marginalization of our support-restricted joint p.m.f. Further, we emphasize that in our case, unlike previous ME methods, this marginalization only requires operations linear in the support size | G s |. Now, based on the properties of cross entropy, equality constraints on all probabilities involving the feature pair (Fk , Fl ) will be satised if the

4 An equivalent but slightly different perspective on our problem is obtained by explicitly expressing the support restriction as additional constraints on entropy maximization. In particular, whereas the original ME problem posed in section 2.1 is to maximize entropy subject to constraints on all pairwise probabilities, we effectively address a related problem that adds many additional constraints, specifying that the joint p.m.f. be zero except on the support set G s . 5 We assume p log p D 0 when p D 0 as in Cover & Thomas (1991).

2186

David J. Miller and Lian Yan

cross entropy/Kullback distance, |Ak | X |Al | X

Dc (P[Fk , Fl ]||P M [Fk , Fl ]) ´

iD 1 jD 1

P[Fk D i, Fl D j] P[Fk D i, Fl D j] , P M [Fk D i, Fl D j]

£ log

(2.6)

is minimized (i.e., zeroed). For pairwise constraints involving the class label, P[Fk , C], we can similarly compute model probabilities P M [Fk D i, C D c] D

X

P[c| fm ]P[ fm ],

(m)

fm 2Gs | fk

Di

i D 1, . . . , | Ak |, c D 1, . . . , K,

(2.7)

and form associated cross-entropy costs: Dc (P[Fk , C]||PM [Fk , C]) ´

|Ak | X K X i D1 c D 1

£ log

P[Fk D i, C D c]

P[Fk D i, C D c] . P M [Fk D i, C D c]

(2.8)

Since we require satisfying all pairwise constraints, our overall constraint cost D is formed as a sum of all the individual pairwise costs. Thus, D has the form: DD

N¡1 N XX kD 1 l > k

C

N X

Dc (P[Fk , Fl ]||P M [Fk , Fl ])

Dc (P[Fk , C]||PM [Fk , C]).

(2.9)

kD 1

Given the quantities D and H, we can form the Lagrangian cost function L D D ¡ TH,

(2.10)

with T a Lagrange multiplier, and seek to minimize this cost over fC c , fP[ f ], f 2 G s gg. Our ultimate goal is to minimize L such that we nd the solution with maximum H under the constraint D D 0. This will require some search for an appropriate value of T. The approach we suggest is a continuation method that has the algorithmic form of a deterministic annealing: starting from high T, we minimize

Approximate Maximum Entropy Inference

2187

8 7.9 7.8 7.7

Entropy H

7.6 7.5 7.4 7.3 7.2 7.1 7 0

1

2 log(1/T) (base 10)

3

Figure 1: Entropy versus inverse temperature (log scale) for the Nursery data set.

L for a sequence of decreasing T values, tracking the solution from one T value to the next and stopping when D drops below a suitable threshold or stops changing. The minimization at each T is performed by gradient descent. This continuation method effectively conducts a search, starting from high H and high D. As T is lowered, all the constraints are gradually satised (D generally decreases) and the entropy also decreases, in a grudging fashion. Thus, we are able to control satisfaction of all the constraints by variation of a single parameter T. An example is shown in Figures 1 and 2. Here, our algorithm was applied to learn a model for the Nursery data set from the UC Irvine repository. Note that the reduction in D and H as T is lowered. When D is rst zeroed (or, more practically, when it rst drops below a preset threshold), the algorithm is terminated, with the Lagrange multipliers C c retrieved. While we do not emphasize a physical analogy, the label “annealing” is claried by the interpretation of L as the Helmholtz free energy, with T as “temperature” and D as “energy” to go along with the entropy, H. A pseudocode description of our algorithm is shown in Table 1. The gradient descent procedure for minimizing L at each T is based on the partial derivatives specied in the appendix. Note further that to ensure the gradient-based optimization does not violate the constraint that fP[ fm ], m D 1, . . . , | G s |g be a p.m.f., we do not optimize over the P[ fm ] val-

2188

David J. Miller and Lian Yan 0.4

0.35

Cross entropy D

0.3

0.25

0.2

0.15

0.1

0.05

0 0

1

2 log(1/T) (base 10)

3

Figure 2: Cross-entropy versus inverse temperature (log scale) for the Nursery data set.

ues directly. Instead, we optimize over the equivalent set of real-valued variables fl m , m D 1, . . . , | G s |g. These determine the p.m.f. via6 P[ fm ] D

el m , |G Ps | l 0 m e

m D 1, . . . , | G

s |.

(2.11)

m0 D 1

The key to the computational tractability of our method is the joint p.m.f. representation (see equation 2.3) and the support restriction on P[ F ]. These two steps together allow us to avoid the intractable normalization required for equation 2.1. The support restriction further allows us to marginalize the joint p.m.f. in an efcient manner. Clearly, the larger the support set Q In practice, this size must G s , the better the overall approximation of P[F]. be limited in order to achieve tractable learning.7 However, although G s 6 The parameterization in equation 2.11 in no way constrains the set of joint p.m.f.s fP[ fm ], fm 2 G s g that can be formed. For any set of nonzero joint p.m.f. values, there is a

set of values fl m g that will achieve this p.m.f. via equation 2.11. 7 Interestingly, even practically feasible choices of |G | may correspond to a signicant s expansion in the number of parameters to be optimized (relative to just optimizing all of

Approximate Maximum Entropy Inference

2189

Table 1: Pseudocode for the approximate maximum entropy learning algorithm. Given: a training set S D f( fm , cm ), m D 1, . . . , Mg, a subset size |G s |, an annealing

parameter g, a threshold 2 , initial temperature Tmax , and nal temperature Tmin .a Calculate the lower order constraint probabilities by frequency counts on S b . If |G s | < M then sample (randomly) from S to obtain a subset G s . If |G s | ¸ M choose the rst M subset points from S and randomly sample the remaining points from G ¡ f fm , m D 1, . . . , Mg. Initialization: fc (Fm D fm , Fn D fn ) Ã 0g c , fl m Ã 0g, T Ã Tmax , calculate fP[ fm ]g by (2.11) and fP[c / fm ]g by equation 2.2 and then compute the cross-entropy Dnew via equation 2.9. do f Dold Ã Dnew . Compute the cost Lnew via equation 2.10. do f Lold Ã Lnew . Update fc (Fm D fm , Fn D fn ) g and fl m g according to the derivatives in the Appendix: @L fc (Fm D fm , Fn D fn ) Ã c (Fm D fm ,Fn D fn ) ¡ r @c g, (F m D fm , Fn D fn )

fl m Ã l m ¡ r @@lLm g, with r a well-chosen learning rate.d Calculate fP[ fm ]g by (2.11) and fP[c / fm ]g by equation 2.2. Compute the cost Lnew via equation 2.10. g while Lold ¡ Lnew > 2 . Compute the entropy H via equation 2.4 and the cross-entropy Dnew via equation 2.9. T Ã gT. g while Dold ¡ Dnew > 2 and T > Tmin . Output fc (Fm D fm ,Fn D fn ) g. a In

our experiments we used g D 0.9, 2 D 0.000001, and Tmax D 1. The algorithm does not appear to be too sensitive to changes in these values. Also, we chose Tmin D 0.000001. b Alternatively, these probabilities may be supplied directly. c The curly braces indicate that the operation is applied to all variables in the set. d We choose r via a bisection line search.

Q accurately,8 the choice of small p.m.f. will always be too small to learn P[F] support, based on the training set or even a subsampling of it, is generally sufcient to learn the model parameters C c accurately. This will be justied by our results. C). Yet while optimizing all of C via the direct approach is infeasible, our method, which may optimize a much larger set of parameters, is quite practical. 8 Unless G D G , P[F] Q will not be accurately learned, since in general the true joint p.m.f. s will have full support. Another way to interpret this is that the approximation restricts us to nding solutions consistent with the probabilistic constraints, but with lower entropy than the true maximum entropy solution.

2190

David J. Miller and Lian Yan

2.3 Algorithm Summary and Discussion. Several issues, both practical and theoretical, deserve further discussion: the selection of constraints, computational complexity, and learning and use with missing features.

2.3.1 Choice of Constraints. According to Jaynes (1982), if given a set of probabilistic constraints, one should apply all of them when maximizing entropy. In this way, one makes use of all available information. However, in practice we are not typically given probabilities directly but rather just a training set. Thus, we must select the constraints we wish to apply and estimate the associated constraint probabilities from training set frequency counts. Unfortunately, the accuracy of frequency count estimates decreases with increasing order of the probabilities. This fact must be considered when choosing constraints. There are several strategies that could be used to search for appropriate constraints. For example, statistical criteria such as condence intervals could possibly be applied. One could also in principle devise a sequential model building approach, similar to Gevarter (1986), but with tractable complexity. In such a scheme, one would sequentially incorporate those dependencies least satised (e.g., in the Kullback sense) by the current model, while also taking into account the nite sample on which the measured dependencies are based. This methodology could be a very exible approach for building the joint p.m.f. One could add constraints on individual probabilities (rather than on p.m.f.s), of varying order (e.g., feature pairs, triples). While we do recognize potential advantages of such a method, in this work we have taken a very simple, heuristic approach (yet one well supported by our results) to model selection. We have been guided by the fact of decreasing frequency count accuracy with increasing order. We also note that memory and use complexities increase with increasing order. Accordingly, we suggest the following strategy: encode constraints on all p.m.f.s at the highest order consistent with their accurate measurement from training data and consistent with available computational resources.9 For the data sets considered here, this has led us to encode constraints on all pairwise p.m.f.s (we do not impose any constraints on third- or higher-order probabilities). While this is a crude rule of thumb for model selection, our experience with the UC Irvine data sets (see section 3) suggests that encoding higher-order constraints (even selectively) may not lead to improved performance and, in fact, may produce overtting and poor generalization. 2.3.2 Computational Complexity. The most important statement to make here is that, unlike the methods of Ku and Kullback (1969) and Cheeseman (1983), ours is an ME technique for practically tackling inference on

9 Only constraints at this (maximum) order need be imposed, since this automatically imposes all constraints of lower order. For example, imposing constraints on all pairwise probabilities ensures satisfaction of all single feature constraints.

Approximate Maximum Entropy Inference

2191

high-dimensional feature spaces. In particular, we can (easily) learn a classier that achieves performance competitive with alternative techniques for the Hepatitis problem, involving a 20-dimensional feature space, with 2 to 10 values per feature. For this problem, the learning time (on a Sparc10 machine) is only about 50 minutes. Moreover, we emphasize that the practicality of our method does not truly rely on the fact that we are interested here in inference only on a single feature (classication). We have recently developed an extension of our ME learning approach that builds models to do inference on multiple features (Yan & Miller, 1999a). The key aspect that makes our method (and its general inference extension) practical is our approximation of the joint feature p.m.f. on a small subset of the full feature space. This allows us to avoid an intractable partition function calculation and to implement probability marginalizations efciently for constraint satisfaction. There are also several ways we can control complexity. First, we can vary the size of the p.m.f. support. We have found we can signicantly reduce this size with little effect on performance, but with substantial reduction in learning time. (This will be discussed further in the results section.) A second way to control complexity is by altering the “annealing” temperature schedule—both the starting temperature and the rate at which the temperature is reduced. In practice, we have found that starting at fairly low temperatures (T D 1) and using a (fast) exponential decay rate for temperature reduction has little effect on performance. Some further clarication is due on this point. In particular, algorithmically our method is a deterministic annealing procedure, involving an effective cost minimization and the tracking of the solution through a sequence of decreasing temperatures. For deterministic annealing applied in some settings, such as clustering (Rose et al., 1992) and supervised learning (Miller et al., 1996), the ultimate solution quality does depend on the chosen temperature schedule. Moreover, in these domains there are phase transitions in the annealing process by which the effective model size grows. Generally there is a signicant amount of optimization performed at temperatures where the system undergoes phase transition, which can slow the annealing process. In our experiments, distinctive phase transitions have not been observed. This at least partially explains insensitivity to the temperature schedule and the practicality of our method. 2.3.3 Missing Features. The problem of handling missing features during training or use has been investigated by a number of researchers in both a neural network context (Ahmad & Tresp, 1993; Ghahramani & Jordan, 1994) and elsewhere (Cooper & Herskovits, 1992). We distinguish two basic scenarios: when many training patterns are incomplete (e.g., the UC Irvine Hepatitis data) and when all (or nearly all) training patterns are complete. We suggest handling the rst case, for our model and alternative models, in a fashion similar to Little (1993) and Cheeseman et al. (1988). In particular,

2192

David J. Miller and Lian Yan

when there are many incomplete patterns, it is reasonable that the model should account for this; that is, missing features can be viewed as additional information and used to help learn the model. If the features are not missing completely at random (Little, 1993), then information will be gleaned that may help performance. Alternatively, if features are missing completely at random, the model may still account for this with no signicant deleterious effect, especially relative to the alternative of throwing away (many) incomplete training vectors. Thus, for each feature Fi , we suggest augmenting its discrete range Ai to include a value associated with “missing.” We then learn in the conventional way. For our ME approach, this will require learning additional Lagrange multipliers for the new constraints involving the value “missing.” When missing features occur during model use, we need to access and use the appropriate Lagrange multipliers. In the second scenario, missing features cannot be incorporated into the model and accounted for during training. Thus, we must apply techniques for handling missing features solely during classication and use. We will suggest two methods suited for a posteriori probability models: a simple heuristic and a novel exact procedure that is practically applicable when only a few features are missing. For clarity, suppose a single feature F1 is missing. Then it is optimal to use the a posteriori probabilities P[C D c| f2 , f3 , . . . , fN ] D

|A1 | X iD 1

P[C D c|F1 D i, f2 , . . . , fN ]

£ P[F1 D i| f2 , f3 , . . . , fN ].

(2.12)

However, the p.m.f. P[F1 D i| f2 , f3 , . . . , fN ] is generally unknown. Thus, instead we can simply approximate equation 2.12 by using P[F1 ] in place of P[F1 | f2 , f3 , . . . , fN ].10 This heuristic is naturally extended for multiple missing features F1 , F2 , . . . , Fm (though with increased computation), based on the p.m.f. P[F1 , F2 , . . . , Fm ] and a new summation for each additional missing feature. While the method just described is quite heuristic, we next derive a novel, exact method that can be implemented for our ME model or any other a posteriori model. This technique demands extra training and extra complexity during use but can be practically implemented when only a few features are missing. The computational approach next described was rst proposed in Miller & Yan (1999), where it was used to implement a combining rule for ensemble classication. Here we identify a second application. For clarity, 10 For multilayer perceptrons, with which we compare in section 3, we will assume normalization in the output layer. Thus, our missing feature description will sufce not only for our model but also for the MLP since the normalized MLP outputs are validly interpreted as a posteriori probabilities.

Approximate Maximum Entropy Inference

2193

we start for the case of a single missing feature F1 . The desired probabilities can be expanded as P[C D c| f2 , . . . , fN ] |A1 | X

D D

i D1 |A1 | X i D1 K X

D

P[C D c, F1 D i| f2 , . . . , fN ] P[C D c|F1 D i, f2 , . . . , fN ]P[F1 D i| f2 , . . . , fN ] P[c0 | f2 , . . . , fN ]

c D1 0

£

|A1 | X iD 1

P[C D c|F1 D i, f2 , . . . , fN ]P[F1 D i|c0 , f2 , . . . , fN ],

c D 1, . . . , K.

(2.13)

The resulting K equations are best understood after collecting them into matrix form. First, we collect the desired probabilities into a column vector q D (P[C D 1| f2 , . . . , fN ], . . . , P[C D K| f2 , . . . , fN ])T and dene the K£K maP |A | trix A with element acc0 D iD 11 P[c|F1 D i, f2 , . . . , fN ]P[F1 D i|c0 , f2 , . . . , fN ]. The equations 2.13 are then seen to be equivalent to the matrix equation: Aq D q. A probabilistic solution, if it exists, is a unit-length eigenvector with eigenvalue unity. To show a solution does exist, we rst recognize that A is a stochastic matrix (Berman & Plemmons, 1979), that is, akl ¸ 0 8k, l and P akl D 1, 8l. Thus, it has the form of a transition matrix associated with a k

Markov chain. Further, the Markov chain is ergodic, since in our case each entry akl can be assumed to be strictly positive. Then, from the theory of Markov chains, A has a principal eigenvalue of unity, with a unique unitlength eigenvector that has all entries nonnegative (Berman & Plemmons, 1979). Thus, a unique probability vector solution does exist. Moreover, the power method (or any other method for nding principal eigenvectors) can be used to solve for the probability vector. Note that solving equation 2.13 requires knowledge of P[F1 D i|c, f2 , . . . , fN ]. This can be obtained by learning an a posteriori model for F1 in addition to learning one for C.11 More generally, an a posteriori model will be needed for every feature that may be missing during use. Since learning a separate model for each feature is expensive for large N, this approach may be best 11

The general problem of learning models that perform inference on multiple features is indicated later as an extension of our ME method.

2194

David J. Miller and Lian Yan

suited to the situation where most features are faithfully obtained, with a small subset known to be missing sometimes. Interestingly, the procedure just described can be extended to handle multiple missing features without requiring any additional learning (beyond obtaining a posteriori probabilities for each feature that may be missing, as needed in the single missing feature case). In particular, for two missing features F1 and F2 , we can derive the equations: P[c| f3 , . . . , fN ] D

X c

P[c0 | f3 , . . . , fN ]

0

0 X

¢@

i, j

P[c|F1 D i, F2 D j, f3 , . . . , fN ]

¢ P[F1 D i|c0 , F2 D j, f3 , . . . , fN ] 1 ¢ P [F2 D j|c0 , f3 , . . . , fN ]A , c D 1, . . . , K. (2.14) In this case, P[c| f1 , . . . , fN ] and P[ f1 |c0 , f2 , . . . , fN ] are obtained by learning inference models for C and for F1 . Moreover, if we also learn an a posteriori model for F2 , then we can use the technique just described for a single missing feature (F1 in this case) to obtain P[F2 |c0 , f3 , . . . , fN ], conditioned on each value c0 D 1, . . . , K. Then, given P[F2 |c0 , f3 , . . . , fN ] and the other probabilities we can compute a corresponding matrix A (based on the term in parentheses in equation 2.14) and again use the power method to solve the resulting eigenvector equation represented by equation 2.14. Thus, for two missing features, our method requires two levels of missing feature inference, both implemented as the solution of eigenvector equations. Corresponding methods can be derived for higher orders of missing features, but this method is most practical for low orders. 3 Results

Here we describe experiments comparing the proposed ME method with dependence trees (DTs) (Chow & Liu, 1968), tree-augmented naive Bayes models (TAN) (Friedman & Goldszmidt, 1996), Kutato (Herskovits & Cooper, 1991), and with multilayer perceptrons (MLPs). Before presenting results, it is useful to describe DTs, TAN, and Kutato, and identify differences between these methods and our ME method. 3.1 Dependence Trees and Bayesian Networks. Chow and Liu’s method assumes that given the class label, each feature Fi is conditionally independent of all but at most one of the remaining features. Thus, the class-

Approximate Maximum Entropy Inference

2195

conditional joint pmf (a single DT) can be written in the form P[ F D f |C D c] D

N Y iD 1

P[Fi D fi |C D c, Fj (i,c ) D fj (i,c) ],

(3.1)

with the index mapping j (i, c) denoting the (class-dependent) feature on ( ) which Fi is assumed to depend.12 Given c, there are N N¡1 third-order prob2 abilities to draw on in constructing each class-dependent joint p.m.f.13 However, the DT model only allows inclusion of N ¡1 of these (constrained so as not to form a cycle). Chow and Liu’s efcient optimization method chooses the best set in the sense of a mutual information measure. Given the resulting class-conditional p.m.f.s, one can directly implement a pseudo-Bayes classication rule. Extensions of Chow and Liu’s method have also been proposed, increasing the conditioning order (Kang, Kim, & Kim, 1997) and building mixtures of DTs (Meila & Jordan, 1998). Moreover, TANs are close relatives of DTs that are also designed to minimize a mutual information measure. The Kutato algorithm was described in Buntine (1996) as a key breakthrough in learning the BN structure. This method applies the ME principle in a greedy, sequential manner to construct BNs. Specically, at each step the algorithm augments the current BN model, adding one new feature dependency. The added dependency is the one that results in the largest decrease in joint entropy. When a new dependency is included, the associated conditional probabilities are estimated via Dirichlet smoothing of frequency counts. The learning algorithm is ultimately terminated (to limit overtting) by using an entropy-based threshold rule. We can identify several points of comparison between DTs, TANs, more general BNs, and our ME model. First, DTs, TANs, and BNs can all be interpreted as ME solutions, but ones consistent with limited probabilistic constraints that must express conditional independence. Second, DTs and TANs encode a small subset of third-order constraints and BNs encode a small subset of even higher order constraints. By contrast, we encode only ( second-order, pairwise p.m.f. constraints, but all N N2C 1) such. Thus, we compare our model that encodes only second-order information (but all such) with models that encode (some) higher-order information. Further, unlike these other models, ours does not require strict determination of conditional independencies; our ME model typically expresses (even weak) dependencies between all features. Finally, our learning method is computationally complex in comparison with both Chow and Liu’s method and Kutato. This higher complexity is required to learn our model effectively. 12

One feature is assumed independent of all others given c. Since we refer to P[ fi , fj ] as a second-order probability, it is consistent to refer to P[ fi |c, fj(i,c) ] as a third-order probability. 13

2196

David J. Miller and Lian Yan

3.2 Results Summary. Experiments were performed on seven data sets from the UC-Irvine repository. The (training/test) split for each set was as follows: Car (1000/728), Balance-Scale (400/225), LED (500/500), Nursery (6480/6480), Congressional Voting Records (300/135), Credit Approval (345/345), 14 and Hepatitis (100/55). For MLPs, at the input layer we used a binary code of | Ai | bits to represent the | Ai | possible values for each feature Fi . At the output layer, we used K outputs, one for each of the K classes. The single hidden-layer MLPs were trained to minimize the squared norm between the K-dimensional MLP output vector and class-specic, target binary codewords. For both ME and MLP training, we used gradient descent with a bisection line search. The bisection search was used to optimize the step size at each step during MLP training. For the ME learning, any increase in the cost initiated a bisection search to reselect the step size. For MLPs, we randomly generated all the initial weights with values in the range [¡10 ¡3 , 10¡3 ]. For ME, the initialization is specied in our pseudocode. DTs and TAN were implemented based on the descriptions given in the respective papers. For DTs, conditional probabilities were estimated based on frequency counts. For TAN, we implemented one version with the conditional probabilities smoothed as described in (Friedman & Goldszmidt, 1996) and another version based solely on frequency counts. For MLPs, we did not conduct an extensive search to nd the best model size. However, we did design four different models, varying the number of hidden units, and then selected the one with best generalization performance for each data set. For both our method and Kutato, we also need to discuss generalization and how it is affected by the choice of stopping rule. In all our experimental comparisons, we used a xed rule to terminate ME learning, stopping either when the reduction in cross entropy is diminishing or when a xed, small temperature Tmin ’ 0 is reached (as specied in our pseudocode). We chose Tmin ’ 0 since, in general, the constraints may not be satised until T ’ 0. However, in principle, the optimal stopping temperature for a given problem may be different from zero temperature. Indeed, if the constraint order is chosen too large (e.g., if we try to encode third order constraints), the best generalization may be achieved by only weakly satisfying the constraints—that is, by stopping at nonzero (even high) temperatures. But our method is based on encoding (only) secondorder constraints. We have done experiments on several data sets to evaluate how generalization performance varies with temperature for second-order constraints. For each data set, we found that generalization continues to improve as T (and hence D) is lowered. An example is shown in Figure 3. This heuristically justies our xed stopping rule based on Tmin ’ 0. To achieve good generalization for Kutato, we carefully chose a xed stopping rule,

14

For Credit, each continuous feature was quantized to ve levels.

Approximate Maximum Entropy Inference

2197

0.38

0.36

Test Set Error rate

0.34

0.32

0.3

0.28

0.26

0.24 0

1

2 log(1/T) (base 10)

3

Figure 3: Error rate on the test set versus inverse temperature (log scale) for the Nursery data set.

with learning terminated when the change in joint entropy drops below a threshold of 0.000001. 15 This threshold typically produces BN solutions with good generalization performance.16 We next discuss experiments evaluating performance of the ve methods, the effect of varying the amount of joint p.m.f. support, the choice of constraints, and the computational learning time. 3.2.1 Performance Comparison. Test set misclassication performance is shown in Table 2. “Complete” indicates all features were present in both the training and test sets. For Credit Approval and Hepatitis, “Missing” in-

15

For decreasing entropy we found that generalization performance continues to improve for our method but at some point worsens for Kutato. The explanation is that the BN model size increases, whereas our model size stays xed, for decreasing H. 16 To characterize this, for each data set we determined the model in the Kutato learning sequence with best generalization (this model is, of course, not determinable in practice for new and test data). We then compared the performance of this “best” model with that obtained based on our xed stopping rule for Kutato. Over the seven data sets, the average performance loss relative to the “best” model was 9%. As a preview of ME performance, we note that our ME model, obtained using our xed stopping rule, was on average 12% better than this (unattainable) “best” model of Kutato.

2198

David J. Miller and Lian Yan

dicates that some features were missing during both training and use. This was handled for all the methods by augmenting the range for each feature to include the value “missing” (during both training and use), as discussed in section 2. For the other data sets, “Missing” indicates that a single feature (randomly selected, uniformly, from the feature set) was missing from each test vector. This was handled for MLPs and for our method by the approximation of the exact inference expression. expression 2.12. For DTs, TAN, and BNs, exact inference can be performed (since these models learn a joint p.m.f.). As seen in the table, the ME method achieves substantial performance gains compared with Chow and Liu’s method, TAN, and Kutato, for both the complete and missing feature cases.17 For Nursery in particular, the performance of these other methods is quite poor. For the “Missing” case, ME achieved performance gains on Nursery, Car, and Balance despite the fact that only the approximation of equation 2.12 was used for making inferences. Further, we note that DTs generally perform very poorly when there are missing values during use. The results in Table 2 indicate that TAN performance with smoothing is generally better than that of DTs and comparable to Kutato. Without smoothing, TAN performance is closer to that of DTs, which is consistent with statements made in Friedman, Goldszmidt, & Lee (1998). In comparison with MLPs the ME method is better on four data sets and worse on three others. Overall, however, the ME approach achieves better results; the ME advantage is substantial for the Car and Credit data sets, and ME is better for the “Missing” case on all but two data sets. We did not evaluate other methods such as decision trees. However, Friedman and Goldszmidt (1996) found that TAN is competitive with C4.5 decision trees (Quinlan, 1993), which is at least quite suggestive that ME should also perform favorably compared with this model. 3.2.2 Changing the p.m.f. Support Size. Several experiments were conducted to quantify the performance gain achieved by increasing G s to include both the training set and additional randomly chosen N-tuples from G . For Nursery, going from |G s | D 1000 to | G s | D 6480 reduced the test set error rate from 0.2409 to 0.2361. For Hepatitis, going from | G s | D 100 to | G s | D 500 reduced the rate from 0.2545 to 0.2364. For Balance, going from | G s | D 400 to | G s | D 500 reduced the rate from 0.240 to 0.2311. For this data set, it is actually possible to optimize over the full space G , since | G | D 625. In doing so, however, we still obtained an error rate of 0.2311. From these results, we conclude that we can limit learning complexity by the choice of small joint pmf support with modest effect on the classication performance.

17

Note, however, that only some of the comparisons give statistically signicant results in the sense of disjoint condence intervals.

Approximate Maximum Entropy Inference

2199

Table 2: Test Set Misclassication Rates with 95% Condence Intervals for several Methods on Data Sets from the UC Irvine Repository. Data sets

Car LED Balance Nursery Congress Credit Hepatitis

Maxent Complete

Missing

Complete

Missing

0.25 (0.22, 0.28) 0.28 (0.24, 0.32) 0.24 (0.19, 0.30) 0.24 (0.23, 0.25) 0.06 (0.030, 0.11) ... ...

0.30 (0.27, 0.34) 0.37 (0.33, 0.42) 0.35 (0.29, 0.41) 0.29 (0.28, 0.31) 0.06 (0.03, 0.11) 0.12 (0.09, 0.16) 0.26 (0.16, 0.38)

0.37 (0.33, 0.40) 0.30 (0.26, 0.34) 0.42 (0.38, 0.46) 0.62 0.61, 0.64) 0.16 (0.10, 0.23) ... ...

0.78 (0.75, 0.81) 0.39 (0.35, 0.43) 0.93 (0.91, 0.95) 0.82 (0.82, 0.83) 0.22 (0.15, 0.29) 0.28 (0.23, 0.33) 0.38 (0.27, 0.51)

Data sets

Car LED Balance Nursery Congress Credit Hepatitis Data Sets

Car LED Balance Nursery Congress Credit Hepatitis

Chow and Liu

MLP

Kutato

Complete

Missing

Complete

Missing

0.37 (0.34, 0.41) 0.29 (0.25, 0.33) 0.21 (0.17, 0.27) 0.23 (0.22, 0.24) 0.06 (0.03, 0.11) ... ...

0.38 (0.35, 0.42) 0.39 (0.35, 0.43) 0.32 (0.26, 0.38) 0.30 (0.28, 0.31) 0.09 (0.05, 0.15) 0.21 (0.17, 0.26) 0.24 (0.14, 0.36)

0.35 (0.31, 0.38) 0.28 (0.24, 0.32) 0.34 (0.28, 0.40) 0.53 (0.51, 0.54) 0.07 (0.04, 0.12) ... ...

0.36 (0.33, 0.40) 0.34 (0.30, 0.38) 0.40 (0.33, 0.46) 0.53 (0.52, 0.54) 0.07 (0.04, 0.13) 0.13 (0.09, 0.16) 0.31 (0.20, 0.44)

TAN (Smoothing)

TAN (No Smoothing)

Complete

Missing

Complete

Missing

0.34 (0.30, 0.37) 0.27 (0.23, 0.31) 0.33 (0.28, 0.40) 0.52 (0.51, 0.53) 0.07 (0.04, 0.13) ... ...

0.35 (0.32, 0.39) 0.36 (0.32, 0.40) 0.39 (0.33, 0.45) 0.50 (0.0.49, 0.51) 0.06 (0.03, 0.11) 0.17 (0.14, 0.21) 0.31 (0.20, 0.44)

0.37 (0.34, 0.41) 0.29 (0.25, 0.33) 0.29 (0.23, 0.35) 0.65 (0.63, 0.66) 0.13 (0.09, 0.20) ... ...

0.37 (0.33, 0.40) 0.36 (0.32, 0.40) 0.37 (0.31, 0.44) 0.59 (0.58, 60) 0.10 (0.06, 0.17) 0.29 (0.25, 0.34) 0.49 (0.44, 0.54)

3.2.3 Choice of Constraints. We have performed several experiments on synthetic data sets to evaluate the ability of our method to learn the structure of the joint p.m.f. accurately. In these experiments, we rst randomly generated the Lagrange multipliers fully specifying an underlying (true) joint p.m.f. We then randomly generated a (training) data set according to this model. Finally, we applied our ME method, encoding all second-order p.m.f. constraints to build a classier. To evaluate learning and model accuracy, we computed the probability that the class decision produced by our model P disagrees with the Bayes-optimal decision. This measure can be written: f 2G Pt [ f ](1 ¡ d(arg maxk Pm [k| f ], arg maxk Pt [k| f ])), with m denoting model and t denoting true probabilities and with d (¢, ¢) the Kronecker delta function.

2200

David J. Miller and Lian Yan 0.175

0.17

Probability of disagreement

0.165

0.16

0.155

0.15

0.145

0.14

0.135

0.13

0

200

400

600

800

1000 1200 Support size

1400

1600

1800

2000

Figure 4: Probability of disagreement with the Bayes decision versus |Gs | for a synthetic data set with 8000 samples (seven dimensions, four values for each dimension). The true joint pmf was based on pairwise constraints.

Two synthetic data sets were generated. For the rst set, the true joint p.m.f. was based solely on pairwise constraints. Thus, our ME model is “matched” to the true joint p.m.f. structure, with the experiment evaluating the accuracy of the support approximation and of the learning. In Figure 4, we see that our model fairly closely approximates the Bayes classier as | G s | increases. The performance is also much better than that of Kutato (disagreement rate of 0.4150). For the second data set, the true joint p.m.f. was based on third-order constraints. Here, the experiment evaluates the loss in accuracy caused by ME model mismatch. Figure 5 indicates in this case there is signicant performance loss. However, the performance is still better than that of Kutato (disagreement rate of 0.5315). While the experiment just described indicates the possibility of model mismatch, the (serious) issue of overtting when building a higher-order ME model based on a (small) data set was not considered. Thus, we also performed several experiments on real data sets to evaluate the inclusion of higher- (third-) order probability constraints. For the LED data set, inclusion of all third-order constraints when building the ME model reduced the error rate, but only by an insignicant amount, from 0.276 to 0.270. For Balance, the error rate increased substantially, going from 0.24 for second-order constraints to 0.4178. Note that this third-order error rate is quite close to that

Approximate Maximum Entropy Inference

2201

0.445

Probability of disagreement

0.44

0.435

0.43

0

200

400

600

800

1000 1200 Support size

1400

1600

1800

2000

Figure 5: Probability of disagreement with the Bayes decision versus |Gs | for a synthetic data set with 8000 samples (seven dimensions, four values for each dimension). The true joint pmf was based on third-order constraints.

of Chow and Liu’s method, which also includes (unsmoothed) third-order constraints. This indicates that, at least for Balance, third-order probability estimates (based solely on raw frequency counts) are not reliable. We thus also tried smoothing as an alternative to standard frequency counts for estimating the probabilities. Since the smoothing procedure suggested by Friedman and Goldszmidt (1996) applies only to conditional probabilities, we instead tried Dirichlet smoothing, the approach used by Kutato. For the Balance set, Dirichlet smoothing applied to all third-order constraints actually led to worse performance, with the error rate going from 0.4178 to 0.5289. For Credit and Hepatitis, with second-order constraints, the error rates were identical to those produced based on standard frequency counts. In general, Dirichlet smoothing was not found to improve performance. Finally, we have also done initial experiments on a heuristic approach for selectively including higher-order constraints, using Kutato to seek them out heuristically. For each third- or higher-order conditional p.m.f. used by Kutato, we encoded the corresponding joint p.m.f. constraint, in addition to all second-order constraints, when building the ME model. However, this approach was not found to yield signicant performance gains. In fact, in some cases including higher-order constraints led to worse performance.

2202

David J. Miller and Lian Yan

To conclude this section, we note that the above results generally provide experimental vindication for our simple model selection paradigm of encoding all second-order constraints. 3.2.4 Computational Learning Time. In section 2, we emphasized the computational practicality of the ME learning method. Here we support that discussion by reporting learning times based on runs on a Sparc10 machine.18 For | G s | D 1000, the ME CPU time for Nursery (9 dimensions, two to ve values per feature) was about 5 hours. For | G s | D 345, the CPU time for Credit (16 dimensions, 2–15 values per feature) was about 4 hours. In contrast, a single MLP training run took 91 hours for Nursery, and 18 hours for Credit. 19 For Kutato the running times were 3 minutes and 1 minute for Nursery and Credit, respectively. While this is substantially lower complexity than that required for our model, some signicant learning effort was required for Kutato on some other data sets. For example, for the 23-dimensional Mushroom data from UC Irvine, Kutato required 75 CPU minutes. The DT and TAN running times were ignorable: less than 2 seconds. The learning time for our model does grow in a signicant way (roughly quadratically) with | G s |. For example, for | G s | D 100, the CPU time on Hepatitis was about 50 minutes, while for | G s | D 200 it grew to about 9.5 hours. However, since we have already shown empirically that the performance is not strongly dependent on | G s | given that it is above a certain size, we can generally choose a small support set and retain quite manageable complexity with little performance loss. While our method does require substantially greater learning effort than some other approaches (e.g., Kutato), the differences in learning times may or may not be practically meaningful, depending on the training scenario. Moreover, our classication results indicate that the learning effort is rewarded. 4 Conclusions

While the ME principle is equivalent to the more commonly used method of maximum likelihood, the particular description given by ME offers several practical advantages for model construction and learning that we have indicated in this work. In particular, while direct application of maximum 18 We did not optimize the efciency of the code we implemented for our algorithm or any of the other methods. 19 The signicant learning time of the MLPs relative to our model is attributable to several factors. First, MLP learning time directly depends on the size of the training set (6480 for Nursery) while our learning time depends on |G s | (which we chose to be less than the Nursery training set size). Second, we did careful learning for MLPs to try to maximize their performance, using bisection search at every step. Finally, we chose a particular MLP architecture. Other MLP structures may have lower training complexity.

Approximate Maximum Entropy Inference

2203

likelihood requires one to choose the form of the data distribution, either in a somewhat arbitrary way, by invoking an additional (heuristic) principle, or based on prior knowledge, the ME model structure is naturally determined by the constraints that are encoded, without appeal to additional principles. Moreover, by judiciously choosing the ME constraints, one can encode a substantial amount of information without making unnecessary assumptions (conditional independence) and with little or no overtting. Despite these advantages, the use of ME given general probabilistic constraints has not received very much attention beyond Cheeseman’s paper in 1983, due to the intractability of learning the model. In this work, we reconsidered the ME problem, proposing an approximate, practical learning method that achieves good classication performance compared with several alternative approaches. While this work has focused on statistical classication, our method can be naturally extended to address other important inference problems. We briey note the following extensions, some of which we have recently developed and some of which we are currently investigating: General statistical inference on discrete feature spaces. We have recently developed this extension, which allows one to infer any feature value (e.g., user specied) given values for the remaining features (Yan & Miller, 1999a). While BNs can also be used for this purpose, we have found that our method achieves substantial performance gains over several BN approaches. Like the learning method developed here, our general inference extension has tractable complexity. Again, the key is the restriction of joint p.m.f. support during learning. This method has application potential for medical and fault diagnosis, information retrieval, and other domains. Inference on mixed continuous and discrete feature spaces. The ME framework naturally allows constraints on continuous as well as discrete features (Yan & Miller, 1999b). For a mixed feature space, we can, for example, encode moment constraints on individual continuous features; expected product constraints on pairs of continuous features; and for a mixed pair of features, conditional moment constraints on the continuous feature given each value for the discrete feature. In encoding these constraints, we work directly with continuous features. Doing so, we avoid a somewhat articial mapping of continuous features to discrete space (via an information-lossy feature quantization) or discrete features to continuous space. In preliminary results, this extension of our approach achieves performance gains over methods that do require such mappings (Yan & Miller, 1999b). Regression. Extension for inference on continuous features is a direction for future study.

2204

David J. Miller and Lian Yan

Sequential model building. This potential extension would allow one to encode constraints on individual probabilities, of varying order. Zhu et al. (1997) may provide insight into a principled framework for tackling this problem. An effective technique for smoothing constraint probabilities could also be explored. Appendix

The following partial derivatives form the basis of our gradient descent procedure: @L

D ¡

@l m

|Gs | X

@L

@P[ fn ] 6 m n D 1,n D

P[ fm ]P[ fn ] C

X

@L @c (Fi D l,CD c)

D

@L

(n) fn 2Gs | fi D l

¡

@P[c / fn ]

P[ fm ](1 ¡ P[ fm ]),

(A.1)

P[c / fn ](1 ¡ P[c / fn ])

K X

X

@L @P[ fm ]

@L @P[c0 / fn ]

0 06 (n) c fn 2Gs | fi D l c D 1,c D

P[c0 / fn ]P[c / fn ]

(A.2)

where @L @P[ fm ]

D T

(

¡

¡

@L @P[c / fm ]

log(P[ fm ]) C 1 C

N¡1 X

! P[c| fm ] log(P[c / fm ])

cD 1 ( )

( )

m m P[ fj , fk ] P

N X

jD 1 kD j C 1

(n)

fn 2Gs | fj

(m)

D fj

(m )

K X N X cD 1 jD 1

K X

P (n)

fn 2Gs | fj

P[ fj

(m)

(n)

(m)

P[ fn ]

, fk D fk

, c]

P[c / fn ]P[ fn ]

P[c / fm ],

(A.3)

D fj

± ² D T log(P[c / fm ]) C 1 P[ fm ]

¡ P[ fm ]

N X jD 1

(m)

P

P[ fj

(n) (m) fn 2Gs | fj D fj

, c]

P[c / fn ]P[ fn ]

.

(A.4)

Approximate Maximum Entropy Inference

2205

Acknowledgments

We thank the anonymous referees for suggestions that helped us to improve this article substantially. This work was supported in part by National Science Foundation Career Award IIS-9624870.

References Ahmad, S., & Tresp, V. (1993). Classication with missing and uncertain inputs. In Proc. of IEEE Intl. Conf. on Neural Networks. (pp. 1949–1954). Berger, A. L., Della Pietra, S. A., & Della Pietra, V. J. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1), 39–68. Berman, A., & Plemmons, R. J. (1979). Nonnegative matrices in the mathematical sciences. New York: Academic Press. Bilbro, G. L., & Van den Bout, D. E. (1992). Maximum entropy and learning theory. Neural Computation, 4, 839–853. Buhmann, J., & Kuhnel, H. (1994). Vector quantization with complexity costs. IEEE Trans. on Info. Theory, 39, 1133–1145. Buntine, W. (1996). A guide to the literature on learning probabilistic networks from data. IEEE Trans. on Knowl. and Data Eng., 8, 195–210. Burg, J. P. (1975). Maximum entropy spectral analysis. Unpublished doctoral dissertation, Stanford University, Stanford, CA. Cheeseman, P. (1983).A method of computing generalized Bayesian probability values for expert systems. In Proc. of the Eighth Intl. Joint Conf. on AI (Vol. 1, pp. 198–202). Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., & Freeman, D. (1988). AutoClass: A Bayesian classication system. In Proc. of the Fifth Intl. Conf. on Machine Learning. Chow, C., & Liu, C. (1968). Approximating discrete probability distributions with dependence trees. IEEE Trans. on Info. Theory, 14, 462–467. Cooper, G. F., & Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9, 309–347. Cover, T. M., & Thomas, J. A. (1991). Elements of Information Theory. New York: Wiley. Edwards, D., & Havranek, T. (1985). A fast procedure for model search in multidimensional contingency tables. Biometrica, 72, 339–351. Frey, B. J. (1998). Graphical models for machine learning and digital communication. Cambridge, MA: MIT Press. Friedman, N., & Goldszmidt, M. (1996). Building classiers using Bayesian networks. In Proc. of the National Conf. on AI (Vol. 2, pp. 1277–1284). Friedman, N., Goldszmidt, M., & Lee, T. J. (1998). Bayesian network classication with continuous attributes: getting the best of both discretization and parametric tting. In Proc. of the Intl. Conf. on Machine Learning (pp. 179– 187).

2206

David J. Miller and Lian Yan

Gevarter, W. B. (1986).Automatic probabilistic knowledge acquisition from data. (Tech. Memo. No. 88224). Mountain View, CA: NASA Ames Research Center. Ghahramani, Z., & Jordan, M. I. (1994). Supervised learning from incomplete data via an EM approach. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 120–127). San Mateo, CA: Morgan Kauffman. Haykin, S. (1999). Neural networks: A comprehensive foundation. Englewood Cliffs, NJ: Prentice Hall. Heckerman, D., Geiger, A., & Chickering, D. M. (1995). Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20, 197–243. Herskovits, E., & Cooper, G. (1991). Kutato: An entropy-driven system for construction of probabilistic expert systems from databases. In P. P. Bonissone, M. Henrione, L. N. Kanal, & J. F. Lemmer (Eds.), Uncertainty in AI 6 (pp. 117–125). Amsterdam: Elsevier. Jaynes, E. T. (1982).Papers on probability,statisticsand statisticalphysics. Dordrecht: Reidel. Jirousek, R., & Preucil, S. (1995). On the effective implementation of the iterative proportional tting procedure. Comp. Stat. and Data Anal., 19, 177–189. Kang, H., Kim, K., & Kim, J. H. (1997). Optimal approximation of discrete probability distribution with kth-order dependency and its application to combining multiple classiers. Patt. Rec. Letters, 18, 515–523. Ku, H. H., & Kullback, S. (1969). Approximating discrete probability distributions. IEEE Trans. on Info. Theory, 15(4), 444–447. Little, R. J. A. (1993). Pattern-mixture models for multivariate incomplete data. Journal of the Amer. Stat. Assoc., 88, 125–134. Meila, M., & Jordan, M. I. (1998). Estimating dependency structure as a hidden variable. In M. Kearns, M. Jordan, & S. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 584–590). Cambridge, MA: MIT Press. Miller, D. J., Rao, A. V., Rose, K., & Gersho, A. (1996). A global optimization technique for statistical classier design. IEEE Trans. on Sig. Proc., 44, 3108– 3122. Miller, D. J. & Yan, L. (1999). Critic-driven ensemble classication. IEEE Trans. on Sig. Proc., 47, 2833–2844. Neal, R. M. (1992). Connectionist learning of belief networks. Art. Intell., 56, 71–113. Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks for plausible inference. San Mateo, CA: Morgan Kauffman. Quinlan, J. R. (1993).C4.5: Programs for machine learning. San Mateo, CA: Morgan Kauffman. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Rose, K., Gurewitz, E., & Fox, G. C. (1992). Vector quantization by deterministic annealing. IEEE Trans. on Info. Theory, 38, 1249–1257. Shore, J. E., & Johnson, R. W. (1980). Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Trans. on Info. Theory, 26, 26–37.

Approximate Maximum Entropy Inference

2207

Yan, L., & Miller, D. J. (1999a). General statistical inference by an approximate application of the maximum entropy principle. In Proc. of IEEE Workshop on Neural Networks for Sig. Proc. (pp. 112–121). Yan, L., & Miller, D. J. (1999b). General statistical inference for discrete and mixed spaces by an approximate application of the maximum entropy principle. Manuscript submitted for publication. Zhu, S. C., Wu, Y. N., & Mumford, D. (1997). Minimax entropy principle and its application to texture modeling. Neural Computation, 9(8), 1627–1660. Received December 14, 1998; accepted September 9, 1999.

LETTER

Communicated by Eric Mjolsness

l -Opt Neural Approaches to Quadratic Assignment Problems Shin Ishii Nara Institute of Science and Technology, Ikoma-shi, Nara, 630-0101 Japan and ATR Human Information Processing Research Laboratories, Soraku-gun, Kyoto, 619-0288 Japan Hirotaka Niitsuma Nara Institute of Science and Technology, Ikoma-shi, Nara, 630-0101 Japan In this article, we propose new analog neural approaches to combinatorial optimization problems, in particular, quadratic assignment problems (QAPs). Our proposed methods are based on an analog version of the l opt heuristics, which simultaneously changes assignments for l elements in a permutation. Since we can take a relatively large l value, our new methods can achieve a middle-range search over possible solutions, and this helps the system neglect shallow local minima and escape from local minima. In experiments, we have applied our methods to relatively large-scale (N D 80– 150 ) QAPs. Results have shown that our new methods are comparable to the present champion algorithms; for two benchmark problems, they are obtain better solutions than the previous champion algorithms. 1 Introduction

We propose new analog neural approaches to combinatorial optimization problems. In particular, we deal with quadratic assignment problems (QAPs) (Burkard, Karisch, & Rendl, 1997), which are known to be very difcult combinatorial optimization problems. Our proposed approaches can also be applied to various combinatorial optimization problems where each solution is represented as a permutation. Examples are traveling salesman problems (TSPs). In the case of a TSP, a solution is represented by aligning the city indices along the salesman’s route, which is a permutation of the set of city indices. A solution to a QAP can also be represented by a permutation. In conventional neural approaches, however, an Ising (binary) spin system (Hopeld & Tank, 1985; Bilbro et al., 1989) or a Potts spin system (Peterson & Soderberg, ¨ 1989; Ishii & Sato, 1997) has been used to represent a solution. The constraints for the permutation are then implemented as “soft” constraints: penalty terms for violations are added to the objective function. This implementation often produces infeasible solutions, one of the c 2000 Massachusetts Institute of Technology Neural Computation 12, 2209– 2225 (2000) °

2210

Shin Ishii and Hirotaka Niitsuma

reasons that these approaches are not very good when the problem size is expanded. In some studies, on the other hand, all of the permutation constraints are treated as “hard” constraints; they are implemented to be automatically satised. Yuille and Kosowski (1994) proposed a gradient descent method in order to obtain a solution permutation for linear assignment problems. Rangarajan, Gold, and Mjolsness (1996) proposed a discrete-time algorithm called “soft-assign” in order to obtain a solution permutation for QAPs. We (Ishii & Sato, 1995, 1996) also proposed a similar algorithm called the doubly constrained network (DCN).1 Rangarajan et al. (1997) proved that their soft-assign, when combined with a deterministic annealing procedure, converges. In these studies, the space of possible solutions is almost equivalent to that of the permutations. The obtained solutions are therefore feasible in general, and they are better (Ishii & Sato, 1996, 1998; Gold & Rangarajan, 1996) than the solutions that would be produced by the Ising or Potts spin approach. Even in DCN (¼ soft-assign) combined with a deterministic annealing procedure, several problems exist. First, since the algorithm is deterministic, the system cannot escape from a local minimum by itself. In addition, when starting at a high temperature, there is no variety to the solution even if the initial conditions are variously prepared (Ishii & Sato, 1996). 2 Second, in order to obtain a valid permutation, a careful deterministic annealing is needed, especially at low temperatures. Such features prevent these approaches from achieving further improvements in the solution. In our new approach, we apply a replacement to a permutation as a basic operation. In addition, we introduce nonequilibrium dynamics to DCN. These modications overcome the problems. The nonequilibrium dynamics allows the system to escape from local minima, and this results in variety to a solution. The basic replacement operation ensures that the system retrieves only valid permutations. When the basic operation is an exchange of two elements in a permutation, the operation is called a 2-opt search. When l ( > 2) elements in a permutation are changed at once, the operation is called a l-opt search (Lin & Kernighan, 1973; Martin, Otto, & Feltman, 1992). The 2-opt search is a very simple heuristic algorithm, but it easily falls into a local minimum. A neural network approach based on the 2-opt heuristics was proposed by

1 The DCN (¼ soft-assign) algorithm was rst proposed by Gold, Rangarajan and Mjolsness (1994). Our articles (Ishii & Sato, 1995, 1996) are later, although they describe a convergence proof for a subprocedure, bifurcation discussion, and TSP (N · 200) simulation results. Therefore, it is appropriate to call this algorithm soft-assign. Since this study is based on our DCN formulation (Ishii & Sato, 1996), however, the algorithm is sometimes called DCN in this article. 2 Starting at low temperature produces solution variety, depending on the initial conditions. However, this scheme does not produce good solutions. In this article, the deterministic annealing procedure is assumed to start at sufciently high temperature.

l-Opt Neural Approaches

2211

Hasegawa, Ikeguchi, and Aihara (1998). Tabu list (Taillard, 1995) and chaos (Hasegawa et al., 1998) have been proposed for an escape from local minima. On the other hand, l-opt (l > 2) heuristics get computationally heavy as l becomes large. Our proposed methods are based on an analog version of the l-opt heuristics. Since we employ analog neural approaches based on the DCN formulation, it is possible to employ a fairly large l (e.g., about N / 2 where N is the problem’s size). The fairly large l value enables the system to search for good permutations over a middle-range region and prevents the system from falling into shallow local minima around the present system state. In addition, our methods naturally introduce nonequilibrium dynamics, which enables the system to search for various solutions one after another in the possible permutation space. In experiments, we apply our new methods to relatively large-scale (N D 80–150) QAPs taken from QAPLIB (Burkard et al., 1997), which is a standard set of QAP benchmark problems. Our methods are comparable to the present champion algorithms, which are based on tabu search (Taillard, 1991; Battiti & Tecchiolli, 1994), genetic algorithms (Fleurent & Ferland, 1994), and simulated annealing (Amin, 1999). Moreover, for two benchmark problems, our methods are able to better the champion algorithms. Consequently, our methods can be considered as some of the stronger algorithms for QAPs. 2 l -DCN 2.1 Quadratic Assignment Problem. QAPs are known to be very difcult combinatorial optimization problems (Burkard et al., 1997). Naturally, they belong to the class of NP-hard problems. A typical instance of QAPs is a facility location problem, in which a set of facilities is assigned to an equal number of locations at the minimum cost. For each pair of facilities, an amount of ow is given, and for each pair of locations, a distance is given. The cost is dened as the summation of the product of the ow between two facilities and the distance between the locations to which the facilities are assigned. Many combinatorial optimization problems, such as TSPs, maximum clique problems, and graph isomorphism problems, are special and easy QAP examples. Let D and C denote N-by-N distance and ow matrices, respectively. Here, a QAP with a problem size N is dened by

min p2P

N X N X

Da,b Cp(a), p(b ) ,

(2.1)

aD 1 b D 1

where P is the set of all permutations of f1, 2, . . . , Ng, and p(a ) gives the element assigned to the location a in a permutation p 2 P.

2212

Shin Ishii and Hirotaka Niitsuma

2.2 l -DCN Equations. To deal with problem 2.1, we dene an N-byN assignment matrix: Sa,n D 1 if p (a) D n and Sa,n D 0 otherwise. The P PN constraints— N n D 1 Sa,n D 1 (a D 1, . . . , N) and a D 1 Sa,n D 1 (n D 1, . . . , N)—should be satised for the assignment matrix S to represent a valid permutation p (a). By using the assignment matrix S , we can dene problem 2.1 as a minimization of an objective energy function,

E obj (S ) ´

N 1 X Da,b Cn,m Sa,n Sb,m , 2 a,b,n,mD 1

(2.2)

which is subject to the assignment matrix constraints. If the facility matrix is of the form Fn,m D (dn, (m C 1) C dn, (m¡1) ) / 2, objective energy function 2.2 is equivalent to that of an N-city TSP. Here, di, j is Kronecker’s delta. Namely, a TSP is a special QAP. We dene a modied energy function, E (S ) ´ Eobj (S ) C

N g X Sa,n (1 ¡ Sa,n ), 2 a,n D1

(2.3)

where g is a positive constant. Although the second term is always zero and hence meaningless for a binary S , studies have shown (Hopeld & Tank, 1985; Peterson & Soderberg, ¨ 1989; Ishii & Sato, 1996; Rangarajan et al., 1996) that this quadratic term helps stabilize the analog algorithm shown later. Next, we dene an N-by-N permutation matrix by » Ya,b D

1 0

if element p(b) changes its location from b to a otherwise.

(2.4)

Although an assignment matrix S can be regarded as a permutation matrix applied to the identity assignment, p (n ) D n (n D 1, . . . , N), we distinguish assignment matrices and permutation matrices for a convenient description. For the permutation matrix Y , there are also constraints: N X aD 1 N X bD 1

Ya,b D 1

(b D 1, . . . , N )

(2.5a)

Ya,b D 1

(a D 1, . . . , N ).

(2.5b)

In addition, we consider a constraint, N X aD 1

Ya,a D N ¡ l ,

(2.6)

l-Opt Neural Approaches

2213

where 0 · l · N is a constant integer. Constraint 2.6 restricts the number of elements whose location is changed by the permutation Y to l. In the following, a permutation matrix Y subject to constraints 2.5 and 2.6 is called a l-permutation matrix. When a permutation matrix Y applies to an assignment matrix S , the new assignment matrix S 0 is given by

S0 D Y S.

(2.7)

Next, we consider a set of l-permutation matrices: L D f(Y k , Pk ) | k D 1, . . .g, where P k is the probability for the permutation Y k . The ensemble average for the new assignment matrices applied by this set of permutation matrices is given by hS 0 iL D hY iL S ,

(2.8)

where h¢iL is the mean with respect to the permutation set: h f (k )iL ´ P 0 k k f (k ) P . We use the notations: V ´ hS iL and X ´ hY iL . Let us now consider a minimization of the energy function, E (S 0 ), with respect to the new assignments. To make this discrete problem a continuous one, we introduce a free energy function: F ´ hE(S 0 )iL ¡ TH H´¡

X

S

0

(2.9a)

P (S 0 ) log P (S 0 ) ¼ ¡

N X

Va,n log Va,n ,

(2.9b)

a,n D 1

where P (¢) denotes the probability for that conguration. T is the temperature and H is the entropy of the possible congurations. For the time being, we deal with a more general problem, where the energy function has a quadratic form, E (S 0 ) D

N N X 1 X Wa,nIb,m S0a,n S0b,m C Ia,n S0a,n , 2 a,n,b,m D 1 a,n D 1

(2.10)

where the weight matrix is symmetric, that is, Wa,nIb,m D Wb,mIa,n (a, n, b, m D 1, . . . , N ). A QAP whose energy function is given by 2.3 is one such problem. For the energy function, 2.10, the following approximation can be performed: hE(S 0 )iL ¼

1 2

1 D 2

N X a,n,b,m D1 N X

a,b,c,dD 1

W a,nIb,m Va,n Vb,m C

(

N X n,m D1

N X

Ia,n Va,n

a,n D1

!

W a,nIb,m Sc,n Sd,m Xa,c Xb,d

2214

Shin Ishii and Hirotaka Niitsuma

C

D

N X a,bD 1

(

N X

! Ia,n Sb,n Xa,b

nD1

N N X 1 X Q a,bIc,d (S )Xa,b Xc,d C W IQa,b (S )Xa,b 2 a,b,c,d D 1 D 1 a,b

´ EQ (X I S ),

(2.11)

P P Q a,bIc,d (S ) ´ Q ( ) where W n,m Wa,nIc,m Sb,n Sd,m and Ia,b S ´ n Ia,n Sb,n . The Q Q new weight matrix is also symmetric: Wa,bIc,d D W c,dIa,b . In equations 2.9b and 2.11, we use a similar approximation to the mean-eld theory (Bilbro et al., 1989; Peterson & Soderberg, ¨ 1989; Yuille & Kosowski, 1994), which is not rigorously applicable to the conguration space of the assignment matrices. The entropy is also approximated as H¼¡ D ¡

N X a,n D1 N X

a,bD 1

Va,n log Va,n D ¡

N X a,n D1

Xa,p¡1 (n) log Xa,p¡1 (n)

Xa,b log Xa,b ´ HQ (X ),

(2.12)

where p is the permutation represented by S . Similarly, constraints 2.5 and 2.6 are also represented by the average permutation matrix X . Accordingly, the new continuous problem is dened by minimize FQ (X I S ) ´ EQ (X I S ) ¡ T HQ (X ) subject to

N X

a D1

N X

bD 1

N X

a D1

(2.13a)

Xa,b D 1

(b D 1, . . . , N )

(2.13b)

Xa,b D 1

(a D 1, . . . , N )

(2.13c)

Xa,a D N ¡ l .

(2.13d)

The new problem is a minimization of the free energy function, 2.13a, which is dened by the average permutation and not by the average assignment. In addition, it is very important that the problem is not a global one. It is a local problem that depends on the previous assignment S . In problem 2.13, the entropy, 2.12, with constraints 2.13b and 2.13c prevents the average permu2 tation X from going out of its domain [0, 1]N . In this sense, it functions as a barrier function (Luenberger, 1989; Yuille & Kosowski, 1994). The constant l is not necessarily an integer in this continuous problem.

l-Opt Neural Approaches

2215

With the Lagrange method, a solution to problem 2.13 can be given by the following simultaneous equations: Xa,b D

Ua,b aa bbc da,b

(

Ua,b ´ exp ¡ aa D bb D c D

(2.14a)

1 @EQ (X I S ) T @Xa,b

! (2.14b)

X Ua,b b

b bc da,b

a

aac da,b

(2.14c)

X Ua,b

(2.14d)

1 X Ua,a , N ¡ l a aa ba

(2.14e)

where aa (a D 1, . . . , N ), bb (b D 1, . . . , N ) and c are Lagrange multipliP Q Q ers. For the energy function, 2.11, @EQ / @Xa,b D c,d Wa,bIc,d Xc,d C I a,b . The equations dened by 2.14 are called l-DCN equations. 2.3 Basic Algorithm. The following basic algorithm works well in a practical sense for solutions to l-DCN equations 2.14.

1. Set t at 0. Set X (0) to be » Xa,b (0) D where 2 0.1]).

a,b

(1 ¡ l / N )(1 C 2 a,b ) l(1 C 2 a,b ) / (N (N ¡ 1))

if a D b 6 b, if a D

(2.15)

denotes small noise (e.g., a uniform random value in [¡0.1,

2. For all a, b, calculate: Ua,b

(

1 @EQ (X (t)I S ) D exp ¡ T @Xa,b (t)

3. Set bbold to be

P

Ua,b a aa (t )c (t)da,b

! .

for all b, and set c

(2.16) old

to be

1 N¡l

P

Ua, a a aa (t)b a (t) .

4. The following substeps are iterated: a. For all a, calculate: aanew D

N X bD 1

Ua,b bbold (c

old )da,b

.

(2.17)

2216

Shin Ishii and Hirotaka Niitsuma

b. For all b, calculate: bbnew D

N X

Ua,b . aanew (c old )da,b

(2.18)

N 1 X Ua,a . new N ¡ l aD1 aa banew

(2.19)

a D1

c. Calculate: c

new

D

( ( If each of aa , bb , and c converges, set aa (t C 1), Pbb t C 1), and c t C 1) to those values. Then rescale ® (t C 1) so that a aa (t C 1) D 1.

5. For all a, b, calculate: Xa,b (t C 1) D

Ua,b . aa (t C 1)b b (t C 1)c (t C 1)da, b

(2.20)

6. If X (t) converges or t is equal to tmax , exit the algorithm. Otherwise, add 1 to t and go to step 2. This algorithm does not necessarily converge. The average permutation matrix X often exhibits a quasi-periodic or periodic oscillation. Therefore, the number of iterations is limited to tmax , which is set at 100 in the experiments in section 4. When constraint 2.13d (and hence step 4c) is removed, the above algorithm is almost equivalent to the soft-assign (¼ DCN). In this case, there are convergence proofs for step 4 (Sinkhorn, 1964), 3 and for the whole basic algorithm (Rangarajan et al., 1997). With step 4c, however, the convergence of step 4 has not been proved yet. Nevertheless, our computer simulations imply that step 4, including step 4c, always converges. The effect of step 4c will be experimentally investigated in section 4. 2.4 l -DCN Algorithm. When the temperature T is small, the average permutation matrix X obtained by the above basic algorithm tends to be close to a vertex of the domain hypercube. Then the matrix is regarded as signifying one of the valid l-permutation matrices. Moreover, we often obtain several valid permutation matrices during a single process of the basic algorithm. Among these, we choose the best permutation matrix: the one that minimizes the objective energy function (equation 2.2 for a QAP). This process retrieves a good new assignment matrix that is an application of one of the l-permutation matrices to the previous assignment matrix S . However, the average permutation matrix X sometimes fails to approach 3

Our previous paper (Ishii & Sato, 1996) also provided another convergence proof based on the Lyapunov method.

l-Opt Neural Approaches

2217

S3 Y3 Y2 S1

Y1

S2

Figure 1: l-DCN procedure. It retrieves a new assignment S1 from the previous assignment S2 . Then the procedure retrieves another assignment S3 , and continues the process. S2 is an assignment transformed from S1 by a l-permutation Y1 .

any vertex of the hypercube. A typical case is that X falls into a two-cycle oscillation, and neither state signies a valid permutation. In such a case, the result of the basic algorithm is regarded as a failure. When a new assignment is obtained, the assignment is improved by the 2-opt heuristics. This additional procedure needs negligible computation time. Our proposed procedure, which is called l-DCN, repeats the process. Figure 1 depicts this procedure. After a single process, a new assignment S 2 , which is transformed from the previous assignment S 1 by a l-permutation Y 1 , is retrieved. The new assignment S 2 is expected to be a good one among the assignments that are reachable from S 1 by a l-permutation. In the subsequent process, another assignment S 3 is retrieved from S 2 by calculating a good l-permutation Y 2 . When the basic algorithm fails to obtain Y 2 , the starting assignment in the next process is again S 2 . Due to the variation of the initial condition (step 1) in the basic algorithm, the next application of the basic algorithm will obtain a valid l-permutation. When the parameter l is equal to 2, this procedure corresponds to an analog version of the 2-opt heuristics, a local search algorithm for the nearest neighbors. However, by taking a relatively large l value, our l-DCN algorithm can achieve a middle-range search for the possible assignments. This helps the system neglect shallow local minima around the previous assignment and enables a search of good but far-located local minima of the energy function. Furthermore, this procedure never terminates; nonequilibrium dynamics is introduced. The system is not trapped by any local minimum. In addition, the nonequilibrium dynamics naturally introduces variety to the solution. Therefore, as the number of iterations—applications of the basic algorithm—becomes large, improvements in the solution can be expected. We discuss this further in section 4.

2218

Shin Ishii and Hirotaka Niitsuma

3 l -Interior DCN

Since the l-DCN algorithm retrieves assignments using l-permutations, it often ignores good local minima located in the nearby neighborhood around the current assignment. To deal with this problem, we propose another method here, which is inspired by the interior point method (Karmarker, 1984). In this method, the equality constraint, 2.6, is replaced by the following inequality constraint, N X aD 1

Ya,a > N ¡ l ,

(3.1)

which restricts the number of elements relocated by the permutation Y to be less than or equal to l. In order to implement this inequality constraint, the following barrier function is introduced into the free energy function, 2.13a: ! ! N N X X Q ( ) K X ´T Xa,a ¡ M log Xa,a ¡ M

(

a D1

C T (h ¡ 1)

(

(

N X a D1

aD 1

!

Xa,a ¡ M ,

(3.2)

where M ´ N ¡ l and h ( D 0 » 1) is a constant parameter. In this case, constraint 2.13d is removed. Using barrier function 3.2, equations 2.14b and 2.14e in the l-DCN equations are replaced by

(

1 @EQ (X I S ) ¡ hda,b Ua,b D exp ¡ T @Xa,b q P M2 C 4 a aUaa,a ba ¡ M c D . 2

! (3.3b)

(3.3e)

This derivation is described in the appendix. Then equation 2.16 in the basic algorithm in section 2.3 is replaced by Ua,b

! 1 @EQ (X (t)I S ) ¡ hda,b I D exp ¡ T @Xa,b (t )

(

(3.4)

equation 2.19 is replaced by q c

new

D

M2 C 4

P

Ua, a a aanew banew

2

¡M

.

(3.5)

l-Opt Neural Approaches

2219

Table 1: Solutions. Name

N

Champion

DCA

2-Opt

l-DCN

Tai80a Tai100a Wil100 Tho150

80 100 100 150

13557864 21125314 273038 8133484

1.2 0.50 0.27 0.33

3.1 3.2 0.73 1.5

¡0.060 (2568) ¡0.089 (469) 0.46 (2109) 0.24 (7139)

l-Inter DCN ¡0.040 ¡0.082 0.070 0.28

(2481) (5656) (2563) (6956)

Notes: “Name” represents the problem’s name, “N” is the problem size, and “champion” is the previous best (champion) solution. “DCA” is the solution obtained by DCN annealing, “2-opt” is the best solution obtained by many 2-opt runs for various initial conditions, “lDCN” is the solution obtained by our l-DCN method, and “l-inter DCN” is the solution obtained by our l-interior DCN method. Each solution is represented by the deviation (in percentage) from the previous champion solution. The parentheses denote number of basic algorithms for the best solution.

The other parts are the same as in the basic algorithm. The modied basic algorithm searches for a good l-permutation where 0 · l · l from the previous assignment S . Therefore, the new procedure simultaneously achieves a local search and a middle-range search in the vicinity of the previous assignment. The modied procedure is called the l-interior DCN algorithm. 4 Experiments

Our proposed methods can be applied to any quadratic problem whose solutions are represented by permutations. In the case of a QAP whose energy function is given by equation 2.3, we set Wa,nIb,m D Da,b Cn,m ¡ g (a, n, b, m D 1, . . . , N) and Ia,n D g / 2 (a, n D 1, . . . , N ). g is a small positive constant. Although l is experimentally set at a good value, our experiential prescription is to set it at about N / 2. The benchmark problems used in the experiments are Tai80a (N D 80), Tai100a (N D 100), Wil100 (N D 100), and Tho150 (N D 150), all of which were taken from QAPLIB (Burkard et al., 1997). The distance and facility matrices for these problems are symmetric and dense. Tho150 is the largest problem with its dense matrices in QAPLIB. For Tai80a and Tai100a, the champion algorithms (Taillard, 1991; Battiti & Tecchiolli, 1994) are based on the tabu search. For Wil100, the champion algorithm (Fleurent & Ferland, 1994) is a combined algorithm of a genetic algorithm and the tabu search. For Tho150, the champion algorithm (Amin, 1998) is called simulated jumping. The QAPLIB web site maintains the champion data for these problems. Table 1 shows the experimental results. The value in each column is a deviation in percentage from the present champion solution. For comparison, the table also contains solutions obtained by our previous approach (Ishii & Sato, 1998), DCN annealing, and the best solutions that the 2-opt heuristics could obtain from many runs. Note that DCN annealing is almost

2220

Shin Ishii and Hirotaka Niitsuma

equivalent to the algorithm proposed by Rangarajan et al. (1996), and its solutions are much better than those produced by the Potts spin network (i.e., soft-max) combined with a deterministic annealing procedure (Peterson & Soderberg, ¨ 1989). With DCN annealing, l-DCN, and l-interior DCN, the 2opt heuristics improved the solutions. The table also shows the applications of the basic algorithm until the best solutions are retrieved for l-DCN and l-interior DCN. In Table 1, the results of our new methods are comparable to the champion solutions. For Tai80a and Tai100a, both of the proposed methods achieved better solutions than the previous champion algorithms. Our methods seem faster than, or at least as fast as, the previous champion algorithms, although a rigorous comparison is difcult due to the lack of description of the computation time in the literature and the different computer environments. For Tai100a, l-DCN took 45 minutes to obtain its best solution on a Silicon Graphics Origin 2000 computer. This computation time was for 469 applications of the basic algorithm. Ten thousand applications of the basic algorithm required about 12 hours and 20 hours in l-DCN and l-interior DCN, respectively.4 For the same problem, DCN annealing took 22 minutes for its best solution shown in Table 1. The soft-max combined with deterministic annealing took 32 minutes for its best solution that was 1.03% worse than the champion solution. 5 Our new methods are slower than the DCN and soft-max approaches. Since DCN annealing is almost deterministic, there is no variety to a solution even if the initial conditions are variously prepared (Ishii & Sato, 1996). On the other hand, our new methods naturally introduce nonequilibrium dynamics that search for possible solutions by solving local problems one after another. The advantage of our methods is not only an automatic multiple restart scheme. Figure 2 compares solutions obtained by the l-DCN algorithm (solid lines) and the best solutions produced by the multiple restarts of the l-DCN basic algorithm from random initial assignments (dashed lines).6 In the multiple restarts, the number of initial assignments is set to be equal to the number of assignments obtained by l-DCN. Figure 2 implies that the multiple restarts cannot achieve signicant improvement even when 4 For Tai100a,the champion algorithm (Battiti & Tecchiolli, 1994)took about 10 hours on a Silicon Graphics Iris computer for a solution that was 0.17% worse than the present best solution. To obtain the present best solution, the algorithm would need more computation time. The same literature mentioned that the champion algorithm for Tai80a (Taillard, 1991) was a little slower. The champion algorithm for Wil100 (Fleurent & Ferland, 1994) seems to have taken about three days to obtain a solution for Tai100a using a SUN SPARC 10 computer. 5 In these experiments, DCN annealing and soft-max annealing were applied 137 and 162 times, respectively, of the subprocedure, which is similar to thel-DCN basic algorithm. 6 A more detailed algorithm is to (1) randomly prepare an assignment, (2) improve the assignment by the 2-opt heuristics, and (3) apply the l-DCN basic algorithm only once. Repeat this procedure.

l-Opt Neural Approaches

2221

6

Best solution (%)

5 4 3 2 1 0 0

100

200

300

No of iterations

400

500

Figure 2: Comparison between l-DCN and multiple restarts of basic algorithm. Abscissa and ordinate denote the applications of basic algorithm and best solution until that number, respectively. Solid lines are l-DCN algorithm, and dashed lines are best solutions by multiple restarts of the basic algorithm from randomly prepared initial assignments. For each number of applications, the number of initial assignments used by multiple restarts is set to be equal to the number of valid assignments obtained by l-DCN. Namely, the number of candidate assignments is even in two methods. Six experiments conducted for each method (a line).

the applications increase, while our l-DCN can. The improvements due to sequential applications of the basic algorithm are very important. Table 2 shows the average and the standard deviation of solutions obtained by lDCN for various applications of the basic algorithm. Finally, let us briey mention the effect of the constraint, equation 2.13d or 3.2, for the diagonal sum of the average permutation matrix X . In the experiments for l-DCN and l-interior DCN, step 4 in the basic algorithm always converges, although the convergence is very slow in some cases. 7 Therefore, the basic algorithm is practically valid. If the convergence criterion for step 4 in l-DCN or l-interior DCN is set as tight as that in DCN, the number of iterations needed in that step becomes larger than that in DCN 7

In about 0.1% cases in both methods, the convergence needed more than 1,000 iterations.

2222

Shin Ishii and Hirotaka Niitsuma

Table 2: Solutions Against Number of Iterations. Number of iterations

0

100

200

300

400

500

Solution (%) 5.12 § 0.26 0.47 § 0.09 0.28 § 0.04 0.24 § 0.08 0.19 § 0.08 0.11 § 0.11 Notes: This table shows the average and the standard deviation of the Tai100a solutions obtained by the l-DCN algorithm for various application numbers of the basic algorithm. Ten random initial assignments were prepared. Solution statistics are the difference (in percentage) from the previous champion solution. When the number of iterations is 0, the solutions are the initial random assignments improved by the 2-opt heuristics.

due to the effect of step 4c. However, l-DCN and l-interior DCN do not need a very tight convergence criterion for step 4, because their dynamics, which is based on permutations, do not need a very accurate calculation. On the other hand, the solutions obtained by DCN annealing are highly dependent on the convergence criterion. In the experiments for Tai100a, the average (standard deviation) of iteration numbers for step 4 were 13.1 ( § 30.1), 3.4 ( § 15.6), and 23.2 ( § 35.1), in DCN annealing, l-DCN, and l-interior DCN, respectively. In the same experiments, the basic algorithm converged in 76% and 48% in l-DCN and l-interior DCN, respectively. 5 Conclusion

When l is relatively large, it is computationally difcult to search for a good assignment over all of the assignments reachable by l-permutations from the current assignment. We proposed a couple of neural methods that partially achieve this mechanism, by considering the free energy function dependent on the current assignment. These methods are l-DCN and linterior DCN. The new methods are also improvements of the previous DCN (¼ soft-assign) annealing in two ways. First, since nonequilibrium dynamics is introduced, the system is not trapped by any local minimum, and the system can retrieve various candidates. Moreover, the solution is expected to be improved by sequential applications of the l-permutations. Second, since we take a permutation as a basic operation, the retrieved solutions always represent valid permutations. This is one of the reasons that our methods work well. In the experiments, we applied our methods to relatively large-scale (N D 80–150) QAP benchmark problems. Consequently, the new methods were found to be comparable to the present champion algorithms; for two benchmark problems, they were better than the champion algorithms. We therefore conclude that our new methods are some of the stronger algorithms for QAPs.

l-Opt Neural Approaches

2223

Appendix

Here, we derive the l-interior DCN equations dened by equations 2.14a, 3.3b, 2.14c, 2.16d, and 3.3e. To minimize FQ (X I S ) D EQ (X I S ) ¡ T HQ (X ) C KQ (X ) subject to constraints 2.13b and 2.13c, we dene a Lagrange function: ! ! L D FQ C

N X

Aa

aD 1

(

N X

bD 1

Xa,b ¡ 1 C

N X

b D1

Bb

(

N X

aD 1

Xa,b ¡ 1 ,

(A.1)

where Aa (a D 1, . . . , N ) and Bb (b D 1, . . . , N ) are Lagrange multipliers. A stationary condition of the Lagrange function is given by " ! !# X @L @EQ C T log Xa,b C 1 C da,b log D Xa,a ¡ M C h @Xa,b @Xa,b a

( (

C Aa C Bb D 0,

(A.2)

with constraints 2.13b and 2.13c. Equation A.2 is solved as ! 1 @EQ 6 b if a D Xa,b D exp ¡ / (aa bb ) T @Xa,b ! ! X 1 @EQ ¡ h / (aa ba ), Xa,a Xc,c ¡ M D exp ¡ T @Xa,a c

(

(

(

(A.3a) (A.3b)

where aa ´ exp(Aa / T C 1) and bb ´ exp(Bb / T ) are the converted Lagrange 6 multipliers. Equation A.3a is equivalent to 2.14a and 3.3b for a D Pb. The summation of equation A.3b over a gives a quadratic equation for a Xa,a . A solution to the quadratic equation is given by q P U M C M 2 C 4 a aaa,baa X . (A.4) Xa,a D 2 a P By dening c ´ a Xa,a ¡ M, equation 3.3e is obtained. Equation 3.3b for a D b is determined from equation A.3b and the denition of c . Acknowledgments

We thank the referees and the editor for their insightful comments in improving the quality of this article. We are grateful to Masa-aki Sato for valuable discussions. References Amin, S. (1999). Simulated jumping. Annals of Operations Research, 86, 23–38.

2224

Shin Ishii and Hirotaka Niitsuma

Battiti, R., & Tecchiolli, G. (1994). The reactive tabu search. ORSA Journal on Computing, 6, 126–140. Bilbro, G., Mann, R., Miller, T. K., Snyder, W. E., Van den Bout, D. E., & White, M. (1989). Optimization by mean eld annealing. In D. S. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 91–98). San Mateo, CA: Morgan Kaufmann. Burkard, R. E., Karisch, S. E., & Rendl, F. (1997). QAPLIB—A quadratic assignment problem library. Journal of Global Optimization, 10, 391–403. Available online at: http://www.imm.dtu.dk/»sk/qaplib/. Fleurent, C., & Ferland, J. A. (1994).Genetic hybrids for the quadratic assignment problem. In P. Pardados & H. Wolkowics (Eds.), Quadratic assignment and related problems(Vol. 16, pp. 173–187).Providence, RI: American Mathematical Society. Gold, S., & Rangarajan, A. (1996). Softassign versus softmax: Benchmarks in combinatorial optimization. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing, 8 (pp. 627–632). Cambridge: MIT Press. Gold, S., Rangarajan, A., & Mjolsness, E. (1994). Clustering with a domainspecic distance measure. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 96–103). San Mateo, CA: Morgan Kauffman. Hasegawa, T., Ikeguchi, M., & Aihara, K. (1998). Harnessing of chaotic dynamics for solving combinatorial optimization problems. In S. Usui & T. Omori (Eds.), Proceedings of the Fifth International Conference on Neural Information Processing, 2 (pp. 749–752). Burke: IOS Press. Hopeld, J. J., & Tank, D. W. (1985). “Neural” computations of decisions in optimization problems. Biological Cybernetics, 52, 141–152. Ishii, S., & Sato, M. (1995). Doubly constrained network for combinatorial optimization problems. In Proceedings of 1995 InternationalSymposium on Nonlinear Theory and Its Applications. Tokyo: IEICE. Ishii, S., & Sato, M. (1996). Doubly constrained network for combinatorial optimization. (Techn. Rep. No. TR-H-193). Kyoto: ATR. Ishii, S., & Sato, M. (1997). Chaotic Potts spin model for combinatorial optimization problems. Neural Networks, 10, 941–963. Ishii, S., & Sato, M. (1998). Constrained neural approaches to quadratic assignment problems. Neural Networks, 11, 1073–1082. Karmarker, N. (1984). A new polynomial-time algorithm for linear programming. Combinatorica, 4, 373–395. Lin, S., & Kernighan, B. W. (1973). An efcient heuristic algorithm for the traveling salesman problem. Operations Research, 21, 498–516. Luenberger, D. G. (1989). Linear and nonlinear programming (2nd ed.). Reading, MA: Addison-Wesley. Martin, S. W., Otto, O., & Feltman, E. W. (1992). Large-step Markov chains for the traveling salesman problem. Operations Research Letters, 11, 219–224. Peterson, C., & Soderberg, ¨ B. (1989). A new method for mapping optimization problems onto neural networks. International Journal of Neural Systems, 1, 3– 22.

l-Opt Neural Approaches

2225

Rangarajan, A., Gold, S., & Mjolsness, E. (1996). A novel optimizing network architecture with applications. Neural Computation, 8, 1041–1060. Rangarajan, A., Yuille, A., Gold, S., & Mjolsness, E. (1997). A convergence proof for the softassign quadratic assignment algorithm. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9, (pp. 620–626). Cambridge: MIT Press. Sinkhorn, R. (1964).A relationship between arbitrary positive matrices and doubly stochastic matrices. Annals of Mathematical Statistics, 35, 876–879. Taillard, E. D. (1991).Robust taboo search for the quadratic assignment problem. Parallel Computing, 17, 443–455. Taillard, E. D. (1995). Comparison of iterative searches for the quadratic assignment problem. Location Science, 3, 87–105. Yuille, A. L., & Kosowski, J. J. (1994).Statistical physics algorithms that converge. Neural Computation, 6, 341–356. Received November 16, 1998; accepted September 23, 1999.

Neural Computation, Volume 12, Number 10

CONTENTS Article

Spike-Driven Synaptic Plasticity: Theory, Simulation, VLSI Implementation Stefano Fusi, Mario Annunziato, Davide Badoni, Andrea Salamon, and Daniel J. Amit

2227

Notes

Modeling Alternation to Synchrony with Inhibitory Coupling: A Neuromorphic VLSI Approach Gennady S. Cymbalyuk, Girish N. Patel, Ronald L. Calabrese, Stephen P. DeWeerth, and Avis H. Cohen

2259

Representation of Concept Lattices by Bidirectional Associative Memories Radim BLelohl´avek

2279

Letters

A Silicon Implementation of the Fly’s Optomotor Control System Reid R. Harrison and Christof Koch

2291

Efcient Event-Driven Simulation of Large Networks of Spiking Neurons and Dynamical Synapses Maurizio Mattia and Paolo Del Giudice

2305

Clustering Irregular Shapes Using High-Order Neurons H. Lipson and H. T. Siegelmann

2331

Learning Chaotic Attractors by Neural Networks Rembrandt Bakker, Jaap C. Schouten, C. Lee Giles, Floris Takens, and Cor M. van den Bleek

2355

Generalized Discriminant Analysis Using a Kernel Approach G. Baudat and F. Anouar

2385

Generalization and Selection of Examples in Feedforward Neural Networks Leonardo Franco and Sergio A. Cannas

2405

Stationary and Integrated Autoregressive Neural Network Processes Adrian Trapletti, Friedrich Leisch, and Kurt Hornik

2427

Learning to Forget: Continual Prediction with LSTM ¨ Felix A. Gers, Jurgen Schmidhuber, and Fred Cummins

2451

ARTICLE

Communicated by Misha Tsodyks

Spike-Driven Synaptic Plasticity: Theory , Simulation, VLSI Implementation Stefano Fusi INFN Sezione RM1, University of Rome “La Sapienza,” 00185, Roma, Italy Mario Annunziato Physics Department, University of Pisa, 56100 Pisa, Italy Davide Badoni INFN, Sezione RM2, University of Rome “Tor Vergata,” 00133, Roma, Italy Andrea Salamon Physics Department, University of Rome “La Sapienza,” 00185, Roma, Italy Daniel J. Amit INFN Sezione RM1, University of Rome “La Sapienza,” 00185, Roma, Italy, and Racah Institute of Physics, Hebrew University, Jerusalem We present a model for spike-driven dynamics of a plastic synapse, suited for aVLSI implementation. The synaptic device behaves as a capacitor on short timescales and preserves the memory of two stable states (efcacies) on long timescales. The transitions (LTP/LTD) are stochastic because both the number and the distribution of neural spikes in any nite (stimulation) interval uctuate, even at xed pre- and postsynaptic spike rates. The dynamics of the single synapse is studied analytically by extending the solution to a classic problem in queuing theory (Tak acs ` process). The model of the synapse is implemented in aVLSI and consists of only 18 transistors. It is also directly simulated. The simulations indicate that LTP/LTD probabilities versus rates are robust to uctuations of the electronic parameters in a wide range of rates. The solutions for these probabilities are in very good agreement with both the simulations and measurements. Moreover , the probabilities are readily manipulable by variations of the chip’s parameters, even in ranges where they are very small. The tests of the electronic device cover the range from spontaneous activity (3–4 Hz) to stimulus-driven rates (50 Hz). Low transition probabilities can be maintained in all ranges, even though the intrinsic time constants of the device are short (» 100 ms). Synaptic transitions are triggered by elevated presynaptic rates: for low presynaptic rates, there are essentially no transitions. The synaptic device can preserve its memory for years in the absence of stimulation. Stochasticity of learning is a result of the variability of interspike intervals; noise is a feature of the distributed dynamics of the network. The fact c 2000 Massachusetts Institute of Technology Neural Computation 12, 2227–2258 (2000) °

2228

Fusi, Annunziato, Badoni, Salamon, and Amit

that the synapse is binary on long timescales solves the stability problem of synaptic efcacies in the absence of stimulation. Yet stochastic learning theory ensures that it does not affect the collective behavior of the network, if the transition probabilities are low and LTP is balanced against LTD. 1 Introduction

Material learning devices have the capability of changing their internal states in order to acquire (learn) and store (memorize) information about the statistics of the incoming ux of stimuli. In the case of neural devices, this information is stored in the structure of couplings between neurons. Hence, developing a learning neural network device means designing a plastic synapse whose synaptic efcacy can be rapidly modied to acquire information and, at the same time, can be preserved for long periods to keep this information in memory. These two requirements are particularly important for on-line unsupervised learning, in which the network must handle an arbitrary stream of stimuli, with no a priori knowledge about their structure and relevance. In the absence of an external supervisor, the synaptic efcacy should be either updated (even just a little) by each stimulus that is presented, or there will be no trace of any of them. Most discussion of learning dynamics to date has focused on the structure of a developing analog synaptic variable as a function of an incoming set of stimuli (Grossberg, 1969; Sejnowski, 1976; Bienenstock, Cooper, & Munro, 1982). Those must be extended before a material implementation can be contemplated. Developing an electronic plastic synapse that can be embedded in a large learning neural network implies analog VLSI (aVLSI) technology. Since the number of synapses can be as large as the square of the number of neurons, the design of the plastic synapse should follow two principles: economy in the number of transistors, to reduce chip area, and low power consumption. The neuromorphic approach, introduced by Mead (1989), has been adopted for the construction of a wide range of neural systems that emulate biological counterparts. Since these devices exploit the inherent physics of analog transistors to produce computation, they use less power and silicon area than their equivalent digital devices (Sarpeshkar, 1998). However, this approach has been applied mainly to specialized systems (e.g., sensory devices), with no learning. The lack of a simple synaptic device that could couple the neuromorphic analog neurons was probably one of the main limiting factors. There are two common approaches to implementing synaptic devices. First, the synaptic weights are stored on a digital device (typically a RAM) off-chip: the analog neurons read the synaptic value via digital-to-analog converters (DAC). Such memory can have arbitrary analog depth and can be preserved for indenitely long periods. It is not, however, an acceptable solution for implementation of large-scale networks of analog neurons

Spike-Driven Synaptic Plasticity

2229

because most of the space would be occupied by the DAC and ADC converters. In the second approach, the memory of the synaptic efcacies is stored as the voltage across a capacitor (Elias, Northmore, & Westerman, 1997; Haiger, Mahowald, & Watts, 1996): the device implementing the synapse can preserve a continuous set of values, but the memory is volatile and is lost on timescales of RC, where C is the capacitance and R is the leakage resistance. A promising solution in this direction is emerging in the form of synaptic devices using oating-gate technology (see, e.g., Diorio, Hasler, Minch, & Mead, 1998). These devices can preserve the synaptic efcacy with good analog depth and for times of the order of years (the problems related to the updating procedure, such as high voltages, are probably going to be solved). Here we chose an alternative solution, which is technically quite simple and is well within the scope of familiar technology. It is based on the theoretical observation that the performance of the network is not necessarily degraded if the analog depth of the synapse is limited, even to the extreme. Most of the effort to achieve large analog depth has been motivated by the drive to implement learning rules based on the hypotheses: that closely spaced, stable synaptic efcacies are essential for the proper functioning of the network and that on short timescales it is possible to acquire information about a stimulus, by small changes in the synaptic efcacy induced by neural activities elicited by that stimulus (Hopeld, 1982). As a matter of fact, another line of study always had it that associative memory could operate with binary synapses. In fact, if stimuli are coded sparsely (i.e., the fraction of neurons each stimulus activates is low), such networks have also optimal performance (Willshaw, 1969; Tsodyks, 1990; Amit & Fusi, 1994). Some empirical evidence that efcacies may indeed be binary, is beginning to accumulate (Petersen, Malenka, Nicoll, & Hopeld, 1998). It is this option that was chosen in the development of the analog device since the binary dynamic memory element has been the natural choice in the entire information industry for the last half-century. Networks with this kind of synapses operate as palimpsests (Nadal, Toulouse, Changeux, & Dehaene, 1986; Parisi, 1986; Amit & Fusi, 1994); that is, old stimuli are forgotten to make room for more recent ones. There is a sliding window of retrievable memory; only part of the past is preserved. The width of the window, the memory span, is related to the learning and the forgetting rates: fast rates imply short windows. It also depends on the fraction of synapses that change following each presentation: fractions close to 1 imply fast learning and forgetting, but the memory span is reduced. Note, by the way, that deep analog synapses do not share the palimpsest property, and when their memory is overloaded, they forget everything (Amit, Gutfreund, & Sompolinsky, 1987; Amit, 1989). Every stimulus makes a subdivision of the entire pool of synapses in the network; only active neurons are involved in changing synapses. Moreover, not all the synapses that connect neurons activated by a given stimulus

2230

Fusi, Annunziato, Badoni, Salamon, and Amit

must be changed in a given presentation, or at all, in order to acquire information about the stimulus to be learnt. The learning process requires a mechanism that selects which synaptic efcacies are to be changed following each stimulation. In the absence of an external supervisor, a local, unbiased mechanism could be stochastic learning: at parity of conditions (equal pair of activities of pre- and postsynaptic neurons), the transitions between stable states occur with some probability; only a randomly chosen fraction of those synapses that might have been changed by a given stimulus is actually changed upon presentation of that stimulus. If stimuli are sparsely coded; transition probabilities are small enough, and long-term potentiation (LTP) is balanced against long-term depression (LTD), then optimal storage capacity is recovered, even for two state synapses (Amit & Fusi, 1994). Moreover, small transition probabilities lead to slow learning and hence to the generation of internal representations that are averages over many nearby stimuli, that is, class representatives. Fast learning generates representatives faithful to the last stimulus of a class. Slow learning also corresponds to observations in inferotemporal cortex (Miyashita, 1993). On the other hand, the particular model chosen (two states for the synapse and linear interspike behavior of synapse and neuron) simplies signicantly the hardware implementation. It also makes possible an analytical treatment of the synaptic dynamics. On the network level, sparse coding and low transition probabilities make possible a detailed analysis of the learning process (Brunel, Carusi, & Fusi, 1998). What we present here is a study of the properties of the stochastic dynamics in a prototypical case. We focus on the dependence of the synaptic transition probabilities on the activities of the two neurons it connects. It is not possible to describe such dynamics in a general case. We consider this study in analogy with the studies of the single integrate-and-re (IF) neuron, as carried out, for example, by Ricciardi (1977; Tuckwell, 1988). This study has claried the spike emission statistics of a leaky IF neuron with an afferent current of gaussian distribution. Clearly, the spiking dynamics of the neuron does not have to follow the RC circuit analogy (see, e.g., Gerstein & Mandelbrot, 1964), nor is the current in general gaussian. In the same sense we will choose a specic dynamical process for the synapse—the one also implemented in aVLSI, as well as specic processes for the pre- and postsynaptic activities. The synaptic dynamics is triggered by presynaptic spikes, which arrive in a Poisson distributed process characterized by the rate; the postsynaptic neuron is an IF neuron with a linear leak and a “oor” (Fusi & Mattia, 1999), driven by a current that is a sum of a large number of Poisson processes. The synapse is described by an analog variable that makes jumps up or down whenever, upon presentation of a stimulus, a spike appears on the presynaptic neuron. The direction of the jump is determined by the level of depolarization of the postsynaptic neuron. The dynamics of the analog synaptic variable is similar to the BCM rule (Bienenstock et al., 1982), except that our synapse maintains its memory on long timescales in the absence

Spike-Driven Synaptic Plasticity

2231

of stimulation. Between spikes, the synaptic variable drifts linearly up, toward a ceiling (down, toward a oor) depending on whether the variable is above (below) a synaptic threshold. The two values that delimit the synaptic variable are the stable synaptic efcacies. Since stimulation takes place on a nite (short) interval, stochasticity is generated by both the pre- and postsynaptic neuron. Transitions (LTP, LTD) take place on presynaptic bursts, when the jumps accumulate to overcome the refresh drifts. The synapse can preserve a continuous set of values for periods of the order of its intrinsic time constants. But on long timescales, only two values are preserved: the synaptic efcacy uctuates in a band near one of the two stable values until a strong stimulation produces a burst of spikes and drives it out of the band and into the neighborhood of the other stable value. Stochastic learning solves the theoretical problems related to performance of the network, but small transition probabilities require bulky devices (Badoni, Bertazzoni, Buglioni, Salina, Amit, & Fusi, 1995). The dynamics of our synapse makes use of the stochasticity of the spike emission process of pre- and postsynaptic neurons. Hence, on the network level, the new synapse exploits the collective dynamical behavior of the interconnected neurons (Amit & Brunel, 1997; van Vreeswijk & Sompolinsky, 1997; Fusi & Mattia, 1999). In section 2 we present the model of the synapse and describe the details of the dynamics. In section 3 the transition probabilities are estimated analytically by extending the theory of the Takacs ` queuing process and are studied as a function of the pre- and postsynaptic neuron rates. In section 4 we present the aVLSI implementation of the modeled synapse, and in section 5 we test the hardware device and show that stimulations induce stochastic transitions between the two stable states and the transition probabilities that can be obtained satisfy the requirements of the stochastic learning theory. In section 6 we discuss extensions of the approach to other synaptic dynamics, the role of the synapse in a network, and the correspondence of the implemented synapse with neurobiological observation. 2 Synaptic Dynamics

The synaptic dynamics is described in terms of the analog internal variable X (t), which in turn determines the synaptic efcacy J (t), dened as the change in postsynaptic depolarization, or afferent current, induced by a single presynaptic spike. The synaptic efcacy has two values: it is potentiated (J(t) D J1 ) when the internal variable X is above the threshold hX and depressed (J(t) D J0 ) if X < hX . X is restricted to the interval [0, 1] and obeys d X (t) D R (t) C H (t)I dt

(2.1)

2232

Fusi, Annunziato, Badoni, Salamon, and Amit

at the boundaries (0 and 1), the right-hand side vanishes if it attempts to drive X outside the allowed interval. Hence X= 0 and 1 are reecting barriers for the dynamics of X. R(t) is a refresh term, responsible for memory preservation. H (t) is the stimulus-driven Hebbian learning term, which is different from 0 only upon arrival of a presynaptic spike. Between spikes, the synaptic dynamics, dominated by R (t), is: R (t ) D ¡aH (¡X (t) C hX ) C b H (X (t) ¡ hX ) ,

(2.2)

which is a constant drift down of rate a, or up with rate b, depending on whether X is below or above a synaptic threshold hX (H is the Heaviside function). The Hebbian term depends on variables related to the activities of the two neurons connected by the synapse: (1) the spikes from the presynaptic neuron and (2) the instantaneous depolarization Vpost of the postsynaptic neuron. Presynaptic spikes trigger the Hebbian dynamics. Each spike induces a jump in X whose value depends on the instantaneous depolarization of the postsynaptic neuron. The depolarization is indirectly related to the activity of the postsynaptic neuron (see the appendix) and is available at the synaptic contact. The capacitance of the neuron is exploited to compute this quantity, which is smooth and provides a nite support for coincidence between the presynaptic spikes and the postsynaptic activity. (See section 6 for possible extensions.) In the general case, the Hebbian term can be written as H (t) D

X k

pre pre F (Vpost (tk ))d(t ¡ tk ).

(2.3)

The sum is over all presynaptic spikes; spike number k arrives at the pre synapse at tk . The value of Vpost upon arrival of the presynaptic spike pre determines the jump, F (Vpost (tk )), in the synaptic variable. The sign of F determines whether the synapse is potentiated or depressed at the event. This form of the Hebbian term, for the analog synaptic variable, is similar to the BCM rule (Bienenstock et al., 1982). In the case studied and implemented here, we have chosen a particular form for F. If the pre-synaptic spike arrives when Vpost > hV , then the synaptic internal state is pushed up by a, toward the potentiated state (X ! X C a). If it arrives while Vpost < hV , then X ! X¡b (b > 0). This can be summarized formally as F (Vpost ) D aH (Vpost (t) ¡ hV ) ¡ bH (hV ¡ Vpost (t)).

(2.4)

The dynamics of equation 2.1 is illustrated in Figure 1 by simulating a synapse connecting two spiking neurons that linearly integrate a gaussian

Spike-Driven Synaptic Plasticity

2233

Pre synaptic spikes

1 q

Synaptic internal variable X(t)

X

0

Post synaptic depolarization V(t)

1 q

V

0 0

50

time (ms)

100

150

Figure 1: Synaptic dynamics. Time evolution of the simulated internal variable X(t) describing the synaptic state (center), presynaptic spikes (top) and depolarization V(t) of postsynaptic neuron (bottom). In the intervals between successive presynaptic pulses, X(t) is driven toward one of the two stable states 0, 1 depending on whether X(t) is below or above the threshold hX . At the times of presynaptic spikes, indicated by dotted vertical lines, X(t) is pushed up (X ! X C a) if the depolarization V(t) (circles) is above the threshold hV , and down (X ! X ¡ b) if V(t) < hV .

current. The choice of the linear integrate-and-re (LIF) neuron, with a reecting barrier at zero depolarization, is also a natural candidate for aVLSI implementation (Mead, 1989). Moreover, networks composed of such neurons have qualitatively the same collective behavior as networks of conventional RC IF neurons, which exhibit a wide variety of characteristics observed in cortical recordings (Fusi & Mattia, 1999). 2.1 Stable Efcacies and Stochastic Transitions. In the absence of stimulation, the pre- and postsynaptic neurons have low (spontaneous) rates, and typically the Hebbian term is zero. Hence the dynamics is dominated by the refresh term R (t ), which stabilizes the synaptic internal variable at one of the two extreme values, 0 and 1. In the interval (0, 1), we have: If X (t) > hX (X (t) < hX ), then a positive current b (negative ¡a) drives X (t) up (down) until it stops at the reecting barrier 1 (0) (see Figure 1). 1 and 0 are stable states, unless a large uctuation drives the synaptic efcacy across (below from above, or above from below) the threshold hX .

2234

Fusi, Annunziato, Badoni, Salamon, and Amit

LTP TRANSITION

NO TRANSITION

Pre synaptic spikes

Pre synaptic spikes

Synaptic internal variable X(t)

1 q

X

0

Post synaptic depolarization V(t)

Synaptic internal variable X(t)

Post synaptic depolarization V(t)

1 q 0

50

100

time (ms)

150

200

V

0 0

50

100

time (ms)

150

200

Figure 2: Stochastic LTP. Pre- and postsynaptic neurons have the same mean rate and the synapse starts from the same initial value (X(0) D 0). (Left) LTP is caused by a burst of presynaptic spikes that drives X(t) above the synaptic threshold. (Right) At the end of stimulation, X returns to the initial value. At parity of activity, the nal state is different in the two cases. See Figure 1 for a denition of the three windows.

To illustrate the stochastic nature of the learning mechanism, we assume that the presynaptic spike train is Poissonian, while the afferent current to the postsynaptic neuron is gaussian and uncorrelated with the presynaptic process. The synaptic dynamics depends on the time statistics of the spike trains of the pre- and postsynaptic neurons. During stimulation, synapses experience two types of transitions between their two stable states: Long-term potentiation. Both pre- and postsynaptic neurons emit at a high rate. The frequent presynaptic spikes trigger many jumps in X (t), mostly up (X ! X C a), since in order to emit at a high rate, the postsynaptic neuron spends much of the time near the emission threshold (h D 1), and hence above the threshold hV . Long-term homosynaptic depression. The presynaptic neuron emits at a high rate, while the postsynaptic neuron emits at a low (spontaneous) rate. The many jumps triggered are mostly down (X ! X ¡b), because the postsynaptic neuron is mostly at low depolarization levels.

During stimulation, the synapses move up and down. Following the removal of the stimulus, the synaptic efcacy may return to its initial state, or it may make a transition to another state. Figure 2 shows two cases at parity of spike rates of pre- and postsynaptic neurons. In one case (left) a uctuation drives the synaptic efcacy above threshold; when the stimulus is removed,

Spike-Driven Synaptic Plasticity

2235

LTD TRANSITION

NO TRANSITION

Pre synaptic spikes

Pre synaptic spikes

Synaptic internal variable X(t)

1 q

X

0

Post synaptic depolarization V(t)

Synaptic internal variable X(t)

Post synaptic depolarization V(t)

1 q 0

50

100

time (ms)

150

200

V

0 0

50

100

time (ms)

150

200

Figure 3: Stochastic LTD. As in the case of LTP, pre- and postsynaptic neurons have the same mean rate, and the synapse starts from X(0) D 1. (Left) A presynaptic burst provokes LTD. (Right) At the end of stimulation, X returns to the initial value. Conventions as in Figure 1.

X is attracted to the high state: LTP has occurred. In the second case (right), when the stimulus is removed, X is below threshold and is attracted by the refresh to the initial state. There is no transition. The stochasticity of LTD transitions is exemplied in Figure 3. At low presynaptic rates, there are essentially no transitions, irrespective of the postsynaptic rate. The empirical transition probability is dened as the fraction of cases, out of a large number of repetitions of the same stimulation conditions, in which at the end of the stimulation, the synapse made a transition to a state different from its original state. 3 Estimating Expected Transition Probabilities 3.1 General Considerations. Given the synaptic parameters and the activity of the two neurons connected by the synapse during the stimulation, the transition probabilities contain most of the information relevant for the learning process. These probabilities depend on the following factors:

The duration of the stimulation. Transition probabilities increase with presentation time Tp . At small transition probabilities, the dependence is linear. The structure and the intensity of stimulation. The presynaptic neuron acts as a trigger. At low ring rates, the probability of having a transition is negligible. The activity of the postsynaptic neuron is related to the statistics of the depolarization and, hence, the probability of having either a potentiating or a depressing jump.

2236

Fusi, Annunziato, Badoni, Salamon, and Amit

Temporal patterns in the spike trains emitted by the two neurons. Synchronization can affect the transition probability. We will consider such cases in the determination of the parameters of the VLSI synapse (see section 5). The parameters governing the synaptic dynamics (see equations 2.1–2.3). The two refresh currents (a, b), the jumps induced by the presynaptic spikes (a, b) and the threshold hX . We study the dependence of LTP and LTD transition probabilities on the mean emission rates of the pre- and postsynaptic neurons, (ºpre ºpost ). The activities of the two neurons are read in different ways. For the presynaptic neuron, we take the probability per unit time that a spike is emitted, which is the mean rate. For the postsynaptic neuron, we use the probability that the depolarization is below (or above) the threshold when the presynaptic spike arrives. In order to determine this probability, we have to specify the framework in which the synaptic device operates. The appendix describes the scenario used. We point out in section 6 that this is not a restrictive feature of the framework. 3.2 Synaptic Dynamics as a Generalized Tak a´ cs Process. We assume that presynaptic spikes are emitted in a Poisson process with rate ºpre. The value (size and direction) of successive jumps is determined by the indepre pre pendently distributed random variables Vpost (tk ) and Vpost (tk C 1 ) of equation 2.3. The independence of these variables is a good approximation when pre pre ºpre is not too high, because the mean interval between tk and tk C 1 is longer than the typical correlation time of the Wiener process driving the depolarization of the postsynaptic neuron. Yet the transition probabilities may be quite sensitive to these correlations, even if the correlation time of the Wiener process is relatively short. In fact, transitions are usually caused by bursts of presynaptic spikes, for which the time intervals between successive spikes are short, irrespective of the mean rate. Here, for simplicity, we pre ignore these correlations and assume that the random variables Vpost (tk ) are independent. This position would be analogous to the neglect of possible temporal correlations in the afferent current to an IF neuron (Ricciardi, 1977). In this case, X (t) is a Markov process, similar to the virtual waiting time in Tak´acs single-server queue (e.g., see Cox & Miller, 1965). The process is somewhat generalized, to take into account that:

There are two possible drifts, one up and one down, depending on where X (t) is relative to its threshold. There are two reecting barriers. Jumps can be down as well as up.

Spike-Driven Synaptic Plasticity

2237

Those generalizations can be made, and we proceed to describe the resulting equations for the distribution. The probability density function of X (t) consists of two discrete probabilities, P0 (t) that X (t) D 0 and P1 (t) that X (t) D 1, and a density p (X, t) for X (t) 2 (0, 1). The density p(X, t) evolves according to two equations, one for each subinterval ]0, hX [ and ]hX , 1[. p (hX , t) D 0 for any t, since hX is an absorbing barrier (see, e.g., Cox & Miller, 1975): @p(X, t)

@p(X, t)

@t

D a

@t

D ¡b

@p(X, t)

(C )

C º[¡p( X, t ) C A (X, t )]

@ X @p (X, t) (¡)

@ X

C º[¡p ( X, t ) C B( X, t )]

if X 2]0, hX [ if X 2]hX , 1[.

The rst term on the right-hand side of each equation is the drift due to the respective refresh currents, ¡a and b. The terms proportional to º are due to presynaptic spikes: ºp(X, t) is the fraction of processes leaving X at time t, per unit time, ºA(X, t) is the probability of arriving at X in the lower subinterval, and ºB(X, t) is the same probability in the upper subinterval. The expressions for A and B are, respectively, A D Qa [P0 (t)d(X ¡ a) C H (X ¡ a)p (X ¡ a, t)] C Qb H (¡X C 1 ¡ b) p( X C b, t )

B D Qa H (X ¡ a)p (X ¡ a, t) C Qb [P1 (t)d(X ¡ (1 ¡ b)) C H(¡X C 1 ¡ b) p ( X C b, t )].

In A, the term in square brackets accounts for the C a jumps, starting from X D 0 (rst term) or from X D X ¡ a (second term). The last term accounts for b jumps (we assume that b < 1 ¡hX ). Qb (D 1 ¡ Qa ) is the probability that the depolarization of the postsynaptic neuron is below the threshold hV ; it is related to the ring rate of the postsynaptic neuron (see the appendix). The interpretation of B above hX is analogous. The two discrete probabilities P0 and P1 are governed by: Z b dP0 (t) D ¡ºQa P0 (t) C ap(0, t) C ºQb p (x, t)dx dt 0 Z 1 dP1 (t) D ¡ºQb P1 (t) C bp(1, t) C ºQa p (x, t)dx. dt 1¡a They describe processes that leave the stable states or arrive at them driven by spikes, the rst and third term, respectively, on the right-hand side of each equation, and processes that arrive at each of the two stable states driven by the refresh drift, middle term in each equation. These equations are completed by the normalization condition, which restricts the process

2238

Fusi, Annunziato, Badoni, Salamon, and Amit

to the interval [0, 1]: Z 1 P0 (t) C p (x, t)dx C P1 (t) D 1. 0

3.3 LTP and LTD Transitions Probabilities. From the solution of the Tak´acs equations, one derives the transition probabilities per presentation for LTP and LTD. To obtain the LTP probability, PLTP , one assumes that the synapse starts at X (0) D 0, which corresponds to the initial condition p (X, 0) D 0, P1 (0) D 0 and P0 (0) D 1. If the stimulation time is Tp , we have: Z 1 P LTP D p (x, Tp )dx C P 1 (Tp ). hX

Similarly, for the depression probability per presentation, PLTD , the initial condition is the high value—p(X, 0) D 0, P0 (0) D 0, P1 (0) D 1—and Z hX P LTD D p (x, Tp )dx C P0 (Tp ). 0

3.4 LTP / LTD Probabilities: Theory.

3.4.1 The Determination of Qa . In order to compute PLTP and P LTD as a function of the mean rates of two neurons connected by the synapse, we solve numerically the Tak´acs equation for each pair of frequencies ºpre , ºpost . To do this, one must calculate Qa and Qb , which depend on the distribution of the depolarization of the (postsynaptic) neuron, p(v ). In general, this distribution is not uniquely determined by the neuron’s emission rate. For an IF neuron, for example, the rate is a function of two variables: the mean and the variance of the afferent current. For the linear IF neuron, given the mean and variance, one can compute explicitly p (v ) (Fusi & Mattia, 1999). Hence, we use a stimulation protocol that allows a one-to-one correspondence between the output rate and the distribution of the depolarization. This is achieved by assuming that the afferent current to the neuron is a sum of a large number of independent Poisson spike processes (see the appendix). Figure 4 exhibits the characteristics of the resulting distribution. On the left is p (v), for 0 < v · h for the relevant ºpost (between 1 and 60 Hz). For low spike rates, p (v ) is concave, and the probability for a depolarization near spike emission threshold h ( D 1), as well as for an upward jump (above hV ( D 0.7)), is low. As ºpost increases, the distribution becomes convex and increasingly uniform. Rh Given p (v) at some value of ºpost , one computes Qa D hV p (v ), which determines whether the synaptic jump is down or up. Figure 4 (right) shows Qa as a function of ºpost. In conclusion, with the parameters a, b, a, b, hX , and Qa , Qb and xing the stimulation time Tp the probabilities for LTP and LTD can be computed for any pair ºpre , ºpost .

Spike-Driven Synaptic Plasticity

2239

0.2

0.02

Qa

p(v)

0.04

0.1

0 0 0.5

40

V [q =1]

1

0 0

60

20 n [Hz]

20 40 n [Hz]

60

Figure 4: Distributions of the postsynaptic depolarization, p(v) (left) and Qa (probability for upward jump) (right) versus º( D ºpost ), in a stimulation procedure described in the appendix. For low spike rates, p(v) is concave, and the probability for a depolarization near spike emission threshold h ( D 1) is low. As ºpost increases, the distribution becomes convex and increasingly uniform. The white line on the surface is p(v) at the upward jump threshold hV ( D 0.7). Rh Qa D h p(v). V

3.4.2 Parameter Choices. The sets of parameters of the synaptic dynamics are the same as in the hardware tests (see section 5). Tp D 500 ms, as in a typical neurophysiological experiment. The rate of spontaneous activity has been set at 4 Hz and the typical stimulation rate at 50 Hz. We have chosen two sets of synaptic parameters to check that behavior can be manipulated (see Table 1). For each set we looked for a regime with the following Table 1: Two Sets of Parameters of the Synaptic Dynamics. a Synaptic parameters: Set 1 DQ 0.150 § 0.009 PQ 498 § 30 mV Synaptic parameters: Set 2 DQ 0.170 § 0.007 PQ 562 § 25 mV

b

a

b

0.117 § 0.012 395 § 25 mV

0.00670 § 0.00054 22.1 § 1.7 mV/ms

0.0184 § 0.0018 62.4 § 6.0 mV/ms

0.138 § 0.020 457 § 40 mV

0.00654 § 0.00035 21.6 § 1.0 mV/ms

0.0185 § 0.002 61.3 § 6.5 mV/ms

Notes: The power supply voltage (the distance between the two stable synaptic efcacies) is 3.3 V. It sets the scale for the dimensionless units (DQ quantities) used in the extended Tak´a cs equations and determined by the measuring procedure. PQ are the physical quantities. The synaptic threshold Xh is 0.374 (1236 mV) in both cases. The static regulations for the two refresh currents a and b are the same for the two sets (except for electronic noise).

2240

Fusi, Annunziato, Badoni, Salamon, and Amit

characteristics: Slow learning . LTP at stimulation (ºpre D ºpost D 50 Hz) is 0.03 for set 1 and 0.1 for set 2. LTD for neurons with anticorrelated activity (ºpre D 50 Hz, ºpost D 4 Hz) is at 10 ¡3 and 10 ¡2 , respectively. This means that approximately 30 (10) repetitions of the same stimulus are required to learn it (Brunel et al., 1998). Smaller probabilities can be achieved for the same rates, but exhaustive hardware tests become too long. LTP/LTD balancing . We have aimed at a region in which the total number of LTP and LTD transitions are balanced in a presumed network. For random stimuli of coding level f , balancing implies PLTD ’ f PLTP . In our case, f D 0.03 for set 1 and 0.1 for set 2.

Stability of memory . In spontaneous activity the transition probabilities should be negligible. In Figure 5 we plot the surfaces PLTP (ºpre, ºpost ) and PLTD (ºpre, ºpost ) for parameter set 1. The ºpre,ºpost space divides in three regions: 1. ºpre and ºpost both high; PLTP is much higher than PLTD . The stronger the stimulation, the higher the transition probabilities. PLTP drops sharply when one of the two rates goes below 10 Hz and becomes negligible around the spontaneous rates—about 5 £ 10 ¡8 . 2. ºpre high, and ºpost around spontaneous levels. LTD dominates, even though stimulus-driven PLTD probability at (ºpre D 50 Hz, ºpost D 4 Hz) is a factor of 30 smaller than the stimulus-driven PLTP (ºpre D ºpost D 50 Hz). This guarantees the balance between potentiation and depression. PLTD decays very rapidly in the direction of lower ºpre and more slowly in the direction of increasing ºpost . 3. ºpre low: Both LTP and LTD transition probabilities are small. This is the region in which the synapse remains unchanged on long timescales. A network with such synapses should perform well as an associative memory (Amit & Fusi, 1994; Brunel et al., 1998). Figure 6 is another representation of the phenomenology discussed above: the color map is the surface of the asymptotic probability P asy of having a potentiated state following a large number of presentations of the same Figure 6: Facing page. Asymptotic “learning rule” Pasy (see equation 3.1) versus (ºpre , ºpost ). If PLTP and PLTD are less than 10¡3 , their value under stimulation Pasy is set to 0.5; there is no clear tendency. In the upper right corner (both neurons activated by stimulus), LTP is dominant; in the upper left corner, LTD dominates (homosynaptic LTD). The bright wedge (green) corresponds to equal probabilities. In the lower region, transition probabilities are negligible. The parameters are the same as in Figure 5.

Spike-Driven Synaptic Plasticity

2241

LTP transition probability

LTD transition probability

0

0

2

2

log10(PLTD)

10

log (P

LTP

)

0

4 6 8

10

4

4 6 8

6

10

50

n

2

40

pre

50 30

20

[Hz]

10

0 0

10

20 n

30 post

40 [Hz]

50 n

40

pre

30

20

[Hz]

10

0 0

10

20 n

30 post

40

8

50

10

[Hz]

Figure 5: (Left) LTP transition probabilities (log scale). (Right) LTD versus preand postsynaptic neuron rates (parameter set 1, Table 1).

LTP vs LTD

LTP

50

0.9

Pre synaptic rate [Hz]

0.8 40

0.7 0.6

30

0.5 0.4

20

0.3 0.2

10

0.1 0 0

10

20 30 40 Post synaptic rate [Hz]

50

0 LTD

2242

Fusi, Annunziato, Badoni, Salamon, and Amit

pattern of activities, given by: P asy (ºpre , ºpost ) D

PLTP (ºpre, ºpost ) . PLTP (ºpre , ºpost ) C PLTD (ºpre , ºpost )

(3.1)

It is near 1 if the synapse tends to be potentiated, near 0 if it tends to be depressed, and around 0.5 if the two transition probabilities are about equal. When both transition probabilities are 10 ¡3 times their value in the presence of stimulation (PLTP < 10 ¡4 , PLTD < 10 ¡6 ), we set the surface height to 0.5, to indicate that there is no clear tendency. Figure 6 illustrates the way the synapse encodes activity information, or the “learning rule.” For each pair of rates (ºpre, ºpost ), Pasy determines whether the synapse will be potentiated or depressed. If the stimulus activates the two neurons connected by the synapse, the synaptic efcacy tends to be potentiated. If the presynaptic neuron is ring at high rates and the postsynaptic neuron has low activity (below 5–20 Hz, depending on the presynaptic neuron activity), the synapse tends to be depressed. Finally, we note that the LTP probability, which sets the upper bound on the stability of memory, is 5 £ 10 ¡8 (5 £ 10¡7 for parameter set 2). In other words, the mean time one has to wait until the synapse makes a spontaneous transition and loses its memory is approximately 100 days for parameter set 1 and 10 days for parameter set 2. This time depends on the typical activation rates, the spontaneous rates, and the time constants of the synaptic dynamics. For parameter set 2, it is relatively short because the probabilities are high. Still it is 107 times longer than the longest time constant of the synaptic dynamics. For lower spontaneous rates, one can achieve mean times of years. For example, with a spontaneous rate of 2 Hz, it is 4 years for parameter set 1, and 1.2 years for set 2. 4 Hardware Implementation

The mechanism has been implemented in aVLSI (Badoni, Salamon, Salina, & Fusi, 2000) using standard analog CMOS 1.2m technology. The synapse occupies about 90m £ 70m . The schematic design of the synapse is depicted in Figure 7. The pre- and postsynaptic neurons are simulated in this study. In the network implementation, they are a modied version of Mead’s neuron (Mead, 1989; Fusi, Del Giudice, & Amit, in press; Badoni et al., 2000). There are four functional blocks. 4.1 Capacitor C. The capacitor C acts as an analog memory element: the amount of charge on the capacitor is directly related to the internal synaptic state X (t ):

X (t) D

U (t) ¡ Umin , Umax ¡ Umin

Spike-Driven Synaptic Plasticity

Hebbian term

2243

Refresh term

Dendrite NotPreSpk

NotPreSpk

V(Refrp)

U( t )

Refrp

PostState

V(I1)

I1

I2

V(I2)

+

C

Uq Bias

V(Refrn)

Refrn

BiasN

V(BiasN)

BiasP

V(BiasP)

To postsynaptic neuron

PreSpk

Figure 7: Schematic of the synaptic device. The voltage across the capacitor C is the internal variable X(t). The refresh block implements R(t) and provides the refresh currents. The activity-dependent Hebbian block implements H(t). The dendrite block produces the synaptic transmission when a presynaptic spike arrives (the devices corresponding to the pre- and postsynaptic neurons are not shown, they are simulated). See the text for a description of each block.

where U (t) is the voltage across the synaptic capacitor and can vary in the band delimited by Umin » 0 and Umax » Vdd (the power supply voltage). 4.2 Hebbian Block. The Hebbian block implements H (t ), which is activated only on the arrival of a presynaptic spike. The synaptic internal state jumps up or down, depending on the postsynaptic depolarization. The input signal PostState determines the direction of the jump and is the outcome of the comparison between two voltage levels: the depolarization of the postsynaptic neuron V (t) and the threshold hV . If the depolarization is above the threshold, then PostState is » 0; otherwise it is near the power supply voltage Vdd. In the absence of presynaptic spikes, the two MOSFETs controlled by NotPreSpk and PreSpk are open (not conducting), and no current ows in the circuit. During the emission of a presynaptic spike, both of these MOSFETs are closed, and the PostState signal controls which branch is activated. If PostState D 0 (the postsynaptic depolarization is above the threshold hV ), the three topmost MOSFETs of the upper branch are closed, and the current ows from Vdd to the capacitor. The charging rate is determined by BiasP, which is the current owing through the upper MOSFET and determined by V (BiasP) (this MOSFET is the slave of a current mirror). Analogously,

2244

Fusi, Annunziato, Badoni, Salamon, and Amit

when PostState D 1, the lower branch is activated, and the current ows from the capacitor to the ground. Hence, the jumps a and b (see equation 2.3) are given by aD

BiasP D t , C(Umax ¡ Umin )

bD

BiasN D t , C(Umax ¡ Umin )

where D t is the duration of the presynaptic spike. 4.3 Refresh Block. The refresh block implements R(t), which charges the capacitor if the voltage U (t) is above the threshold Uh (corresponding to hX ) and discharges it otherwise. It is the term that tends to damp any small uctuation that drives U (t) away from one of the two stable states. If the voltage across the capacitor is below the threshold Uh (U(t) < Uh ), the voltage output of the comparator is » Vdd , the pMOS controlled by V (Ref rp ) is not conducting, while the nMOS controlled by V (Ref rn) is closed. The result is a current ow from the capacitor to the ground. If U (t) > Uh , the comparison produces the minimal output voltage as the output of the comparator, the lower pMOS is closed, and the current ows from Vdd , through the upper pMOS, to the capacitor. Since the nMOS is always closed, part of the current coming from Vdd ows to the ground. The net effect is that the capacitor is charged at a rate proportional to the difference between the current determined by V (Ref rp ) (Refrp) and the current owing through the lower MOS (Refrn) (see Figure 7). The two refresh currents a and b of equation 2.2 are, respectively:

aD

Ref rn , C(Umax ¡ Umin )

bD

Ref rp ¡ Ref rn . C(Umax ¡ Umin )

See Badoni et al. (2000). 4.4 Dendrite Block. The dendrite block implements the synaptic efcacy. The postsynaptic potential evoked by a presynaptic spike (EPSP) depends on the synaptic internal state X (t). In our implementation, if X (t) is below the threshold Xh , the EPSP is J0 ; otherwise, it is J1 . This mechanism is implemented as follows: The current is injected only upon the arrival of a presynaptic spike (the topmost MOSFET acts as a switch that is closed while the presynaptic neuron is emitting a spike). The output of the refresh block comparator determines the amount of current injected in the postsynaptic neuron. The lower MOSFET in the left branch acts as a switch whose state depends on the outcome of the comparison performed by the refresh term between X (t) and hX . If X (t) > hX , the switch is closed, and the current injected is the sum of the two currents I1 and I2 . Otherwise, only the MOSFET controlled by VI2 is conducting, and the total current is I2 . Hence the EPSPs J0 and J1 can be regulated through the control

Spike-Driven Synaptic Plasticity

2245

currents I1 and I2 : J0 D I2 D t / Cn J1 D (I1 C I2 )D t / Cn ,

(4.1)

where Cn is the capacity of the postsynaptic neuron and D t is the spike duration. 5 Testing the Electronic Synapse 5.1 The Testing Setup. We present measured LTP and LTD probabilities for the electronic synapse, for two sets of parameters of the synaptic dynamics, as a function of the spike rates of the pre- and postsynaptic neurons. The main objectives are to test whether the behavior of the electronic synapse is well described by the model of section 2 and to check whether the desired low LTP and LTD transition probabilities and the balance between them can be achieved and manipulated with reasonable control currents. We use the same setup for measuring the synaptic parameters of the chip as for the measurements of the transition probabilities. The only difference is the choice of special stimulation protocols for the parameter determination, much as in neurobiology (Markram, Lubke, Frotscher, & Sakmann, 1997). 5.2 Measuring the Synaptic Parameters. In the parameter measurement protocol, the presynaptic neuron emits a regular train of spikes, while the postsynaptic depolarization is kept xed (for LTP, it is constantly above the threshold hV ). If the interspike interval of the regular train is T, then the arrival of a presynaptic spike (at t D 0) generates a triangular wave in X (t):

X (t ) D X (0) C a ¡ at. We assume that a < hX . If the rate is low (T > a / a), the synaptic internal state is reset to 0 by the refresh current a, before the arrival of the next spike. If T < a / a, the effects of successive spikes accumulate and eventually drive the synaptic potential across the synaptic threshold, hX , to the potentiated state. The transition occurs when na ¡ (n ¡ 1)Ta ¸ hX ,

(5.1)

where n is the minimal number of presynaptic spikes needed to make the synaptic potential cross the synaptic threshold hX . If one increases progressively the period T, at some point the minimal number of pulses n will change, and one extra spike will be needed in order to cross the threshold. At this point the relation (5.1) becomes an equality: na ¡ (n ¡ 1)Ta D hX .

(5.2)

2246

Fusi, Annunziato, Badoni, Salamon, and Amit

By increasing or decreasing the period T, we can collect a set of such special points. For each special period Tk , we have a corresponding nk : for T < Tk , nk presynaptic pulses are needed to cross the threshold, while for T > Tk the minimal number of spikes is nk C 1. Actually two of these special points, for example, ni and nj , determine a and a: ni ¡ nj , ni (nj ¡ 1)Tj ¡ nj (ni ¡ 1)Ti (nj ¡ 1)Tj ¡ (ni ¡ 1)Ti . a D hX ni (nj ¡ 1)Tj ¡ nj (ni ¡ 1)Ti

a D hX

(5.3)

An analogous procedure would provide an estimate of b and b. The synapse is initially set to X D 0. A high-frequency train of pulses is generated for the presynaptic neuron: T D Tmin D 11 ms for a and a and Tmin D 2 ms for b and b. The difference is due to the fact that at parity of frequency, LTD requires more spikes to cross the threshold (for example, for set 1, hX D 2.49a, while (1 ¡ hX ) D 5.35b). As soon as the synaptic voltage X (t) crosses the threshold hX , the train is stopped and the number of spikes emitted by the presynaptic neuron since the beginning of the trial is stored in n(T ). Then T is increased (typically D T D 0.1–0.2 ms) and the procedure is repeated, to obtain n(T ) for a given range of periods. The discontinuities of the curve are the special points (ni , Ti ). aij and aij are determined for each pair (i, j) of special points, using equations 5.3. The nal estimate is given by the average of these values. The standard deviation provides an estimate of the uctuations due to electronic noise. The measured parameters and the corresponding standard deviations are reported in Table 1, in their dimensionless as well as physical values. The parameters entering the extended Takacs ` equations are dimensionless. The corresponding physical quantities are obtained using the actual voltages for the threshold hX and the distance between the two stable synaptic efcacies equal to the power supply voltage. The parameters have been adjusted to satisfy the requirements listed in section 3. Note that the errors are larger for LTD parameters. This is due to the fact that more jumps are required to get to threshold, and the electronic noise accumulates as the square root of the number of jumps. 5.3 Measuring the Transition Probabilities. Poissonian spike trains are generated for the presynaptic neuron by the host computer. The probability of the distribution of depolarization of the LIF neuron is used to generate the up (down) jump probability Qa (Qb ) for the postsynaptic neuron to be above (below) the threshold hV . (See the appendix.) For each presynaptic spike, an independent binary number j is generated: j D 1 with probability Qa , 0 with probability Qb (=1 ¡ Qa ). This number determines the direction of the jump induced.

Spike-Driven Synaptic Plasticity

2247

The synapse is set to one of its two stable values, and then the two neurons are stimulated (xing their emission rates) for Tp D 500 ms. Following the stimulation, the host PC reads the resulting stable value of the synapse and establishes whether a transition has occurred. This procedure is repeated from 104 to 2 £ 105 times (depending on the transition probability), and the transition probability is estimated by the relative frequency of occurrence of a transition. The testing time is reduced by discarding trials in which the internal variable X (t) is likely to remain throughout the trial, too far from the threshold to induce a transition. In order to identify these trials, we simulate the dynamics described by the ordinary differential equations of section 2 and take only the trials in which the simulated internal variable enters in a band around the threshold (X(t) > Xh ¡ 1.5a for LTP, and X (t) < Xh C 1.5b for LTD). In the other trials, the electronic synapse is likely to remain unchanged, even if the parameters have not been estimated accurately or the simulated dynamics does not reproduce exactly the real dynamics. Hence they are safely counted as no-transition trials. The upper and lower bounds of the condence interval of the estimated probabilities are given by Meyer (1965): h i 2 2 1/2 Pn C k2 ¨ k P (1 ¡ P )n C k4 , (5.4) Plow / high D n C k2 where n is the total number of repetitions, P is the estimated probability, and k is the width of the condence interval in sigmas. In our case, k D 3. 5.4 Results. The experimental transition probabilities are compared with those predicted by the theory of section 3 and with those measured in a computer simulation of the protocol used for the hardware test, for the two sets of parameters of Table 1. If the model of section 2 is adequate for describing the behavior of the electronic synapse, the computer simulation values should be in agreement with the measured experimental values. This question becomes more poignant given the uncertainties on the measured parameters, as expressed in Table 1. In what follows we use the central estimate for the parameters. On the other hand, if the theoretical approach, as expressed in the extended Tak`acs equations, is a valid one, the numerical integration of the equations should give a good asymptotic estimate for both the experimental values and the simulated ones (for a large number of presentations). Two sets of parameters have been chosen to show that beyond the quality of the model and of its solutions, the performance of the synaptic system can be manipulated rather effectively, despite the strict requirements imposed— that is, small, balanced transition probabilities for sparse coding stimuli. Note that the time constants of the refresh currents are the same for the two sets: 1 / a ’ 150 ms, and 1 / b ’ 50 ms.

2248

Fusi, Annunziato, Badoni, Salamon, and Amit

Simulations vs Theory 0

Hardware vs Theory 0 log PLTP

1

log PLTP

1 2

2

3

3

4 5 0

4 20 n (=n pre

40 60 ) [Hz] post

5 0 n

20 (=n pre

40 60 ) [Hz] post

Figure 8: LTP probabilities versus presynaptic (D postsynaptic) rate, in a threeway comparison, for the two sets of parameters of Table 1. Lower curves and points: set 1; upper: set 2. (Left) Numerical solution of extended Tak´acs equations (solid curves) and measurements in a simulation (diamonds and circles) (105 repetitions for the upper curves and 2 £105 for the lower ones). (Right) Theoretical prediction (solid curves, as on left) and transition probabilities for the hardware device (from 104 to 4£104 presentations) (diamonds and circles). Error bars: condence intervals (see equation 5.4).

The results are reported in Figures 8 and 9 for LTP and LTD, respectively. Each gure is a three-way comparison for two sets of parameters (see Table 1). Upper curves and points are for set 2; the lower ones are for set 1. The graphs on the left compare theory (extended Tak´acs equations) and simulations; the ones on the right compare the same theory and probability measurements on the electronic device. To reduce the two-dimensional rate space to one, the probabilities for LTP are given for a typical stimulation situation in which both neurons have the same rate. The higher the rate is, the stronger is the stimulation. Instead, for LTD, the postsynaptic rate is set at a spontaneous level, 4 Hz, and the presynaptic rate is varied. The theoretical predictions are almost identical to the simulations results, which implies that the numerical solution of the extended Tak`acs equations provides a reliable estimate of the modeled dynamics, even for rather small probabilities (a range of ve orders of magnitude). The theoretical predictions are also in good agreement with the measured experimental values. This is an indication that the device works as expected and is not sensitive to the uncertainties in the parameters. Finally, Figures 10 and 11 present a comparison of the measured (chip) and theoretical (extended Tak`acs) LTP and LTD transition probabilities (re-

Spike-Driven Synaptic Plasticity

2249

Simulations vs Theory 0

Hardware vs Theory 0

4

log P

log P

LTD

2

LTD

2

6 8 20 n

4 6 8

40 60 [Hz] (n =4 Hz) pre post

20 n

40 60 [Hz] (n =4 Hz) pre post

Figure 9: LTD probabilities versus presynaptic rate (postsynaptic rate is 4 Hz) in a three-way comparison theory-simulation-experiment, as in Figure 8.

spectively), over a two-dimensional (pre- and postsynaptic) rate space for parameter set 2. Each point of the surfaces represents the transition probability for a given pair of pre- and postsynaptic rates. It is coded by both color and elevation. The average percentage discrepancy between the measured and the predicted transition probabilities is about 11% for LTP and about 14% for LTD (the discrepancy between the measured value P m and the prediction Pp is dened as 2|Pm ¡Pp | / (Pm C Pp )). The fact that the relative discrepancies are larger for LTD is related to the fact that the uncertainties in the parameters are larger and that these probabilities are at least 10 times smaller. 6 Discussion 6.1 Balance Sheet. We have carried out an extensive test (about 1.5£106 presentations in all) in which most of the plane of the pre- and postsynaptic activities is explored. The agreement demonstrates that the dynamics of the synapse is well described by the model of section 2, and the simple synapse is a reliable, manipulable device, integrable on large scale. 6.2 The Scope of the Synaptic Dynamics. The particular realization of a plastic, synaptic device resolves the tension between the need to learn from every stimulus (fast) and the need to preserve memory on long timescales— to learn and forget slowly. The solution is conceptually simple and is implementable in aVLSI. The result satises the expectations quite well. The synapse has two dynamical regimes: an analog, volatile regime, in which

2250

Fusi, Annunziato, Badoni, Salamon, and Amit

1

1 LTP

LTP

)

0

)

0

10

log (P

2

10

log (P

2 3

3

4

4

5

5

50

n

40

pre

50 30 [Hz]

20

10

0

0

10

20 n

30 post

40

50 n

[Hz]

40

pre

30 [Hz]

20

10

0

0

10

20 n

30 post

40

50

[Hz]

Figure 10: LTP transition probability surfaces (log scale). (Left) Experiment. (Right) Theory. The code is of both elevation and color. Number of repetitions for each point is 105 . Average relative discrepancy is 11%.

1

0 1 )

LTD

2

10

log (P

3

10

log (P

LTD

)

2

4

3 4

5 60

5 60 50 40 n

pre

[Hz]

30 20

0

10

20 n

30 post

40 [Hz]

50

50 n

40 pre

[Hz]

30 20

0

10

20 n

30 post

40

50

[Hz]

Figure 11: LTD transition probability surfaces. (Left) Experiment. (Right) Theory. As in Figure 10. Number of repetitions: 2 £ 104 to 4 £ 104 .

it is modied little and the modications decay in a relatively short time; and a discrete, binary efcacy, which is modied in a stochastic way and can preserve memory for very long times. The synapse and the analysis presented in this study are a particular, though very rich, case of a wide range of situations that can, and should, be selected by confrontation with experiment and/or with computational

Spike-Driven Synaptic Plasticity

2251

desiderata. The particular choices made were of two types: convenience of electronic implementation and amenability to detailed analysis. The rst choices have suggested the dynamics of the synapse. The second have suggested the dynamics of the neurons connected by the synapse. The synapse may have a different refresh mechanism (such as an exponential, rather than linear, decay). It may also use a different postsynaptic variable than the postsynaptic depolarization. For the mechanism to work, it must be related to a variable that smoothes the very brief instances in which postsynaptic spikes are emitted. An alternative to the depolarization could be some quantity that measures the time interval between spikes emitted by the pre- and the postsynaptic neurons. It would require some extra internal degree of freedom and hence some extra components in the chip, but the framework will still hold. The extended Tak`acs description covers a wide range of cases and can go unmodied. Simplifying assumptions were also made for the stimulation mechanism, mainly the fact that the neurons are LIF neurons and that the jumps, depending on the postsynaptic depolarization, are independently distributed. Both can be relaxed, at the price of losing the analytic tool of the Markov process. But simulation and experiment can be performed. We reemphasize that the context in which the study has been conceived is similar to that of the IF neuron (Ricciardi, 1977; Fusi & Mattia, 1999), in which a motivated, simplifying choice is made for the neural dynamics and a simple, though generic, choice is made for the afferent current. Such an approach allows for a detailed study and for the development of intuitions and the generation of new questions. The simplied models can then be embedded in networks in which the assumptions may not be precisely satised and yet function satisfactorily. For the synapse, such a paradigmatic model has not been presented before. 6.3 Requisites at the Level of a Network. The synapse presented here is the result of an effort to extend the study of the properties of recurrent neural networks to include on-line learning (see also Badoni et al., 1995; Del Giudice, Fusi, Badoni, Dante, & Amit, 1998; Amit, Del Giudice, & Fusi, 1999). Sixty of these synapses have been embedded in a pilot aVLSI network of 21 (14 excitatory and 7 inhibitory; only excitatory-excitatory connections are plastic) neurons (Badoni et al., 2000; Fusi et al., in press), as well as in large-scale simulation experiments (Mattia & Del Giudice, in press). The general framework developed for learning networks has convinced us that synapses have discrete and bounded values for their efcacies. This is suggested by considerations of implementability (see, e.g., Baldoni et al., 2000), as well as by considerations of performance as an associative memory. The latter are related to the fact that it should be a rather general requirement that an open learning memory have the palimpsest property than the “overload catastrophe.” But for a network with discrete synapses to

2252

Fusi, Annunziato, Badoni, Salamon, and Amit

perform well and have large enough memory, learning should be stochastic and slow. This puts a heavy load on the synapses, because implementing low transition probabilities for individual synapses requires cumbersome devices. 6.4 The Present Synapse in the Context of a Learning Network. The scheme has the very attractive (and novel) feature that it transfers the load of generating low-probability synaptic transitions to the collective dynamics of the network. In other words, not only memory is distributed but also functional noise. The source of stochasticity is in the spike emission processes of the neurons, and small transition probabilities can be easily achieved because the LTP and LTD transitions are based on the (approximate) coincidence of events that are relatively rare (uctuations in the presynaptic spike train and in the postsynaptic depolarization). These rare events are the result of the collective dynamics of the network, in which the pre- and postsynaptic neurons are embedded. Indeed, the time constants of the single neurons could not account for the long mean interspike intervals observed in cortical recordings for spontaneous activity, for example. The statistics of recorded trains of spikes are good enough for the dynamics of our synapse (Softky & Koch, 1992). Networks composed of IF neurons show similar features (Amit & Brunel, 1997; van Vreeswijk & Sompolinsky, 1997; Fusi & Mattia, 1999). The next step will be to know whether the spikes generated by the simulated networks can drive the synaptic dynamics as expected. Preliminary results (Mattia & Del Giudice, in press) indicate that networks of LIF neurons, when stimulated, produce stochastic transitions. In those simulations, it is already clear that the learning process can be described as a random walk between the two stable states of the synapses, as in Brunel et al. (1998). The behavior of the network can also be modied by setting the parameters appropriately. Varying the transition probabilities, it is possible to control the learning rate (in terms of repetitions of the same stimulus) and, consequently, the size of the sliding window of the memory span. This means that it is possible to have fast single-shot learning, with a correspondingly short memory, as well as slow learning, using the same device. In fact, one simple handle is the rates provoked by the stimuli. 6.5 The aVLSI Synapse and Biology . The synaptic device has been tested for biological timescales; that is, the rates are similar to those observed in associative cortex and the typical presentation time is 500 ms. The two parameters that determine the timescale of the synaptic dynamics are the two refresh currents, a and b. Varying these currents, one can obtain similar behavior on a wide range of timescales, presumably as fast as 103 times the biological scale (requiring a much faster data acquisition and stimulation system). It is tempting to speculate that one of the reasons for the

Spike-Driven Synaptic Plasticity

2253

brain to operate at low speeds is also the lack of fast enough stimulation and acquisition systems. Given that the ideas underlying the hardware implementation are rather general, it is natural to ask about their biological relevance. The question is not about a mapping onto a series of biochemical reactions that ultimately trigger the expression of synaptic change. The biological plausibility is related to observable consequences of spike activity—that is, to the experimental protocols that induce an activity-dependent LTP and LTD. 6.5.1 Discreteness of Synapses. It is known that the discreteness of the stable synaptic states is compatible with experimental data (see, e.g., Bliss & Collingridge, 1993). Recently Petersen et al. (1998) provided experimental evidence that long-term potentiation is all-or-none, meaning that for each synapse, only two efcacies can be preserved on long timescales. Models of networks with this type of synapses have been studied in detail (Amit & Fusi, 1994; Brunel et al., 1998). 6.5.2 Synaptic Dynamics. It appears that at low presynaptic rates, the synaptic changes are temporary, and after stimulation, the synaptic efcacy returns to the initial values (Markram et al., 1997). Sudden transitions to the potentiated or depressed state are observed when the rate is increased. This would be a nice corroboration of our scenario, in which a change in the internal variable that does not make X cross the threshold is damped. Recent experiments (Markram et al., 1997; Zhang, Tao, Holt, Harris, & Poo, 1998; Bi & Poo, 1999) indicate that induction of LTP and LTD requires “precise timing” of presynaptic spikes and postsynaptic action potentials. LTP requires that presynaptic spikes precede postsynaptic spikes by no more than about 10 ms, and LTD that presynaptic spikes follow the postsynaptic spike within a short time window. Our synapse is not fully compatible with these results, since it is the postsynaptic depolarization that determines whether the synapse is potentiated or depressed, rather than “precise timing.” What is consistent is that whenever a presynaptic spike precedes a postsynaptic action potential, the depolarization is likely to be near the emission threshold, and therefore above the threshold hV , and LTP is likely to be induced. When a presynaptic spike occurs just after the emission of a postsynaptic action potential, that neuron is likely to be hyperpolarized, and the depolarization will tend to be below hV . This produces LTD. Where it fails is when the presynaptic spike precedes by more than 10 ms the postsynaptic one. Our synapse may still potentiate or depress; the biological one seems not to. There may be different reasons for the remaining discrepancy. The depolarization of simple IF neurons may not directly map on the depolarization of a complex biological neuron. There may instead be some internal variable that accounts for other internal states and is sensitive to particular time intervals, before and after the postsynaptic spike.

2254

Fusi, Annunziato, Badoni, Salamon, and Amit

0.4

F(V

post

)

+a q

b 0.4 0

V

V

1

post

Figure 12: Alternative forms for the dependence of the analog synaptic jump (provoked by a presynaptic spike) F upon the postsynaptic depolarization, Vpost . Full curve: (step function) is implemented in the present synapse—constant positive jumps for high depolarization, negative for low Vpost . Dashed curve: an alternative, with a transition region of very small jumps (unable to produce any permanent modication)—produces a behavior similar to the “precise timing” plasticity observed in the experiments.

Alternatively, a simple modication of the synaptic dynamics would allow reproducing the experimental results of Markram et al. (1997). For example, it is sufcient to extend the function F, expressing the value of the synaptic jump (see equation 2.4), upon the arrival of a presynaptic spike. The form used in our chip is depicted by the step function (full curve) in Figure 12. It takes a negative value (¡b) below some hV and a positive value (a) above. The second example in the gure depicts a case of soft transition: jump up for Vpost near h; down for Vpost very low; in a region of intermediate Vpost, the size of the jump is zero or very small. If the two regions in which the synaptic jumps are signicant are concentrated near enough to the depolarization boundaries, the experimental results can be reproduced quantitatively. In other words, given that the experiment is carried out in a deterministic situation (Markram et al., 1997), there will be a sharp cutoff for LTP and LTD in the time difference of the arrival of the pre- and postsynaptic spikes. But a synaptic mechanism for which both LTP and LTD have a sharp cutoff in the time difference between the arrival of pre- and postsynaptic spikes (“precise timing”) is rather insensitive to input spike rates (Abbott & Song, 1998; Senn, Tsodyks, & Markram, in press), and it is not easy to encode information expressed in mean emission rates. Our synapse will operate in a regime based on time coincidence: although it is conceived to operate in an asynchronous regime (described in section 5), it is rather sensitive to spike synchronization.

Spike-Driven Synaptic Plasticity

2255

Appendix: Depolarization and Firing Rate

The probability Qb (D 1 ¡Qa ) that appears in the extended Tak´acs equations of section 3 is the probability that the depolarization of the postsynaptic neuron is below the threshold hV . It is indirectly related to the ring rate of the postsynaptic neuron, and the relation depends on the neural dynamics and the structure of the input current. A.1 Neural Dynamics. We chose a linear IF neuron, which integrates an afferent current and has a reecting barrier at 0 (Fusi & Mattia, 1999). When the depolarization V crosses a threshold h, a spike is emitted, and the depolarization is reset to 0 and forced to stay there during an absolute refractory period tarp . A constant negative current m n produces a linear decay in the absence of the afferent current I (t ). In the interval between the reecting barrier and the emission threshold h, the dynamics of V is:

dV D I (t) ¡ m n . dt Given the probability distribution function, pn (V ), of the depolarization V, the probability Qb is obtained as Z hV Qb D pn (v )dv. ¡1

For neurons with a reecting barrier at 0, the lower bound in the above integral is 0. The probability distribution pn (V ) can be computed analytically in terms of the mean m and the standard deviation s, per unit time, of the afferent current I. It is given by ± ²i m º(m , s) h 1 ¡ exp ¡2 2 (h ¡ V ) , pn ( V ) D m s where º(m , s) is the emission rate of the neuron for these m and s, given below. Note that the linear decay current m n has been absorbed in m . The emission rate of the neuron can also be expressed in terms of m and s: µ s2 º(m , s) D tarp C 2m 2

¡ 2mh s2

¡1 C e

¡2mh s2

¢¶

¡1

.

(A.1)

Thus, given the neural parameters (tarp D 2 ms), if m and s 2 are linear functions of the stimulating spike rate, we have a unique relation between the rate of the postsynaptic neuron and the distribution of the depolarization. A.2 Afferent Current. The current is produced by a large number of independently spiking excitatory and inhibitory neurons: cE D 240 afferents

2256

Fusi, Annunziato, Badoni, Salamon, and Amit

from excitatory neurons and cI D 60 afferents from inhibitory neurons. The rates of the inhibitory neurons and 120 of the excitatory ones are xed. They help take care of the level of spontaneous activity. cext E D 120 excitatory afferents carry stimulation activity. The mean synaptic efcacies are chosen so that the neurons have spontaneous activity ºE0 D 4 Hz when all 240 afferent excitatory neurons have that same rate and the 60 inhibitory neurons emit at ºI0 D 8 Hz: JE!E D 0.05h (equal for all excitatory synapses), JI!E D 0.33h. The afferent current to each neuron is well approximated by a gaussian, and its mean and variance per unit time can be estimated by using mean-eld theory (as in Amit & Brunel, 1997): m D ¡110.4 ¡ m n C 6ºext

(A.2)

s 2 D 68.34 C 0.37ºext ,

(A.3)

where the coefcients in the expression for s 2 include a variability in the connectivity of 25%. The linear decay current m n is 27h Hz. Acknowledgments

We thank V. Dante for the rst version of the DAC board for the static regulations; M. Mattia for the parameters of the linear IF neuron network; and P. Del Giudice for many useful discussions. This work has been supported in part by a grant from the Israel Fund for Basic Research. References Abbott, L. F., & Song, S. (1998). Temporally asymmetric hebbian learning, spike timing and neuronal response variability. In M. S. Kearns, S. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Amit, D. J. (1989). Modeling brain function. Cambridge: Cambridge University Press. Amit, D. J., & Brunel, N. (1997). Model of global spontaneous activity and local structured activity during delay periods in the cerebral cortex. Cerebral Cortex, 7, 237. Amit, D. J., Del Giudice, P., & Fusi, S. (1999). Apprendimento dinamico della memoria di lavoro: una realizzazione in elettronica analogica (in Italian). In Frontiere della vita 3, 599. Rome: Istituto della Enciclopedia Italiana. Amit, D. J., & Fusi, S. (1994). Dynamic learning in neural networks with material synapses. Neural Computation, 6, 957. Amit, D. J., Gutfreund, H., & Sompolinsky, H. (1987). Statistical mechanics of neural networks near saturation, Annals of Physics, 173, 30. Badoni, D., Bertazzoni, S., Buglioni, S., Salina, G., Amit, D. J., & Fusi, S. (1995). Electronic implementation of a stochastic learning attractor neural network. NETWORK Computation in Neural Systems, 6, 125.

Spike-Driven Synaptic Plasticity

2257

Badoni, D., Salamon, A., Salina, A., & Fusi, S., (2000). aVLSI implementation of a network with spiking neurons and stochastic learning. In preparation. Bi, G.-q., & Poo, M.-m. (1999). Activity induced synaptic modications in hippocampal culture: Dependence on spike timing, synaptic strength and cell type. J. Neurosci., 18, 10464–10472. Bienenstock, E. L., Cooper, L. N., & Munro, P. W. (1982). Theory for the development of neuron selectivity: Orientation specicity and binocular interaction in visual cortex. Journal of Neuroscience, 2, 1, 32. Bliss, T. V. P., & Collingridge, G. L. (1993). A synaptic model of memory: Long term potentiation in the hippocampus. Nature, 361, 31. Brunel, N., Carusi, F., & Fusi, S. (1998). Slow stochastic Hebbian learning of classes of stimuli in a recurrent neural network. NETWORK Computation in Neural Systems, 9, 123. Cox, D. R., & Miller, H. D. (1965). The theory of stochastic processes. London: Methuen. Del Giudice, P., Fusi, S., Badoni, D., Dante, V., & Amit, D. J. (1998). Learning attractors in an asynchronous, stochastic electronic neural network. NETWORK Computation in Neural Systems, 9, 183. Diorio, C., Hasler, P., Minch, B. A., & Mead, C. (1998).Floating-gate MOS synapse transistors. In T. S. Lande (Ed.), Neuromorphic systems engineering: Neural networks in silicon. Norwell, MA: Kluwer. Elias, J., Northmore, D. P. M., & Westerman, W. (1997).An analog memory circuit for spiking silicon neurons. Neural Computation, 9, 419. Fusi, S., Del Giudice, P., & Amit, D. J. (in press). Neurophysiology of a VLSI spiking neural network. In Proceedings of the IJCNN 2000. Available online at: http://jupiter.roma1.infn.it. Fusi, S., & Mattia, M. (1999). Collective behavior of networks with linear (VLSI) integrate-and-re neurons. Neural Computation, 11, 633. Gerstein, G. L., & Mandelbrot, B. (1964). Random walk models for the spike activity of a single neuron. Biophysical Journal, 4, 41. Grossberg, S. (1969). On learning and energy-entropy dependence in recurrent and non recurrent signed networks. Journal of Stat. Phys., 1, 319. Haiger, P., Mahowald, M., & Watts, L. (1996). A spike based learning neuron in analog VLSI. In M. Hozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (p. 692). Cambridge, MA: MIT Press. Hopeld, J. J. (1982). Neural networks and physical systems with emergent selective computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554. Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coincidence of postsynaptic APs and EPSPs. Science, 275, 213. Mattia, M., & Del Giudice, P. (in press). Asynchronous simulation of large networks of spiking neurons and dynamical synapses. Neural Computation. Mead, C. (1989). Analog VLSI and Neural System. Reading, MA: Addison-Wesley. Meyer, P. L. (1965). Introductory probability and statistical applications. Reading, MA: Addison-Wesley. Miyashita, Y. (1993). Inferior temporal cortex: Where visual perception meets memory. Ann. Rev. Neurosci., 16, 245.

2258

Fusi, Annunziato, Badoni, Salamon, and Amit

Nadal, J. P., Toulouse, G., Changeux, J. P., & Dehaene, S. (1986). Networks of formal neurons and memory palimpsests. Europhys. Lett., 1, 535. Parisi, G. (1986). A memory which forgets. J. Phys, A: Math. Gen., 19, L617. Petersen, C. C. H., Malenka, R. C., Nicoll, R. A., & Hopeld, J. J. (1998). All-ornone potentiation at CA3-CA1 synapses. Proc. Natl. Acad. Sci., 95, 4732. Ricciardi, L. M. (1977). Diffusion processes and related topics on biology. Berlin: Springer-Verlag. Sarpeshkar, R. (1998). Analog versus digital: Extrapolating from electronics to neurobiology. Neural Computation, 10, 1601. Sejnowski, T. J. (1976). Storing covariance with nonlinearly interacting neurons. J. Math. Biology, 4, 303. Senn, W., Tsodyks, M., & Markram, H. (in press). An algorithm for modifying neurotransmitter release probability based on pre- and post-synaptic spike timing. Neural Computation. Softky, W. R., & Koch, C. (1992). Cortical cells should spike regularly but do not. Neural Computation, 4, 643. Tsodyks, M. (1990). Associative memory in neural networks with binary synapses. Modern Physics Letters, B4, 713. Tuckwell, H. C. (1988). Introduction to theoreticalneurobiology (Vol. 2). Cambridge: Cambridge University Press. van Vreeswijk, C. A., & Sompolinsky, H. (1997). Chaos in neural networks with balanced excitatory and inhibitory activity. Science, 274, 1724. Willshaw, D. (1969). Non-holographic associative memory. Nature, 222, 960. Zhang, L. I., Tao, H. W., Holt, C. E., Harris, W. A., & Poo, M. (1998). A critical window for cooperation and competition among developing retinotectal synapses. Nature, 395, 37. Received April 28, 1999; accepted November 8, 1999.

NOTE

Communicated by Timothy Horiuchi

Modeling Alternation to Synchrony with Inhibitory Coupling: A Neuromorphic VLSI Approach Gennady S. Cymbalyuk Institute of Mathematical Problems in Biology, Russian Academy of Sciences, Pushchino, Moscow region, Russia 142292, and Department of Biology, Emory University, Atlanta, Georgia 30322, U.S.A Girish N. Patel School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332-0250, U.S.A. Ronald L. Calabrese Department of Biology, Emory University, Atlanta, Georgia 30322, U.S.A. Stephen P. DeWeerth School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332-0250, U.S.A. Avis H. Cohen Department of Biology, University of Maryland, College Park, Maryland 20742,U.S.A. We developed an analog very large-scale integrated system of two mutually inhibitory silicon neurons that display several different stable oscillations. For example, oscillations can be synchronous with weak inhibitory coupling and alternating with relatively strong inhibitory coupling. All oscillations observed experimentally were predicted by bifurcation analysis of a corresponding mathematical model. The synchronous oscillations do not require special synaptic properties and are apparently robust enough to survive the variability and constraints inherent in this physical system. In biological experiments with oscillatory neuronal networks, blockade of inhibitory synaptic coupling can sometimes lead to synchronous oscillations. An example of this phenomenon is the transition from alternating to synchronous bursting in the swimming central pattern gener ator of lamprey when synaptic inhibition is blocked by strychnine. Our results suggest a simple explanation for the observed oscillatory transitions in the lamprey central pattern generator network: that inhibitory connectivity alone is sufcient to produce the observed transition.

c 2000 Massachusetts Institute of Technology Neural Computation 12, 2259–2278 (2000) °

2260

Cymbalyuk, Patel, Calabrese, DeWeerth, & Cohen

1 Introduction

Experimental studies, especially in the realm of central pattern generators (CPGs), have emphasized the importance of reciprocal inhibitory connections in producing alternating oscillations, whereas excitatory connections or electrical connections are thought to mediate synchronous oscillations. The potential synchronizing role of weak inhibitory connections is often disregarded. However, when the strength of inhibitory connections is reduced by experimental treatments in some preparations, a transition from alternation to synchronization has been observed. A particularly relevant example of this type of transition is provided by experiments with the lamprey spinal cord (Cohen & Harris-Warrick, 1984; Alford, Sigvardt, & Williams, 1990). The segmental CPGs of the lamprey swim system are composed of two reciprocally inhibitory units that represent oscillatory neural networks located on contralateral sides of the spinal cord. The oscillations in the units are based on intrinsic membrane and circuit properties. Under normal conditions, the two units oscillate in antiphase. However, in the presence of strychnine, which blocks glycinergic inhibitory coupling, they oscillate synchronously (Cohen & Harris-Warrick, 1984; Alford et al., 1990). Theoretical studies of oscillatory neural networks have shown that inhibitory connections can lead to synchronization (Lytton & Sejnowski, 1991; Wang & Rinzel, 1993; Destexhe, Contreras, Sejnowski, & Stereade, 1994; Golomb, Wang, & Rinzel, 1994; Borisyuk, Borisyuk, Khibnik, & Roose, 1995; Bush & Sejnowski, 1996; White, Chow, Ritt, Soto-Trevino, ˜ & Kopell, 1998; Chow, White, Ritt, & Kopell, 1998; Rubin & Terman, 2000). In particular, mutually inhibitory connections in a two-cell system can produce stable synchronous oscillations in some regimes of parameter space. In the limit of weak coupling, asymptotic methods are useful in nding phase-locked periodic solutions and determining their stability (Malkin, 1956; Blechman, 1981; Kuramoto, 1984; Ermentrout & Kopell, 1991; Cymbalyuk, Nikolaev, & Borisyuk, 1994; Hoppensteadt & Izhikevich, 1997). It has been emphasized that in order to synchronize oscillations, some parameters such as rate of rise or decay of synaptic coupling should be sufciently slow with respect to the spike waveform (Wang & Rinzel, 1992; Van Vreeswijk, Abbot, & Ermentrout, 1994; Gerstner, van Hemmen, & Cowan, 1996; Terman, Kopell, & Bose, 1998). Also, insight about the synchronization properties of a system of oscillatory neurons can be gained by characterizing the intrinsic membrane properties of the component neurons by establishing their phase-response curve type (Hansel, Mato, & Meunier, 1995; Ermentrout, 1996; Hoppensteadt & Izhikevich, 1997; Rinzel & Ermentrout, 1998; Izhikevich, 1999). However, some behaviors observed in mathematical models may not be observed in real systems due to constraints imposed by the physical world. For example, a parameter regime corresponding to synchronous oscillations in a given model may be so narrow that noise in the system

Modeling Alternation to Synchrony

2261

or mismatch among the oscillators prevents a reliable observation of this particular behavior. For this article, we conducted experiments with silicon neurons to test the robustness of different oscillatory regimes observed in a corresponding mathematical model. We have implemented neural oscillators using neuromorphic very large-scale integrated (VLSI) circuits that are fabricated in standard silicon microelectronic processes (Mead, 1989; Patel, Cymbalyuk, Calabrese, & DeWeerth, 1999). In addition, we have derived a mathematical model describing the silicon neuron’s dynamic properties (Patel et al., 1999). Using a system of two mutually inhibitory connected silicon neurons, we demonstrate stable synchronous oscillations as well as other behaviors predicted by the mathematical model. Thus, the dynamic behaviors revealed in the experiments are sufciently robust to persist under the physical constraints inherent in silicon neuron technology. These experimental observations help to build intuitions as to how such behaviors may be observed in experiments with real neurons. We show that a physical system of two silicon neurons can display stable synchronous oscillations with weak inhibitory coupling, thus realizing the predictions developed by exploring a corresponding mathematical model. These synchronous oscillations do not require synaptic connections possessing signicant delay, or slow rise and decay properties, and are sufciently robust to survive the physical constraints. We suggest a simple explanation for the observed oscillatory transitions in the lamprey CPG network: that inhibitory connectivity alone is sufcient to produce the observed transition. 2 Silicon Neuron

We have implemented a silicon neuron (Patel & DeWeerth, 1997; Patel et al., 1999) inspired by a mathematical model of an excitable cell (Morris & Lecar, 1981). The mathematical model describes the dynamics of a neuron possessing an inward Ca2 C current with instantaneous activation dynamics, an outward K C current with slow activation dynamics, and a leak current. This model appears to be at a reasonable level of reduction of the neuronal dynamics reconciling relative simplicity of analysis with biophysical plausibility (Rinzel & Ermentrout, 1998; Skinner, Turrigiano, & Marder, 1993). The design of our silicon neuron is based on subthreshold operation of MOS (metal-oxide-semiconductor) transistors and the current-mode techniques described by Mahowald and Douglas (1991). The basic building block is a voltage-controlled conductor implemented with a single MOS transistor. A pair of complementary MOS transistors is shown in Figures 1A and 1B. The I-V characteristics of the voltage-controlled conductor, implemented with an n-type MOS transistor (see Figure 1A), is described by ID D I0 e(k VG ¡VS ) / UT (1 ¡ e ¡VDS / UT ),

(2.1)

2262

Cymbalyuk, Patel, Calabrese, DeWeerth, & Cohen

Figure 1: Building blocks and circuit schematic of the silicon neuron. (A) n-type and (B) p-type MOS transistors are used as conductors. The bulk node of a transistor, which is usually not displayed, is the fourth terminal of MOS transistors. The bulk nodes of n-MOS transistors are typically connected to the lowest potential in the circuit (Gnd), and the bulk nodes of p-MOS are connected to highest potential in the circuit (Vdd ). (C) Circuit schematic of the silicon neuron. Transistors M1 , M2 and M3 emulate voltage-gated calcium, potassium, and synaptic currents, respectively.

Modeling Alternation to Synchrony

2263

where ID is the drain-to-source current, I0 and k are physical constants that depend on the fabrication process, UT is the thermal voltage (kT / q), VG and VS are the gate and source voltages with respect to the bulk node, and VDS is the drain-to-source voltage (Mead, 1989). The drain-to-source current of the complementary pMOS transistor (see Figure 1B) is also described by equation 2.1; however, the signs of all voltages are reversed. For both nMOS and pMOS transistors, the gate current is zero. For VDS À UT , the transistor operates in its saturation region where the drain-to-source current is given by ( ) ISAT ¼ I0 e kVG ¡VS / UT .

(2.2)

For VDS ¿ UT and a xed value of VGS (gate-to-source voltage), the transistor operates in its ohmic region where the I-V characteristic is approximately linear: Iohmic ¼ ISAT

VDS . UT

(2.3)

The schematic diagram of the silicon neuron is shown in Figure 1C. The voltage across the capacitors C1 is the state variable representing the membrane potential, V, and the voltage across the capacitors C2 is the slow-state variable corresponding to activation of K C current, W. Current conservation at node V yields C1 VP D aP iH ¡ aN iL C aP Iext ,

(2.4)

where aP iH , aN iL correspond to the Ca2 C and K C currents, respectively, and Iext is an externally applied current. Currents iH and iL are the output currents of the differential pair circuits enclosed in the dashed boxes B 1 and B2 , respectively. aP and aN describe the ohmic effects of transistors M1 and M2 , respectively. In our implementation, there is no membrane leak current. Current conservation at node W yields P D b P b N ix , C2 W

(2.5)

where ix is the output current of the operational transconductance amplier (OTA) and bP and bN represent the ohmic effects of pull-up and pull-down transistors inside the OTA. The output currents of a differential pair circuit and an OTA circuit, derived by using equation 2.2, are described by a Fermi function and a hyperbolic tangent function, respectively (Mead, 1989). Substituting these functions into equations 2.4 and 2.5 yields C1 VP D f (V, W ) D Iext aP C IBH

ek (V¡VH ) / UT aP 1 C ek (V¡VH ) / UT

2264

Cymbalyuk, Patel, Calabrese, DeWeerth, & Cohen

ek (W¡VL ) / UT aN , 1 C ek (W¡VL ) / UT P D It tanh k V ¡ W bP bN , C2 W 2UT ¡ IBL

¡

(2.6)

¢

(2.7)

where aP D 1 ¡ e((V¡VHigh ) / UT ) ,

(2.8)

aN D 1 ¡ e((VLow ¡V) / UT ) , bP D 1 ¡ e

bN D 1 ¡ e

((W¡Vdd ) / UT (¡W / UT )

(2.9)

,

(2.10)

.

(2.11)

In the above formulation, the bulk nodes of nMOS and pMOS transistors are connected to VLow and VHigh , respectively. As shown in Figure 1C, an inhibitory synapse is implemented with a single nMOS transistor, M3 , whose gate voltage is determined by the drainto-source current of M8 . The synaptic current iSyn responds to changes in the presynaptic potential instantaneously. For a neuron with a single inhibitory synapse, equation 2.6 is modied as follows: C1 VP D f (V, W ) ¡ iSyn ,

(2.12)

where f (V, W ) is given in equation 2.6 and

¡

iSyn D aN IBSyn

ek (Vpre ¡Vthresh ) / UT 1 C ek (Vpre ¡Vthresh ) / UT

¢

.

(2.13)

In equation 2.13, Vpre corresponds to the membrane potential of the presynaptic neuron, aN is given in equation 2.9, and IBSyn represents the maximal synaptic conductance. 3 Experimental Characterization of the Model

We focused our experiments on the quantication of observed behaviors in a two-cell silicon neuron network connected by mutually inhibitory synapses along with the characterization of the corresponding mathematical model. The observed behaviors were quantied by sampling the voltage waveforms (approximately 20 periods’ worth) and postprocessing the data. The oscillatory behaviors were studied while IBSyn was varied from approximately 5 pA to 1 m A. The parameters of the silicon neuron were set so that each neuron, in the absence of synaptic coupling, behaved as an oscillator. Most of the parameters in the mathematical model of the silicon neuron were measured experimentally. Three others were inaccessible for measurements. These were estimated based on the circuit design and adjusted to t

Modeling Alternation to Synchrony

2265

the experimental data. Directly measured parameters were VHigh , VLow , VH , and Vdd . Current sources IBH and IBL were measured in voltage-clamp experiments via the membrane potential node. The value of k D 0.65 was determined by measuring the slope of the steady-state activation curve of the inward current. At room temperature, UT is approximately 0.025 volts. The values of C1 and C2 were determined by matching the period of oscillations observed in the silicon model with that observed in the mathematical model. Because W is an inaccessible node, we estimate the value of It by assuming current sources IBH and It match to within an order of magnitude. The exact value of It was chosen to match the bifurcation diagram resulting from experimental data to that computed from the mathematical model.1 Through a series of voltage-clamp experiments, we characterized the synaptic coupling in our model neuron. By turning off all current sources except IBSyn and clamping V to a constant voltage, the current at the node V was measured. Figure 2A shows the measured steady-state synaptic currents when V is clamped at V D 5.0 volts, and Vpre is swept from 0 to 5 volts. Figure 2B shows the measured steady-state synaptic currents when Vpre D 5.0 volts, and V is swept from 0 to 5 volts. The data shown in Figure 2A validate the sigmoid characteristic described in equation 2.13, and the data shown in Figure 2B correspond to the function describing the factor aN (see equation 2.9). It is interesting that when the system of two silicon neurons has had well-matched frequencies and the strength of coupling has been reduced to a value less than the smallest measurable IBSyn , IBSyn < 3 pA, the neurons have oscillated synchronously. This evidence supports the prediction derived from our mathematical model that this system demonstrates synchronous oscillations even if the strength of synaptic connection tends to zero. Nevertheless, it does not exclude the possibility that a parasitic electrotonic component of synaptic coupling might be responsible for this synchronizing effect. To investigate whether a parasitic electrotonic component of synaptic coupling exists, the experiments described for Figures 2A and 2B were performed with all current sources (including IBSyn ) turned off; the results are shown in Figure 2C and Figure 2D, respectively. Figure 2D reveals leak currents that increase with an increase in the clamp voltage of the postsynaptic cell. These leak currents are due to a parasitic (reversed biased) diode present on the drain terminal of all MOS transistors. In an experiment corresponding to Figure 2A, we set the clamp voltage to V D 0.3 volts to minimize the effects of the leak currents described above. The result, shown

1 The adjusted parameters used in the mathematical model are I BL D 48.0 nA, IBH D 6.437 nA, It D 2.81 nA, Iext D 15.0 nA, VHigh D Vdd D 5.0 volts, VLow D 0.0 volts, VH D VL D Vthresh D 2.0 volts, and C1 D C2 D 35 pF.

2266

Cymbalyuk, Patel, Calabrese, DeWeerth, & Cohen

Figure 2: Experimental characterization of synaptic connection. (A) Synaptic current is presented with the membrane potential clamped at V D 2.5 volts, the presynaptic potential swept from 0.0 to 5.0 volts, and the synaptic weight set at IBSyn D 1.45 nA. (B) The synaptic current is presented with the presynaptic potential clamped at Vpre D 5.0 volts, the membrane’s clamp voltage swept from 0.0 to 5.0 volts, and the synaptic weight set at IBSyn D 1.45 nA. In (C) and (D), the synaptic currents are presented under the same conditions as in (A) and (B), respectively; however, IBSyn has been turned off.

in Figure 2C, demonstrates that very weak electrotonic coupling does exist (approximately 100 TV). To investigate whether parasitic capacitive coupling exists, we measured the step response from Vpre to V, with all current sources turned off. Although small, there is a measurable amount of capacitive coupling between V and Vpre. For a 5-volt step in Vpre, the measured change in V is 26 mvolts, which corresponds to a coupling capacitance of approximately 50 f F.

Modeling Alternation to Synchrony

2267

4 Observed Behaviors as IBSyn Is Varied

The mathematical model presented above and the silicon neurons show a number of similar stable oscillatory behaviors. In the mathematical model, we call a limit cycle stable if it is asymptotically stable. We evaluate the stability by calculating characteristic multipliers using the bifurcation analysis software LocBif (Khibnik, Kuznetsov, Levitin, & Nikolaev, 1993). In the silicon neuron experiments, our criterion for the stability of oscillations was that they persist in the presence of inherent noise and small voltage perturbations. As we varied the strength of the inhibitory connections, IBSyn , we observed the following corresponding stable oscillatory behaviors in both the silicon neuron experiments and the mathematical model (experimental behavior–model behavior): synchronous oscillations (see Figure 3B)–synchronous limit cycle, SC phase-shifted oscillations (see Figures 3A and 4A and 4B)–a pair of phase-shifted nonsymmetric limit cycles–NSC1 / NSC2 alternating oscillations (see Figure 3C)–antiphase limit cycle, AC drifting oscillations (see Figure 4C)–stable torus T Here we do not present certain behaviors that are stable in the mathematical model but have not been observed experimentally. It is possible that these behaviors were not observed because the basin of attraction with the necessary initial conditions is so narrow. The model demonstrates different asymmetric limit cycles, where, for example, one model neuron oscillates with higher amplitude and half the frequency of the other (see Figures 5A and 5C). As another example, one model neuron may demonstrate smallamplitude spindle-like oscillations, and the other would oscillate with the waveform similar to that of an uncoupled oscillator (see Figures 5B and 5D). Due to the symmetry in the system, nonsymmetric oscillations always appear in pairs. For instance, in the case of Figures 5A and 5C, depending on initial conditions, one or the other neuron oscillates with the larger amplitude and the smaller frequency. It is interesting to note that Jung, Kiemel, and Cohen (1996) observed a similar pair of limit cycles in a model of lamprey segmental oscillator. Comparison of Figures 6 and 8 to Figures 7 and 9 shows good qualitative correspondence between dynamic behaviors and transitions occurring in the systems when IBSyn was varied. In the model, the synchronous limit cycle is stable for IBSyn 2 [0, 0.140] nA. It loses stability at point a (see Figure 7), at the critical value IBSyn D 0.140 nA, and undergoes a Neimark-Sacker torus bifurcation. In the mathematical model, a stable torus is observed at IBSyn smaller than the critical value at point a. Therefore, we infer that this is a subcritical torus bifurcation, and a stable torus appears at a fold torus bifur-

2268

Cymbalyuk, Patel, Calabrese, DeWeerth, & Cohen

Figure 3: Experimentally obtained oscillatory modes of the two silicon neurons connected by mutually inhibitory synapses. (A, D) Phase-shifted oscillations and (B, E) synchronous oscillations were both stable when the connections were of moderate value, (IBSyn D 1.4 nA). By proper choice of initial conditions, one could choose the oscillatory mode observed. The alternating oscillations (C, F) were stable, if the connections were strong, (IBSyn D 1464.9 nA). Here and in Figure 4, solid and dashed lines differentiate the membrane potentials of the two silicon neurons, V1 and V2 . (A–C) Data collected during 3 seconds.

cation together with an unstable torus. Thus, for some values of IBSyn , a stable torus coexists with the stable synchronous limit cycle and an unstable torus. At point a the unstable torus disappears, and synchronous oscillations lose stability. Similarly, in experiments with silicon neurons, synchronous oscillations were observed for the inhibitory connections ranging from zero up to moderate strengths of coupling, IBSyn D 2.70 nA (see point a, Figure 6). Corresponding to the mathematical model, we observed drifting oscillations

Modeling Alternation to Synchrony

2269

Figure 4: Experimentally obtained oscillatory modes of the two silicon neurons connected via inhibitory connections of moderate strength, IBSyn D 3.2 nA. (A, B) Phase-shifted oscillations coexisted with (C) drifting oscillations.

coexisting with synchronous oscillations at IBSyn close to the critical value. An example of the drifting oscillations for stronger coupling is presented at Figure 4C along with the pair of phase-shifted oscillations (see Figures 4A and 4B). In the model for the moderate strength of synaptic coupling, the pair of stable phase-shifted limit cycles NSC1 / NSC2 is observed, so that depending on the initial conditions, one or the other oscillator leads. For the parameters provided, we observed phase differences w 2 [0.33, 0.67]. The NSC1 / NSC2 appear at the fold bifurcation for limit cycles together with a pair of unstable phase-shifted limit cycles, IBSyn D 0.0564 nA (see point b, Figure 7) and disappear at the pitchfork limit cycle bifurcation, IBSyn D 6.511 nA (see point c, Figure 7), where NSC1 and NSC2 merge and AC becomes stable. Similar

2270

Cymbalyuk, Patel, Calabrese, DeWeerth, & Cohen

Figure 5: Examples of stable, nonsymmetric limit cycles obtained from the mathematical model of the two symmetrically inhibitory coupled identical silicon neurons. These limit cycles were not observed in experiments. (A, C) One of the two oscillators demonstrates 2.5 times larger amplitude and twice as large period, representing 1:2 synchronization mode: V1 D 3.52888 volts, W1 D 1.87397 volts, V2 D 1.72983 volts, W2 D 1.79920 volts, IBSyn D 1.9947 nA. (B, D) One of the two oscillators shows high-amplitude oscillations, and the other shows small-amplitude “spindle” oscillations: V1 D 3.15062 volts, W1 D 1.79645 volts, V2 D 1.75655 volts, W2 D 1.89334 volts, IBSyn D 11.9122 nA.

behaviors were observed experimentally. For sufciently strong synaptic connections, we observed only alternating oscillations. As we made the strength of the connections weaker, at IBSyn ¼ 6.34 nA (see point c in Figure 6), two phase-shifted oscillations branched from the alternating oscillations, which became unstable for the weaker coupling. As IBSyn decreased to IBSyn ¼ 1.51 nA, the phase difference between oscillations went from w D 0.50 (for alternating oscillation) to w ¼ 0.22 (w ¼ 0.76 for the other branch). The model and experimental data have qualitatively similar dependencies of the amplitude (compare Figure 8A and Figure 9A) and the period (compare Figure 8B and Figure 9B) on the strength of coupling. Synchronous oscillations in both cases have amplitude and period close to the corresponding values in the uncoupled system. Phase-shifted oscillations lay on the transition from synchronous oscillations to alternating oscillations. Our silicon neurons also demonstrated that stronger inhibitory synaptic con-

Modeling Alternation to Synchrony

2271

Figure 6: Experimentally obtained bifurcation diagram for the two mutually inhibitory silicon neurons. The phase difference, w , of different phase-locked oscillations is plotted against strength of the synaptic connections, IBSyn . Synchronous oscillations are stable for weak coupling and give rise to stable drifting oscillations at the moderate strength of coupling a (IBSyn D 2.70 nA). A pair of phase-shifted oscillations appeas at point b (IBSyn ¼ 1.40 ¥ 1.62 nA), is stable for moderate strengths of coupling, and gives stability to alternating oscillations, merging at point c (IBSyn ¼ 5.62 ¥ 7.06 nA). Alternating oscillations remain stable for the strong inhibitory connections. Here and in Figure 7, circles, triangles, and X’s represent synchronous oscillations, phase-shifted oscillations, and alternating oscillations, respectively.

nections produced slower oscillations (see Figures 8B and 9B), reproducing qualitatively the frequency dependence observed in strychnine experiments in the lamprey (Cohen & Harris-Warrick, 1984; Grillner & Wallen, 1980). Under the condition of identical intrinsic periods of our silicon neurons, we observed synchronous oscillations experimentally in the absence

2272

Cymbalyuk, Patel, Calabrese, DeWeerth, & Cohen

Figure 7: The bifurcation diagram of phase-locked oscillations obtained for the mathematical model of the two symmetrically inhibitory coupled identical oscillators as strength of connection, IBSyn , is varied. The synchronous limit cycle, stable with weak coupling, became unstable at a subcritical Neimark-Sacker torus bifurcation at moderate coupling strength at point a (IBSyn D 0.140 nA). The pair of stable nonsymmetrical phase-shifted limit cycles NSC1 / NSC2 appears along with the pair of unstable nonsymmetrical phase-shifted limit cycles UNSC 1 / UNSC 2 at fold bifurcation for limit cycles at point b, IBSyn D 0.0564 nA. Antiphase limit cycle is stable for the strong inhibitory connection and loses stability on the pitchfork limit cycle bifurcation, giving stability to the pair of phase-shifted limit cycles NSC1 / NSC2 at point c (IBSyn D 6.511 nA).

of synaptic coupling. To eliminate the possibility that synchronization appears primarily due to parasitic electrotonic coupling, we investigated the system with approximately 2% disparity in the intrinsic oscillation periods such that phase locking was disrupted and drifting oscillations ensued. By adding weak inhibitory coupling, we were then able to return the circuit to

Modeling Alternation to Synchrony

2273

Figure 8: Experimentally observed amplitude and period of phase-locked oscillations of the two mutually inhibitory silicon neurons. The (A) amplitude, A, and the (B) period, T, of the oscillations are shown against strength of synaptic connection, IBSyn .

stable synchronous oscillations. Thus, inhibitory coupling can synchronize mismatched oscillators that parasitic coupling, even if it exists, is too weak to synchronize. Note that our conclusion—synchronous oscillations are maintained by weak inhibitory coupling—is also supported by the mathematical model, which exhibited synchronous oscillations even for negligibly small coupling. 5 Discussion

We have described a pair of coupled silicon neurons implemented using VLSI technology and compared experimental results from these neurons to

2274

Cymbalyuk, Patel, Calabrese, DeWeerth, & Cohen

Figure 9: Amplitude and period of phase-locked oscillations obtained for the mathematical model of the two symmetrically inhibitory coupled identical oscillators as strength of connection, IBSyn , is varied. Circles, triangles, and X’s represent the synchronous limit cycle, phase-shifted limit cycle, and antiphase limit cycle, respectively.

simulation data from a corresponding mathematical model. In particular, we have shown that the silicon circuits can replicate many of the dynamical regimes seen in a mathematical model. Thus, silicon neurons can be used to demonstrate that these mathematically predicted behaviors can exist stably in real-world systems. The described results point to a role for inhibitory connections in the generation of rhythmic behavior in biological CPGs. In the spinal CPG network of the lamprey, for example, segmental oscillators normally produce alternating oscillations. However, when bathed in strychnine, synchronous oscillations replace the alternating oscillations. It is known that

Modeling Alternation to Synchrony

2275

strong contralateral inhibitory (glycinergic) connections, which can be blocked by strychnine, exist in the CPG network. It has been proposed that the transition from alternating to synchronous oscillations can be accounted for by weak contralateral excitatory connections (Cohen & HarrisWarrick, 1984; Alford et al., 1989). The strychnine-dependent transition may be due to changes in the relative strength of excitatory and inhibitory connections, that is, the left and right oscillators are connected by relatively strong reciprocal inhibitory connections parallel to comparatively weaker reciprocal excitatory coupling (Cohen & Harris-Warrick, 1984; Alford et al., 1989). While it is known that a few neurons with crossed excitatory connections do exist, the results shown here demonstrate that there is an alternative explanation for the phase transition. We suggest that the simplest explanation is that inhibitory connections alone can produce the observed transition. In addition, electrotonic coupling between the left and right oscillatory neurons may exist and produce synchronous oscillations when inhibitory connections are weak or disrupted by strychnine. Moreover, it has been shown in a mathematical model that electrotonic coupling alone can, under some conditions, reproduce the transition from synchronous to alternating oscillations in a circuit of two coupled Hindmarsh-Rose neurons (Cymbalyuk et al., 1994). There are three basic hypotheses that can account for the inuence of strychnine on the dynamical behavior of the spinal CPG: 1. The inhibitory connections produce stable alternating oscillations when they are strong and synchronous oscillations when they are sufciently weak (in particular, when they are under the inuence of strychnine). 2. The inhibitory connections are accompanied by the electrical connections, which can produce synchronous oscillations when inhibitory connections are weak or blocked. 3. If inhibitory connections are strong and excitatory connections are weak, then as inhibitory connections are blocked, excitatory connections become relatively stronger, facilitating stable synchronous oscillations. These hypotheses are complementary in the sense that all may be valid simultaneously. The rst two hypotheses have not been proposed previously. They appear, however, to be the most biologically plausible. It is also interesting to note that the quasi-periodic behavior observed in our silicon model experiments and mathematical model might conform to experiments with the lamprey. Lesher et al. (1998) have shown that lamprey spinal central pattern generators might employ a chaos control mechanism based on a skeleton of unstable oscillations. According to this mechanism, the CPGs exhibit unstable oscillations rather than stable oscillations. It is synaptic connectivity or control signals that appear to provide stability to

2276

Cymbalyuk, Patel, Calabrese, DeWeerth, & Cohen

the desired regime. In future work, the systems of silicon neurons could give some insight to the potential for this control mechanism. We are developing a variety of silicon neurons and associated systems of neurons that implement biologically inspired motor behaviors such as axial locomotion (DeWeerth, Patel, Simoni, Schimmel, & Calabrese, 1997; Patel, Holleman, & DeWeerth, 1998). These efforts are directed at the creation of dynamical sensorimotor systems that exhibit tightly coupled sensory feedback and motor learning, providing platforms for both testing biological hypotheses and addressing engineering applications. A major challenge in this development lies in the derivation of mathematical models that facilitate qualitative and quantitative understanding of the complex behaviors of these systems. The modeling of individual silicon neurons and circuits, as exemplied by this article, provides the foundation for this effort. Acknowledgments

We thank Roman Borisyuk and Tim Kiemel for their helpful comments. We are grateful to Ariel Hart, Mark Masino, and Anne-Elise Tobin for careful reading of the manuscript for this article. This work is a result of a collaboration established in the 1997 Workshop on Neuromorphic Engineering. We thank the National Science Foundation and the Gatsby Foundation for sponsoring this workshop. S. DeWeerth and G. Patel are funded by NSF grant IBN-9511721, A. Cohen is funded by NIH grant MH 44809, G. S. Cymbalyuk is supported by the Russian Foundation of Fundamental Research grant 99-04-49112. R. L. Calabrese and G. S. Cymbalyuk are supported by NIH grants NS24072 and NS34975. References Alford, S., Sigvardt, K. A., & Williams, T. L. (1990). GABAergic control of rhythmic activity in the presence of strychnine in the lamprey spinal cord. Brain Research, 506, 303–306. Blechman, I. I. (1981). Synchronization in nature and engineering. Moscow: Nauka. (In Russian). Borisyuk, G. N., Borisyuk, R. M., Khibnik, A. I., & Roose, D. (1995). Dynamics and bifurcations of two coupled neural oscillators with different connection types. Bulletin of Mathematical Biology, 57, 809–840. Bush, P., & Sejnowski, T. (1996). Inhibition synchronizes sparsely connected cortical neurons within and between columns in realistic network models. J. Computational Neuroscience, 3, 91–110. Chow, C. C., White, J. A., Ritt, J., & Kopell, N. (1998). Frequency control in synchronized networks of inhibitory neurons. J. Computational Neuroscience, 5, 407–420. Cohen, A., & Harris-Warrick, R. M. (1984). Strychnine eliminates alternating motor output during ctive locomotion in the lamprey. Brain Research, 293,

Modeling Alternation to Synchrony

2277

164–167. Cymbalyuk, G. S., Nikolaev, E. V., & Borisyuk, R. M. (1994). In-phase and antiphase self-oscillations in a model of two electrically coupled pacemakers. Biological Cybernetics, 71, 153–160. Destexhe, A., Contreras, D., Sejnowski, T. J., & Stereade M. (1994). A model of spindle rhythmicity in the isolated thalamic reticular nucleus. J. Neurophysiology, 72, 803–818. DeWeerth, S. P., Patel, G. N., Simoni, M. F., Schimmel, D. E., & Calabrese, R. L. (1997). A VLSI architecture for modeling intersegmental coordination. In R. Brown & A. Ishii (Eds.), Proceedings of the 17th Conference on Advanced Research in VLSI (pp. 182–200). New York: IEEE Computer Society. Ermentrout, G. B. (1996). Type I membranes, phase resetting curves, and synchrony. Neural Computation, 8, 979–1001. Ermentrout, G. B., & Kopell, N. (1991). Multiple pulse interactions and averaging in systems of coupled neural oscillators. J. Mathematical Biology, 29, 195– 217. Gerstner, W., van Hemmen, J. L., & Cowan, J. D. (1996).What matters in neuronal locking? Neural Computation, 8, 1653–1676. Golomb, D., Wang, X.-J., & Rinzel, J. (1994). Synchronization properties of spindle oscillations in a thalamic reticular nucleus model. J. Neurophysiology, 72, 1109–1126. Grillner, S., & Wallen, P. (1980). Does the central pattern generation for locomotion in lamprey depend on glycine inhibition? Acta Physiol Scand, 110, 103–105. Hansel, D., Mato, G., & Meunier, C. (1995). Synchrony in excitatory neural networks. Neural Computation, 7, 307–337. Hoppensteadt, F. C., & Izhikevich, E. M. (1997). Weakly connected neural networks. New York: Springer-Verlag. Izhikevich, E. M. (1999). Weakly pulse-coupled oscillators, FM interactions, synchronization, and oscillatory associative memory. IEEE Transactions on Neural Networks, 10, 508–526. Jung, R., Kiemel, T., & Cohen A. H. (1996).Dynamic behavior of a neural network model of locomotor control in the lamprey. J. Neurophysiology, 75, 1074–1086. Khibnik, A. I., Kuznetsov, Yu. A., Levitin, V. V., & Nikolaev E. V. (1993). Continuation techniques and interactive software for bifurcation analysis of ODEs and iterated maps. Physica D, 62, 360–367. Kuramoto, Y. (1984). Chemical oscillations, waves and turbulence. New York: Springer-Verlag. Lesher, S., Spano, M. L., Mellen, N. M., Guan, L., Dykstra, S., & Cohen, A. H. (1998). Evidence for unstable periodic orbits in intact swimming lampreys, isolated spinal cords, and intermediate preparations. Ann NY Acad Sci., 16, 486–491. Lytton, W., & Sejnowski, T. (1991). Simulations of cortical pyramidal neurons synchronized by inhibitory neurons. J. Neurophysiology, 66, 1059–1097. Mahowald, M. A., & Douglas, R. (1991). A silicon neuron. Nature, 354, 515–518. Malkin, I. G. (1956). Some problems of the theory of nonlinear oscillations. Moscow: Gostehizdat.

2278

Cymbalyuk, Patel, Calabrese, DeWeerth, & Cohen

Mead, C. A. (1989). Analog VLSI and neural systems. Reading, MA: AddisonWesley. Morris, C., & Lecar, H. (1981). Voltage oscillations in the barnacle giant muscle ber. Biophysical Journal, 35, 193–213. Patel, G. N., Cymbalyuk, G. S., Calabrese, R. L., & DeWeerth, S. P. (1999). Bifurcation analysis of a silicon neuron. In Proceedings of the NIPS ’99. Patel, G. N., & DeWeerth, S. P. (1997). An analogue VLSI Morris-Lecar neuron. Electronics Letters, IEE, 33, 997–998. Patel, G. N., Holleman, J. H., & DeWeerth, S. P. (1998). Analog VLSI model of intersegmental coordination with nearest-neighbor coupling. In M. Jordan, M. Kearns, & S. Solla (Eds.), Advances in neural information processing, 10 (pp. 719–725). Cambridge, MA: MIT Press. Rinzel, J., & Ermentrout, B. (1998). Analysis of neural excitability and oscillations. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From ions to networks (2nd ed.) (pp. 251–191). Cambridge, MA: MIT Press. Rubin, J. E., & Terman, D. (2000). Geometric analysis of population rhythms in synaptically coupled neuronal networks. Neural Computation, 12, 597–645. Skinner, F. K., Turrigiano, G. G., & Marder, E. (1993). Frequency and burst duration in oscillating neurons and two-cell networks. Biological Cybernetics, 69, 375–383. Terman, D., Kopell, N., & Bose, A. (1998). Dynamics of two mutually coupled slow inhibitory neurons. Physica D, 117, 241–275. Van Vreeswijk, C., Abbot, L. F., & Ermentrout, G. B. (1994). When inhibition not excitation synchronizes neural ring. J. Computational Neuroscience, 1, 313– 321. Wang, X. J., & Rinzel, J. (1992). Alternating and synchronous rhythms in reciprocally inhibitory model neurons. Neural Computation, 4, 84–97. Wang, X. J., & Rinzel, J. (1993). Spindle rhythmicity in the reticularis thalami nucleus: Synchronization among mutually inhibitory neurons. Neuroscience, 53, 899–904. White, J., Chow, C. C., Ritt, J., Soto-Trevino, ˜ C., & Kopell, N. (1998). Synchronization and oscillatory dynamics in heterogeneous, mutually inhibited neurons. J. Computational Neuroscience, 5, 5–16. Received June 12, 1998; accepted September 26, 1999.

NOTE

Communicated by Eric Baum

Representation of Concept Lattices by Bidirectional Associative Memories Radim B eL lohl a´ vek Institute for Research and Application of Fuzzy Modeling, University of Ostrava, CZ701 03 Ostrava, Czech Republic and Department of Computer Science, Technical University of Ostrava, CZ-708 33 Ostrava, Czech Republic This article presents a concept interpretation of patterns for bidirectional associative memory (BAM) and a representation of hierarchical structures of concepts (concept lattices) by BAMs. The constructive representation theorem provides a storing rule for a training set that allows a concept interpretation. Examples demonstrating the theorems are presented. 1 Introduction and Preliminaries

Two levels have to be taken into account when modeling intelligent systems: the microlevel and the macrolevel. This is because there are two corresponding levels evidenced by biological systems, which are supposed to be intelligent: the level of brain and the level of mental phenomena. The macrolevel (mental level) is supposed to be implemented in the microlevel (brain level). The inspiration from the microlevel gave rise to the paradigm of articial neural networks: systems based on similar principles as their biological counterparts. On the other hand, there are many models inspired by the macrolevel. A challenging goal is the development of architectures that exhibit both of levels. On the macrolevel, a clear interpretation of the system is possible, while on the microlevel, an analysis up to an appropriate degree of exactness can be performed. This article deals with a macrolevel interpretation of the bidirectional associative memory (BAM) (Kosko, 1987, 1988). The interpretation is in terms of concepts: BAM patterns are interpreted to represent concepts in the sense of Wille (1982) and Ganter and Wille (1999). Sections 2 and 3 survey BAMs and fundamentals of concept lattices, respectively. In section 4, a concept interpretation of BAM patterns is proposed and discussed, and a storing rule and representation theorem are presented. Section 5 contains illustrative examples. 2 Bidirectional Associative Memories

Associative memories represent a class of neural networks that aim at modeling the association phenomenon (Arbib, 1995; BLelohl´avek, 1998). Based on c 2000 Massachusetts Institute of Technology Neural Computation 12, 2279–2290 (2000) °

2280

Radim BLelohl´avek

the early models of Amari (1972) and Hopeld (1984), Kosko (1987, 1988) proposed a bidirectional associative neural network called bidirectional associative memory. BAM consists of two layers of neurons. The rst and the second layers contain k and l neurons, respectively, states (signals) of which are denoted by xi (i D 1, . . . , k) and yj (j D 1, . . . , l). The states xi and yj take the values ¡1, 1 by bipolar encoding and 0, 1 by binary encoding (to which we restrict ourselves). Each (ith) neuron of the rst layer is connected to each (jth) neuron of the second layer by a connection with the real weight y wij . A real threshold hix (hj ) is assigned to the ith neuron of the rst layer (jth neuron of the second layer). The dynamics is bidirectional: given a pair hX, Yi D hhx1 , . . . , xk i, hy1 , . . . , yl ii 2 f0, 1gk £ f0, 1gl of patterns of signals, the signal is fed to the second layer to obtain a new pair hX, Y0 i, then again to the rst layer to obtain hX0 , Y0 i, and so on. The dynamics is given by the formulas

yj0 D

8 > <1

for

yj

for

0

for

> :

Pk

D 1 wij xi P ik D 1 wij xi P ik i D 1 wij xi

y

> hj D

<

y hj y hj ,

x0i D

8 > <1

for

xi

for

0

for

> :

Pl

D1

Pjl

D1

Pjl

wij yj0 > hix wij yj0 D hix

(2.1)

x j D 1 wij yj < hi . 0

The pair of patterns hX, Yi is called a stable point if the states of neurons, which are set to hX, Yi, do not change under the dened dynamics. The set of all stable points of a BAM will be denoted by Stab(W, H). Using appropriate energy function, Kosko (1988) proved that such a network is stable for any y weights wij and any thresholds hix , hj .1 Stability means that given any initial pattern hX, Yi of signals, the net eventually stops after a nite number of steps (feeding signal from layer to layer). The aim of learning in the context of associative memories is to set the parameters of the net so that a prescribed training set of patterns is related in some way to the set of all stable points. Usually all training patterns have to become stable points. Kosko proposes a kind of Hebbian learning, by which the weights wij are determined from the training set T D fhXp , Yp i | p 2 Pg by wij D

X p2P

±

p

²

±

p

²

bip xi ¢ bip yj ,

(2.2)

where bip maps 1 to 1 and 0 to ¡1, that is, it changes the binary encoding to a bipolar one. Thresholds are set to 0. Another rule has been proposed by Wang, Cruz, and Mulligan (1990), where a further theoretical analysis of BAM dynamics is provided. y

1 In fact, Kosko proved the stability theorem for h x D h D 0. It is a matter of routine i j y to extend the proof to arbitrary hix and hj .

Representation of Concept Lattices

2281

3 Concepts and Concept Lattices

The notion of concept, central in human thinking, appears as well in the context of neural networks. Networks are seen as if extracting characteristic features of input data that are considered to represent concepts. Attributes other than the aggregation function (selection of characteristic features) are usually ignored. In a programmatic paper Wille (1982) formulated the theory of concept lattices, which serves as the foundation of formal concept analysis (Ganter & Wille, 1999). It is based on a traditional understanding of concepts (the Port Royal school) by which a concept is determined by its extent and its intent. The extent of a concept (e.g., DOG) is the collection of all objects covered by the concept (the collection of all dogs), while the intent is the collection of all attributes (e.g., “to bark,” “to be a mammal”) covered by the concept. The starting point of the formalization is that of context: a triple hG, M, Ii, where I is a binary relation between G and M, that is, I µ G £ M. Elements of G are interpreted as objects, elements of M as attributes, and the fact hg, mi 2 I as “the object g has the attribute m.” According to philosophical tradition, a (formal) concept in a given context is any pair hA, Bi of extent A µ G and intent B µ M such that A D B# :D def fg 2 G | for all m 2 B it holds hg, mi 2 Ig and B D A" : D def fm 2 M | for all g 2 A it holds hg, mi 2 Ig. In other words, hA, Bi is a concept if A D B# and B D A" ; that is, A is the set of all objects g 2 G that have all the attributes m of B, and, conversely, B is the set of all attributes m 2 M that are shared by all the objects g of A. The crucial relation between concepts is that of a hierarchical ordering. The hierarchy of concepts plays a fundamental role in conceptual reasoning. Denote B (G, M, I ) the set of all concepts in the context hG, M, Ii,

B (G, M, I ) D fhA, Bi | A D B# , B D A" g,

and for hA1 , B1 i, hA2 , B2 i 2 B (G, M, I ) put hA1 , B1 i · hA2 , B2 i iff A1 µ A2 (which is equivalent to B1 ¶ B2 ). The relation · naturally models the relation “to be a subconcept,” and hA1 , B 1 i · hA2 , B2 i means that each object covered by hA1 , B1 i is also covered by hA2 , B2 i (as an example, consider the concepts DOG and MAMMAL). The fundamental structure of the set of all concepts given by a context is given by the following proposition (Wille, 1982), which is part of the so-called main theorem of conceptual data analysis. Proposition. Let hG, M, Ii be a context. Then the set B (G, M, I ) is under the relation · introduced above a complete lattice where

^

fhAj , Bj iI j 2 Jg D

½ \

±\

²"¾

fAj I j 2 Jg, fAj I j 2 Jg , ½ ¾ ±\ ²# \ _ fhAj , Bj iI j 2 Jg D fBj I j 2 Jg , fBj I j 2 Jg .

2282

Radim BLelohl´avek

Recall that a complete lattice (Birkhoff, 1967) is a partially ordered set in which each subset has the least upper bound (supremum) and the greatest lower bound (inmum). The lattice B (G, M, I ) is called a concept lattice given by the context hG, M, Ii. The complete lattice structure is a very natural one for conceptual structures. Informally, it states that for each set of concepts, there is a direct generalization (supremum) as well as a direct specialization (inmum). Note that the visualizable hierarchical structure of the revealed concepts is the basic tool of conceptual data analysis. Further results can be found in Ganter and Wille (1999). 4 Representation and Storing of Concept Lattices

Our aim now is to provide a conceptual interpretation of BAM and to show that BAMs can represent lattices of concepts. To this end, we accept the following convention. For a set Z D fz1 , . . . , zn g and a subset A µ Z, we denote by sZ (A ) D ha1 , . . . , an i 2 f0, 1gn the characteristic vector of A, that 6 A. is, ai D 1 if zi 2 A and ai D 0 if zi 2 Let us have a context hG, M, Ii with both G and M nite, G D fg1 , . . . , gk g, M D fm1 , . . . , ml g. Using the grandmother cell idea (Arbib, 1995) we can consider a BAM with k neurons in the rst layer and l neurons in the second layer, with the following interpretation: The states of both of the layers represent subsets of G and M, respectively. The set A µ G is represented by the vector sG (A ) 2 f0, 1gk of the states of the rst layer, and similarly B µ M is represented by sM (B ) 2 f0, 1gl . The ith neuron of the rst layer therefore represents the object gi , and the jth neuron of the second layer represents the property mj . The pairs of subsets of G and M are thus in a one-to-one correspondence with the pairs of states of the rst and the second layers. In general, the concept interpretation of the patterns of states is not possible. We easily nd a BAM stable points, but they cannot be interpreted as concepts. As an example, consider k D 1, m D 2, w11 D 1, and w12 D ¡2, all thresholds set to 0. Then Stab(W, H) D fh0, h0, 0ii, h0, h0, 1ii, h0, h1, 1ii, h1, h1, 0iig. For hsG (A1 ), sM (B1 )i D h0, h0, 0ii, hsG (A2 ), sM (B2 )i D h1, h1, 0ii, 6 we have A1 ½ A2 but B1 ¶ B2 which contradicts the rule that the more common objects there are, the fewer the common properties, which is valid for formal concepts. On the other hand, taking, for example, k D l D 2, w11 D w12 D 1, w12 D w21 D ¡3, all thresholds set to ¡ 12 . Then Stab(W, H) D fhh0, 0i, h1, 1ii, hh0, 1i, h0, 1ii, hh1, 0i, h1, 0ii, hh1, 1i, h0, 0iig. As can be easily veried, Stab(W, H) corresponds to B (G, M, I ) for I D fhg1 , m1 i, hg2 , m2 ig by Stab(W, H) D fhsG ( A), sM (B )i | hA, Bi 2 B (G, M, I )g. Example 1.

The crucial question therefore is: Is there for each concept lattice B (G, M, I ) a BAM such that the set of all concepts of B (G, M, I ) is (modulo the above

Representation of Concept Lattices

2283

correspondence) precisely the set of all stable points of this BAM? A positive answer is given by the following theorem. Let B (G, M, I ) be a concept lattice given by the context hG, M, Ii with G and M nite. Then there is a BAM given by the weights W and thresholds H such that Stab(W, H) D fhsG (A ), sM (B )i | hA, Bi 2 B (G, M, I )g; that is, there is a one-to-one correspondence between stable points and formal concepts. Theorem 1.

Let G D fg1 , . . . , gk g, M D fm1 , . . . , ml g. Dene a BAM by the matrix W given by Proof.

» wij D

1 ¡q

if hgi , mj i 2 I 6 I if hgi , mj i 2

(4.1)

for i D 1, . . . , k, j D 1, . . . , l where q D maxfk, lg C 1. Set all the thresholds to ¡ 12 . Denote by x (t) and y (t ), t D 0, 1, 2, . . . , the vectors of the states of the rst and the second layers, respectively, at time t. Let hA, Bi 2 B (G, M, I ). We have to show that hsG (A ), sM (B )i is a stable point of the BAM. To this end, initialize the network with hsG (A ), sM (B )i, that is, x(0) D hx1 (0), . . . , xk (0)i D sG (A ), y (0) D hy1 (0), . . . , yl (0)i D sM (B ). We show hx(1), y (1)i D hx(0), y(0)i; feeding the signal forward, the states do not change. Clearly x(1) D x(0) (if the signal is fed forward, the rst layer does not change its state). Consider now any yj . We distinguish two cases. First, let yj (0) D 1. Since hA, Bi is a concept, we have A" D B, and so hgi , mj i 2 I (i.e., wij D 1 by equation 4.1) for each i such that gi 2 A (i.e., such that xi (0) D 1). We therefore have k X iD 1

wij xi (0) D |A| ¸ ¡

1 y D hj . 2

By the activation dynamics (see equation 2.1) of BAM, we have yj (1) D 1. For yj (0) D 1, the state therefore does not change. Second, let yj (0) D 0. Since hA, Bi is a concept, there is some i such that gi 2 A (i.e., xi (0) D 1) 6 I (i.e., w ij D ¡q). Denote by K the set of all such i. Denote but hgi , mj i 2 furthermore by K¤ the set of i such that gi 2 A (i.e., xi (0) D 1) and hgi , mj i 2 I (i.e., wij D 1). We have k X iD 1

wij xi (0) D

X i2K

wij xi (0) C

X

1 wij xi (0) D ¡|K|q C |K¤ | < ¡ , 2 i2K¤

since q > k ¸ |K¤ |, |K| ¸ 1, and ¡|K|q C |K¤ | is an integer. By the BAM dynamics, we have yj (1) D 0; the state does not change. We have proved y (1) D y(0). We should now show that x (2) D x (1) by the backward phase.

2284

Radim BLelohl´avek

The proof is completely symmetric and therefore will be omitted. We have thus proved that hsG ( A), sM (B )i is stable. Conversely, let hsG (A ), sM (B )i D hx, yi 2 Stab(W, H). We have to show that hA, Bi is a concept of B (G, M, I), that is, A" D B and B# D A. Again, due to symmetry, we show only A" D B. We reason as follows: mj 2 A" iff for each i such that gi 2 A (i.e., xi D 1) we have hgi , mj i 2 I (i.e., wij D 1). P The last assertion holds iff kiD 1 wij xi > ¡ 12 . (Indeed, if for each i such that P xi D 1 it holds that wij D 1, then kiD1 wij xi ¸ 0 > ¡ 12 .) P The direction ) is clear. Conversely, let kiD1 wij xi > ¡ 12 . If there would 6 I (w ij D ¡q), then from be some i such that gi 2 A (xi D 1) and hgi , mj i 2 Pk 1 q > k we conclude iD 1 wij xi < ¡ 2 , a contradiction. By the BAM dynamics P equation 2.1, kiD 1 wij xi > ¡ 12 iff yj D 1. To sum up, mj 2 A" iff yj D 1, hence A" D B. The theorem is proved. For each nite lattice L D hL, ¹ i, there is a BAM given by the weights W and thresholds H such that under the relation · dened on Stab(W, H) by Corollary 1.

hx1 , y1 i · hx2 , y2 i iff

¡ ¢ x1i · x2i (8i) iff y1j ¸ y2j (8 j)

(4.2)

hStab(W, H), · i and hL, ¹ i are isomorphic lattices. By the rst theorem of Wille (1982), the lattice L is isomorphic to the concept lattice B (L, L, ¹ ) (i.e., one puts G D M D L and I D ¹). Denote the isomorphism by Q. By theorem 1, there is a BAM given by W, H and a one-to-one correspondence y between B (L, L, ¹ ) and Stab(W, H). It is obvious that if one introduces the relation · on Stab (W, H) by equation 4.2, y becomes an isomorphism between hB (L, L, ¹), ·i and hStab(W, H), · i. Therefore, the composite mapping Q ± y is an isomorphism between L D hL, ¹ i and hStab(W, H), ·i. Proof.

The proof of theorem 1 gives also the rule (equation 4.1) for storing the set of all patterns (concepts) hA, Bi of B (G, M, I ) from the relation I. However, one might be concerned with a situation where the training information is not in the form of a binary relation but in the form of a training set T D fhAp , Bp i | A µ G, B µ M, p 2 Pg of patterns. In this case, the fundamental question is whether T can be interpreted as a (consistent) structure of concepts. Call T conceptually consistent if there is a concept lattice B (G, M, I ) such that T µ B (G, M, I ), that is, each hA, Bi 2 T is a concept of a xed concept lattice. In the following we concentrate on the problem of nding necessary and sufcient conditions for a

Representation of Concept Lattices

2285

training set T to be conceptually consistent and on the problem of storing a conceptually consistent training set. Let B (G, M, I ) be a concept lattice and put T D B (G, M, I ). Dene the relation IT µ G £ M by: Lemma 1.

hg, mi 2 IT

iff there exists

hA, Bi 2 T such that g 2 A, m 2 B.

(4.3)

Then it holds I D IT . If hg, mi 2 IT , then by equation 4.3, there is some hA, Bi 2 T such that g 2 A, m 2 B. Since T D B (G, M, I ), hA, Bi is a formal concept, therefore A" D B. By denition of A" , we conclude that hg, mi 2 I, that is, IT µ I. Conversely, if hg, mi 2 I, then for hA, Bi D hfgg"# , fgg" i 2 B (G, M, I ), we have g 2 A, m 2 B, that is, hg, mi 2 IT , which proves I µ IT . Proof.

For a conceptually consistent T, let IT be dened by equation 4.3. Then it holds that T µ B (G, M, IT ). Lemma 2.

If T is conceptually consistent, then T µ B (G, M, I) for some I µ G £ M. By the denition of equation 4.3 of IT and lemma 1, IT µ I. Denote the operators corresponding to I and IT by " , # and "T , #T , respectively. Take hA, Bi 2 T. Since IT µ I, we have B D A" D fm 2 M | 8g 2 A: hg, mi 2 Ig ¶ fm 2 M | 8g 2 A: hg, mi 2 IT g D A"T . On the other hand, since for every g 2 A, m 2 B we have hg, mi 2 IT (by equation 4.3), it holds A"T ¶ B, that is, A"T D B). One would similarly obtain B#T D A. To sum up, A"T D B and B#T D A, that is, hA, Bi 2 B (G, M, IT ). We have proved T µ B (G, M, IT ). Proof.

We have the following criterion for a training set to be conceptually consistent. A training set T D fhAp , Bp i | p 2 Pg is conceptually consistent iff for each p 2 P it holds Theorem 2.

0

0

Ap D fg 2 G | 8m 2 Bp 9p0 2 P: g 2 Ap , m 2 B p g 0

0

Bp D fm 2 M | 8g 2 Ap 9p0 2 P: g 2 Ap , m 2 Bp g.

If T is conceptually consistent, then, by lemma 2, T µ B (G, M, IT ). Therefore, each hAp , Bp i 2 T is a formal concept, and thus Ap" D Bp . By denition of Ap" and equation 4.3, this means that Bp D fm 2 M | 8g 2 0 0 Ap : hg, mi 2 IT g D fm 2 M | 8g 2 Ap 9p0 2 P: g 2 Ap , m 2 Bp g, which is the second equality to be proved. The rst equality can be obtained symmetrically. Conversely, if for each p 2 P the above equalities hold, then Proof.

2286

Radim BLelohl´avek

T µ B (G, M, IT ), since the equalities just express the facts Ap D Bp# and Bp D Ap" , respectively. By the previous results, we have for a training set T D fhAp , Bp i | A µ G, B µ M, p 2 Pg with |G| D k, |M| D l the following storing algorithm: For i D 1, . . . , k, j D 1, . . . , l, set the weights by » wij D

1 ¡(maxfk, lg C 1)

if 9p 2 P: g 2 Ap , m 2 Bp otherwise

and the thresholds by 1 hix D ¡ , 2

1 y hj D ¡ . 2

Call T storable (by our algorithm) if for the weights and the thresholds set according the algorithm, it holds s(T ) µ Stab (W, H) where s(T ) D fhsG ( A), sM (B )i | hA, Bi 2 Tg. The scopes and limits of the algorithm are described by the following assertion. Corollary 2.

A training set T is storable iff it is conceptually consistent.

If T is conceptually consistent, then T is storable by lemma 2. Conversely, if T is storable, then, by denition, s (T ) µ Stab (W, H). The assertion follows from the fact that if W and H are set from T by our algorithm, then Stab (W, H) D s (B (G, M, IT )) by theorem 1. Proof.

5 Examples

Let a context be given by the set G of nine planets (Mercury, . . . , Pluto), the set M of seven attributes (“size small,” . . . , “does not have a moon”), and the relation I between them depicted in Table 1 (see Wille, 1982). The BAM determined from this context by theorem 1 consists of nine and seven neurons in the rst and the second layer, respectively. The set of all stable points (i.e., concepts) is depicted in Table 2. The approximate linguistic description of the stable points is as follows: 1, the empty concept; 2, “small planet without moon near to sun”; 3, “small planet with moon(s) near to sun”; 4, “small planet with moon(s) far from sun”; 5, “large planet with moon(s) far from sun”; 6, “medium planet with moon(s) far from sun”; 7, “small planet near to sun”; 8, “small planet with moon(s)”; 9, “planet far from sun”; 10, “small planet”; 11, “planet with moon”; 12, “planet.” The conceptual hierarchy of the stable points (concept lattice) is depicted in Figure 1. Example 2.

Representation of Concept Lattices

2287

Table 1: Planets and Their Attributes. Size

Distance from Sun

Small Medium Large (ss) (sm) (sl)

Near Far (dn) (df)

Mercury (Me) Venus (V) Earth (E) Mars (Ma) Jupiter (J) Saturn (S) Uranus (U) Neptune (N) Pluto (P)

£ £ £ £

£ £

£ £

£

£ £ £ £

Has Moon Yes (my)

No (mn) £ £

£ £ £ £ £ £ £

£ £ £ £ £

Table 2: Stable Points of the Planet Example. Number

Extent

Intent

Me V E Ma J S U N P 1 2 3 4 5 6 7 8 9 10 11 12

0 1 0 0 0 0 1 0 0 1 0 1

0 1 0 0 0 0 1 0 0 1 0 1

0 0 1 0 0 0 1 1 0 1 1 1

0 0 1 0 0 0 1 1 0 1 1 1

0 0 0 0 1 0 0 0 1 0 1 1

0 0 0 0 1 0 0 0 1 0 1 1

0 0 0 0 0 1 0 0 1 0 1 1

0 0 0 0 0 1 0 0 1 0 1 1

0 0 0 1 0 0 0 1 1 1 1 1

ss sm sl dn df my 1 1 1 1 0 0 1 1 0 1 0 0

1 0 0 0 0 1 0 0 0 0 0 0

1 0 0 0 1 0 0 0 0 0 0 0

1 1 1 0 0 0 1 0 0 0 0 0

1 0 0 1 1 1 0 0 1 0 0 0

1 0 1 1 1 1 0 1 1 0 1 0

mn 1 1 0 0 0 0 0 0 0 0 0 0

Figure 1: Hierarchical structure (lattice) of stable points of the planet example.

2288

Radim BLelohl´avek

Figure 2: Stable points.

Consider a BAM with 9 neurons in the rst layer and 25 neurons in the second layer. We will represent each state hX, Yi of the net by a pair of bitmap pictures: a 3 £ 3 bitmap for X and 5 £ 5 bitmap for Y. If the neuron state is 0 (1), then the corresponding pixel is white (black). Consider the training set T consisting of six pairs hA, Bi such that hsG (A ), sM (B )i are represented by the pairs 5, 6, 9, 19, 22, and 23 of Figure 2. By theorem 2, it is easy to verify that T is conceptually consistent. The storing algorithm yields a BAM with 24 stable points depicted in Figure 2. The conceptual hierarchy of the stable points is visualized in Figure 3. The patterns of T have been stored and completed into a complete conceptual structure. Note that concepts 6 and 19 of the training set are complementary. The stored structure of Example 3.

Representation of Concept Lattices

2289

Figure 3: Hierarchical structure (lattice) of stable points.

concepts contains additional pairs of complementary concepts (“opposite concepts,” in conceptual terms), for example, 5 and 20 (5 is in T) and 7 and 18 (none of them is in T). Note that the pairs of complementary concepts, viewed from the lattice point of view, form complementary elements in the lattice of concepts.

Acknowledgments

This work was supported by grant no. 201/99/P060 of the Grant Agency of the Czech Republic and project VS96037 of the Ministry of Education of the Czech Republic. I thank an anonymous referee for comments concerning the presentation of the article. References Amari, S. (1972). Learning patterns and pattern sequences by self-organizing nets of thresholding elements. IEEE Trans. on Computers, 21(11), 461–482. Arbib, M. (Ed.). (1995). Handbook of brain theory and neural networks. Cambridge, MA: MIT Press. BLelohl´avek, R. (1998). Network processing indeterminacy. Unpublished doctoral dissertation, Technical University of Ostrava. Birkhoff, G. (1967). Lattice theory, 3rd ed. Providence, RI: American Mathematical Society. Ganter, B., & Wille, R. (1999). Formal concept analysis: Mathematical foundations. Berlin: Springer-Verlag.

2290

Radim BLelohl´avek

Hopeld, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. Proc. Natl. Acad. Sci. U.S.A., 81, 3088–3092. Kosko, B. (1987). Adaptive bidirectional associative memory. Applied Optics, 26(23), 4947–4960. Kosko, B. (1988). Bidirectional associative memory. IEEE Trans. Systems, Man, and Cybernetics, 18(1), 49–60. Wang, Y.-F., Cruz, J. B., Jr., & Mulligan, J. H., Jr. (1990). Two coding strategies for bidirectional associative memory. IEEE Trans. Neural Networks, 1(1), 81–92. Wille, R. (1982). Restructuring lattice theory: An approach based on hierarchies of concepts. In I. Rival (Ed.), Ordered sets (pp. 445–470). Dordrecht: Reidel. Received March 3, 1999; accepted October 17, 1999.

LETTER

Communicated by M. V. Srinivasan and Stephen DeWeerth

A Silicon Implementation of the Fly’s Optomotor Control System Reid R. Harrison Christof Koch Computation and Neural Systems Program 139-74, California Institute of Technology, Pasadena, CA 91125, U.S.A. Flies are capable of stabilizing their body during free ight by using visual motion information to estimate self-rotation. We have built a hardware model of this optomotor control system in a standard CMOS VLSI process. The result is a small, low-power chip that receives input directly from the real world through on-board photoreceptors and generates motor commands in real time. The chip was tested under closed-loop conditions typically used for insect studies. The silicon system exhibited stable control sufciently analogous to the biological system to allow for quantitative comparisons. 1 Introduction

Flies use visual motion information to estimate self-rotation and generate a compensatory torque response to maintain stability during ight. This wellstudied behavior is known as the optomotor response. It is interesting from an engineering standpoint because it extracts relevant information from a dynamic, unstructured environment on the basis of passive sensors and uses this information to generate appropriate motor commands during ight. This system is implemented in biological hardware that is many orders of magnitude smaller and more power efcient than charge-coupled device (CCD) imagers coupled to a conventional digital microprocessor. Much of the computation underlying the optomotor control system is performed by the horizontal system (HS) cells of the y visual system (Geiger & N¨assel, 1981, 1982; Egelhaaf, Hausen, Reichardt, & Wehrhahn, 1988; Hausen & Wehrhahn, 1990; Egelhaaf & Borst, 1993). The three HS cells (HSN, HSE, and HSS) belong to a group of 50 to 60 giant tangential neurons having large dendritic arborizations in the lobular plate region of the optic lobe (Hausen, 1981, 1982, 1984; Hengstenberg, 1982; Krapp & Hengstenberg, 1996). HS cells are nonspiking neurons that are depolarized by full-eld visual motion from the front to the back of the eye and hyperpolarized by back-to-front motion. Each HS cell integrates signals from an ipsilateral retinotopic array of elementary motion detectors (EMDs), units in the medulla that estimate local motion in small areas of the visual eld. c 2000 Massachusetts Institute of Technology Neural Computation 12, 2291–2304 (2000) °

2292

Reid R. Harrison and Christof Koch

The HS cells synapse onto descending, spiking neurons, which relay information to the motor centers of the thoracic ganglion. We built a single-chip integrated imager and analog computer that mimics the y optomotor control system and produces compensatory motor signals in real time. Based on earlier work by Mead and colleagues, we use standard CMOS eld-effect transistors operating in the subthreshold regime where the source-drain current through the transistor is exponentially related to its gate voltage (Mead, 1989). Similar to neural systems, all our computations are performed in parallel by analog and distributed circuits operating in continuous time. There is no software; the algorithm is entirely specied by the circuit architecture. Several researchers have built hardware models of insect visual systems with analog VLSI (Andreou & Strohbehn, 1990; Delbruck, ¨ 1993; Sarpeshkar, Bair, & Koch, 1993; Etienne-Cummings & Van der Spiegel, 1996; Moini et al. 1997), discrete analog hardware (Franceschini, Pichon, & Blanes, 1992), and traditional CPUs with CCD cameras (Srinivasan, Chahl, & Zhang, 1997; Lewis, 1998), many of which have been incorporated into mobile robots. The comparisons made to biology have been qualitative in nature. We present here a more rigorous approach to neuromorphic engineering in which hardware models are evaluated by directly replicating experiments originally performed on their biological counterparts. 2 Description of Hardware Model

Our motion detector architecture is a delay-and-correlate scheme similar to that that rst proposed by Hassenstein and Reichardt (1956) to explain beetle behavior, where a temporally bandpassed signal from one photoreceptor is multiplied with the delayed bandpassed signal from an adjacent photoreceptor. The result is subtracted from the mirror-symmetric operation to remove the directional insensitive component. The output of many such EMDs sensitive to motion at different locations is added and low-passltered to give the nal output signal that the insect uses to stabilize its ight. There is good evidence that correlation-based EMDs form the basis of the optomotor system in the y (Reichardt & Egelhaaf, 1988; Borst & Egelhaaf, 1989). Our chip consists of an array of photoreceptors and EMDs whose outputs are summed (see Figure 1). A lens mounted over the chip focuses the image of the outside world onto the silicon surface. Each elementary motion detector uses photodiodes as light sensors. We use a four-transistor adaptive photoreceptor circuit developed by Delbruck ¨ (Delbruck ¨ & Mead, 1996) that produces a continuous-time output voltage proportional to the logarithm of light intensity (see Figure 2a). This circuit has a temporal low-pass characteristic with a cutoff frequency that can be set with a bias voltage. The photoreceptor is connected to a temporal derivative circuit (Mead, 1989) (see Figure 2b), which has a high-pass behavior. Transient ring, charac-

Fly’s Optomotor Control System

2293

single EMD photoreceptors temporal bandpass filters temporal lowpass filters

+

´ -

´

+

´ -

´

multipliers

S

+

´ -

´

HS cell output temporal lowpass filter torque response

Figure 1: Model of the y optomotor system implemented in silicon. Photoreceptor outputs are temporally bandpass ltered and fed into HassensteinReichardt elementary motion detectors (EMDs) consisting of rst-order lowpass lters (t D 40 ms) followed by multipliers. The HS cell is modeled simply as a spatial summation of opponent EMDs. The HS cell response is passed through an off-chip low-pass lter (t D 680 ms), mimicking the behavior of the thoracic motor centers, to generate the torque response. The silicon implementation includes 13 EMDs with integrated photoreceptors on a single 2.2 mm £ 2.2 mm chip fabricated in a standard 2.0 m m CMOS process.

teristic of a temporal high-pass response, has been observed in y laminar cells that receive input from retinal photoreceptors (Weckstr om, ¨ Juusola, & Laughlin, 1992). Together, the low-pass ltering of the photoreceptor and the high-pass ltering of the temporal derivative circuit form a bandpass lter, which improves performance by eliminating DC illumination (which contains no motion information) and attenuating high-frequency noise such as the 120 Hz icker of AC incandescent lighting. These bandpass lters were set to attenuate frequencies below 2.8 Hz and above 10 Hz. The temporal derivative circuit relies on a high-gain differential amplier in a negative feedback conguration to keep the voltage on the capacitor equal to the input voltage. As the capacitor charges and discharges to main-

2294

Reid R. Harrison and Christof Koch

a Photoreceptor

b Temporal Derivative Circuit dV positive component of C dt

bias

I BPF+

sf

Vphoto

Vphoto

i=C sf

dV negative component of C dt

C

dV dt

I BPF-

single EMD Vphoto I BPF

+

´

I out

-

c Low-Pass Filter

I BPF V 2I t t

It C

´

I LPF

d Multiplier

I BPF I LPF I out

I LPF V

dI LPF + I LPF = I BPF dt C UT t = It

I ref

I out =

I BPF I LPF I ref

tain this equality, the currents through the two source follower transistors (labeled “sf” in Figure 2b) may be measured. The outputs of the temporal derivative circuit are these two unidirectional currents, which are proportional to the positive and negative components of temporal derivative of the input voltage. This resembles the ON and OFF channels found in many

Fly’s Optomotor Control System

2295

biological visual systems. One study suggests ON and OFF channels are present in the y (Franceschini, Riehle, & Nestour, 1989), but the evidence is mixed (Egelhaaf & Borst, 1992). We use the phase lag inherent in a rst-order low-pass lter as a time delay. The currents from the temporal derivative circuit are passed to currentmode rst-order low-pass lter circuits (see Figure 2c) (Himmelbauer, Furth, Pouliquen, & Andreou, 1996). These are log-domain lters that take advantage of the exponential behavior of eld-effect transistors (FETs) in the subthreshold region of operation. Note that two lters are needed for each EMD: one for the ON channel and one for the OFF channel, which are processed in parallel. The time constant of the lters is controlled with a bias current that can be set externally. This time constant can be changed to tune the EMD to a specic optimal temporal frequency. For all the experiments described in this article, we xed this time constant to 40 ms. This gave our chip a maximum temporal frequency sensitivity of 4 Hz, similar to motion-sensitive neurons in ies (O’Carroll, Bidwell, Laughlin, & Warrant, 1996). To correlate the delayed and nondelayed signals for motion computation, we use a current-mode multiplier circuit (see Figure 2d). This circuit takes advantage of the exponential behavior of subthreshold FETs to perform a computation. Two diode-connected FETs convert the input currents into log-encoded voltages. The weighted sum of these voltages is computed with the capacitive divider on the oating gate of the output transistor, and this transistor exponentiates the summed voltages into the output current, completing the multiplication. Any trapped charge remaining on the oating gates from fabrication is eliminated by exposing the chip to ultraviolet light, which imparts sufcient energy to the trapped electrons to allow passage through the surrounding insulator. This circuit represents one of a family of oating-gate MOS translinear circuits developed by Minch that are capable of computing arbitrary power laws with current-mode signals (Minch, Diorio, Haster, & Mead, 1996). After multiplication is performed in both the ON and OFF channels, these two signals are summed.

Figure 2: Facing page. EMD subcircuits. (a) Photoreceptor. This circuit produces an output voltage proportional to the logarithm of light intensity. (b) Temporal derivative circuit. In combination with the low-pass lter inherent to the photoreceptor, this forms a temporal bandpass lter with a current-mode output. (c) Low-pass lter. The time constant of this rst-order lter is determined by the bias current It (which is set by a voltage supplied from off-chip) and the capacitance C. (d) Multiplier. The devices shown are oating-gate nFET transistors with capacitive inputs. The two inputs couple to the oating gate, forming a capacitive divider. The input transistors are diode connected, which converts the input currents into log-encoded voltages. The capacitive divider on the output transistor computes a weighted sum of these voltages. The output transistor produces a current proportional to the exponental of this sum.

2296

Reid R. Harrison and Christof Koch

One entire EMD (left and right channels) consists of 31 transistors and 25 capacitors with 8.0 pF of total capacitance. Each EMD takes 0.044 mm2 of silicon area in a 2.0 m m CMOS process, including the integrated photoreceptors. By operating most of the transistors in the subthreshold regime, we achieve extremely low power dissipation (approximately 7.5 m W per elementary motion detector). We built a simple model of the HS cell by constructing a one-dimensional array of 13 complete EMDs and linearly summing their outputs. This is easily achieved due to the current-mode nature of the EMD output signals; we simply tie all the wires together. Dendritic integration is likely to be nonlinear. When stimulus size is increased, motion-sensitive neurons in the y exhibit gain control, where the response increases less than linearly with size (Borst, Egelhaaf, & Haag, 1995; Single, Haag, & Borst, 1997). Our model does not include this nonlinear size dependence, but all our experiments use patterns of xed size. Figure 3 shows the response of the EMD array to sinusoidal gratings with varying temporal frequencies. Data from a y wide-eld motion neuron is shown for comparison. Both motion detectors show temporal frequency tuning that is characteristic of the delay-and-correlate motion detector (Adelson & Bergen, 1985). The greater width of the y tuning curve is probably due to adaptation of the EMD low-pass lter time constant (de Ruyter van Steveninck, Zaagman, & Mastebroek, 1986; Borst & Egelhaaf, 1987; Clifford, Ibbotson, & Langley, 1997). Also, bandpass ltering of the photoreceptor signals attenuates low- and high-frequency stimuli in our VLSI model. We have measured the response of our EMD circuit to drifting sinusoids while varying spatial frequency and direction, and have shown that the circuit exhibits tuning similar to that observed in insect motion-sensitive neurons (data not shown; see Harrison & Koch, 1998). 3 Measuring the Optomotor Response 3.1 Experiments Previously Performed on Flies. Warzecha & Egelhaaf (1996) recently characterized the optomotor behavior of the y under closedloop conditions. A female sheepy (Lucilia cuprina, Calliphoridae) was rigidly attached to a meter that measured yaw torque produced while the y attempted to turn in response to visual stimuli (see Figure 4a), reducing the y’s behavior to a single degree of freedom. Vertical bars were presented to a large region of the y’s visual eld and could be drifted clockwise or counterclockwise. In closed-loop experiments, the y’s yaw torque was measured in real time and scaled by a constant gain term to yield angular velocity. This simulates the observed dominance of air friction in determining the instantaneous angular velocity in ies (Reichardt & Poggio, 1976). The y’s simulated angular velocity was subtracted from the angular velocity imposed by the experimenter. The resulting signal was used to control the drift rate of the visual stimulus. This simulated free-ight conditions and allowed evaluation of the optomotor system performance. The imposed motion schedule consisted of 3.75 s of zero imposed motion, then 7.5 s of clockwise rotation at 44 degree s ¡1 . Figure 5a shows the torque

Fly’s Optomotor Control System

Relative Mean Response

1

2297

silicon fly

0.8

0.6

0.4

0.2

0 0.1

1

10

100

Temporal Frequency [Hz] Figure 3: Normalized temporal frequency response of silicon (circles) and y (triangles) motion detectors. Both systems exhibit temporal frequency tuning characteristic of correlation-based motion detection schemes. (Temporal frequency is proportional to velocity for constant spatial frequency.) The width of the y tuning curve is probably due to adaptation of the EMD low-pass lter time constant (Clifford et al., 1997). The stimulus used for the silicon EMD array was a sinusoidal grating with a spatial frequency of 0.03 cycles deg ¡1 , resulting in two- to three-pattern wavelengths across the photodetector array. Error bars denote standard deviation of chip response during an 800 ms recording interval (2 kHz sampling rate), indicating some residual pattern dependence. Fly data are normalized mean spike rate (spontaneous activity subtracted) taken from an unspecied lobular plate wide-eld motion neuron in the blowy (Calliphora erythrocephala). Fly data redrawn from O’Carroll et al. (1996).

data and resulting stimulus position for an individual trial. Figure 5b shows the averaged data over 139 trials in a total of ve animals. (See Warzecha & Egelhaaf, 1996, for details on the experimental protocol.) The y is able to stabilize its ight and cancel out most of the imposed motion. Simulation results suggest that the nonmonotonic temporal frequency response of Reichardt motion detectors results in greater stability for the optomotor control system (Warzecha & Egelhaaf, 1996). The individual trials show an oscillatory component to the torque response around 2 Hz. This oscillation is not linked to the stimulus since it is not present in the average torque trace. Oscilla-

2298

Reid R. Harrison and Christof Koch

a

imposed motion

+ S -

pattern motion

self-motion

ò pattern position

feedback gain torque meter output voltage torque meter

fly display

b

imposed motion

+ S -

pattern motion

self-motion

ò pattern position

feedback gain output voltage lowpass filter lens chip

display

Fly’s Optomotor Control System

2299

tions are not observed under open-loop conditions, suggesting they arise from optomotor feedback (Geiger & Poggio, 1981; Warzecha & Egelhaaf, 1996). Notice that despite the large amplitude of the torque oscillations, the position trace is not dominated by this effect. This uctuation amplitude, in terms of number of photoreceptors, is close to the amplitude observed in human microsaccades (Warzecha & Egelhaaf, 1996). Poggio and colleagues observed similar oscillations in closed-loop experiments and proposed that they arose from the 60–75 ms synaptic delay inherent in the y visual system (Geiger & Poggio, 1981; Poggio & Reichardt, 1981). 3.2 Duplicating Experiments with the Silicon System. We were able to replicate these experiments with our silicon analog of the optomotor system (see Figure 4b). To provide visual stimulation, we used an LED display with a 200 Hz refresh rate, which is currently being used to test ies in closed-loop experiments. The stimulation time schedule was identical to the y experiments, but an angular velocity of 50 degrees s ¡1 was used. Our chip had a much smaller eld of view (10 degrees) than the y’s, so we set the stimulus distance such that the EMD array saw approximately one wavelength of the pattern. The output signal from the silicon model of the HS cell was passed through an off-chip rstorder low-pass lter with a time constant of 680 ms, modeling the behavior of the thoracic motor centers (Egelhaaf, 1987; Wolf & Heisenberg, 1990; Warzecha & Egelhaaf, 1996). The ltered output of the chip was treated exactly like the signal from the torque meter in the y experiments, and closed-loop experiments were run in real time. Figure 5c shows torque and position data from the chip for an individual trial, and Figure 5d shows the averaged response over 100 trials. The silicon system shows the same ability to cancel the imposed motion greatly. The y showed an average drift of 9.4% of the open-loop drift velocity, with position uctuations of 7.8 degrees (standard deviation) about this drift. The chip showed an average drift of 22% of the open-loop drift velocity, with position uctuations of 6.2 degrees (S.D.) about this drift. Also, we observe the same 2 Hz oscillations in the individual trials. Since we did not build any explicit delay into our system, this demonstrates that the phase lags and nonlinearities in this simple model are sufcient to produce oscillations, even in the absence of additional synaptic delays. In future experiments, we hope to explore how these parameters determine the oscillation frequency and amplitude.

Figure 4: Facing page. Experimental methodology. (a) Setup used by Warzecha and Egelhaaf (1996) to measure the closed-loop torque response of the sheepy Lucilia. The torque meter output is scaled to produce a measure of what the y’s self-motion would be if it were free to rotate. This self-motion is subtracted from the imposed motion to determine the pattern motion, creating the illusion of free ight in a room with distant walls. (Only rotation, not translation, is simulated.) (b) Setup used to replicate the closed-loop experiments with the silicon model. The output voltage from the circuit is used in place of the torque meter output voltage. The rest of the system is identical to a.

Reid R. Harrison and Christof Koch

10

1s 5

0

5 4

2 1 0 1

150

150

100

100

50 0 50 100

1s

0

5

d

0 50 100 150

AVERAGED

5 4 3 2 1 0 1

150

position [deg]

50

100

SILICON

150 100

0 50

output [arbitrary units]

output [arbitrary units]

10

5

50

150

INDIVIDUAL TRIAL

c

1s

3

5

150

position [deg]

AVERAGED

b

position [deg]

position [deg]

- 8

torque [N× m ´ 10 ]

a

FLY - 8

INDIVIDUAL TRIAL

torque [N× m ´ 10 ]

2300

100 50 0 50 100 150

1s

Fly’s Optomotor Control System

2301

4 Conclusion We have demonstrated a small, power-efcient silicon system built in a standard CMOS process that replicates behavior observed in a biological sensorimotor control system. Our hardware model senses the image and generates motor commands in real time, allowing direct, quantitative comparisons to biological systems to be made using the same experimental apparatus. There are still important differences between our hardware model and the biological system. For example, the temporal frequency sensitivity of the silicon system is much narrower than the sensitivity of lobular plate neurons (see Figure 3). Future instantiations might use spatial instead of temporal bandpass ltering to remove the DC component from the input image while maintaining temporal sensitivity. Also, time-constant adaptation in the motion detectors could be implemented and explored. It would be interesting to observe visually guided behaviors in a hardware model with adaptation enabled or disabled. This experiment might be difcult or impossible to achieve with a real animal and might lend insight into the benets of adaptation in sensorimotor systems. We believe this hardware modeling approach will prove increasingly valuable in the future, as biological models of the neural circuitry underlying more complex and sophisticated behaviors arise. To simulate a sensorimotor system in software, one must construct two models: a model of the biological system and a model of the world. The physical environment is an essential element in a sensorimotor feedback loop, so this world model must increase in detail as we study more advanced behaviors. Since animals interact with their three-dimensional environment in very dynamic ways, it may not be long before software simulations of sensorimotor systems require more computational resources to model the world than to model the neural circuitry of interest. By using a physically instantiated hardware model with integrated sensors, we can replicate experiments using existing stimuli developed for studying Figure 5: Facing page. Comparing the y’s optomotor behavior to our silicon system. (a) Torque (top panel) and angular position (bottom panel) versus time for an individual closed-loop trial with a y. The dark horizontal bar indicates experimenter-imposed rotation. Thin lines on the position trace indicate position in the open-loop case. Most of the imposed rotation is cancelled out by the optomotor control system. Since the position is proportional to the integral of the torque (see text for details), large torque oscillations do not cause large position oscillations. (b) Averaged torque response and angular position trace for multiple trials (N D 139, 5 ies). The y showed an average drift of 9.4% of the open-loop drift velocity, with position uctuations of 7.8 degrees (standard deviation) about this drift. (c) Chip output signal (analogous to torque) and position versus time for the silicon system in an individual trial. (d) Averaged torque response and angular position trace for multiple trials (N D 100, 1 chip). The chip showed an average drift of 22% of the open-loop drift velocity, with position uctuations of 6.2 degrees (S.D.) about this drift. Data in a and b redrawn from Warzecha and Egelhaaf (1996).

2302

Reid R. Harrison and Christof Koch

animals. This approach also opens up the possibility of endowing our systems with real motor capabilities, an obvious extension of the work presented here. We could conceivably build mobile robots and test them in complex environments that would be extremely difcult to model in software. Neural models may also be implemented in software and run on digital computers using traditional CCD cameras, but if sensorimotor feedback is of interest, the software must run in real time. If mobile robotic systems are used, then the size and power advantages of the analog VLSI approach presented here would be especially benecial. We believe that neuromorphic engineering represents a new tool for understanding complex biological systems and as a testbed for evaluating how theoretical models perform in low-accuracy hardware embedded into a noisy world. Many of the neural systems being studied involve real-time sensory processing and motor control that greatly exceed the capabilities of modern digital computers. As we implement biological models in compact, power-efcient ways, this approach may have signicant engineering uses as well. Acknowledgments We thank M. Egelhaaf, A.-K. Warzecha, J. Haag, and B. Kimmerle for invaluable advice and assistance and for the use of their resources; and B. Minch, H. Krapp, R. Deutschmann, and E. Slimko for useful discussions. This work was supported by the NSF Center for Neuromorphic Systems Engineering Research Center, grants from ONR, and an NDSEG fellowship (R.H.). References Adelson, E. H., & Bergen, J. R. (1985). Spatiotemporal energy models for the perception of motion. J Opt Soc Am A, 2, 284–299. Andreou, A., & Strohbehn, K. (1990). Analog VLSI implementation of Hassenstein-Reichardt-Poggio models for vision computation. In Proc 1990 IEEE Symp Syst, Man, Cybern (pp. 707–710). Borst, A., & Egelhaaf, M. (1987). Temporal modulation of luminance adapts time constant of y movement detectors. Biol Cybern, 56, 209–215. Borst, A., & Egelhaaf, M. (1989). Principles of visual motion detection. TINS, 12, 297–305. Borst, A., Egelhaaf, M., & Haag, J. (1995). Mechanisms of dendritic integration underlying gain control in y motion-sensitive interneurons. J Comput Neurosci, 2, 5–18. Clifford, C. W. G., Ibbotson, M. R., & Langley, K. (1997). An adaptive Reichardt detector model of motion adaptation in insects and mammals. Visual Neuroscience, 14, 741–749. Delbruck, ¨ T. (1993). Silicon retina with correlation-based, velocity-tuned pixels. IEEE Trans Neural Networks, 4, 529–541. Delbruck, ¨ T., & Mead, C. (1996). Analog VLSI phototransduction by continuoustime, adaptive,logarithmic photoreceptorcircuits (CNS Memo No. 30). Pasadena: California Institute of Technology.

Fly’s Optomotor Control System

2303

de Ruyter van Steveninck, R. R., Zaagman, W. H., & Mastebroek, H. A. K. (1986). Adaptation of transient responses of a movement-sensitive neuron in the visual system of the blowy Calliphora erythrocephala. Biol Cybern, 54, 223– 236. Egelhaaf, M. (1987). Dynamic properties of two control-systems underlying visually guided turning in house-ies. J Comp Physiol A, 161, 777–783. Egelhaaf, M., & Borst, A. (1992). Are there separate ON and OFF channels in y motion vision? Visual Neuroscience, 8, 151–164. Egelhaaf, M., & Borst, A. (1993). A look into the cockpit of the y: Visual orientation, algorithms, and identied neurons. J Neurosci, 13, 4563–4574. Egelhaaf, M., Hausen, K., Reichardt, W., & Wehrhahn, C. (1988). Visual course control in ies relies on neuronal computation of object and background motion. TINS, 11, 351–358. Etienne-Cummings, R., & Van der Spiegel, J. (1996). Neuromorphic vision sensors. Sensors and Actuators A, 56, 19–29. Franceschini, N., Pichon, J. M., & Blanes, C. (1992). From insect vision to robot vision. Phil Trans R Soc Lond B, 337, 283–294. Franceschini, N., Riehle, A., & le Nestour, A. (1989). Directionally selective motion detection by insect neurons. In Stavenga & Hardie (Eds.), Facets of vision. Berlin: Springer-Verlag. Geiger, G., & N¨assel D. R. (1981). Visual orientation behaviour of ies after selective laser beam ablation of interneurones. Nature, 293, 398–399. Geiger, G., & N a¨ ssel, D. R. (1982). Visual processing of moving single objects and wide-eld patterns in ies: Behavioral analysis after laser-surgical removal of interneurons. Biol Cybern, 44, 141–149. Geiger, G., & Poggio, T. (1981).Asymptotic oscillations in the tracking behaviour of the y Musca domestica. Biol Cybern, 41, 197–201. Harrison, R. R, & Koch, C. (1998). An analog VLSI model of the y elementary motion detector. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds), Advances in neural information processing systems, 10 (pp. 880–886). Cambridge, MA: MIT Press. Hassenstein, B., & Reichardt, W. (1956). Systemtheoretische Analyse der Zeit, Reihenfolgen, und Vorzeichenauswertung bei der Bewegungsperzepion des Russelk ¨ a¨ fers Chlorophanus. Z Naturforsch, 11b, 513–524. Hausen, K. (1981). Motion sensitive interneurons in the optomotor system of the y: I. The horizontal cells: Structure and signals. Biol Cybern, 45, 143–156. Hausen, K. (1982). Motion sensitive interneurons in the optomotor system of the y: I. The horizontal cells: Receptive eld organization and response characteristics. Biol Cybern, 46, 67–79. Hausen, K. (1984). In The lobule-complex of the y: Structure, function, and signicance in behaviour. M. A. Ali (Ed.), Photoreceptionand vision in invertebrates (pp. 523–559). New York: Plenum. Hausen, K., & Wehrhahn, C. (1990). Neural circuits mediating visual ight control in ies: II. separation of two control-systems by microsurgical brainlesions. J Neurosci, 10, 351–360. Hengstenberg, R. (1982). Common visual reponse properties of giant vertical cells in the lobula plate of the blowy Calliphora. J Comp Physiol A, 149, 179–

2304

Reid R. Harrison and Christof Koch

193. Himmelbauer, W., Furth, P. M., Pouliquen, P. O., & Andreou, A. G. (1996). Logdomain lters in subthreshold MOS (Tech. Rep.). Baltimore: Johns Hopkins University. Krapp, H. G., & Hengstenberg, R. (1996). Estimation of self-motion by optic ow processing in single visual interneurons. Nature, 384, 463–466. Lewis, M. A. (1998). Visual navigation in a robot using zig-zag behavior. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances in neural information processing systems, 10 (pp. 822–828). Cambridge, MA: MIT Press. Mead, C. (1989). Analog VLSI and neural systems. Reading, MA: Addison-Wesley. Minch, B. A., Diorio, C., Hasler, P., & Mead, C. (1996). Translinear circuits using subthreshold oating-gate MOS transistors. Analog Int Circuits and Signal Processing, 9, 167–179. Moini, A., Bouzerdoum, A., Eshraghian, K., Yakovleff, A., Nguyen, X. T., Blanksby, A., Beare, R., Abbott, D., & Bogner, R. E. (1997). An insect visionbased motion detection chip. IEEE J Solid-State Circuits, 32, 279–284. O’Carroll, D. C., Bidwell, N. J., Laughlin, S. B., & Warrant, E. J. (1996). Insect motion detectors matched to visual ecology. Nature, 382, 63–66. Poggio, T., & Reichardt, W. (1981). Visual xation and tracking by ies: Mathematical properties of simple control systems. Biol Cybern, 40, 101–112. Reichardt, W., & Egelhaaf, M. (1988). Properties of individual movement detectors as derived from behavioural experiments on the visual system of the y. Biol Cybern, 58, 287–294 Reichardt, W., & Poggio, T. (1976). Visual control of orientation behaviour in the y. Quarterly Reviews of Biophysics, 9, 311–375. Sarpeshkar, R., Bair, W., & Koch, C. (1993). Visual motion computation in analog VLSI using pulses. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems, 5 (pp. 781–788). San Mateo, CA: Morgan Kaufmann. Single, S., Haag, J., & Borst, A. (1997). Dendritic computation of direction selectivity and gain control in visual interneurons. J Neurosci, 17, 6023–6030. Srinivasan, M. V., Chahl, J. S., & Zhang, S. W. (1997). Robot navigation by visual dead-reckoning: inspiration from insects. Intl J Patt Recog and Art Int, 11, 35–47. Warzecha, A-K., & Egelhaaf, M. (1996). Intrinsic properties of biological motion detectors prevent the optomotor control system from getting unstable. Phil Trans R Soc Lond B, 351, 1579–1591. Weckstrom, ¨ M., Juusola, M., & Laughlin, S. B. (1992). Presynaptic enhancement of signal transients in photoreceptor terminals in the compound eye. Proc R Soc Lond B, 250, 83–89. Wolf, R., & Heisenberg, M. (1990). Visual control of straight ight in Drosophila melanogaster. J Comp Physiol A, 167, 269–283. Received December 17, 1998; accepted January 26, 2000.

LETTER

Communicated by Lloyd Watts

Efcient Event-Driven Simulation of Large Networks of Spiking Neurons and Dynamical Synapses Maurizio Mattia Paolo Del Giudice Physics Laboratory, Istituto Superiore di Sanit`a, 00161 Roma, Italy A simulation procedure is described for making feasible large-scale simulations of recurrent neural networks of spiking neurons and plastic synapses. The procedure is applicable if the dynamic variables of both neurons and synapses evolve deterministically between any two successive spikes. Spikes introduce jumps in these variables, and since spike trains are typically noisy, spikes introduce stochasticity into both dynamics. Since all events in the simulation are guided by the arrival of spikes, at neurons or synapses, we name this procedure event-driven. The procedure is described in detail, and its logic and performance are compared with conventional (synchronous) simulations. The main impact of the new approach is a drastic reduction of the computational load incurred upon introduction of dynamic synaptic efcacies, which vary organically as a function of the activities of the pre- and postsynaptic neurons. In fact, the computational load per neuron in the presence of the synaptic dynamics grows linearly with the number of neurons and is only about 6% more than the load with xed synapses. Even the latter is handled quite efciently by the algorithm. We illustrate the operation of the algorithm in a specic case with integrate-and-re neurons and specic spike-driven synaptic dynamics. Both dynamical elements have been found to be naturally implementable in VLSI. This network is simulated to show the effects on the synaptic structure of the presentation of stimuli, as well as the stability of the generated matrix to the neural activity it induces. 1 Introduction

Computer simulations of model neural networks with feedback play an increasingly prominent role in the effort to model brain functions. Besides complementing theory when it cannot fully describe the richness of the phenomenology embedded in the mathematical model, simulations aspire to be predictive and realistic enough to serve as a partial guide in planning the experiments and helping in the interpretation of real data. Although there is wide agreement on these statements, network simulations lack a crucial ingredient: synaptic dynamics. Learning has been c 2000 Massachusetts Institute of Technology Neural Computation 12, 2305–2329 (2000) °

2306

Maurizio Mattia and Paolo Del Giudice

incorporated in the modeling of (recurrent) feedback neural networks thus far as a static prescription for synaptic efcacies, according to some chosen “learning rule.” In this way, a long and important series of results have been obtained, which have made these models candidates for an account of the retrieval and maintenance of “learned” internal representations of a stimulus in tasks implying a working memory (see, e.g., Amit, 1995; Amit & Brunel, 1997a, 1997b.). But learning is guided by neural activities, which in turn are governed by the acquired synapses. Hence, for these models to make a leap toward a realistic description of learning, as it results from the dynamical interaction with the environment, a mechanism has to be incorporated in the model to describe the evolution of synaptic variables. Such an evolution will be driven by neural activities, which are driven in turn by afferent stimuli. With dynamical synapses, an incoming stimulus will not only make learned information pop up from the memory store and be put in an active state, but will also contribute to expanding the store. Moreover, if synaptic dynamics is a function of neural activities, then the question arises of the stability of any acquired synaptic stock and any activity distribution. This is true for both underlying levels of spontaneous activity and computational expressions by selective activity distributions. In other words, any neural activity distribution may produce synaptic changes and consequently start drifting away from what was considered spontaneous rates or a computationally relevant distribution. The question of the stability, for any computational paradigm, must be considered central, in the sense that both the selectivity of the computational outcome and its separability from noise (spontaneous activity) are put into question by synaptic dynamics. Thus, simulating joint dynamics is doubly necessary. There are serious problems in carrying out such a program. Semirealistic networks must contain a large number of cells to reproduce realistic spike dynamics. A typical cortical module contains on the order of 105 cells. Even with xed synapses and simple integrate-and-re (IF) neurons, a standard workstation will run a typical simulation of a network of this size with a very high ratio between the CPU time and the network’s “biological” time. One might hope that a suitably rescaled network of O (104 ) neurons would be feasible and still capture the most relevant aspects of collective behavior. But the number of synapses is a factor of 103 –104 greater than that of the neurons. If synaptic dynamics is to be an organic part of the simulation, then the number of dynamical variables threatens to increase by a corresponding factor, making the simulation impracticable. To this one should add the expectation that synaptic dynamics is intrinsically slower than neural dynamics, which implies that in order to observe the effects of synaptic dynamics on the neural activities, longer simulations (in neural time) are required, just when everything is about to slow down because of the number of dynamical variables. It is therefore important to devise and explore simulation strategies that could make fea-

Efcient Event-Driven Simulation

2307

sible large-scale simulations with double, coupled dynamics of neurons and synapses. We have found some attempts in this direction. An event-driven approach, similar in concept to the approach presented here, was rst proposed by Watts (1994). It is rather general and exible but not very suitable for large networks with extensive randomness because a specic “event” is generated and handled for every modication of every element in the network. For example, a spike that affects cN neurons will entail the management of cN events, with the associated computational load of the corresponding time ordering. These features of the simulation are consistent with its intended purpose of simulating small networks to be implemented in silicon and envisage the possibility of simulating complex neural elements. GENESIS (Bower & Beeman, 1998), a widely used, general-purpose simulation software, can serve as a basis for a wide class of simulation strategies. It has been used mostly in simulations of relatively small networks of complex, multicompartment model neurons. In computational terms, it is essentially a synchronous simulation, with a wide range of choices for the numerical integration method; a recently proposed evolution of the original package (PGENESIS, parallel GENESIS) implements a distributed version of clocked, synchronous simulations over several processors. It may be considered partially asynchronous as far as the management of the communication between modules is concerned.1 SpikeNET (Delorme, Gautrais, van Rullen, & Thorpe, 1999) is a simulator with the stated purpose of simulating large networks of IF neurons. In it, the single neuron dynamics is fully synchronous, with a surprisingly high value for the time step (dt D 1 ms for neurons spiking at 1 Hz); the network operates in a feedforward fashion and is constrained to be sparsely connected: each layer propagates to the next one a list of neurons that spiked in the last dt (each spike affecting about 50 neurons in a network of 400,000 neurons); nally, there is no ongoing synaptic dynamics (learning is effected in a supervised, noniterative fashion). Here we describe a way of exploiting specic features of the dynamics of IF neurons and a class of dynamical synapses driven by spikes in order to achieve a gentle scaling of the computational complexity of the full simulation with the network size. In section 2 we summarize the general idea. In section 3 we set the context of the simulation, listing the main features of the networks we consider. In section 4 we provide a detailed description of the algorithm. In section 5 we dene a specic context for a simulation whose results are sketched in section 6. In section 7 we describe the performances of the algorithm. In the appendix we illustrate possible extensions and improvements of the algorithm, some of which are the subject of work in

1

See http://www.psc.edu/Packages/PGENESIS/.

2308

Maurizio Mattia and Paolo Del Giudice

progress. A summary and outlook are provided in section 8. A preliminary account of this work is given in Mattia, Del Giudice, and Amit (1998). 2 An Event-Driven Approach

The usual approach to numerical simulations of networks of IF neurons is based on the nite difference integration of the associated differential equations. Whatever method is used (such as Euler or Runge-Kutta), a time step D t is chosen that serves as a clock, providing the timing for the synchronous update of the dynamical variables. D t also sets a cutoff on the temporal resolution of the simulation and, therefore, on the ability to capture short-time dynamical transients (Hansel, Mato, Meurier, & Neltner, 1998). Because of the role played by D t, we call these simulations synchronous. To get a quantitative estimate of the scaling of the computational complexity of a synchronous simulation when synaptic dynamics is introduced, suppose there are N neurons in the network and that (1) the connectivity is c (a spike emitted by a neuron is communicated on average to cN neurons) and (2) neurons emit on average º spikes per second. With a time step D t seconds, one has on average cN / D t synaptic updates per second per neuron. On the other hand, D t must be limited to prevent conicts between arriving spikes: the average number of spikes per second received by a neuron isºcN, so that the average interval between two successive spikes’ hitting a neuron is 1 /ºcN; but in order not to miss dynamical effects possibly related to single spikes, the probability of having more than one spike received by a neuron in D t should be negligible. This implies the bound D t ¿ 1 /ºcN. Therefore, the number of synaptic updates per second per neuron is À ºc2 N 2 , which leads to an N 2 scaling of the computational complexity per neuron. In the context of simulations with xed synapses (and with several other simplications), a few suggestions have been put forward (Hansel et al., 1998; Tsodyks, Mit’kov, & Sompolinsky, 1993), pointing to the development of more efcient synchronous algorithms. Margins for improvement, already at the purely neural level, are provided by the following observation. In realistic conditions, the neurons emit at low rates, with a high variability in the time intervals between successive spikes. Because of this, a synchronous algorithm will spend most of the time effecting the analog update of the depolarization of the IF neuron, whose evolution is deterministic between successive afferent spikes and could be exactly interpolated. Assume also that the synaptic dynamics is spike driven and is deterministic between spikes. This is a rather generic assumption, leaving open the particular type of dynamics. Specic choices will be discussed. In that case, the events (spikes) whose occurrence determines changes in the synaptic state will be much sparser in time, because a synapse sees only two neurons, while a neuron contacts cN synapses. The nely grained deterministic update, at the root of the N 2 scaling, could be eliminated if most of the computation is concentrated in the asynchronous determination of the effects of

Efcient Event-Driven Simulation

2309

the single spikes on neural as well as on synaptic variables, letting the deterministic interspike evolution of both be exactly interpolated at the same instances and become computationally negligible. Such an event-driven approach is proposed in this article. Its computational complexity per neuron is linear in N. In fact, each spike emitted by a generic neuron affects on average cN synapses, so an average number of ºcN synapses are affected per second by that neuron, and this gives a linear dependence on N. In this approach, there is no intrinsic temporal cutoff, and the simulation is an exact solution of the evolution equations of the system, under fairly general conditions to be specied below. The simulation therefore reproduces transients on arbitrarily short timescales. This strategy is very effective, but its implementation is complicated by the various sources of inhomogeneities and randomness in the network, as we explain. 3 General Features of the Simulated Network

The simulated network to be considered has the following typical ingredients: Neurons are of the IF type. The membrane depolarization undergoes a deterministic evolution between two subsequent incoming (excitatory or inhibitory) spikes; spikes are instantaneous events (zero length in time) generated when the neuron’s depolarization surpasses a threshold. The interpolating dynamics can be generic. In addition to recurrent spikes, all neurons in the network receive spikes from outside. Low-frequency, nonselective, external spike trains implement the background activity in the area surrounding the local module; higher-rate, selective external spike trains code for afferent stimuli. The resulting external current is assumed to be random in both cases. Typically the stochastic current will be a Poissonian train of spikes since external neurons emit consecutive spikes at random, independent instants with exponentially distributed interspike intervals (ISI), and spike emission by different external neurons is uncorrelated. Current from outside is then a superposition of uncorrelated stochastic Poisson processes. At the end of section 4.2 we describe how its generation is implemented in the simulation. In the absence of a paradigm for a model of spike-driven synaptic dynamics, such as the IF model for the neuron, the synaptic dynamics will be assumed to be governed by the (instantaneous) presynaptic spikes and the depolarization of the postsynaptic neuron, following a recently proposed model (Annunziato, Badoni, Fusi, & Salamon, 1998; Annunziato & Fusi, 1998; Fusi, Annunziato, Badoni, Salamon, & Amit, 2000). The presynaptic spikes serve as triggers for synaptic changes,

2310

Maurizio Mattia and Paolo Del Giudice

and the postsynaptic potential determines whether the change is potentiating or depressing. Long-term memory is maintained by two discrete values of the synaptic efcacy. In this model, synaptic changes have a Hebbian character and are stochastic because of the random nature of the spikes driving them. There is some empirical support (Petersen, Malenka, Nicoll, & Hopeld, 1998) and strong logical motivations (Amit & Fusi, 1994) for considering synapses with a discrete, small set of stable values for their efcacies. Theory suggests (Amit & Fusi, 1994) that stochastic dynamics provide an effective mechanism for implementing learning with discrete synapses. Both architectural and dynamical sources of randomness are present in the network. The rst, static (or “quenched”) noise is associated with the random pattern of connections among the neurons in the network and the random distribution of delays in the transmission of recurrent spikes. The latter is due to external spikes’ hitting the neurons in the network at random instants of time. Moreover, the recurrent spikes traveling in such a network exhibit a high degree of randomness in their times of emission. The network draws from this distributed reservoir of randomness in order to implement stochastic learning. The recurrent network is assumed to include several interacting populations of neurons (excitatory and inhibitory neurons in the simplest case). Both the neural and the synaptic dynamics merge stochastic and deterministic components, the rst being associated with the random nature of the sequences of spikes (which act in both cases as trigger events for the dynamics) and the second expressing the evolution of the synaptic or neural state variables in the time interval between two successive afferent spikes. The algorithm to be described exploits the fact that the state variables of neurons and synapses evolve most of the time deterministically. They deviate only upon the stochastic arrival of spikes. The procedure will be described for both a monotonic decay of the state variables between two successive spikes (for neurons, this is the case for the usual leaky IF dynamics and also for the “linear” IF neuron; see Fusi & Mattia, 1999), as well as for a generic deterministic evolution between two spikes. 4 The Algorithm

The core of the algorithm is the management of the temporal hierarchy of the spikes that are generated in the network and those that arrive from outside. We rst describe the coding of the elementary events and the data structures, found convenient in reducing the computational load due to the

Efcient Event-Driven Simulation

2311

necessary ordering of events in time. Then we go into detail on the working of the algorithm. 4.1 Main Elements and Data Structures. Each recurrent, instantaneous spike (event) emitted by an IF neuron of the simulated network is represented by the pair (i, t) where i is the emitting neuron and t is the emission time of the spike. We assume a discrete set of D ordered delays d0 < d1 ¢ ¢ ¢ < dD¡1 for spike transmission. Synapses are organized in matrix-structured layers, each layer corresponding to one value of delay. The generic column i of the synaptic matrix in layer l represents all those synapses on the axon of neuron i that transmit a spike from neuron i to target neurons, with delay dl . Target neurons are identied by the row index in each layer of the synaptic matrix. Each element in the column contains the synaptic (dynamic) state variables: an analog (internal) variable that evolves continuously, determining in turn the evolution of a discrete variable that represents long-term memory and acts as the synaptic efcacy, mediating the effects of presynaptic spikes. The union of all layers represents the complete synaptic matrix. The synaptic structure is compressed, taking advantage of the sparse nature of the synaptic connections. Many vacancies would be left in a straightforward representation of the synaptic matrix; the layers are even sparser. The compressed version of the synaptic matrix in which only nonzero elements are stored is such that the generic element in one column of a given layer contains the address of the next neuron on the same portion of the axon (in terms of the difference of their indices). This coding does not imply additional complexity in scanning the axon. The events (i, t) representing recurrent spikes are directed to the synaptic layers for neural and synaptic updates through a set of queues, one for each layer (i.e., for each delay). Each element in the lth queue is the pair (i, t C dl ) (dl is the value of the lth delay), which is the spike generated by neuron i at time t. This event will trigger neural updates at time t C dl using efcacies in layer l and synaptic updates in the same layer. Each queue is of the rstin-rst-out (FIFO) type: events are ordered in time, with the oldest element at the top of the list. External spikes from outside the network are modeled by stochastic trains of instantaneous events impinging on the cells in the network. Each external spike is coded by a pair ( j, t), where j is the target neuron in the network and t is the time when the external spike will be received. The external stochastic processes viewed by different neurons are assumed to be uncorrelated and such that the linear superposition of several external streams of spikes retains the same statistical distribution of the single train; this is the case for trains of spikes with Poissonian or gaussian statistics. These choices allow signicant simplication in the management of the external spikes. One can generate a single train of spikes, for example, with a Poisson distribution, store successive an external spikes in a single reg-

2312

Maurizio Mattia and Paolo Del Giudice

ister serving as a buffer, and choose at random the target neuron for the current spike in the register, such that each neuron in the network receives an external Poissonian spike train of prescribed frequency (see the end of section 4.2 for details). The major advantage brought about by this simplication will be clear in the next section. 4.2 Life Cycle of a Spike. The algorithm is explained with reference to Figure 1. The simulation is started by igniting activity in the network. This is done by either prelling the queues of events or starting with empty queues having the network driven by incoming external spikes. The time unit is provided by one of the parameters with dimensions of time that enter the dynamics: the absolute refractory period, the frequency of external (Poisson) events and the decay constant of the neuron’s depolarization. Suppose a spike is emitted by neuron k at time T (A in Figure 1). The event (k, T ) enters a rst queue Q0 associated with the rst synaptic layer (with minimal delay d0 ), and its time label is updated to T C d0 , which is when the spike will affect its target neurons (B). The event waits to be handled until its temporal label becomes the oldest in the queue (it is in the rst position in the queue) (B). When all previously emitted spikes have affected all their postsynaptic targets with delay d0 , the event reaches the rst position in Q0 , and at this moment it becomes the rst in line to act, among spikes that will act with delay d0 . If, upon sorting with the top elements of the other queues (F), it turns out to be the oldest event around, it starts affecting the synapses in column k of the rst layer (C), and updating the state of each postsynaptic neuron, identied by the row index of the synapse. Target neurons for the spike are sequentially addressed, and their depolarization is updated:

1. The time at which the last spike reached the target neuron is known. Using the time difference to the arrival of the present spike, the value of the neuron’s depolarization is given by its deterministic evolution. 2. The postsynaptic contribution of the spike to the depolarization is the corresponding synaptic efcacy. It is added as a discrete value. 3. The new value of the depolarization is compared with the threshold for emission of a spike. 4. If spike emission occurs, the new event enters the queue Q0 with time label T C d0 C d0 . 5. The synaptic efcacies are updated by the combination of their deterministic evolution in the interval since the arrival of the last spike and the spike-driven contribution.

Efcient Event-Driven Simulation

2313

Figure 1: Schematic illustration of the algorithm. (From the left): A portion of the network, three neurons, two of which emit spikes at different times; the main data structures used in the simulation: the queues of events associated with each layer, represented as a stack of arrays; then the synaptic matrix—a superposition of D grids, one for each synaptic delay (the layers), each element containing the synaptic variables, including the discrete efcacy. The arrows illustrate the main successive steps taken in the management of an event, and the letters mark the main phases in the lifetime of a spike. See the text for details.

2314

Maurizio Mattia and Paolo Del Giudice

Here, for simplicity, we have taken the time at which a spike updates the depolarization of a neuron to be the same as that at which it updates the synaptic variable (see section 5 for details of the chosen synaptic dynamics.) This limitation can be relaxed (see the appendix). Once all the postsynaptic updates associated with delay d0 are completed, the event is appended with time label T C d1 (d1 > d0 ) to the bottom of the next queue Q1 , attached to the layer with delay d1 , and it is handled as above (D). Queues are lled in such a way that the time labels are automatically ordered inside each queue (the rst element is guaranteed to be the oldest of its queue), without need of further sorting. However, in order to choose the oldest among all events, it is necessary to sort the rst events in all the queues to be processed. With a small number of allowed discrete values for the delays, this is a negligible additional computational load. We have assumed that the neuron’s depolarization undergoes a monotonic decay between successive incoming spikes. This ensures that a spike can be generated only at the time a spike arrives (see the appendix for possible extensions). The life cycle of an external spike is simpler, because it carries no delay and has a unique target, and it is very much simplied by the statistical assumptions we have made. In the simulation, a single random generator produces trains of external spikes with assigned frequencies. If the last external spike is received at time tn by the network, the next one will arrive at time tn C 1 D tn C D t, where D t is randomly chosen from the exponential density function p (D t), 2 p(D t) D ºext cext N 2 e ¡ºext cext N D t ,

(4.1)

where we assume that external neurons emit spikes at a mean rate ºext and the average number of external connections received by a neuron in the network cext N. The generation of the new event is completed by choosing at random from the recurrent network, using a uniform distribution, the target neuron for the external spike. In this way, each neuron sees incoming spikes with Poissonian statistics of mean rate ºext cext N. Random selection of the receiver ensures that trains of external spikes afferent on different neurons are statistically independent. Each new external event is stored in a register (E), and its time label participates in the sorting process together with the top elements of the queues. As soon as the event in the register has been processed, it is replaced by a new one. The corresponding computational load per second of simulation (in “biological time”) is the generation of an average of ºextcext N external spikes per neuron, which is of O (N ). The choice of independent Poisson trains of external spikes, which allows us to use a single random generator and a single register for all external spikes, is crucial in keeping low the load related to spike sorting (see section 4.3).

Efcient Event-Driven Simulation

2315

For a network composed of several populations of neurons (e.g., a population of inhibitory neurons and several populations of excitatory neurons, activated by different external stimuli), one spike generator with appropriate frequency and one register are required for each population. So sorting involves only D C P elements, where P is the number of populations in the simulated network (different populations may receive in general external currents with different statistical properties), with an additional complexity of O (log2 (D C P )), which is negligible in the usual situation D C P ¿ N (see section 7 and Figure 6). 4.3 Remarks on Specic Features of the Algorithm.

4.3.1 Layered Structure. Coding the synaptic matrix in a single layer would imply, for each spike to be processed, scanning the whole axon as many times as there are delays in order to choose the appropriate neurons for update. Also, the spike just processed for one value of delay would have to be reinserted in the global pool of events to be processed (the single queue of length, say, m), implying an additional log2 (m) computational effort (insertion of one element in an ordered list), to be multiplied by the number of delays. To estimate the typical value of m, we note the following. Each spike emitted in the network will travel along its axon for a time dD¡1 (provided at least one target synapse exists for that neuron with maximal delay dD¡1 ). That spike will therefore be “active,” waiting for all its effects to be computed by the simulation, for a time window dD¡1 after its emission. In other words, at time t, only spikes emitted in a time span (t ¡ dD¡1 , t) will coexist on the same axon. For reasonably low frequencies and a maximal delay of a few milliseconds, the average number of spikes traveling simultaneously on the same axon (ºdD¡1 ) in the time window (t ¡ dD¡1 , t) is very small. But when we consider the whole network, all neurons contribute a factor ºdD¡1 to the average number of active events in the supposed single queue at time t, and so m D NºdD¡1 . For example, in a regime of spontaneous activity and in the absence of external spikes, a large network of N D 105 neurons emitting º D 2 spikes per second, with maximal delay dD¡1 D 3 ms has an average of NºdD¡1 D 600 active events at any given time t. So a “brute force” (single-layer) event-driven approach would have to manage a single queue with an additional computational load of O (D log2 (dD¡1ºN )). 4.3.2 Generation and Handling of External Spikes. Having a separate generator of external spikes for each neuron in the network would imply a relatively small increase in memory allocation but a huge increase in the computational load due to the additional sorting required. Instead of having to sort for each spike D top elements of the queues, plus elements in the registers for the external spikes, D C N elements would have to be sorted.

2316

Maurizio Mattia and Paolo Del Giudice

4.3.3 Simultaneous Event-Driven Management of “Fast” and “Slow” Dynamics. One might think of a synchronous algorithm based on the simultaneous use of two very different D t for the neural and the synaptic evolution. But we are typically interested in slow synaptic dynamics, in which the (stochastic) changes in the synaptic state occur with small probabilities. This implies that the synaptic dynamics is governed by rare, fast bursts of spikes, and a big D t would fail to reproduce correctly the most important regime. A big D t for synapses would also conict with the resolution with which the time of emission of the spikes is determined, constrained in turn by the neural D t. The input signal to the synapse would therefore be affected by a quantization error, which could in principle propagate back to the neural dynamics and distort it (for example, in regimes of fast global oscillations). 5 A Specic Context: Linear IF Neuron and Spike-Driven Synapses

We adopt a linear IF neuron as an example. Such a neuron is suitable for VLSI implementation, and networks of such neurons retain most of the collective features of the leaky IF neuron (Fusi & Mattia, 1999). The choice of the neuronal model is not critical for the algorithm. The recurrent network includes NE excitatory neurons (possibly including subpopulations activated by different external stimuli) and NI inhibitory neurons. Neuronal depolarization integrates linearly the afferent stream of spikes, with a constant decay b between two subsequent incoming spikes. Each spike from neuron j contributes a jump in the depolarization Vi of receiving neuron i, equal to the synaptic efcacy Jij . If Vi crosses a threshold h, a spike is emitted, and Vi is reset to H 2 (0, h ), where it is forced to remain for a time tarp , the absolute refractory period. In what follows, H D 0. A reecting barrier at 0 forces Vi to stay at 0 whenever it is driven toward negative values, by either the constant decay or inhibitory spikes. This dynamics is illustrated in Figure 2. The synapses connecting pairs of excitatory neurons are plastic; all the others are xed. The dynamics of the plastic synapses is dened as in Annunziato et al. (1998), where an analytic treatment of the model is described, together with results from a VLSI implementation. (For another model of a spike-driven synaptic dynamics, see H¨aiger, Mahowald, & Watts, 1997; Senn, Tsodyks, & Markram, 1997). It is a specic implementation of a stochastic, Hebbian learning dynamics. The synaptic efcacy J takes one of two values—J0 (depressed) and J1 (potentiated)—and the learning dynamics evolves as a sequence of random transitions between J0 and J1 , driven by the pre- and postsynaptic activities. An internal synaptic (dimensionless) variable (“synaptic potential,” V J ) describes the short-time analog response of the synapse to pre- and postsynaptic events and determines the transitions between the states J0 and J1 , as follows (see Figure 3). V J is conned to vary in the interval [0, 1]. Upon arrival of a presynaptic spike, VJ undergoes a positive (negative) jump of size a (b) if the postsynaptic

Efcient Event-Driven Simulation

2317

Figure 2: Illustration of the dynamics of the neural depolarization V(t) (see the text). The evolution of V(t) is governed by the equation VP i (t) D ¡b C Ii (t), P 1,N P (k) where Ii (t) D jD6 i Jij k d (t¡tj ¡dij ) is the afferent current to neuron i, resulting (k)

from the sequences of spikes emitted by presynaptic neurons j at times tj and transmitted with delays dij , whose postsynaptic contribution to Vi (t) is Jij .

depolarization is found to be above (below) a threshold hV (not the spike emission threshold, h > hV ). The accessible interval for V J is split in two parts by a synaptic threshold h J . The synaptic efcacy J D J0 if VJ < h J and J D J1 if V J > hJ . Whenever a jump brings V J above (below) hJ , J undergoes a transition: J0 ! J1 (long-term potentiation, LTP) (J1 ! J0 (long-term depression, LTD). In the time between two successive presynaptic spikes, V J is linearly driven toward 0 (1) if the last spike left it below (above) h J . The extreme values 0 and 1 act as reecting barriers for VJ . 6 Synaptic Structuring: Expectations and Results

The synaptic dynamics we have described leads to the following qualitative expectations. First, since any change in the synaptic state is triggered by a presynaptic spike, synaptic modications will be rare for low-frequency presynaptic neurons. Second, the synaptic potential VJ is attracted toward the barriers in 0 and 1, so in order to provoke a change in the synaptic efcacy J, the rate of trigger events has to be high enough to overcome the restoring force (that is, the time between two successive presynaptic spikes

2318

Maurizio Mattia and Paolo Del Giudice

Figure 3: Illustration of the synaptic dynamics as driven by the activities of pre- and postsynaptic neurons. (Top) Depolarization and spikes of presynaptic neuron. (Bottom) Depolarization and spikes for postsynaptic neuron. (Middle) Induced time course of the two synaptic variables, the synaptic efcacy J (thick curve), and the synaptic potential VJ (thin curve). In the absence of presynaptic spikes, if V J (t) < h J , it drifts down; if V J (t) > hJ , it drifts up. Presynaptic spikes trigger abrupt changes in V J . Upon arrival of a presynaptic spike, V J undergoes a positive jump of size a if the postsynaptic depolarization is above the threshold hV , and a negative jump b otherwise. As long as V J (t) < h J , the synaptic efcacy J D J0 , when V J (t) > hJ , J D J1 . These represent long-term memory. The second presynaptic spike makes V J cross the threshold hJ (dashed line), and this provokes an instantaneous LTP transition of the synaptic efcacy J D J0 ! J D J1 (solid curve). The fth presynaptic spike provokes a depression, LTD.

must be shorter than the time needed for the linear decay to compensate for the jumps in V J ). Third, given the statistics of presynaptic spikes, the probability distribution of the postsynaptic depolarization determines the relative frequency of up and down jumps in V J , and therefore of potentiations (LTP) and depressions (LTD). A postsynaptic neuron emitting at a high rate will frequently have its depolarization near the threshold for spike emission h, hence above hV , while a low-rate neuron spends most of its time near the reecting barrier at 0, and V is likely to be below hV . In the case of interest, in which the transition probabilities for J are low, a synaptic transition can occur only when a uctuation produces a burst of presynaptic spikes coherently pushing VJ up or down. Expectations for the learning scenario due to the above mechanism are that the synaptic structure should be unaffected for low presynaptic activity,

Efcient Event-Driven Simulation

2319

regardless of the postsynaptic neural state, and high presynaptic activity will provoke potentiation (LTP) or depression (LTD), respectively, for high or low spike rates in the postsynaptic neuron. An example of the time evolution of a recurrent network of linear IF neurons, subjected to the above synaptic dynamics, is a network of 1500 neurons (1200 excitatory, 300 inhibitory) connected by synapses with 10% connectivity (» 144,000 plastic synapses). We choose this relatively “small” network to allow a faithful graphical representation of the synaptic matrix. Table 1 summarizes the values of the main parameters of the simulation. In the initial state, a randomly chosen 10% of the excitatory synapses are potentiated; the neural parameters and the synaptic efcacies are chosen to have mean spontaneous excitatory activity of about 8 Hz. Box B1 in Figure 4 is a snapshot of the network in its initial state. The big square sprayed with black and white dots provides a representation of the excitatory-excitatory part of the synaptic matrix; white (black) dots stand for potentiated (depressed) synapses, and their coordinates inside the square identify pre- and postsynaptic neurons. The average rate of each (of 1200) excitatory neuron labeling each axis of the synaptic matrix is shown in the (identical) horizontal and vertical rectangular boxes in B1–B4 (rates are averaged in 500 ms before the acquisition of the synapses). The learning protocol is as follows (see boxes B2–B4 in Figure 4): there are two stimuli (S1 ,S2 ), which are directed to neurons 1–120 and 121–240, respectively. After 1 second of spontaneous activity, stimulus S1 is presented for 1 second (B2). The rates of neurons 1–120 can be seen to be much higher than spontaneous activity. Then, after 1 second with no stimuli, S2 is presented for 1 second (B3). The rates of neurons 121–240 are high; the network is then left without stimuli for 4 seconds (B4). The synaptic matrix is preserved. The synaptic dynamics in the simulation is in qualitative agreement with expectations (see Annunziato et al., 1998). The gures show that potentiation occurs for high pre- and postsynaptic frequencies (during stimulation) depression occurs for high pre- and low postsynaptic frequencies and for low presynaptic frequencies, synapses remain essentially unaffected. In the structured network, in the absence of stimuli (B4), neurons connected by potentiated synapses emit spikes at spontaneous rates that are, on average, about twice as high as the rates prior to the presentation of the stimuli. The scale in Figure 4 makes it difcult to notice. It is noteworthy that despite this signicant change in rates, the structured synaptic matrix remains stable. The phenomenology exposed by the gure provides a rst step toward answering some of the questions raised in section 1: the neurons split in two populations with different spontaneous rates as a result of the synaptic structuring induced by the incoming stimuli, yet the new higher rates turn out to be low enough and do not affect the synaptic matrix.

2320

Maurizio Mattia and Paolo Del Giudice Table 1: Parameters of the Simulated Network. Parameter NE NI cE cI bE bI Jext JIE JEI JII

Value 1200 300 0.2 0.1 (h / s) 1 (h / s) 4 (h ) 0.025 (h ) 0.027 (h ) 0.090 (h ) 0.070 DJ 0.25 D 4 d0 1 (ms) dD¡1 3 (ms) (s ¡1 ) a 20 (s ¡1 ) b 20 a 0.37 b 0.22 (h ) Jp 0.065 (h ) Jd 0.02 hJ 0.5 (h ) hV 0.8 (Hz) ºext 8 stim ºext 48 (Hz) x 0.5 cext D (1 ¡ x)cE 0.1 Notes: NE , NI D number of excitatory and inhibitory neurons. cE , cI D excitatory and inhibitory connectivities; the neural threshold h for emission of a spike is set to one and serves as the unit for the depolarization. bE , bI D decay parameters for excitatory and inhibitory linear neurons. Jext D efcacy of synapses from external neurons. Jdc , d, c 2 (E, I) D nonplastic synaptic efcacies involving inhibitory neurons (the E ¡E synapses, which are plastic, are treated separately; see below). D J D relative variance of the efcacies of the xed synapses, all in units of h. a (b) D refresh term driving the synaptic dynamic variable V J toward the lower (upper) long-term value; a (b) is the up (down) jump of V J , upon a presynaptic spike (they are dimensionless because V J is). Jp (Jd ) D longterm-memory potentiated (depressed) synaptic efcacies for the plastic synapses between excitatory neurons. hJ D dimensionless threshold for V J . hV D threshold for the postsynaptic depolarization. ºext D frequency of exstim D frequency of the external neuternal neurons; ºext rons when they code for stimuli. Of the cE NE excitatory connections received by a neuron, xcE NE are recurrent (x D 0.5 in our case) and cext NE D (1 ¡ x)cE NE are from external neurons. D D number of synaptic delays, ranging from d0 to dD¡1 . See also the text.

Efcient Event-Driven Simulation

2321

Figure 4: Illustration of the stochastic learning dynamics. See the text for details.

7 Performance

In section 2 we argued that a major advantage of an (asynchronous) eventdriven approach is that the computational complexity of the simulation remains linear in N (of orderºcN) when the synaptic dynamics is introduced. We now provide empirical evidence for this scaling, and also a quantitative measure of the memory occupation. Figure 5 (left) shows the CPU time per neuron, in milliseconds, needed to complete a simulation of 1 second of neural time, for both static and dynamic synapses versus network size N. For different values of N, the synaptic efcacies are changed, keeping c D 0.1 and other parameters xed, in order to maintain a xed value for the (stable) spontaneous frequencies of the excitatory (º D 2 Hz) and inhibitory

2322

Maurizio Mattia and Paolo Del Giudice

Figure 5: (Left) Execution time per neuron (in milliseconds) versus N. Solid line: synaptic dynamics on; dashed line: synaptic dynamics off. (Right) Memory allocation (in Kbytes) per neuron versus N. Each synapse is stored in 8 bytes (the analog, oat value of VJ , the two-valued efcacy J, and the relative address of the next synapse on the same portion of the axon, same delay).

(º D 4 Hz) populations. The test has been carried out on a Pentium II, 300 MHz PC with 256 Mb RAM. The expected linear scaling of the CPU time per neuron with N is remarkably well reproduced in the test and also for the case with synaptic dynamics. It is also apparent from the gure that CPU time consumption is not much different going from static to dynamic synapses. The magnitude of the difference depends on the parameters of the network. Essentially it depends on the ratio between the number of dynamic synapses (those between excitatory neurons) and the number of intrinsically static synapses (those involving an inhibitory neuron, or an external neuron). With reference to Table 1, the following factors govern the above difference: NE / NI , x, and ºE /ºI . For the parameters chosen for the test, the introduction of synaptic dynamics accounts for about 6% of the CPU time, while about 30% goes for the generation of external events. The management of the queues associated with the synaptic delays accounts for another 6% in the case of D D 4 delays. Most of the remaining time is spent in updating the neural depolarizations and scanning the (compressed) axon in order to determine after each update the next target neuron for the current spike. In Figure 5 (right) we plot the memory occupation per neuron in Kbytes, versus N. As expected, memory occupation is dominated by the storage of synapses, and the constant slope of the curve indicates that the storage per neuron scales linearly with N.

Efcient Event-Driven Simulation

2323

Figure 6: Execution time per neuron (in milliseconds) versus N, for two values of D (the number of delays). *: D D 16, o: D D 1. The relative difference stabilizes around 5% for large N.

In both plots, each point corresponds to 10 repetitions of the simulation for the corresponding value of N; the error bars are too small to be discernible. The runs contributing to each point differ in the initial neural and synaptic conditions, the realization of the random connectivity, and the trains of external spikes. Figure 6 shows how the execution time per neuron depends on the number D of delays. For large N, the relative time difference between D D 1 and D D 16 is about 5%, thus proving that the algorithm is not very sensitive to the number of delays (layers). 8 Discussion

We have described a tool for fully event-driven simulations of large networks of spiking neurons and spike-driven, dynamical synapses. We pointed out reasons that it is efcient and exible enough for pursuing in semirealistic conditions questions related to synaptic dynamics and its interaction with neural spike emission. The algorithm relies on rather general assumptions on the neural and synaptic dynamics: 1. That neurons emit spikes as a result of the spikes they receive and a

2324

Maurizio Mattia and Paolo Del Giudice

deterministic evolution of the depolarization between afferent spikes; there are no constraints on the deterministic evolution of the depolarization following the arrival of an afferent spike. 2. That presynaptic spikes trigger changes in the synapses, with a deterministic evolution of the analog synaptic variable between two successive presynaptic spikes.2 3. That spikes emitted by different external neurons are statistically independent. 4. That synaptic delays can be assumed to form a discrete (not too large) set. We have illustrated how the simulation affects synaptic structuring in a network of linear IF neurons according to a specic, spike-driven synaptic dynamics. This particular rule is new (Annunziato et al., 1998; Annunziato & Fusi, 1998), and the results shown are meant only as a preliminary insight into its working in the context of a network of IF neurons. With the tool, proposed here, for the simulation of large networks with coupled neural and synaptic dynamics, ideas and hypotheses concerning the collective effects of synaptic dynamics can be developed and tested. Some were raised in section 1; others we mention next: The stability of neural activities can be probed when the synaptic dynamics is turned on. We have checked that a network with prelearned, static synapses, clamped to a conguration compatible with the expected structure due to LTP and LTD dynamics, is able to sustain stimulus-selective, stable collective states while maintaining a low-frequency, stable spontaneous activity. The question is whether, starting from an unstructured synaptic state, the specic synaptic dynamics used here is able to drive the noisy, uctuating, nite-size network to the above expected synaptic structure. If not, what modication of the synaptic dynamics will do? If the answer to the previous question is positive, one must investigate whether the (stochastic) changes in the synaptic matrix have a negligible probability in the absence of stimuli, those induced by the stimuli are such that neural activities are kept stable, and that synaptic changes have a negligible probability also when the network is reverberating in an attractor, for these constraints ensure that the network does not forget recently learned, stimuli. 2 If synaptic matrices are not compressed, it is straightforward to let the synaptic evolution be driven also by postsynaptic spikes, as in Markram, Lubke, Frotscher, and Sakmann (1997) and H¨a iger et al. (1997).

Efcient Event-Driven Simulation

2325

Simulating the coupled neural and synaptic dynamics makes it possible to study the conditions under which the system exhibits experimentally observed effects, such as the generation of context correlations (Miyashita, 1988; Yakovlev, Fusi, Berman, & Zohari, 1998), which reect themselves in the structure of neural selective delay activities. The simulation tool is exible enough to allow the study of modied or alternative synaptic dynamics, allowing, for example, the test of several recently proposed synaptic models (H¨aiger et al., 1997; Senn et al., 1997) and their implications for the collective behavior of a large network. These issues are under study, together with extensions and improvements of the algorithm, some of which have already been mentioned. A particularly interesting extension is related to the fact that the event-driven algorithm is well suited for distributed computing. Each computing node would contain a subpopulation of neurons and their complete dendritic tree (the corresponding set of synapses). This choice would eliminate the need to propagate synaptic information across different computing nodes, reducing the communication load. The messages to be passed among the processing machines would be essentially spikes, or what we have called events. The extension to multiple, concurrent computing nodes requires some care. One must solve problems of the causal propagation of events: the “real” CPU time each node spends per unit “biological” time may not be equal in all nodes because of different emission frequencies of neurons simulated in each node. Because of this, a “fast” node might receive old events from a “slow” one, with time labels that are “obsolete.” This is a typical synchronization problem in concurrent computing. Its solution goes beyond the scope of this article. Appendix: Extension to Arbitrary Interspike Neural Dynamics

The simulation method described in this article deals with cases in which in the absence of spikes, the neuron’s depolarization is monotonically decreasing, and hence a neuron could emit a spike only upon the arrival of an excitatory spike. The technique can be extended to cases in which the neural state variables have an arbitrary deterministic evolution between consecutive afferent spikes. In such cases, when a neuron receives a spike, the time ts can be determined, when that neuron will in turn emit a spike (if any), on the basis of the known evolution of the depolarization. But before ts , this same neuron might receive other spikes, and each one of those will in general redene ts . In other words, if there are events to be processed in the queues at times earlier than ts , the emission of the spike, predicted to occur at ts , is “uncertain,” and ts must be updated each time an older spike is processed. This alters

2326

Maurizio Mattia and Paolo Del Giudice

P Figure 7: Time evolution of the depolarization V(t) according to V(t) D ¡b C I(t), P when the afferent current I(t) is subjected to the dynamic law t I(t) D ¡I(t) C P P (k) (t ) d ¡ ¡ J t d as in Amit and Tsodyks (1992). The solid curve shows the j j j k j time evolution of V(t) when an excitatory spike is received in t1 , followed by an inhibitory one in t2 > t1 . In this condition, the neuron does not emit in the time interval shown. The dashed line shows the time course of V(t) in the absence of the second inhibitory spike in t2 . In this case, the neuron emits at t D ts > t1 . The value of ts can be deterministically computed from the above equations, possibly also using efcient methods such as those described in Srinivasan and Chiel (1993).

the natural order in time of the spikes emitted in the network needed to operate our algorithm. One must therefore devise a way of handling these “uncertain” events (to reproduce the correct causal dependencies) without increasing signicantly the computational complexity of the simulation. To show how this can be achieved, it is useful to consider the situation depicted in Figure 7 (it corresponds to a simplied description of the effects of dynamical conductances on V (t)). Two successive events are scheduled to reach the same neuron i at times t1 and t2 ( > t1 ). If the evolution of the neural state following t1 is not a monotonic decay, it is possible for the target neuron i to reach the threshold and emit a spike at time ts > t1 . If ts < t2 , the spike at ts becomes a “real” event and undergoes the usual processing. If ts > t2 , as in Figure 7, when the spike received at t D t2 is excitatory, it will shift the next emission to t0s < ts , while an inhibitory spike will delay it to t0s > ts or suppress it (which is the case shown in the gure). The foreseen event (i, ts ) is therefore termed “uncertain.” To take into account the uncertain events, the algorithm is modied as follows: One new queue is created, in which the uncertain events are placed; all events generated in the network are now born “uncertain.”

Efcient Event-Driven Simulation

2327

Each new uncertain event is placed at the correct position in the new queue, according to its temporal label. If the length of the queue at that moment is n, this implies an additional computational load O (log2 (n )) (insertion of an element in an ordered list; see below for an estimate of n). If an event (i, ts ) is already present in the queue, each spike received by neuron i will redene ts . As a result, the event (i, ts ) could change its position in the queue, implying yet another O (log2 (n )) load (in the worst case of an arbitrary shift of the position in the queue)—or the event could be suppressed and be deleted from the queue. The previous operations preserve the order of the queue such that the initially ordered queue of uncertain events is kept ordered throughout the simulation without any need of global sorting. An event (i, ts ) becomes “real,” and enters the rst queue of events to be processed when there are no more spikes to be processed in the network at times earlier than ts : there are no more events that could affect ts . The additional computational complexity is then mainly due to the insertion and repositioning of an uncertain event in a sorted array. The average length n of the queue of uncertain events can be estimated by the product of the mean time T the spike will stay there (waiting to become real or to disappear) and the number of events Nº generated per unit time: n ¼ NºT. To get an estimate of T, we note that on average, the interval (t1 , t2 ) is shorter than the interval between two successive spikes emitted by neuron i. From the denition of the uncertain event (generated at t1 , and scheduled to happen at ts ), we cannot have a “real” spike emitted by the same neuron i between t1 and ts , so if we call ts¡1 the emission time of the previous spike, T D hts ¡ t1 i < hts ¡ ts¡1 i D 1 /º, since ts¡1 < t1 . Therefore, T < 1 /º. In the worst case, a neuron generates an uncertain event each time it receives a spike, so in the unit time, the average number of insertion per neuron is cºN. Rearrangements, which involve only neurons with a predicted uncertain event, are n in the worst case, and each spike in the network affects cn of the neurons associated with uncertain events. On average, º spikes are emitted per unit time by each neuron, so the number of rearrangements is cnº D cº2 NT. Allowing an arbitrary time evolution for the postspike depolarization will therefore cost an additional complexity per neuron of order ºcN (1 C ºT ) log2 (NºT ) (cºN log2 (n ) D cºN log2 (NºT ) due to the insertion of new events in the ordered queue, plus cº2 NT log2 (NºT ) due to rearrangements in the queue—still a modest price to pay when compared to order N2 complexity of the synchronous simulation.

2328

Maurizio Mattia and Paolo Del Giudice

Acknowledgments

Most of this work has been discussed in depth with D. J. Amit. We heartily thank him for his many stimulating contributions. We acknowledge very useful comments from S. Fusi, D. Del Pinto, and F. Carusi. This study has been supported in part by the RM12 initiative of INFN. References Amit, D. J. (1995). The Hebbian paradigm reintegrated: Local reverberations as internal representations. Behavioural and Brain Sciences, 18, 617–657. Amit, D. J., & Brunel, N. (1997a). Model of global spontaneous activity and local structured (learned) delay activity during delay periods in cerebral cortex. Cerebral Cortex, 7, 237–252. Amit, D. J., & Brunel, N. (1997b). Dynamics of a recurrent network of spiking neurons before and following learning. Network, 8, 373–404. Amit, D. J., & Fusi, S. (1994).Learning in neural networks with material synapses. Neural Computation, 6(5), 957–982. Amit, D. J., & Tsodyks, M. (1992). Effective and attractor neural networks in cortical environment. Network, 3, 121–137. Annunziato, M., Badoni, D., Fusi, S., & Salamon, A. (1998). Analog VLSI implementation of a spike driven stochastic dynamical synapse. In L. Niklasson, M. Boden, & T. Ziemke (Eds.), Proceedings of the 8th International Conference on Articial Neural Networks, Skovde, Sweden (Vol. 1, pp. 475–480). Berlin: Springer-Verlag. Annunziato, M., & Fusi, S. (1998). Queuing theory for spike-driven synaptic dynamics. In L. Niklasson, M. Boden, & T. Ziemke (Eds.), Proceedings of the 8th International Conference on Articial Neural Networks, Skovde, Sweden (Vol. 1, pp. 117–122). Berlin: Springer-Verlag. Bower, J. M., & Beeman, D. (1998). The book of GENESIS. New York: SpringerVerlag. Delorme, A., Gautrais, J., van Rullen, R., & Thorpe, S. (1999). SpikeNET: A simulator for modeling large networks of integrate and re neurons. In J. M. Bower (Ed.), Computational neuroscience: Trends in research 1999, neurocomputing (Vols. 26–27, pp. 989–996). Amsterdam: Elsevier Science. Fusi, S., Annunziato, M., Badoni, D., & Amit, D. J. (2000). Spike-driven synaptic plasticity: Theory, simulation, VLSI implementation. Neural Computation, 12(10), 2227–2258. Fusi, S., & Mattia, M. (1999). Collective behavior of networks with linear (VLSI) integrate-and-re neurons. Neural Computation, 11(3), 633–653. H¨aiger, P., Mahowald, M., & Watts, L. (1997). A spike based learning neuron in analog VLSI. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processingsystems,9, (p. 692). Cambridge, MA: MIT Press. Hansel, D., Mato, G., Meurier, C., & Neltner, L. (1998). On numerical simulations of integrate-and-re neural networks. Neural Computation, 10(2), 467–483.

Efcient Event-Driven Simulation

2329

Markram, H., Lubke, J., Frotscher, M., & Sakmann, B. (1997). Regulation of synaptic efcacy by coindence of postsynaptic APs and EPSPs. Science, 275, 213–215. Mattia, M., Del Giudice, P., & Amit, D. J. (1998). Asynchronous simulation of large networks of spiking neurons and dynamical synapses. In L. Niklasson, M. Boden, & T. Ziemke (Eds.), Proceedings of the 8th International Conference on Articial Neural Networks, Skovde, Sweden (Vol. 2, pp. 1045–1050). Berlin: Springer-Verlag. Miyashita, Y. (1988). Neuronal correlate of visual associative long-term memory in the primate temporal cortex. Nature, 335, 817. Petersen, C. H., Malenka, R. C., Nicoll, R. A., & Hopeld, J. J. (1998). All-or-none potentiation at CA3–CA1 synapses. Proc. Natl. Acad. Sci. USA, 95, 4732–4737. Senn, W., Tsodyks, M., & Markram, H. (1997). An algorithm for synaptic modication based on exact timing of pre- and post-synaptic action potentials. In W. Gerstner, A. Germond, M. Asler, & J.-D. Nicoud (Eds.), Articial neural networks–ICANN ’97 (Vol. 1327, pp. 121–126). Berlin: Springer-Verlag. Srinivasan, R., & Chiel, H. J. (1993). Fast calculation of synaptic conductances. Neural Computation, 5, 200–204. Tsodyks, M., Mit’kov, I., & Sompolinsky, H. (1993). Pattern of synchrony in inhomogeneous networks of oscillators with pulse interactions. Phys. Rev. Lett., 71, 1280. Watts, L. (1994). Event-driven simulation of networks of spiking neurons. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6, (pp. 927–934). San Mateo, CA: Morgan Kaufmann Publishers. Yakovlev, V., Fusi, S., Berman, E., & Zohari, E. (1998). Inter-trial neuronal activity in inferior temporal cortex: A putative vehicle to generate long-term visual associations. Nature Neuroscience, 1(4), 310–317. Received February 22, 1999; accepted September 28, 1999.

LETTER

Communicated by Shimon Edelman

Clustering Irregular Shapes Using High-Order Neurons H. Lipson Mechanical Engineering Department, Technion—Israel Institute of Technology, Haifa 32000, Israel H. T. Siegelmann Industrial Engineering and Management Department, Technion—Israel Institute of Technology, Haifa 32000, Israel This article introduces a method for clustering irregularly shaped data arrangements using high-order neurons. Complex analytical shapes are modeled by replacing the classic synaptic weight of the neuron by highorder tensors in homogeneous coordinates. In the rst- and second-order cases, this neuron corresponds to a classic neuron and to an ellipsoidalmetric neuron. We show how high-order shapes can be formulated to follow the maximum-correlation activation principle and permit simple local Hebbian learning. We also demonstrate decomposition of spatial arrangements of data clusters, including very close and partially over lapping clusters, which are difcult to distinguish using classic neurons. Superior results are obtained for the Iris data. 1 Introduction

In classical self-organizing networks, each neuron j is assigned a synaptic weight denoted by the column-vector w (j) . The winning neuron j(x) in response to an input x is the one showing the highest correlation with the input, that is, neuron j for which w (j)T x is the largest, j (x ) D argj max kw ( j)T xk

(1.1)

where the operator k¢k represents the Euclidean norm of a vector. The use of the term w (j)T x as the matching criterion corresponds to selection of the neuron exhibiting the maximum correlation with the input. Note that when the synaptic weights w ( j) are normalized to a constant Euclidean length, then the above criterionbecomes the minimum Euclidean distance matching criterion, j (x ) D argj min kx ¡ w ( j) k,

j D 1, 2, . . . , N.

(1.2)

However, the use of a minimum-distance matching criterion incorporates several difculties. The minimum-distance criterion implies that the features of the input domain are spherical; matching deviations are considered c 2000 Massachusetts Institute of Technology Neural Computation 12, 2331–2353 (2000) °

2332

H. Lipson and H. T. Siegelmann

Figure 1: (a) Two data clusters with nonisotropic distributions, (b) close clusters, (c) curved clusters not linearly separable, and (d) overlapping clusters

equally in all directions, and distances between features must be larger than the distances between points within a feature. These aspects preclude the ability to detect higher-order, complex, or ill-posed feature congurations and topologies, as these are based on higher geometrical properties such as directionality and curvature. This constitutes a major difculty, especially when the input is of high dimensionality, where such congurations are difcult to visualize and detect. Consider, for example, the simple arrangements of two data clusters residing in a two-dimensional domain, shown in Figure 1. The input data points are marked as small circles, and two classical neurons (a and b) are located at the centers of each cluster. In Figure 1a, data cluster b exhibits a nonisotropic distribution; therefore, the nearest neighbor or maximum correlation criterion does not yield the desired classication because devia-

Clustering Irregular Shapes

2333

tions cannot be considered equally in all directions. For example, point p is farther away from neuron b than is point q, yet point p denitely belongs to cluster b while point q does not. Moreover, point p is closer to neuron a but clearly belongs to cluster b. Similarly, two close and elongated clusters will result in the neural locations shown in Figure 1b. More complex clusters as shown in Figure 1c and Figure 1d may require still more complex metrics for separation. There have been several attempts to overcome the above difculties using second-order metrics: ellipses and second-order curves. Generally the use of an inverse covariance matrix in the clustering metric enables capturing linear directionality properties of the cluster and has been used in a variety of clustering algorithms. Gustafson and Kessel (1979) use the covariance matrix to capture ellipsoidal properties of clusters. Dav´e (1989) used fuzzy clustering with a non-Euclidean metric to detect lines in images. This concept was later expanded, and Krishnapuram, Frigui, and Nasraoui (1995) use general second-order shells such as ellipsoidal shells and surfaces. (For an overview and comparison of these methods, see Frigui & Krishnapuram, 1996.) Abe and Thawonmas (1997) discuss a fuzzy classier (ML) with ellipsoidal units. Incorporation of Mahalanobis (elliptical) metrics in neural networks was addressed by Kavuri and Venkatasubramanian (1993) for fault analysis applications and by Mao and Jain (1996) as a general competitive network with embedded principal component analysis units. Kohonen (1997) also discusses the use of adaptive tensorial weights to capture signicant variances in the components of input signals, thereby introducing a weighted elliptic Euclidean distance in the matching law. We have also used ellipsoidal units in previous work (Lipson, Hod, & Siegelmann, 1998). At the other end of the spectrum, nonparametric clustering methods such as agglomerative techniques (Blatt, Wiseman, & Domany, 1996) avoid the need to provide a parametric shape for clusters, thereby permitting them to take arbitrary forms. However, agglomerative clustering techniques cannot handle overlapping clusters as shown in Figure 1d. Overlapping clusters are common in real-life data and represent inherent uncertainty or ambiguity in the natural classication of the data. Areas of overlap are therefore of interest and should be explicitly detected. This work presents an attempt to generalize the spherical-ellipsoidal scheme to more general metrics, thereby giving rise to a continuum of possible cluster shapes between the classic spherical-ellipsoidal units and the fully nonparametric approach. To visualize a cluster metric, we draw a hypersurface at an arbitrary constant threshold of the metric. The shape of the resulting surface is characteristic of the shape of the local dominance zone of the neuron and will be referred to as the shape of the neuron. Some examples of neuron distance metrics yielding practically arbitrary shapes at various orders of complexity are shown in Figure 2. The shown hypersurfaces are the optimal (tight) boundary of each cluster, respectively.

2334

H. Lipson and H. T. Siegelmann

Figure 2: Some examples of single neuron shapes computed for the given clusters (after one epoch). (a–f) Orders two to seven, respectively.

In general, the shape restriction of classic neurons is relaxed by replacing the weight vector or covariance matrix of a classic neuron with a general high-order tensor, which can capture multilinear correlations among the signals associated with the neurons. This also permits capturing shapes with holes and/or detached areas, as in Figure 3. We also use the concept of homogeneous coordinates to combine correlations of different orders into a single tensor. Although carrying the same name, our notion of high-order neurons is different from the one used, for example, by Giles, Griffen, and Maxwell, (1988), Pollack (1991), Goudreau, Giles, Chakradhar, & Chen (1994), and Balestrino and Verona (1994). There, high-order neurons are those that receive multiplied combinations of inputs rather than individual inputs. Those neurons were also not used for clustering or in the framework of competitive learning. It appears that high-order neurons were generally abandoned mainly due to the difculty in training such neurons and their inherent instability due to the fact that small variations in weight may cause a large variation in the output. Some also questioned the usefulness of highorder statistics for improving predictions (Myung, Levy, & William, 1992). In this article we demonstrate a different notion of high-order neurons where

Clustering Irregular Shapes

2335

Figure 3: A shape neuron can have nonsimple topology: (a) A single fth-order neuron with noncompact shape, including two “holes.” (b) A single fourth-order neuron with three detached active zones.

the neurons are tensors, and their order pertains to their internal tensorial order rather than to the degree of the inputs. Our high-order neurons exhibit good stability and excellent training performance with simple Hebbian learning. Furthermore, we believe that capturing high-order information within a single neuron facilitates explicit manipulation and extraction of this information. General high-order neurons have not been used for clustering or competitive learning in this manner. Section 2 derives the tensorial formulation of a single neuron, starting with a simple second-order (elliptical) metric, moving into homogeneous coordinates, and then increasing order for a general shape. Section 3 discusses the learning capacity of such neurons and their functionality within a layer. Finally, section 4 demonstrates an implementation of the proposed neurons on synthetic and real data. The article concludes with some remarks on possible future extensions. 2 A “General Shape” Neuron

We adopt the following notation: x, w W, D x( j ) , D ( j ) xi , Dij xH , DH

column vectors matrices the jth vector / matrix corresponding to the jth neuron / class element of a vector / matrix vector / matrix in homogeneous coordinates

2336

H. Lipson and H. T. Siegelmann

m d N

order of the high-order neuron dimensionality of the input size of the layer (the number of neurons)

The modeling constraints imposed by the maximum correlation matching criterion stem from the fact that the neuron’s synaptic weight w( j) has the same dimensionality as the input x, that is, the same dimensionality as a single point in the input domain, while in fact the neuron is modeling a cluster of points, which may have higher-order attributes such as directionality and curvature. We shall therefore refer to the classic neuron as a rst-order (zero degree) neuron, due to its correspondence to a point in multidimensional input space. To circumvent this restriction, we augment the neuron with the capacity to map additional geometric and topological information, by increasing its order. For example, as the rst-order is a point neuron, the second-order case will correspond to orientation and size components, effectively attaching a local oriented coordinate system with nonuniform scaling to the neuron center and using it to dene a new distance metric. Thus, each high-order neuron will represent not only the mean value of the data points in the cluster it is associated with, but also the principal directions, curvatures, and other features of the cluster and the variance of the data points along these directions. Intuitively, we can say that rather than dening a sphere, the high-order distance metric now denes a multidimensional-oriented shape with an adaptive shape and topology (see Figures 2 and 3). In the following pages, we briey review the ellipsoidal metric and establish the basic concepts of encapsulation and base functions to be used in higher-order cases where tensorial notation becomes more difcult to follow. For a second-order neuron, dene the matrix D ( j) 2 Rd£d to encapsulate the direction and size properties of a neuron, for each dimension of the input, where d is the number of dimensions. Each row i of D ( j) is a unit row vector vTi corresponding to a principal direction of the cluster, divided by the standard deviation of the cluster in that direction (the square root of the variance si ). The Euclidean metric can now be replaced by the neurondependent metric,

® ¡ ¢® dist (x , j) D ®D (j) x ¡ w (j) ® ,

where

2

p1 6 s1

D (j)

6 6 6 D6 6 6 4

0 .. . 0

3

0 p1 s2

¢¢¢

¢¢¢ .. 0

.

2 3 7 v T1 76 7 7 6 vT 7 7 6 27 76 . 7, 7 6 .. 7 4 5 0 7 5 vT 1 0 .. .

p

sd

d

(2.1)

Clustering Irregular Shapes

2337

such that deviations are normalized with respect to the variance of the data in the direction of the deviation. The new matching criterion based on the new metric becomes

® ®

® ®

() i(x ) D argj min ®dist (x ) j ®

® ± ²® ® () () ® D argj min ®D j x ¡ w j ® ,

j D 1, 2, . . . , N.

(2.2)

The neuron is now represented by the pair (D (j) , w (j) ), where D(j) represents rotation and scaling and w ( j) represents translation. The values of D(j) can be obtained from an eigenstructure analysis of the correlation matrix R ( j) D S x ( j) x ( j )T 2 Rd£d of the data associated with neuron j. Two wellknown properties are derived from the eigenstructure of principal component analysis: (1) the eigenvectors of a correlation matrix R pertaining to the zero mean data vector x dene the unit vectors vi representing the principal directions along which the variances attain their extremal values, and (2) the associated eigenvalues dene the extremal values of the variances si . Hence, D

( j)

1

( )¡ () D l j 2V j ,

(2.3)

where l ( j ) D diag (R( j) ) produces the eigenvalues of R( j) in a diagonal form and V ( j) D eig (R( j) ) produces the unit eigenvectors of R( j) in matrix form. Also, by denition of eigenvalue analysis of a general matrix A, AV D l V

and hence A D V Tl V .

(2.4)

Now, since a term kak equals aT a, we can expand the term kD ( j) (x ¡ w ( j) )k of equation 2.2 as follows:

® ± ²® ± ²T ± ² T ®D ( j ) x ¡ w ( j) ® ® ® D x ¡ w ( j) D ( j) D ( j) x ¡ w ( j) .

(2.5)

Substituting equations 1.2 and 2.3, we see that

2

1 6 s1

0

£ ¤6 0 T D ( j) D ( j) D v 1 , v 2 , . . . , v d 6 6

1 s2

6

6 ... 4

0

¢¢¢

¢¢¢ ..

. 0

32

3

0 T 7 v1 7 .. 7 6 6 v T2 7 .7 7 ( ) ¡1 76 D R j , (2.6) 6 7 6 .. 7 7 07 54 . 5

1 sd

v Td

2338

H. Lipson and H. T. Siegelmann

meaning that the scaling and rotation matrix is, in fact, the inverse of R ( j) . Hence, µ± ²T ± ²¶ i(x ) D argj min x ¡ w ( j) R ¡1 x ¡ w ( j) , j D 1, 2, . . . , N, (2.7) which corresponds to the term used by Gustafson and Kessel (1979) and in the maximum likelihood gaussian classier (Duda & Hart, 1973). Equation 2.7 involves two explicit terms, R( j) and w ( j) , representing the orientation and size information, respectively. Implementations involving ellipsoidal metrics therefore need to track these two quantities separately, in a two-stage step. Separate tracking, however, cannot be easily extended to higher orders, which requires simultaneous tracking of additional higherorder qualities such as curvatures. We therefore use an equivalent representation that encapsulates these terms into one and permits a more systematic extension to higher orders. We combine these two coefcients into one expanded covariance denoted RH ( j) in homogenous coordinates (Faux & Pratt, 1981), as RH

( j)

D

S

( j) ( j) T

xH xH ,

(2.8)

µ ¶ x ( j) where in homogenous coordinates xH 2 R(d C 1)£1 and is dened by x H D – . 1 In this notation, the expanded covariance also contains coordinate permutations involving 1 as one of the multiplicands, and so different (lower) orders are introduced. Consequently, RH ( j) 2 R(d C 1)£(d C 1) . In analogy to the correlation matrix memory and autoassociative memory (Haykin, 1994), the extended matrix RH ( j ) , in its general form, can be viewed as a homogeneous autoassociative tensor. Now the new matching criterion becomes simply µ ¶ ( j ) ¡1 i(x ) D argj min X T R H X

® ®

1

( j) ¡ 2

D argj min ® ®R H

® ®

X® ®,

® ® ® ( j) ¡1 ® ® D argj min ®R H X ® ®,

j D 1, 2, . . . , N.

(2.9)

This representation retains the notion of maximum correlation, and for convenience, we now denote RH ¡1(j) as the synaptic tensor. When we moved into homogenous coordinates, we gained the following: The eigenvectors of the original covariance matrix R ( j) represented directions. The eigenvectors of the homogeneous covariance matrix RH ( j) correspond to both the principal directions and the offset (both

Clustering Irregular Shapes

2339

second-order and rst-order) properties of the cluster accumulated in RH ( j) . In fact, the elements of each eigenvector correspond to the coefcient of the line equation ax C by C ¢ ¢ ¢ C c D 0. This property permits direct extension to higher-order tensors in the next section, where direct eigenstructure analysis is not well dened. The transition to homogeneous coordinates dispensed with the problem created by the fact that the eigenstructure properties cited above pertained only to zero-mean data, whereas RH ( j) is dened as the covariance matrix of non-zero-mean data. By using homogeneous coordinates, we effectively consider the translational element of the cluster as yet another (additional) dimensions. In this higher space, the data can be considered to be zero mean. 2.1 Higher Orders. The ellipsoidal neuron can capture only linear directionality, not curved directionalities such as that appearing, for instance, in a fork, in a curved or in a sharp-cornered shape. In order to capture more complex spatial congurations, higher-order geometries must be introduced. The classic neuron possesses a synaptic weight vector w ( j) , which cor( j) responds to a point in the input domain. The synaptic weight P w can be ( j) seen as the rst-order average of its signals, that is, w D xH (the last element xHd C 1 D 1 of the homogeneous coordinates has no effect in this case). Ellipsoidal units hold information regarding the linear correlations among the of data points represented by the neuron by using P coordinates RH ( j) D xH xH T . Each element of RH is thus a proportionality constant relating two specic dimensions of the cluster. The second-order neuron can therefore be regarded as a second-order approximation of the corresponding data distribution. Higher-order relationships can be introduced by considering correlations among more than just two variables. We may consequently introduce a neuron capable of modeling a d-dimensional data cluster to an mth-order approximation. Higher-order correlations are obtained by multiplying the generating vector xH by itself, in successive outer products, more than just once. The process yields a covariance tensor rather than just a covariance matrix (which is a second-order tensor). The resulting homogeneous covariance tensor is thus represented by a 2(m ¡ 1)-order tensor of rank d C 1, denoted ( j) by ZH 2 R(d C 1)£¢¢¢£(d C 1) , obtained by 2(m ¡ 1) outer products of xH by itself: ( j)

ZH D x 2(m¡1) . H

(2.10)

The exponent consists of the factor (m ¡ 1), which is the degree of the approximation, and a factor of 2 since we are performing autocorrelation, so the function is multiplied by itself. In analogy to reasoning that leads to equation 2.9, each eigenvector (now an eigentensor of order m ¡ 1) corre-

2340

H. Lipson and H. T. Siegelmann

Figure 4: Metrics of various orders: (a) a regular neuron (spherical metric), (b) a “square” neuron, and (c) a neuron with “two holes.”

sponds to the coefcients of a principal curve, and multiplying it by input points produces an approximation of the distance of that input point from ( j) the curve. Consequently, the inverse of the tensor Z H can then be used to compute the high-order correlation of the signal with the nonlinear shape neuron, by simple tensor multiplication:

® ® ® ( j ) ¡1 ® m¡1 ® i(x ) D argi min ®Z H xH ® ® , j ¡ 1, 2, . . . , N,

(2.11)

where denotes tensor multiplication. Note that the covariance tensor can be inverted only if its order is an even number, as satised by equation 2.10. Note also that the amplitude operator is carried out by computing the root of the sum of the squares of the elements of the argument. The computed metric is now not necessarily spherical and may take various other forms, as plotted in Figure 4. In practice, however, high-order tensor inversion is not directly required. To make this analysis simpler, we use a Kronecker notation for tensor products (see Graham, 1981). Kronecker tensor product attens out the elements of X Y into a large matrix formed by taking all possible products between the elements of X and those of Y. For example, if X is a 2£3 matrix, then XY is " # Y ¢ X1,1 Y ¢ X1,2 Y ¢ X1,3 , X Y D Y ¢ X2,1 Y ¢ X2,2 Y ¢ X2,3

(2.12)

where each block is a matrix of the size of Y. In this notation, the internal structure of higher-order tensors is easier to perceive, and their correspondence to linear regression of principal polynomial curves is revealed. Consider, for example, a fourth-order covariance tensor of the vector x D fx, yg.

Clustering Irregular Shapes

2341

The fourth-order tensor corresponds to the simplest nonlinear neuron according to equation 2.10, and takes the form of the 2£2£2£2 tensor: «

¬

«

¬

"

x x x x D R R D

# " xy x2 2 y xy

x2 xy

2

# xy y2

3

x4 x3 y x3 y x2 y2 6 3 7 2 2 6x y x y x2 y2 xy3 7 6 D 6- - - - - - - - - - - - - - - - - - - - -7 7. x2 y2 xy3 5 4 x3 y x2 y2 x2 y2 xy3 xy3 y4

(2.13)

The homogeneous version of this tensor also includes all lower-order permutations of the coordinates of xH D fx, y, 1g, namely, the 3£3£3£3 tensor:

2

Z H (4)

3

2

3

2 2 « ¬ « ¬ 6 x xy x7 6 x xy x7 2 D x H , x H , x H , x H D R H , R H D 4 xy y y5 4 xy y2 y5 x y 1 x y 1

2

3

x4 x3 y x3 x3 y x2 y2 x2 y x3 x2 y x2 6 3 7 2 2 2 2 2 3 2 6x y x y x y x y xy xy x2 y xy2 xy7 6 3 7 2 2 2 2 2 6 x xy x xy x 7 6 - - - - - -x- -y- - -x- - - - - -x- -y- - -xy 7 6 3 7 2 2 3 2 2 2 6 x y x2 y2 x2 y 7 x y xy xy x y xy xy 6 2 2 7 3 2 3 4 3 2 3 2 D 6x y 7. xy xy xy y y xy y y 6 2 7 2 2 3 2 2 6 x y xy 7 xy xy y y xy y y 6 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -7 6 3 7 6 x x2 y x2 x2 y xy2 xy x2 xy x 7 6 2 7 4 x y xy2 xy xy2 y3 y2 xy y2 y5 1 x2 xy x xy y2 y x y

(2.14)

Remark. It is immediately apparent that the matrix in equation 2.14 corre-

sponds to the matrix to be solved for nding a least-squares t of a conic section equation ax2 C by2 C cxy C dx C ey C f D 0 to the data points. Moreover, the set of eigenvectors of this matrix corresponds to the coefcients of the set of mutually orthogonal best-t conic section curves that are the principal curves of the data. This notion adheres with Gnanadesikan’s (1977) method for nding principal curves. Substitution of a data point into the equation of a principal curve yields an approximation of the distance of the point from that curve, and the sum of squared distances amounts to equation 2.11. Note that each time we increase order, we are seeking a set of principal curves of one degree higher. This implies that the least-squares matrix needs to be two degrees higher (because it is minimizing the squared error), thus yielding the coefcient 2 in the exponent of equation 210.

2342

H. Lipson and H. T. Siegelmann

3 Learning and Functionality

In order to show that higher-order shapes are a direct extension of classic neurons, we show that they are also subject to simple Hebbian learning. Following an interpretation of Hebb’s postulate of learning, synaptic modication (i.e., learning) occurs when there is a correlation between presynaptic and postsynaptic activities. We have already shown in equation 2.9 that the presynaptic activity xH and postsynaptic activity of neuron j coincide when ¡1

the synapse is strong; that is, RH ( j) xH is minimum. We now proceed to show that in accordance with Hebb’s postulate of learning, it is sufcient to incur self-organization of the neurons by increasing synapse strength when there is a coincidence of presynaptic and postsynaptic signals. We need to show how self-organization is obtained merely by increasing RH ( j) , where j is the winning neuron. As each new data point arrives at a specic neuron j in the net, the synaptic weight of that neuron is adapted by the incremental corresponding to Hebbian learning, ( j ) 2(m¡1)

Z H ( j) (k C 1) D Z H ( j) (k ) C g(k)x H

,

(3.1)

where k is the iteration counter and g(k) is the iteration-dependent learn( j) ing rate coefcient. It should be noted that in equation 2.10, ZH becomes a 2(m¡1) weighted sum of the input signals xH (unlike RH in equation 2.8, which is a uniform sum). The eigenstructure analysis of a weighted sum still provides the principal components under the assumption that the weights are uniformly distributed over the cluster signals. This assumption holds true if the process generating the signals of the cluster is stable over time, a basic assumption of all neural network algorithms (Bishop, 1997). In practice, this assumption is easily acceptable, as will be demonstrated in the following sections. In principle, continuous updating of the covariance tensor may create instability in a competitive environment, as a winner neuron becomes increasingly dominant. To force competition, the covariance tensor can be normalized using any of a number of factors dependent on the application, such as the number of signals assigned to the neuron so far or the distribution of data among the neuron (forcing uniformity). However, a more natural factor for normalization is the size and complexity of the neuron. A second-order neuron is geometrically equivalent to a d-dimensional ellipsoid. A neuron may be therefore normalized by a factor V proportional to the ellipsoid volume: V/

p

s1 ¢ s2 ¢ ¢ ¢ ¢ ¢ sd D

d Y p iD 1

si / q

1

. ( ) det Rj

(3.2)

During the competition phase, the factors R ( j) x / V are compared, rather than

Clustering Irregular Shapes

2343

just R( j) x, thus promoting smaller neurons and forcing neurons to become equally “fat.” The nal matching criteria for a homogeneous representation are therefore given by

®

®

® ¡1 ( j) x ® , j (x ) D argj min ®RO H H ® j D 1, 2, . . . , N,

(3.3)

where R

( j)

O ( j) D r H R . ( ) H j det RH

(3.4)

Covariance tensors of higher-order neurons can also be normalized by their determinant. However, unlike second-order neurons, higher-order shapes are not necessarily convex, and hence their volume is not necessarily proportional to their determinant. The determinant of higher-order tensors therefore relates to more complex characteristics, which include the deviation of the data points from the principal curves and the curvature of the boundary. Nevertheless, our experiments showed that this is a simple and effective means of normalization that promotes small and simple shapes over complex or large, concave ttings to data. The relationship between geometric shape and the tensor determinant should be investigated further for more elaborate use of the determinant as a normalization criterion. Finally, in order to determine the likelihood of a point’s being associated with a particular neuron, we need to assume a distribution function. Assuming gaussian distribution, the likelihood p ( j) (x H ) of data point x H being associated with cluster j is proportional to the distance from the neuron and is given by () p j (x H ) D

®

®

2 1 ¡ 12 ®ZO ( j) X H ® . e (2p )d / 2

(3.5)

4 Algorithm Summary

To conclude the previous sections, we provide a summary of the highorder competitive learning algorithm for both unsupervised and supervised learning. These are the algorithms that were used in the test cases described later in section 5. Unsupervised Competitive learning.

1. Select layer size (number of neurons N) and neuron order (m) for a given problem.

2344

H. Lipson and H. T. Siegelmann

P 2(m¡1) 2. Initialize all neurons with Z H D xH where the sum is taken over all input data, or a representative portion, or unit tensor or random numbers if no data are available. To compute this value, write down x m¡1 as a vector with all mth degree permutations of fx1 , x2 , . . . , xd , 1g, H and Z H as a matrix summing the outer product of these vectors. Store ¡1 Z H and Z H / f , for each neuron, where f is a normalization factor (e.g., equation 3.4). 3. Compute the winner for input x. First compute vector x m¡1 from x as H ¡1 in section 2, and then multiply it by the stored matrix Z H / f . Winner j is the one for which the modulus of the product vector is smallest. 4. Output winner j. 5. Update the winner by adding x 2(m¡1) to Z H ( j) weighted by the learning H coefcient g(k) where k is the iteration counter. Store Z H and Z H ¡1 / f , for the updated neuron. 6. Go to step 3. Supervised Learning.

1. Set layer size (number of neurons) to number of classes, and select neuron order (m) for a given problem. P 2(m¡1) ( j) 2. Train. For each neuron, compute Z H D xH where the sum is taken over all training points to be associated with that neuron. To compute this value, write down xH m¡1 as a vector with all mth degree permutations of fx1 , x2 , . . . , xd , 1g, and Z H as a matrix summing the outer product of these vectors. Store Z H ¡1 / f for each neuron. 3. Test. Use test points to evaluate the quality of training. 4. Use. For each new input x, rst compute vector xm¡1 from x, as in H section 2, and then multiply it by the stored matrix Z ¡1 H / f . Winner j is the one for which modulus of the product vector is smallest. 5 Implementation with Synthesized and Real Data

The proposed neuron can be used as an element in any network by replacing lower-dimensionality neurons and will enhance the modeling capability of each neuron. Depending on the application of the net, this may lead to improvement or degradation of the overall performance of the layer, due to the added degrees of freedom. This section provides some examples of an implementation at different orders, in both supervised and unsupervised learning. First, we discuss two examples of second-order (ellipsoidal) neurons to show that our formulation is compatible with previous work on ellipsoidal neural units (e.g., Mao & Jain, 1996). Figure 5 illustrates how

Clustering Irregular Shapes

2345

Figure 5: (a–c) Stages in self-classication of overlapping clusters, after learning 200, 400, and 650 data points, respectively (one epoch).

a competitive network consisting of three second-order neurons gradually evolves to model a two-dimensional domain exhibiting three main activation areas with signicant overlap. The second-order neurons are visualized as ellipses with their boundaries drawn at 2s of the distribution. The next test uses a two-dimensional distribution of three arbitrary clusters, consisting of a total of 75 data points, shown in Figure 6a. The network of three second-order neurons self-organized to map this data set and creates the decision boundaries shown in Figure 6b. The dark curves are the decision boundaries, and the light lines show the local distance metric within each cell. Note how the decision boundary in the overlap area accounts for the different distribution variances.

Figure 6: (a) Original data set. (b) Self-organized map of the data set. Dark curves are the decision boundaries, and light lines show the local distance metric within each cell. Note how the decision boundary in the overlap area accounts for the different distribution variances.

2346

H. Lipson and H. T. Siegelmann

Table 1: Comparison of Self-Organization Results for IRIS Data. Method

Epochs

Super paramagnetic (Blatt et al., 1996)

Number misclassied or Unclassied

25

Learning vector quantization / Generalized learning vector quantization (Pal, Bezdek, & Tsao, 1993)

200

K-means (Mao & Jain, 1996)

17 16

Hyperellipsoidal clustering (Mao & Jain, 1996)

5

Second-order unsupervised

20

4

Third-order unsupervised

30

3

The performance of the different orders of shape neuron has also been assessed on the Iris data (Anderson, 1939). Anderson’s time-honored data set has become a popular benchmark problem for clustering algorithms. The data set consists of four quantities measured on each of 150 owers chosen from three species of iris: Iris setosa, Iris versicolor, and Iris virginica. The data set constitutes 150 points in four-dimensional space. Some results of other methods are compared in Table 1. It should be noted that the results presented in the table are “best results;” third-order neurons are not always stable because the layer sometimes converges into different classications.1 The neurons have also been tried in a supervised setup. In supervised mode, we use the same training setup of equations 2.8 and 2.10 but train each neuron individually on the signals associated with it, using ( j)

ZH D

S

2(m¡1)

xH

,

()

xH 2 ª j ,

h i x where in homogeneous coordinates x H D – and ª ( j) is the jth class. 1 In order to evaluate the performance of this technique, we trained each neuron on a random selection of 80% of the data points of its class and then tested the classication of the whole data set (including the remaining 20% for cross validation). The test determined the most active neuron 1 Performance of third-order neurons was evaluated with a layer consisting of three neurons randomly initialized with normal distribution within the input domain with mean and variance of the input, and with learning factor of 0.3 decreasing asymptotically toward zero. Of 100 experiments carried out, 95% locked correctly on the clusters, with an average 7.6 (5%) misclassications after 30 epochs. Statistics on performance of cited works have not been reported. Download MATLAB demos from http://www.cs.brandeis.edu/ lipson/papers/geom.htm.

Clustering Irregular Shapes

2347

Table 2: Supervised Classication Results for Iris Data.

Order

3-hidden-layer neural network (Abe et al., 1997) Fuzzy hyperbox (Abe et al., 1997) Fuzzy ellipsoids (Abe et al., 1997) Second-order supervised Third-order supervised Fourth-order supervised Fifth-order supervised Sixth-order supervised Seventh-order supervised

Epochs

1000 1000 1 1 1 1 1 1

Number Misclassied Average Best

2.2 3.08 2.27 1.60 1.07

1.20 1.30

1 2 1 1 1 0 0 0 0

Notes: Results after one training epoch, with 20% cross validation. Averaged results are for 250 experiments.

for each input and compared it with the manual classication. This test was repeated 250 times, with the average and best results shown in Table 2, along with results reported for other methods. It appears that on average, supervised learning for this task reaches an optimum with fth-order neurons. Figure 7 shows classication results using third-order neurons for an arbitrary distribution containing close and overlapping clusters. Finally, Figure 8 shows an application of third- through fth-order neurons to detection and modeling of chromosomes in a human cell. The corresponding separation maps and convergence error rates are shown in Figures 9 and 10, respectively.

Figure 7: Self-classication of synthetic point clusters using third-order neurons.

2348

H. Lipson and H. T. Siegelmann

Figure 8: Chromosome separation. (a) Designated area. (b) Four third-order neuron self-organized to mark the chromosomes in designated area. Data from Schrock et al. (1996). 6 Conclusions and Further Research

We have introduced high-order shape neurons and demonstrated their practical use for modeling the structure of spatial distributions and for geometric feature recognition. Although high-order neurons do not directly correspond to neurobiological details, we believe that they can provide powerful data modeling capabilities. In particular, they exhibit useful properties for correctly handling close and partially overlapping clusters. Furthermore, we have shown that ellipsoidal and tensorial clustering methods, as well as classic competitive neurons, are special cases of the general-shape neuron. We showed how lower-order information can be encapsulated using tensor representation in homogeneous coordinates, enabling systematic continuation to higher-order metrics. This formulation also allows for direct extraction of the encapsulated high-order information in the form of principal curves, represented by the eigentensors of each neuron. The use of shapes as neurons raises some further practical questions, which have not been addressed in this article and require further research, and are listed below. Although most of these issues pertain to neural networks in general, the approach required here may be different: Number of neurons. In the examples provided, we have explicitly used the number of neurons appropriate to the number of clusters. The question of network size is applicable to most neural architecture. However, our experiments showed that overspecication tends

Clustering Irregular Shapes

2349

Figure 9: Self-organization for chromosome separation. (a–d) Second- through fth-order, respectively.

to cause clusters to split into subsections, while underspecication may cause clusters to unite. Initial weights. It is well known that sensitivity to initial weights is a general problem of neural networks (see, for example, Kolen & Pollack, 1990, and Pal et al., 1993). However, when third-order neurons were evaluated on the Iris data with random initial weights, they performed with average 5% misclassications. Thus, the capacity of neurons to be spread over volumes may provide new possibilities for addressing the initialization problem. Computational cost. The need to invert a high-order tensor introduces a signicant computational cost, especially when the input dimension-

2350

H. Lipson and H. T. Siegelmann

Figure 10: Average error rate in self-organization for chromosome separation of Figure 9, respectively.

ality is large. There are some factors that counterbalance this computational cost. First, the covariance tensor needs to be inverted only when a neuron is updated, not when it is activated or evaluated. When the neuron is updated, the inverted tensor is computed and then stored and used whenever the neuron needs to compete or evaluate an input signal. This means that the whole layer will need only one inversion for each training point and no inversions for nontraining operation. Second, because of the complexity of each neuron, sometimes fewer of them are required. Finally, relatively complex shapes can be attained with low orders (say, third order). Selection of neuron order. As in many other networks, there is much domain knowledge about the problem that comes into the solution through choice of parameters such as the order of the neurons. However, we also estimate that due to the rather analytic representation

Clustering Irregular Shapes

2351

of each neuron, and since high-order information is contained with each neuron independently, this information can be directly extracted (using the eigentensors). Thus, it is possible to decompose analytically a cross-shapes fourth-order cluster into two overlapping elliptic clusters, and vice versa. Stability versus exibility and enforcement of particular shapes. Further research is required to nd methods for biasing neurons toward particular shapes in an attempt to seek particular shapes and help neuron stability. For example, how would one encourage detection of triangles and not rectangles? Nonpolynomial base functions. We intend to investigate the properties of shape layers based on nonpolynomial base functions. Noting that the high-order homogeneous tensors give rise to various degrees of polynomials, it is possible to derive such tensors with other base functions such as wavelets, trigonometric functions, and other wellestablished kernel functions. Such an approach may enable modeling arbitrary geometrical shapes. Acknowledgments

We thank Eitan Domani for his helpful comments and insight. This work was supported in part by the U.S.-Israel Binational Science Foundation (BSF), the Israeli Ministry of Arts and Sciences, and the Fund for Promotion of Research at the Technion. H. L. acknowledges the generous support of the Charles Clore Foundation and the Fischbach Postdoctoral Fellowship. We thank Applied Spectral Imaging for supplying the data for the experiments described in Figure 8 through 10. Download further information, MATLAB demos, and implementation details of this work from http://www.cs.brandeis.edu/»lipson/papers/ geom.htm. References Abe, S., & Thawonmas, R. (1997). A fuzzy classier with ellipsoidal regions. IEEE Trans. on Fuzzy Systems, 5, 358–368. Anderson, E. (1939). The Irises of the Gaspe Peninsula. Bulletin of the American Iris Society, 59, 2–5. Balestrino, A., & Verona, B. F. (1994). New adaptive polynomial neural network. Mathematics and Computers in Simulation, 37, 189–194. Bishop, C. M. (1997). Neural networks for pattern recognition. Oxford: Clarendon Press. Blatt, M., Wiseman, S., & Domany, E. (1996). Superparamagnetic clustering of data. Physical Review Letters, 76/18, 3251–3254.

2352

H. Lipson and H. T. Siegelmann

Dav´e, R. N. (1989). Use of the adaptive fuzzy clustering algorithm to detect lines in digital images. In Proc. SPIE, Conf. Intell. Robots and Computer Vision, 1192, 600–611. Duda, R. O., & Hart, P. E. (1973). Pattern classication and scene analysis. New York: Wiley. Faux, I. D., & Pratt, M. J., (1981). Computational geometry for design and manufacture. New York: Wiley. Frigui, H. & Krishnapuram, R. (1996). A comparison of fuzzy shell-clustering methods for the detection of ellipses. IEEE Transactions on Fuzzy Systems, 4, 193–199. Giles, C. L., Grifn, R. D., & Maxwell, T. (1988). Encoding geometric invariances in higher-order neural networks. In D. Z. Anderson (Ed.), Neural information processing systems. American Institute of Physics Conference Proceedings. Gnanadesikan, R. (1977). Methods for statistical data analysis of multivariate observations. New York: Wiley. Goudreau, M. W., Giles C. L., Chakradhar, S. T., & Chen, D. (1994). First-order versus second-order single layer recurrent neural networks. IEEE Transactions on Neural Networks, 5, 511–513. Graham, A. (1981). Kronecker products and matrix calculus: With applications. New York: Wiley. Gustafson, E. E., & Kessel, W. C. (1979). Fuzzy clustering with fuzzy covariance matrix. In Proc. IEEE CDC (pp. 761–766). San Diego, CA. Haykin, S. (1994). Neural networks: A comprehensive foundation. Englewood Cliffs, NH: Prentice Hall. Kavuri, S. N, & Venkatasubramanian, V. (1993). Using fuzzy clustering with ellipsoidal units in neural networks for robust fault classication. Computers Chem. Eng., 17, 765–784. Kohonen, T., (1997). Self organizing maps. Berlin: Springer-Verlag. Kolen, J. F., & Pollack, J. B. (1990). Back propagation is sensitive to initial conditions. Complex Systems, 4, 269–280. Krishnapuram, R., Frigui, H., & Nasraoui, O. (1995). Fuzzy and possibilistic shell clustering algorithms and their application to boundary detection and surface approximation—Parts I and II. IEEE Transactions on Fuzzy Systems, 3, 29–60. Lipson, H., Hod, Y., & Siegelmann, H. T. (1998). High-order clustering metrics for competitive learning neural networks. In Proceedings of the Israel-Korea Bi- National Conference on New Themes in Computer Aided Geometric Modeling. Tel-Aviv, Israel. Mao, J., & Jain, A. (1996). A self-organizing network for hyperellipsoidal clustering (HEC). IEEE Transactions on Neural Networks, 7, 16–29. Myung, I., Levy, J., & William, B., (1992). Are higher order statistics better for maximum entropy prediction in neural networks? In Proceedings of the Int. Joint Conference on Neural Networks IJCNN’91. Seattle, WA. Pal, N., Bezdek, J. C., & Tsao, E. C.-K. (1993). Generalized clustering networks and Kohonen’s self-organizing scheme. IEEE Transactions on Neural Networks, 4, 549–557.

Clustering Irregular Shapes

2353

Pollack J. B. (1991). The induction of dynamical recognizers. Machine Learning, 7, 227–252. Schrock, E., duManoir, S., Veldman, T., Schoell, B., Wienberg, J., Feguson Smith, M. A., Ning, Y., Ledbetter, D. H., BarAm, I., Soenksen, D., Garini, Y., & Ried, T. (1996). Multicolor spectral karyotyping of human chromosomes. Science, 273, 494–497. Received August 24, 1998; accepted September 3, 1999.

LETTER

Communicated by Jos e´ Principe

Learning Chaotic Attractors by Neural Networks Rembrandt Bakker DelftChemTech, Delft University of Technology, 2628 BL Delft, The Netherlands Jaap C. Schouten Chemical Reactor Engineering, Eindhoven University of Technology, 5600 MB Eindhoven, The Netherlands C. Lee Giles NEC Research Institute, Princeton, NJ 08540, U.S.A. Floris Takens Department of Mathematics,Universityof Groningen, 9700AV Groningen, The Netherlands Cor M. van den Bleek DelftChemTech, Delft University of Technology, 5600 MB Eindhoven, The Netherlands An algorithm is introduced that trains a neural network to identify chaotic dynamics from a single measured time series. During training, the algorithm learns to short-term predict the time series. At the same time a criterion, developed by Diks, van Zwet, Takens, and de Goede (1996) is monitored that tests the hypothesis that the reconstructed attractors of model-generated and measured data are the same. Training is stopped when the prediction error is low and the model passes this test. Two other features of the algorithm are (1) the way the state of the system, consisting of delays from the time series, has its dimension reduced by weighted principal component analysis data reduction, and (2) the user-adjustable prediction horizon obtained by “error propagation”—partially propagating prediction errors to the next time step. The algorithm is rst applied to data from an experimental-driven chaotic pendulum, of which two of the three state variables are known. This is a comprehensive example that shows how well the Diks test can distinguish between slightly different attractors. Second, the algorithm is applied to the same problem, but now one of the two known state variables is ignored. Finally, we present a model for the laser data from the Santa Fe time-series competition (set A). It is the rst model for these data that is not only useful for short-term predictions but also generates time series with similar chaotic characteristics as the measured data. c 2000 Massachusetts Institute of Technology Neural Computation 12, 2355–2383 (2000) °

2356 R. Bakker, J. C. Schouten, C. L. Giles, F. Takens, and C. M. van den Bleek 1 Introduction

A time series measured from a deterministic chaotic system has the appealing characteristic that its evolution is fully determined and yet its predictability is limited due to exponential growth of errors in model or measurements. A variety of data-driven analysis methods for this type of time series was collected in 1991 during the Santa Fe time-series competition (Weigend & Gershenfeld, 1994). The methods focused on either characterization or prediction of the time series. No attention was given to a third, and much more appealing, objective: given the data and the assumption that it was produced by a deterministic chaotic system, nd a set of model equations that will produce a time series with identical chaotic characteristics, having the same chaotic attractor. The model could be based on rst principles if the system is well understood, but here we assume knowledge of just the time series and use a neural network–based, black-box model. The choice for neural networks is inspired by the inherent stability of their iterated predictions, as opposed to for example, polynomial models (Aguirre & Billings, 1994) and local linear models (Farmer & Sidorowich, 1987). Lapedes and Farber (1987) were among the rst who tried the neural network approach. In concise neural network jargon, we formulate our goal: train a network to learn the chaotic attractor. A number of authors have addressed this issue (Aguirre & Billings, 1994; Principe, Rathie, & Kuo, 1992; Kuo & Principe, 1994; Deco & Schurmann, ¨ 1994; Rico-Mart´õ nez, Krischer, Kevrekidis, Kube, & Hudson, 1992; Krischer et al., 1993; Albano, Passamente, Hediger, & Farrell, 1992). The common approach consists of two steps: 1. Identify a model that makes accurate short-term predictions. 2. Generate a long time series with the model by iterated prediction, and compare the nonlinear-dynamic characteristics of the generated time series with the original, measured time series. Principe et al. (1992) found that in many cases, this approach fails; the model can make good short-term predictions but has not learned the chaotic attractor. The method would be greatly improved if we could minimize directly the difference between the reconstructed attractors of the modelgenerated and measured data rather than minimizing prediction errors. However, we cannot reconstruct the attractor without rst having a prediction model. Therefore, research is focused on how to optimize both steps. We highly reduce the chance of failure by integrating step 2 into step 1, the model identication. Rather than evaluating the model attractor after training, we monitor the attractor during training, and introduce a new test developed by Diks, van Zwet, Takens, and de Goede (1996) as a stopping criterion. It tests the null hypothesis that the reconstructed attractors of modelgenerated and measured data are the same. The criterion directly measures

Learning Chaotic Attractors

2357

the distance between two attractors, and a good estimate of its variance is available. We combined the stopping criterion with two special features that we found very useful: (1) an efcient state representation by weighted principal component analysis (PCA) and (2) a parameter estimation scheme based on a mixture of the output-error and equation-error method, previously introduced as the compromise method (Werbos, McAvoy, & Su, 1992). While Werbos et al. promoted the method to be used where equation error fails, we here use it to make the prediction horizon user adjustable. The method partially propagates errors to the next time step, controlled by a user-specied error propagation parameter. In this article we present three successful applications of the algorithm. First, a neural network is trained on data from an experimental driven and damped pendulum. This system is known to have three state variables, of which one is measured and a second, the phase of the sinusoidal driving force, is known beforehand. This experiment is used as a comprehensive visualisation of how well the Diks test can distinguish between slightly different attractors and how its performance depends on the number of data points. Second, a model is trained on the same pendulum data, but this time the information about the phase of the driving force is completely ignored. Instead, embedding with delays and PCA is used. This experiment is a practical verication of the Takens theorem (Takens, 1981), as explained in section 3. Finally, the error propagation feature of the algorithm becomes important in the modeling of the laser data (set A) from the 1991 Santa Fe time-series prediction competition. The resulting neural network model opens new possibilities for analyzing these data because it can generate time series up to any desired length and the Jacobian of the model can be computed analytically at any time, which makes it possible to compute the Lyapunov spectrum and nd periodic solutions of the model. Section 2 gives a brief description of the two data sets that are used to explain concepts of the algorithm throughout the article. In section 3, the input-output structure of the model is dened: the state is extracted from the single measured time series using the method of delays followed by weighted PCA, and predictions are made by a combined linear and neural network model. The error propagation training is outlined in section 4, and the Diks test, used to detect whether the attractor of the model has converged to the measured one, in section 5. Section 6 contains the results of the two different pendulum models, and section 7 covers the laser data model. Section 8 concludes with remarks on attractor learning and future directions. 2 Data Sets

Two data sets are used to illustrate features of the algorithm: data from an experimental driven pendulum and far-infrared laser data from the 1991 Santa Fe time-series competition (set A). The pendulum data are described

2358 R. Bakker, J. C. Schouten, C. L. Giles, F. Takens, and C. M. van den Bleek

Figure 1: Plots of the rst 2000 samples of (a) driven pendulum and (b) Santa Fe laser time series.

in section 2.1. For the laser data we refer to the extensive description of Hubner, ¨ Weiss, Abraham, & Tang (1994). The rst 2048 points of the each set are plotted in Figures 1a and 1b. 2.1 Pendulum Data. The pendulum we use is a type EM-50 pendulum produced by Daedalon Corporation (see Blackburn, Vik, & Binruo, 1989, for details and Figure 2 for a schematic drawing). Ideally, the pendulum would

Figure 2: Schematic drawing of the experimental pendulum and its driving force. The pendulum arm can rotate around its axis; the angle h is measured. During the measurements, the frequency, of the driving torque was 0.85 Hz.

Learning Chaotic Attractors

2359

obey the following equations of motion, in dimensionless form, 0 1 0 1 h v d @ A @ v D ¡c v ¡ sinh C a sin w A , dt w vD

(2.1)

where h is the angle of the pendulum, v its angular velocity, c is a damping constant, and (a, vD , w ) are the amplitude, frequency, and phase, respectively, of the harmonic driving torque. As observed by De Korte, Schouten, & van den Bleek (1995), the real pendulum deviates from the ideal behavior of equation 2.1 because of its four electromagnetic driving coils that repulse the pendulum at certain positions (top, bottom, left, right). However, from a comparison of Poincar´e plots of the ideal and real pendulum by Bakker, de Korte, Schouten, Takens, & van den Bleek (1996), it appears that the real pendulum can be described by the same three state variables (h, v, w ) as the equations 2.1. In our setup the angle h of the pendulum is measured at an accuracy of about 0.1 degree. It is not a well-behaved variable, for it is an angle dened only in the interval [0, 2p ] and it can jump from 2p to 0 in a single time step. Therefore, we construct an estimate of the angular velocity v by differencing the angle time series and removing the discontinuities. We call this estimate V. 3 Model Structure

The experimental pendulum and FIR-laser are examples of autonomous, deterministic, and stationary systems. In discrete time, the evolution equation for such systems is xE n C 1 D F(xE n ),

(3.1)

where F is a nonlinear function, and xE n is the state of the system at time tn D t0 C nt , and t is the sampling time. In most real applications, only a single variable is measured, even if the evolution of the system depends on several variables. As Grassberger, Schreiber, and Schaffrath (1991) pointed out, it is possible to replace the vector of “true” state variables by a vector that contains delays of the single measured variable. The state vector becomes a delay vector xE n D (yn¡m C 1 , . . . , yn ), where y is the single measured variable and m is the number of delays, called embedding dimension. Takens (1981) showed that this method of delays will lead to evolutions of the type of equation 3.1 if we choose m ¸ 2D C 1, where D is the dimension of the system’s attractor. We can now use this result in two ways: 1. Choose an embedding a priori by xing the delay time t and the number of delays. This results in a nonlinear auto regressive (NAR) model structure—in our context, referred to as a time-delay neural network.

2360 R. Bakker, J. C. Schouten, C. L. Giles, F. Takens, and C. M. van den Bleek

2. Do not x the embedding, but create a (neural network) model structure with recurrent connections that the network can use to learn an appropriate state representation during training. Although the second option employs a more exible model representation (see, for example, the FIR network model used by Wan, 1994, or the partially recurrent network of Bulsari & Sax´en, 1995), we prefer to x the state beforehand since we do not want to mix learning of two very different things: an appropriate state representation; and the approximation of F in equation 3.1. Bengio, Simard, and Frasconi (1994) showed that this mix makes it very difcult to learn long-term dependencies, while Lin, Horne, Tins, and Giles (1996) showed that this can be solved by using the NAR(X) model stucture. Finally, Siegelmann, Horne, and Giles (1997) showed that the NAR(X) model structure may be not as compact but is computationally as strong as a fully connected recurrent neural network. 3.1 Choice of Embedding. The choice of the delay time t and the embedding m is highly empirical. First we x the time window, with length T D t (m¡1 ) C 1. For a proper choice of T, we consider that it is too short if the variation within the delay vector is dominated by noise, and it is longer than necessary if it exceeds the maximum prediction horizon of the system—that is, if the rst and last part of the delay vector have no correlation at all. Next we choose the delay time t . Figure 3 depicts two possible choices of t for a xed window length. It is important to realize that t equals the one-stepahead prediction horizon of our model. The results of Rico-Martinez et al. (1992) show that if t is chosen large (such that successive delays show little linear correlation), the long-term solutions of the model may be apparently different from the real system due to the very discrete nature of the model. A second argument to choose t small is that the further we predict ahead in the future, the more nonlinear will be the mapping F in equation 3.1. A small delay time will require the lowest functional complexity. Choosing t (very) small introduces three potential problems:

1. Since we xed the time window, the delay vector becomes very long, and subsequent delays will be highly correlated. This we solve by reducing its dimension with PCA (see section 3.2). 2. The one-step prediction will be highly correlated with the most recent delays, and a linear model will already make good one-step predictions (see section 3.4). We adopt the error-propagation learning algorithm (see section 4) to ensure learning of nonlinear correlations. 3. The time needed to compute predictions over a certain time interval is inversely proportional to the delay time. For real-time applications, too small a delay time will make the computation of predictions prohibitively time-consuming.

Learning Chaotic Attractors

2361

Figure 3: Two examples of an embedding with delays, applied to the pendulum time series of Figure 1a. Compared to (a), in (b) the number of delays within the xed time window is large, there is much (linear) correlation between the delays, and the one-step prediction horizon is small.

The above considerations guided us for the pendulum to the embedding of Fig. 3b, where t D 1, and T D m D 25. 3.2 Principal Component Embedding. The consequence of a small delay time in a xed time window is a large embedding dimension, m ¼ T / t, causing the model to have an excess number of inputs that are highly correlated. Broomhead and King (1986) proposed a very convenient way to reduce the dimension of the delay vector. They called it singular system analysis, but it is better known as PCA. PCA transforms the set of delay vectors to a new set of linearly uncorrelated variables of reduced dimension (see Jackson, 1991, for details). The principal component vectors Ezn are computed from the delay vectors xE n Ezn ¼ V T x En,

(3.1)

where V is obtained from U S V T D X, the singular value decomposition of the matrix X that contains the original delay vectors xE n . An essential ingredient of PCA is to examine how much energy is lost for each element of Ez that is discarded. For the pendulum, it is found that the rst 8 components

2362 R. Bakker, J. C. Schouten, C. L. Giles, F. Takens, and C. M. van den Bleek

Figure 4: Accuracy (RMSE—root mean square error) of the reconstruction of the original delay vectors (25 delays) from the principal component vectors (8 principal components) for the pendulum time series: (a) nonweighted PCA; (c) weighted PCA; (d) weighted PCA and use of update formula equation 4.1. The standard prole we use for weighted PCA is shown in (b).

(out of 25) explain 99.9% of the variance of the original data. To reconstruct the original data vectors, PCA takes a weighted sum of the eigenvectors of X (the columns of V ). Note that King, Jones, and Broomhead (1987) found that for large embedding dimensions, principal components, are essentially Fourier coefcients and the eigenvectors are sines and cosines. In particular for the case of time-series prediction, where recent delays are more important than older delays, it is important to know how well each individual component of the delay vector xE n can be reconstructed from zE n . Figure 4a shows the average reconstruction error (RMSE, root mean squared error) for each individual delay, for the pendulum example where we use 8 principal components. It appears that the error varies with the delay, and, in particular, the most recent and the oldest measurement are less accurately represented than the others. This is not a surprise since the rst and last delay have correlated neighbors only on one side. Ideally, one would like to put some more weight on recent measurements than older ones. Grassberger et al. (1991) remarked that this seems not easy when using PCA. Fortunately the contrary is true, since we can do a weighted PCA and set the desired

Learning Chaotic Attractors

2363

accuracy of each individual delay: each delay is multiplied by a weight constant. The larger this constant is, the more the delay will contribute to the total variance of the data, and the more accurately this delay will need to be represented. For the pendulum we tried the weight prole of Figure 4b. It puts a double weight on the most recent value, to compensate for the neighbor effect. The reconstruction result is shown in Figure 4c. As expected, the reconstruction of recent delays has become more accurate. 3.3 Updating Principal Components. In equation 3.1, the new state of the system is predicted from the previous one. But the method of delays has introduced information from the past in the state of the system. Since it is not sound to predict the past, we rst predict only the new measured variable,

yn C 1 D F0 (xE n ),

(3.2)

and then update the old state xE n with the prediction yn C 1 to give xE n C 1 . For delay vectors xE n D (yn¡m C 1 , . . . , yn ), this update simply involves a shift of the time window: the oldest value of y is removed and the new, predicted value comes in. In our case of principal components xE n D zE n , the update requires three steps: (1) reconstruct the delay vector from the principal components, (2) shift the time window, and (3) construct the principal components from the updated delay vector. Fortunately, these three operations can be done in a single matrix computation that we call update formula (see Bakker, de Korte, Schouten, Takens, & van den Bleek, 1997, for a derivation): E n C1 Ezn C 1 D A Ezn C by

(3.3)

with A ij D

m X kD 2

¡1 V i,k V k¡1, j I

¡1 . bEi D V i,1

(3.4)

Since V is orthogonal, the inverse of V simply equals its transpose, V ¡1 D V T . The use of the update formula, equation 3.3, introduces some extra loss of accuracy, for it computes the new state not from the original but from reconstructed variables. This has a consequence, usually small, for the average reconstruction error; compare for the pendulum example Figures 4c to 4d. A nal important issue when using principal components as neural network inputs is that the principal components are usually scaled according to the value of their eigenvalue. But the small random weights initialization for the neural networks that we will use accounts for inputs that are scaled approximately to zero mean and unit variance. This is achieved by replacing V by its scaled version W D V S and its inverse by W ¡1 D (X T XW )T . Note that this does not affect reconstruction errors of the PCA compression.

2364 R. Bakker, J. C. Schouten, C. L. Giles, F. Takens, and C. M. van den Bleek

Figure 5: Structure of the prediction model. The state is updated with a new value of the measured variable (update unit), then fed into the linear model and an MLP neural network to predict the next value of the measured variable.

3.4 Prediction Model. After dening the state of our experimental system, we need to choose a representation of the nonlinear function F in equation 2.1. Because we prefer to use a short delay time t , the one-step-ahead prediction will be highly correlated with the most recent measurement, and a linear model may already make accurate one-step-ahead predictions. A chaotic system is, however, inherently nonlinear and therefore we add a standard multilayer perceptron (MLP) to learn the nonlinearities. There are two reasons for using both a linear and an MLP model, even though the MLP itself can represent a linear model: (1) the linear model parameters can be estimated quickly beforehand with linear least squares, and (2) it is instructive to see at an early stage of training if the neural network is learning more than linear correlations only. Figure 5 shows a schematic representation of the prediction model, with a neural network unit, a linear model unit, and a principal component update unit. 4 Training Algorithm

After setting up the structure of the model, we need to nd a set of parameters for the model such that it will have an attractor as close as possible to that of the real system. In neural network terms, train the model to learn the attractor. Recall that we would like to minimize directly the discrepancy between model and system attractor, but stick to the practical training objective: minimize short-term prediction errors. The question arises whether this is sufcient: will the prediction model also have approximately the

Learning Chaotic Attractors

2365

same attractor as the real system? Principe et al. (1992) showed in an experimental study that this is not always the case, and some years later Kuo and Principe (1994) and also Deco and Schurmann ¨ (1994) proposed, as a remedy, to minimize multistep rather than just one-step-ahead prediction errors. (See Haykin & Li, 1995, and Haykin & Principe, 1998, for an overview of these methods.) This approach still minimizes prediction errors (and can still fail), but it leaves more room for optimization as it makes the prediction horizon user adjustable. In this study we introduce, in section 4.1, the compromise method (Werbos et al., 1992) as a computationally very efcient alternative for extending the prediction horizon. In section 5, we assess how best to detect differences between chaotic attractors. We will stop training when the prediction error is small and no differences between attractors can be detected. 4.1 Error Propagation. To train a prediction model, we minimize the squared difference between the predicted and measured future values. The prediction is computed from the principal component state Ez by

yO n C 1 D F0 (zE n ).

(4.1)

In section 3.1 we recommended choosing the delay time t small. As a consequence, the one-step-ahead prediction will be highly correlated with the most recent measurements and eventually may be dominated by noise. This will result in bad multistep-ahead predictions. As a solution, Su, McAvoy, and Werbos (1992) proposed to train the network with an output error method, where the state is updated with previously predicted outputs instead of measured outputs. They call this a pure robust method and applied it successfully to nonautonomous, nonchaotic systems. For autonomous systems, applying an output error method would imply that one trains the model to predict the entire time series from only a single initial condition. But chaotic systems have only a limited predictability (Bockmann, 1991), and this approach must fail unless the prediction horizon is limited to just a few steps. This is the multistep-ahead learning approach of Kuo and Principe (1994), Deco and Schurmann ¨ (1994), and Jaeger and Kantz (1996): a number of initial states are selected from the measured time series, and the prediction errors over a small prediction sequence are minimized. Kuo and Principe (1994) related the length of the sequences to the largest Lyapunov exponent of the system, and Deco and Schurmann ¨ (1994) also put less weight on longer-term predictions to account for the decreased predictability. Jaeger and Kantz (1996) address the so-called error-in-variables problem: a standard least-squares t applied to equation 4.1 introduces a bias in the estimated parameters because it does not account for the noise in the independent variables Ezn . We investigate a second possibility to make the training of models for chaotic systems more robust. It recognizes that the decreasing predictability

2366 R. Bakker, J. C. Schouten, C. L. Giles, F. Takens, and C. M. van den Bleek

of chaotic systems can alternatively be explained as a loss of information about the state of the system when time evolves. From this point of view, the output error method will fail to predict the entire time series from only an initial state because this single piece of information will be lost after a short prediction time—unless additional information is supplied while the prediction proceeds. Let us compare the mathematical formulation of the three methods: one-step ahead, multistep ahead, and output error with addition of information. All cases use the same prediction equation, 4.1. They differ in how they use the update formula, equation 3.3. The one-stepahead approach updates each time with a new, measured value, E nC1 . zE rn C 1 D A Ezrn C by

(4.2)

The state is labeled Ezr because it will be used as a reference state later in equation 4.9. Trajectory learning, or output error learning with limited prediction horizon, uses the predicted value zE n C 1 D A Ezn C bEyO n C 1 .

(4.3)

For the third method, the addition of information must somehow nd its source in the measured time series. We propose two ways to update the state with a mixture of the predicted and measured values. The rst one is E (1 ¡ g)yn C 1 C gyO n C 1 ], zE n C 1 D A Ezn C b[

(4.4)

or, equivalently, E n C 1 C gen ], zE n C 1 D A Ezn C b[y

(4.5)

where the prediction error en is dened en ´ yO n C 1 ¡ yn C 1 . Equation 4.5 and the illustrations in Figure 6 make clear why we call g the error propagation parameter: previous prediction errors are propagated to the next time step, depending on g that we must choose between zero and one. Alternatively, we can adopt the view of Werbos et al. (1992), who presented the method as a compromise between the output-error (pure-robust) and equation-error method. Prediction errors are neither completely propagated to the next time step (output-error method) nor suppressed (equation-error method) but they are partially propagated to the next time step. For g D 0, error propagation is switched off, and we get equation 4.2. For g D 1, we have full error propagation, and we get equation 4.3, which will work only if the prediction time is small. In practice, we use g as a user-adjustable parameter intended to optimize model performance by varying the effective prediction horizon. For example, for the laser data model in section 7, we did ve neural network training sessions, each with a different value of g,

Learning Chaotic Attractors

2367

Figure 6: Three ways to make predictions: (a) one-step-ahead, (b) multistepahead, and (c) compromise (partially keeping the previous error in the state of the model).

ranging from 0 to 0.9. In the appendix, we analyze error propagation training applied to a simple problem: learning the chaotic tent map. There it is found that measurement errors will be suppressed if a good value for g is chosen, but that too high a value will result in a severe bias in the estimated parameter. Multistep-ahead learning has a similar disadvantage; the length of the prediction sequence should not be chosen too high. The error propagation equation, 4.5, has a problem when applied to the linear model, which we often add before even starting to train the MLP (see section 3.4). If we replace the prediction model, 4.4, by a linear model, yn C 1 D C xE n ,

(4.6)

and substitute this in equation 4.4, we get, E (1 ¡ g)yn C 1 C gC Ezn ] Ezn C 1 D A Ezn C b[

(4.7)

Ezn C 1 D (A C gbEC )Ezn C bE (1 ¡ g)yn C 1 .

(4.8)

or

This is a linear evolution of Ez with y as an exogeneous input. It is BIBO (bounded input, bounded output) stable if the largest eigenvalue of the matrix A C g bEC is smaller than 1. This eigenvalue is nonlinearly dependent on g, and we found many cases in practice where the evolution is stable when g is zero or one, but not for a range of in-between values, thus making error propagation training impossible for these values of g. A good remedy is to change the method into the following, improved error propagation formula, E n C 1 C gen ], Ezn C 1 D A [Ezrn C g(Ezn ¡ Ezrn )] C b[y

(4.9)

2368 R. Bakker, J. C. Schouten, C. L. Giles, F. Takens, and C. M. van den Bleek

where the reference state Ezrn is the state obtained without error propagation, as dened in equation 4.2. Now stability is assessed by taking the eigenvalues of g(A C bEC ), which are g times smaller than the eigenvalues of (A C bEC ), and therefore error propagation cannot disturb the stability of the system in the case of a linear model. We prefer equation 4.9 over equation 4.5, even if no linear model is added to the MLP. 4.2 Optimization and Training. For network training we use backpropagation through time (BPTT) and conjugate gradient minimization. BPTT is needed because a nonzero error propagation parameter introduces recurrency into the network. If a linear model is used parallel to the neural network, we always optimize the linear model parameters rst. This makes convergence of the MLP initially slow (the easy part has already been done) and normal later on. The size of the MLPs is determined by trial and error. Instead of using “early stopping” to prevent overtting, we use the Diks monitoring curves (see section 5) to decide when to stop: select exactly the number of iterations that yields an acceptable Diks test value and a low prediction error on independent test data. The monitoring curves are also used to nd a suitable value for the error propagation parameter; we pick the value that yields a curve with the least irregular behavior (see the examples in sections 6 and 7). Fortunately, the extra work to optimize g replaces the search for an appropriate delay time. We choose that as small as the sampling time of the measurements. In a preliminary version of the algorithm (Bakker et al., 1997), the parameter g was taken as an additional output of the prediction model, to make it state dependent. The idea was that the training procedure would automatically choose g large in regions of the embedding space where the predictability is high, and small where predictability is low. We now think that this gives the minimization algorithm too many degrees of freedom, and we prefer simply to optimize g by hand. 5 Diks Test Monitoring

Because the error propagation training does not guarantee learning of the system attractor, we need to monitor a comparison between the model and system attractor. Several authors have used dynamic invariants such as Lyapunov exponents, entropies, and correlation dimension to compare attractors (Aguirre & Billings, 1994; Principe et al., 1992; Bakker et al., 1997; Haykin & Puthusserypady, 1997). These invariants are certainly valuable, but they do not uniquely identify the chaotic system, and their condence bounds are hard to estimate. Diks et al. (1996) has developed a method that is unique: if the two attractors are the same, the test value will be zero for a sufcient number of samples. The method can be seen as an objective substitute for visual comparison of attractors plotted in the embedding

Learning Chaotic Attractors

2369

space. The test works for low- and high-dimensional systems, and, unlike the previously mentioned invariants, its variance can be well estimated from the measured data. The following null hypothesis is tested: the two sets of delay vectors (model generated, measured) are drawn from the same multidimensional probability distribution. The two distributions, r 10 ( Er ) and r 20 ( Er ), are rst smoothed using gaussian kernels, resulting in r 1 ( Er ) and r 2 ( Er ), and then the distance between the smoothed distributions is dened to be the volume integral over the squared difference between the two distributions: Z QD

dEr (r 10 ( Er ) ¡ r 20 ( Er ))2 .

(5.1)

The smoothing is needed to make Q applicable to the fractal probability distributions encountered in the strange attractors of chaotic systems. Diks et al. (1996) provide an unbiased estimate for Q and for its variance, Vc , where subscript c means “under the condition that the null hypothesis holds.” In p that case, the ratio S D QO / Vc is a random variable with zero mean and unit variance. Two issues need special attention: the test assumes independence among all vectors, and the gaussian kernels use a bandwidth parameter d that determines the length scale of the smoothing. In our case, the successive states of the system will be very dependent due to the short prediction time. The simplest solution is to choose a large time interval between the delay vectors—at least several embedding times—and ignore what happens in between. Diks et al. (1996) propose a method that makes more efcient use of the available data. It divides the time series in segments of length l and uses an averaging procedure such that the assumption of independent vectors can be replaced by the assumption of independent segments. The choice of bandwidth d determines the sensitivity of the test. Diks et al. (1996) suggest using a small part of the available data to nd out which value of d gives the highest test value and using this value for the nal test with the complete sets of data. In our practical implementation, we compress the delay vectors to principal component vectors, for computational efciency, knowing that kxE 1 ¡ xE 2 k2 ¼ kU (Ez1 ¡ Ez2 )k2 D kEz1 ¡ Ez2 k2 ,

(5.2)

since U is orthogonal. We vary the sampling interval randomly between one and two embedding times and use l D 4. We generate 11 independent sets with the model, each with a length of two times the measured timeseries length. The rst set is used to nd the optimum value for d, and the other 10 sets to get an estimate of S and to verify experimentally its standard deviation that is supposed to be unity. Diks et al. (1996) suggest rejecting the null hypothesis with more than 95% condence for S > 3. Our estimate of S is the average of 10 evaluations. Each evaluation uses new and independent model-generated data, but the same measured data (only the

2370 R. Bakker, J. C. Schouten, C. L. Giles, F. Takens, and C. M. van den Bleek

Figure 7: Structure of pendulum model I that uses all available information about the pendulum: the measured angle h , the known phase of the driving force w , the estimated angular velocity V, and nonlinear terms (sines and cosines) that appear in the equations of motion, equation 2.1.

sampling is done again). To account for the 10 half-independent evaluations, p we lower the 95% condence threshold to S D 3 / (1 / 2 ¢ 10) ¼ 1.4. Note that the computational complexity of the Diks test is the same as that of computing correlation entropies or dimensions—roughly O (N 2 ), where N is the number of delay vectors in the largest of the two sets to be compared. 6 Modeling the Experimental Pendulum

In this chapter, we present two models for the chaotic, driven, and damped pendulum that was introduced in section 2.1. The rst model, pendulum model I, uses all possible relevant inputs, that is, the measured angle, the known phase of the driving force, and some nonlinear terms suggested by the equations of motion of the pendulum. This model does not use error propagation or principal component analysis; it is used just to demonstrate the performance of the Diks test. The second model, pendulum model II, uses a minimum of different inputs—just a single measured series of the estimated angular velocity V (dened in section 2.1). This is a practical verication of the Takens theorem (see section 3), since we disregard two of the three “true” state variables and use delays instead. We will reduce the dimension of the delay vector by PCA. There is no need to use error propagation since the data have a very low noise level. 6.1 Pendulum Model I. This model is the same as that of Bakker et al. (1996), except for the external input that is not present here. The input-output structure is shown in Figure 7. The network has 24 nodes in the rst and 8 in the second hidden layer. The network is trained by backpropagation

Learning Chaotic Attractors

2371

Figure 8: Development of the Poincar´e plot of pendulum model I during training, to be compared with the plot of measured data (a). The actual development is not as smooth as suggested by (b–f), as can be seen from the Diks test monitoring curves in Figure 9.

with conjugate gradient minimization on a time series of 16,000 samples and reaches an error level for the one-step-ahead prediction of the angle h as low as 0.27 degree on independent data. The attractor of this system is the plot of the evolutions of the state vectors (V, h, Q )n in a three-dimensional space. Two-dimensional projections, called Poincar´e sections, are shown in Figure 8: each time a new cycle of the sinusoidal driving force starts, when the phase Q is zero, a point (V, h )n is plotted. Figure 8a shows measured data, while Figures 8b through 8f show how the attractor of the neural network model evolves as training proceeds. Visual inspection reveals that the Poincare´ section after 8000 conjugate gradient iterations is still slightly different from the measured attractor, and it is interesting to see if we would draw the same conclusion if we use the Diks test. To enable a fair comparison with the pendulum model II results in the next section, we compute the Diks test not directly from the states (V, h, Q )n (that would probably be most powerful), but instead reconstruct the attractor in a delay space of the variable V, the only variable that will be available in pendulum model II, and use these delay vectors to compute the Diks test. To see how the sensitivity of the test depends on the length of the time series, we compute it three times,

2372 R. Bakker, J. C. Schouten, C. L. Giles, F. Takens, and C. M. van den Bleek

Figure 9: Diks test monitoring curve for pendulum model I. The Diks test converges to zero (no difference between model and measured attractor can be found), but the spikes indicate the extreme sensitivity of the model attractor for the model parameters.

using 1, 2, and 16 data sets of 8000 points each. These numbers should be compared to the length of the time series used to create the Poincare´ sections: 1.5 million points, of which each 32nd point (when the phase Q is 0) is plotted. The Diks test results are printed in the top-right corner of each of the graphs in Figure 8. The test based on only one set of 8000 (Diks 8K) points cannot see the difference between Figures 8a and 8c, whereas the test based on all 16 sets (Diks 128K) can still distinguish between Figures 8a and 8e. It may seem from the Poincare´ sections in Figures 8b through 8f that the model attractor gradually converges toward the measured attractor when training proceeds. However, from the Diks test monitoring curve in Figure 9, we can conclude that this is not at all the case. The spikes in the curve indicate that even after the model attractor gets very close to the measured attractor, a complete mismatch may occur from one iteration to the other. We will see a similar behavior for the pendulum model II (see section 6.2) and the laser data model (see section 7), and address the issue in the discussion (see section 8). 6.2 Pendulum Model II. This model uses information from just a single variable, the estimated angular velocity V. From the equations of motion, equation 2.1, it follows that the pendulum has three state variables, and the Takens theorem tells us that in the case of noise-free data, an embedding of 7(2 ¤ 3 C 1) delays will be sufcient. In our practical application, we take the embedding of Figure 3b with 25 delays, which are reduced to 8 principal components by weighted PCA. The network has 24 nodes in the rst and 8 in the second hidden layer, and a linear model is added. First, the network is trained for 4000 iterations on a train set of 8000 points; then a second set of 8000 points is added, and training proceeds for another 2000 iterations,

Learning Chaotic Attractors

2373

reaching an error level of 0.74 degree, two and a half times larger than model I. The Diks test monitoring curve (not shown) has the same irregular behavior as for pendulum model I. In between the spikes, it eventually varies between one and two. Compare this to model I after 2000 iterations, as shown in Figure 8e, which has a Diks test of 0.6: model II does not learn the attractor as well as model I. That is the price we pay for omitting a priori information about the phase of the driving force and the nonlinear terms from the equations of motion. We also computed Lyapunov exponents, l. For model I, we recall the estimated largest exponent of 2.5 bits per driving cycle, as found in Bakker et al. (1996). For model II, we computed a largest exponent of 2.4 bits per cycle, using the method of von Bremen et al. (1997) that uses QR-factorizations of tangent maps along model-generated trajectories. The method actually computes the entire Lyapunov spectrum, and from that we can estimate the information dimension D1 of the system, using the Kaplan-Yorke conjecture (KaplanP& Yorke, 1979): take that (real) value of i where the interpolated curve i l i versus i crosses zero. This happens at D1 D 3.5, with an estimated standard deviation of 0.1. For comparison, the ideal chaotic pendulum, as dened by equation 2.1, has three exponents, and as mentioned in Bakker et al. (1996), the largest is positive, the second zero, and the sum of all three is negative. This tells us that the dimension must lie between 2 and 3. This means that pendulum model II has a higher dimension than the real system. We think this is because of the high embedding space (eight principal components) that is used. Apparently the model has not learned to make the ve extra, spurious, Lyapunov exponents more negative than the smallest exponent in the real system. Eckmann, Kamphorst, Ruelle, and Ciliberto (1986) report the same problem when they model chaotic data with local linear models. We also computed an estimate for the correlation dimension and entropy of the time series with the computer package RRCHAOS (Schouten & van den Bleek, 1993), which uses the maximum likeliehood estimation procedures of Schouten, Takens, and van den Bleek (1994a, 1994b). For the measured data, we found a correlation dimension of 3.1+/¡0.1, and for the model-generated data 2.9+/¡0.1: there was no signicant difference between the two numbers, but the numbers are very close to the theoretical maximum of three. For the correlation entropy, we found 2.4+/¡0.1 bits per cycle for measured data, and 2.2+/¡0.2 bits per cycle for the model-generated data. Again there was no signicant difference and, moreover, the entropy is almost the same as the estimated largest Lyapunov exponent. For pendulum model II we relied on the Diks test to compare attractors. However, even if we do not have the phase of the driving force available as we did in model I, we can still make plots of cross-sections of the attractor by projecting them in a two-dimensional space. This is what we do in Figure 10: every time when the rst state variable upwardly crosses zero, we plot the remaining seven state variables in a two-dimensional projection.

2374 R. Bakker, J. C. Schouten, C. L. Giles, F. Takens, and C. M. van den Bleek

Figure 10: Poincar´e section based on measured (a) and model-generated (b) data. A series of truncated states (z1 , z2 , z3 )T is reconstructed from the time series; a point (z2 , z3 )T is plotted every time z1 crosses zero (upward).

7 Laser Data Model

The second application of the training algorithm is the well-known laser data set (set A) from the Santa Fe time series competition (Weigend & Gershenfeld, 1994). The model is similar to that reported in Bakker, Schouten, Giles, & van den Bleek (1998), but instead of using the rst 3000 points for training, we use only 1000 points—exactly the amount of data that was available before the results of the time-series competition were published. We took an embedding of 40 and used PCA to reduce the 40 delays to 16 principal components. First, a linear model was tted to make one-stepahead predictions. This already explained 80% of the output data variance. The model was extended with an MLP, having 16 inputs (the 16 principal components), two hidden layers with 32 and 24 sigmoidal nodes, and a single output (the time-series value to be predicted). Five separate training sessions were carried out, using 0%, 60%, 70%, 80%, and 90% error propagation (the percentage corresponds to g ¢ 100). Each session lasted 4000 conjugate gradient iterations, and the Diks test was evaluated every 20 iterations. Figure 11 shows the monitoring results. We draw three conclusions from it: 1. While the prediction error on the training set (not shown) monotonically decreases during training, the Diks test shows very irregular behavior. We saw the same irregular behavior for the two pendulum models. 2. The case of 70% error propagation converges best to a model that has learned the attractor and does not “forget the attractor ” when training proceeds. The error propagation helps to suppress the noise that is present in the time series.

Learning Chaotic Attractors

2375

Figure 11: Diks test monitoring curves for the laser data model with (a) 0%, (b) 60%, (c) 70%, (d) 80%, (e) 90% error propagation. On the basis of these curves, it is decided to select the model trained with 70% error propagation and after 1300 iterations for further analysis.

3. The common stopping criterion, “stop when error on test set is at its minimum,” may yield a model with an unacceptable Diks test result. Instead, we have to monitor the Diks test carefully and select exactly that model that has both a low prediction error and a good attractor— a low Diks test value that does not change from one iteration to the other. Experience has shown that the Diks test result of each training session is sensitive for the initial weights of the MLP; the curves are not well reproducible. Trial and error is required. For example, in Bakker et al. (1998) where we used more training data to model the laser, we selected 90% error propagation (EP) to be the best choice by looking at the Diks test monitoring curves. For further analysis, we select the model trained with 70% EP after 1300 iterations, having a Diks test value averaged over 100 iterations as low as ¡0.2. Figure 12b shows a time series generated by this model along with measured data. For the rst 1000 iterations, the model runs in 90% EP mode and follows the measured time series. Then the mode is switched to free run (closed-loop prediction), or 100% EP, and the model gener-

2376 R. Bakker, J. C. Schouten, C. L. Giles, F. Takens, and C. M. van den Bleek

Figure 12: Comparison of time series (c) measured from the laser and (d) closedloop (free run) predicted by the model. Plots (a) and (b) show the rst 2000 points of (c) and (d), respectively.

ates a time-series on its own. In Figure 12d the closed-loop prediction is extended over a much longer time horizon. Visual inspection of the time series and the Poincare´ plots of Figures 13a and 13b conrms the results of the Diks test: the measured and model-generated series look the same, and it is very surprising that the neural network learned the

Figure 13: Poincar´e plots of (a) measured and (b) model-generated laser data. A series of truncated states (z1 , z2 , z3 , z4 )T is reconstructed from the time series; a point (z2 , z3 , z4 )T is plotted every time z1 crosses zero (upward).

Learning Chaotic Attractors

2377

chaotic dynamics from 1000 points that contain only three “rise-and-collapse” cycles. We computed the Lyapunov spectrum of the model, using series of tangent maps along 10 different model generated trajectories of length 6000. The three largest exponents are 0.92, 0.09, and ¡1.39 bits per 40 samples. The largest exponent is larger than the rough estimate by Hubner ¨ et al. (1994) of 0.7. The second exponent must be zero according to Haken (1983). To verify this, we estimate, from 10 evaluations, a standard deviation of the second exponent of 0.1, which is of the same order as its magnitude. From the Kaplan-Yorke conjecture, we estimate an information dimension D1 of 2.7, not signicantly different from the value of 2.6 that we found in another model for the same data (Bakker et al., 1998). Finally, we compare the short-term prediction error of the model to that of the winner of the Santa Fe time-series competition (see Wan, 1994). Over the rst 50 points, our normalized mean squared error (NMSE) is 0.0068 (winner: 0.0061); over the rst 100 it is 0.2159 (winner: 0.0273). Given that the long-term behavior of our model closely resembles the measured behavior, we want to stress that the larger short-term prediction error does not imply that the model is “worse” than Wan’s model. The system is known to be very unpredictable at the signal collapses, having a number of possible continuations. The model may follow one continuation, while a very slight deviation may cause the “true” continuation to be another. 8 Summary and Discussion

We have presented an algorithm to train a neural network to learn chaotic dynamics, based on a single measured time series. The task of the algorithm is to create a global nonlinear model that has the same behavior as the system that produced the measured time series. The common approach consists of two steps: (1) train a model to short-term predict the time series, and (2) analyze the dynamics of the long-term (closed-loop) predictions of this model and decide to accept or reject it. We demonstrate that this approach is likely to fail, and our objective is to reduce the chance of failure greatly. We recognize the need for an efcient state representation, by using weighted PCA; an extended prediction horizon, by adopting the compromise method of Werbos, McAvoy, & Su, (1992); and (3) a stopping criterion that selects exactly that model that makes accurate short-term predictions and matches the measured attractor. For this criterion we propose to use the Diks test (Diks et al., 1996) that compares two attractors by looking at the distribution of delay vectors in the reconstructed embedding space. We use an experimental chaotic pendulum to demonstrate features of the algorithm and then benchmark it by modeling the well-known laser data from the Santa Fe time-series competition. We succeeded in creating a model that produces a time series with the same typical irregular rise-and-collapse behavior as the measured data, learned

2378 R. Bakker, J. C. Schouten, C. L. Giles, F. Takens, and C. M. van den Bleek

from a 1000-sample data set that contains only three rise-and-collapse sequences. A major difculty with training a global nonlinear model to reproduce chaotic dynamics is that the long-term generated (closed-loop) prediction of the model is extremely sensitive for the model parameters. We show that the model attractor does not gradually converge toward the measured attractor when training proceeds. Even after the model attractor is getting close to the measured attractor, a complete mismatch may occur from one training iteration to the other, as can be seen from the spikes in the Diks test monitoring curves of all three models (see Figures 9 and 11). We think of two possible causes. First, the sensitivity of the attractor for the model parameters is a phenomenon inherent in chaotic systems (see, for example, the bifurcation plots in Aguirre & Billings, 1994, Fig. 5). Also the presence of noise in the dynamics of the experimental system may have a signicant inuence on the appearance of its attractor. A second explanation is that the global nonlinear model does not know that the data that it is trained on are actually on an attractor. Imagine, for example, that we have a two-dimensional statespace and the training data set is moving around in this space following a circular orbit. This orbit may be the attractor of the underlying system, but it might as well be an unstable periodic orbit of a completely different underlying system. For the short-term prediction error, it does not matter if the neural network model converges to the rst or the second system, but in the latter case, its attractor will be wrong! An interesting development is the work of Principe, Wang, and Motter (1998), who use self-organizing maps to make a partitioning of the input space. It is a rst step toward a model that makes accurate short-term predictions and learns the outlines of the (chaotic) attractor. During the review process of this manuscript, such an approach was developed by Bakker et al. (2000). Appendix: On the Optimum Amount of Error Propagation: Tent Map Example

When using the compromise method, we need to choose a value for the error propagation parameter such that the inuence of noise on the resulting model is minimized. We analyze the case of a simple chaotic system, the tent map, with noise added to both the dynamics and the available measurements. The evolution of the tent map is xt D a|xt¡1 C ut | C c,

(A.1)

with a D 2 and c D 1 / 2, and where ut is the added dynamical noise, assumed to be a purely random process with zero mean and variance su2 . It is convenient for the analysis to convert equation A.1 to the following

Learning Chaotic Attractors

2379

notation, xt D at (xt¡1 C ut ) C c,

» at D a, at D ¡a,

(xt¡1 C ut ) < 0 . (xt¡1 C ut ) ¸ 0

(A.2)

We have a set of noisy measurements available, yt D xt C m t ,

(A.3)

where mt is a purely random process, with zero mean and variance sm2 . The model we will try to t is the following: zt D b|zt¡1 | C c,

(A.4)

where z is the (one-dimensional) state of the model and b is the only parameter to be tted. In the alternate notation, zt D bt zt¡1 C c,

» bt D b, B t D ¡b,

zt¡1 < 0 . zt¡1 ¸ 0

(A.5)

The prediction error is dened as et D zt ¡ yt ,

(A.6)

and during training we minimize SSE, the sum of squares of this error. If, during training, the model is run in error propagation (EP) mode, each new prediction partially depends on previous prediction errors according to zt D bt (yt¡1 C get¡1 ) C c,

(A.7)

where g is the error propagation parameter. We will now derive how the expected SSE depends on g, su2 and sm2 . First eliminate z from the prediction error by substituting equation A.7 in A.6, et D bt (yt¡1 C get¡1 ) C c ¡ yt ,

(A.8)

and use equation A.3 to eliminate yt and A.2 to eliminate xt : et D f(bt ¡ at )xt¡1 ¡ at ut C bt mt¡1 ¡ mt g C btget¡1 .

(A.9)

Introduce 2 t D f(bt ¡ at )xt¡1 ¡ at ut C bt (1 ¡g)mt¡1 g, and assume e0 D 0. Then the variance se2 of et , dened as the expectation E[e2t ], can be expanded: se2 D E[¡mt C 2

t C

(btg)2

t¡1 C

¢ ¢ ¢ C (b2g)t¡12 1 ]2 .

(A.10)

2380 R. Bakker, J. C. Schouten, C. L. Giles, F. Takens, and C. M. van den Bleek

For the tent map, it can be veried that there is no autocorrelation in x (for typical orbits that have a unit density function on the interval [¡0.5,0.5] (see Ott, 1993), under the assumption that u is small enough not to change this property. Because mt and ut are purely random processes, there is no autocorrelation in m or u, and no correlation between m and u, x, and m. Equation A.2 suggests a correlation between x and u, but this cancels out because the multiplier at jumps between ¡1 and 1, independent from u under the previous assumption that u is small compared to x. Therefore, the expansion A.10 becomes a sum of the squares of all its individual elements without cross products: se2 D E[m2t ] C E[2

2 t C

(btg)22

2 t¡1 C

¢ ¢ ¢ C (b2g)2(t¡1)2 12 ].

(A.11)

From the denition of at in equation A.2, it follows that a2t D a2 . A similar argument holds for bt , and if we assume that at each time t, both the arguments xt¡1 C ut in equation A.2 and yt¡1 C get¡1 in A.7 have the same sign (are in the same leg of the tent map), we can also write (bt ¡ at )2 D (b ¡ a)2 .

(A.12)

Since (btg)2 D (bg )2 , the following solution applies for the series A.11: se2

D

sm2 C

s2

2

¡ 1 ¡ (bg) ¢ 2t

1 ¡ (bg )2

6 1, , |bg| D

where s2 2 is the variance of 2 t . Written in full, se2 D sm2 C ((b ¡a)2 sx2 C su2 C b2 (1 ¡g)2 sm2 )

(A.13)

¡ 1¡(bg) ¢

and for large t, se2 D sm2 C

((b ¡ a )2 sx2 C su2 C b2 (1 ¡g)2 sm2 )

2t

1¡(bg)2

¡

6 1, (A.14) , |bg| D

¢

1 6 1. (A.15) , |gb| D 1¡(gb)2

Ideally we would like the difference between the model and real-system parameter, (b ¡ a)2 sx2 , to dominate the cost function that is minimized. We can draw three important conclusions from equation A.15: 1. The contribution of measurement noise mt is reduced when gb approaches 1. This justies the use of error propagation for time series contaminated by measurement noise. 2. The contribution of dynamical noise ut is not lowered by error propagation.

Learning Chaotic Attractors

2381

3. During training, the minimization routine will keep b well below 1 /g, 1 because otherwise the multiplier, 1¡(gb) 2 , would become large and so would the cost function. Therefore, b will be biased toward zero if ga is close to unity. The above analysis is almost the same for the trajectory learning algorithm (see section 4.1). In that case, g D 1 and equation A.14 becomes se2 D sm2 C ((b ¡ a )2 sx2 C su2 )

¡1¡b ¢ 2t

1 ¡ b2

6 1. , |b| D

(A.16)

Again, measurement noise can be suppressed by choosing t large. This leads 2t to a bias toward zero of b because for a given t > 0, the multiplier, 1¡b , 1¡b2 is reduced by lowering b (assuming b > 1, a necessary condition to have a chaotic system). Acknowledgments

This work is supported by the Netherlands Foundation for Chemical Research (SON) with nancial aid from the Netherlands Organization for Scientic Research (NWO). We thank Cees Diks for his help with the implementation of the test for attractor comparison. We also thank the two anonymous referees who helped to improve the article greatly. References Aguirre, L. A., & Billings, S. A. (1994). Validating identied nonlinear models with chaotic dynamics. Int. J. Bifurcation and Chaos, 4, 109–125. Albano, A. M., Passamante, A., Hediger, T., & Farrell, M. E. (1992). Using neural nets to look for chaos. Physica D, 58, 1–9. Bakker, R., de Korte, R. J., Schouten, J. C., Takens, F., & van den Bleek, C. M. (1997). Neural networks for prediction and control of chaotic uidized bed hydrodynamics: A rst step. Fractals, 5, 523–530. Bakker, R., Schouten, J. C., Takens, F., & van den Bleek, C. M. (1996). Neural network model to control an experimental chaotic pendulum. Physical Review E, 54A, 3545–3552. Bakker, R., Schouten, J. C., Coppens, M.-O., Takens, F., Giles, C. L., & van den Bleek, C. M. (2000). Robust learning of chaotic attractors. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12. Cambridge, MA: MIT Press (pp. 879–885). Bakker, R., Schouten, J. C., Giles, C. L., & van den Bleek, C. M. (1998). Neural learning of chaotic dynamics—The error propagation algorithm. In proceedings of the 1998 IEEE World Congress on Computational Intelligence (pp. 2483– 2488). Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difcult. IEEE Trans. Neural Networks, 5, 157–166.

2382 R. Bakker, J. C. Schouten, C. L. Giles, F. Takens, and C. M. van den Bleek Blackburn, J. A., Vik, S., & Binruo, W. (1989). Driven pendulum for studying chaos. Rev. Sci. Instrum., 60, 422–426. Bockman, S. F. (1991). Parameter estimation for chaotic systems. In Proceedings of the 1991 American Control Conference (pp. 1770–1775). Broomhead, D. S., & King, G. P. (1986). Extracting qualitative dynamics from experimental data. Physica D, 20, 217–236. Bulsari, A. B., & Sax´en, H. (1995). A recurrent network for modeling noisy temporal sequences. Neurocomputing, 7, 29–40. Deco, G., & Schurmann, ¨ B. (1994). Neural learning of chaotic system behavior. IEICE Trans. Fundamentals, E77A, 1840–1845. DeKorte, R. J., Schouten, J. C., & van den Bleek, C. M. (1995). Experimental control of a chaotic pendulum with unknown dynamics using delay coordinates. Physical Review E, 52A, 3358–3365. Diks, C., van Zwet, W. R., Takens, F., & de Goede, J. (1996). Detecting differences between delay vector distributions. Physical Review E, 53, 2169–2176. Eckmann, J.-P., Kamphorst, S. O., Ruelle, D., & Ciliberto, S. (1986). Liapunov exponents from time series. Physical Review A, 34, 4971–4979. Farmer, J. D., & Sidorowich, J. J. (1987). Predicting chaotic time series. Phys. Rev. Letters, 59, 62–65. Grassberger, P., Schreiber, T., & Schaffrath, C. (1991). Nonlinear time sequence analysis. Int. J. Bifurcation Chaos, 1, 521–547. Haken, H. (1983). At least one Lyapunov exponent vanishes if the trajectory of an attractor does not contain a xed point. Physics Letters, 94A, 71. Haykin, S., & Li, B. (1995). Detection of signals in chaos. Proc. IEEE, 83, 95–122. Haykin, S. & Principe, J. (1998). Making sense of a complex world. IEEE Signal Processing Magazine, pp. 66–81. Haykin, S., & Puthusserypady, S. (1997). Chaotic dynamics of sea clutter. Chaos, 7, 777–802. Hubner, ¨ U., Weiss, C.-O., Abraham, N. B., & Tang, D. (1994). Lorenz-like chaos in NH3 -FIR lasers (data set A). In Weigend & Gershenfeld (Eds.), Time series prediction: Forecasting the future and understanding the past. Reading, MA: Addison-Wesley. Jackson, J. E. (1991). A user’s guide to principal components. New York: Wiley. Jaeger, L., & Kantz, H. (1996). Unbiased reconstruction of the dynamics underlying a noisy chaotic time series. Chaos, 6, 440–450. Kaplan, J., & Yorke, J. (1979). Chaotic behavior of multidimensional difference equations. Lecture Notes in Mathematics, 730. King, G. P., Jones, R., & Broomhead, D. S. (1987). Nuclear Physics B (Proc. Suppl.), 2, 379. Krischer, K., Rico-Martinez, R., Kevrekidis, I. G., Rotermund, H. H., Ertl, G., & Hudson, J. L. (1993). Model identication of a spatiotemporally varying catalytic reaction. AIChe Journal, 39, 89–98. Kuo, J. M., & Principe, J. C. (1994). Reconstructed dynamics and chaotic signal modeling. In Proc. IEEE Int’l Conf. Neural Networks, 5, 3131–3136. Lapedes, A., & Farber, R. (1987). Nonlinear signal processing using neural networks: Prediction and system modelling (Tech. Rep. No. LA-UR-87-2662). Los Alamos, NM: Los Alamos National Laboratory.

Learning Chaotic Attractors

2383

Lin, T., Horne, B. G., Tino, P., & Giles, C. L. (1996). Learning long-term dependencies in NARX recurrent neural networks. IEEE Trans. on Neural Networks, 7, 1329. Ott, E. (1993).Chaos in dynamical systems. New York: Cambridge University Press. Principe, J. C., Rathie, A., & Kuo, J. M. (1992). Prediction of chaotic time series with neural networks and the issue of dynamic modeling. Int. J. Bifurcation and Chaos, 2, 989–996. Principe, J. C., Wang, L., & Motter, M. A. (1998). Local dynamic modeling with self-organizing maps and applications to nonlinear system identication and control. Proc. of the IEEE, 6, 2240–2257. Rico-Mart´õ nez, R., Krischer, K., Kevrekidis, I. G., Kube, M. C., & Hudson, J. L. (1992). Discrete- vs. continuous-time nonlinear signal processing of Cu Electrodissolution Data. Chem. Eng. Comm., 118, 25–48. Schouten, J. C., & van den Bleek, C. M. (1993). RRCHAOS: A menu-driven software package for chaotic time series analysis. Delft, The Netherlands: Reactor Research Foundation. Schouten, J. C., Takens, F., & van den Bleek, C. M. (1994a). Maximum-likelihood estimation of the entropy of an attractor. Physical Review E, 49, 126–129. Schouten, J. C., Takens, F., & van den Bleek, C. M. (1994b). Estimation of the dimension of a noisy attractor. Physical Review E, 50, 1851–1861. Siegelmann, H. T., Horne, B. C., & Giles, C. L. (1997). Computational capabilities of recurrent NARX neural networks. IEEE Trans. on Systems, Man and Cybernetics—Part B: Cybernetics, 27, 208. Su, H-T., McAvoy, T., & Werbos, P. (1992). Long-term predictions of chemical processes using recurrent neural networks: A parallel training approach. Ind. Eng. Chem. Res., 31, 1338–1352. Takens, F. (1981). Detecting strange attractors in turbulence. Lecture Notes in Mathematics, 898, 365–381. von Bremen, H. F., Udwadia, F E., & Proskurowski, W. (1997). An efcient QR based method for the computation of Lyapunov exponents. Physica D, 101, 1–16. Wan, E. A. (1994). Time series prediction by using a connectionist network with internal delay lines. In A. S. Weigend & N. A. Gershenfeld (Eds.), Time series prediction: Forecasting the future and understanding the past. Reading, MA: Addison-Wesley. Weigend, A. S., & Gershenfeld, N. A. (1994). Time series prediction: Forecasting the future and understanding the past. Reading, MA: Addison-Wesley. Werbos, P. J., McAvoy, T., & Su, T. (1992). Neural networks, system identication, and control in the chemical process industries. In D. A. White & D. A. Sofge (Eds.), Handbook of intelligent control. New York: Van Nostrand Reinhold. Received December 7, 1998; accepted October 13, 1999.

LETTER

Communicated by Peter Bartlett

Generalized Discriminant Analysis Using a Kernel Approach G. Baudat Mars Electronics International, CH-1211 Geneva, Switzerland F. Anouar INRA-SNES, Institut National de Recherche en Agronomie, 49071 Beaucouz´e, France We present a new method that we call generalized discriminant analysis (GDA) to deal with nonlinear discriminant analysis using kernel function operator . The underlying theory is close to the support vector machines (SVM) insofar as the GDA method provides a mapping of the input vectors into high-dimensional feature space. In the transformed space, linear properties make it easy to extend and generalize the classical linear discriminant analysis (LDA) to nonlinear discriminant analysis. The formulation is expressed as an eigenvalue problem resolution. Using a different kernel, one can cover a wide class of nonlinearities. For both simulated data and alternate kernels, we give classication results, as well as the shape of the decision function. The results are conrmed using real data to perform seed classication. 1 Introduction

Linear discriminant analysis (LDA) is a traditional statistical method that has proved successful on classication problems (Fukunaga, 1990). The procedure is based on an eigenvalue resolution and gives an exact solution of the maximum of the inertia. But this method fails for a nonlinear problem. In this article, we generalize LDA to nonlinear problems and develop a generalized discriminant analysis (GDA) by mapping the input space into a high-dimensional feature space with linear properties. In the new space, one can solve the problem in a classical way, such as with the LDA method. The main idea is to map the input space into a convenient feature space in which variables are nonlinearly related to the input space. This fact has been used in some algorithms, such as unsupervised learning algorithms (Kohonen, 1994; Anouar, Badran, & Thiria, 1998) and in support vector machine (SVM) (Vapnik, 1995; Burges, 1998; Schölkopf, 1997). In our approach, the mapping is close to the mapping used for support vector method, a universal tool to solve pattern recognition problems. In the feature space, the SVM method selects a subset of the training data and denes a decision function that is a linear expansion on a basis whose elements are nonlinear functions parameterized by the support vectors. SVM was extended to difc 2000 Massachusetts Institute of Technology Neural Computation 12, 2385–2404 (2000) °

2386

G. Baudat and F. Anouar

ferent domains such as regression estimation (Drucker, Burges, Kaufman, Smola, & Vapnik, 1997; Vapnik, Golowich, & Smola, 1997). The basic ideas behind SVM have been explored by Scholkopf, ¨ Smola, and Muller ¨ (1996, 1998) to extend principal component analysis (PCA) to nonlinear kernel PCA for extracting structure from high-dimensional data sets. The authors also propose nonlinear variant of other algorithms such as independent component analysis (ICA) or kernel-k-means. They mention that it would be desirable to develop a nonlinear form of discriminant analysis based on the kernel method. A related approach using an explicit map into a higherdimensional space instead of the kernel method was proposed by Hastie, Tibshirani, and Buja (1993). The foundations for the kernel developments described here can be connected to kernel PCA. Drawn from these works, we show how to express the GDA method as a linear algebraic formula in the transformed space using kernel operators. In the next section, we introduce the notations used for this purpose. Then we review the standard LDA method. The formulation of the GDA using dot product and matrix form is explained in the third section. Then, we develop the eigenvalue resolution. The last section is devoted to the experiments on simulated and real data. 2 Notations

Let x be a vector of the input set X with M elements. Xl designs subsets S t of X; thus: X D N l D1 Xl . N is the number of the classes. x represents the transpose of the vector x. The cardinality of the subsets Xl is denoted by nl ; P thus N l D 1 nl D M. C is the covariance matrix: CD

M 1 X xj xjt M j D1

(2.1)

Suppose that the space X is mapped into a Hilbert space F through a nonlinear mapping function w : w : X!F

x ! w (x ).

(2.2)

The covariance matrix in the feature space F is: VD

M 1 X w (xj )w t (xj ). M jD 1

(2.3)

We assume that the observations are centered in F. Nevertheless, the way to center the data in the feature space can be derived from Sch¨olkopf et al.

Generalized Discriminant Analysis

2387

(1998). By B we denote the covariance matrix of the class centers. B represents the interclasses inertia in the space F: BD

N 1 X nl wNl wN lt , M lD 1

(2.4)

where wNl is the mean value of the class l, wNl D

nl 1 X w (xlk ), nl k D 1

(2.5)

and xlk is the element k of the class l. In the same manner, the covariance matrix (see equation 2.3) of F elements can be rewritten using the class indexes: VD

nl N X 1 X w (xlk )w t (xlk ). M l D 1 k D1

(2.6)

V represents the total inertia of the data into F. In order to simplify, when there is no ambiguity in index of xlj , the class index l is omitted. To generalize LDA to the nonlinear case, we formulate it in a way that uses dot product exclusively. Therefore, we consider an expression of dot product on the Hilbert space F (Aizerman, Braverman, & Rozono´er, 1964; Boser, Guyon, & Vapnik, 1992) given by the following kernel function: k(xi , xj ) D kij D w t (xi )w (xj ).

For a given classes p and q, we express this kernel function by: (kij )pq D w t (xpi )w (xqj ).

(2.7)

Let K be a (M £ M) matrix dened on the class elements by (Kpq ) pD 1,..., N , q D 1, ..., N

where (Kpq ) is a matrix composed of dot product in the feature space F: K D (Kpq ) pD 1, ...,N q D 1,..., N

where

Kpq D (kij ) iD 1, ...,np .

(2.8)

jD 1, ...,n q

Kpq is a (np xnq ) matrix, and K is a (M £ M) symmetric matrix such that t D K . We also introduce the matrix: Kpq pq W D (Wl )lD 1,...,N ,

(2.9)

where Wl is a (nl xnl ) matrix with terms all equal to: 1 / nl . W is a (M £ M) block diagonal matrix. In the next section, we formulate the GDA method in the feature space F using the denition of the covariance matrix V (see equation 2.6), the classes covariance matrix B (see equation 2.4), and the matrices K (see equation 2.8) and W (see equation 2.9).

2388

G. Baudat and F. Anouar

3 GDA Formulation in Feature Space

LDA is a standard tool for classication based on a transformation of the input space into a new one. The data are described as a linear combination of the new coordinate values, called principal components, which represent the discriminant axis. For the common LDA (Fukunaga, 1990), the classical criteria for class separability are dened by the quotient between the interclasses inertia and the intraclasses inertia. This criteria should be larger when the interclasses inertia is larger and the intraclasses inertia is smaller. It was shown that this maximization is equivalent to eigenvalue resolution (Fukunaga, 1990) (see appendix A). Assuming that the classes have a multivariate gaussian distribution, each observation can be assigned to the class having the maximum posterior probability using the Mahalanobis distance. Using kernel functions, we generalize LDA to the case where in the transformed space, the principal components are nonlinerally related to the input variables. The kernel operator K allows the construction of nonlinear decision function in the input space that is equivalent to linear decision function in the feature space F. As such for the LDA, the purpose of the GDA method is to maximize the interclasses inertia and minimize the intraclasses inertia. This maximization is equivalent to the following eigenvalue resolution: we have to nd eigenvalues l and eigenvectors º, solutions of the equation (3.1)

lVº D Bº.

The largest eigenvalue of equation 3.1 gives the maximum of the following quotient of the inertia (see appendix A): lD

ºt Bº . ºt Vº

(3.2)

Because the eigenvectors are linear combinations of F elements, there exist coefcients apq (p D 1, . . . , N; q D 1, . . . , np ) such that np N X X

ºD

apq w (xpq ).

(3.3)

p D 1 qD 1

All solutions º with nonzero eigenvalue lie in the span of w (xij ). The coefcient vector a D (apq )p D 1,...,NIqD 1,...,np can be written in a condensed way as a D (ap )pD 1,...,N , where ap D (apq )qD 1,...,np , ap is the coefcient of the vector º in the class p. We show in appendix B that equation 3.2 is equivalent to the following quotient: lD

at KWKa . at KKa

(3.4)

Generalized Discriminant Analysis

2389

This equation, developed in appendix B, is obtained by multiplying equation 3.1 by w t (xij ), which makes it easy to rewrite in a matrix form. Equation 3.1 has the same eigenvector as Sch¨olkopf et al. (1998): lw t (xij )Vº D w t (xij )Bº.

(3.5)

We then express equation 3.5 by using the powerful idea of dot product (Aizerman et al., 1964; Boser et al., 1992) between the mapped pattern dened by the matrices K in F without having to carry out the map w . We rewrite the two terms of the equality 3.5 in a matrix form using the matrices K and W, which gives equation 3.4 (see appendix B). The next section resolves the eigenvector system (see equation 3.4), which requires an algebraic decomposition of the matrix K. 4 Eigenvalue Resolution

Let us use the eigenvectors decomposition of the matrix K, K D UCUt . Here, we consider C the diagonal matrix of nonzero eigenvalues and U the matrix of normalized eigenvectors associated to C. Thus C ¡1 exists. U is an orthonormal matrix, that is, Ut U D I,

where I is the identity matrix.

Substituting K in equation 3.4, we get: lD

(C Ut a)t U t WU (C U t a) . (C Ut a)t Ut U (C Ut a)

Let us proceed to variable modication using b such that: b D CU t a.

(4.1)

Substituting in the latter formula, we get lD

b t Ut WUb . b t Ut Ub

(4.2)

Therefore we obtain lU t Ub D Ut WUb. As U is orthonormal, the latter equation can be simplied and gives lb D Ut WUb,

(4.3)

2390

G. Baudat and F. Anouar

for which solutions are to be found by maximizing l. For a given b, there exists at least one a satisfying equation 4.1 in the form a D UC ¡1 b. a is not unique. Thus the rst step of the system resolution consists of nding b according to equation 4.3, which corresponds to a classical eigenvector system resolution. Once b are calculated, we compute a. Note that one can achieve this resolution by using other decomposition of K or other diagonalization method. We refer to the QR decomposition of K (Wilkinson & Reinsch, 1971), which allows working in a subspace that simplies the resolution. The coefcients a are normalized by requiring that the corresponding vectors º be normalized in F: ºtº D 1. Using equation 3.2, ºtº D ºtº D

np X nl N X N X X p D 1 qD 1 l D 1 hD 1 N X N X p D1 l D1

apqalh w t (xpq )w (xlh ) D 1

apt Kpl al D 1

ºtº D at Ka D 1.

p

(4.4)

The coefcients a are divided by at Ka in order to get normalized vectorsº. Knowing the normalized vectors º, we then compute projections of a test point z by ºt w (z ) D

np N X X

apq k (xpq , z ).

(4.5)

p D 1 qD 1

GDA procedure is summarized in the following steps: 1. Compute the matrices K (see equations 2.7 and 2.8) and W (see equation 2.9). 2. Decompose K using eigenvectors decomposition. 3. Compute eigenvectors b and eigenvalues of the system 4.3. 4. Compute eigenvectors º using a (see equation 3.3) and normalize them (see equation 4.4). 5. Compute projections of test points onto the eigenvectors º (see equation 4.5). 5 Experiments

In this section, two sets of simulated data are studied, and then the results are conrmed on Fisher’s iris data (Fisher, 1936) and the seed classication.

Generalized Discriminant Analysis

2391

The types of simulated data are chosen in order to emphasize the inuence of the kernel function. We have used a polynomial kernel of degree d and a gaussian kernel to solve the corresponding classication problem. Other kernel forms can be used, provided that they fulll the Mercer’s theorem (Vapnik, 1995; Sch¨olkopf et al., 1998). Polynomial and gaussian kernels are among the classical kernels used in the literature: Polynomial kernel using a dot product: k (x, y ) D (x, y )d , where d is the polynomial degree. Gaussian kernel: k (x, y ) D exp(¡ chosen.

kx¡yk2 s

), where the parameter s has to be

These kernel functions are used to compute K matrix elements: kij D k (xi , xj ). 5.1 Simulated Data.

5.1.1 Example 1: Separable Data. Two 2D classes are generated and studied in the feature space obtained with different type of kernel function. This example aims to illustrate the behavior of the GDA algorithm according to the choice of the kernel function. For the rst class (class1), a set of 200 points (x, y) is generated as in the following: X is a normal p variable with p a mean equal to 0 and a standard variation equal to 2: X » N (0, 2). Y is generated according to the following variable: Yi D X¤i Xi C N (0, 0.1). The second class (class2) corresponds to 200 points (x, y ), where: X is a variable such that X » N (0, 0.001). Y is a variable such that Y » N (2, 0.001).

Note that the variables X and Y are independent here. Twenty examples by class are randomly chosen to compute the decision function. Both data and the decision function are shown in Figure 1. We construct a decision function corresponding to a polynomial of degree 2. Suppose that the input vector x D (x1 , x2 , . . . , xt ) has t components, where t is termed the dimensionality of the input space. The feature space ( C F has t t 2 1) coordinates of the form w (x ) D (x21 , . . . , x2t , x1 x2 , . . . , xi xj , . . .) (Poggio, 1975; Vapnik, 1995). The separating hyperplane in F space is a second-degree polynomial in the input space. The decision function is computed on the learning set by nding a threshold such that the projection curves (see Figure 1b) are well separated. Here, the chosen threshold corresponds to the value 0.7. The polynomial separation is represented for the whole data in Figure 1a.

2392

G. Baudat and F. Anouar

Figure 1: (a) The decision function for the two classes using the rst discriminant axis. In the input space, the decision function is computed using a polynomial kernel type with d D 2. (b) Projections of all examples on the rst axis with an eigenvalue l equal to 0.765. The dotted line separates the learning examples from the others. The x-axis represents the samples number, and the y-axis is the value of the projection.

Notice that the nonlinear GDA produces a decision function reecting the structure of the data. As for the LDA method, the maximal number of principal components with nonzero eigenvalues is equal to the number of classes minus one (Fukunaga, 1990). For this example, the rst axis is sufcient to separate the two classes of the learning set.

Generalized Discriminant Analysis

2393

It can be seen from Figure 1b that the two classes can clearly be separated using one axis, except for two examples, where the curves overlap. The two misclassied examples do not belong to the learning set. The vertical dotted line indicates the 20 examples of the learning set of each class. We can observe that almost all of the examples of class2 are projected on one point. In the following, we give the results using a gaussian kernel. As previously, the decision function is computed on the learning set and represented for the data in Figure 2a. In this case, all examples are well separated. When projecting the examples on the rst axis, we obtain the curves given in Figure 2b, which are well-separated lines. The positive line corresponds to class2 and the negative one to class1. The line of the threshold zero separates all the learning examples as well as all the testing examples. The corresponding separation in the input space is an ellipsoid (see Figure 2a). 5.1.2 Example 2: Nonseparable Data. We consider two overlapped clusters in two dimensions, each cluster containing 200 samples. For the rst cluster, samples are uniformly located on a circle of radius of 3. A normal noise with a variance of 0.05 is added to the X and Y coordinates. For the second cluster, the X and Y coordinates follow a normal distribution with a mean vector

¡0¢ 0

and a covariance matrix

¡ 5.0 4.9

¢

4.9 . 5.0

This example will illustrate the behavior of the algorithm on nonseparated data, and the classication results will be compared to SVM results. Therefore, 200 samples of each cluster are used for the learning step and 200 for the testing step. The GDA is performed using a gaussian kernel operator with a sigma equal to 0.5. The decision function and the data are represented on the Figure 3. To evaluate the classication performance, we use the Mahalanobis distance to assign samples to the classes. The percentage of correct classication for the learning set is 98%, and for the testing set it is equal to 93.5%. The SVM classier of free Matlab software (Gunn, 1997) has been used to classify these data with the same kernel operator and the same value of sigma (sigma D 0.5 and C D 1). The percentage of correct classication for the learning set is 99%, and for the testing set it is 83%. By performing the parameter C of the SVM classier with a gaussian kernel, the best results obtained (with sigma D 1 and C D 1) are 99% on the learning set and 88% on the testing set.

2394

G. Baudat and F. Anouar

Figure 2: (a) The decision function on the data using a gaussian kernel with s D 1. (b) The projection of all examples on the rst discriminant axis with an eigenvalue l equal to 0.999. The x-axis represents the samples number, and the y-axis is the value of the projection.

5.2 Fisher ’s Iris Data. The iris ower data were originally published by Fisher (1936), for example, in discriminant analysis and cluster analysis. Four parameters—sepal length, sepal width, petal length, and petal

Generalized Discriminant Analysis

2395

Figure 3: (a) 200 samples of the rst cluster are represented by crosses and 200 samples of the second cluster by circles. The decision function is computed on the learning set using a gaussian kernel with sigma D 0.5. (b) Projections of all samples on the rst axis with an eigenvalue equal to 0.875. The dotted vertical line separates the learning sample from the testing samples.

width—were measured in millimeters on 50 iris specimens from each of three species: Iris setosa, I. versicolor, and I. virginica. The set of data thus contains 150 examples with four dimensions and three classes.

2396

G. Baudat and F. Anouar

Figure 4: The projection of iris data on the rst two axes using LDA. LDA is derived from GDA and associated with a polynomial kernel with degree one (d D 1).

One class is linearly separable from the two other; the latter are not linearly separable from each other. For the following tests, all iris examples are centered. Figure 4 shows the projection of the examples on the rst two discriminant axes using the LDA method, a particular case of GDA when the kernel is a polynomial with degree one. 5.2.1 Relation to Kernel PCA. Kernel PCA proposed by Sch¨olkopf et al. (1998) is designed to capture the structure of the data. The method reduces the sample dimension in a nonlinear way for the best representation in lower dimensions, keeping the maximum of inertia. However, the best axis for the representation is not necessarily the best axis for the discrimination. After kernel PCA, the classication process is performed in the principal component space. The authors propose different classication methods to achieve this task. Kernel PCA is a useful tool for unsupervised and nonlinear problems for feature extraction. In the same manner, GDA can be used for supervised and nonlinear problems for feature extraction and classication. Using GDA, one can nd a reduced number of discriminant coordinates that are optimal for separating the groups. With two such coordinates, one can visualize a classication map that partitions the reduced space into regions.

Generalized Discriminant Analysis

2397

Figure 5: (a) The projection of the examples on the rst two axes using nonlinear kernel PCA with a gaussian kernel and s D 0.7. (b) The projection of the examples on the rst two axes using the GDA method with a gaussian kernel s D 0.7.

Kernel PCA and GDA can produce a very different representation, which is highly dependent on the structure of the data. Figure 5 shows the results of applying both the kernel PCA and GDA to the iris classication problem using the same gaussian kernel with s D 0.7. The projection on the rst two axes seems to be insufcient for kernel PCA to separate the classes; more than two axes will certainly improve the separation. With two axes, GDA algorithm produces better separation of these data because of the use of the interclasses inertia.

2398

G. Baudat and F. Anouar

Table 1: Comparison of GDA and Other Classication Methods for the Discrimination of Three Seed Species: Medicago sativa L. (lucerne), Melilotus sp., and Medicago lupulina L. Percentage of Correct Classication Method

Learning

Testing

k-nearest neighbors

81.7

81.1

Linear dicriminant analysis (LDA)

72.8

67.3

Probabilistic neural network (PNN)

100

85.6

Generalized dicriminant analysis (GDA) gaussian kernel (sigma D 0.5)

100

85.1

99

85.2

Nonlinear Support vectors machines gaussian kernel (sigma D 0.7, C D 1000)

As can be seen from Figure 5b, the three classes are well separated: each is nearly projected on one point, which is the center of gravity. Note that the rst two eigenvalues are equal to 0.999 and 0.985. In addition, we assigned the test examples to the nearest class according to the Mahalanobis distance and using the prior probability of each class. We apply the assignment procedure with the leave-one-out test method. We measured the percentage of correct classication. For GDA, the result is equal to 95.33% of correct classication. This percentage can be compared to those of radial basis function (RBF) network (96.7%) and the multilayer perceptron (MLP) network (96.7%) (Gabrijel & Dobnikar, 1997). 5.3 Seed Classication. Seed samples were supplied by the National Seed Testing Station of France. The aim is to perform seed classication methods in order to help analysts for the successful operation of national seed certication. Seed characteristics are extracted using a vision system and image analysis software. Three seed species are studied: Medicago sativa L. (lucerne), Melilotus sp., and Medicago lupulina L. These species present the same appearance and are difcult for analysts to identify (see Figure 6). In all, 224 learning seeds and 150 testing seeds were placed in random positions and orientations in the eld of the camera. Each seed was represented by ve variables extracted from image seeds and describing their morphology and texture. Different classication methods were compared in terms of the percentage of correct classication. The results are summarized in Table 1. k-nearest neighbors classication was performed with 15 neighbors and gives better results than the LDA method. The SVM classier was tested with different kernels and for several values of the upper-bound parameter C to relax the constraints. In this case the best results are 99% on the learning set and 85.2% on the testing set for C D 1000 and sigma D 0.7. For the same

Generalized Discriminant Analysis

2399

Figure 6: Examples of images of seeds: (a) Medicago sativa L. (lucerne). (b) Medicago lupulina L.

sigma as GDA (0.5), SVM results are 99% for the learning set and 82.5% for the testing set. The classication results obtained by the GDA method with a gaussian kernel, probabilistic neural network (PNN) (Specht, 1990; Musavi, Kalantri, Ahmed, & Chan, 1993), and SVM classier are nearly the same. However, the advantage of the GDA method is that it is based on a formula calculation and not on an optimization approximation such as for the PNN classier or SVM for which the user must choose and adjust some

2400

G. Baudat and F. Anouar

parameters. Moreover, the SVM classier was initially developed for two classes, and adapting it to multiclasses is time-consuming. 6 Discussion and Future Work

The dimensionality of the feature space is huge and depends on the size of the input space. A function that successfully separates the learning data may not generalize well. One has to nd a compromise between the learning and the generalization performances. It was shown for SVM that the test error depends on only the expectation of the number of support vectors and the number of learning examples (Vapnik, 1995) and not on the dimensionality of the feature space. Currently we are seeking to establish the relationship between GDA resolution and SVM resolution. We can improve generalization, accuracy, and speed by using the wide studies of SVM technique (Burges & Sch¨olkopf, 1997; Sch¨olkopf et al., 1998). Nevertheless, the fascinating idea of using a kernel approach is that we can construct an optimal separating hyperplane in the feature space without considering this space in an explicit form. We have to calculate only the dot product. But the choice of the kernel type remains an open problem. However, the possibility of using any desired kernels (Burges, 1998) allows generalizing classical algorithms. For instance, there are similarities between GDA with a gaussian kernel and PNNs. Like the GDA method, PNN can be viewed as a mapping operator built on a set of input-output observations, but the procedure to dene decision boundaries is different for the two algorithms. In GDA, the eigenvalue resolution is used to nd a set of vectors that dene a hyperplane separation and give a global minimum according to the inertia criterion. In PNN the separation is found by trial-and-error measurement on the learning set. The PNN and more general neural networks always nd a local minimum. The only weakness of the method, as for the SVM, is the need to store and process large matrices, as large as the number of the examples. The largest database on which we applied GDA is a real database with 1500 examples. Independently of the memory storage consumption, the complexity of the algorithm is cubic in the number of the example. 7 Conclusion

We have developed a generalization of discriminant analysis as nonlinear discrimination. We described the algebra formulation and the eigenvalue resolution. The motivation for exploring the algebraic approach was to develop an exact solution, not an approximate optimization. The GDA method gives an exact solution even if some points require further investigation, such as the choice of the kernel function. In terms of classication performance, for the small databases studied here, the GDA method competes with support vector machines and probabilistic neural network classier.

Generalized Discriminant Analysis

2401

Appendix A

Given two symmetric matrices A and B with the same size, B is supposed t inversible. It shown that (Saporta, 1990) the quotient vvt Av is maximal for v an Bv eigenvector of B ¡1 A associated with the largest eigenvalue l. Maximizing the quotient requires that the derivative with respect to v vanish: (vt Bv )(2 Av ) ¡ (vt Av )(2Bv ) D 0, (vt Bv )2 which implies B ¡1 Av D

¡ v Av ¢ t

vt Bv

v.

v is then an eigenvector of B ¡1 A associated with the eigenvalue maximum is reached for the largest eigenvalue.

vt AV . vt Bv

The

Appendix B

In this appendix we rewrite equation 3.5 in a matrix form in order to obtain equation 3.4. We develop each term of equation 3.5 according to the matrices K and W. Using equations 2.6 and 3.2, the left term of 3.5 gives: Vº D D

np nl N X N X X 1 X w (xlk )w t (xlk ) apqw (xpq ) M l D 1 k D1 pD 1 q D 1 np nl N X N X X 1 X apq w (xlk )[w t (xlk )w (xpq )] M pD 1 qD 1 lD 1 kD 1

lw t (xij )Vº D D

np nl N X N X X l X apqw t (xij ) w (xlk )[w t (xlk )w (xpq )] M p D 1 q D1 l D 1 k D1 np nl N X N X X £ t ¤£ ¤ l X apq w (xij )w (xlk ) w t (xlk )w (xpq ) . M p D 1 q D1 l D 1 k D1

Using this formula for all class i and for all its element j, we obtain: l(w t (x11 ), . . . , w t (x1n1 ), . . . , w t (xij ), . . . , w t (xN1 ), . . . , w t (xNnN ))Vº D

l KKa. M

(B.1)

2402

G. Baudat and F. Anouar

According to equations 2.4, 2.5, and 3.3, the right term of equation 3.5 gives: " #" #t np nl nl N X N X 1 X 1 X 1 X apq w (xpq ) w (xlk ) w (xlk ) Bº D nl M p D1 qD 1 nl k D 1 nl k D 1 l D1 " #µ ¶" # np nl nl N X N X X X 1 X 1 t apq w (xlk ) w (xlk )w (xpq ) . D M p D1 qD 1 nl lD 1 kD 1 kD 1

" #µ ¶" # np nl nl N X N X X X 1 X 1 t t w xij )Bº D apq w (xij )w (xlk ) w (xlk )w (xpq ) . M p D 1 qD 1 nl lD 1 kD 1 kD 1 t(

For all class i and for all its elements j, we obtain: ¡

¢ w t (x11 ), . . . , w t (x1n1 ), . . . , w t (xij ), . . . , w t (xN1 ), . . . , w t (xNnN ) Bº D

1 KWKa. M

(B.2)

Combining equations B.1 and B.2, we obtain: lKKa D KWKa, which is multiplied by at to obtain equation 3.4. Acknowledgments

We are grateful to Scott Barnes (engineer at MEI, USA), Philippe Jard (applied research manager at MEI, USA), and Ian Howgrave-Graham (R&D manager at Landis & Gyr, Switzerland) for their comments on the manuscript for this article. We also thank Rodrigo Fernandez (research associate at the University of Paris Nord) for comparing results using his own SVM classier software. References Aizerman, M. A., Braverman, E. M., & Rozono´er, L. I. (1964). Theoretical foundations of the potential function method in pattern recognition learning. Automation and Remote Control, 25, 821–837. Anouar, F., Badran, F., & Thiria, S. (1998). Probabilistic self organizing map and radial basis function. Journal Neurocomputing, 20, 83–96. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992).A training algorithm for optimal margin classiers. In D. Haussler (Ed.), Fifth Annual ACM Workshop on COLT (pp. 144–152). Pittsburgh, PA: ACM Press. Burges, C. J. C. (1998). A tutorial on support vector machine for pattern recognition. Support vector web page available online at: http://svm.rst.gmd.de.

Generalized Discriminant Analysis

2403

Burges, C. J. C., & Sch¨olkopf, B. (1997). Improving the accuracy and speed of support vector machines. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Neural information processing systems, 9. Cambridge, MA: MIT Press. Drucker, H., Burges, C. J. C., Kaufman, L., Smola, A., & Vapnik, V. (1997).Support vector regression machines. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Neural information processing systems, 9. Cambridge, MA: MIT Press. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annual Eugenics, 7, 179–188. Fukunaga, K. (1990). Introduction to statistical pattern recognition (2nd ed.). Orlando, FL: Academic Press. Gabrijel, I., & Dobnikar, A. (1997). Adaptive RBF neural network. In Proceedings of SOCO ’97 Conference (pp. 164–170). Gunn, S. R. (1997). Support vector machines for classication and regression (Tech. rep.). Image Speech and Intelligent Systems Research Group, University of Southampton. Available online at: http://www.isis.ecs.soton.ac.uk/ resource/svminfo/. Hastie, T., Tibshirani, R., & Buja, A. (1993). Flexible discriminant analysis by optimal scoring (Res. Rep.). AT&T Bell Labs. Kohonen, T. (1994). Self-organizing maps. New York: Springer-Verlag. Musavi, M. T., Kalantri, K., Ahmed, W., & Chan, K. H. (1993). A minimum error neural network (MNN). Neural Networks, 6, 397–407. Poggio, T. (1975). On optimal nonlinear associative recall. Biological Cybernetics, 19, 201–209. Saporta, G. (1990). Probabilites, analyse des donn´ees et statistique. Paris: Editions Technip. Sch¨olkopf, B. (1997). Support vector learning. Munich: R. Oldenbourg Verlag. Sch¨olkopf, B., Smola, A., & Muller, ¨ K. R. (1996). Nonlinear component analysis as a kernel eigenvalue problem (Tech. Rep. No. 44). MPI fur biologische kybernetik. Sch¨olkopf, B., Smola, A., & Muller, ¨ K. R. (1998). Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10, 1299–1319. Specht, D. F. (1990). Probabilistic neural networks. Neural Networks, 3(1), 109– 118. Vapnik, V. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Vapnik, V., Golowich, S. E., & Smola, A. (1997). Support vector method for function approximation, regression estimation, and signal processing. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Neural information processing systems, 9. Cambridge, MA: MIT Press. Wilkinson, J. H., & Reinsch, C. (1971). Handbook for automatic computation, Vol. 2: Linear algebra. New York: Springer-Verlag. Recent References Since the Article Was Submitted

Jaakkola, T. S., & Haussler, D. (1999). Exploiting generative models in discriminative classiers. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press.

2404

G. Baudat and F. Anouar

Mika, S., R¨atsch, G., Weston, J., Sch¨olkopf, B., & Muller, ¨ K. R. (1999). Fisher discriminant analysis with kernels. In Proc. IEEE Neural Networks for Signal Processing Workshop, NNSP. Received December 4, 1998; accepted October 6, 1999.

LETTER

Communicated by Wolfgang Kinzel

Generalization and Selection of Examples in Feedforward Neural Networks Leonardo Franco Sergio A. Cannas Facultad de Matem´atica, Astronom´õa y F´õsica, Universidad Nacional de C´ordoba, Ciudad Universitaria, (5000), C´ordoba, Argentina In this work, we study how the selection of examples affects the learning procedure in a boolean neural network and its relationship with the complexity of the function under study and its architecture. We analyze the generalization capacity for different target functions with particular architectures through an analytical calculation of the minimum number of examples needed to obtain full generalization (i.e., zero generalization error). The analysis of the training sets associated with such parameter leads us to propose a general architecture-independent criterion for selection of training examples. The criterion was checked through numer ical simulations for various particular target functions with particular architectures, as well as for random target functions in a nonoverlapping receptive eld perceptron. In all cases, the selection sampling criterion lead to an improvement in the generalization capacity compared with a pure random sampling. We also show that for the parity problem, one of the most used problems for testing learning algorithms, only the use of the whole set of examples ensures global learning in a depth two architecture. We show that this difculty can be overcome by considering a tree-structured network of depth 2 log2 (N ) ¡ 1. 1 Introduction

Feedforward neural networks have been used extensively to solve many kinds of problems, being applied in a wide range of areas covering subjects such as prediction of temporal series, structure prediction of proteins, and speech recognition. (See, e.g., Haykin, 1994, and Hertz, Krogh, & Palmer, 1991.) One of the fundamental properties making these networks useful is its capacity to learn from examples. Through synaptic modications algorithms, the network is capable of obtaining a new structure of internal connections that is appropriate for solving a determined task. The general underlying theory of the whole learning process is poorly understood. There are few general results, especially concerning generalization. One particular point of interest is the selection of a concise subset of examples from the whole training set as a way of improving generalizac 2000 Massachusetts Institute of Technology Neural Computation 12, 2405–2426 (2000) °

2406

Leonardo Franco and Sergio A. Cannas

tion ability. This problem has also been referred to as “active learning” or “query-based learning” by many authors. In a broad sense, these terms refer to any form of learning in which the learning algorithm has some control over the inputs used for the training. Several approaches have been developed to derive criteria for selecting examples in different sort of networks, ranging from simple nets as perceptrons (Kinzel & Ruj a´ n, 1990) and linear classiers (Jung & Opper, 1996) to multilayer feedforward networks with continuous outputs. Among them (see Plutowski & White, 1993, for more references) we refer to the work of Cohn (Cohn, Atlas, & Ladner, 1994; Cohn, 1996), which uses partially trained networks to determine regions of uncertainty in the environment from which the examples are selected. Plutowski and White (1993) assume that a large amount of data have been collected and work on principles for selecting a subset of those data for efcient training. Another kind of approach is that of Baum (1991), who proposed a query-based algorithm for iteratively looking for classication boundaries in the input space. An important question related to the selection of example problems is, What is the absolute minimum number of examples that ensures full generalization in the learning procedure? By full generalization, we mean zero generalization error in the case of boolean networks or an error less than some small, positive value for the case of continuous outputs. A minimum number is the size of some particular subset of examples that contains all the information about the task or target function to be learned, and it gives a lower bound for the number of examples needed for full generalization in any selective sampling procedure (given a target function and a xed architecture). The counterpart is the minimum average number of examples needed for generalization in a pure random sampling of the training set (Baum & Haussler, 1989). In this case the Vapnik-Chervonenkis (VC) dimension (see Haykin, 1994) appears to be a fundamental parameter for determining the generalization capacity of a xed architecture. Although the minimum number of examples needed for full generalization (MNEFG) is related to the intrinsic complexity of the target function, it also depends on the architecture of the student network (incorrectly designed nets might not be able to use all the information contained in the set and therefore need more examples to achieve full generalization). In this sense the MNEFG results in a more specic concept than the VC dimension, since the former is related to the combined complexity of the architecture and the specic target function, while the latter refers to arbitrary target functions. In fact, the MNEFG is an upper bound for the VC dimension. In sections 2, 3 and 4, we analyze the learning of three different boolean problems by particular architectures using linear threshold units and obtain analytically a lower bound for the MNEFG. The idea of the method is the following. Suppose a boolean network has N inputs, a single output, and a given target function. Then the perfect learning of every one of the 2N possible examples implies a set of constraints on the synaptic weights. In a

Examples in Feedforward Networks

2407

general case, only a subset of the 2N constraints will be independent. Since every constraint is associated with one particular example, the size of the subset gives us the minimum number of examples to ensure the perfect learning of the 2N examples. The analysis was carried out rst for a small number of inputs and then extended to arbitrary sizes N. Most of the analysis in this article was carried out with specic problems and architectures. Models for which a set of exact solutions and properties are known serve as standards for testing the hypothesis. The work presented here illustrates how insights about general properties of learning can be obtained from studying of such standards. The three problems we rst considered were addition of two numbers, bit shifting, and parity. For every problem, we chose an architecture for which the computability was already known. Moreover, for each of these structures, a set of exact solutions was available (see Cannas, 1995, for the rst case; Franco & Cannas, 1998, for the second one; and Rumelhart & McClelland, 1986, for the last one). The three structures have N inputs, a single output (in order to make the results comparable in the three cases, we considered only the most signicant bit of the output in the rst two cases), and one hidden layer. Since all the problems are nonlinearly separable, the structures are optimal (i.e., minimal) in depth in all the cases. We found that the MNEFG scales polynomially with N in the rst two cases and exponentially in the last one. That is, for the parity function, the MNEFG scales with the total number of examples for a depth 2 network. The main difference between the architectures is that while in the parity case it is fully connected with N hidden units, in the other two cases we chose architectures with fewer hidden units and local receptive elds. These ad hoc modications simplify the analytical treatment without a great loss of generality since they are based on prior knowledge of the global symmetries of the corresponding problems (Cannas, 1995; Franco & Cannas, 1998). Although they reduce the computability of the network (in the sense of the number of different target functions that the architecture is capable of implement) compared with the fully connected counterpart, these architectures are expected to be general enough to compute a large class of functions, at least those sharing the same types of global symmetries. In particular, for the addition case, the reduction becomes negligible for large values of the number of inputs N, because the number of hidden units is kept constant. Moreover, the presence of local receptive elds and a small number of hidden units can always be thought of as a particular solution of a fully connected network with N hidden units and a set of synaptic weights equal to zero. Hence, the MNEFG we obtained constitute lower bounds for the three problems considered here when implemented in a standard architecture—a fully connected depth 2 network with the same number of inputs and hidden units. Since the three problems are expected to have different complexity, the MNEFG appears to be an interesting parameter to classify the complexity of a target function. Moreover, an analysis of the

2408

Leonardo Franco and Sergio A. Cannas

subsets of examples (or “exemplars,” in Plutowski and White’s terminology) that determine the MNEFG gave us a rst insight into how to construct a criterion for selecting examples. The analysis is detailed in section 5, where the criterion is tested in the nonoverlapping receptive eld perceptron (Hancock, Golea, & Marchand, 1994). An analysis of the MNEFG for a particular target function in this general architecture allowed us to rene our ideas and propose a general architecture-independent criterion for selecting examples in boolean networks. This criterion was checked for random functions using numerical simulations in a small network, showing a considerable improvement in the generalization capacity compared with a pure random sampling, especially concerning the probability of full generalization. We also present some numerical simulations for the addition and bit shifting problems in small networks. The criterion works well in all these case but fails in the parity one. To show that the difculty here comes from the architecture, which is not complex enough to accomplish the desired task (the depth 2 network can learn, but it cannot generalize), we study a network of depth 2 log2 N¡1 with a tree structure, which is also capable of computing the parity function (Rumelhart & McClelland, 1986). We show that full generalization is possible with this architecture, for which the MNEFG scales polynomially with N. Moreover, we show through numerical simulations that the generalization capacity is also improved in this case with our selection criterion. 2 The Sum Problem

We study the generalization properties of a network constructed to compute the addition between two binary operands each of N bits, which gives a result of the same length N. The architecture (Cannas, 1995) is optimal in depth; it has only one hidden layer. It is composed of 2N binary neurons in the input layer, corresponding to the two numbers of N-bits to add, N hidden neurons and N binary neurons in the output. To simplify the analysis and allow comparisons with every function that has at least one output neuron, we study only one output bit—the one having the most signicant value. The generalization to the case of N output bits is straightforward, since the learning of the most signicant output bit automatically xes the synapses shared with the least signicant output bits to its correct values. The rest of the synapses can be xed by a set of independent exemplars selected by the same procedure used for the most signicant bit. The resulting network has two hidden neurons: one is fully connected to the input layer, and the other is connected to every input neuron but those corresponding to the most signicant bits. Full connection also exists between the hidden layer and the output neuron. Finally, the two most signicant input bits are also directly connected to the output. An example of the network for N D 3 is shown in Figure 1a, where only the connections related

Examples in Feedforward Networks

2409

Figure 1: Network structure to compute the most signicant output bit of the sum function of two three-bit numbers. (a) General structure for arbitrary synapses without constraints. (b) The synapses corresponding to input bits with the same signicant value have been symmetrized in such a way that the six input bits can be replaced by three input bits (denoted by A, B, and C) taking the values f0, 1, 2g.

to the most signicant bit are shown. We see that the architecture is almost fully connected, where only a set of backward connections (from left to right) has been set to zero. This simplication is based on prior knowledge

2410

Leonardo Franco and Sergio A. Cannas

of the global left-right asymmetry of the addition, produced by the carry operation from the least (right) to the most (left) signicant bits (Cannas, 1995). Hence, this architecture is expected to be general enough to compute a large class of functions, at least those sharing the left-right asymmetry. Further simplication of the problem can be obtained by considering shared synapses. That means we impose the constraint that synapses connecting one hidden neuron with two input bits with the same signicant value are always equal. With this constraint, every pair of input neurons corresponding to the same signicant bit can be replaced by a single ternary neuron, which takes the values f0, 1, 2g. Hence, our nal architecture contains an input layer of N ternary neurons. We start by adding two numbers of 3-bit length and then generalize the results to the addition of two Nbits numbers. A description of the nal network architecture for N D 3 is depicted in Figure 1b. The output bit S receives direct inputs from the two hidden neurons D, E D 0, 1 through synapses d1 , e1 and from the input neuron A, through synapse a0 , computing the following function: S D h fa0 A C d1h [a1 A C b1 B C c1 C ¡ Td ] C e1h[b2 B C c2 C ¡ Te ] ¡ Ts g,

(2.1)

where A, B, C D 0, 1, 2; h (x) is the Heaviside step function and T a (a D d, e, s) are the threshold parameters. The values of the synapse a0 and those of the thresholds Td and Te are restricted to be greater than zero to reduce the possible internal representations to only one. Requiring that the network compute the full set of 27 addition examples leads to the following necessary and sufcient conditions: Td 2 a1 C 2b1 ¸ Td > 2b1 C 2c1 Td > a1 ¸

(2.2) (2.3)

a1 C b1 C 2c1 ¸ Td > a1 C b1 C c1

(2.4)

b2 C 2c2 ¸ Te > b2 C c2

(2.6)

2b2 ¸ Te > 2c2

e1 ¸ Ts > 0 a0 ¸ Ts > 2a0 C d1

2a0 C d1 C e1 ¸ Ts > a0 C d1 C e1 .

(2.5) (2.7) (2.8) (2.9)

We will show that if the network computes a particular set of 12 examples, then the above conditions are satised. Therefore, the correct learning of such examples ensures full generalization with fewer than half of the total number of examples.

Examples in Feedforward Networks

2411

We will denote the examples by writing between square brackets the three input values and the correct output separated by a colon. Then: From the example [000:0], we obtain the right part of equation 2.7. From the examples [100:1] and [200:0] we obtain: d1 < ¡a0

(2.10)

and the fulllment of equations 2.2 and 2.8. From the example [020:1] we obtain: d1h [2b1 ¡ Td ] C e1h[2b2 ¡ Te ] > Ts ,

which together with Ts > 0 and equation 2.10 implies the left part of equation 2.7 and the left part of equation 2.5. From the example [120:0], we obtain the left part of equation 2.3, together with the right part of equation 2.9. From the examples [111:1] and [112:0], we obtain equation 2.4 and also that c1 > 0. From the examples [011:0], [012:1], and [002:0], we obtain equation 2.6 together with the right part of equation 2.5. From the example [222:1], we obtain the left part of equation 2.9. From the example [022:1], we obtain the right part of equation 2.3 and consequently the fulllment of the complete set of equations 2.2 through 2.9. For the most general case of an input of N bits, we can see that for each bit we increase the input in size, it is enough to add four examples to obtain full generalization. For example, for the case of N D 4 we take the 12 examples corresponding to the 3-bit case but converted to the new problem by adding a zero in the new right-most input place. For example, the input pattern [112:0] in the 3-bit problem now would be the input pattern [1120:0]. It is also necessary to add four new examples, two pairs of them having different outputs when we modify the new right-most input bit. For the case of N D 4 these two pairs are the examples f[0111:0][0112:1]I [1111:1][1112:0]g. The same procedure is repeated as we increase the number of input bits. Thus, the size of the minimal set of examples needed for generalization is equal to 4N. Finally, for N output neurons, such a number becomes 2N(N C 1). 3 Bit Shifting

The bit-shifting function is a basic operation in computer circuits and also was used in modeling vision devices (see Franco & Cannas, 1998). The network structure has only one hidden layer, and since the problem is linearly

2412

Leonardo Franco and Sergio A. Cannas

Figure 2: Network structure to compute the left-most output bit of a bit-shifting operation, where I1 , I2 , I3 are the input bits and S1 , S0 are the bits indicating the number of bits to shift.

inseparable, the structure is optimal in depth. The input layer has N input bits to shift plus log2 (N C 1), indicating bits given the places to shift. There are N binary neurons in the hidden layer, each one connected to one input neuron and to all the indicating neurons. The problem has an output of N bits, but to simplify the study, we analyze just one output bit. We start the analysis with the particular case of having three input bits with its corresponding two indicating bits, and later generalize the results to the case of N input bits. For the case of three input bits, the structure of the network is shown in Figure 2, where the output neuron S computes the following function: £ S D h Jah (a1 I1 C a2 S1 C a3 S0 ¡ Ta ) C Jbh (b1 I2 C b2 S1 C b3 S0 ¡ Tb ) ¤ C Jch ( c1 I 3 C c2 S1 C c3 S0 ¡ Tc ) ¡ T (3.1) I1 , I2 , I3 D 0, 1 being the input bits; S0 , S1 D 0, 1 are the two indicating bits, Ja , Jb , Jc are the synapses between the three hidden neurons A, B, C and the output S, and ai , bi , ci are the synapses between the respectively hidden neurons (A,B,C) and the input and indicating bits. We impose the constraint that the thresholds of the hidden neurons Ta , T b , Tc are always positive to reduce to only one the number of possible internal representations. In order to simplify the generalization to the N input bits case, we also impose that the synapses Ja , Jb , and Jc have to be greater than the threshold of the output neuron T. These constraints produce no substantial changes to the results.

Examples in Feedforward Networks

2413

Requiring that this network compute exactly the full set of 32 examples leads to the following necessary and sufcient conditions: T > 0 a1 ¸ Ta > a1 C a2 < Ta

(3.2) a1 C a3

(3.3) (3.4)

b1 < Tb b1 C b3 ¸ Tb > b3

(3.5)

c1 < T c c1 C c2 ¸ T c > c2

(3.8)

b1 C b2 C b3 < Tb

c1 C c2 C c3 < T c .

(3.6) (3.7) (3.9) (3.10)

We denote the examples by writing between square brackets the three input values plus the two indicating bits and the correct output separated by a colon. As in the previous section, it is easy to verify that the correct learning of the 10 examples—f[000–00:0], [100–00:1], [100–01:0], [100–10:0], [010–01:1], [010–11:0], [010–00:0], [001–00:1, [001–10:1], [001–11:0]g ensures the correct computation of the full set of examples. For instance, from example [000– 00:0] we derive T > 0 (see equation 2.10); from example [100–00:1] we obtain the left side of equation 3.2; and so on. Extending this result to the case of having N bits involves all the examples that have only one input bit ON and all its possible combinations of indicating bits, plus the patterns with all the input bits OFF and its combinations with the indicating bits. This procedure gives that for N input bits, (N C 1)2 examples are enough to obtain full generalization. 4 The Parity Problem

The parity function is one of the most used problems for testing learning algorithms because its simple denition and great complexity given by the fact that the most similar patterns (those differing by a single bit) have different outputs (Rumelhart & McClelland, 1986; Tessauro & Janssens, 1988). The parity function has only one output neuron that indicates when it is ON that an odd number of the N input bits are ON, while it is OFF if this number is even. The simplest architecture known to compute this function using linear threshold units (l.t.u.) and having no direct input-output connections consists of a network with one hidden layer with N units, fully connected to the N input neurons. As we did in the previous sections, we rst analyze a particular case with six input neurons (see Figure 3) and then generalize the result to N input bits.

2414

Leonardo Franco and Sergio A. Cannas

Figure 3: Network structure with one hidden layer of six fully connected neurons used to compute the 6-bit parity function.

The standard functioning of this network (see Minsky & Papert, 1969; Rumelhart & McClelland, 1986; and Hertz et al., 1991) is based on the behavior of the six hidden neurons: the number of hidden neurons that have to be ON is equal to the number of bits ON in the input. These hidden neurons are connected to the output neuron P through synapses set alternatively to positive and negative values, so the neuron P is ON when an odd number of hidden neurons are ON and is inactive when this number is even. The functioning of the hidden neurons is achieved by the following conditions: 1. The hidden neuron A has to be ON when one or more input bits are ON; otherwise, A has to be OFF. 2. The hidden neuron B has to be ON when two or more input bits are ON; otherwise, B has to be OFF. 3. The hidden neuron C has to be ON when three or more input bits are ON; otherwise, C has to be OFF. 4. The hidden neuron D has to be ON when four or more input bits are ON; otherwise, D has to be OFF. 5. The hidden neuron E has to be ON when ve or more input bits are ON; otherwise, E has to be OFF. 6. The hidden neuron F has to be ON when the six input bits are ON; otherwise, F has to be OFF.

Examples in Feedforward Networks

2415

Denoting by P D 0, 1 the output value, we have that P computes the following function: ( " P D h ah

#

6 X iD 1

(ai Ii ¡Ta ) C bh

"

C dh

"

6 X i D1

#

6 X i D1

"

(di Ii ¡Td ) C eh

#

"

(bi Ii ¡Tb ) C ch 6 X iD 1

#

6 X iD 1

(ei Ii ¡Te ) C fh

# (ci Ii ¡Tc )

"

6 X iD 1

#

)

( fi Ii ¡T f ) ¡T , (4.1)

where Ii D 0, 1 (i D 1, . . . , 6) denote the state of the input neurons. From the conditions 1–6 we obtain the following set of inequalities for the synapses values: From condition 1 we obtain that: ai ¸ Ta

(4.2)

8i.

From condition 2 we obtain that: bi C bj < Tb

8 i, j.

(4.3)

From condition 3 we obtain that: ci C cj C ck ¸ Tc

8 i, j, k.

(4.4)

From condition 4 we obtain that: di C dj C dk C dl < Td 8 i, j, k, l

(4.5)

From condition 5 we obtain that: ei C ej C ek C el C em ¸ Te 8 i, j, k, l, m.

(4.6)

From condition 6 we obtain that: fi C fj C fk C fl C fm C fn < T f 8 i, j, k, l, m, n

(4.7)

(a, b, c, d, e, f ). Fixing the values of the thresholds Ta , Tb , Tc , Td , Te , T f , equations 4.2 through 4.7 lead to a set of 26 ¡ 1 independent inequalities for the synapses’ parameters fai g, fbi g, . . . , f fi g. On the other hand, the total number of examples is 26 , and as the learning of any example ensures only the fulllment of one of the 26 ¡ 1 inequalities, only imposing the correct learning of the full set of examples except the simplest one [000000:0], which determines

2416

Leonardo Franco and Sergio A. Cannas

Table 1: Some Features of the Networks used to Compute the Addition of Two Numbers, Bit-Shifting, and the N-Bit Parity Functions.

Addition

Bit Shifting

Parity

Number of synapses

2N C 4

N(2 C log2 (N C 1))

N2 C N

Total number of examples

3N

2N (N C 1)

2N

MNEFG

4N

(N C 1) 2

O (2N )

the sign of the output threshold, ensures the fulllment of inequalities (see equations 4.2–4.7). Hence, we cannot guarantee generalization with any particular subset of examples. Finally, the generalization of this result to a fully connected network with N inputs and N hidden neurons is straightforward, because in this case we have 2N examples and 2N ¡ 1 inequalities of the type of equations 4.2 through 4.7). The previous analysis was based on one particular internal representation. These network admits other internal representations like that used in Franco and Cannas (1998) for the construction of a network for the addition problem. Along the same lines as before, it can be shown that in this case, the minimum number of examples, although less than 2N ¡ 1, is still of order O (2N ). These results suggest that in this simple architecture for the parity problem (only one hidden layer, fully connected), the MNEFG could be always exponential in the number of inputs. 5 Examples

In the previous section we analyzed three problems that are expected to have different degrees of complexity, being the parity the “hardest” one. The results are summarized in Table 1. The MNEFG we obtained in the previous section can be considered the lower bounds for the corresponding problems in a xed standard architecture—a fully connected depth 2 network with the same number of inputs and hidden units. In this sense, the results are comparable for the three problems, suggesting the MNEFG as a possible parameter for classifying complexity. A careful analysis of the exemplars we found for the sum and bit-shifting problems shows that almost all of them can be grouped into pairs whose inputs differ in a single bit and whose output is different. In other words, such pairs are the closest ones (considering the Hamming distance between the input patterns) located at both sides of a classication boundary. This appears as a general and simple criterion for the selection of examples.

Examples in Feedforward Networks

2417

Figure 4: Comparison between the generalization error from selective sampling of examples (lower curves) and a pure random one (upper curves) for (a) a bitshifting network with three input bits and (b) a network constructed to compute the addition of two numbers of four bits length. The pure random selection was done without repeating the examples.

In Figure 4 we show numerical simulations of the generalization error as a function of the fraction of examples in the training set using random epochs of examples selected with the above-described criterion. The results are compared with the corresponding ones for a random selection over the full set of examples obtained through numerical simulations. In Figure 4a the comparison is done for a bit-shifting network with three input bits (see the architecture of Figure 2) and in Figure 4b, the results for a network computing the addition of two numbers of four bits length (see the architecture of Figure 1) is shown. All the simulations in this work were done using simulated annealing as the learning algorithm, and the results were averaged over different random epochs and random initial distributions of the synaptic weights. Typical sample sizes run from 25 to 100.

2418

Leonardo Franco and Sergio A. Cannas

Figure 5: Nonoverlapping receptive eld perceptron (NORFP) with N D 6 and two hidden neurons.

In order to check how the above criterion works in a general architecture, we consider another well-studied depth 2 one: the nonoverlapping receptive eld perceptron (NORFP), whose structure consists essentially of many simple perceptrons connected to a single-output neuron. One can think of these architectures as networks of “decoupled” perceptrons, which in terms of architectural complexity lie between the single perceptrons and the fully connected feedforward nets. In this sense, this type of architecture has been a natural candidate for studying the learning and generalization properties started in perceptrons with no hidden units (Copelli & Caticha, 1995). An example with two hidden units and N D 6 inputs is depicted in Figure 5. Some properties and computational capabilities of this kind of structure can be found in Hancock et al. (1994) and Priel, Blatt, Grossman, Domany, and Kanter (1994). First, from the many functions that these architectures can compute (Priel et al., 1994), we selected a target that is simple to analyze and whose computability is ensured. Consider the case of even N and divide the input into two groups of N / 2 bits; then the output of the network should be ON only if there are two or more input neurons activated in each group at the same time, and OFF otherwise. We will call this function F2 . It is easy to see that this function is solvable by a NORFP with two hidden units, like the one shown in Figure 5. The problem of determining whether two or more input bits are ON in a set of N / 2 inputs is clearly linearly separable; hence, the problem can be solved by two simple perceptrons coupled to a third one that performs a boolean AND function. From the denition of the target function, it is simple to see the number of examples that can be grouped into pairs with only one input bit different and different output scales as O (N 2 2N / 2 ) for large N. Hence, it is not expected that a random sampling of examples based on this criterion

Examples in Feedforward Networks

2419

Figure 6: Comparison between the generalization error from selective sampling of examples and a pure random one in the learning of the F2 function (see the text for a denition) by a NORFP with N D 6 (see Figure 5).

can improve the generalization capacity of the net, compared with a pure random sampling for this function. We veried this assumption through numerical simulations. However, not all these examples are needed to ensure full generalization. Following the same method of the previous sections, it can be seen that the set that determines MNEFG consists of all the following pairs of examples: one example with two input bits ON in a group and only one in the other group and the example having the same input bits ON in the rst group plus another bit ON in the group that had only one. For example, in the case of Figure 5, one possible pair of examples is [110–100:0] and [110–110:1]. A simple counting shows that the MNEFG in this case scales as O (N 2 ) for large N. In Figure 6 we plot the generalization error obtained through numerical simulations as a function of the fraction of examples in the training set constructed with the last criterion for the network of Figure 5, and also the same curve is shown for pure random sampling of examples. This last result led us to propose the following general criterion for selecting examples in boolean networks: select randomly only pairs of examples differing in one input bit with different outputs, but select with higher probability those having fewer active input bits. Related criteria have been implemented in other methods of active learning (Baum & Haussler, 1989; Kinzel & Ruj´an, 1990).

2420

Leonardo Franco and Sergio A. Cannas

We checked this last criterion in the NORFP of Figure 5 for random target functions. In order to ensure that only computable functions were considered, we constructed a priori a set of testing functions by a random generation of synaptic weights. The results were averaged over the different functions and different random initial conditions. Sample sizes ran from 100 to b 1000. The examples were selected with a probability proportional to / 1 / na , where na is the number of active bits of the input and b > 0 is some arbitrary exponent. The results are shown in Figure 7 compared with the corresponding ones for a pure random sampling of examples. It is worth stressing that in boolean networks, full generalization can always be achieved with a large enough set of examples. Although we see improvement in the mean generalization error with our criterion (see Figure 7a), the behavior of the probability of full generalization (see Figure 7b) is much more impressive, showing a nite probability (about 10%) even with just 10% of the examples and reaching a value higher than 80% for half of the examples. A similar test of the criterion was carried out for the addition and bitshifting problems with the architectures of Figures 1 and 2, respectively. The results are qualitatively similar to those of the NORFP, especially concerning the probability of full generalization. We have seen that our criterion is closely related to the behavior of the MNEFG. In the case of the parity function with the standard architecture, we have shown that the MNEFG scales with the total number of possible examples, and therefore the selection criterion does not work. This example illustrates how the generalization capacity of a network is a result of the interplay between the complexity of the target function and the architecture. The previous structure is the simplest depth 2 one that can solve the parity function, and it has a very poor generalization capacity. More complex nontrivial architectures can be constructed by increasing its depth. A good example is that proposed by Rumelhart and McClelland (1986), based on the following simple principle. Suppose that N D 2m with m > 0 integer (generalization to other cases are straightforward). Then group the input units into 2m¡1 pairs, and compute the parity of every pair by a standard depth 2 network (with two hidden units). This procedure can be repeated recursively, giving a tree-structured network of depth 2 log2 (N ) ¡ 1, which solves the parity of the N inputs. An example for N D 4 is shown in Figure 8. As in the example of sum and bit shifting, this network incorporates a previous knowledge of a global symmetry of the target function—let us say, the “self-similarity” property of the parity. We checked our criterion numerically in the small network of Figure 8. The results for the generalization error compared with those corresponding to a pure random sampling are shown in Figure 9. Following the same procedure of the previous sections, it can be shown that the MNEFG in this case scales as O (N 2 ). This is consistent with the observed result that even for a pure random sampling, this architecture always achieves full generalization with a large enough training set.

Examples in Feedforward Networks

2421

Figure 7: Selective versus pure random sampling of examples in the learning of random target functions by a NORFP with N D 6 (see Figure 5). (a) Average generalization error versus fraction of the total number of examples in the training set. (b) Probability of full generalization (zero generalization error).

Finally, we estimated the different average CPU times used by the learning procedure in a Pentium II (300 MHz) personal computer by three particular target functions: (1) bit-shifting (N D 5) with the architecture of Figure 2; (2) F2 function (N D 6) in a NORFP (see Figure 5), and (3) parity with the architecture of Figure 8. The results are summarized in Table 2.

2422

Leonardo Franco and Sergio A. Cannas

Figure 8: Tree-structured network that solves the parity function for N D 4.

Figure 9: Comparison between the generalization error from selective sampling of examples and a pure random one in the learning of the parity by a treestructured network with N D 4 (see Figure 8).

Examples in Feedforward Networks

2423

Table 2: Average CPU Time Used by the Different Sampling Procedures. Bit Shifting Selected Fraction of examples Generalization error CPU time (sec)

0.28 0.0 7.0

Random Selected 0.28 0.81 0.26 0.02 0.2 1.1

Parity

F2

0.4 0.0 7.7

Random

Selected

Random

0.4 1 0.15 0.02 9.0 93.3

0.43 0.0 160

0.43 0.68 0.4 0.0 8.8 332

In all the cases, we calculated rst the average CPU time needed to obtain zero generalization error with the selection sampling procedure, for a fraction of training examples corresponding to the threshold value at which the learning curve reaches zero the rst time (see Figures 4a, 6, and 9). In order to compare with the performance obtained by a purely random sampling, we calculated in this case two different CPU times. First, we calculated for each target function the average time needed to converge when using the same fraction of training examples as in the selection case (which gives a larger generalization error in the random sampling than in the selective one). Second, we calculated the average time needed to obtain the minimum generalization error for each function with random sampling (which involves a larger training set, as in the previous situation). We see that in the worst case (bit shifting), the average time needed to reach the minimum error with the selection procedure was about seven times the corresponding one for the random procedure. However, it is worth noting that in the rst case, full generalization is always achieved (i.e., with probability one) only with a fraction of examples, while in the second case, even with all the examples, the probability of learning is less than one. Hence, a greater generalization capacity is obtained with the selection sampling in this case at the cost of computational time. In the other two cases, the time performance of the selection sampling compares well with the random one. 6 Conclusions and Discussion

We analyzed the generalization capacity of several boolean neural networks for different target functions. Our analysis was based on the analytic estimation of the MNEFG and its scaling properties with the number of inputs N. A close examination of the subsets of examples that determine the MNEFG in the different cases led us to propose a simple selection criterion: rst choose pairs of examples with closest inputs in terms of Hamming distance and with different outputs; examples are then chosen randomly within the data set formed in such a way, but with higher probability for those examples with fewer active input bits. This criterion is architecture independent.

2424

Leonardo Franco and Sergio A. Cannas

However, a set of examples constructed with such a criterion is not always enough to ensure full generalization, as we have shown, for instance, for the parity. Here is where the architecture comes into play. Our results suggest that the criterion works when the MNEFG scales polynomially with N, and this seems to happen only with local receptive elds, that is, when the network is not fully connected. This effect is possibly related to the existence, in a fully connected network, of multiple internal representations for a given target function (Boers, Kuiper, Happel, & Sprinkhuizen-Kuyper, 1993). In this case, the number of constraints to be imposed on the optimization procedure underlying the learning would be very large in order to ensure a proper solution. Therefore, the number of examples needed to ensure global learning is expected to grow exponentially. It is known that networks with too many degrees of freedom (i.e., too many weights) fail to generalize, presenting several undesirable effects, such as overtraining and interference (see Boers et al., 1993). The introduction of local receptive elds reduces drastically the number of possible internal representations, allowing a relatively small number of examples to set the synapses to the proper values associated with a single internal representation. This interpretation is consistent with our results in the NORFP, where the probability of full generalization for random target functions is much higher with the selection sampling than with a purely random sampling, even with a very small fraction of examples. The usage of local receptive elds presents the problem of computability, that is, given a target function, we are not sure a priori if a not fully connected network like the NORFP can solve it. This is a general problem that involves not only connectivity but also the minimum number of hidden layers and the minimum number of neurons in each layer. However, specially designed networks can be constructed for particular target functions with just the prior knowledge of some global symmetries of the functions, as in the cases of addition, bit shifting, and parity described here (see also Haykin, 1994). The alternative could be a combination of the selection procedure described here with some pruning techniques, which allows reducing dynamically the connectivity of an initial fully connected network. We are working along these lines. Finally, we showed how the scaling properties of the MNEFG for large N can be obtained by generalizing the analysis of the small network equations. The MNEFG appears as an interesting complexity parameter, consistent with the intuitive notion that the more complex the problem is, the harder it is to learn. Our results illustrate how the interplay between function and architecture complexities can be studied through this parameter. In the case of depth 2 networks, we have shown that for relatively simple functions like addition and bit shifting (though with local receptive elds), the MNEFG scales polynomially while for the more complex (parity), it scales exponentially. It is worth noting that this is the simplest depth 2 architecture that solves the parity, and hence the MNEFG, reecting the intrinsic

Examples in Feedforward Networks

2425

complexity of this function. This is an interesting result since the parity with this architecture is one of the most used structures for testing learning algorithms. We also showed how this difculty can be overcome by increasing the depth of the network. Once again, by using local receptive elds, the MNEFG scales polynomially with N. This is consistent with the numerical results, which show that not only with selective sampling, but even with pure random sampling, full generalization is obtained with a fraction of the examples. Acknowledgments

This work was partially supported by the following agencies: CONICET (Argentina), CONICOR (Cordoba, ´ Argentina), and Secretar´õ a de Ciencia y T´ecnica de la Universidad Nacional de Cordoba ´ (Argentina). References Baum, E. B. (1991). Neural net algorithms that learn in polynomial time from examples and queries. IEEE Transactions on Neural Networks, 2(1), 5. Baum, E. B., & Haussler, D. (1989). What net size gives valid generalization? Neural Computation, 1(1), 151. Boers, E. J. W., Kuiper, H., Happel, B. L. M., & Sprinkhuizen-Kuyper, I. G. (1993). Designing modular articial neural networks. In H. A. Wijshoff (Ed.), Proceedings of Computing Science in the Netherlands (CSN’93) (p. 87). Amsterdam: SION, Stichting Matematisch Centrum. Cannas, S. A. (1995). Arithmetic perceptrons. Neural Computation, 7(1), 173. Cohn, D. (1996). Neural network exploration using optimal experiment design. Neural Networks, 9(6), 1071. Cohn, D., Atlas, L., & Ladner, R. (1994). Improving generalization with active learning. Machine Learning, 15(2), 201. Copelli, M., & Caticha, N. (1995). On-line learning in the committee machine. J. Phys. A: Math. Gen., 28, 1615. Franco, L., & Cannas, S. A. (1998). Solving arithmetic problems using feedforward neural networks. Neurocomputing, 18, 61–79. Hancock, T. R., Golea, M., & Marchand, M. (1994).Learning nonoverlapping perceptron networks from examples and membership queries. Machine Learning, 16, 161. Haykin, S. (1994).Neural networks: A comprehensivefoundation. New York: McMillan. Hertz, J., Krogh, A., & Palmer, R. (1991). Introduction to the theory of neural computation. Reading, MA: Addison-Wesley and Santa Fe Institute. Jung, G., & Opper, M. (1996). Selection of examples for a linear classier. J. Phys. A: Math. Gen., 29(7), 1367. Kinzel, W., & Ruj´an, P. (1990). Improving a network generalization ability by selecting examples. Europhysics Letters, 13(5), 473. Minsky, M. L., & Papert, S. A. (1969). Perceptrons. Cambridge, MA: MIT Press.

2426

Leonardo Franco and Sergio A. Cannas

Plutowski, M., & White, H. (1993). Selecting concise training sets from clean data. IEEE Transactions on Neural Networks, 4(2), 305. Priel, A., Blatt, M., Grossman, T., Domany, E., & Kanter, I. (1994). Computational capabilities of restricted two layered perceptrons. Physical Review E, 50, 577. Rumelhart, D. E., & McClelland, J. L. (1986). Parallel distributed processing (Vol. 1). Cambridge, MA: MIT Press. Tessauro, G., & Janssens, R. (1988). Scaling relationships in back-propagation learning: Dependence on predicate order (Tech. Rep. No. CCSR-88-1). UrbanaChampaign, IL: Center for Complex System Research. Received September 10, 1998; accepted September 26, 1999.

LETTER

Communicated by Halbert White

Stationary and Integrated Autoregressive Neural Network Processes Adrian Trapletti Department of Operations Research, Vienna University of Economics and Business Administration, A-1090 Vienna, Austria Friedrich Leisch Kurt Hornik Department of Statistics and Probability Theory, Vienna University of Technology, A-1040 Vienna, Austria We consider autoregressive neural network (AR-NN) processes driven by additive noise and demonstrate that the characteristic roots of the shortcuts—the standard conditions from linear time-series analysis—determine the stochastic behavior of the overall AR-NN process. If all the characteristic roots are outside the unit circle, then the process is ergodic and stationary. If at least one characteristic root lies inside the unit circle, then the process is transient. AR-NN processes with characteristic roots lying on the unit circle exhibit either ergodic, random walk, or transient behavior . We also analyze the class of integrated AR-NN (ARI-NN) processes and show that a standardized ARI-NN process “converges” to a Wiener process. Finally, least-squares estimation (training) of the stationary models and testing for nonstationarity is discussed. The estimators are shown to be consistent, and expressions on the limiting distributions are given. 1 Introduction

Over the past decade neural network (NN) models have been extensively used for a wide range of applications in the domain of time-series analysis and forecasting. Although the NN time-series application literature has grown in a spectacular fashion, relatively few theoretical advancements have been made. The aim of this article is to provide a rigorous analysis of the stochastic properties for what probably is the most popular class of NNs for timeseries analysis and forecasting: autoregressive neural network (AR-NN) processes driven by additive noise. The scalar AR-NN (p) process can be seen as a natural generalization of the linear autoregressive AR(p) process and is dened by stochastic difference equations of the form yt D h (xt¡1 , h ) C et .

(1.1)

c 2000 Massachusetts Institute of Technology Neural Computation 12, 2427–2450 (2000) °

2428

A. Trapletti, F. Leisch, and K. Hornik

h (xt¡1 , h ) denotes a multilayer perceptron (MLP) with input vector xt¡1 D (yt¡1 , . . . , yt¡p )0 and weight (parameter) vectorh. et is a sequence of independent and identically distributed (i.i.d.) random variables and represents the noise process. If Eet D 0, then h (xt¡1 , h ) equals the conditional expectation E[yt |xt¡1 ]. The output of a MLP with q hidden units, shortcut connections y , and the activation function G(¢) is given by h (xt¡1 , h ) D º C y 0 xt¡1 C

q X i D1

bi G(w i0 xt¡1 C m i ).

(1.2)

The weight vectors are y (p £ 1), b ´ (b1 , . . . , bq )0 (q £ 1), w ´ (w 10 , . . . , w q0 )0 (pq £ 1), m ´ (m 1 , . . . , m q )0 (q £ 1), and the intercept º (scalar) collected together in the r £ 1 network weight vector h D (y 0 , b 0 , w 0 , m 0 , º0 )0 with r D 1 C p C q(2 C p ). The activation function G(¢) is assumed to be bounded, for example, the logistic function G(x ) D (1 C exp(¡x)) ¡1 . However, for most of the results in this article, the exact structure of the MLP is not important, provided that the nonlinear part of the network is bounded. It is well known that MLPs can provide arbitrarily accurate approximations to arbitrary functions in a variety of normed function spaces if the dimension of the weight space is sufciently large (see, e.g., Hornik, Stinchcombe, & White, 1989). However, we focus here on bounded MLP complexity; the network weight space H is dened as a subset of the nitedimensional space Rr . Nevertheless, this treatment of MLPs as parametric models still allows us to approximate an arbitrary function to some degree. For example, White (1989) approximated the H´enon map with ve hidden units. Of particular importance for statistics in time-series analysis is the question of stationarity and ergodicity of a process. Since for stationary ergodic processes a single trajectory displays the whole probability law of the process, averaging over time gives the same result as averaging over the population. Thus, the sample mean converges to the population mean, provided this exists. Whereas for linear AR processes conditions for stationarity and ergodicity are well known (see, e.g., Brockwell & Davis, 1993), this question has not gained much attention in the neural networks literature. Leisch, Trapletti, and Hornik (1999) provide rst results for the ergodicity and stationarity of AR-NN processes. They obtain sufcient conditions for the ergodicity and stationarity, but it is not clear to what extent these conditions are necessary. Furthermore, they introduce the class of integrated AR-NN (ARI-NN) processes. This is a particularly interesting class of nonstationary AR-NN processes, since they can be made stationary by simple differencing. In a related article in the eld of econometrics, Corradi, Swanson, and White (1998) analyze ergodicity, stationarity, and cointegration in the context of (rst-order) Markov processes, which can be represented as the sum

Autoregressive Neural Network Processes

2429

of a linear plus a bounded nonlinear component. They also provide tests for stationarity and cointegration. In this article we add to these results in several ways using Markov chain and time-series analysis theory. In section 2 the results of Leisch et al. (1999) are extended. We show that their conditions for ergodicity and stationarity are sufcient but not necessary. Moreover, we provide a new proof for the geometric ergodicity of the AR-NN process, which can also be used to establish conditions for the existence of moments. We also give useful results concerning the probabilistic structure of AR-NN processes. We show that the “memory” of an AR-NN process vanishes exponentially fast, which implies the existence of a spectral density. Furthermore, we obtain conditions for which the shape of the stationary distribution is symmetric. Nonstationary time series play an important role in a variety of application elds such as economics and physical sciences. In section 3 we focus on nonstationary AR-NN processes that exhibit either transient or random walk behavior. We analyze the class of ARI-NN processes as dened in Leisch et al. (1999) and show that ARI-NN processes “converge” to a Wiener process when appropriately standardized. In section 4 we apply the results from the previous sections to the problem of training (estimation) and testing in the context of AR-NN processes. In particular, least-squares estimators of the weights of the stationary AR-NN model are shown (under mild regularity conditions) to be consistent and asymptotically normal. We introduce the hypothesis test for a unit root of Phillips (1987), Perron (1988), and Phillips and Perron (1988) (PP) as a tool to discriminate between stationary AR-NN and ARI-NN models. This test is shown to be applicable for AR-NN processes, and the limiting distribution follows from functional central limit theory. The PP test is also a complement to Corradi et al. (1998), where a stationary null is tested against a unit root alternative. All proofs are deferred to the appendix. 2 Ergodicity and Stationarity 2.1 Some Markov Chain Theory. To start, we write the AR-NN(p) process as a Markov chain on Rp and outline some relevant Markov chain theory (for details, see Chan & Tong, 1985; Feigin & Tweedie, 1985; Tong, 1990; and Meyn & Tweedie, 1993). Dening gt D (et , 0, . . . , 0)0 , equation 1.1 becomes

xt D H (xt¡1 ) C gt ,

(2.1)

where H (xt¡1 ) D (h(xt¡1 , h0 ), yt¡1 , . . . , yt¡p C 1 )0 . Let Pn (x, A ) D P(xt C n 2 A | xt D x )

denote the probability that fxt g moves from x to the set A 2 B in n steps.

2430

A. Trapletti, F. Leisch, and K. Hornik

Here, h0 is a xed element of H that indexes the process generating the data. For the sake of convenience and for clarity, we suppress the dependence of H, xt , yt , P n , and P on h0 . Then fxt g with P (x, A ) ´ P1 (x, A ) forms a Markov chain with state-space (Rp , B, l), where B is the Borel s-algebra on Rp and l denotes the Lebesgue measure on (Rp , B ) (see also Chan & Tong, 1985, pp. 666, 667). We are interested in conditions on the weights for which fxt g is a strictly stationary process. In Markov chain theory, this problem is closely related to the asymptotic behavior of fxt g as t becomes large. The Markov chain fxt g is geometrically ergodic if there exists a probability measure p on (Rp , B ) and a constant % > 1 such that lim

n!1

n

%

kPn (x, ¢) ¡ p (¢)k D 0

(2.2)

for each x 2 Rp , where k ¢ k denotes the total variation norm. If equation 2.2 holds for % D 1, then fxt g is called ergodic. It is easy to see that p in equation 2.2 satises the invariance equation, Z p (A) D

P (x, A )p (dx),

8A 2 B .

(2.3)

Thus, p is called the stationary measure. Suppose that fxt g is ergodic. Then equation 2.2 implies that the distribution of xt converges to p . Hence, fxt g is asymptotically stationary. If fxt g is started either with initial distribution p , that is, x0 » p , or in the innite past, then fxt g is strictly stationary. This solution to equation 2.3 will be called the stationary or steady-state solution for equation 2.1. To establish the ergodicity of AR-NN processes, we need the concepts of irreducibility and aperiodicity. The Markov chain fxt g is called irreducible if 1 X nD1

Pn (x, A ) > 0,

8x 2 Rp ,

whenever l(A ) > 0. This basically means that all parts of the state-space can be reached by the Markov chain irrespective of the starting point. An irreducible Markov chain is aperiodic if there exists an A 2 B with l(A ) > 0 and for all C 2 B, C µ A with l (C ) > 0, there exists a positive integer n such that P n (x, C ) > 0 and Pn C 1 (x, C ) > 0,

x 2 Rp

(see also Tong, 1990, proposition A1.2). Hence, for an aperiodic chain, it is impossible that the chain returns to given sets only at specic time points. To formalize explosive behavior of a Markov chain, we need to consider the concept of transience. The state v of a Markov chain on a countable space (e.g., with f0, 1g or f0, 1, 2, . . .g generating the state-space in-

Autoregressive Neural Network Processes

2431

stead of Rp ) is called transient if the expected number of visits to v is nite. The Markov chain is called transient if every state is transient. Transience of a Markov chain fxt g with Rp generating the state-space means that Px fxt ! 1g D 1, for each x 2 Rp (for a denition of transience on a general state-space see, e.g., Meyn & Tweedie, 1993, Chap. 8). This means that the trajectory of fxt g visits each compact set only nitely often for each initial state x. 2.2 Ergodic and Stationary AR-NN Processes. We follow Leisch et al. (1999) and extend their results to the case of characteristic roots lying on the unit circle. For most general time-series models, irreducibility and aperiodicity cannot be assumed automatically. However, for an AR-NN process, these conditions can be checked easily. Essentially, it is enough if the distribution of the noise process exhibits an absolutely continuous component with respect to Lebesgue measure and if the support of the probability density function (p.d.f.) is sufciently large. In this case every non-null p-dimensional hypercube is reached in p (and also in p C 1) steps with positive probability (and hence every non-null Borel set A ).

Let fxt g be the Markov chain of the AR-NN process (see equation 2.1) with continuous activation function G (¢). Let et be i.i.d. with absolutely continuous distribution with respect to Lebesgue measure l. If the p.d.f. f (¢) of et is positive everywhere in R, then fxt g is an irreducible and aperiodic Markov chain. Lemma 1.

Obviously, gaussian white noise (mean zero i.i.d. normal et ) fullls the conditions of lemma 1. We can now state conditions for which the stochastic difference equations 1.1 and 1.2 or, equivalently 2.1, have stationary solutions. Suppose the Markov chain fxt g of the AR-NN process (see equation 2.1) satises the conditions of lemma 1, the activation function G (¢) is bounded, and E | et | < 1. Let Theorem 1.

y ( z) : D 1 ¡

p X

yi zi ,

iD 1

z2C

(2.4)

denote the characteristic polynomial associated with the shortcut connections. Then the condition 6 0 8z, |z| · 1 y ( z) D

(2.5)

is sufcient but not necessary for the ergodicity of the Markov chain fxt g. Furthermore, if equation 2.5 holds, then fxt g is geometrically ergodic and the associated AR-NN process fyt g is asymptotically stationary.

2432

A. Trapletti, F. Leisch, and K. Hornik

This result is not surprising since boundedness is a strong form of stability. It shows that essentially the linear part of the AR-NN process determines whether the overall process is stationary. Thus, the usual condition for AR processes that all roots of the characteristic polynomial lie outside the unit circle can be used. Moreover, an AR-NN without shortcut connections leads always to a stationary solution. To illustrate that condition 2.5 is not necessary, suppose a root of the characteristic polynomial (see equation 2.4) (characteristic root) lies on the unit circle and all other roots are outside the unit circle. Based on linear time-series theory, random walk behavior with or without drift would be expected, depending on the nonlinear part and the intercept of the AR-NN process. In contrast to a purely linear process, for an AR-NN process it is also possible that the drift generated by the nonlinear part is toward the “center” of the state-space. We consider the following special case of an AR-NN(1) process as an example: yt D yt¡1 C G (yt¡1 ) C et ,

(2.6)

where G(¢) satises the conditions of theorem 1 and is asymptotically constant, that is, lim G (y ) D a and

y!1

lim G(y ) D b for some a, b < 1.

y!¡1

Let et satisfy the conditions of lemma 1, Eet D 0, and E | et | 2 < 1. If a < 0 and b > 0, then the solution fyt g of equation 2.6 is ergodic and asymptotically stationary. Proposition 1.

Since the drift toward the stationary solution is at most linear, we can state ergodicity but not geometric ergodicity. For simplicity, we use in the remainder of this article the term stationary AR-NN process for strictly stationary or steady-state solutions of equations 1.1 and 1.2 for which theorem 1 holds. Geometric ergodicity in theorem 1 implies that the “memory” of the ARNN process vanishes exponentially fast. A convenient way of describing the memory of a process is through the concept of mixing processes. Let fyt g be a stochastic process on a probability space (V, F , P ), and let Fab D s (yt I a · t · b) be the s-algebra generated by the random variables ya , ya C 1 , . . . , yb . Dene a(m) :D sup | P (E \ F ) ¡ P (E )P(F )|,

(2.7)

n , F 2 F 1 , and n. This where the supremum is taken over all E 2 F ¡1 n Cm quantity measures the dependence between events separated by at least m time periods. We call the process fyt g strong or a-mixing if a(m) ! 0 as m ! 1.

Autoregressive Neural Network Processes

2433

The following result establishes strong mixing for geometrically ergodic AR-NN processes. Let fyt g be a stationary AR-NN process satisfying the conditions of theorem 1. Then fyt g is strong mixing with mixing coefcients Corollary 1.

a(m) · k%m for some k < 1 and % 2 (0, 1). This result is, as we will see later, of particular importance for the asymptotic theory of inference and testing, since mixing processes are sufciently well behaved to allow laws of large numbers and central limit theorems to be established. If the memory of the stationary AR-NN process vanishes exponentially fast, then the autocovariance function approximates zero with an exponential rate. Corollary 2. Let fyt g as in corollary 1 and dene c (t ) : D Cov(yt , yt C t ). If Q 2 (0, 1). E |yt | 4 < 1, then |c (t )| · kQ %Q t for some kQ < 1 and %

Sometimes the “frequency domain” analysis of time series provides an illuminating alternative way of viewing a stationary process. An important tool is the spectral density as the Fourier transform of the covariance function of the process. Although not every stationary process has a spectral density, an absolutely summable autocovariance function as in corollary 2 implies the existence of a spectral density. Every AR-NN process fyt g that satises the conditions of corollary 2 has a spectral density Corollary 3.

f (l ) : D

1 1 X e ¡itlc (t ) ¸ 0, 2p t D ¡1

8l 2 [¡p , p ],

and the autocovariance function has the spectral representation Z c (t ) D

p ¡p

e ¡ilt f (l)dl .

2.3 The Stationary Distribution p . We now turn to the question of the existence of moments of fyt g in its stationary regime. Obviously, the bounded part of the AR-NN process has no adverse inuence on the existence of

2434

A. Trapletti, F. Leisch, and K. Hornik

moments. We get the following simple result: Theorem 2.

Let fyt g be a stationary AR-NN process as in theorem 1. Then Z

E |yt | k D

|x| kp (dx ) < 1 () E | et | k < 1.

Hence, for example, for gaussian white noise, the AR-NN process is kintegrable for all k < 1. While we have used the concept of stationarity in the strict sense, it is often of interest to consider second-order stationarity: Corollary 4. Let fyt g as in theorem 1. Then fyt g is weakly stationary if and only if E | et | 2 < 1.

Unfortunately it is not possible to describe the stationary distribution p in general. Although an implicitsolution of p is always dened by equation 2.3, for most cases this equation cannot be solved in closed form. There exist different numerical techniques for solving this problem (see Tong, 1990, for an overview). However, a partial result about the shape of p is obtained in the following. Since this result depends on the particular structure of the AR-NN, we state it only for the logistic activation function G (x ) D (1 C exp(¡x)) ¡1 . First, we impose a minimality condition on the nonlinear part of the AR-NN process (as in Sussmann, 1992, and Hwang & Ding, 1997). Each hidden unit makes a unique nontrivial contribution 6 6 to the overall AR-NN process: b 0i D 0, w 0i D 0 for all i D 1, . . . , q, and 0 0 6 6 (w 0i , m 00i ) D , m 00j ) for all i D § (w 0j j. Assumption 1.

For the logistic activation function, assumption 1 ensures that it is not possible to state equation 1.2 with fewer hidden units, that is, with a new q, say, qQ < q. (See Hwang & Ding, 1997, concerning the logistic function and Sussmann, 1992, concerning the essentially equivalent tanh squasher.) Suppose assumption 1 holds and fyt g is a stationary AR-NN process with activation function G (x ) D (1 C exp(¡x)) ¡1 . Let et have a symmetric p.d.f. f (¢) about the origin. Then all nite stationary joint p.d.f.s of yt1 , yt2 , . . . , ytk are symmetric about the origin for all P qt1 , t2 , . . . , tk and for k ¸ 1 if and only if m i D 0 for all i D 1, . . . , q and 2º C iD 1 b i D 0. Proposition 2.

Since a symmetric joint p.d.f. implies a symmetric marginal p.d.f. for yt , the expectation of yt is zero, that is, Eyt D 0.

Autoregressive Neural Network Processes

2435

Corollary 5. Let the assumptions of proposition 2 hold and k be an odd integer. Then Eykt D 0. 3 Nonstationary and Integrated Processes 3.1 Transience. By analogy with linear time-series theory, it seems plausible that the AR-NN process exhibits transient behavior if at least one characteristic root lies inside the unit circle. Transience for the AR-NN process is given in the following theorem:

Let fxt g be the irreducible and aperiodic Markov chain of an ARNN process (see equation 2.1) with y (¢) as in theorem 1. Then fxt g is transient if E exp(| et |) < 1, y (¢) has distinct roots, and y (z ) D 0 for at least one z with |z| < 1. Theorem 3.

We believe that the distinctness assumption on the roots is not necessary and that the moment condition could be relaxed. 3.2 ARI-NN Processes. In linear time-series modeling, it is a rather common practice to detrend a time series by taking differences if the series exhibits nonstationary features. An example is the class of autoregressive integrated moving average (ARIMA) processes (see, e.g., Brockwell & Davis, 1993), where the difference operator must be applied one or more times before a stationary representation is appropriate. This approach has proved useful in many applications. We start our analysis with an introduction of the class of ARI-NN processes as given in Leisch et al. (1999). Let d be a nonnegative integer. As in linear time-series analysis, we dene the d-th order difference operator,

D d zt :D D (D d¡1 zt ), d ¸ 2, D 1 zt ´ D zt :D (1 ¡ B)zt

D zt ¡ zt¡1 ,

where B is the backward shift operator Bzt : D zt¡1 . Polynomials in B and D are manipulated in a similar way as polynomial functions of real variables— for example,

y (B)zt D

(

1¡

p X iD 1

!

yi B

i

zt D 1zt ¡

p X iD 1

y i Bi z t D z t ¡

p X

yi zt¡i ,

iD 1

where y (z) is the characteristic polynomial (see equation 2.4). Then zt is said to be an ARI-NN (p, d ) process if yt D D d zt is a stationary AR-NN(p) process.

2436

A. Trapletti, F. Leisch, and K. Hornik

Hence, an ARI-NN process reduces to a stationary AR-NN process after differencing nitely many times. Moreover, an ARI-NN(p, d ) process is stationary if and only if d D 0, in which case it reduces to a stationary AR-NN(p) process. A stochastic process fzt g is called integrated of order d— briey, zt » I (d) if D d zt is stationary and D d¡1 zt is not stationary. Thus, an ARI-NN(p, d ) process is I (d ). Note that we only consider stationary AR-NN processes for which condition 2.5 holds. (See also the remark after proposition 1.) We now analyze the long-term behavior of an ARI-NN(p, 1) process fzt g1 t D 0 . Following Phillips (1987) we write the ARI-NN(p, 1) process as partial sum zt D

t X

yi ,

i D1

t D 1, 2, . . .

(3.1)

where fyt g is a stationary AR-NN(p) process with Eyt D 0. Without loss of generality, we have assumed that z0 D 0. From the sequence fzt g we construct the random variables 1 Zt (r ) D p zbtrc , ts

( j ¡ 1) / t · r · j / t ( j D 1, . . . , t),

where b¢c denotes the integer part of its argument. We further assume s 2 D limt!1 E[t¡1 z2t ] exists and s 2 > 0. Theorem 4.

If E |yt |d < 1 for some d > 2, then

Zt (¢) ) W (¢),

t ! 1,

where W (¢) is the Wiener process on the unit interval and ) denotes weak convergence. This type of result is often referred to as a functional central limit theorem. Here, it basically states that a standardized zero-mean ARI-NN(p, 1) process “converges” to Pa Wiener process. If the conditions of the theorem hold, then s 2 D Ey21 C 2 1 k D 2 E[y1 yk ] < 1. 3.3 Discussion. Using the backward-shift operator polynomials, we rewrite equations 1.1 and 1.2 as

y (B )yt D

q X iD 1

bi G (w i (B )yt¡1 C m i ) C et ,

(3.2)

where y (z ) is the characteristic polynomial (see equation 2.4) and w i (z) D

Autoregressive Neural Network Processes

Pp

2437

j¡1 .

The stochastic behavior of the AR-NN process is essentially determined by the characteristic roots of the shortcuts: j D 1 w i, j z

6 0 for |z| · 1, then fyt g is stationary. 1. If y (z) D

2. If y (z) D 0 for some |z| < 1, then fyt g is essentially transient.

3. If y (z) ´ y ¤ (z)(z¤1 ¡ z) . . . (z¤k ¡ z ) for some k · p, the roots |z¤1 | D ¢ ¢ ¢ D 6 |z¤k | D 1, and the polynomial y ¤ (z ) D 0 for |z| · 1, then fyt g exhibits either ergodic, random walk, or transient behavior, depending on the sum of the right-hand side of equation 3.2. Consequently, representation 3 has to be used to parameterize AR-NN processes that exhibit random walk behavior. However, representation 3 is too general. First, it does not ensure random walk behavior. Second, if fyt g exhibits random walk behavior, then the variance of the variables yt¡1 , . . . , yt¡p grows with t. Because the variables enter the nonlinear part in levels (not differenced), the bounded activation functions eventually get saturated as t becomes large and, hence, the nonlinearity in E[yt |yt¡1 , . . . , yt¡p ] vanishes. These drawbacks can be removed by considering only differenced lagged variables that enter the nonlinear part. For simplicity, we consider the case 6 y (z ) ´ y ¤ (z)(1 ¡ z ) and y ¤ (z) D 0 for |z| · 1. Then equation 3.2 becomes

y (B)yt D y ¤ (B)(1 ¡ B)yt D y ¤ (B)D yt D D

q X iD 1 q X iD 1

b i G(w i (B )(1 ¡ B )yt¡1 C m i ) C et b i G(w i (B )D yt¡1 C m i ) C et ,

(3.3)

which is exactly the representation of an ARI-NN(p, 1) process (replace in equation 3.1 yt with D yt and zt with yt ). Because the differenced variables are stationary, the nonlinearity never vanishes and fyt g “converges” to a Wiener process. Thus, ARI-NN processes are potentially well suited to analyze time series that exhibit random walk behavior and nonlinear effects. 4 Training and Testing 4.1 Consistency . The general results from the previous sections are applied to training of AR-NN models in the context of AR-NN processes for xed-model orders p and q (see equations 1.1 and 1.2). We follow White (1994), although other results from the literature could be used. The follow-

2438

A. Trapletti, F. Leisch, and K. Hornik

ing assumption species the data generation process: The data generation process for the sequence of scalar realvalued observations fyt gntD1 is a stationary AR-NN(p) process (see equation 2.1) with continuous activation function G (¢) and r £ 1 network weight vector h0 2 H satisfying assumption 1. The network weight space H is a compact subset of Rr for some r 2 N. Assumption 2.

The goodness of t of an AR-NN model as a function of h for a given P time series fyt gntD1 can be measured by Qn (h ) D n ¡1 ntD 1 (yt ¡ h (xt¡1 , h ))2 . However, the main interest is in the overall model performance—the performance for future observations QN n (h ) D EQn (h ), where the expectation is with respect to the stationary distribution p . Unfortunately, choosing h to solve infh2H QN n (h ) is not possible since the stationary distribution p is unknown. Nevertheless, an approximate solution hOn can be found by solving the problem Qn (hOn ) D inf Qn (h ). h2H

(4.1)

This so-called nonlinear least-squares estimator hOn provides a good approximation since for large n, Qn (h ) approximates well QN n (h ). Solving equation 4.1 is straightforwardly implemented by a batch learning procedure, though the solution can also be approximated by an on-line learning procedure as described in White (1996). A fundamental problem for statistical inference with MLPs is the unidentiability of the network weights (see, e.g., White, 1996; Hwang & Ding, 1997). The next assumptions deal with this problem. First, we rule out equivalent minima of equation 4.1 achieved by permuting the hidden units and changing the sign of the network weights on each side of a hidden unit (see ± also Kurkov´a & Kainen, 1994, proposition 3.9). H is such that the network weights from the hidden layer to the output unit are nonnegative and sorted, that is, 0 · b1 · ¢ ¢ ¢ · bq . Equality is resolved by arranging the input to hidden-layer weights according to their size. Assumption 3.

Without loss of generality, we can assume that the weight vectors lie in the cone as specied by assumption 3. Otherwise, simple sign change and permutation transformations will change them into this form. Furthermore, if assumption 1 holds and two or more of the b i ’s are equal, then it is always possible to sort the remaining weights from the input layer to the hidden layer. For certain activation functions G(¢), assumptions 1 and 3 ensure that equation 4.1 has a unique global minimum. These activation functions are

Autoregressive Neural Network Processes

2439

specied in the following assumptions. Related assumptions are condition A and condition B in Hwang and Ding (1997). G (¢) is a bounded, nonconstant, and asymptotically constant function that is shifted odd (a function g: R ! R is called shifted odd, if g (x ) D f (x ) C t for some constant t and f (¢) is odd). Assumption 4.

The family of functions f1g [ fG(gi (x ))giD 1,...,k is linearly independent for any positive integer k, where fgi (x )giD 1,...,k is a family of nonconstant linear afne functions on R, no two of which are sign equivalent (Two functions gi (x ) and gj (x ) are called sign equivalent iff |gi (x )| D |gj (x)| for all x 2 R.) Assumption 5.

Fortunately, the popular logistic and the essentially equivalent tanh squashers satisfy assumptions 4 and 5. (For a specic discussion of identiability ± concepts, see Sussmann, 1992; Kurkov´a & Kainen, 1994; Hwang & Ding, 1997.) We can now state the strong consistency of the nonlinear least-squares estimator hOn : Theorem 5.

then

Suppose assumptions 2–5 hold. If E | et | 2 Cd1 < 1 for some d1 > 0,

a.s. hOn ! h0 ,

n ! 1,

a.s.

where ! denotes convergence almost sure (a.s.). 4.2 Asymptotic Normality . We add the following two conditions to state

the asymptotic normality of hOn :

The network weight vector h0 is interior to H and G (¢) is continuously differentiable of order 2. Assumption 6.

Let fgi (x )giD 1,...,k be a family of functions as given in assumption 5. The family of functions f1g [ fxg [ fG(gi (x))giD 1,...,k [ fG0 ( gi (x ))giD1,...,k [ fxG0 ( gi (x ))giD 1,...,k is linearly independent for any positive integer k. Assumption 7.

This assumption on the activation function G(¢) is clearly stronger than assumption 5 and essentially ensures together with assumption 1 that the information matrix is regular at h0 . The logistic function and the tanh squasher also satisfy assumption 7 (see also Hwang & Ding, 1997, condition B and lemma 2.7).

2440

A. Trapletti, F. Leisch, and K. Hornik

Asymptotic normality is given in the following theorem: Theorem 6.

d2 > 0, then

¡

Suppose assumptions 2–4, 6, and 7 hold. If E | et | 4 Cd2 < 1 for some

1 2N r Qn (h0 ) 2s 2

¢

1 /2

p

d n (hOn ¡ h0 ) ! N (0,

p ),

n ! 1,

where r 2 QN n (h0 ) D E[r 2 Qn (h0 )] is the expectation of the Hessian matrix of Qn (h ) d

evaluated at h0 , s 2 is the variance of et , and ! denotes convergence in distribution. This result forms the basis for hypotheses tests, for example, testing whether certain network weights are zero. The Hessian r 2 QN n (h0 ) and s 2 can c2 D n ¡1 P n eO 2 , be consistently estimated by r 2\ QN n (h0 ) D r 2 Qn (hOn ) and s t D1 t where fOet g is the sequence of the observed residuals. However, the Hessian can be very badly behaved in practice, whether estimated or not. Then alternative estimators may be appropriate. (A discussion of the various issues involved is beyond our scope here. See White, 1994, for a detailed discussion.) 4.3 Unit Root Test. The methods presented in sections 4.1 and 4.2 are for stationary time series. However, many time series encountered in practice are nonstationary. A popular way to solve the problem of training in a nonstationary environment is to look for a transformation of the data, which results in a stationary series. This frequently can be achieved by simple differencing. Once the data have been suitably transformed, the results from sections 4.1 and 4.2 can be used. We introduce the PP unit root test as a formal test concerning the need for differencing in AR-NN modeling. To establish the corresponding asymptotic results we follow Phillips (1987). Let us write a zero-mean ARI-NN(p, 1) process fzt g1 t D1 as

zt D azt¡1 C yt , a D 1,

t D 1, 2, . . .

(4.2) (4.3)

where by denition fyt g is a stationary zero-mean AR-NN(p) process. Furthermore, suppose z0 is equal to zero. Now, a formal test of the null hypothesis that fzt g is an ARI-NN(p, 1) process against the alternatives of fzt g being a stationary AR-NN(p C 1) or a transient AR-NN(p C 1) process can be expressed as a hypothesis on the

Autoregressive Neural Network Processes

weight a: ( H0 : H1 :

2441

a D 1,

6 1. aD

If the null hypothesis is accepted, the difference operator has to be applied before a stationary AR-NN(p) model is appropriate. Such a test is also called unit root test, because the characteristic polynomial of zt has a unit root under the null. Given a series of n C 1 observations fzt gntD 0 , the basic idea for the PP test is to estimate a by the least-squares estimator, Pn zt zt¡1 a O D PtDn 1 2 , t D 1 zt¡1 from regressing zt on zt¡1 and to consider the conventional t-statistic of a. O Unfortunately, the limiting distribution of this statistic is nonnormal and depends on nuisance parameters. However, the nuisance parameters may be consistently estimated, and a transformation of the test statistic exists that eliminates the nuisance dependence asymptotically. The transformed test statistic is dened as

(

O ¡ 1) ¡ (1 / 2)(s2nl ¡ s2y ) n ¡2 Zn D n (a

n X tD 1

z2t¡1

!¡1

,

(4.4)

P P P where s2nl D n ¡1 ntD 1 (zt ¡zt¡1 )2 C 2n ¡1 lt D 1 ntD t C 1 (zt ¡zt¡1 )(zt¡t ¡zt¡t ¡1 ) P and s2y D n ¡1 ntD1 (zt ¡ zt¡1 )2 . The limiting distribution of equation 4.4 is given in the following theorem: Suppose fzt g is generated by equation 4.2, where fyt g follows a stationary AR-NN process. Let E | et | 2d < 1 for some d > 2. If l ! 1 as n ! 1 such that l D o(n1 / 4 ); then Theorem 7.

Zn )

(W (1)2 ¡ 1) / 2 , R1 ( )2 0 W r dr

(4.5)

under H0 , that is, if fzt g comes from an ARI-NN (p, 1) process. The distribution in equation 4.5 may be found tabulated in econometric textbooks. The best-known reference is probably Fuller (1976, p. 371). Alternative test statistics, for example, for a nonzero drift, can be found in Perron (1988) and Phillips and Perron (1988). However, the asymptotic critical values must be used cautiously. It is by now a well-documented fact that the PP tests do not always have the correct sizes even in the context of linear

2442

A. Trapletti, F. Leisch, and K. Hornik

time series. A discussion of the nite-sample properties of the PP tests and several modied test statistics can be found in Perron and Ng (1996). Furthermore, a O is a consistent estimator for the unit root. Let fzt g be generated by equations 4.2 and 4.3. If E | et |d < 1 for some d > 2, then aO is a (weakly) consistent estimator, Proposition 3.

p

aO ! 1,

n ! 1,

p

where ! denotes convergence in probability. Hence, for practical applications theorem 7 might provide the basis to decide whether a given time series has to be differenced. If the empirical autocovariance (or, equivalently, autocorrelation) function of the considered time series vanishes not sufciently fast as suggested in corollary 2, we can use the formal test of theorem 7. Leisch et al. (1999) discusses the implications of modeling a random walk without taking differences. A complementary procedure to the PP tests would be to consider a stationary null against a unit root alternative as in Corradi et al. (1998). This procedure may be appropriate if the time series under consideration seems to be stationary before differencing. 5 Conclusions

In this article we have studied several classical concepts of linear time-series analysis in the context of AR-NN processes driven by additive noise. We have derived conditions on the network weights for the ergodicity and stationarity of AR-NN processes. Our results show that an AR-NN process that has all its characteristic roots outside the unit circle is geometrically ergodic and asymptotically stationary. Furthermore, if the process is started in either its stationary regime or the innite past, then the process is strictly stationary. If essentially one characteristic root lies inside the unit circle, then the process is proved to be explosive. Concerning AR-NN processes with characteristic roots lying on the unit circle, the long-term behavior is determined by the “state-dependent intercept,” that is, the nonlinear part and the intercept of the process: driftless processes exhibit random walk behavior; a drift toward C 1 or ¡1 results in a transient process; a state dependent drift toward the “center ” of the state-space gives an ergodic and asymptotically stationary solution. We have obtained results on the memory of the AR-NN process. Geometrically ergodic AR-NN processes are strong mixing. Furthermore, their mixing rate and their autocovariance (autocorrelation) function approach zero exponentially fast. Hence, all geometrically ergodic AR-NN processes have a spectral density.

Autoregressive Neural Network Processes

2443

Concerning the stationary regime of an AR-NN process, we have stated conditions for the existence of moments. In particular, the existence of the second moment of the noise process is a necessary and sufcient condition for the weak stationarity of the AR-NN process. Moreover, we have presented conditions for which the shape of the stationary distribution is symmetric about the origin. We have also analyzed the ability of AR-NN processes to represent nonstationary behavior. We have shown that the class of ARI-NN processes is well suited to model time series that exhibit random walk behavior and nonlinear effects. These processes “converge” to a Wiener process when appropriately standardized, and the nonlinearity “in mean” does not vanish asymptotically. Finally, we have discussed training and testing of AR-NN models in the context of AR-NN processes. We have shown the consistency and asymptotic normality (under mild regularity conditions) of the nonlinear leastsquares estimator for the stationary AR-NN models. These results form the basis for a number of statistical tests. In the context of ARI-NN processes we suggest using the unit root test of Phillips (1987), Perron (1988), and Phillips and Perron (1988) concerning the need for differencing. Our results show that this test can be used to discriminate between stationary AR-NN and ARI-NN processes. The research reported in this article is currently being extended in various ways. Since all the statistical results are asymptotic, we have to be careful when applying them to actual data. Thus, it is of interest to study the small sample performance of the suggested estimators and tests by simulation. Another interesting topic currently under investigation is to extend the results of this article toward the more general and possibly multivariate autoregressive moving-average recurrent neural network process. Some of the concepts can be transformed straightforwardly, and various new concepts, such as cointegration in a system of nonlinear and nonstationary processes, can be formulated (Granger & Hallmann, 1991; Granger, 1995; Granger & Swanson, 1996; Corradi et al., 1998). Appendix: Mathematical Proofs

This follows immediately from Chan and Tong (1985, pp. 668, 669), and the denition of an AR-NN(p) process or Markov chain, respectively. In the following “AR-NN” is used for AR-NN processes, models, and Markov chains. Proof of Lemma 1.

This can also be obtained by the decomposition techniques of Chan and Tong (1985, p. 673), or theorem 4.2 of Tong (1990). However, we verify geometric ergodicity by applying a so-called drift criterion for Markov chains. This has the advantage that it can be extended to proof other results as well. Proof of Theorem 1. Sufciency.

2444

A. Trapletti, F. Leisch, and K. Hornik

We rst note that every l-non-null compact set is small (Chan & Tong, 1985, pp. 668, 669) and petite (Meyn & Tweedie, 1993, p. 121). We now verify the drift criterion 15.3 of theorem 15.0.1 in Meyn & Tweedie (1993) to obtain the desired result. The proof is based on a similar proof by Tjøstheim (1990) for recurrence of vector threshold models. Dene the matrix

2 6 6 6 Y :D 6 6 4

y1 1 0 .. . 0

y2 0 1 0

... ... ..

. ...

yp¡1 0 0 .. . 1

yp 0 0 .. . 0

3 7 7 7 7. 7 5

Then we can write equation 2.1 as xt D Yxt¡1 C F(xt¡1 ) C gt ,

(A.1)

where F (¢) is the nonlinear part and the intercept. Now, there exists a transformation S such that U D S Y S ¡1 has the eigenvalues of Y along its diagonal and arbitrarily small off-diagonal elements (see, e.g., Bellmann, 1970, p. 205). Let the test function be V (x ) D k S xk and the test set C D fx 2 Rp : V (x ) · cg for some c < 1. k ¢ k denotes the Euclidean norm for a vector and the spectral norm for a matrix. Then we have E[V(xt )| xt¡1 D x] · k S Yxk C k S F (x )k C Ek Sgt k

· (kL k C kD k)V(x) C k S F (x)k C Ek Sgt k,

where L D diag(U ) and D D U ¡ L . By assumption kL k < 1 and S can be chosen such that (kL k C kD k) < 1 ¡ 2 for some 2 > 0. Since the second and third terms are bounded, we can choose c such that E[V(xt )| xt¡1 D x] · (1 ¡ 2 )V (x ) C d C (x ) for some d < 1 and for all x. This is also valid for the test function V (x ) C 1 and, therefore, we get the desired result. Necessity: Proposition 1 presents a counterexample. To establish ergodicity the drift criterion 9.1, 9.2 of theorem 9.1 in Tweedie (1976) is veried for the test function V ( y) D |y| and the test set C D fy: |y| · cg: Proof of Proposition 1.

E[V(yt )| yt¡1 D y] D E[(y C G(y ) C et )

1

¡ (y C G (y ) C et )

2]

D (y C G(y ))(P1 ¡ P2 ) C E[et 1 ] ¡ E[et 2 ] D (y C G(y )) ¡ 2(y C G(y )) P2 ¡ 2E[et 2 ],

Autoregressive Neural Network Processes

2445

where 1 D fet ¸ ¡y¡G(y)g , 2 D fet < ¡y¡G(y)g , P1 D P(et ¸ ¡y ¡ G(y )), and P2 D P (et < ¡y ¡ G(y )). The last equality follows from P1 C P2 D 1 and E[et 1 ] C E[et 2 ] D 0. Now, for y positive and large enough, E[V(yt )| yt¡1 D y] · y ¡ 2 D V (y ) ¡ 2

for some 2 > 0, since limy!1 ( y C G(y ))P2 D 0, limy!1 E[et 2 ] D 0, and limy!1 G(y ) D a. The case y negative is proved analogously. Since for y 2 C , E[V(yt )| yt¡1 D y] is bounded, ergodicity and asymptotic stationarity follow. Since the drift criterion of theorem 15.0.1 (Meyn & Tweedie, 1993) is satised, the AR-NN is also V-uniformly ergodic (Meyn & Tweedie, 1993, theorem 16.0.1). This implies V-geometric mixing (theorem 16.1.5 and discussion on p. 388 of Meyn & Tweedie, 1993), that is, Proof of Corollary 1.

| E[g(xn )h (xn C m )] ¡ E[g(xn )]E[h(xn C m )]| · k%m , for all g2 (¢), h2 (¢) · V (¢) and n, m 2 Z, which is equivalent to the desired result (see also the proof of theorem A, pp. 883, 884, Athreya & Pantula, 1986). Proof of Corollary 2.

Omitted (see, e.g., Billingsley, 1995, lemma 3, p. 365).

Proof of Corollary 3.

Omitted (see, e.g., Brockwell & Davis, 1993, corol-

lary 4.3.2). Sufciency: Rewriting equation A.1 and taking expectations of the norm gives Proof of Theorem 2.

Ekgt k · Ekxt k C kYkEkxt¡1 k C EkF(xt¡1 )k < 1.

Necessity: The use of the test function V (x ) D k S xkk C 1 in the proof of theorem 1 establishes the desired result (Meyn & Tweedie, 1993, theorems 14.3.7 and 15.0.1). Proof of Corollary 4.

Omitted.

From Tong (1990), theorem 4.6, a symmetric joint p.d.f. of fyt g, is equivalent to h (¡x, h ) D ¡h(x, h ) (p -a.s.). Therefore, Proof of Proposition 2.

h(x, h ) C h(¡x, h ) D 2º C

X i

b i [G(w i0 x C m i ) C G(¡w i0 x C m i )] D 0.

2446

A. Trapletti, F. Leisch, and K. Hornik

6 0 for some j and m i D 0 for all i D 6 Now assume m j D j. Then

2º C

X 6 j iD

bi [G(w i0 x ) C G(¡w i0 x )] C bj [G(wj0 x C m j ) C G (¡w j0 x C m j )]

D 2º C

X i

b i C bj G (w j0 x C m j ) ¡ bj G (w j0 x ¡ m j ) D 0,

which implies that bj D 0, contradicting assumption 1. This follows because G(w j0 x C m j ), G(wj0 x¡m j ), and the constant 1 are linearly independent functions for the logistic G (¢) (lemma 2.7, Hwang &P Ding, 1997) and the support of p is Rp . Therefore, m i D 0 for all i and 2º C i bi D 0. Proof of Corollary 5.

Omitted.

We verify the drift criterion 11.4 of theorem 11.3 in Tweedie (1976). The proof is based on a similar proof by Tjøstheim (1990) for transience of vector threshold models. Representation A.1 is used. Since y (¢) has distinct roots, there exists an p p orthonormal basis of eigenvectors P fÂi giD 1 of Y, such that an arbitrary x 2 R has the representation x D i #i ( x ) Âi . Furthermore, there exists a transformation S such that U D S Y S ¡1 is diagonal and has the eigenvalues of Y along its diagonal. Let j be the coordinate number associated with an eigenvalue l j , |l j | > 1. We dene the test function and the test set as V (x ) D exp(¡|#j ( S x )|), C D fx : |#j ( S x )| · cg. Obviously, condition 11.4b of Tweedie (1976) holds. Since |#i (x )| · kxk and the coordinate functions #i (x) are linear, Proof of Theorem 3.

E[V(xt )| xt¡1 D x] · exp(¡|#j ( S Yx)|) exp(|#j ( S F (x ))|) exp(|#j ( Sgt )|) · exp(¡|l j ||#j ( S x)|) exp(k S F (x)k) exp(k Sgt k) · exp(¡|#j ( S x)|) D V (x ),

for xP large enough. In addition, we have used #j ( S Yx) D #j (U S x ) D #j (U i #i ( S x )Âi ) D l j #i ( S x ), E exp(| et |) < 1, and F (¢) is bounded. The desired result follows directly since both, C and its complement, are Lebesgue non-null Borel sets. This result follows directly from lemma 2.2 of Phillips (1987) and the properties of AR-NNs. Proof of Theorem 4.

Proof of Theorem 5.

assumptions.

We use theorem 3.5 from White (1994) and check its

Autoregressive Neural Network Processes

2447

Assumptions 2.1 and 2.3 are trivial. Assumption 3.1: Let h D h (xt¡1 , h ). Then assumption 3.1a follows because White’s E[log ft (Xt , h )] ´ ¡E (yt ¡ h )2 in our setup and is nite for each h 2 H and each t by our moment condition on et and theorem 2. Since G (¢) is continuous, dominated convergence yields assumption 3.1b. To prove assumption 3.1c, we use the uniform law of large numbers (ULLN) of lemma A.2b from Potscher ¨ and Prucha (1986). Since we deal with strictly stationary and ergodic AR-NNs for which E | et | 2 Cd1 < 1 for some d1 > 0 (and hence suph E (yt ¡ h )2 < 1 by compactness of H), the assumptions of the ULLN hold. Assumption 3.2: In our setup, identiably uniqueness of h ¤ is implied by the uniqueness of h0 . Because the support of the stationary measure p is R p , our assumptions 1, 4, and 5 and the continuity of G (¢) ensure that it is not possible to state equation 1.2 with fewer hidden units. Second, it is ensured that h is identiable up to the family of simple sign change and permutation transformations (see also the arguments to proof theorem 2.3a and corollary 2.4 of Hwang & Ding, 1997). Finally, assumption 3 uniquely identies one of these equivalent minima as the nal soo lution. Hence, h0 is unique. (See also Kurkov´a and Kainen, 1994, proposition 3.9.) We use theorem 6.4 from White (1994) and check its assumptions. Assumptions 2.1, 2.3, and 3.1 follow from the proof of theorem 5. Assumptions 3.2’ and 3.6 are trivial. Assumptions 3.7a, 3.8a, and 3.8b: The expected gradient and the expected Hessian of Qn (h ) are given by Proof of Theorem 6.

E[r Qn (h )] D ¡2E[r h (yt ¡ h )]

and E[r 2 Qn (h )] D 2E[r hr 0 h ¡ r 2 h (yt ¡ h )],

respectively. Assumptions 3.7a and 3.8a follow from our moment condition on et because each element of r h is at most linear in yt¡i and each element of r 2 h contains at most terms of order yt¡i yt¡j (for some i, j). Finally, assumption 6 and dominated convergence give assumption 3.8b. Assumption 3.8c: According to the proof of theorem 5, the application of the ULLN from Potscher ¨ and Prucha (1986) yields the result. Assumption 3.9: Let h0 D h(xt¡1 , h0 ). White’s A¤n ´ E[r 2 Qn (h0 )] D 2E[r h0 r 0 h0 ] is O (1) in our setup. Essentially our assumptions 1, 4, and 7 imply the nonsingularity of E[r h0 r 0 h0 ] (extending the proof of theorem 2.3b,

2448

A. Trapletti, F. Leisch, and K. Hornik

Hwang & Ding, 1997, to an NN that contains shortcuts gives the desired result). Assumption 6.1: We show that the sequence f 0 r log ft (Xt , hn¤ ) ´ 2f 0 r h0 et obeys the central limit theorem (CLT) from White and Domowitz (1984, theorem 2.4), for some r £ 1 vector f , f 0f D 1. Since fet g is i.i.d. and because of our moment condition on et , their assumptions A(i),(iii) hold. Furthermore, A(ii) holds with V D 4s 2 f 0 E[r h0 r 0 h0 ]f . Now, the CLT applies because AR-NNs are strong mixing and a wide class of transformations of mixing processes are themself mixing (White & Domowitz, 1984, lemma 2.1). By using the Crame´ r–Wold device, r log ft (Xt , hn¤ ) also obeys the CLT with covariance matrix B¤n D 4s 2 E[r h0 r 0 h0 ] D 2s 2 A¤n , which is O (1) and nonsingular.

It follows from Theorem 5.1 of Phillips (1987) and the properties of AR-NNs. Proof of Theorem 7.

Proof of Proposition 3.

Omitted (Phillips, 1987, theorem 3.1d).

Acknowledgments

We thank two anonymous referees for their helpful comments and suggestions, especially concerning the discussion of ARI-NN processes and for pointing us to the work of Corradi et al. (1998). F. L. and K. H. were supported by the Austrian Science Foundation (FWF) under grant SFB 010 (“Adaptive Information Systems and Modelling in Economics and Management Science”). References Athreya, K. B., & Pantula, S. G. (1986). Mixing properties of Harris chains and autoregressive processes. Journal of Applied Probability, 23, 880–892. Bellmann, R. (1970). Introduction to matrix analysis. New York: McGraw-Hill. Billingsley, P. (1995). Probability nad measure (3rd ed.). New York: Wiley. Brockwell, P. J., & Davis, R. A. (1993). Time series: Theory and methods (2nd ed.). New York: Springer-Verlag. Chan, K. S., & Tong, H. (1985). On the use of the deterministic Lyapunov function for the ergodicity of stochastic difference equations. Advances in Applied Probability, 17, 666–678. Corradi, V., Swanson, N. R., & White, H. (1998). Testing for stationarity-ergodicity and for comovements between nonlinear discrete time Markov processes (Working Paper 4-96-6). Department of Economics, Pennsylvania State University. Available online at: http://econ.la.psu.edu/swanson/papers.htm.

Autoregressive Neural Network Processes

2449

Feigin, P. D., & Tweedie, R. L. (1985). Random coefcient autoregressive processes: A Markov chain analysis of stationarity and niteness of moments. Journal of Time Series Analysis, 6, 1–14. Fuller, W. A. (1976). Introduction to statistical time series. New York: Wiley. Granger, C. W. J. (1995). Modelling non-linear relationships between extendedmemory variables. Econometrica, 63, 265–279. Granger, C. W. J., & Hallmann, J. (1991). Long memory series with attractors. Oxford Bulletin of Economics and Statistics, 53, 11–26. Granger, C. W. J., & Swanson, N. (1996). Future developments in the study of cointegrated variables. Oxford Bulletin of Economics and Statistics, 58, 537– 553. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359–366. Hwang, J. T. G., & Ding, A. A. (1997). Prediction intervals for articial neural networks. Journal of the American Statistical Association, 92, 748–757. ± Kurkova´ , V., & Kainen, P. C. (1994). Functionally equivalent feedforward neural networks. Neural Computation, 6, 543–558. Leisch, F., Trapletti, A., & Hornik, K. (1999). Stationarity and stability of autoregressive neural network processes. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Meyn, S. P., & Tweedie, R. L. (1993). Markov chains and stochastic stability. London: Springer-Verlag. Perron, P. (1988). Trends and random walks in macroeconomic time series. Journal of Economic Dynamics and Control, 12, 297–332. Perron, P., & Ng, S. (1996). Useful modications to some unit root tests with dependent errors and their local asymptotic properties. Review of Economic Studies, 63, 435–463. Phillips, P. C. B. (1987). Time series regression with a unit root. Econometrica, 55, 277–301. Phillips, P. C. B., & Perron, P. (1988). Testing for a unit root in time series regression. Biom`etrika, 75, 335–346. Potscher, ¨ B. M., & Prucha, I. R. (1986). A class of partially adaptive one-step mestimators for the non-linear regression model with dependent observations. Journal of Econometrics, 32, 219–251. Sussmann, H. J. (1992). Uniqueness of the weights for minimal feedforward nets with a given input-output map. Neural Networks, 5, 589–593. Tjøstheim, D. (1990). Non-linear time series and Markov chains. Advances in Applied Probability, 22, 587–611. Tong, H. (1990). Non-linear time series: A dynamical system approach. Oxford: Oxford University Press. Tweedie, R. L. (1976). Criteria for classifying general Markov chains. Advances in Applied Probability, 8, 737–771. White, H. (1989). Some asymptotic results for learning in single hidden-layer feedforward network models. Journal of the American Statistical Association, 84, 1003–1013.

2450

A. Trapletti, F. Leisch, and K. Hornik

White, H. (1994). Estimation, inference and specication analysis. Cambridge, MA: Cambridge University Press. White, H. (1996). Parametric statistical estimation with articial neural networks. In P. Smolensky, M. C. Mozer, & D. E. Rumelhart (Eds.), Mathematical perspectives on neural networks. Mahwah, NJ: Erlbaum. White, H., & Domowitz, I. (1984). Non-linear regression with dependent observations. Econometrica, 52, 143–161. Received December 7, 1998; accepted October 5, 1999.

LETTER

Communicated by Kenji Doya

Learning to Forget: Continual Prediction with LSTM Felix A. Gers Jurgen ¨ Schmidhuber Fred Cummins IDSIA, 6900 Lugano, Switzerland Long short-term memory (LSTM; Hochreiter & Schmidhuber , 1997) can solve numerous tasks not solvable by previous learning algorithms for recurrent neural networks (RNNs). We identify a weakness of LSTM networks processing continual input streams that are not a priori segmented into subsequences with explicitly marked ends at which the network’ s internal state could be reset. Without resets, the state may grow indenitely and eventually cause the network to break down. Our remedy is a novel, adaptive “forget gate” that enables an LSTM cell to learn to reset itself at appropriate times, thus releasing internal resources. We review illustrative benchmark problems on which standard LSTM outperforms other RNN algorithms. All algorithms (including LSTM) fail to solve continual versions of these problems. LSTM with forget gates, however , easily solves them, and in an elegant way.

1 Introduction

Recurrent neural networks (RNNs) constitute a very powerful class of computational models, capable of instantiating almost arbitrary dynamics. The extent to which this potential can be exploited is, however, limited by the effectiveness of the training procedure applied. Gradient-based methods– back-propagation through time (Williams & Zipser, 1992; Werbos, 1988) or real-time recurrent learning (Robinson & Fallside, 1987; Williams & Zipser, 1992) and their combination (Schmidhuber, 1992)—share an important limitation. The temporal evolution of the path integral over all error signals “owing back in time” exponentially depends on the magnitude of the weights (Hochreiter, 1991). This implies that the backpropagated error (Pearlmutter, 1995) quickly either vanishes or blows up (Hochreiter & Schmidhuber, 1997; Bengio, Simard, & Frasconi, 1994). Hence standard RNNs fail to learn in the presence of time lags greater than 5–10 discrete time steps between relevant input events and target signals. The vanishing error problem casts doubt on whether standard RNNs can indeed exhibit signicant practical advantages over time-window-based feedforward networks. c 2000 Massachusetts Institute of Technology Neural Computation 12, 2451–2471 (2000) °

2452

Felix A. Gers, Jurgen ¨ Schmidhuber, and Fred Cummins

A recent model, long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997), is not affected by this problem. LSTM can learn to bridge minimal time lags in excess of 1000 discrete time steps by enforcing constant error ow through constant error carousels (CECs) within special units called cells. Multiplicative gate units learn to open and close access to the cells. LSTM’s learning algorithm is local in space and time; its computational complexity per time step and weight is O(1). It solves complex long-timelag tasks that have never been solved by previous RNN algorithms. (See Hochreiter & Schmidhuber, 1997, for a comparison of LSTM to alternative approaches.) In this article, however, we show that even LSTM fails to learn to process certain very long or continual time series correctly that are not a priori segmented into appropriate training subsequences with clearly dened beginnings and ends. The problem is that a continual input stream eventually may cause the internal values of the cells to grow without bound, even if the repetitive nature of the problem suggests they should be reset occasionally. This article presents a remedy. Although we present a specic solution to the problem of forgetting in LSTM networks, we recognize that any training procedure for RNNs that is powerful enough to span long time lags must also address the issue of forgetting in short-term memory (unit activations). We know of no other current training method for RNNs that is sufciently powerful to have encountered this problem. Section 2 briey summarizes LSTM and explains its weakness in processing continual input streams. Section 3 introduces a remedy called forget gates. Forget gates learn to reset memory cell contents once they are not needed anymore. Forgetting may occur rhythmically or in an inputdependent fashion. Section 4 derives a gradient-based learning algorithm for the LSTM extension with forget gates. Section 5 describes experiments: we transform well-known benchmark problems into more complex, continual tasks, report the performance of various RNN algorithms, and analyze and compare the networks found by standard LSTM and extended LSTM. 2 Standard LSTM

The basic unit in the hidden layer of an LSTM network is the memory block, which contains one or more memory cells and a pair of adaptive, multiplicative gating units that gate input and output to all cells in the block. Each memory cell has at its core a recurrently self-connected linear unit called the constant error carousel (CEC), whose activation we call the cell state. The CECs solve the vanishing error problem: in the absence of new input or error signals to the cell, the CEC’s local error backow remains constant, neither growing nor decaying. The CEC is protected from both forward-owing activation and backward-owing error by the input and

Learning to Forget

2453

yc

hy h(sc)

out

output gating output squashing

ouput gate

s

input squashing

wout netout

=1.0 for Std. LSTM

memorizing and forgetting CEC with self-connection c input gating

y out

g yin g(net c)

yj

wj

netj

win

net in

forget gate

y in

input gate

wc net c

Figure 1: The standard LSTM cell has a linear unit with a recurrent selfconnection with weight 1.0 (CEC). Input and output gates regulate read and write access to the cell whose state is denoted sc . The function g squashes the cell’s input; h squashes the cell’s output. See the text for details.

output gates, respectively. When gates are closed (activation around zero), irrelevant inputs and noise do not enter the cell, and the cell state does not perturb the remainder of the network. Figure 1 shows a memory block with a single cell. The cell state, sc , is updated based on its current state and three sources of input: netc is input to the cell itself, while netin and netout are inputs to the input and output gates. We consider discrete time steps t D 1, 2, . . .. A single step involves the update of all units (forward pass) and the computation of error signals for all weights (backward pass). Input gate activation yin and output gate activation yout are computed as follows: netoutj (t) D netinj (t) D

X

(2.1)

m

woutj m ym (t ¡ 1), youtj (t ) D foutj (netoutj (t)),

(2.2)

m

winj m ym (t ¡ 1), yinj (t ) D finj (netinj (t)).

X

Throughout this article, j indexes memory blocks; v indexes memory cells in block j, such that cjv denotes the vth cell of the jth memory block; wlm is the weight on the connection from unit m to unit l. Index m ranges over all source units, as specied by the network topology. For gates, f is a logistic sigmoid with range [0, 1]. Net input to the cell itself is squashed by g, a centered logistic sigmoid function with range [¡2, 2].

2454

Felix A. Gers, Jurgen ¨ Schmidhuber, and Fred Cummins

The internal state of memory cell sc (t) is calculated by adding the squashed, gated input to the state at the previous time step sc (t ¡ 1) (t > 0): netcjv (t) D

X m

wcjv m ym (t¡1 ),

scjv (t) D scjv (t¡1 ) C yinj (t) g(netcvj (t)), (2.3)

with scjv (0) D 0. The cell output yc is calculated by squashing the internal state sc via the output squashing function h and then multiplying (gating) it by the output gate activation yout : cv

y j (t) D youtj (t) h (scjv (t)).

(2.4)

h is a centered sigmoid with range [¡1, 1]. Finally, assuming a layered network topology with a standard input layer, a hidden layer consisting of memory blocks, and a standard output layer, the equations for the output units k are: netk (t) D

X m

wkm ym (t ¡ 1), yk (t) D fk (netk (t)),

(2.5)

where m ranges over all units feeding the output units (typically all cells in the hidden layer, the input units, but not the memory block gates). As squashing function fk , we again use the logistic sigmoid, range [0, 1]. All equations except for equation 2.3 will remain valid for extended LSTM with forget gates. Hochreiter and Schmidhuber (1997) provide details of standard LSTM’s backward pass. Essentially, as in truncated backpropagation through time (BPTT), errors arriving at net inputs of memory blocks and their gates do not get propagated back further in time, although they do serve to change the incoming weights. An error signal arriving at a memory cell output is scaled by the output gate and the output nonlinearity h; it then enters the memory cell’s linear CEC, where it can ow back indenitely without ever being changed (this is why LSTM can bridge arbitrary time lags between input events and target signals). Only when the error escapes from the memory cell through an opening input gate and the additional input nonlinearity g does it get scaled once more, and then serves to change incoming weights before being truncated. (Details of extended LSTM’s backward pass are discussed in section 3.2.) 2.1 Limits of Standard LSTM. LSTM allows information to be stored across arbitrary time lags and error signals to be carried far back in time. This potential strength, however, can contribute to a weakness in some situations. The cell states sc often tend to grow linearly during the presentation of a time series (the nonlinear aspects of sequence processing are left to the squashing functions and the highly nonlinear gates). If we present a continuous input

Learning to Forget

2455

stream, the cell states may grow in unbounded fashion, causing saturation of the output squashing function, h. This happens even if the nature of the problem suggests that the cell states should be reset occasionally, such as at the beginnings of new input sequences (whose starts, however, are not explicitly indicated by a teacher). Saturation will make h’s derivative vanish, thus blocking incoming errors, and make the cell output equal the output gate activation; that is, the entire memory cell will degenerate into an ordinary BPTT unit, so that the cell will cease functioning as a memory. The problem did not arise in the experiments reported by Hochreiter and Schmidhuber (1997) because cell states were explicitly reset to zero before the start of each new sequence. How can we solve this problem without losing LSTM’s advantages over time-delay neural networks (TDNN) (Waibel, 1989) or NARX (nonlinear autoregressive models with exogenous inputs) (Lin, Horne, Tino, ˜ & Giles, 1996), which depend on a priori knowledge of typical time lag sizes? The standard technique of weight decay, which helps to contain the level of overall activity within the network, was found to generate solutions that were particularly prone to unbounded state growth. Variants of focused backpropagation (Mozer, 1989) also do not work well. These let the internal state decay via a self-connection whose weight is smaller than 1. But there is no principled way of designing appropriate decay constants. A potential gain for some tasks is paid for by a loss of ability to deal with arbitrary, unknown causal delays between inputs and targets. In fact, state decay does not signicantly improve experimental performance (see “state decay” in Table 2). Of course, we might try to “teacher force” (Jordan, 1986; Doya & Yoshizawa, 1989) the internal states sc by resetting them once a new training sequence starts. But this requires an external teacher who knows how to segment the input stream into training subsequences. We are precisely interested, however, in those situations where there is no a priori knowledge of this kind.

3 Solution: Forget Gates

Our solution to the problem above is to use adaptive forget gates, which learn to reset memory blocks once their contents are out of date and hence useless. By resets, we do not mean only immediate resets to zero but also gradual resets corresponding to slowly fading cell states. More specically, we replace standard LSTM’s constant CEC weight 1.0 by the multiplicative forget gate activation yQ . (See Figure 1.) 3.1 Forward Pass of Extended LSTM with Forget Gates. The forget gate activation yQ is calculated like the activations of the other gates (see

2456

Felix A. Gers, Jurgen ¨ Schmidhuber, and Fred Cummins

equations 2.1 and 2.2): netQj (t) D

X m

wQj m ym (t ¡ 1)I

yQj (t) D fQj (netQj (t)).

(3.1)

Here netQj is the input from the network to the forget gate. We use the logistic sigmoid with range [0, 1] as squashing function fQj . Its output becomes the weight of the self-recurrent connection of the internal state sc in equation 2.3. The revised update equation for sc in the extended LSTM algorithm is (for t > 0): scjv (t) D yQj (t ) scjv (t ¡ 1) C yinj (t ) g (netcjv (t)),

(3.2)

with scjv (0) D 0. Extended LSTM’s full-forward pass is obtained by adding equations 3.1 to those in section 2 and replacing equation 2.3 by 3.2. Bias weights for LSTM gates are initialized with negative values for input and output gates (see Hochreiter & Schmidhuber, 1997, for details) and positive values for forget gates. This implies (compare equations 3.1 and 3.2) that in the beginning of the training phase, the forget gate activation will be almost 1.0, and the entire cell will behave like a standard LSTM cell. It will not explicitly forget anything until it has learned to forget. 3.2 Backward Pass of Extended LSTM with Forget Gates. LSTM’s backward pass (see Hochreiter & Schmidhuber, 1997, for details) is an efcient fusion of slightly modied, truncated BPTT (e.g., Williams & Peng, 1990) and a customized version of real time recurrent learning (RTRL) (e.g., Robinson & Fallside, 1987). Output units use BPTT; output gates use slightly modied, truncated BPTT. Weights to cells, input gates, and the novel forget gates, however, use a truncated version of RTRL. Truncation means that all errors are cut off once they leak out of a memory cell or gate, although they do serve to change the incoming weights. The effect is that the CECs are the only part of the system through which errors can ow back forever. This makes LSTM’s updates efcient without signicantly affecting learning power: error ow outside cells tends to decay exponentially anyway tr

(Hochreiter, 1991). In the equations that follow, D will indicate where we use error truncation; unless otherwise indicated, we assume for simplicity only a single cell per block. We start with the usual squared error objective function based on targets tk , E (t ) D

1X e k ( t )2 I 2 k

ek (t) : D tk (t) ¡ yk (t),

(3.3)

where ek denotes the externally injected error. We minimize E via gradient descent by adding weight changes D wlm to the weights wlm (from unit m to

Learning to Forget

2457

unit l) using learning rate a:

D wlm (t)

D ¡a tr

D a

D a

@E (t) @wlm

X

ek (t)

k

@yl (t) @netl (t)

|

D ¡a

@E (t) @yk (t) @yk (t) @wlm

Da

X

ek (t)

@yk (t)

k

@wlm

@yk (t ) @yl (t) @netl (t) @yl (t) @netl (t)

(

@w | {zlm } D ym (t¡1)

X @yk (t) k

{z

@yl (t)

!

ek (t) }

ym (t ¡ 1).

(3.4)

D:dl (t )

For an arbitrary output unit (l D k), the sum in equation 3.4 vanishes. By differentiating equation 2.5, we obtain the usual backpropagation weight changes for the output units: @yk (t) @netk (t)

0 0 D fk (netk (t)) H) dk (t) D fk (netk (t)) ek (t).

(3.5)

To compute the weight changes for the output gates D woutj m , we set (l D out) in equation 3.4. The resulting terms can be determined by differentiating equations 2.1, 2.4, and 2.5: @youtj (t) @netoutj (t)

0

D foutj (net outj (t)),

@yk (t) @youtj (t )

ek (t ) D h (scvj (t)) wkcjv dk (t).

v , the contribution of the block’s Inserting both terms in equation 3.4 givesdout j vth cell to doutj . Because every cell in a memory block contributes to the weight change of the output gate, we have to sum over all cells v in block j to obtain the total doutj of the jth memory block (with Sj cells):

0

Sj X

0 (netoutj (t )) @ doutj (t) D fout j

v D1

1 h (scjv (t))

X k

wkcjv dk (t)A .

(3.6)

Equations 3.4 through 3.6 dene the weight changes for output units and output gates of memory blocks. Their derivation was almost standard BPTT, with error signals truncated once they leave memory blocks (including its gates). This truncation does not affect LSTM’s long time lag capabilities but is crucial for all equations of the backward pass and should be kept in mind.

2458

Felix A. Gers, Jurgen ¨ Schmidhuber, and Fred Cummins

For weights to cell, input gate, and forget gate, we adopt an RTRLoriented perspective by rst stating the inuence of a cell’s internal state scvj on the error and then analyzing how each weight to the cell or the block’s gates contributes to scjv . So we split the gradient in a way different from the one used in equation 3.4:

D wlm (t) D ¡a

@scjv (t) @E (t ) tr @E (t) @scjv (t) . D ¡a D a escv (t ) @wlm @scv (t ) @wlm j @wlm

(3.7)

j

| {z }

D :¡es

cv j

(t )

These terms are the internal state error escv and a partial j

cjv

@scv j

@wlm

of scjv with

cjv )

respect to weights wlm feeding the cell (l D or the block’s input gate (l D in) or the block’s forget gate (l D Q), as all these weights contribute to the calculation of scjv (t). We treat the partial for the internal states error escv j

analogously to equation 3.4 and obtain: cv

escv (t) : D ¡ j

cv

@E (t) tr @E (t) @yk (t) @y j (t) @y j X @yk (t) D ¡ D ek (t) . v v @scjv (t) @yk (t) @ycj (t) @scjv (t) @scjv (t) k @ycj (t)

|

Differentiating the forward pass equation, 2.4,

j

obtain: outj (

escv (t) D y j

0

t) h (scjv (t))

(

X k

}

j

cv

@y j @sc v (t )

{z

Dwc v ldk (t)

D

youtj (t) h0 (scjv (t ),

we

! wkcjv dk (t) .

(3.8)

This internal state error needs to be calculated for each memory cell. To @sc v

calculate the partial @w j in equation 3.7, we differentiate equation 3.2 and lm obtain a sum of four terms: @scv (t) j

@wlm

D

@scv (t ¡ 1)

@g(net cv (t)) j yQj (t) C yinj (t ) @wlm @wlm | {z } | {z } j

6 0 for all l2fQ ,in,cv g D j

C g ( netcv ( t ))

|

j

{z

6 0 for l D cv (cell) D j

@yinj (t) @wlm

}

6 0 for l D in (input gate) D

C scv ( t ¡ 1)

|

j

{z

@yQj (t) @wlm

.

(3.9)

}

6 0 for l D Q (forget gate) D

Differentiating the forward pass equations 2.3, 2.2, and 3.1 for g, yin , and

Learning to Forget

2459

yQ , we can substitute the unresolved partials and split the expression on the right-hand side of equation 3.9 into three separate equations for the cell (l D cjv ), the input gate (l D in), and the forget gate (l D Q): @scv (t ) j

@wcv m j

@scjv (t) @winj m @scjv (t ) @wQj m

D D D

@scv (t ¡ 1) j

@wcv m j

@scjv (t ¡ 1) @winj m

@scjv (t ¡ 1) @wQj m

yQj (t) C g0 (netcjv (t )) yinj (t) ym (t ¡ 1),

(3.10)

yQj (t) C g(netcjv (t)) fin0 j (netinj (t )) ym (t ¡ 1),

(3.11)

yQj (t) C scvj (t ¡ 1) fQ0 j (netQj (t)) ym (t ¡ 1).

(3.12)

Furthermore the initial state of network does not depend on the weights, so we have @scjv (t D 0) @wlm

for l 2 fQ, in, cjv g.

D0

(3.13)

Note that the recursions in equations 3.10 through 3.12 depend on the actual activation of the block’s forget gate. When the activation goes to zero, not only the cell’s state but also the partials are reset (forgetting includes forgiving previous mistakes). Every cell needs to keep a copy of each of these three partials and update them at every time step. We can insert the partials in equation 3.7 and calculate the corresponding weight updates, with the internal state error escv (t) given by equation 3.8. The j

difference between updates of weights to a cell itself (l D cjv ) and updates of weights to the gates is that changes to weights to the cell D wcjv m depend on only the partials of this cell’s own state:

D wcjv m (t) D aescv (t) j

@scv (t) j

@wcv m

.

(3.14)

j

To update the weights of the input gate and the forget gate, however, we have to sum over the contributions of all cells in the block:

D wlm (t) D a

Sj X vD 1

escv (t) j

@scv (t) j

@wlm

for l 2 fQ, ing.

(3.15)

2460

Felix A. Gers, Jurgen ¨ Schmidhuber, and Fred Cummins

The equations necessary to implement the backward pass are 3.4 through 3.6, 3.8, and 3.10 through 3.15. The appendix summarizes the complete LSTM algorithm (forward and backward pass) in pseudocode. 3.3 Complexity . To calculate the computational complexity of extended LSTM, we take into account that weights to input gates and forget gates cause more expensive updates than others, because each such weight directly affects all the cells in its memory block. We evaluate a rather typical topology used in the experiments (see Figure 3). All memory blocks have the same size; gates have no outgoing connections; output units and gates have a bias connection (from a unit whose activation is always 1.0); other connections to output units stem from memory blocks only; the hidden layer is fully connected. Let B, S, I, K denote the numbers of memory blocks, memory cells in each block, input units, and output units, respectively. We nd the update complexity per time step to be:

W c D (B ¢ (S C 2S C 1)) ¢ (B ¢ (S C 2)) C B ¢ (2S C 1) | {z } recurrent connections and bias

C ( B ¢ S C 1) ¢ K C (B ¢ ( S C 2S C 1)) ¢ I .

|

{z

to output

}

|

{z

}

(3.16)

from input

Hence LSTM’s update complexity per time step and weight is of order O (1), essentially the same as for a fully connected BPTT recurrent network. Storage complexity per weight is also O (1), as the last time step’s partials from equations 3.10 through 3.12 are all that need to be stored for the backward pass. So the storage complexity does not depend on the length of the input sequence. Hence, extended LSTM is local in space and time, according to Schmidhuber’s denition (1989), just like standard LSTM. 4 Experiments 4.1 Continual Embedded Reber Grammar Problem. To generate an innite input stream, we extend the well-known embedded Reber grammar (ERG) benchmark problem, Smith & Zipser 1989; Cleeremans, ServanSchreiber, & McClelland, 1989; Fahlman, 1991; Hochreiter & Schmidhuber, 1997). Consider Figure 2.

4.1.1 ERG. The traditional method starts at the left-most node of the ERG graph and sequentially generates nite symbol strings (beginning with the empty string) by following edges and appending the associated symbols

Learning to Forget

S

B

T

2461

X S X

P

E

P V

T

V

B

T P

Reber Grammar

T E

Reber Grammar

P

recurrent connection for continual prediction

Figure 2: Transition diagrams for standard (left) and embedded (right) Reber grammars. The dashed line indicates the continual variant.

to the current string until the right-most node is reached. Edges are chosen randomly if there is a choice (probability D 0.5). Input and target symbols are represented by seven-dimensional binary vectors, each component standing for one of the seven possible symbols. Hence the network has seven input units and seven output units. The task is to read strings, one symbol at a time, and to predict continually the next possible symbol(s). Input vectors have exactly one nonzero component. Target vectors may have two, because sometimes there is a choice of two possible symbols at the next step. A prediction is considered correct if the mean squared error at each of the seven output units is below 0.49 (error signals occur at every time step). To predict correctly the symbol before the last (T or P ) in an ERG string, the network has to remember the second symbol (also T or P ) without confusing it with identical symbols encountered later. The minimal time lag is 7 (at the limit of what standard recurrent networks can manage); time lags have no upper bound, though. The expected length of a string generated by an ERG is 11.5 symbols. The length of the longest string in a set of N nonidentical strings is proportional to log N (Gers, Schmidhuber, & Cummins, 1999). For the training and test sets used in our experiments, the expected value of the longest string is greater than 50. Table 1 summarizes the performance of previous RNNs on the standard ERG problem (testing involved a test set of 256 ERG test strings). Only LSTM always learns to solve the task. Even when we ignore the unsuccessful trials of the other approaches, LSTM learns much faster. 4.1.2 CERG. Our more difcult continual variant of the ERG problem (CERG) does not provide information about the beginnings and ends of symbol strings. Without intermediate resets, the network is required to learn, in an on-line fashion, from input streams consisting of concatenated ERG strings. Input streams are stopped as soon as the network makes an incorrect prediction or the 105 th successive symbol has occurred. Learning and testing alternate: after each training stream, we freeze the weights and

2462

Felix A. Gers, Jurgen ¨ Schmidhuber, and Fred Cummins

Table 1: Standard Embedded Reber Grammar (ERG): Percentage of Successful Trials and Number of Sequence Presentations until Success. Algorithm

Number of Hidden Units

Number of weights

Learning Rate

% Success

Success After

RTRL RTRL ELM RCC Standard LSTM

3 12 15 7–9 3 memory blocks, size 2

¼ 170 ¼ 494 ¼ 435 ¼ 119–198

0.05 0.1

“Some fraction” “Some fraction” 0 50

173,000 25,000 > 200,000 182,000

276

0.5

100

8440

Sources: RTRL: Results from Smith and Zipser (1989); ELM results from Cleeremans et al. (1989); RCC results from Fahlman (1991); and LSTM results from Hochreiter and Schmidhuber (1997). Note: Weight numbers in the rst four rows are estimates.

feed 10 test streams. Our performance measure is the average test stream size; 100,000 corresponds to a so-called perfect solution (106 successive correct predictions). 4.2 Network Topology and Parameters. The seven input units are fully connected to a hidden layer consisting of four memory blocks with 2 cells each (8 cells and 12 gates in total). The cell outputs are fully connected to the cell inputs, all gates, and the seven output units. The output units have additional “shortcut” connections from the input units (see Figure 3). All gates and output units are biased. Bias weights to input and output gates are initialized blockwise: ¡0.5 for the rst block, ¡1.0 for the second, ¡1.5 for the third, and so forth. In this manner, cell states are initially close to zero; as training progresses, the biases become progressively less negative, allowing the serial activation of cells as active participants in the network computation. Forget gates are initialized with symmetric positive values: +0.5 for the rst block, +1 for the second block, and so on. Precise bias initialization is not critical, though; other values work just as well. All other weights including the output bias are initialized randomly in the range [¡0.2, 0.2]. There are 424 adjustable weights, which is comparable to the number used by LSTM in solving the ERG (see Table 1). Weight changes are made after each input symbol presentation. At the beginning of each training stream, the learning rate a is initialized with 0.5. It either remains xed or decays by a factor of 0.99 per time step (LSTM with a-decay). Learning rate decay is well studied in statistical approximation theory and is also common in neural networks (e.g., Darken, 1995). We report results of exponential a-decay p (as specied above), but also tested several other variants (linear, 1 / T, 1 / T) and found them all to work as well without extensive optimization of parameters.

Learning to Forget

2463 Out 1

Out 2

Out 3

Out 4

MemoryMemory Out Gate 1 Block Block 1 1 Forget Gate 1 Cell Cell 1 2

In 2

Out 6

Out 7

MemoryMemory Out Gate 2 Block Block 2 2 Forget Gate 2 Cell Cell 1 2

In Gate 1

In 1

Out 5

In Gate 2

In 3

In 4

In 5

In 6

In 7

Figure 3: Three-layer LSTM topology with recurrence limited to the hidden layer consisting of four extended LSTM memory blocks (only two are shown) with two cells each. Only a limited subset of connections are shown.

4.3 CERG Results. Training was stopped after at most 30,000 training streams, each of which was ended when the rst prediction error or the 100,000th successive input symbol occurred. Table 2 compares extended LSTM (with and without learning rate decay) to standard LSTM and an LSTM variant with decay of the internal cell state sc (with a self-recurrent weight < 1). Our results for standard LSTM with network activation resets (by an external teacher) at sequence ends are slightly better than those based on a different topology (Hochreiter & Schmidhuber, 1997). External resets (noncontinual case) allow LSTM to nd excellent solutions in 74% of the trials, according to our stringent testing criterion. Standard LSTM fails, however, in the continual case. Internal state decay does not help much either (we tried various self-recurrent weight values and report only the best result). Extended LSTM with forget gates, however, can solve the continual problem. A continually decreasing learning rate led to even better results but had no effect on the other algorithms. Different topologies may provide better results too. We did not attempt to optimize topology. Can the network learn to recognize appropriate times for opening and closing its gates without using the information conveyed by the marker symbols B and E ? To test this, we replaced all CERG subnets of the type T nP

E

B

T nP

T nP

T nP

¡! ¡! ¡! ¡! by ¡! ¡!. This makes the task more difcult, as the net now needs to keep track of sequences of numerous potentially confusing T and P symbols. But LSTM with forget gates (same topology)

2464

Felix A. Gers, Jurgen ¨ Schmidhuber, and Fred Cummins

Table 2: Continuous Embedded Reber Grammar (CERG). Algorithm

% Solutionsa

% Good Solutionsb

% Restc

Standard LSTM with external reset Standard LSTM LSTM with state decay (0.9) LSTM with forget gates LSTM with forget gates and sequential a decay

74 (7441) 0 (-) 0 (-) 18 (18,889)

0 h¡i 1 h1166i 0 h¡i 29 h39,171i

26 h31i 99 h37i 100 h56i 53 h145i

62 (14,087)

6 h68,464i

32 h30i

a Percentage

of “perfect” solutions (correct prediction of 10 streams of 100,000 symbols each). The number of training streams presented until a solution was reached is shown in parentheses. b Percentage of solutions with an average stream length > 1000. The mean length of error-free prediction is given in angle brackets. c Percentage of “bad” solutions with average stream length · 1000. The mean length of error-free prediction is given in angle brackets. Notes: The results are averages over 100 independently trained networks. Other algorithms like BPTT are not included in the comparison, because they tend to fail even on the easier, noncontinual ERG.

was still able to nd perfect solutions, although less frequently (sequential a decay was not applied).

Internal Cell State

4.4 Analysis of the CERG Results. How does extended LSTM solve the task on which standard LSTM fails? Section 2.1 already mentioned LSTM’s problem of uncontrolled growth of the internal states. Figure 4 shows the evolution of the internal states sc during the presentation of a test stream. The internal states tend to grow linearly. At the starts of successive ERG strings, the network is in an increasingly active state. At some point (here,

-9- -10- -14- -10- -10- -9- -10- -10- -12- -10- -9- -9-

100 50 0 -50 0

T

T

T

T

T

P

P

T

T

T

T

P 130

Symbol

Figure 4: Evolution of standard LSTM’s internal states sc during presentation of a test stream stopped at rst prediction failure. Starts of new ERG strings are indicated by vertical lines labeled by the symbols (P or T ) to be stored until the next string start.

Internal Cell State

Learning to Forget

-12- -20- -11- -15- -11--10- -15- -14- -9- -19- -10- -9- -920 10 0 3.Block, 1.Cell 3.Block, 2.Cell

-10 680

Forget Gate Activ.

2465

T

P

P

T

T P

P

T T

T

T T 850

T

P

P

T

T

P

T T

T

T T 850

1 0.5 0 680

P

Symbol

Figure 5: (Top) Internal states sc of the two cells of the self-resetting third memory block in an extended LSTM network during a test stream presentation. The gure shows 170 successive symbols taken from the longer sequence presented to a network that learned the CERG. Starts of new ERG strings are indicated by vertical lines labeled by the symbols P or T , to be stored until the next string start. (Bottom) Simultaneous forget gate activations of the same memory block.

after 13 successive strings), the high level of state activation leads to saturation of the cell outputs, and performance breaks down. Extended LSTM, however, learns to use the forget gates for resetting its state when necessary. Figure 5 (top) shows a typical internal state evolution after learning. We see that the third memory block resets its cells in synchrony with the starts of ERG strings. The internal states oscillate around zero; they never drift out of bounds as with standard LSTM (see Figure 4). It also becomes clear how the relevant information gets stored: the second cell of the third block stays negative, while the symbol P has to be stored, whereas a T is represented by a positive value. The third block’s forget gate activations are plotted in Figure 5 (bottom). Most of the time, they are equal to 1.0, thus letting the memory cells retain their internal values. At the end of an ERG string, the forget gate’s activation goes to zero, thus resetting cell states to zero. Analyzing the behavior of the other memory blocks, we nd that only the third is directly responsible for bridging ERG’s longest time lag. Figure 6 plots values analogous to those in Figure 5 for the rst memory block and its rst cell. The rst block’s cell and forget gate show short-term behavior only (necessary for predicting the numerous short time lag events of the Reber

2466

Felix A. Gers, Jurgen ¨ Schmidhuber, and Fred Cummins

Internal State

-12- -20- -11- -15- -11--10- -15- -14- -9- -19- -10--9- -910

0 1.Block, 1.Cell

Forget Gate Activ.

-10 680

T

P

P

T

T

T

P

P

T

T

P

P

T T

T

T T 850

P

P

T T

T

T T 850

1 0.5 0 680

Symbol

Figure 6: (Top) Extended LSTM’s self-resetting states for the rst cell in the rst block. (Bottom) Forget gate activations of the rst memory block.

grammar). The same is true for all other blocks except the third. Common to all memory blocks is that they learned to reset themselves in an appropriate fashion. 4.5 Continual Noisy Temporal Order Problem. Extended LSTM solves the CERG problem; standard LSTM does not. But can standard LSTM solve problems that extended LSTM cannot? We tested extended LSTM on one of the most difcult nonlinear long time lag tasks ever solved by an RNN: noisy temporal order (NTO) (task 6b taken from Hochreiter & Schmidhuber 1997).

4.5.1 NTO. The goal is to classify sequences of locally represented symbols. Each sequence starts with an E, ends with a B (the “trigger symbol”), and otherwise consists of randomly chosen symbols from the set fa, b, c, dg except for three elements at positions t1 , t2 , and t3 that are either X or Y. The sequence length is randomly chosen between 100 and 110, t1 is randomly chosen between 10 and 20, t2 is randomly chosen between 33 and 43, and t3 is randomly chosen between 66 and 76. There are eight sequence classes, Q, R, S, U, V, A, B, C, which depend on the temporal order of the Xs and Ys. The rules are: X, X, X ! Q; X, X, Y ! R; X, Y, X ! S; X, Y, Y ! U; Y, X, X ! V; Y, X, Y ! A; Y, Y, X ! B; Y, Y, Y ! C. Target signals occur only at the end of a sequence. The problem’s minimal time lag size is 80. Forgetting is harmful only because all relevant information has to

Learning to Forget

2467

Table 3: Continuous Noisy Temporal Order (CNTO). Algorithm

% Perfect Solution a

% Partial Solutionb

Standard LSTM LSTM with forget gates LSTM with forget gates and sequential a decay

0 (-) 24 (18,077)

100 h4.6i 76 h12.2i

37 (22,654)

63 h11.8i

a Percentage

of perfect solutions (correct classication of 1000 successive NTO sequences in 10 test streams). In parentheses is the number of training streams presented. b Percentage of solutions and average stream size (value in angular brackets) · 100. Notes: All results are averages over 100 independently trained networks. Other algorithms (BPTT, RTRL, etc.) are not included in the comparison, because they fail even on the easier, noncontinual NTO.

be kept until the end of a sequence, after which the network is reset anyway. We use the network topology described in section 4.2 with eight input and eight output units. Using a large bias (5.0) for the forget gates, extended LSTM solved the task as quickly as standard LSTM (recall that a high forget gate bias makes extended LSTM degenerate into standard LSTM). Using a moderate bias like the one used for CERG (1.0), extended LSTM took about three times longer on average, but did solve the problem. The slower learning speed results from the net’s having to learn to remember everything and not to forget. Generally we have not yet encountered a problem that LSTM solves while extended LSTM does not. 4.5.2 CNTO. Now we take the next obvious step and transform the NTO into a continual problem that does require forgetting, just as in section 4.1, by generating continual input streams consisting of concatenated NTO sequences. Processing such streams without intermediate resets, the network is required to learn to classify NTO sequences in an on-line fashion. Each input stream is stopped once the network makes an incorrect classication or 100 successive NTO sequences have been classied correctly. Learning and testing alternate; the performance measure is the average size of 10 test streams, measured by the number of their NTO sequences (each containing between 100 and 110 input symbols). Training is stopped after at most 105 training streams. 4.5.3 Results. Table 3 summarizes the results. We observe that standard LSTM again fails to solve the continual problem. Extended LSTM with forget gates, however, can solve it. A continually decreasing learning rate (a decaying by a fraction of 0.9 after each NTO sequence in a stream) leads to slightly better results but is not necessary.

2468

Felix A. Gers, Jurgen ¨ Schmidhuber, and Fred Cummins

5 Conclusion

Continual input streams generally require occasional resets of the streamprocessing network. Partial resets are also desirable for tasks with hierarchical decomposition. For instance, reoccurring subtasks should be solved by the same network module, which should be reset once the subtask is solved. Since typical real-world input streams are not a priori decomposed into training subsequences and typical sequential tasks are not a priori decomposed into appropriate subproblems, RNNs should be able to learn to achieve appropriate decompositions. Our novel forget gates naturally permit LSTM to learn local self-resets of memory contents that have become irrelevant. Extended LSTM holds promise for any sequential processing task in which we suspect that a hierarchical decomposition may exist but do not know in advance what this decomposition is. The model has been successfully applied to the task of discriminating languages from very limited prosodic information (Cummins, Gers, & Schmidhuber, 1999) where there is no clear linguistic theory of hierarchical structure. Memory blocks equipped with forget gates may also be capable of developing into internal oscillators or timers, allowing the recognition and generation of hierarchical rhythmic patterns. Acknowledgments

This work was supported by SNF grant 2100-49’144.96, “Long Short-Term Memory,” to J. S. Thanks to Nici Schraudolph for providing his fast exponentiation code (Schraudolph, 1999) employed to accelerate the computation of exponentials.

Learning to Forget Appendix: Summary of Extended LSTM in Pseudocode

2469

2470

Felix A. Gers, Jurgen ¨ Schmidhuber, and Fred Cummins

References Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difcult. IEEE Transactions on Neural Networks, 5(2), 157–166. Cleeremans, A., Servan-Schreiber, D., & McClelland, J. L. (1989). Finite-state automata and simple recurrent networks. Neural Computation, 1, 372–381. Cummins, F., Gers, F., & Schmidhuber, J. (1999). Language identication from prosody without explicit features. In Proceedings of EUROSPEECH’99 (Vol. 1, pp. 371–374). Darken, C. (1995). Stochastic approximation and neural network learning. In M. A. Arbib (Ed.), The handbook of brain theory and neural networks (pp. 941– 944). Cambridge, MA: MIT Press. Doya, K., & Yoshizawa, S. (1989). Adaptive neural oscillator using continuoustime backpropagation learning. Neural Networks, 2(5), 375–385. Fahlman, S. E. (1991). The recurrent cascade-correlation learning algorithm. In R. P. Lippmann, J. E. Moody, & D. S. Touretzky (Eds.), Advances in neural information processing systems, 3 (pp. 190–196). San Mateo, CA: Morgan Kaufmann. Gers, F. A., Schmidhuber, J., & Cummins, F. (1999). Learning to forget: Continual prediction with LSTM (Tech. Rep. No. IDSIA-01-99). Lugano, Switzerland: IDSIA. Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Technische Universit¨at Munchen. ¨ Available online at www7. informatik.tu-muenchen.de/˜hochreit. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780. Jordan, M. I. (1986). Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the Eighth Annual Cognitive Science Society Conference. Hillsdale, NJ: Erlbaum. Lin, T., Horne, B. G., Tino, ˜ P., & Giles, C. L. (1996). Learning long-term dependencies in NARX recurrent neural networks. IEEE Transactions on Neural Networks, 7(6), 1329–1338. Mozer, M. C. (1989). A focused backpropagation algorithm for temporal pattern processing. Complex Systems, 3, 349–381. Pearlmutter, B. A. (1995). Gradient calculation for dynamic recurrent neural networks: A survey. IEEE Transactions on Neural Networks, 6(5), 1212–1228. Robinson, A. J., & Fallside, F. (1987). The utility driven dynamic error propagation network. (Tech. Rep. No. CUED/F-INFENG/TR.1). Cambridge: Cambridge University Engineering Department. Schmidhuber, J. (1989). The neural bucket brigade: A local learning algorithm for dynamic feedforward and recurrent networks. Connection Science, 1(4), 403–412. Schmidhuber, J. (1992).A xed size storage O(n3 ) time complexity learning algorithm for fully recurrent continually running networks. Neural Computation, 4(2), 243–248.

Learning to Forget

2471

Schraudolph, N. (1999). A fast, compact approximation of the exponential function. Neural Computationx, 11(4), 853–862. Smith, A. W., & Zipser, D. (1989). Learning sequential structures with the realtime recurrent learning algorithm. International Journal of Neural Systems, 1(2), 125–131. Waibel, A. (1989). Modular construction of time-delay neural networks for speech recognition. Neural Computation, 1(1), 39–46. Werbos, P. J. (1988). Generalisation of backpropagation with application to a recurrent gas market model. Neural Networks, 1, 339–356. Williams, R. J., & Peng, J. (1990).An efcient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Computation, 2(4), 490–501. Williams, R. J., & Zipser, D. (1992). Gradient-based learning algorithms for recurrent networks and their computational complexity. In Y. Chauvin & D. E. Rumelhart (Eds.), Back-propagation: Theory, architectures and applications. Hillsdale, NJ: Erlbaum. Received March 3, 1999; accepted November 7, 1999.

Neural Computation, Volume 12, Number 11

CONTENTS Article

Dynamical Mechanism for Sharp Orientation Tuning in an Integrate-and-Fire Model of a Cortical Hypercolumn P. C. Bressloff, N. W. Bressloff, and J. D. Cowan

2473

Notes

Statistical Signs of Common Inhibitory Feedback with Delay Quentin Pauluis

2513

On the Computational Power of Winner-Take-All Wolfgang Maass

2519

Reclassication as Supervised Clustering A. Sierra and F. Corbacho

2537

Letters

A Model of Invariant Object Recognition in the Visual System: Learning Rules, Activation Functions, Lateral Inhibition, and Information-Based Performance Measures Edmund T. Rolls and T. Milward

2547

An Analysis of Orientation and Ocular Dominance Patterns in the Visual Cortex of Cats and Ferrets T. Muller, ¨ M. Stetter, M. Hubener, ¨ F. Sengpiel, T. Bonhoeffer, I. G¨odecke, B. Chapman, S. L¨owel, and K. Obermayer

2573

Improvements to the Sensitivity of Gravitational Clustering for Multiple Neuron Recordings Stuart N. Baker and George L. Gerstein 2597 Neural Coding: Higher-Order Temporal Patterns in the Neurostatistics of Cell Assemblies Laura Martignon, Gustavo Deco, Kathryn Laskey, Mathew Diamond, Winrich Freiwald, and Eilon Vaadia 2621 Gaussian Processes for Classication: Mean-Field Algorithms Manfred Opper and Ole Winther

2655

The Bayesian Evidence Scheme for Regularizing Probability-Density Estimating Neural Networks Dirk Husmeier

2685

A Bayesian Committee Machine Volker Tresp

2719

ARTICLE

Communicated by Carl van Vreeswijk

Dynamical Mechanism for Sharp Orientation Tuning in an Integrate-and-Fire Model of a Cortical Hypercolumn P. C. Bresslof f Nonlinear and Complex Systems Group, Department of Mathematical Sciences, Loughborough University, Loughborough, Leicestershire LE11 3TU, U.K. N. W. Bresslof f Computational and Engineering Design Centre, Department of Aeronautics and Astronautics, University of Southampton, Southampton, SO17 1BJ, U.K. J. D. Cowan Department of Mathematics, University of Chicago, Chicago, IL 60637, U.S.A. Orientation tuning in a ring of pulse-coupled integrate-and-re (IF) neurons is analyzed in terms of spontaneous pattern formation. It is shown how the ring bifurcates from a synchronous state to a non-phase-locked state whose spike trains are characterized by clustered but irregular uctuations of the interspike intervals (ISIs). The separation of these clusters in phase space results in a localized peak of activity as measured by the time-averaged ring rate of the neurons. This generates a sharp orientation tuning curve that can lock to a slowly rotating, weakly tuned external stimulus. Under certain conditions, the peak can slowly rotate even to a xed external stimulus. The ring also exhibits hysteresis due to the subcritical nature of the bifurcation to sharp orientation tuning. Such behavior is shown to be consistent with a corresponding analog version of the IF model in the limit of slow synaptic interactions. For fast synapses, the deterministic uctuations of the ISIs associated with the tuning curve can support a coefcient of variation of order unity. 1 Introduction

Recent studies of the formation of localized spatial patterns in one- and two-dimensional neural networks have been used to investigate a variety of neuronal processes including orientation selectivity in primary visual cortex (Ben-Yishai, Bar-Or, Lev, & Sompolinsky, 1995; Ben-Yishai, Hansel, & Sompolinsky, 1997; Hansel & Sompolinsky, 1997; Mundel, Dimitrov, & Cowan, 1997), the coding of arm movements in motor cortex (Lukashin & Georgopolous, 1994a, 1994b; Georgopolous, 1995), and the control of saccadic eye movements (Zhang, 1996). The networks considered in these studies are based on a simplied rate or analog model of a neuron, in which c 2000 Massachusetts Institute of Technology Neural Computation 12, 2473–251 1 (2000) °

2474

P. C. Bressloff, N. W. Bressloff, and J. D. Cowan

the state of each neuron is characterized by a single continuous variable that determines its short-term average output activity (Cowan, 1968; Wilson & Cowan, 1972). All of these models involve the same basic dynamical mechanism for the formation of localized patterns: spontaneous symmetry breaking from a uniform resting state (Cowan, 1982). Localized structures consisting of a single peak of high activity occur when the maximum (spatial) Fourier component of the combination of excitatory and inhibitory interactions between neurons has a wavelength comparable to the size of the network. Typically such networks have periodic boundary conditions so that unraveling the network results in a spatially periodic pattern of activity, as studied previously by Ermentrout and Cowan (1979a, 1979b) (see also the review by Ermentrout, 1998). In contrast to mean ring-rate models, there has been relatively little analytical work on pattern formation in more realistic spiking models. A number of numerical studies have shown that localized activity proles can occur in networks of Hodgkin-Huxley neurons (Lukashin & Georgopolous, 1994a, 1994b; Hansel & Sompolinsky, 1996). Moreover, both local (Somers, Nelson, & Sur, 1995) and global (Usher, Stemmler, Koch, & Olami, 1994) patterns of activity have been found in integrate-and-re (IF) networks. Recently a dynamical theory of global pattern formation in IF networks has been developed in terms of the nonlinear map of the neuronal ring times (Bressloff & Coombes, 1998b, 2000). A linear stability analysis of this map shows how, in the case of short-range excitation and long-range inhibition, a network that is synchronized in the weak coupling regime can destabilize as the strength of coupling is increased, leading to a state characterized by clustered but irregular uctuations of the interspike intervals (ISIs). The separation of these clusters in phase-space results in a spatially periodic pattern of mean (time-averaged) ring rate across the network, which is modulated by deterministic uctuations in the instantaneous ring rates. In this article we apply the theory of Bressloff and Coombes (1998b, 2000) to the analysis of localized pattern formation in an IF version of the model of sharp orientation tuning developed by Hansel and Sompolinsky (1997) and Ben-Yishai et al. (1995, 1997). We rst study a corresponding analog model in which the outputs of the neurons are taken to be mean ring rates. Since the resulting ring-rate function is nonlinear rather than semilinear, it is not possible to construct exact solutions for the orientation tuning curves along the lines of Hansel and Sompolinsky (1997). Instead, we investigate the existence and stability of such activity proles using bifurcation theory. We show how a uniform resting state destabilizes to a stable localized pattern as the strength of neuronal recurrent interactions is increased. This localized state consists of a single narrow peak of activity whose center can lock to a slowly rotating, weakly tuned external stimulus. Interestingly, we nd that the bifurcation from the resting state is subcritical: the system jumps to a localized activity prole on destabilization of the resting state, and

Integrate-and-Fire Model of Orientation Tuning

2475

hysteresis occurs in the sense that sharp orientation tuning can coexist with a stable resting state. We then turn to the full IF model and show how an analysis of the nonlinear ring time map can serve as a basis for understanding orientation tuning in networks of spiking neurons. We derive explicit criteria for the stability of phase-locked solutions by considering the propagation of perturbations of the ring times throughout the network. Our analysis identies regions in parameter space where instabilities in the ring times cause (subcritical) bifurcations to non-phase-locked states that support localized patterns of mean ring rates across the network similar to the sharp orientation tuning curves found in the corresponding analog model. However, as found in the case of global pattern formation (Bressloff & Coombes, 1998b, 2000), the tuning curves are modulated by deterministic uctuations of the ISIs on closed quasiperiodic orbits, which grow with the speed of synapses. For sufciently fast synapses, the resulting coefcient of variation (CV ) can be of order unity and the ISIs appear to exhibit chaotic motion due to breakup of the quasiperiodic orbits. 2 The Model

We consider an IF version of the neural network model for orientation tuning in a cortical hypercolumn developed by Hansel and Sompolinsky (1997) and Ben-Yishai et al. (1997). This is a simplied version of the model studied numerically by Somers et al. (1995). The network consists of two subpopulations of neurons, one excitatory and the other inhibitory, which code for the orientation of a visual stimulus appearing in a common visual eld (see Figure 1). The index L D E, I will be used to distinguish the two populations. Each neuron is parameterized by an angle w , 0 · w < p , which represents its orientation preference. (The angle is restricted to be from 0 to p since a bar that is oriented at an angle 0 is indistinguishable from one that has an orientation p .) Let UL (w , t) denote the membrane potential at time t of a neuron of type L and orientation preference w. The neurons are modeled as IF oscillators evolving according to the set of equations t0

@UL (w , t) @t

D h0 ¡ UL (w , t) C XL (w , t),

L D E, I,

(2.1)

where h0 is a constant external input or bias, t0 is a membrane time constant, and XL (w , t) denotes the total synaptic input to a neuron w of type L. Periodic boundary conditions are assumed so that UL (0, t) D UL (p , t ). Equation 2.1 is supplemented by the condition that whenever a neuron reaches a threshold k , it res a spike, and its membrane potential is immediately reset to zero. In other words, UL (w , t C ) D 0

whenever

UL (w , t) D k,

L D E, I.

(2.2)

2476

P. C. Bressloff, N. W. Bressloff, and J. D. Cowan

WIE

WEE E WII

I

WEI

f Figure 1: Network architecture of a ring model of a cortical hypercolumn (see Hansel & Sompolinsky, 1997).

For concreteness, we set the threshold k D 1 and take h0 D 0.9 < k , so that in the absence of any synaptic inputs, all neurons are quiescent. We also x the fundamental unit of time to be of the order 5–10 msec by setting t0 D 1. (All results concerning ring rates or interspike intervals presented in this article are in units of t0 .) The total synaptic input XL (w , t) is taken to be of the form

XL (w , t) D

X Z M DE,I 0

p

dw 0 WLM (w ¡ w 0 )YM (w 0 , t) C hL (w , t), p

(2.3)

where W LM (w ¡ w 0 ) denotes the interaction between a presynaptic neuron w 0 of type M and a postsynaptic neuron w of type L, and YM (w 0 , t) is the effective input at time t induced by the incoming spike train from the presynaptic neuron (also known as the synaptic drive; Pinto, Brumberg, Simons, & Ermentrout, 1996). The term hL (w , t) represents the inputs from the lateral geniculate nucleus (LGN). The weight functions WLM are taken to be even

Integrate-and-Fire Model of Orientation Tuning

2477

and p -periodic in w so that they have the Fourier expansions WLE (w ) D W0LE C 2

1 X k D1

WLI (w ) D ¡W 0LI ¡ 2

W kLE cos(2kw ) ¸ 0

1 X kD 1

WkLI cos(2kw ) · 0.

(2.4) (2.5)

Moreover, the interactions are assumed to depend on the degree of similarity of the presynaptic and postsynaptic orientation preferences, and to be of maximum strength when they have the same preference. In order to make a direct comparison with the results of Hansel and Sompolinsky (1997), almost all our numerical results will include only the rst two harmonics in equations 2.4 and 2.5. Higher harmonics (WnLM , n ¸ 2) generate similarlooking tuning curves. Finally, we take Z YL (w , t) D

1 0

dtr (t ) fL (w , t ¡ t ),

(2.6)

where r (t ) represents some delay distribution and fL (w , t ) is the output spike train of a neuron w of type L. Neglecting the pulse shape of an individual action potential, we represent the output spike train as a sequence of impulses, fL (w , t ) D

X k2Z

d(t ¡ T Lk (w )),

(2.7)

where TLk (w ), integer k, denotes the kth ring time (threshold-crossing time) of the given neuron. The delay distribution r (t ) can incorporate a number of possible sources of delay in neural systems: (1) discrete delays arising from nite axonal transmission times, (2) synaptic processing delays associated with the conversion of an incoming spike to a postsynaptic potential, or (3) dendritic processing delays in which the effects of a postsynaptic potential generated at a synapse located on the dendritic tree at some distance from the soma are mediated by diffusion along the tree. For concreteness, we shall restrict ourselves to synaptic delays and take r (t) to be an alpha function (Jack, Noble, & Tsien, 1975; Destexhe, Mainen, & Sejnowsky, 1994), r (t ) D a2 t e ¡at H (t ),

(2.8)

where a is the inverse rise time of a postsynaptic potential, and H (t ) D 1 if t ¸ 0 and is zero otherwise. We expect axonal delays to be small within a given hypercolumn. (For a review of the dynamical effects of dendritic structure, see Bressloff & Coombes, 1997.)

2478

P. C. Bressloff, N. W. Bressloff, and J. D. Cowan

It is important to see how the above spiking version of the ring model is related to the rate models considered by Hansel and Sompolinsky (1997) and Ben-Yishai et al. (1997). Suppose the synaptic interactions are sufciently slow that the output fL (w , t ) of a neuron can be characterized reasonably well by a mean (time-averaged) ring rate (see, for example, Amit & Tsodyks, 1991; Bressloff & Coombes, 2000). Let us consider the case in which r (t ) is given by the alpha function 2.8 with a synaptic rise time a¡1 signicantly longer than all other timescales in the system. The total synaptic input to neuron w of type L will then be described by a slowly time-varying function XL (w , t), such that the actual ring rate will quickly relax to approximately the steady-state value, as determined by equations 2.1 and 2.2. This implies that fL (w , t) D f (XL (w , t)),

(2.9)

with the steady-state ring-rate function f given by » µ f (X ) D ln

h0 C X h0 C X ¡ 1

¶¼

¡1

H (h0 C X ¡ 1).

(2.10)

(For simplicity, we shall ignore the effects of refractory period, which is reasonable when the system is operating well below its maximal ring rate.) Equation 2.9 relates the dynamics of the ring rate directly to the stimulus dynamics XL (w , t) through the steady-state response function. In effect, the use of a slowly varying distribution r (t ) allows a consistent denition of the ring rate so that a dynamical network model can be based on the steady-state properties of an isolated neuron. Substitution of equations 2.6 and 2.9 into 2.3 yields the extended WilsonCowan equations (Wilson & Cowan, 1973): Z 1 X Z p dw 0 XL (w , t) D WLM (w ¡ w 0 ) dtr (t ) f (XM (w 0 , t ¡ t )) p 0 0 M DE,I C hL (w , t ).

(2.11)

An alternative version of equation 2.11 may be obtained for the alpha function delay distribution, 2.8, by rewriting equation 2.6 as the differential equation 1 @2 YL 2 @YL C C YL D f (XL ), a2 @t2 a @t

(2.12)

with XL given by equation 2.3 and YL (w , t), @YL (w , t ) / @t ! 0 as t ! ¡1. Similarly, taking r (t) D ae ¡at generates the particular version of the WilsonCowan equations studied by Hansel and Sompolinsky (1997), a ¡1

@YL @t

C YL D f (XL ).

(2.13)

Integrate-and-Fire Model of Orientation Tuning

2479

There are a number of differences, however, between the interpretation of YL in equation 2.13 and the corresponding variable considered by Hansel and Sompolinsky (1997). In the former case, YL (w , t) is the synaptic drive of a single neuron with orientation preference w , and f is a time-averaged ring rate, whereas in the latter case, YL represents the activity of a population of neurons forming an orientation column w , and f is some gain function. It is possible to introduce the notion of an orientation column in our model by partitioning the domain 0 · w < p into N segments of length p / N such that Z mL (w k , t) D N

(k C 1)p / N

kp / N

dw fL (w , t), p

k D 0, 1, . . . , N ¡ 1

(2.14)

represents the population-averaged ring rate within the kth orientation column, w k D kp / N. (Alternatively, we could reinterpret the IF model as a caricature of a synchronized column of spiking neurons.) Hansel and Sompolinsky (1997) and Ben-Yishai et al. (1995, 1997) have carried out a detailed investigation of orientation tuning in the mean ringrate version of the ring model dened by equations 2.3 and 2.13 with f a semilinear gain function. They consider external inputs of the form hL (w , t) D CC L [1 ¡ Â C Â cos(2(w ¡ w 0 (t)))]

(2.15)

for 0 · Â · 0.5. This represents a tuned input from the LGN due to a visual stimulus of orientation w 0 . The parameter C denotes the contrast of the stimulus, C L is the transfer function from the LGN to the cortex, and Â determines the angular anisotropy. In the case of a semilinear gain function, f (x ) D 0 if x · 0, f (x ) D x if x ¸ 0, and f (x ) D 1 for x ¸ 1, it is possible to derive self-consistency equations for the activity prole of the network. Solving these equations shows that in certain parameter regimes, local cortical feedback can generate sharp orientation tuning curves in which only a fraction of neurons are active. The associated activity prole consists of a single peak centered about w 0 (Hansel & Sompolinsky, 1997). This activity peak can also lock to a rotating stimulus w 0 D Vt provided that V is not too large; if the inhibitory feedback is sufciently strong, then it is possible for spontaneous wave propagation to occur even in the absence of a rotating stimulus (Ben-Yishai et al., 1997). The idea that local cortical interactions play a central role in generating sharp orientation tuning curves is still controversial. The classical model of Hubel and Wiesel (1962) proposes a very different mechanism, in which the orientation preference of a cell arises primarily from the geometrical alignment of the receptive elds of the LGN neurons projecting to it. A number of recent experiments show a signicant correlation between the alignment of receptive elds of LGN neurons and the orientation preference of simple cells functionally connected to them (Chapman, Zahs, &

2480

P. C. Bressloff, N. W. Bressloff, and J. D. Cowan

Stryker, 1991; Reid & Alonso, 1995). In addition, Ferster, Chung, and Wheat (1997) have shown that cooling a patch of cortex and therefore presumably abolishing cortical feedback does not totally abolish the orientation tuning exhibited by excitatory postsynaptic potentials (EPSPs) generated by LGN input. However, there is also growing experimental evidence suggesting the importance of cortical feedback. For example, the blockage of extracellular inhibition in cortex leads to considerably less sharp orientation tuning (Sillito, Kemp, Milson, & Beradi, 1980; Ferster & Koch, 1987; Nelson, Toth, Seth, & Mur, 1994). Moreover, intracellular measurements indicate that direct inputs from the LGN to neurons in layer 4 of the primary visual cortex provide only a fraction of the total excitatory inputs relevant to orientation selectivity (Pei, Vidyasagar, Volgushev, & Creutzseldt, 1994; Douglas, Koch, Mahowald, Martin, & Suarez, 1995; see also Somers et al., 1995). In addition, there is evidence that orientation tuning takes about 50 to 60 msec to reach its peak, and that the dynamics of tuning has a rather complex time course (Ringach, Hawken, & Shapley, 1997), suggesting some cortical involvement. The dynamical mechanism for sharp orientation tuning identied by Hansel and Sompolinsky (1997) can be interpreted as a localized form of spontaneous pattern formation (at least for strongly modulated cortical interactions). For general spatially distributed systems, pattern formation concerns the loss of stability of a spatially uniform state through a bifurcation to a spatially periodic state (Murray, 1990). The latter may be stationary or oscillatory (time periodic). Pattern formation in neural networks was rst studied in detail by Ermentrout and Cowan (1979a, 1979b). They showed how competition between short-range excitation and long-range inhibition in a two-dimensional Wilson-Cowan network can induce periodic striped and hexagonal patterns of activity. These spontaneous patterns provided a possible explanation for the generation of visual hallucinations. (See also Cowan, 1982.) At rst sight, the analysis of orientation tuning by Hansel and Sompolinsky (1997) and Ben-Yishai et al. (1997) appears to involve a different dynamical mechanism from this, since only a single peak of activity is formed over the domain 0 · w · p rather than a spatially repeating pattern. This apparent difference vanishes, however, once it is realized that spatially periodic patterns would be generated by “unraveling” the ring. Thus, the one-dimensional stationary and propagating activity proles that Wilson and Cowan (1973) found correspond, respectively, to stationary and timeperiodic patterns on the p periodic ring. In the following sections, we study orientation tuning in both the analog and IF models from the viewpoint of spontaneous pattern formation. 3 Orientation Tuning in Analog Model

We rst consider orientation tuning in the analog or rate model described by equation 2.11 with the ring-rate function given by equation 2.10. This may be viewed as the a ! 0 limit of the IF model. We shall initially restrict

Integrate-and-Fire Model of Orientation Tuning

2481

ourselves to the case of time-independent external inputs hL (w ). Given a w independent input hL (w ) D CL , the homogeneous equation 2.11 has at least one spatially uniform solution XL (w , t) D XL , where XL D

X M DE,I

W 0LM f (XM ) C CL .

(3.1)

The local stability of the homogeneous state is found by linearizing equation 2.11 about XL . Setting xL (w , t) D XL (w , t) ¡ XL and expanding to rst order in xL gives xL (w , t) D

X MD E,I

Z cM

p

0

dw 0 WLM (w ¡ w 0 ) p

Z

1 0

dtr (t )x M (w 0 , t ¡ t ), (3.2)

where c M D f 0 (X M ) and f 0 ´ df / dX. Equation 3.2 has solutions of the form xL (w , t) D ZL eºt e §2inw .

(3.3)

For each n, the eigenvalues º and corresponding eigenvectors Z D (ZE , ZI )tr satisfy the equation b n Z, Z D re(º)W

(3.4)

where Z re(º) ´

0

1

e ¡ºt r (t )dt D

a2 (a C º)2

(3.5)

.

(3.6)

for the alpha function 2.8, and bn D W

(

c E WnEE c E WnIE

¡c I WnEI ¡c I W nII

!

The state XL will be stable if all eigenvalues º have a negative real part. It follows that the stability of the homogeneous state depends on the particular choice of weight parameters WnLM and the external inputs CL , which determine the factors c L . On the other hand, stability is independent of the inverse rise time a. (This a independence will no longer hold for the IF model; see section 4.) In order to simplify matters further, let us consider a symmetric twopopulation model in which CL D C,

WnEE D WnIE ´ WnE ,

WnEI D W nII ´ W nI .

(3.7)

2482

P. C. Bressloff, N. W. Bressloff, and J. D. Cowan

Equation 3.1 then has a solution XL D X, where X is the unique solution of the equation X D W0 f (X ) C C. We shall assume that C is superthreshold, h0 C C > 1; otherwise X D C such that f (X ) D 0. Moreover, c L D c ´ f 0 (X ) b n D c W n . Equations 3.4 through 3.6 then reduce to the for L D E, I so that W eigenvalue equation ±º

a

C1

²2

§ D c ln

(3.8)

for integer n, where l n§ are the eigenvalues of the weight matrix W n with corresponding eigenvectors Z n§ : C

l n D Wn ´

WnE

¡ W nI ,

l n¡

D 0,

C

Zn D

¡1¢ 1

,

Z n¡

D

¡1 / W ¢ E n 1 / WnI

.

(3.9)

It follows from equation 3.8 that X is stable with respect to excitation of the (¡) modes Z 0¡, Z n¡ cos(2kw ), Z n¡ sin(2kw ). If c Wn < 1 for all n then X is also stable with respect to excitation of the corresponding ( C ) modes. This implies that the effect of the superthreshold LGN input hL (w ) D C is to switch the network from an inactive state to a stable homogeneous active state. In this parameter regime, sharp orientation tuning curves can be generated only if there is a sufcient degree of angular anisotropy in the afferent inputs to the cortex from the LGN, that is, Â in equation 2.15 must be sufciently large (Hansel & Sompolinsky, 1997). This is the classical model of orientation tuning (Hubel & Wiesel, 1962). An alternative mechanism for sharp orientation tuning can be obtained by having strongly modulated cortical interactions such that c Wn < 1 for 6 all n D 1 and c W1 > 1. The homogeneous state then destabilizes due to excitation of the rst harmonic modes Z 1C cos(2w ), Z 1C sin(2w ). Since these modes have a single maximum in the interval (0, p ), we expect the network to support an activity prole consisting of a solitary peak centered about some angle w 0 , which is the same for both the excitatory and inhibitory populations. For w -independent external inputs (Â D 0), the location of this center is arbitrary, which reects the underlying translational invariance of the network. Following Hansel and Sompolinsky (1997), we call such an activity prole a marginal state. The presence of a small, angular anisotropy in the inputs (0 < Â ¿ 1 in equation 2.15) breaks the translational invariance of the system and locks the location of the center to the orientation corresponding to the peak of the stimulus (Hansel & Sompolinsky, 1997; Ben-Yishai et al., 1997). In contrast to afferent mechanisms of sharp orientation tuning, Â can be arbitrarily small. From the symmetries imposed by equations 3.7, we can restrict ourselves to the class of solutions XL (t ) D X (t), L D E, I, such that equation 2.11 reduces to the effective one-population model Z p Z 1 dw 0 (3.10) X (w , t) D W (w ¡ w 0 ) dtr (t ) f (X (w 0 , t ¡ t )) C C, p 0 0

Integrate-and-Fire Model of Orientation Tuning

2483

1 0.8 W1

M

0.6 0.4

H

0.2 2

4 -W0

6

8

Figure 2: Phase boundary in the (W0 , W1 )-plane (solid curve) between a stable homogeneous state (H) and a stable marginal state (M) of the analog model for C D 1. Also shown are data points from a direct numerical simulation of a symmetric two-population network consisting of N D 100 neurons per population and sinusoidal-like weight kernel W(w ) D W0 C 2W1 cos(2w ). (a) Critical coupling for destabilization of homogeneous state (solid diamonds). (b) Lower critical coupling for persistence of stable marginal state due to hysteresis (crosses).

P with W (w ) D W0 C 2 1 n D 1 Wn cos(2nw ) and Wn xed such that c Wn < 1 6 for all n D 1. Treating W1 as a bifurcation parameter, the critical coupling at which the marginal state becomes excited is given by W1 D W1c , where W1c D 1 / f 0 (X ) with X dependent on W0 and C. This generates a phase boundary curve in the (W0 , W1 )-plane that separates a stable homogeneous state and a marginal state. (See the solid curve in Figure 2.) When the phase boundary is crossed from below, the network jumps to a stable marginal state consisting of a sharp orientation tuning curve whose width (equal to p / 2) is invariant across the parameter domain over which it exists. This reects the fact that the underlying pattern-forming mechanism selects out the rst harmonic modes. On the other hand, the height and shape of the tuning curve will depend on the particular choice of weight coefcients W n as well as other parameters, such as the contrast C. Note that sufciently far from the bifurcation point, additional harmonic modes will be excited, which may modify the width of the tuning curve, leading to a dependence on the weights W n . Indeed, in the case of a semilinear ring-rate function, the width of the sharp tuning curve is found to be dependent on the weights Wn (Hansel & Sompolinsky, 1997), reect-

2484

P. C. Bressloff, N. W. Bressloff, and J. D. Cowan

ing the absence of an additional nonlinearity selecting out the fundamental harmonic. Examples of typical tuning curves are illustrated in Figure 3a, where we show results of a direct numerical simulation of a symmetric two-population analog network. The steady-state ring rate f (w ) of the marginal state is plotted as a function of orientation preference w in the case of a cosine weight kernel W (w ) D W0 C 2W1 cos(2w ). It can be seen that the height of the activity prole increases with the weight W1 . Note, however, that if W1 becomes too large, then the marginal state is itself destroyed by amplitude instabilities. The height also increases with contrast C. Consistent with the results of Hansel and Sompolinsky (1997) and Ben-Yishai et al. (1997), the center of the activity peak can lock to a slowly rotating, weakly tuned stimulus as illustrated in Figure 3b. Also shown in Figure 3a is the tuning curve for a 2 2 2 2 Mexican hat weight kernel W (w ) D A1 e¡w / 2s1 ¡ A2 e¡w / 2s2 over the range ¡p / 2 · w · p / 2. The constants si , Ai are chosen so that the corresponding 6 1. The shape Fourier coefcients satisfy c W1 > 1 and c Wn < 1 for all n D of the tuning curve is very similar to that obtained using the simple cosine. To reproduce more closely biological tuning curves, which have a sharper peak and broader shoulders than obtained here, it is necessary to introduce some noise into the system (Dimitrov & Cowan, 1998). This smooths the ring rate function, 2.10, to yield a sigmoid-like nonlinearity. Interestingly, the marginal state is found to exhibit hysteresis in the sense that sharp orientation tuning persists for a range of values of W1 < W 1c . That is, over a certain parameter regime, a stable homogeneous state and a stable marginal state coexist (see data points in Figure 2). This suggests that the bifurcation is subcritical, which is indeed found to be the case, as we now explain. In standard treatments of pattern formation, bifurcation theory is used to derive nonlinear ordinary differential equations for the amplitude of the excited modes from which existence and stability can be determined, at least close to the bifurcation point (Cross & Hohenberg, 1993; Ermentrout, 1998). Suppose that the ring-rate function f is expanded about the (nonzero) xed point X, f ( X ) ¡ f ( X ) D c (X ¡ X ) C g 2 ( X ¡ X ) 2 C g 3 ( X ¡ X ) 3 C ¢ ¢ ¢ ,

(3.11)

where c D f 0 (X ), g2 D f 00 (X ) / 2, g3 D f 000 (X ) / 6. Introduce the small parameter 2 according to W 1 ¡ W 1c D 2 2 and substitute into equation 3.10 the perturbation expansion, h i X (w , t) ¡ X D Z(t)e2inw C Z¤ (t)e ¡2inw C O (2 2 ),

(3.12)

where Z(t), Z¤ (t) denote a complex conjugate pair of O (2 ) amplitudes.

Integrate-and-Fire Model of Orientation Tuning

f(f )

(a)

W1 = 1.2 W1 = 1.5 W1 = 1.8

4 3

2485

2 1 00

20 40 60 80 100 orientation f (in units p /100)

2.5

(b)

2 f(f )

1.5 1 0.5 0

0

20

40 f

60

80

100

Figure 3: Sharp orientation tuning curves for a symmetric two-population analog network (N D 100) with cosine weight kernel W(w ) D W0 C 2W1 cos(2w ). (a) The ring rate f (w ) is plotted as a function of orientation preference w for various W1 with W0 D ¡0.1, C D 0.25 and a D 0.5. Also shown (solid curve) is the corresponding tuning curve for a Mexican hat weight kernel with s1 D 20 p degrees, s2 D 60 degrees, and Ai D A / 2p si2 . (b) Locking of tuning curve to a slowly rotating, weakly tuned external stimulus. Here W0 D ¡0.1, W1 D 1.5, C D 0.25, a D 1.0. The degree of angular anisotropy in the input is Â D 0.1, and the angular frequency of rotation is V D 0.0017 (which corresponds to approximately 10 degrees per second).

2486

P. C. Bressloff, N. W. Bressloff, and J. D. Cowan

A standard perturbation calculation (see appendix A and Ermentrout, 1998) shows that Z(t) evolves according to an equation of the form ± ² dZ D Z W 1 ¡ W1c C A|Z| 2 , dt

(3.13)

where AD

2g22 3g3 C 2 c c2

µ

¶ 2W0 W2 C . 1 ¡ c W2 1 ¡ c W0

(3.14)

Since W1 and A are all real, the phase of Z is arbitrary p (reecting a marginal state), whereas the amplitude is given by |Z| D |W1 ¡ W1c | / A. It is clear that a stable marginal state will bifurcate from the homogeneous state if and only if A < 0. A quick numerical calculation shows that A > 0 for all W0 , W2 , and C. This implies that the bifurcation is subcritical, that is, the resting state bifurcates to an unstable broadly tuned orientation curve whose amplitude is O (2 ). Since the bifurcating tuning curve is unstable, it is not observed in practice; rather, one nds that the system jumps to a stable, large-amplitude, sharp tuning curve that coexists with the resting state and leads to hysteresis effects as illustrated schematically in Figure 4a. Note that in the case of sigmoidal ring-rate functions f (X ), the bifurcation may be either sub- or supercritical, depending on the relative strengths of recurrent excitation and lateral inhibition (Ermentrout & Cowan, 1979b). If the latter is sufciently strong, then supercritical bifurcations can occur in which the resting state changes smoothly into a sharply tuned state. However, in the case studied here, the nonlinear ring-rate function f (X ) of equation 2.10 is such that the bifurcation is almost always subcritical. The underlying translational invariance of the network means that the center of the tuning curve remains undetermined for a w -independent LGN input (marginal stability). The center can be selected by including a small degree of angular anisotropy along the lines of equation 2.15. Suppose, in particular, that we replace C on the right-hand side of equation 3.10 by an LGN input of the form h(w , t) D C[1 ¡ Â C Â cos(2[w ¡ vt])] for some slow frequency v. If we take Â D 2 3 , then the amplitude equation 3.13 becomes ± ² dZ D Z W 1 ¡ W1c C A|Z| 2 C Ce ¡2ivt . dt

(3.15)

Writing Z D re ¡2i(h C vt) (with the phase h dened in a rotating frame), we obtain the pair of equations: rP D r (W 1 ¡ W1c C Ar2 ) C C cos 2h

C hP D v ¡ sin(2h ). 2r

(3.16) (3.17)

Integrate-and-Fire Model of Orientation Tuning

|Z |

2487

M+

(a) d MH

0

c

W1

|Z | (b)

0

W1

Figure 4: (a) Schematic diagram illustrating hysteresis effects in the analog model. As the coupling parameter W1 is slowly increased from zero, a critical point c is reached where the homogeneous resting state (H) becomes unstable, and there is a subcritical bifurcation to an unstable, broadly tuned marginal state (M¡ ). Beyond the bifurcation point, the system jumps to a stable, sharply tuned marginal state (MC ), which coexists with the other solutions. If the coupling W1 is now slowly reduced, the marginal state MC persists beyond the original point c so that there is a range of values W1 (d) < W1 < W1 (c) over which a stable resting state coexists with a stable marginal state (bistability). At point d, the states M§ annihilate each other, and the system jumps back to the resting state. (b) Illustration of an imperfect bifurcation in the presence of a small degree of anisotropy in the LGN input. Thick solid (dashed) lines represent stable (unstable) states.

2488

P. C. Bressloff, N. W. Bressloff, and J. D. Cowan

Thus, provided that v is sufciently small, equation 3.17 will have a stable xed-point solution in which the phase w of the pattern is entrained to the signal. Another interesting consequence of the anisotropy is that it generates an imperfect bifurcation for the (real) amplitude r due to the presence of the r-independent term d D C cos(2h ) on the right-hand side of equation 3.16. This is illustrated in Figure 4b. 4 Orientation Tuning in IF Model

Our analysis of the analog model in section 3 ignored any details concerning neural spike trains by taking the output of a neuron to be an average ring rate. We now return to the more realistic IF model of equations 2.1 through 2.3 in which the ring times of individual spikes are specied. We wish to identify the analogous mechanism for orientation tuning in the model with spike coding. 4.1 Existence and Stability of Synchronous State. The rst step is to specify what is meant by a homogeneous activity state. We dene a phaselocked solution to be one in which every oscillator resets or res with the same collective period T that must be determined self-consistently. The state of each oscillator can then be characterized by a constant phase gL (w ) with 0 · gL (w ) < 1 and 0 · w < p . The corresponding ring times satisfy TLk (w ) D (k¡gL (w ))T, integer k. Integrating equation 2.1 between two successive ring events and incorporating the reset condition 2.2 leads to the phase equations

1 D (1 ¡ e¡T )(h0 C CL ) X Z p dw 0 C WLM (w ¡ w 0 )KT (gM (w 0 ) ¡ gL (w )), p MD E,I 0

L D E, I (4.1)

where WLM (w ) is given by equations 2.4 and 2.5, and Z KT (g) D e

¡T

T 0

dt et

X k2Z

r (t C (k C g)T).

(4.2)

Note that for any phase-locked solution of equation 4.1, the distribution of network output activity across the network, as specied by the ISIs

D kL (w ) D TLk C 1 (w ) ¡ TLk (w ),

(4.3)

is homogeneous since D kL (w ) D T for all k 2 Z, 0 · w < p , L D (E, I ). In other words, each phase-locked state plays an analogous role to the homogeneous state XL of the mean ring-rate model. To simplify our analysis, we shall concentrate on instabilities of the synchronous state gL (w ) D g, where g is an arbitrary constant. In order to ensure such a solution exists, we impose

Integrate-and-Fire Model of Orientation Tuning

8 7 6 5 T 4 3 2 1

C=1 C = 0.5

2489

a = 0.5

a =2

2

4

-W0

6

8

10

Figure 5: Variation of collective period T of synchronous state as a function of W0 for different contrasts C and inverse rise times a.

the conditions 3.7 so that equation 4.1 reduces to a single self-consistency equation for the collective period T of the synchronous solution 1 D (1 ¡ e ¡T )C0 C W0 KT (0),

(4.4)

where we have set C0 D h0 C C and W0 D W0E ¡W0I . Solutions of equation 4.4 are plotted in Figure 5. The linear stability of the synchronous state can be determined by considering perturbations of the ring times (van Vreeswijk, 1996; Gerstner, van Hemmen, & Cowan, 1996; Bressloff & Coombes, 1998a, 1998b, 2000): TLk (w ) D (k ¡ g)T C dLk (w ).

(4.5)

Integrating equation 2.1 between two successive ring events yields a nonlinear mapping for the ring times that can be expanded as a power series in the perturbations dLk (w ). This generates a linear difference equation of the form h i AT dLk C 1 (w ) ¡ dLk (w ) D

1 X

Z

GT (l )

p 0

lD 0

h i dw 0 X k¡l (w 0 ) ¡ dLk (w ) WLM (w ¡ w 0 ) d M p M DE,I

(4.6)

with Z GT (l) D e ¡T

T 0

dt etr 0 (t C lT )

(4.7)

2490

P. C. Bressloff, N. W. Bressloff, and J. D. Cowan

AT D C0 ¡ 1 C W0

X

r (kT ).

(4.8)

k2Z

Equation 4.6 has solutions of the form dLk (w ) D ekºe §2inw dL ,

(4.9)

L D E, I,

where º 2 C and 0 · Im º < 2p . The eigenvalues º and associated eigenvectors Z D (dE , dI )tr for each integer n satisfy the equation h i eT (º)W n ¡ G eT (0)D Z , AT [eº ¡ 1]Z D G

(4.10)

where eT (º) D G

and Wn D

1 X

GT (k)e ¡kº

(4.11)

kD 0

¡W

EE n WnIE

¢

¡WnEI , ¡WnII

DD

¡W

EE 0

¡ W0EI 0

¢

W0IE

0 . ¡ W 0II

(4.12)

In order to simplify the subsequent analysis, we impose the same symmetry conditions 3.7 as used in the study of the analog model in section 3. We can then diagonalize equation 4.10 to obtain the result eT (º)l § ¡ G eT (0)W0 , AT [eº ¡ 1] D G n

(4.13)

where l n§ are the eigenvalues of the weight matrix W n (see equation 3.9). Note that one solution of equation 4.13 is º D 0 and n D 0 associated with excitation of the uniform ( C ) mode. This reects invariance of the system with respect to uniform phase shifts in the ring times. The synchronous state will be stable if all other solutions of equation 4.13 satisfy Re º < 0. Following Bressloff and Coombes (2000), we investigate stability by rst looking at the weak coupling regime in which W n D O (2 ) for some 2 ¿ 1. Performing a perturbation expansion in 2 , it can be established that the stability of the synchronous state is determined by the nonzero eigenvalues in a small neighborhood of the origin, and these satisfy the equation (to e (0). For the alpha function 2.8, it can rst order in 2 ) ATº D (l n§ ¡ W0 )G T eT (0) ´ K0 (0) / T < 0. Therefore, be established that AT > 0, whereas G T the synchronous state is stable in the weak coupling regime provided that 6 0 and W0 < 0. Assuming that these two conditions Wn > W 0 for all n D are satised, we now investigate what happens as the strength of coupling 6 1. First, it is easy to establish that the W1 is increased for xed Wn , n D

Integrate-and-Fire Model of Orientation Tuning

2491

synchronous state cannot destabilize due to a real eigenvalue º crossing the origin. In particular, it is stable with respect to excitations of the (¡) modes. Thus, any instability will involve one or more complex conjugate pairs of 6 p ) the eigenvalues º D § iv crossing the imaginary axis, signaling (for v D onset of a discrete Hopf bifurcation (or Neimark-Sacker bifurcation) of the ring times due to excitation of a ( C ) mode dLk (w ) D e § iv0 k e §i2w , L D E, I. In the special case v D p , this reduces to a subharmonic or period doubling bifurcation since e § ip D ¡1. 4.2 Desynchronization Leading to Sharp Orientation Tuning. In order to investigate the occurrence of a Hopf (or period-doubling) bifurcation in the ring times, substitute º D iv into equation 4.13 for n D 1 with l 1C D W 1 and equate real and imaginary parts to obtain the pair of equations e(v) ¡ W0 C e(0), AT [cos(v) ¡ 1] D W1 C

AT sin(v) D ¡W1 e S (v), (4.14)

where e(v) D Re G eT (iv ), e eT (iv ). C S (v) D ¡Im G

(4.15)

e(v) and e Explicit expressions for C S (v) in the case of the alpha function delay distribution 2.8 are given in appendix B. Suppose that a, W 0 , and C0 are xed, with T D T (a, W 0 , C0 ) the unique self-consistent solution of equation 4.4 (see Figure 5). The smallest value of the coupling parameter, 6 0, of the simultaneous W1 D W1c , for which a nonzero solution, v D v0 D equations 4.14 is then sought. This generates a phase boundary curve in the (W0 , W1 ) plane, as illustrated in Figure 6 for various C and a. Phase boundaries in the (a, W1 ) plane and the (C, W1 ) plane are shown in Figures 7 and 8, respectively. The cusps of the boundary curves in these gures correspond to points where two separate solution branches of equation 4.14 cross; only the lower branch is shown since this determines the critical coupling for destabilization of the synchronous state. The region below a given boundary curve includes the origin. Weak coupling theory shows that in a neighborhood of the origin, the synchronous state is stable. Therefore, the boundary curve is a locus of bifurcation points separating a stable homogeneous state from a marginal non-phase-locked state. It should also be noted that for each boundary curve in Figures 6 and 7, the left-hand branch signals the onset 6 p ), whereas the right-hand branch of a (subcritical) Hopf bifurcation (v0 D signals the onset of a (subcritical) period doubling bifurcation (v0 D p ); the opposite holds true in Figure 8. However, there is no qualitative difference in the observed behavior induced by these two types of instability. Direct numerical simulations conrm that when a phase boundary curve is crossed from below, the IF network jumps to a stable marginal state consisting of a sharp orientation tuning curve as determined by the spatial

2492

P. C. Bressloff, N. W. Bressloff, and J. D. Cowan

a = 0.5 a = 0.9

20

a = 2.0 a = 4.0

16 W1 12 8 4

C=1 1

2

3

5 4 -W0

6

7

8

Figure 6: Phase boundary in the (W0 , W1 ) plane between a stable synchronous state and a marginal state of a symmetric two-population IF network. Boundary curves are shown for C D 1 and various values of the inverse rise time a. In each case, the synchronous state is stable below a given boundary curve.

distribution of the mean (time-averaged) ring rates a(w ). The latter are dened according to a (w ) D D (w )¡1 where M X 1 D m (w ), M!1 2M C 1 m D ¡M

D (w ) D lim

(4.16)

with D m (w ) given by equation 4.3. (We do not distinguish between excitatory and inhibitory neurons here since they behave identically in a symmetric two-population network.) The resulting long-term average behavior of the IF network for W1 > W1c is illustrated in Figure 9a. For the given choice of parameter values, the activity prole is in good quantitative agreement with the corresponding prole obtained in the analog version of the network (see Figure 3). Moreover, as in the case of the analog network, (1) the marginal state exhibits hysteresis (suggesting that the bifurcation is subcritical), and (2) the center of the sharp orientation tuning curve is able to lock to a slowly rotating, weakly tuned external stimulus (see Figure 9b). Comparison of the phase diagrams of the IF model (Figures 6–8) and the corresponding analog model (Figure 2) shows a major difference in the predicted value of the critical coupling W1c in the two models. The critical coupling W 1c in the IF model converges to the corresponding analog result (which is a-independent) in the limit a ! 0. This is particularly clear in Figure 8. However, for nonzero a, we expect good quantitative

Integrate-and-Fire Model of Orientation Tuning

5

W0 = - 2

W0 = - 0.1 W0 = - 0.5

4 W1

2493

3 2 1

C = 0.25

1

2

3

4 a

5

6

7

8

Figure 7: Phase boundary in the (a, W1 ) plane between a stable homogeneous state and a marginal state of a symmetric two-population IF network. Boundary curves are shown for C D 0.25 and various values of W0 .

agreement between the two models only when (1) there is an approximate balance between the mean excitatory and inhibitory coupling (W 0 ¼ 0) and (2) synaptic interactions are sufciently slow (a ¿ 1). Both conditions hold in Figures 9. Outside this parameter regime, the higher value of W 1c for the IF model leads to relatively high levels of mean ring rates in the activity prole of the marginal state, since large values of W1 imply strong shortrange excitation. The condition W0 ¼ 0 can be understood in terms of the self-consistency condition for the collective period T of the synchronous state given by equation 4.4. As W0 becomes more negative, the period T increases (see Figure 5) so that aT > 1 and the reduction to the analog model is no longer a good approximation. (In Bressloff & Coombes, 2000, the period T is kept xed by varying the external input.) 4.3 Quasiperiodicity and Interspike Interval Variability. We now explore in more detail the nature of the spatiotemporal dynamics of the ISIs occurring in the marginal state of the IF model. For concreteness, we consider a symmetric two-population network with Wn D 0 for all n ¸ 2. We shall proceed by xing the parameters W 0 , W1 , and C such that a sharp orientation tuning curve exists and considering what happens as a is increased. In Figure 10 we plot the ISI pairs (D n¡1 (w ), D n (w )), integer n, for all the excitatory neurons in the ring. It can be seen from Figure 10a that for relatively slow synapses (a D 2), the temporal uctuations in the ISIs are negligible. There exists a set of spatially separated points reecting the

2494

P. C. Bressloff, N. W. Bressloff, and J. D. Cowan

4

a = 0.25 a = 0.5

a =8 a = 20

3 W1 2 1 W0 = -0.1 0.2

0.4

C

0.6

0.8

1

Figure 8: Phase boundary in the (C, W1 ) plane between a stable synchronous state and a marginal state of a symmetric two-population IF network. Boundary curves are shown for W0 D ¡0.1 and various values of the inverse rise time a. The solid curve corresponds to the analog model, whereas the dashed curves correspond to the IF model.

w -dependent variations in the mean ring rates a(w ), as shown in Figure 9. However, as a is increased, the system bifurcates to a state with periodic or quasiperiodic uctuations of the ISIs on spatially separated invariant circles (see Figure 10b). Although the average behavior is still consistent with that found in the corresponding analog model, the additional ne structure associated with these orbits is not resolved by the analog model. In order to characterize the size of the ISI uctuations, we dene a deterministic coefcient of variation CV (w ) for a neuron w according to r ¡ ¢2 D (w ) ¡ D (w ) , (4.17) CV (w ) D D (w ) with averages dened by equation 4.16. The CV (w ) is plotted as a function of w for various values of a in Figure 11a. This shows that the relative size of the deterministic uctuations in the mean ring rate is an increasing function of a. For slow synapses (a ! 0), the CV is very small, indicating an excellent match between the IF and analog models. However, the uctuations become much more signicant when the synapses are fast. This is further illustrated in Figure 11b, where we plot the variance of the ISIs against the mean. The variance increases monotonically with the mean, which shows why groups of neurons close

Integrate-and-Fire Model of Orientation Tuning

4 a(f )

3

2495

W1 = 1.8

W1 = 1.2 W1 = 1.5

2 1 0

0

20 40 60 80 100 f

0

20

2.5 2 a(f )

1.5 1 0.5 0

40 f

60

80 100

Figure 9: Sharp orientation tuning curves for a symmetric two-population IF network (N D 100) and cosine weight kernel. (a) The mean ring rate a(w ) is plotted as a function of orientation preference w and various W1 . Same parameter values as Figure 3a. (b) Locking to a slowly rotating, weakly tuned external stimulus. Same parameters as Figure 3b.

to the edge of the activity prole in Figure 11a (so that the mean ring rate is relatively small) have a larger CV . An interpretation of this behavior is that neurons with a low ring rate are close to threshold (i.e., there is an approximate balance between excitation and inhibition), which means that these neurons are more sensitive to uctuations. Another observation from Figure 11b is that the uctuations decrease with the size of the network; this appears to be a nite-size effect since there is only a weak dependence on N for N > 200.

2496

P. C. Bressloff, N. W. Bressloff, and J. D. Cowan

1.6 1.4

(a)

1.2 D n(f )

1 0.8 f

0.6 0.4 0.2

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 D n-1(f )

1.6 1.4

(b)

1.2 n D (f ) 1 0.8 0.6

f

0.4 0.2

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 D n-1(f )

Figure 10: Separation of the ISI orbits for a symmetric two-population IF network (N D 100) with W0 D ¡0.5, W1 D 2.0, C D 0.25. (a) a D 2. (b) a D 5. The (projected) attractors of the ISI pairs with coordinates (D n¡1 (w ), D n (w )) are shown for all excitatory neurons.

In Figure 12 we plot (D n¡1 (w ), D n (w )) for two selected neurons: one with a low CV /fast ring rate and the other with a high CV /low ring rate. Results are shown for a D 8 and a D 20. Corresponding ISI histograms are presented in Figure 13. These gures establish that for fast synapses, the ISIs display highly irregular orbits with the invariant circles of Figure 10b no longer present. An interesting question that we hope to pursue elsewhere concerns whether the underlying attractor for the ISIs supports chaotic dynamics, for it is well known that the breakup of invariant circles is one possible route to chaos, as has previously been demonstrated in a num-

Integrate-and-Fire Model of Orientation Tuning

2497

1.4 1.2 1 CV(f ) 0.8 0.6 0.4 0.2 (a) 0 0 20 3 variance

2.5 2 1.5

a =8 a = 14 + a = 20

40 f

60

80

100

(8,100) (14,100) (20,100) (20,50) (20,200)

1 0.5 0

0.2

(b)

0.6

1 1.4 mean

1.8

Figure 11: (a) Plot of the coefcient of variation CV (w ) as a function of w for various values of the inverse rise time a. (b) Plot of variance against the mean of the ISIs for various values of inverse rise time a and network size N. All other parameter values are as in Figure 10.

ber of classical uid dynamics experiments (Berg´e, Dubois, & Vidal, 1986). As partial evidence for chaotic ISI dynamics, we plot the power spectrum |h(p)| 2 of a neuron with high CV in Figure 14. We consider a sequence of ISIs over M C 1 ring events fD k , k D 1, . . . Mg and dene M 1 X e2p ikp D k , h( p ) D p M kD 1

(4.18)

where p D m / M for m D 1, . . . , M. Although the numerical data are rather

2498

P. C. Bressloff, N. W. Bressloff, and J. D. Cowan

1.8 (a) 1.4 n D (f ) 1

high CV

0.6 low CV

0.2

0.2

0.6

1 D n-1(f )

1.4

1.8

1.8 (b) 1.4 n D (f ) 1 high CV

0.6 0.2

low CV 0.2

0.6

1 D n-1(f )

1.4

1.8

Figure 12: Same as Figure 10 except that (a) a D 8 and (b) a D 20. The attractor is shown for two excitatory neurons: one with a low CV and the other with a high CV .

noisy, it does appear that there is a major difference between the quasiperiodic regime shown in Figure 10 and the high-CV regime of Figure 12. The latter possesses a broadband spectrum, which is indicative of (but not conclusive evidence for) chaos. There is currently considerable interest in possible mechanisms for the generation of high CV s in networks of cortical neurons. This follows the recent observation by Softky and Koch (1993) of high variability of the ISIs in data from cat V1 and MT neurons. Such variability is inconsistent with the standard notion that a neuron integrates over a large number of small

Integrate-and-Fire Model of Orientation Tuning

80

40 (a)

30 N

40

10

20 0.285 0.29 ISI

0

0.295 0.3

0.5

1

1.5 ISI

300

40

(c)

30

2

2.5

(d)

250 200

20

150 100

10 0

(b)

60

20

0 0.28

N

2499

50 0.28

0.32 ISI

0.36

0

0

2

4

6

8

10

ISI

Figure 13: Histograms of 1000 ring events for selected excitatory neurons in a symmetric two-population IF network: (a) a D 8, low CV . (b) a D 8, high CV . (c) a D 20, low CV . (d) a D 20, high CV . All other parameter values are as in Figure 10.

(excitatory) inputs, since this would lead to a regular ring pattern (as a consequence of the law of large numbers). Two alternative approaches to resolving this dilemma have been proposed. The rst treats cortical neurons as coincidence detectors that re in response to a relatively small number of excitatory inputs arriving simultaneously at the (sub)millisecond level—a suggested mechanism for the amplication of a neuron’s response is active dendrites (Softky & Koch, 1993). The second approach retains the picture of a neuron as a synaptic integrator by incorporating a large number of inhibitory inputs to balance the effects of excitation so that a neuron operates close to threshold (Usher et al., 1994; Tsodyks & Sejnowski, 1995; van Vreeswijk & Sompolinsky, 1996, 1998). These latter studies incorporate a degree of randomness into a network either in the form of quenched disorder in the coupling or through synaptic noise (see also Lin, Pawelzik, Ernst, & Sejnowski, 1998). We have shown that even an ordered network evolving deterministically can support large uctuations in the ISIs provided that the synapses are sufciently fast. For the activity prole shown in Figure 11a,

2500

P. C. Bressloff, N. W. Bressloff, and J. D. Cowan

1

(a)

10-2 |h(p)|2 10-4 10-6

0.1

0.2 0.3 0.4 0.5 frequency p (b)

100 |h(p)|2

10 1 10-1

0.1

0.2 0.3 0.4 0.5 frequency p

Figure 14: (a) Spectrum of a neuron in the quasiperiodic regime (a D 5). (b) Spectrum of a high-CV neuron with a D 20. All other parameter values are as in Figure 10.

only a subset of neurons have a CV close to unity. In order to achieve a consistently high CV across the whole network, it is likely that one would need to take into account various sources of disorder present in real biological networks. 5 Intrinsic Traveling Wave Proles

So far we have restricted our analysis of orientation tuning to the case of a symmetric two-population model in which the excitatory and inhibitory populations behave identically. Sharp orientation tuning can still occur if the symmetry conditions 3.8 are no longer imposed, provided that the rst harmonic eigenmodes are excited when the homogeneous resting state is

Integrate-and-Fire Model of Orientation Tuning

2501

destabilized (n D 1 in equation 3.3 or 4.9). Now, however, the heights of the excitatory and inhibitory activity proles will generally differ. More signicant, it is possible for traveling wave proles to occur even in the absence of a rotating external stimulus (Ben Yishai et al., 1997). In order to illustrate this phenomenon, suppose that equation 3.7 is replaced by the conditions CL D C,

W0 D

(

W0E W0E

¡W0I ¡W0I

! ,

W1 D

¡W K

¡K ¡W

¢

(5.1)

with K > W and W nLM D 0 for all n ¸ 2. First consider the analog model analyzed in section 3. Under the conditions 5.1, the xed-point equation 3.1 has solutions XL D X, L D E, I, where X D W0 f (X ) C C with W0 D W0E ¡ W0I . Linearization about this xed point again yields the eigenvalue equation 3.8, with the eigenvalues of the weight matrices W n of equation 5.1 now given by l 0C D W0 ,

l 0¡ D 0,

l 1§ D § iW1 , W 1 ´

p

K2 ¡ W 2 .

(5.2)

Equations 3.8 and 5.2 imply that the xed point X is stable with respect to excitation of the n D 0 modes provided that c W0 < 1. On the other hand, substituting for l 2§ using equation 5.2 shows that there exist solutions for º of the form º D ¡1 C a

r

c W1 (1 § i). 2

(5.3)

Hence, at a critical coupling W1 D W 1c ´ 2 /c , the xed point undergoes a Hopf bifurcation due to the occurrence of a pair of pure imaginary eigenvalues º D § ia. The excited modes take the form Z § ei(2w §at) , where Z § are the eigenvectors associated with l 2§ and correspond to traveling wave activity proles with rotation frequency a. Let us now turn to the case of the IF model analyzed in section 4. Linear stability analysis of the synchronous state generates the eigenvalue equation 4.13 with l n§ given by equation 5.2. Following similar arguments to section 3, it can be established that the synchronous state is stable in the weak coupling regime provided that W0 < 0. It is also stable with respect to excitations of the n D 0 modes for arbitrary coupling strength. Therefore, we look for strong coupling instabilities of the synchronous state by substituting º D iv in equation 4.13 for n D 2. Equating real and imaginary parts then leads to the pair of equations e(0), AT [cos(v) ¡ 1] D W1 e S (v) ¡ W0 C

e(v). AT sin(v) D W1 C

(5.4)

2502

P. C. Bressloff, N. W. Bressloff, and J. D. Cowan

8 7 6 W1

5 4 3

a = 0.1 a = 0.5 a = 2.0

2 1 1

2

3

-W0

4

5

6

Figure 15: Phase boundary in the (W0 , W1 ) plane between a stable synchronous state and a traveling marginal state of an asymmetric two-population IF network. Boundary curves are shown for C D 0.25 and various values of the inverse rise time a. For each a there are two solution branches, the lower one of which determines the point of instability of the synchronous state. The boundary curve for the analog model is also shown (thick line).

Examples of boundary curves are shown in Figure 15 for W1 as a function of W0 with a xed. For each a, there are two solution branches, the lower one of which determines the point at which the synchronous state destabilizes. At rst sight it would appear that there is a discrepancy between the IF and analog models, for the lower boundary curves of the IF model do not coincide with the phase boundary of the analog model even in the limit a ! 0 and W0 ! 0. (Contrast Figure 15 with Figure 8, for example.) However, numerical simulations reveal that as the lower boundary curve of the IF model is crossed from below, the synchronous state destabilizes to a state consisting of two distinct synchronized populations—one excitatory and the other inhibitory. The latter state itself destabilizes for values of W1 beyond the upper boundary curve, leading to the formation of an intrinsically rotating orientation tuning curve. In Figure 16 we plot the spike train of one of the IF neurons associated with such a state. The neuron clearly exhibits approximately periodic bursts at an angular frequency V ¼ p / 6t0 , where the time constant t0 is the fundamental unit of time. This implies that orientation selectivity can shift over a few tens of milliseconds, an effect that has recently been observed by Ringach et al. (1997).

Integrate-and-Fire Model of Orientation Tuning

2503

100 102 104 106 108 110 112 114 116 118 120 time (in units of t 0) Figure 16: Oscillating spike train of an excitatory neuron in a traveling marginal state of an asymmetric two-population IF network. Here W0 D ¡1.0, W1 D 4.2, W0EE D 42.5, W1EE D 20.0 D J, N D 50, C D 1.0, and a D 2.0.

6 Discussion

Most analytical treatments of network dynamics in computational neuroscience are based on analog models in which the output of a neuron is taken to be a mean ring rate (interpreted in terms of either population or time averaging). Techniques such as linear stability analysis and bifurcation theory are used to investigate strong coupling instabilities in these networks, which induce transitions to states with complex spatiotemporal activity patterns (see Ermentrout, 1998, for a review). Often a numerical comparison is made with the behavior of a corresponding network of spiking neurons based on the IF model or, perhaps, a more detailed biophysical model such as Hodgkin-Huxley. Recently Bressloff and Coombes (2000) developed analytical techniques for studying strong coupling instabilities in IF networks allowing a more direct comparison with analog networks to be made. We have applied this dynamical theory of strongly coupled spiking neurons to a simple computational model of sharp orientation tuning in a cortical hypercolumn. A number of specic results were obtained that raise interesting issues for further consideration. Just as in the analog model, bifurcation theory can be used to study orientation tuning in the IF model. The main results are similar in that the coefcients of the interaction function W (w ) D W 0 C 2W1 cos(2w ) play a key role in determining orientation tuning. If the cortical interaction is weak—both W0 and W1 small—then, as may be expected, only biases in the geniculocortical map can give rise to orientation tuning. However, if W1 is sufciently large, then only a weak geniculocortical bias is needed to produce sharp orientation tuning at a given orientation.

2504

P. C. Bressloff, N. W. Bressloff, and J. D. Cowan

Such a tuning is produced via a subcritical bifurcation, which implies that the switching on and off of cortical cells should exhibit hysteresis (Wilson & Cowan, 1972, 1973). Thus, both tuned and untuned states could coexist under certain circumstances, in both the analog and the IF model. It is of interest that contrast and orientationdependent hysteresis has been observed in simple cells by Bonds (1991). Strong coupling instabilities in the IF model involve a discrete Hopf or period doubling bifurcation in the ring times. This typically leads to non-phase-locked behavior characterized by clustered but irregular variations of the ISIs. These clusters are spatially separated in phase-space, which results in a localized activity pattern for the mean ring rates across the network that is modulated by uctuations in the instantaneous ring rate. The latter can generate CV s of order unity. Moreover, for fast synapses, the underlying attractor for the ISIs as illustrated in Figure 12 appears to support chaotic dynamics. Evidence for chaos in disordered (rather than ordered) balanced networks has been presented elsewhere (van Vreeswijk & Sompolinsky, 1998), although caution has to be taken in ascribing chaotic behavior to potentially high-dimensional dynamical systems. Another interesting question is to what extent the modulations of mean ring rate observed in the responses of some cat cortical neurons to visual stimuli by Gray and Singer (1989) can be explained by discrete bifurcations of the ring times in a network of spiking neurons. In cases where the excitatory and inhibitory populations comprising the ring are not identical, traveling wave proles can occur in response to a xed stimulus. This implies that peak orientation selectivity can shift over a few tens of milliseconds. Such an effect has been observed by Ringach et al. (1997) and again occurs in both the analog and IF models. Certain care has to be taken in identifying parameter regimes over which particular forms of network operation occur, since phase diagrams constructed for analog networks differ considerably from those of IF networks (see Figure 8). For example, good quantitative agreement between the localized activity proles of the IF and analog models holds only when there is an approximate balance in the mean excitatory and inhibitory coupling and the synaptic interactions are sufciently slow.

Integrate-and-Fire Model of Orientation Tuning

2505

Appendix A

Let X (w , t) D X be a homogeneous xed point of the one-population model described by equation 3.9, which we rewrite in the more convenient form: 1 @2 X (w , t) 2 @X (w , t) C C X (w , t ) a2 a @t2 @t Z p dw 0 D W (w ¡ w 0 ) f (X (w 0 , t)) C C. p 0

(A.1)

Suppose that the weight distribution W (w ) is given by W (w ) D W0 C 2

1 X

Wk cos(2kw ).

(A.2)

kD 1

Expand the nonlinear ring-rate function about the xed point X as in equation 3.11. Dene the linear operator 2 O D 1 @ X C 2 @X C X ¡ c W ¤ X LX 2 a @t2 a @t

(A.3)

Rp in terms of the convolution W ¤ X (w ) D 0 W (w ¡ w 0 )X (w 0 )dw 0 / p . The operator LO has a discrete spectrum with eigenvalues º satisfying (cf. equation 3.7) ±

1C

º ²2 D c Wk a

(A.4)

and corresponding eigenfunctions X (w , t) D zk e2ikw C z¤k e ¡2ikw .

(A.5)

6 Suppose that Wn for some n D 0 is chosen as a bifurcation parameter 6 and c W k < 1 for all k D n. Then Wn D W n,c ´ c ¡1 is a bifurcation point for destabilization of the homogeneous state X due to excitation of the nth modes. Set Wn ¡ Wnc D 2 2 , Wc (w ) D W (w )| Wn D Wn, c , and perform the following perturbation expansion:

X D X C 2 X1 C 2 2 X2 C ¢ ¢ ¢

(A.6)

Introduce a slow timescale t D 2 2 t and collect terms with equal powers of

2506

P. C. Bressloff, N. W. Bressloff, and J. D. Cowan

2 . This leads to a hierarchy of equations of the form

O (1):

X D W 0 f (X ) C C

(A.7)

O (2 ):

LO c X1 D 0

(A.8)

O (2 2 ):

O (2

3 ):

LO c X2 D g2 Wc ¤ X21

(A.9)

µ ¶ OLc X3 D g3 Wc ¤ X3 C 2g2 Wc ¤ X1 X2 ¡ 2 @X1 ¡ c fn ¤ X1 , (A.10) 1 a @t

where LO c D X¡c Wc ¤X and fn (w ) D cos(2nw ). The O (1) equation determines the xed point X. The O (2 ) equation has solutions of the form X1 (w , t) D z (t ) e2inw C z¤ (t ) e ¡2inw .

(A.11)

We shall determine a dynamical equation for the complex amplitude z (t ) by deriving so-called solvability conditions for the higher-order equations. We proceed by taking the inner product of equations A.9 and A.10 with the linear eigenmode A.11. We dene R p the inner product of two periodic functions U, V according to hU|Vi D 0 U ¤ (w )V (w )dw / p . The O (2 2 ) solvability condition hX1 | LO c X2 i D 0 is automatically satised since hX1 |Wc ¤ X21 i D 0.

(A.12)

The O (2 3 ) solvability condition hX1 | LO c X3 i D 0 gives hX1 |

2 @X1 ¡ Wc ¤ X1 i D g3 hX1 |Wc ¤ X31 i C 2g2 hX1 |Wc ¤ X1 X2 i. (A.13) a @t

First, we have he2inw |

2 @X1 2 dz ¡ c fn ¤ X1 i D ¡c z a @t a dt

(A.14)

and Z he2inw |Wc ¤ X31 i D Wn,c

p 0

²3 dw ± 2inw C z¤ e ¡2inw e ¡2inw ze p

D 3Wn,c z|z| 2 .

(A.15)

Integrate-and-Fire Model of Orientation Tuning

2507

The next step is to determine U2 . From equation A.9 we have Z U2 (w ) ¡ c D g2

p

dw 0 W c (w ¡ w 0 )U2 (w 0 ) p

p

dw 0 Wc (w ¡ w 0 )U1 (w 0 )2 p

0

Z

0

h i 2 4inw C z¤2 W2n e ¡4inw C 2|z| 2 W0 . D g2 z W2n e

(A.16)

Let X2 (w ) D V C e4inw C V ¡e ¡4inw C V0 C k X1 (w ).

(A.17)

The constant k remains undetermined at this order of perturbation but does not appear in the amplitude equation for z(t ). Substituting equation A.17 into A.16 yields VC D

g2 z2 W2n , 1 ¡ c W2n

V¡ D

g2 z¤2 W2n , 1 ¡ c W2n

V0 D

2g2 W0 . 1 ¡ c W0

(A.18)

Using equation A.18, we nd that Z he2inw |W c ¤ X1 X2 i D Wn,c ±

p 0

² dw ± 2inw C z¤ e ¡2inw ze p ²

£ V C e4inw C V ¡ e¡4inw C V0 C k X1 (w ) e ¡2inw D Wn,c [z¤ V C C zV0 ]

µ

D g2 Wn,c z|z|

2

¶ 2W0 W2n C . 1 ¡ c W2n 1 ¡ c W0

(A.19)

Combining equations A.14, A.15, and A.19, we obtain the Stuart-Landau equation, dz D z(1 C A|z| 2 ), dt

(A.20)

where we have absorbed a factor c a / 2 into t and AD

2g22 3g3 C c2 c2

µ

¶ 2W0 W2n C . 1 ¡ c W2n 1 ¡ c W0

(A.21)

After rescaling, 2 z D Z, t D 2 2 t, this becomes dZ D Z (W n ¡ Wn,c C A|Z| 2 ). dt

(A.22)

2508

P. C. Bressloff, N. W. Bressloff, and J. D. Cowan

Appendix B

In this appendix we write explicit expressions for various quantities evaluated in the particular case of the alpha function 2.8. First, the phase interaction function dened by equation 4.2 becomes KT (w ) D

i a2 1 ¡ e ¡T h ¡aTw ( )e C Tw e ¡aTw C a2 ( T )e¡Tw a T 1 1 ¡ a 1 ¡ e ¡aT 1 Te ¡aT ¡ , 1 ¡ e ¡aT 1¡a

a 1 (T ) D

a2 (T ) D

1 1 ¡ e ¡aT . 1 ¡ a 1 ¡ e ¡T

(B.1)

(B.2)

Second, equations 2.8 and 4.7 give F0 ( T ) F1 (T )e ¡aT¡º ¡ , (1 ¡ e ¡aT¡º)2 1 ¡ e ¡aT¡º

(B.3)

F0 ( T ) D

i a2 e ¡T h (1¡a)T (1 C a2 T )e ¡ aT ¡ 1 (1 ¡ a)2

(B.4)

F1 ( T ) D

i Ta3 e ¡T h (1¡a)T e ¡1 . 1¡a

(B.5)

eT (º) D G

where

Finally, setting º D iv in equation B.3 and taking real and imaginary parts leads to the result e(v) D a (v)C(v) C ±

²

e S (v) D a (v)S (v) ±

²

¡ b(v) cos(v)[C(v)2 ¡ S (v)2 ] ¡ 2 sin(v)C(v)S(v)

¡ b (v) sin(v)[C(v)2 ¡ S (v)2 ] C 2 cos(v)C(v)S(v)

(B.6)

(B.7)

where a (v) D

F 0 (T ) , C(v)2 C S (v)2

b (v) D

F1 (T )e ¡aT [C(v)2 C S (v)2 ]2

and C (v) D 1 ¡ e ¡aT cos(v), S (v) D e ¡aT sin(v).

(B.8)

Integrate-and-Fire Model of Orientation Tuning

2509

Acknowledgments

This work was supported by grants from the Leverhulme Trust (P.C.B.) and the McDonnell Foundation (J.D.C.). References Amit, D. J., & Tsodyks, M. V. (1991). Quantitative study of attractor neural networks retrieving at low spike rates I: Substrate-spikes, rates and neuronal gain. Network, 2, 259–274. Ben-Yishai, R., Bar-Or Lev, R., & Sompolinsky, H. (1995). Theory of orientation tuning in visual cortex. Proc. Natl. Acad. Sci. (USA), 92, 3844–3848. Ben-Yishai, R., Hansel, D., & Sompolinsky, H. (1997). Traveling waves and the processing of weakly tuned inputs in a cortical network module. J. Comput. Neurosci., 4, 57–77. Berg´e, P., Dubois, M., & Vidal, C. (1986). Order within chaos. New York: Wiley. Bonds, A. B. (1991). Temporal dynamics of contrast gain control in single cells of the cat striate cortex. Visual Neuroscience, 6, 239–255. Bressloff, P. C., & Coombes, S. (1997). Physics of the extended neuron. Int. J. Mod. Phys., 11, 2343–2392. Bressloff, P. C., & Coombes, S. (1998a). Desynchronization, mode-locking and bursting in strongly coupled IF oscillators. Phys. Rev. Lett., 81, 2168–2171. Bressloff, P. C., & Coombes, S. (1998b). Spike train dynamics underlying pattern formation in integrate-and-re oscillator networks. Phys. Rev. Lett., 81, 2384– 2387. Bressloff, P. C., & Coombes, S. (2000). Dynamics of strongly coupled spiking neurons. Neural Comput., 12, 91–129. Chapman, B., Zahs, K. R., & Stryker, M. P. (1991). Relation of cortical cell orientation selectivity to alignment of receptive elds of the geniculocortical afferents that arborize within a single orientation column in ferret visual cortex. J. Neurosci., 11, 1347–1358. Cowan, J. D. (1968). Statistical mechanics of nervous nets. In E. R. Caianiello (Ed.), Neural networks (pp. 181–188). Berlin: Springer-Verlag. Cowan, J. D. (1982). Spontaneous symmetry breaking in large scale nervous activity. Int. J. Quantum Chem., 22, 1059–1082. Cross, M. C., & Hohenberg, P. C. (1993). Pattern formation outside of equilibrium. Rev. Mod. Phys., 65, 851–1112. Destexhe, A., Mainen, A. F., & Sejnowski, T. J. (1994). Synthesis of models for excitable membranes synaptic transmission and neuromodulation using a common kinetic formalism. J. Comp. Neurosci., 1, 195–231. Dimitrov, A., & Cowan, J. D. (1998). Spatial decorrelations in orientationselective cells. Neural Comput., 10, 1779–1796. Douglas, R. J., Koch, C., Mahowald, M., Martin, K., & Suarez, H. (1995).Recurrent excitation in neocortical circuits. Science, 269, 981–985. Ermentrout, G. B. (1998). Neural networks as spatial pattern forming systems. Rep. Prog. Phys., 61, 353–430.

2510

P. C. Bressloff, N. W. Bressloff, and J. D. Cowan

Ermentrout, G. B., & Cowan, J. D. (1979a). A mathematical theory of visual hallucination pattens. Biol. Cybern., 34, 137–150. Ermentrout, G. B., & Cowan, J. D. (1979b). Temporal oscillations in neuronal nets. J. Math. Biol., 7, 265–280. Ferster, D., Chung, S., & Wheat, H. (1997). Orientation selectivity of thalamic input to simple cells of cat visual cortex. Nature, 380, 249–281. Ferster, D., & Koch, C. (1987). Neuronal connections underlying orientation selectivity in cat visual cortex. Trends in Neurosci., 10, 487–492. Georgopolous, A. P. (1995). Current issues in directional motor control. Trends in Neurosci., 18, 506–510. Gerstner, W., van Hemmen, J. L., & Cowan, J. D. (1996).What matters in neuronal locking. Neural Comput., 94, 1653–1676. Gray, C. M., & Singer, W. (1989). Stimulus-specic neuronal oscillations in orientation columns of cat visual cortex. Proc. Nat. Acad. Sci., 86, 1698–1702. Hansel, D., & Sompolinsky, H. (1996). Chaos and synchrony in a model of a hypercolumn in visual cortex. J. Comput. Neurosci., 3, 7–34. Hansel, D., & Sompolinsky, H. (1997). Modeling feature selectivity in local cortical circuits. In C. Koch & I. Segev (Eds.), Methods of neuronal modeling (2nd ed.) (pp. 499–567). Cambridge, MA: MIT Press. Hubel, D. H., & Wiesel, T. N. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. Lond., 160, 106–154. Jack, J. J. B., Noble, D., & Tsien, R. W. (1975). Electric current ow in excitable cells. Oxford: Clarendon Press. Lin, J. K., Pawelzik, K., Ernst, U., & Sejnowski, T. J. (1998). Irregular synchronous activity in stochastically-coupled networks of integrate-and-re neurons. Network: Comput. Neural Syst., 9, 333–344. Lukashin, A. V., & Georgopolous, A. P. (1994a). A neural network for coding trajectories by time series of neuronal population vectors. Neural Comput., 4, 57–77. Lukashin, A. V., & Georgopolous, A. P. (1994b). Directional operations in the motor cortex modeled by a neural network of spiking neurons. Biol. Cybern., 71, 79–85. Mundel, T., Dimitrov, A., & Cowan, J. D. (1997). Visual cortex circuitry and orientation tuning. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 887–893). Cambridge, MA: MIT Press. Murray, J. D. (1990). Mathematical biology. Berlin: Springer-Verlag. Nelson, S., Toth, L., Seth, B., & Mur, S. (1994). Orientation selectivity of cortical neurons during extra-cellular blockade of inhibition. Science, 265, 774–777. Pei, X., Vidyasagar, T. R., Volgushev, M., & Creutzseldt, O. D. (1994). Receptive eld analysis and orientation selectivity of postsynaptic potentials of simple cells in cat visual cortex. J. Neurosci., 14, 7130–7140. Pinto, D. J., Brumberg, J. C., Simons, D. J., & Ermentrout, G. B. (1996). A quantitative population model of whisker barrels: Re-examining the Wilson-Cowan equations. J. Comput. Neurosci., 6, 19–28. Reid, R. C., & Alonso, J. M. (1995). Specicity of monosynaptic connections from thalamus to visual cortex. Nature, 378, 281–284.

Integrate-and-Fire Model of Orientation Tuning

2511

Ringach, D., Hawken, M. J., & Shapley, R. (1997).Dynamics of orientation tuning in macaque primary visual cortex. Nature, 387, 281–284. Sillito, A. M., Kemp, J. A., Milson, J. A., & Beradi, N. (1980). A re-evaluation of the mechanisms underlying simple cell orientation selectivity. Brain Res., 194, 517–520. Softky, W. R., & Koch, C. (1993). The highly irregular ring of cortical cells is inconsistent with temporal integration of random EPSPs. J. Neurosci., 13, 334-350. Somers, D. C., Nelson, S. B., & Sur, M. (1995). An emergent model of orientation selectivity in cat visual cortical simple cells. J. Neurosci., 15, 5448–5465. Tsodyks, M. V., & Sejnowski, T. J. (1995). Rapid switching in balanced cortical network models. Network: Comput. Neural Syst., 6, 111–124. Usher, M., Stemmler, M., Koch, C., & Olami, Z. (1994). Network amplication of local uctuations causes high spike rate variability, fractal ring patterns and oscillatory local eld potentials. Neural Comput., 5, 795–835. van Vreeswijk, C. (1996). Partial synchronization in populations of pulsecoupled oscillators. Phys. Rev. E., 54, 5522–5537. van Vreeswijk, C., & Sompolinsky, H. (1996). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274, 1724–1726. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comput., 10, 1321–1371. Wilson, H. R., & Cowan, J. D. (1972). Excitatory and inhibitory interactions in localized populations of model neurons. Biophys. J., 12, 1–24. Wilson, H. R., & Cowan, J. D. (1973). A mathematical theory of the functional dynamics of cortical and thalamic nervous tissue. Kybernetik, 13, 55–80. Zhang, K. (1996). Representation of spatial orientation by the intrinsic dynamics of the head-direction cell ensemble: A theory. J. Neurosci., 16, 2122–2126. Received April 15, 1999; accepted February 2, 2000.

NOTE

Communicated by George Gerstein

Statistical Signs of Common Inhibitory Feedback with Delay Quentin Pauluis Laboratory of Neurophysiology,School of Medicine, Université Catholique de Louvain, Brussels, Belgium Cross-correlation histograms (CCHs) sometimes exhibit an isolated central peak anked by two troughs. What can cause this pattern? The absence of CCH satellite peak makes an oscillatory common input doubtful. It is here shown using a simple counting model that a common inhibitory feedback with delay can account for this pattern. 1 Introduction

It has been known for some time that an isolated central peak in the crosscorrelation histogram (CCH) can be caused by common input (Perkel, Gerstein, & Moore, 1967; Moore, Segundo, Perkel, & Leviathan, 1970). In the last decade, much attention has been paid to oscillations in the CCH, which can be explained by a common oscillatory input either external to the network, or emerging from the network dynamics (Perkel et al., 1967; Moore et al., 1970; Singer & Gray, 1995; Ritz & Sejnowski, 1997; Pauluis, Baker, & Olivier, 1999). We found recently in our recordings in the superior colliculus a simpler pattern made up of a central peak straddled by two troughs (TPT pattern; Pauluis, Baker, & Olivier, 2000). Although a Gabor function could be tted to this TPT pattern (Konig, ¨ 1994) and an oscillation frequency extracted, this seems inappropriate since there were rarely signicant satellite peaks. Using a simple model, this study shows that the most parsimonious hypothesis accounting for the TPT pattern is a delayed common inhibitory feedback. 2 Method

Two cells were simulated after the counting model proposed by Shadlen and Newsome (1998). Inputs were modeled by independent random trains of events, assuming an exponential distribution of interevent intervals t (Poisson process): f (t) D

1 ¡t / m , e m

(2.1)

where the mean frequency m equaled 50 Hz, and t was implemented in 1 ms time steps. c 2000 Massachusetts Institute of Technology Neural Computation 12, 2513–2518 (2000) °

2514

Quentin Pauluis

The cell counted these random excitatory and inhibitory inputs, which respectively incremented or decremented the current count by one unit. Consequently, the input has a duration of one time step (1 ms “current” pulse). If the count reached a threshold of 15, the cell emitted a spike, and the count was reset to zero. The count was constrained not to be negative and decayed exponentially with a time constant of 20 ms. The discrete nature of the input effectively produced excitatory postsynaptic potentials of zero rise time, permitting an immediate response to synchronized input. Two cells were simulated, reproducing the conditions of Figure 8E in Shadlen and Newsome (1998). In this simulation, each cell received 300 excitatory and 300 inhibitory input trains. The CCH signicance was calculated as in our previous studies (see Pauluis & Baker, 2000; Pauluis et al., 2000), using a method based on the estimation of the instantaneous ring rate. The instantaneous ring rate was here estimated by convolving the spike trains with a gaussian kernel (300 ms S.D.; Silverman, 1986); the cross-correlation of the two ring-rate estimates yielded a prediction for the CCH offset. A peak was designated when a 5 ms window exhibited a signicant excess of counts (p < 0.001), assuming a Poisson distribution for the window count. 3 Simulations

In the rst simulation (see Figure 1A), 75 of 300 of each of the excitatory and inhibitory inputs were common to the two cells, the remainder being independent. Such an arrangement produced a narrow cross-correlation peak. In real cortical networks, the strongest inhibition is normally conveyed by a few somatic inputs (Rall & Agmon-Snir, 1998). Accordingly, for the simulation in Figure 1B, only 30 inhibitory inputs were simulated. These were all shared between the two cells. Their duration was increased to 5 ms in order to account for the longer inhibitory postsynaptic potential time course (5 ms “current” pulse; Thomson, Deuchars, & West, 1993; (Thomson,West, & Deuchars, 1995; Buhl et al., 1997; Tama` s, Buhl, & Somogyi, 1997), and their strength was increased to cancel an excitatory count of 2.2 each time step they were active, reecting stronger somatic inputs. No common excitatory drive was present in this simulation. As expected from Perkel et al. (1967), the CCH exhibited a central peak (which was broader than in Figure 1A, due to the longer duration inhibition), but no trough was present. The next simulation tested the effect of a delayed inhibitory feedback. The notion of local inhibitory feedback has to be understood at the network level. The assumptions are that (1) the neuron we are looking at is embedded in a network; (2) the inhibitory feedback is due to the local inhibitory cells; (3) the global network activity is proportional to the network input; and (4) the network input and output can be approximated by the cell input. This could be if the input is low or in a network with only a few excitatory

Common Inhibitory Feedback with Delay

2515

Figure 1: Balanced excitation-inhibition in simple counting neurons. (A) Common excitatory and inhibitory inputs produced a sharp central peak. Rasters of the rst 1 s of the cell discharge are shown at the top. Twenty seconds of activity in total were simulated. The raw CCH is represented by a thin line, and its moving-window average (5 ms) is shown by the bold line. p The light gray area represents the expected offset § 3 S.D. (p < 0.001; S.D. D count) and dark gray areas mark the signicant count excess. A signicant peak or trough is designated when a 5 ms window is signicant, and is represented by a star. (B) The common source of excitatory inputs was removed. The common inhibitory inputs were adjusted to resemble the properties of somatic inhibition (see the text). The output ring rate was close to 90 Hz, and the central peak in the CCH was 9 ms wide. (C) Delayed common inhibitory feedback from one excitatory pathway. The ring rate of both cells was reduced, to 37 and 38 Hz. Although the peak became smaller, there was now a clear trough at 10 ms lag. (D) Delayed common inhibitory feedback from both excitatory pathways. The central peak was anked by symmetrical troughs (TPT pattern, see text for details).

2516

Quentin Pauluis

connections to excitatory neurons (to prevent the occurrence of a burst of activity) and a few inhibitory connections to inhibitory cells (to prevent the occurrence of oscillations) (van Vreeswijk & Sompolinsky, 1998; Pauluis et al., 1999). The number of excitatory inputs to cell 1 was scaled down by a factor 10, rounded to the nearest integer and delayed by 10 ms to produce the 30 common inhibitory inputs: I (t C 10) ¼ E 1 (t) / 10.

(3.1)

Such a delay may seem long, but since it must take into account dendritic integration, axonal transmission, and synaptic delay, it is perhaps not unreasonable. Indeed some studies reported very slow conduction velocities in inhibitory neurons (60 m m/ms; Murakoshi, Guo, & Ichinose, 1993; Salin & Prince, 1996). As in the simulation presented in Figure 1B, these inputs were 5 ms in duration and 2.2 in strength. As shown in Figure 1C, such a model produced a trough on one side of the central peak. The generation of the inhibitory input time course as a scaled-down version of a higher-rate process meant that the coefcient of variation of the inhibitory ring was lower than previously, producing the reduced output ring rate compared to previous simulations. The nal simulation tested the effect of constructing the common inhibitory drive by scaling down and averaging both excitatory sources, according to the following equation: I (t C 10) ¼ [E1 (t) C E2 (t )] / 20.

(3.2)

This simulation generated a CCH with a central peak surrounded by symmetrical troughs (see Figure 1D), reproducing exactly the TPT pattern. It is interesting to notice that the situation can be temporally reversed. It has been suggested that inhibitory neurons could prime the network activity (Pauluis et al., 1999). Even if the inhibitory inputs preceded the excitatory ones by 10 ms, the central peak in the CCH was still anked by two troughs (data not shown). The peak exhibited a slightly smaller amplitude (58 spikes.bin ¡1 ) and troughs were restricted to §12 ms. Whichever input led, the trough time lag corresponded fairly well with the absolute value of the delay and the trough width with the input duration (data not shown). 4 Discussion

Figure 1D suggests that network oscillations are not required to produce collateral troughs in a CCH and that a delayed common inhibitory feedback, probably mediated by local interneurons, could explain TPT patterns completely. Indeed, common inhibitory feedback produced both the central peak and the two troughs.

Common Inhibitory Feedback with Delay

2517

We assumed in these simulations that the inhibitory feedback could be linearly related to the excitatory inputs. In large, neural networks, it is the case if we consider that excitatory and inhibitory neurons process the same input in the same way and that the discharge of the cells in the study represents only a small contribution to the network activity. Oscillation could be added to the model if the inhibitory feedback becomes oscillatory, as is the case when inhibitory interneurons are interconnected (Whittington, Traub, & Jefferys, 1995; Traub, Whittington, Colling, Buzs´aki, & Jefferys, 1996; Wang & Busz´aki, 1996; White, Chow, Ritt, SotoTrevino, ˜ & Kopell, 1998; Pauluis et al., 1999). Although the problem of the low variance of the inhibitory inputs could be the result of the mathematical simplication we made, it could also be solved by these oscillations, because oscillations would ensure some large inhibitory input uctuations. From that point of view, looking at the incidence of oscillatory patterns in a local neural network would be a means to quantify the transient efcacy of inhibitory-to-inhibitory connections and the recruitment of the inhibitory interneurons. The simulations show that delayed common inhibitory feedback is a halfway point toward network oscillations and could explain the TPT pattern in CCHs. The results should encourage experimenters to test the signicance of satellite troughs in order to differentiate the TPT pattern from an isolated central peak or a genuine oscillatory CCH. 5 Acknowledgments

This short note is the last part of my thesis (available online at http://www. md.ucl.ac.be/nefy/qp). I am grateful to the members of the jury who reviewed this manuscript and to S. N. Baker for invaluable help during these four years. References Buhl, E. H., Tam`as, G., Szil`agyi,T., Stricker, C., Paulsen, O., & Somogyi P. (1997). Effect number and location of synapses made by single pyramidal cells onto aspiny interneurons of cat visual cortex. J. Physiol., 500, 689–713. König, P. (1994). A method for the quantication of synchrony and oscillatory properties of neuronal-activity. J. Neurosci. Meth., 54, 31–37. Moore, G. P., Segundo, J. P., Perkel, D. H., & Leviathan, H. (1970). Statistical signs of synaptic interaction in neurons. Biophys. J., 10, 876–900. Murakoshi, T., Guo, J. Z., & Ichinose, T. (1993). Electrophysiological identication of horizontal synaptic connections in rat visual cortex in vitro. Neurosci. Let., 163, 211–214. Pauluis, Q. & Baker, S. N. (2000). An accurate measure of the instantaneous discharge probability with application to unitary joint-event analysis. Neural Computation, 12, 687–709.

2518

Quentin Pauluis

Pauluis, Q., Baker, S. N., & Olivier, E. (1999). Emergent oscillations in a realistic network: The role of inhibition and the effect of the spatiotemporal distribution of the input. J. Comput. Neurosci., 6, 27–48. Pauluis, Q., Baker, S. N., & Olivier, E. (2000). Precise spike synchronization in the superior colliculus of the awake cat during moving stimulus presentation. Submitted. Perkel, D. H., Gerstein, G. L., & Moore, G. P. (1967). Neuronal spike trains and stochastic point process. II. Simultaneous spike trains. J. Biophys., 7, 419–440. Rall, W., & Agmon-Snir, H. (1998). Cable theory for dendritic neurons. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling: From ions to networks (p. 86). Cambridge, MA. MIT Press. Ritz, R., & Sejnowski, T. (1997). Synchronous oscillatory activity in sensory systems: New vistas on mechanisms. Curr. Op. Neurobiol., 7, 536–546. Salin, P. A. & Prince, D. A. (1996). Electrophysiological mapping of GABAA receptor-mediated inhibition in adult rat somatosensory cortex. J. Neurophysiol., 75, 1589–1600. Shadlen, M. N., Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. J. Neurosci., 18, 3870–3896. Silverman, B. W. (1986). Density estimation for statistics and data analysis. London: Chapman & Hall. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annu. Rev. Neurosci., 18, 555–586. Tam`as, G., Buhl, E. H., Somogyi, P. (1997). Fast IPSPs elicited via multiple synaptic release sites by different types of GABAergic neurone in cat visual cortex. J. Physiol., 500, 715–738. Thomson, A. M., Deuchars, J., & West, D. C. (1993). Single axon excitatory postsynaptic potentials in neocortical interneurons exhibit pronounced paired pulse facilitation. Neuroscience, 54, 347–360. Thomson, A. M., West, D. C., Deuchars, J. (1995). Properties of single axon excitatory postsynaptic potentials elicited in spiny interneurons by action potentials in pyramidal neurons in slices of rat neocortex. Neuroscience, 69, 727–738. Traub, R. D., Whittington, M. A., Colling, S. B., Buzs´aki, G., & Jefferys, J. G. R. (1996). Analysis of gamma rhythms in the rat hippocampus in vitro and in vivo. J. Physiol., 493, 471–484. van Vreeswijk, C., & Sompolinsky, H. (1998). Chaotic balanced state in a model of cortical circuits. Neural Comput., 10, 1321–1371. Wang, X. J., & Buzs´aki, G. (1996). Gamma oscillation by synaptic inhibition in a hippocampal interneuron network model. J. Neurosci., 15, 6402–6413. White, J. A., Chow, C. C., Ritt, J., Soto-Trevino ˜ C., & Kopell, N. (1998). Synchronization and oscillatory dynamics in heterogeneous, mutually inhibited neurons. J. Comput. Neurosci., 5, 5–16. Whittington, M. A., Traub, R. D., & Jeffereys, J. G. R. (1995). Synchronized oscillations in interneuron networks driven by metabotropic glutamate receptor activation. Nature, 373, 612–615. Received July 13, 1999; accepted February 23, 2000.

NOTE

Communicated by John Lazzaro

On the Computational Power of Winner -Take-All Wolfgang Maass Institute for Theoretical Computer Science, Technische Universit¨at Graz, A-8010 Graz, Austria This article initiates a rigorous theoretical analysis of the computational power of circuits that employ modules for computing winner -take-all. Computational models that involve competitive stages have so far been neglected in computational complexity theory, although they are widely used in computational brain models, articial neural networks, and analog VLSI. Our theoretical analysis shows that winner -take-all is a surprisingly powerful computational module in comparison with threshold gates (also referred to as McCulloch-Pitts neurons) and sigmoidal gates. We prove an optimal quadratic lower bound for computing winner -takeall in any feedforward circuit consisting of threshold gates. In addition we show that arbitrary continuous functions can be approximated by circuits employing a single soft winner -take-all gate as their only nonlinear operation. Our theoretical analysis also provides answers to two basic questions raised by neurophysiologists in view of the well-known asymmetry between excitatory and inhibitory connections in cortical circuits: how much computational power of neural networks is lost if only positive weights are employed in weighted sums and how much adaptive capability is lost if only the positive weights are subject to plasticity. 1 Introduction

Computational models that involve competitive stages are widely used in computational brain models, articial neural networks, and analog VLSI (see Arbib, 1995). The simplest competitive computational module is a hard winner-take-all gate that computes a function WTAn : Rn ! f0, 1gn whose output hb1 , . . . , bn i D WTAn (x1 , . . . , xn ) satises » 6 i 1, if xi > xj for all j D bi D 6 i. 0, if xj > xi for some j D Thus, in the case of pairwise different inputs x1 , . . . , xn a single output bit bi has value 1, which marks the position of the largest input xi .1 1 Different conventions are considered in the literature in case that there is no unique “winner” xi , but we need not specify any convention in this article since our lower-bound result for WTAn holds for all of these versions.

c 2000 Massachusetts Institute of Technology Neural Computation 12, 2519–2535 (2000) °

2520

Wolfgang Maass

In this article we also investigate the computational power of two common variations of winner-take-all: k-winner-take-all, where the ith output bi has value 1 if and only if xi is among the k largest inputs, and soft winnertake-all, where the ith output is an analog variable ri whose value reects the rank of xi among the input variables. Winner-take-all is ubiquitous as a computational module in computational brain models, especially in models involving computational mechanisms for attention (Niebur & Koch, 1998). Biologically plausible models for computing winner-take-all in biological neural systems exist on the basis of both the assumption that the analog inputs xi are encoded through ring rates—where the most frequently ring neuron exerts the strongest inhibition on its competitors and thereby stops them from ring after a while—and the assumption that the analog inputs xi are encoded through the temporal delays of single spikes—where the earliest-ring neuron (that encodes the largest xi ) inhibits its competitors before they can re (Thorpe, 1990). We would like to point to another link between results of this article and computational neuroscience. There exists a notable difference between the computational role of weights of different signs in articial neural network models on one hand, and anatomical and physiological data regarding the interplay of excitatory and inhibitory neural inputs in biological neural systems on the other hand (see Abeles, 1991, and Shepherd, 1998). Virtually any articial neural network model is based on an assumed symmetry between positive and negative weights. Typically weights of either sign occur on an equal footing as coefcients in weighted sums that represent the input to an articial neuron. In contrast, there exists a strong asymmetry regarding positive (excitatory) and negative (inhibitory) inputs to biological neurons. At most 15% of neurons in the cortex are inhibitory neurons, and these are the only neurons that can exert a negative (inhibitory) inuence on the activity of other neurons. Furthermore, the location and structure of their synaptic connections to other neurons differ drastically from those formed by excitatory neurons. Inhibitory neurons are usually connected to just neurons in their immediate vicinity, and it is not clear to what extent their synapses are changed through learning. Furthermore, the location of their synapses on the target neurons (rarely on spines, often close to the soma, frequently with multiple synapses on the target neuron) suggests that their computational function is not symmetric to that of excitatory synapses. Rather, such data would support a conjecture that their impact on other neurons may be more of a local regulatory nature—that inhibitory neurons do not function as computational units per se like the (excitatory) pyramidal neurons, whose synapses are subject to ne-tuning by various learning mechanisms. These observations from anatomy and neurophysiology have given rise to the question of whether a quite different style of neural circuit design may be feasible, which achieves sufcient computational power without requiring symmetry between excitatory and

On the Computational Power of Winner-Take-All

2521

inhibitory interaction among neurons. The circuits constructed in section 3 and 4 of this article provide a positive answer to this question. It is shown there that neural circuits that use inhibition exclusively for lateral inhibition in the context of winner-take-all have the same computational power as multilayer perceptrons that employ weighted sums with positive and negative weights in the usual manner. Furthermore, one can, if one wants, keep the inhibitory synapses in these circuits xed and modify just the excitatory synapses in order to program the circuits so that they adopt a desired input-output behavior. It has long been known that winner-take-all can be implemented via inhibitory neurons in biologically realistic circuit models (Elias & Grossberg, 1975; Amari & Arbib, 1977; Coultrip, Granger, & Lynch, 1992; Yuille and Grzywacz, 1989). The only novel contribution of this article is the result that in combination with neurons that compute weighted sums (with positive weights only), such winner-take-all modules have universal computational power for both digital and analog computation. A large number of efcient implementations of winner-take-all in analog VLSI have been proposed, starting with Lazzaro, Ryckebusch, Mahowald, and Mead (1989). The circuit of Lazzaro et al. computes an approximate version of WTAn with just 2n transistors and wires of total length O (n ), with lateral inhibition implemented by adding currents on a single wire of length O(n ). Its computation timescales with the size of its largest input. Numerous other efcient implementations of winner-take-all in analog VLSI have subsequently been produced (e.g., Andreou et al., 1991; Choi & Sheu, 1993; Fang, Cohen, & Kincaid, 1996). Among them are circuits based on silicon spiking neurons (DeYong, Findley, & Fields, 1992; Meador & Hylander, 1994; Indiveri, in press) and circuits that emulate attention in articial sensory processing (DeWeerth and Morris, 1994; Horiuchi, Morris, Koch, & DeWeerth, 1997; Brajovic & Kanade, 1998; Indiveri, in press). In spite of these numerous hardware implementations of winner-take-all and numerous practical applications, we are not aware of any theoretical results on the general computational power of these modules. It is one goal of this article to place these novel circuits in the context of other circuit models that are commonly studied in computational complexity theory and to evaluate their relative strength. There exists an important structural difference between those that are commonly studied in computational complexity theory and those that one typically encounters in hardware or wetware. Almost all circuit models in computational complexity theory are feedforward circuits—their architecture constitutes a directed graph without cycles (Wegener, 1987; Savage, 1998). In contrast, typical physical realizations of circuits contain besides feedforward connections also lateral connections (connections among gates on the same layer), and frequently also recurrent connections (connections from a higher-layer backward to some lower layer of the circuit). This gives rise to the question whether this discrepancy is relevant—for example,

2522

Wolfgang Maass

whether there are practically important computational tasks that can be implemented substantially more efciently on a circuit with lateral connections than on a strictly feedforward circuit. In this article, we address this issue in the following manner: we view modules that compute winner-takeall (whose implementation usually involves lateral connections) as “black boxes” of which we only model their input-output behavior. We refer to these modules as winner-take-all gates in the following. We study possible computational uses of such winner-take-all gates in circuits where these gates (and other gates) are wired together in a feedforward fashion. 2 We charge one unit of time for the operation of each gate. We will show that in this framework, the new modules, which may internally involve lateral connections, do in fact add substantial computational power to a feedforward circuit. The arguably most powerful gates that have previously been studied in computational complexity theory are threshold gates (also referred to as McCulloch-Pitts neurons or perceptrons; see Minsky & Papert, 1969; Siu, Roychowdury, & Kailath, 1995) and sigmoidal gates (which may be viewed as soft versions of threshold gates). A threshold gate with weights a1 , . . . , an 2 R and threshold H 2 R computes the function G: Rn ! f0, 1g n P dened by G(x1 , . . . , xn ) D 1 , ai xi ¸ H. Note that AND and OR of n i D1

bits as well as NOT are special cases of threshold gates. A threshold circuit (also referred to as multilayer perceptron) is a feedforward circuit consisting of threshold gates. The depth of a threshold circuit C is the maximal length of a directed path from an input node to an output gate. The circuit is called layered if all such paths have the same length. Note that a circuit with k hidden layers has depth k C 1 in this terminology. Except for theorem 1 we will discuss in this article only circuits with a single output. The other results can be extended to networks with several outputs by duplicating the network so that each output variable is formally computed by a separate network. We will prove in section 2 that WTAn is a rather expensive computational operation from the point of view of threshold circuits, since any such circuit needs quadratically in n many gates to compute WTAn . We will show in section 3 that the full computational power of two layers of threshold gates can be achieved by a single k-winner-take-all gate applied to positive weighted sums of the input variables. Furthermore, we will show in section 4 that by replacing the single k-winner-take-all gate by a single soft winner-take-all gate, these extremely simple circuits become universal approximators for arbitrary continuous functions. 2 We examine computations in such a circuit for only a single batch-input, not for streams of varying inputs. Hence, it does not matter for this analysis whether one implements a winner-take-all gate by a circuit that needs to be reinitialized before the next input, or by a circuit that immediately responds to changes in its input.

On the Computational Power of Winner-Take-All

2523

2 An Optimal Quadratic Lower Bound for Hard Winner -Take-All

We show in this section that any feedforward circuit consisting of threshold gates needs to consist of quadratically in n many gates for computing WTAn . This result also implies a lower bound for any circuit of threshold gates involving lateral and/or recurrent connections that compute WTAn (provided one assumes that each gate receives at time t only outputs from other gates that were computed before time t). One can simulate any circuit with s gates whose computation takes t discrete time steps by a feedforward circuit with s ¢ t gates whose computation takes t discrete time steps. Hence, for example, if there exists some circuit consisting of O(n ) threshold gates with lateral and recurrent connections that computes WTAn in t discrete time steps, then WTAn can be computed by a feedforward circuit consisting of O (t ¢ n ) threshold gates. Therefore theorem 1 implies that WTAn cannot be computed in sublinear time by any linear-size circuit consisting of threshold gates with arbitrary (i.e., feedforward, lateral, and recurrent) connections. In case linear-size implementations of (approximations of) WTAn in analog VLSI can be built whose computation time grows sublinearly in n, then this negative result would provide theoretical evidence for the superior computational capabilities of analog circuits with lateral connections. For n D 2 it is obvious that WTAn can be computed by a threshold circuit of size 2 and that this size is optimal. For n ¸ 3 the most straightforward of a (feedforward) threshold circuit C that computes WTAn uses ¡design ¢ n C n threshold gates. For each pair hi, ji with 1 · i < j · n, one employs 2 a threshold gate Gij that outputs 1 if and only if xj ¸ xi . The ith output bi of WTAP n for a circuit Pinput x is computed by a threshold gate Gi with Gi D 1 , j i ¡Gij (x ) ¸ i ¡ 1. This circuit ¡ ¢ design appears to be suboptimal, since most of its threshold gates—the n2 gates Gij —do not make use of their capability to evaluate weighted sums of many variables, with arbitrary weights from R. However, the following result shows that no feedforward threshold circuit (not even with an arbitrary number of layers, and threshold gates of arbitrary¡ fan-in ¢ with arbitrary real weights) can compute WTAn with fewer than n2 C n gates. Assume that n ¸ 3 and WTAn is computed by some arbitrary feedforward circuit¡C¢consisting of threshold gates with arbitrary weights. Then C consists of at least n2 C n threshold gates. Theorem 1.

Let C be any threshold circuit that computes WTAn for all x D hx1 , . . . , xn i 2 Rn with pairwise different x1 , . . . , xn . We say that a threshold gate in the circuit C contains an input variable xk if there exists a direct “wire” (i.e., an edge in the directed graph underlying C) from the kth input node to this gate, and the kth input variable xk occurs in the weighted sum of Proof.

2524

Wolfgang Maass

6 0. This threshold gate may also receive this threshold gate with a weight D outputs from other threshold gates as part of its input, since we do not assume that C is a layered circuit. 6 Fix any i, j 2 f1, . . . , ng with i D j. We will show that there exists a gate Gij in C that contains the input variables xi and xj , but no other input variables. The proof proceeds in four steps:

Step 1. Choose h 2 (0.6, 0.9) and r 2 (0, 0.1) such that no gate G in C that

contains the input variable xj , but no other input variable, changes its output value when xj varies over (h ¡r , h C r ), no matter which xed binary values a have been assigned to the other inputs of gate G. For any xed a there exists at most a single value ta for xj , so that G changes its output for xj D ta when xj varies from ¡1 to C 1. It sufces to choose h and r so that none of these nitely many values of ta (for arbitrary binary a and arbitrary gates G) falls into the interval (h ¡ r , h C r ).

Step 2. Choose a closed ball B µ (0, 0.5)n¡2 with a center c 2 Rn¡2 and a

radius d > 0 so that no gate G in C changes its output when we set xi D xj D h and let the vector of the other n ¡ 2 input variables vary over B (this is required to hold for any xed binary values of the inputs that G may receive from other threshold gates). We exploit here that for xed xi D xj D h, the gates in C (with inputs from other gates replaced by all possible binary values) partition (0, 0.5)n¡2 into nitely many sets S, each of which can be written as an intersection of half-spaces, so that no gate in C changes its output when we x xi D xj D h and let the other n ¡ 2 input variables of the circuit vary over S (while keeping inputs to C from other gates articially xed).

Step 3. Choose c 2 (0, r ) so that in every gate G in C that contains besides

xj some other input variable xk with k 2/ fi, jg, a change of xj by an amount c causes a smaller change of the weighted sum at this threshold gate G than a change by an amount d of any of the input variables xk with k 2/ fi, jg that it contains. This property can be satised because we can choose for c an arbitrarily small positive number. c c C 2 ]. By as2,h sumption, the output of C changes since the output of WTAn (x1 , . . . , xn ) changes (xi is the winner for xj < h, xj is the winner for xj > h). Let Gij be some gate in C that changes its output, whereas no gate that lies on a path from an input variable to Gij changes its output. Hence, Gij contains the input variable xj . The choice of h and r > c in step 1 implies that Gij contains besides xj some other input variable. Assume for a contradiction that Gij contains an input variable xk with k 2/ fi, jg. By the choice of c in step 3

Step 4. Set xi : D h, hxk ik2fi, / jg : D c , and let xj vary over [h ¡

On the Computational Power of Winner-Take-All

2525

this implies that the output of Gij changes when we set xi D xj D h and move xk by an amount up to d from its value in c , while keeping all other input variables and inputs that Gij receives from other threshold gates xed (even if some preceding threshold gates in C would change their output in response to this change in the input variable xk ). This yields a contradiction to the denition of the ball B in step 2. Thus we have shown that k 2 fi, jg for any input variable xk that Gij contains. Therefore Gij contains exactly the input variables xi and xj . 6 So far we have shown that for any i, j 2 f1, . . . , ng with i D j, there exists a gate Gij in C that contains the two input variables xi , xj , and no other input variables. ¡ ¢ It remains to be shown that apart from these n2 gates Gij the circuit C contains n other gates G1 , . . . , Gn that compute the n output bits b1 , . . . , bn 6 l, since bk D 6 bl for some of WTAn . It is impossible that Gk D Gl for some k D arguments of WTAn . Assume for a contradiction that Gk D Gij for some 6 6 6 k, i, j 2 f1, . . . , ng with i D j. We have k D i or k D j. Assume without loss 6 of generality that k D i. Let l be any number in f1, . . . , ng ¡ fk, ig. Assign to x1 , . . . , xn some pairwise different values a1 , . . . , an so that al > al0 for all 6 6 6 l0 D l. Since l D k and i D k, we have that for any xi 2 R, the kth output variable bk of WTAn (a1 , . . . , ai¡1 , xi , ai C 1 , . . . an ) has value 0 (ak cannot be the maximal argument for any value of xi since al > ak ). On the other hand, since Gij contains the variable xi , there exist values for xi (move xi ! 1 or xi ! ¡1) so that Gij outputs 1, no matter which binary values are assigned to the inputs that Gij receives in C from other threshold gates (while we keep xj xed at value aj ). This implies that Gij does not output bk in circuit C, that 6 is, Gk D Gij . Therefore, the circuit C contains in addition to the gates Gij n other gates that provide the circuit outputs b1 , . . . , bn .

3 Simulating Two Layers of Threshold Gates with a Single k-Winner -Take-All Gate as the Only Nonlinearity

A popular variation of winner-take-all (which has also been implemented in analog VLSI; Urahama & Nagao, 1995) is k-winner-take-all. The output of k-winner-take-all indicates for each input variable xi whether xi is among the k largest inputs. Formally we dene for any k 2 f1, . . . , ng the function k-WTAn : Rn ! f0, 1gn where k-WTAn (x1 , . . . , xn ) D hb1 , . . . , bn i has the property that bi D 1 , (xj > xi holds for at most k ¡ 1 indices j).

We will show in this section that a single gate that computes k-WTAn can absorb all nonlinear computational operations of any two-layer threshold circuit with one output gate and any number of hidden gates.

2526

Wolfgang Maass

Any two-layer feedforward circuit C (with m analog or binary input variables and one binary output variable) consisting of threshold gates can be simulated by a circuit consisting of a single k-winner-take-all gate k-WTAn applied to n weighted sums of the input variables with positive weights, except for some set S µ Rm of inputs that has measure 0. In particular, any boolean function f : f0, 1gm ! f0, 1g can be computed by a single k-winner-take-all gate applied to weighted sums of the input bits. If C has polynomial size and integer weights, whose size is bounded by a polynomial in m, then n can be bounded by a polynomial in m, and all weights in the simulating circuit are natural numbers whose size is bounded by a polynomial in m. Theorem 2.

The exception set of measure 0 in this result is a union of up to n hyperplanes in Rm . This exception set is apparently of no practical Q not just for DQ µ f0, 1gm but, for signicance, since for any given nite set D, m Q example, for D : D fhz1 , . . . , zm i 2 R : each zi is a rational number with bitlength · 1000g, one can move these hyperplanes (by adjusting the constant terms in their denitions) so that no point in DQ belongs to the exception set. Hence the k-WTA circuit can simulate the given threshold circuit C on any Q with an architecture that is independent of D. Q nite set D, On the other hand, the proof of the lower bound result from theorem 1 requires that the circuit C computes WTA everywhere, not just on a nite set. Remark 1.

One can easily show that the exception set S of measure 0 in theorem 2 is necessary: The set of inputs z 2 Rm for which a k-WTAn gate applied to weighted sums outputs 1 is always a closed set, whereas the set of inputs for which a circuit C of depth 2 consisting of threshold gates outputs 1 can be an open set. Hence in general, a circuit C of the latter type cannot be simulated for all inputs z 2 Rm by a k-WTAn gate applied to weighted sums. Remark 2.

Any layered threshold circuit C of arbitrary even depth d can be simulated for all inputs except for a set of measure 0 by a feedforward circuit consisting of ` k-WTA gates on d2 layers (each applied to positive weighted sums of circuit inputs on the rst layer and of outputs from preceding k-WTA gates on the subsequent layers), where ` is the number of gates on even-numbered layers in C. Thus, one can view the simulating circuit as a d-layer circuit with alternating layers of linear and k-WTA gates. Alternatively one can simulate the layers 3 to d of circuit C by a single kWTA gate applied to a (possibly very large) number of linear gates. This yields a simulation of C (except for some input set S of measure 0) by a four-layer circuit with alternating layers of linear gates and k-WTA gates. The number of k-WTA gates in this circuit can be bounded by `2 C 1, where `2 is the number of threshold gates on layer 2 of C. Corollary 1.

On the Computational Power of Winner-Take-All

2527

Proof of Theorem 2. Since the outputs of the gates on the hidden layer of

C are from f0, 1g, we can assume without loss of generality that the weights a1 , . . . , an of the output gate G of C are from f¡1, 1g. (See, e.g., Siu et al., 1995, for details. One rst observes that it sufces to use integer weights for threshold gates with binary inputs; one can then normalize these weights to values in f¡1, 1g by duplicating gates on the hidden layer of C.) Thus, for P any z 2 Rm , we have C(z ) D 1 , jnD 1 aj GO j (z) ¸ H, where GO 1 , . . . , GO n are the threshold gates on the hidden layer of C, a1 , . . . , an are from f¡1, 1g, and H is the threshold of the output gate G. In order to eliminate the negative weights in G, we replace each gate GO j for which aj D ¡1 by another threshold gate Gj so that Gj (z ) D P 1 ¡ GO j (z) for all z 2P Rm except on some hyperplane. m (¡wi )zi > ¡H for arbitrary We utilize here that : iD1 wi zi ¸ H , m iD 1 wi , zi , H 2 R. We set Gj : D GO j for all j 2 f1, . . . , ng with aj D 1. Then we have for all z 2 Rm , except for z from some exception set S consisting of up to n hyperplanes, n X jD 1

aj GO j (z) D

Hence C(z) D 1 , j w1 ,

n X jD 1

Gj (z ) ¡ |fj 2 f1, . . . , ng: aj D ¡1g|.

Pn

j D1 Gj

j wm

(z) ¸ k for all z 2 Rm ¡ S, for some suitable k 2 N.

2 R be the weights and H j 2 R be the threshold of gate P P j j |wi |zi ¡ i: wj < 0 |wi |zi ¸ H j . Gj , j D 1, . . . , n. Thus, Gj (z) D 1 , j i: wi > 0 i Hence with Let

...,

Sj : D

X

i:

j wi < 0

j

|wi |zi C H j C

X X

6 j i: w` > 0 `D i

|w`i |zi

for j D 1, . . . , n

and Sn C 1 : D

n X X

j

|wi |zi ,

j D1 i: w j > 0 i

we have for every j 2 f1, . . . , ng and every z 2 Rm : Sn C 1 ¸ Sj ,

X j

i: wi > 0

j

|wi |zi ¡

X j

i: wi < 0

j

|wi |zi ¸ H j , Gj (z) D 1.

This implies that the (n C 1)st output bn C 1 of the gate (n ¡ k C 1)¡WTAn C 1 applied to S1 , . . . , Sn C 1 satises bn C 1 D 1 , |fj 2 f1, . . . , n C 1g: Sj > Sn C 1 g| · n ¡ k , |fj 2 f1, . . . , n C 1g: Sn C 1 ¸ Sj g| ¸ k C 1 , |fj 2 f1, . . . , ng: Sn C 1 ¸ Sj g| ¸ k ,

n X jD 1

Gj (z) ¸ k

, C (z ) D 1.

2528

Wolfgang Maass

Note that all the coefcients in the sums S1 , . . . , Sn C 1 are positive. Proof of Corollary 1. Apply the preceding construction separately to the

rst two layers of C, then to the second two layers of C, and so on. Note that the later layers of C receive just boolean inputs from the preceding layer. Hence one can simulate the computation of layer 3 to d also by a two-layer threshold circuit, which requires just a single k¡WTA gate for its simulation in our approach from theorem 2. 4 Soft Winner -Take-All Applied to Positive Weighted Sums Is an Universal Approximator

We consider in this section a soft version soft-WTA of a winner-take-all gate. Its output hr1 , . . . , rn i for real valued inputs hx1 , . . . , xn i consists of n analog numbers ri , whose value reects the relative position of xi within the ordering of x1 , . . . , xn according to their size. Soft versions of winner-takeall are also quite plausible as computational function of cortical circuits with lateral inhibition. Efcient implementations in analog VLSI are still elusive, but in Indiveri (in press), an analog VLSI circuit for a version of soft winner-take-all has been presented where the time when the ith output unit is activated reects the rank of xi among the inputs. We show in this section that single gates from a fairly large class of soft winner-take-all gates can serve as the only nonlinearity in universal approximators for arbitrary continuous functions. The only other computational operations needed are weighted sums with positive weights. We start with a basic version of a soft winner-take-all gate soft-WTAn,k for natural numbers k, n with k · n2 , which computes the following function hx1 , . . . , xn i 7! hr1 , . . . , rn i from Rn into [0, 1]n : ri : D p

¡ |fj 2 f1, . . . , ng: x ¸ x g| ¡ ¢ i

k

j

n 2

,

where p : R ! [0, 1] is the piecewise linear function dened by p ( x) D

8 < 1, x, : 0,

if if if

x> 1 0·x·1 x < 0.

If one implements a soft winner-take-all gate via lateral inhibition, one can expect that its ith output ri is lowered (down from 1) by every competitor xj that wins the competition with xi (i.e., xj > xi ). Furthermore, one can expect that the ith output ri is completely annihilated (i.e., is set equal to 0) once the number of competitors xj that win the competition with xi reach a certain critical value c. This intuition has served as a guiding principle in the previously described formalization. However, for the sake of mathematical

On the Computational Power of Winner-Take-All

2529

simplicity, we have dened the outputs ri of soft-WTAn,k not in terms of the number of competitors xj that win over xi (i.e., xj > xi ), but rather in terms of the number of competitors xj that do not win over xi (i.e., xi ¸ xj ). Obviously one has |fj 2 f1, . . . , ng: xi ¸ xj g| D n ¡ |fj 2 f1, . . . , ng : xj > xi g|. Hence, the value of ri goes down if more competitors xj win over xi —as desired. The critical number c of winning competitors that sufce to “annihilate” ri is formalized through the “threshold” n2 . Our subsequent result shows that in principle, it sufces to set the annihilation-threshold always equal to n2 . But in applications, one may, of course, choose other values for the annihilation threshold. It is likely that an analog implementation of a soft-winner-take-all gate will not be able to produce an output that is really linearly related to |fj 2 f1, . . . , ng: xi ¸ xj g| ¡ n2 over a sufciently large range of values. Instead, an analog implementation is more likely to output a value of the form g

¡ |fj 2 f1, . . . , ng: x ¸ x g| ¡ ¢ i

k

j

n 2

,

where the piecewise linear function p is replaced by some possibly rather complicated nonlinear warping of the target output values. The following result shows that such gates with any given nonlinear warping g can serve just as well as the competitive stage (and as the only nonlinearity) in universal approximators. More precisely, let g be any continuous function g: R ! [0, 1] that has value 1 for x > 1, value 0 for x < 0, and which is strictly increasing over the interval [0, 1]. We denote a soft winner-take-all g gate that employs g instead of p by soft-WTAn,k . Assume that h: D ! [0, 1] is an arbitrary continuous function with a bounded and closed domain D µ Rm (for example: D D [0, 1]m ). Then for any e > 0 and for any function g (satisfying above conditions) there exist natural j j numbers k, n, biases a0 2 R, and nonnegative coefcients ai for i D 1, . . . , m and j D 1, . . . , n , so that the circuit consisting of the soft winner-take-all gate P g j j soft-WTAn,k applied to the n sums m i D1 ai zi C a0 for j D 1, . . . , n computes a function3 f : D ! [0, 1] so that | f (z) ¡ h (z)| < e for all z 2 D. Thus, circuits consisting of a single soft WTA-gate applied to positive weighted sums of the input variables are universal approximators for continuous functions. Theorem 3.

The proof shows that the number n of sums that arise in the circuit constructed in theorem 3 can essentially be bounded in terms of the number of hidden gates in a one-hidden-layer circuit C of sigmoidal gates that approximates the given continuous function h (more precisely, Remark 3.

3

g

More precisely, we set f (z) :D rn for the nth output variable rn of soft-WTAn,k .

2530

Wolfgang Maass

the function g¡1 ± h), the maximal size of weights (expressed as multiple of the smallest nonzero weight) on the second layer of C, and 1 / e (where e is the desired approximation precision). Numerous practical applications of backprop suggest that at least the rst one of these three numbers can be kept fairly small for most functions h that are of practical interest. Proof of Theorem 3. The proof has the following structure. We rst ap-

ply the regular universal approximation theorem for sigmoidal neural nets in order to approximate the continuous function g ¡1 (h (z)) by a weighted PQ sum jnD 1 bQj Gj1 (z ) of p -gates, that is, of sigmoidal gates Gj1 that employ the piecewise linear activation function p that has already been dened. As the next step, we simplify the weights and approximate the weighted P P a sum jnQD 1 bQj Gj1 (z ) by another weighted sum jnOD 1 Oj Gj2 (z ) of p gates Gj2 with k aj 2 f¡1, 1g for j D 1, . . . , nO and some suitable kO 2 N. We then eliminate all negative weights by making use of the fact that one can rewrite ¡Gj2 (z)

as Gj3 (z) ¡ 1 with another p gate Gj3 , for all j with aj D ¡1. By setting P nO 3 G (z )¡TO PO a jD 1 j for Gj3 : D Gj2 for all j with aj D 1, we then have jnD 1 Oj Gj2 (z ) D k kO O As the next step, we approximate each all z 2 D, with some TO 2 f0, . . . , ng. P p gate G3 by a sum 1` `iD 1 G (i) of ` threshold gates G (i) . Thus, altogether we have constructed an approximation of the continuous function g ¡1 (h(z )) P 0 G (z)¡T by a weighted sum jnD 1 j k of threshold gates with a uniform value 1k for all weights. 4 By expanding our technique from the proof of theorem 2, g we can replace this simple sum of threshold gates by a single soft-WTAn,k gate applied to weighted sums in such a way that the resulting WTA circuit approximates the given function h. We now describe the proof in detail. According to Leshno, Lin, Pinkus, and Schocken (1993) there exist nQ 2 N, bQj 2 R, and p gates Gj1 for j D 1, . . . , nQ so that

nQ X 1 ¡1 bQj G (z) ¡ g (h (z))< eQ j 3 jD 1

for all z 2 D,

where g¡1 is the inverse of the restriction of the given function g to [0, 1], and eQ > 0 is chosen so that |g(x) ¡ g (y )| < e for any x, y 2 R with |x ¡ y| < eQ . We are relying here on the fact that g ¡1 ± h is continuous. 4 One could, of course, also apply the universal approximation theorem directly to threshold gates (instead of p gates) in order to get an approximation of g ¡1 ± h by a weighted sum of threshold gates. But the elimination of negative weights would then give rise to an exception set S, as in theorem 3.

On the Computational Power of Winner-Take-All

2531

As the next step, we simplify the weights. It is obvious that there exist PQ b kO 2 N and b1 , . . . , b nQ 2 Z so that jnD 1 | bQj ¡ Oj | < 3eQ . We then have k

nQ b X 2 j 1 ¡1 < eQ ( ) ( ( )) ¡ G z g h z O j 3 jD 1 k

for all z 2 D.

PQ Q we create |bj | copies of the We set nO : D jnD 1 |bj |, and for all j 2 f1, . . . , ng p -gate Gj1 . Let G21 , . . . , G2nO be the resulting sequence of p gates Gj2 . Then there exist aj 2 f¡1, 1g for j D 1, . . . , nO so that

nO a X j 2( ) ¡1 ( ( )) 2 O Gj z ¡ g h z < 3 eQ jD 1 k

for all z 2 D.

We now eliminate all negative weights from this weighted sum. More precisely, we show that there are p gates G31 , . . . , G3nO so that nO a X j jD 1

kO

Gj2 (z)

D

P nO

3 j D 1 Gj ( z)

kO

¡ TO

for all z 2 D,

(4.1)

O with aj D ¡1. Consider some where TO is the number of j 2 f1, . . . , ng O with aj D ¡1. Assume that Gj2 is dened by Gj2 (z ) D arbitrary j 2 f1, . . . , ng

p (w¢ z C w0 C 12 ) for all z 2 Rm , with parameters w 2 Rm and w0 2 R. Because of the properties of the function p , we have

¡

¡p

w ¢ z C w0 C

1 2

¢

¡

D ¡1 C p

¡w ¢ z ¡ w0 C

1 2

¢

for all z 2 Rm . Indeed, if |w ¢ z C w0 | · 12 we have ¡p (w ¢ z C w0 C 12 ) D ¡w ¢ z ¡ w0 ¡ 12 D ¡1 C p (¡w ¢ z ¡ w0 C 12 ). If w ¢ z C w0 < ¡ 12 then ¡p (w ¢ z C w0 C 12 ) D 0 D ¡1 C p (¡w ¢ z ¡ w0 C 12 ), and if w ¢ z C w0 > 12 then ¡p (w ¢ z C w0 C 12 ) D ¡1 D ¡1 C p (¡w ¢ z ¡ w0 C 12 ). Thus, if we dene the p gate Gj3 by Gj3 (z) D p

¡

¡w ¢ z ¡ w0 C

1 2

¢

for all z 2 Rm ,

we have ¡Gj2 (z ) D ¡1 C Gj3 (z ) for all z 2 Rm . Besides transforming all p

O gates Gj2 with aj D ¡1 in this fashion, we set Gj3 D Gj2 for all j 2 f1, . . . , ng with aj D 1. Obviously equation 4.1 is satised with these denitions. Since the activation function p can be approximated arbitrarily close by step functions, there exists for each p gate G3 a sequence G(1) , . . . , G (`) of threshold gates so that P` eQ ¢ kO ( ) G z ( ) 3 ( ) i for all z 2 Rm . G z ¡ iD 1 < 3 ¢ nO `

2532

Wolfgang Maass

By applying this transformation to all p -gates G31 , . . . , G3nO , we arrive at a sequence G1 , . . . , Gn0 of threshold gates with n0 : D nO ¢ ` so that one has for k : D kO ¢ ` and T : D TO ¢ ` that P n0 jD 1 Gj (z ) ¡ T ¡1 ( ( )) ¡ g h z < eQ for all z 2 D. k According to the choice of eQ , this implies that

P n0 ! j D 1 Gj (z ) ¡ T ¡ h(z )< e g k

(

for all z 2 D.

By adding dummy threshold gates Gj that give constant output for all z 2 D, 0 one can adjust n0 and T so that n0 ¸ 2k and T D n 2¡1 , without changing the P n0 value of jD 1 Gj (z ) ¡ T for any z 2 D. It remains to show that ! P n0 n0 ¡1 j D1 Gj ( z) ¡ 2 g k

(

g

can be computed for all z 2 D, by some soft winner-take-all gate soft-WTAn,k applied to n weighted sums with positive coefcients. We set n : D n0 C 1. j For j D 1, . . . , n0 and i D 1, . . . , m let wi , H j 2 R be parameters so that Gj (z) D 1 , Then

P n0

j D 1 Gj (z )

X

i:

X

j

j wi > 0

|wi |zi ¡

i:

j

j wi < 0

|wi |zi ¸ H j .

D |fj 2 f1, . . . , n0 g: Sn0 C 1 ¸ Sj g| for Sn0 C 1 : D

P n0 P jD 1

j

i: wi > 0

j

|wi |zi

P P P j and Sj : D i: wj < 0 |wi |zi C H j C `D6 j i: w`i > 0 |w`i |zi . This holds because we i have for every j 2 f1, . . . , n0 g Sn0 C 1 ¸ Sj ,

X j

i: wi > 0

j

|wi |zi ¡

X j

i: wi < 0

j

|wi |zi ¸ H j .

0

Since n D n0 C 1 and therefore n 2¡1 C 1 D n2 , the preceding implies that g the nth output variable rn of soft-WTAn,k applied to S1 , . . . , Sn outputs g

¡ |fj 2 f1, . . . , ng: S k

n

¸ Sj g| ¡

n 2

¢

D g

(

P n0

j D 1 Gj

(z ) ¡

n0 ¡1 2

!

k

for all z 2 D. Note that all weights in the weighted sums S1 , . . . , Sn are positive.

On the Computational Power of Winner-Take-All

2533

5 Conclusions

We have established the rst rigorous analytical results regarding the computational power of winner-take-all. The lower-bound result of section 2 shows that the computational power of hard winner-take-all is already quite large, even if compared with the arguably most powerful gate commonly studied in circuit complexity theory: the threshold gate (also referred to a McCulloch-Pitts neuron or perceptron). Theorem 1 yields an optimal quadratic lower bound for computing hard winner-take-all on any feedforward circuit consisting of threshold gates. This implies that no circuit consisting of linearly many threshold gates with arbitrary (i.e., feedforward, lateral, and recurrent) connections can compute hard winner-take-all in sublinear time. Since approximate versions of winner-take-all can be computed very fast in linear-size analog VLSI chips (Lazzaro et al., 1989), this lower-bound result may be viewed as evidence for a possible gain in computational power that can be achieved by lateral connections in analog VLSI (and apparently also in cortical circuits). It is well known (Minsky & Papert, 1969) that a single threshold gate is unable to compute certain important functions, whereas circuits of moderate (i.e., polynomial)-size consisting of two layers of threshold gates with polynomial size weights have remarkable computational power (Siu et al., 1995). We showed in section 3 that any such two-layer (i.e., one hidden layer) circuit can be simulated by a single competitive stage, applied to polynomially many weighted sums with positive integer weights of polynomial size. In section 4, we analyzed the computational power of soft winner-takeall gates in the context of analog computation. It was shown that a single soft winner-take-all gate may serve as the only nonlinearity in a class of circuits that have universal computational power in the sense that they can approximate any continuous functions. In addition, we showed that this result is robust with regard to any continuous nonlinear warping of the output of such a soft winner-take-all gate, which is likely to occur in an analog implementation of soft winner-take-all in hardware or wetware. Furthermore, our novel universal approximators require only positive linear operations besides soft winner-take-all, thereby showing that in principle, no computational power is lost if inhibition is used exclusively for unspecic lateral inhibition in neural circuits, and no exibility is lost if synaptic plasticity (i.e., “learning”) is restricted to excitatory synapses. This result appears to be of interest for understanding the function of biological neural systems, since typically only 15% of synapses in the cortex are inhibitory, and plasticity for those synapses is somewhat dubious. Our somewhat surprising results regarding the computational power and universality of winner-take-all point to further opportunities for lowpower analog VLSI chips, since winner-take-all can be implemented very efciently in this technology. In particular, our theoretical results of section 4 predict that efcient implementations of soft winner-take-all will be useful

2534

Wolfgang Maass

in many contexts. Previous analog VLSI implementations of winner-take-all have primarily been used for special-purpose computational tasks. In contrast, our results show that a VLSI implementation of a single soft winnertake-all in combination with circuitry for computing weighted sums of the input variables yields devices with universal computational power for analog computation. A more qualitative account of the results of this article can be found in Maass, Wolfgang (1999). The theoretical results of this article support the viability of an alternative style of neural circuit design, where complex multilayer perceptrons— feedforward circuits consisting of threshold gates or sigmoidal gates with positive and negative weights—are replaced by a single competitive stage applied to positive weighted sums. One may argue that this circuit design is more compatible with anatomical and physiological data from biological neural circuits. Acknowledgments

I thank Moshe Abeles, Peter Auer, Rodney Douglas, Timmer Horiuchi, Giacomo Indiveri, and Shih-chii Liu for helpful discussions and the anonymous referees for helpful comments. Research for this article was partially supported by the ESPRIT Working Group NeuroCOLT, No. 8556, and the Fonds zur Forderung ¨ der wissenschaftlichen Forschung (FWF), Austria, project P12153. References Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Amari, S., & Arbib, M. (1977). Competition and cooperation in neural nets. In J. Metzler (Ed.), Systems neuroscience (pp. 119–165). San Diego: Academic Press. Andreou, A. G., Boahen, K. A., Pouliquen, P. O, Pavasovic, A., Jenkins, R. E., & Strohbehn, K. (1991). Current-mode subthreshold MOS circuits for analog VLSI neural systems. IEEE Trans. on Neural Networks, 2, 205–213. Arbib, M. A. (1995). The handbook of brain theory and neural networks. Cambridge, MA: MIT Press . Brajovic, V., & Kanade, T. (1998). Computational sensor for visual tracking with attention. IEEE Journal of Solid State Circuits, 33, 8, 1199–1207. Choi, J., & Sheu B. J. (1993). A high precision VLSI winner-take-all circuit for self-organizing neural networks. IEEE J. Solid-State Circuits, 28, 576–584. Coultrip, R., Granger R., & Lynch, G. (1992). A cortical model of winner-take-all competition via lateral inhibition. Neural Networks, 5, 47–54. DeWeerth, S. P., & Morris, T. G. (1994). Analog VLSI circuits for primitive sensory attention. Proc. IEEE Int. Symp. Circuits and Systems, 6, 507–510. DeYong, M., Findley, R., & Fields, C. (1992). The design, fabrication, and test of a new VLSI hybrid analog-digital neural processing element. IEEE Trans. on

On the Computational Power of Winner-Take-All

2535

Neural Networks, 3, 363–374. Elias, S. A., & Grossberg, S. (1975). Pattern formation, contrast control, and oscillations in the short term memory of shunting on-center off-surround networks. Biol. Cybern., 20, 69–98. Fang, Y., Cohen, M., & Kincaid, M. (1996). Dynamics of a winner-take-all neural network. Neural Networks, 9, 1141–1154. Horiuchi, T. K., Morris, T. G., Koch, C., & DeWeerth, S. P. (1997). Analog VLSI circuits for attention-based visual tracking. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems (pp. 706–712). Cambridge, MA: MIT Press. Indiveri, G. (in press). Modeling selective attention using a neuromorphic analog-VLSI device. Neural Computation. Lazzaro, J., Ryckebusch, S., Mahowald, M. A., & Mead, C. A. (1989). Winnertake-all networks of O(n) complexity. In D. Touretzky (Ed.), Advances in neural information processing systems, 1 (pp. 703–711). San Mateo, CA: Morgan Kaufmann. Leshno, M., Lin, V., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6, 861–867. Maass, Wolfgang. (1999). Neural computation with winner-take-all as the only nonlinear operation. Advances in Neural Information Processing Systems 1999, 12 . Cambridge: MIT Press. Meador, J. L., & Hylander, P. D. (1994). Pulse coded winner-take-all networks. In M. E. Zaghloul, J. Meador, & R. W. Newcomb (Eds.), Silicon implementation of pulse coded neural networks (pp. 79–99). Boston: Kluwer Academic Publishers. Minsky, M. C., & Papert, S. A. (1969). Perceptrons. Cambridge, MA: MIT Press. Niebur, E., & Koch, C. (1998). Computational architectures for attention. In R. Parasuraman (Ed.), The Attentive Brain (pp. 163–186). Cambridge, MA: MIT Press. Savage, J. E. (1998). Models of computation: Exploring the power of computing. Reading, MA: Addison-Wesley. Shepherd, G. M. (Ed.). (1998).The synaptic organizationof the brain. Oxford: Oxford University Press. Siu, K.-Y., Roychowdhury, V., & Kailath, T. (1995). Discrete neural computation: A theoretical foundation. Englewood Cliffs, NJ: Prentice Hall. Thorpe, S. J. (1990). Spike arrival times: A highly efcient coding scheme for neural networks. In R. Eckmiller, G. Hartmann, and G. Hauske (Eds.), Parallel processing in neural systems and computers (pp. 91–94). Amsterdam: Elsevier. Urahama, K., & Nagao, T. (1995).k-winner-take-all circuit with O(N) complexity. IEEE Trans. on Neural Networks, 6, 776–778. Yuille, A. L., & Grzywacz, N. M. (1989). A winner-take-all mechanism based on presynaptic inhibition. Neural Computation, 1, 334–347. Wegener, I. (1987). The complexity of boolean functions. New York: Wiley. Received May 21, 1999; accepted November 15, 1999.

NOTE

Communicated by Brendan J. Frey

Reclassication as Supervised Clustering A. Sierra F. Corbacho Escuela T´ecnica Superior de Inform´atica, Universidad Aut´onoma de Madrid, 28049 Madrid, Spain In some branches of science, such as molecular biology , classes may be dened but not completely trusted. Sometimes posterior analysis proves them to be partially incorrect. Despite its relevance, this phenomenon has not received much attention within the neural computation community . We dene reclassication as the task of redening some given classes by maximum likelihood learning in a model that contains both supervised and unsupervised information. This approach leads to supervised clustering with an additional complexity penalizing term on the number of new classes. As a proof of concept, a simple reclassication algorithm is designed and applied to a data set of gene sequences. To test the perfor mance of the algorithm, two of the original classes are merged. The algorithm is capable of unraveling the original three-class hidden structure, in contrast to the unsupervised version (K-means); moreover , it predicts the subdivision of one of the original classes into two different ones.

1 Introduction

Learning involves extracting meaning from observation. For example, in reading handwritten zip codes, observations are images taken from envelopes, and meanings are the associated digits. Neural network learning has been traditionally subdivided into supervised and unsupervised learning, depending on the presence or absence of an external teaching signal (Hinton, 1989). Obviously, any classication problem can be approached in an unsupervised way. Supervision is feasible only provided a teacher is available. The teacher labels the training patterns with their corresponding meanings. Even in such cases we still have the possibility of using unsupervised learning. In fact, when the repertoire of meanings or classes is well known in advance, common practice dictates using unsupervised techniques as hidden steps (not directly connected to the output labels) of a leading supervised mechanism. Such is the case for counterpropagation networks (Hecht-Nielsen, 1987) and radial basis functions (RBF) (Moody & Darken, 1989). In all of these cases, dispersion in the input space is not explicitly penalized by the cost function. c 2000 Massachusetts Institute of Technology Neural Computation 12, 2537–2546 (2000) °

2538

A. Sierra and F. Corbacho

Hiding the unsupervised component makes sense if we completely trust the output categorization, that is, the number and nature of the classes. However, there may be reason to suspect that the categories could be improved. Although not very frequent in machine learning, this is not as unusual a situation as might be initially thought. There are areas of active research in science, such as virology and genome research, where a major puzzle is the elucidation of the actual classes (Ripley, 1996; Compton, Culham, & Jury, 1998; Martin, Hogberg, & Nylund, 1998). This is obviously a necessary step before there can be any serious attempt at classication. In these cases, unsupervised techniques do not sufce. It would be foolish to overlook the available knowledge about the nature and number of classes. What we need is a mechanism capable of redesigning the original labeling in the light of the input observations. Input and output spaces have to be placed at the same level. Reclassication takes both input measurements and output labels into account in order to nd a partition (i.e., a new set of classes). This is what our new cost function is intended to do, since it implements a trade-off between the input variance and the output quadratic error. The unsupervised error is no longer hidden behind the supervised error; both work hand in hand. There is a compromise between both drives that leads to a reclassication of the training set, hopefully shedding new insight into the problem at hand. In this sense, this is not a classication device but a clustering one in an enlarged input-output space. 2 Reclassication from a Statistical Point of View

Let X be a set of N data points X D fx 1 , . . . , x N g and fd1 , . . . , dN g their corresponding labels within a set of J supervised classes. A new set Z D fz 1 , . . . , z N g can be formed out of the original attributes plus extra ones, coding the label as in zi D (x i , di ). Reclassication involves nding partitions fC1 , C2 , . . . , CK g of this extended set Z. A good partition or new class must embody the information contained in both X and the original classes. Let S (N, K ) denote the number of possible partitions of N patterns into K groups. It can be shown that S fullls the following recursive equation S (N, K ) D KS (N ¡ 1, K ) C S (N ¡ 1, K ¡ 1),

(2.1)

whose solutions are the so-called Stirling numbers of the second kind (e.g., Liu, 1968): S (N, K ) D

¡ ¢

K 1 X K N (¡1)K¡i i . i K! iD 0

(2.2)

For any reasonable N, S (N, K ) is a prohibitively large number, thus precluding any attempt to perform an exhaustive brute-force examination of

Reclassication as Supervised Clustering

2539

the whole combinatorial space. We must hence introduce a framework in which reasonably good partitions (where “reasonably good” will later have a formal meaning) can be automatically found. We shall now formalize reclassication as the task of nding partitions of the extended data set Z by maximum likelihood. Each partition of the data set Z is specied by a collection of pairs (z , h ), where h is the representative associated with z. The likelihood of a partition with representatives (h 1 , . . . , h K ), can be expressed as LD

Y z 2Z

p (z , h ) D

Y

p(z | h )p (h ),

(2.3)

z2Z

where we have assumed that each pair (z , h ) is drawn independently from the same distribution. It is more convenient to minimize the negative logarithm of the likelihood E D ¡ ln L D ¡

X z2Z

ln p(z | h ) ¡

X

ln p(h ),

(2.4)

z2Z

where E is called the error function. The second term in equation 2.4 can be dropped if equal a prioriprobabilities are assumed for all possible partitions. However, if no other a priori knowledge is available, a more reasonable assumption might be a gaussian distribution N ( J, r ) around the number J of original classes, p (h ) D

¡

( K ¡ J )2 1 exp ¡ 2 1 2 (2pr ) / 2r 2

¢ .

(2.5)

After substituting equation 2.5 into 2.4 and eliminating constant terms, we are left with the following error, which explicitly penalizes the complexity of the model: ED¡

X z2Z

ln p (z | h ) C

N ( K ¡ J )2 . 2r 2

(2.6)

We will further assume that each pattern z can be generated from its partition representative h by adding gaussian noise 2 , z D h C2 ,

(2.7)

so that the conditional probability can be written, p (z | h ) D

¡

(z ¡ h ) 2 1 exp ¡ 2 1 2 (2p s ) / 2s 2

¢ .

(2.8)

2540

A. Sierra and F. Corbacho

Substituting equation 2.8 into 2.6 and dropping again the constant terms, we nally obtain the familiar expression for the sum-of-squares error function plus an extra term that controls the complexity for a partition (C1 , . . . , CK ) with representatives (h 1 , . . . , h K ): ED

K X 1 X N (z ¡ h k )2 C (K ¡ J )2 . 2s 2 kD 1 z2C 2r 2

(2.9)

k

To summarize, in this section we have derived the sum-of-squares error function for reclassication from the principle of maximum likelihood under the assumption of gaussian distributed data around partition representatives. The minimization of this error function can be approached by different means. The next section shows a simple but effective algorithm. 3 Supervised Clustering for Reclassication

Our solution proceeds in the spirit of K-means (MacQueen, 1967), searching for locally optimal partitions, but in the expanded set Z instead of the original space X. Grouping in K-means is done on the basis of similarities among observations. Yet we are aiming at grouping patterns, taking into account both input similarities and output labels. This is accomplished by means of our new cost function, p equation 2.9, with the following joint input-output codication z D (x , ad ): ED

K X X k D 1 x2Ck

( x ¡ x k )2 C a

K X X k D1 x 2Ck

(d ¡ d k )2 C b (K ¡ J )2 ,

(3.1)

where d is the original class of x and d k is the representative of the new assigned class. The parameter a controls the degree of supervision and b the model complexity. The rst term in equation 3.1 measures the dispersion of patterns around a number K of centers or representatives in the input space. The second term measures the dispersion of the original categories around the new-found class representatives. The last term controls an unnecessary increase in the number of new classes. The coupled minimization of all of the terms yields a distribution of patterns around a collection of K pairs, h k D (x k , d k ),

(3.2)

each of which consists of a representative x k in the original input space and its corresponding label d k .

Reclassication as Supervised Clustering

2541

Following is an algorithm for the incremental minimization of the rst two terms in equation 3.1 that also searches for a natural number of new classes: 1. Initialize the number of new classes K to 1. 2. Initialize the supervision parameter a to 0. 3. Initialize clusters (x k , d k ) randomly. 4. Repeat the following two steps for each pattern (x , d ): (a) Calculate Ik D (x ¡ xk )2 C a(d ¡ d k )2

(3.3)

for all k D 1, . . . , K and nd the winning cluster as the one that minimizes Ik . (b) Update centers only if pattern z D (x, d ) moved from one cluster (k) to another (m ): xk D

1 ( nk x k ¡ x ) nk ¡ 1

dk D

1 ( nk d k ¡ d ) nk ¡ 1

(3.4)

xm D

1 ( nm x m C x ) nm C 1

dm D

1 (nm d m C d ). nm C 1

(3.5)

5. If any pattern movement took place in (b) go back to 4. 6. Calculate the classication error by assigning each pattern (x, d ) to the class most represented in its cluster. 7. Increment a and go back to 3 while the classication error is different from zero. 8. Select the value of a for which the classication error begins to drop signicantly. 9. Calculate the quadratic error (transformed likelihood). If the error decrease compared with the previous one is not signicant, then exit the algorithm; else increment K and go back to 2. The selection of a in 8 is somewhat critical in this system. Since the extended patterns contain the supervised label, the classication error eventually becomes zero. However, we do not want to reinforce the original labels; rather, we want to nd the right compromise between them and the unsupervised information. This is why we use early stopping to avoid overtting of the original labels. This will be made clearer in the following section.

2542

A. Sierra and F. Corbacho

Figure 1: Cluster class distribution for two original classes of gene sequences (EI+NE, IE) and K D 3. The entries represent the number of patterns of each new class (A, B, C) present in each original class (EI, IE, NE). (Left) a D 0 (K-means). (Right) a D 4.2. Original classes EI and NE are shown decomposed, although the algorithm is fed only with classes EI C NE and IE.

4 Experiments

The main purpose of a reclassication algorithm is to determine the extent to which a given set of categories is consistent with all the available information. The results no doubt can be assessed only by expert knowledge. However, to show the validity of our reclassication approach, we note the following: any labeled data set can be used to validate a reclassication algorithm by merging two or more of the original classes (assigning them the same label) and checking whether the reclassication algorithm unravels the articially hidden structure. In this way, the classication error on the original class space can be used to assess the performance of the reclassication algorithm. We have applied our validation methodology to the Genbank 64.1 data set (Noordewier, Towell, & Shavlik, 1991). This data set contains 3175 patterns composed of 60 nucleotides, each coded by two binary inputs, for a total of 120 input attributes. The problem consists of assigning each sequence to one of the following categories: an intron/exon boundary (IE, a donor), an exon/intron boundary (EI, an acceptor), or none of the above (NE). Our algorithm is fed with supervised information, but classes EI and NE are merged to yield just one, the other one being IE. Our algorithm selects K D 4 and a D 3.4 as the most likely partition. It stops at K D 5 because the likelihood gain from K D 4 to K D 5 is comparatively small. In what follows, we will show only the highlights of the learning process. The results for a D 0 (pure K-means) and K D 3, when no supervision is available, are shown in Figure 1 (left). The entries of this gure show the number of patterns of each new class (A, B, and C) present in each original class (EI+NE, IE). For convenience, the three original classes are shown. The clusters found do not correspond one to one with the original classes. We can see that most of the EI and IE patterns are assigned to the same cluster (C). Cluster C does not have the majority of patterns from any of the classes. Hence, K-means, being incapable of

Reclassication as Supervised Clustering

2543

Figure 2: Classication error versus supervision parameter for K D 3 (three new classes) and two original classes of gene sequences (EI+NE, IE). The arrow shows where early stopping occurs.

separating EI and IE, does not unravel the existence of the three original classes. For nonzero values of the supervision parameter, this situation is improved. The reclassication algorithm realizes that class EI+NE is in fact composed of two different ones. Figure 1 (right) shows the cluster class distribution for a D 4.2 and K D 3, where early stopping occurs, as Figure 2 shows by an arrow. This avoids reinforcing the supervised labels since they are precisely what is being questioned. Now all of the clusters are associated with one and only one class—the one that is most represented within that cluster. However, the algorithm still continues with the next K value (K D 4). In general, the selection of the number of new classes, K, is a delicate issue since likelihood 3.1 with b D 0 increases monotonically as a function of K, the complexity of the model. Obviously we do not want to increase complexity indenitely, and that is why the algorithm stops at K D 5, for which the likelihood gain is comparatively small. Figure 3 shows the error (transformed likelihood) for K D 1, . . . , 8. Although the likelihood gain for K D 3 and K D 4 is quite similar, it drops signicantly for K D 5, as the dashed line shows. In fact, the original class NE may actually be subdivided for

2544

A. Sierra and F. Corbacho

Figure 3: Quadratic error versus number of new classes. The dashed line shows that K D 5 does not yield a signicant increase in likelihood as compared with K D 4.

K D 4 into two different subclasses, bringing to four the total number of new classes found by our reclassication algorithm. This result is shown in Figure 4. 5 Conclusion

Although clustering techniques have been extensively used in the past to support pattern classication as a hidden step, and some work has been

Figure 4: Cluster class distribution for K D 4, a D 3.4 and two original classes of gene sequences. Clusters A and D show that class NE gets divided into two different subclasses.

Reclassication as Supervised Clustering

2545

recently done in coupled algorithms (Bensaid & Bezdek, 1998; Augusteijn & Steck, 1998), not much effort has been devoted to assessing categories in the light of observations. Procedures of this kind are known to be of key importance in the life sciences. Hence, we believe they deserve some effort from the neural computation community. We have approached reclassication as clustering in an enlarged space. Our algorithm, by no means the only possible option, is one of the simplest ways to redesign original categories in the light of both supervised and unsupervised information. It has been applied to the reclassication of a data set of gene sequences with articially hidden structure. The algorithm unravels the third class and, moreover, predicts the subdivision of one of the original classes into two different ones. Other kinds of unsupervised algorithms, such as hierarchical clustering, can be used to search for partitions. This is part of our future work. Acknowledgments

We are grateful to the referee for his very constructive criticism. This note has been sponsored by the Spanish Interdepartmental Commission of Science and Technology (CICYT), project number TIC98-0247. References Augusteijn, M. F., & Steck, U. J. (1998). Supervised adaptive clustering. A hybrid neural network clustering algorithm. Neural Computing and Applications, 7(1), 78–89. Bensaid, A. M., & Bezdek, J. C. (1998). Semi-supervised point prototype clustering. International Journal of Pattern Recognition and Articial Intelligence, 12(5), 625–643. Compton, J. A., Culham, A., & Jury, S. L. (1998). Reclassication of Actea to include Cimifuga and Souliea. Taxon, 47(3), 593–634. Hecht-Nielsen, R. (1987). Counterpropagation networks. Applied Optics, 26, 4979–4984. Hinton, G. (1989). Connectionist learning procedures. Articial Intelligence, 40, 185–234. Liu, C. L. (1968). Introduction to combinatorial mathematics. New York: McGrawHill. MacQueen, J. (1967). Some methods for classication and analysis of multivariate observations. In L. M. LeCun & J. Neyman (Eds.), Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1 (pp. 281–297). Berkeley, CA: University of California Press. Martin, M. P., Hogberg, N., & Nylund, J. E. (1998). Molecular analysis conrms morphological reclassication of rhizopogon. Mycological Research, 102(7), 855–858. Moody, J., & Darken, C. (1989). Fast learning in networks of locally tuned processing units. Neural Computation, 1(2), 281–294.

2546

A. Sierra and F. Corbacho

Noordewier, M. O., Towell, G. G., & Shavlik, J. W. (1991). Training knowledgebased neural networks to recognize genes in DNA sequences. In D. Touretzky & R. Lippmann (Eds.), Advances in neural information processing systems, 3. San Mateo, CA: Morgan Kaufmann. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Received November 16, 1998; accepted October 19, 1999.

LETTER

Communicated by Laurenz Wiskott

A Model of Invariant Object Recognition in the Visual System: Learning Rules, Activation Functions, Lateral Inhibition, and Information-Based Performance Measures Edmund T. Rolls T. Milward Oxford University, Department of Experimental Psychology, Oxford OX1 3UD, England VisNet2 is a model to investigate some aspects of invariant visual object recognition in the primate visual system. It is a four -layer feedforward network with convergence to each part of a layer from a small region of the preceding layer, with competition between the neurons within a layer and with a trace learning rule to help it learn transform invariance. The trace rule is a modied Hebbian rule, which modies synaptic weights according to both the current ring rates and the ring rates to recently seen stimuli. This enables neurons to learn to respond similarly to the gradually transforming inputs it receives, which over the short term are likely to be about the same object, given the statistics of normal visual inputs. First, we introduce for VisNet2 both single-neuron and multiple-neuron information-theoretic measures of its ability to respond to transformed stimuli. Second, using these measures, we show that quantitatively resetting the trace between stimuli is not necessary for good performance. Third, it is shown that the sigmoid activation functions used in VisNet2, which allow the sparseness of the representation to be controlled, allow good performance when using sparse distributed representations. Fourth, it is shown that VisNet2 operates well with medium-range lateral inhibition with a radius in the same order of size as the region of the preceding layer from which neurons receive inputs. Fifth, in an investigation of different learning rules for learning transform invariance, it is shown that VisNet2 operates better with a trace rule that incorporates in the trace only activity from the preceding presentations of a given stimulus, with no contribution to the trace from the current presentation, and that this is related to temporal difference learning. 1 Introduction 1.1 Background. There is evidence that over a series of cortical processing stages, the visual system of primates produces a representation of objects that shows invariance with respect to, for example, translation, size, and view, as shown by recordings from single neurons in the temporal lobe c 2000 Massachusetts Institute of Technology Neural Computation 12, 2547–2572 (2000) °

2548

Edmund T. Rolls and T. Milward

(Desimone, 1991; Rolls, 1992; Rolls & Tovee, 1995; Tanaka, Saito, Fukada, & Moriya, 1991) (see Figure 2). Rolls (1992, 1994, 1995, 1997, 2000) has reviewed much of this neurophysiological work and has advanced a theory for how these neurons could acquire their transform-independent selectivity based on the known physiology of the visual cortex and self-organizing principles (see also Wallis & Rolls, 1997; Rolls & Treves, 1998). Rolls’s hypothesis has the following fundamental elements: A series of competitive networks, organized in hierarchical layers, exhibiting mutual inhibition over a short range within each layer. These networks allow combinations of features or inputs that occur in a given spatial arrangement to be learned by neurons, ensuring that higherorder spatial properties of the input stimuli are represented in the network. A convergent series of connections from a localized population of cells in preceding layers to each cell of the following layer, thus allowing the receptive-eld size of cells to increase through the visual processing areas or layers. A modied Hebb-like learning rule incorporating a temporal trace of each cell’s previous activity, which, it is suggested, will enable the neurons to learn transform invariances (see also Foldi´ ¨ ak, 1991). Based on these hypotheses, Wallis and Rolls produced a model of ventral stream cortical visual processing designed to investigate computational aspects of this processing (Wallis, Rolls, & Foldi´ ¨ ak, 1993; Wallis & Rolls, 1997). With this model, it has been shown that provided that a trace learning rule to be dened below is used, the network (VisNet) can learn transforminvariant representations. It can produce neurons that respond to some but not other stimuli with translation and view invariance. In investigations of translation invariance described by Wallis and Rolls (1997), the stimuli were simple stimuli such as T, L, and C , or more complex stimuli such as faces. Relatively few (up to seven) different stimuli, such as different faces, were used for training. The work described here investigates several key issues that inuence how the network operates, including new formulations of the trace rule that signicantly improve the performance of the network; how local inhibition within a layer may improve the performance of the network; how neurons with a sigmoid activation function that allows the sparseness of the representation to be controlled can improve the performance of the network; and how the performance of the network scales up when a larger number of training locations and a larger number of stimuli are used. The model described previously by Wallis and Rolls (1997) was modied in the ways just described to produce the model described here, which is denoted VisNet2. Before describing these new results, we rst outline the architecture of the model. In section 2, we describe a way to measure the performance of

Invariant Object Recognition in the Visual System

2549

the network using information theory measures. This approach has a double advantage: it is based on information theory, which is an appropriate measure for how any information processing system performs, and it uses the same measurement that has been applied recently to measure the performance of real neurons in the brain (Rolls, Treves, Tovee, & Panzeri, 1997; Rolls, Treves, & Tovee, 1997; Booth & Rolls, 1998), and thus allows direct comparisons in the same units, bits of information, to be made between the data from real neurons and those from VisNet. We then introduce the new form of the trace rule. In section 3, we show using the new information theory measure how the new version of the trace rule operates in VisNet and how the parameters of the lateral inhibition inuence the performance of the network; and we describe how VisNet operates when using larger numbers of stimuli than were used previously, together with values for the sparseness of the ring that are closer to those found in the brain (Wallis & Rolls, 1997; Rolls & Treves, 1998). 1.2 The Trace Rule. The learning rule implemented in the simulations uses the spatiotemporal constraints placed on the behavior of real-world objects to learn about natural object transformations. By presenting consistent sequences of transforming objects, the cells in the network can learn to respond to the same object through all of its naturally transformed states, as described by Foldi´ ¨ ak (1991) and Rolls (1992). The learning rule incorporates a decaying trace of previous cell activity and is henceforth referred to simply as the trace learning rule. The learning paradigm is intended in principle to enable learning of any of the transforms tolerated by inferior temporal cortex neurons (Rolls, 1992, 1995, 1997, 2000; Rolls & Tovee, 1995; Rolls & Treves, 1998; Wallis & Rolls, 1997). The trace update rule used in the simulations by Wallis and Rolls (1997) is equivalent to that of Foldi´ ¨ ak and the earlier rule of Sutton and Barto (1981), and can be summarized as follows:

D wj

t D ay .xj

where yt D (1 ¡ g)yt C gyt ¡1

xj :

j th input to the neuron.

yt :

Trace value of the output of the neuron at time step t .

wj :

Synaptic weight between j th input and the neuron.

y:

Output from the neuron.

a:

Learning rate; annealed between unity and zero.

g:

Trace value; the optimal value varies with presentation sequence length.

Edmund T. Rolls and T. Milward

Layer 4

Layer 3

Layer 2

Layer 1

Figure 1: Stylized image of the VisNet four-layer network. Convergence through the network is designed to provide fourth-layer neurons with information from across the entire input retina.

To bound the growth of each cell’s dendritic weight vector, its length is explicitly normalized (Wallis & Rolls, 1997), as is common in competitive networks (see Rolls & Treves, 1998). 1.3 The Network. Figure 1 shows the general convergent network architecture used. The architecture of VisNet 2 is similar to that of VisNet, except that neurons with sigmoid activation functions are used, with explicit control of sparseness; the radius of lateral inhibition and thus of competition is larger, as specied below; different versions of the trace learning rule as described below are used; and an information-theoretic measure of performance replaces that used earlier by Wallis and Rolls (1997). The network itself is designed as a series of hierarchical, convergent, competitive networks, with four layers, specied as shown in Table 1. The forward connections to a cell in one layer come from a small region of the preceding layer dened by the radius in Table 1, which will contain approximately 67% of the connections from the preceding layer. VisNet is provided with a set of input lters that can be applied to an image to produce inputs to the network that correspond to those provided by simple cells in visual cortical area 1 (V1). The purpose is to enable within VisNet the more complicated response properties of cells between V1 and

Receptive Field Size / deg

2550

Invariant Object Recognition in the Visual System

2551

Table 1: VisNet Dimensions.

Receptive Field Size / deg

Layer 4 Layer 3 Layer 2 Layer 1 Retina

Dimensions

Number of Connections

Radius

32 £ 32 32 £ 32 32 £ 32 32 £ 32 128 £ 128 £ 32

100 100 100 272 —

12 9 6 6 —

50

TE

20

TEO

8.0

V4

3.2

V2

1.3

V1

view independence

view dependent configuration sensitive combinations of features

larger receptive fields

LGN 0 1.3 3.2 8.0 20 50 Eccentricity / deg Figure 2: Convergence in the visual system. V1: visual cortex area V1; TEO: posterior inferior temporal cortex; TE: inferior temporal cortex (IT). (Adapted from Rolls, 1992.)

the inferior temporal cortex (IT) to be investigated, using as inputs natural stimuli such as those that could be applied to the retina of the real visual system. This is to facilitate comparisons between the activity of neurons in VisNet and those in the real visual system to the same stimuli. The elongated orientation-tuned input lters used accord with the general tuning proles of simple cells in V1 (Hawken & Parker, 1987) and are computed by weighting the difference of two gaussians by a third orthogonal gaussian as described by Wallis et al. (1993) and Wallis and Rolls (1997). Each individual lter is tuned to spatial frequency (0.0625 to 0.5 cycles pixels¡1 over four C octaves), orientation (0–135 degrees in steps of 45 degrees), and sign (¡ 1). Of the 272 layer 1 connections of each cell, the number from each group is as shown in Table 2. Graded local competition is implemented in VisNet in principle by a lateral inhibition-like scheme. The two steps in VisNet were as follows. A local spatial lter was applied to the neuronal responses to implement

2552

Edmund T. Rolls and T. Milward

Table 2: VisNet Layer 1 Connectivity. Frequency Number of connections

0.0625

0.125

0.25

0.5

201

50

13

8

Note: The frequency is in cycles per pixel.

lateral inhibition. Contrast enhancement of the ring rates (as opposed to winner-take-all competition) was realized (Wallis and Rolls, 1997) by raising the ring P rates r to a xed power p, and then renormalizing the rates p (i.e., y D rp / ( i ri ), where y is the ring rate after the competition). The biological rationale for this is the greater than linear increase in neuronal ring as a function of the activation of the neuron. This characterizes the rst steeply rising portion close to threshold of the sigmoid-like activation function of real neurons. In this article, we introduce for VisNet2 a sigmoid activation function. The general calculation of the response of a neuron is as described by Wallis and Rolls (1997) and Wallis et al. (1993). In brief, the postsynaptic ring is calculated as shown in equation 2.1, where in this article the function f incorporates a sigmoid activation function and lateral inhibition, as described in section 2. The measure of network performance used in VisNet, the Fisher metric, reects how well a neuron discriminates between stimuli, compared to how well it discriminates between different locations (or more generally the images used rather than the objects, each of which is represented by a set of images, over which invariant stimulus or object representations must be learned). The Fisher measure is very similar to taking the ratio of the two F values in a two-way ANOVA, where one factor is the stimulus shown and the other factor is the position in which a stimulus is shown. The measure takes a value greater than 1.0 if a neuron has more different responses to the stimuli than to the locations. Further details of how the measure is calculated are given by Wallis and Rolls (1997). Measures of network performance based on information theory and similar to those used in the analysis of the ring of real neurons in the brain (see Rolls and Treves, 1998) are introduced in this article for VisNet2. They are described in section 2, compared to the Fisher measure at the start of section 3, and then used for the remainder of the article. 2 Methods 2.1 Modied Trace Learning Rules. The trace rule that has been used in VisNet can be expressed by equation 2.4, where t indexes the current trial and t ¡ 1 the previous trial. (The superscript on w indicates the version of the learning rule.) This is similar to Sutton and Barto (1981) and one of the formulations of Foldi´ ¨ ak (1991). The postsynaptic term in the synaptic

Invariant Object Recognition in the Visual System

2553

modication is based on the trace from previous trials available at time t ¡ 1 and on the current activity (at time t ), with g determining the relative proportion of these two. This is expressed in equation 2.3, where y is the postsynaptic ring rate calculated from the presynaptic ring rates xj as summarized in equation 2.1, and yt is the postsynaptic trace at time t . In this article, we compare this type of rule with a different type in which the postsynaptic term depends on only the trace left from previous trials available from t ¡ 1, without including a contribution from the current instantaneous activity y of the neuron. This is expressed in equation 2.5. In addition, we introduce a comparison of a trace incorporated in the presynaptic term (with the two variants for times t and t ¡ 1 shown in equations 2.6 and 2.7). The presynaptic trace, xjt , is calculated as shown in equation 2.2. We also introduce a comparison of traces incorporated in both the presynaptic and the postsynaptic terms (with the two variants for times t and t ¡ 1 shown in equations 2.8 and 2.9). We will demonstrate that using the trace at time t ¡ 1 is signicantly better than the trace at time t , but that the type of trace (presynaptic, postsynaptic or both) does not affect performance. What is achieved computationally by these different rules, and their biological plausibility, are considered in section 4. Following are these equations: yD f

±X

²

(2.1)

wj xj

xjt D (1 ¡ g)xjt C gxjt ¡1

(2.2)

y D (1 ¡ g)y C gy

(2.3)

t

D wj1 D wj2

t

yt xjt yt ¡1 xjt yt xjt

(2.4)

/

yt xjt ¡1 yt xjt C

(2.7)

/

yt ¡1 xjt C

/ /

D wj3 / D wj4

D wj5 / D wj6

t ¡1

(2.5) (2.6) yt xjt D D wj1 C D wj3 yt xjt ¡1

D

D wj2 C

(2.8)

D wj4

(2.9)

2.2 Measures of Network Performance Based on Information Theory.

Two new performance measures based on information theory are introduced. The information-theoretic measures have the advantages that they provide a quantitative and principled way of measuring the performance of an information processing system and are essentially identical to those that are being applied to analyze the responses of real neurons (see Rolls & Treves, 1998), thus allowing direct comparisons of real and model systems. The rst measure is applied to a single cell of the output layer and measures how much information is available from a single response of the cell

2554

Edmund T. Rolls and T. Milward

to each stimulus s. Each response of a cell in VisNet to a particular stimulus is produced by the different transforms of the stimulus. For example, in a translation-invariance experiment, each stimulus image or object might be shown in each of nine locations, so that for each stimulus there would be nine responses of a cell. The set of responses R over all stimuli would thus consist of 9NS responses, where NS is the number of stimuli. If there is little variation in the responses to a given stimulus and the responses to the different stimuli are different, then considerable information about which stimulus was shown is obtained on a single trial, from a single response. If the responses to all the stimuli overlap, then little information is obtained from a single response about which stimulus (or object) was shown. Thus, in general the more information about a stimulus that is obtained, the better is the invariant representation. The information about each stimulus is calculated by the following formula, with the details of the procedure given by Rolls, Treves, Tovee, and Panzeri (1997) and background to the method given there and by Rolls and Treves (1998): I (s, R ) D

X r2R

P (r | s) log2

P (r | s ) . P (r)

(2.10)

The stimulus-specic information, I (s, R), is the amount of information that the set of responses, R, has about a specic stimulus, s. The mutual information between the whole set of stimuli S and of responses R is the average across stimuli of this stimulus-specic information (see equation 2.11). (Note that r is an individual response from the set of responses R.) The calculation procedure was identical to that described by Rolls, Treves, Tovee, and Panzeri (1997) with the following exceptions. First, no correction was made for the limited number of trials, because in VisNet2 (as in VisNet), each measurement of a response is exact, with no variation due to sampling on different trials. Second, the binning procedure was altered in such a way that the ring rates were binned into equispaced rather than equipopulated bins. This small modication was useful because the data provided by VisNet2 can produce perfectly discriminating responses with little trial-to-trial variability. Because the cells in VisNet2 can have bimodally distributed responses, equipopulated bins could fail to separate the two modes perfectly. (This is because one of the equipopulated bins might contain responses from both of the modes.) The number of bins used was equal to or less than the number of trials per stimulus, that is for VisNet the number of positions on the retina (Rolls, Treves, Tovee, & Panzeri, 1997). Because VisNet operates as a form of competitive net to perform categorization of the inputs received, good performance of a neuron will be characterized by large responses to one or a few stimuli regardless of their position on the retina (or other transform), and small responses to the other stimuli. We are thus interested in the maximum amount of information that a neuron provides about any of the stimuli rather than the average amount

Invariant Object Recognition in the Visual System

2555

of information it conveys about the whole set S of stimuli (known as the mutual information). Thus, for each cell, the performance measure was the maximum amount of information a cell conveyed about any one stimulus (with a check, in practice always satised, that the cell had a large response to that stimulus, as a large response is what a correctly operating competitive net should produce to an identied category). In many of the graphs in this article, the amount of information that each of the 100 most informative cells had about any stimulus is shown. If all the output cells of VisNet learned to respond to the same stimulus, then the information about the set of stimuli S would be very poor and would not reach its maximal value of log2 of the number of stimuli (in bits). A measure that is useful here is the information provided by a set of cells about the stimulus set. If the cells provide different information because they have become tuned to different stimuli or subsets of stimuli, then the amount of this multiple cell information should increase with the number of different cells used, up to the total amount of information needed to specify which of the NS stimuli have been shown, that is, log2 NS bits. Procedures for calculating the multiple cell information have been developed for multiple neuron data by Rolls, Treves, and Tovee (1997) (see also Rolls & Treves, 1998), and the same procedures were used for the responses of VisNet. In brief, what was calculated was the mutual information I (S, R ), that is, the average amount of information that is obtained from a single presentation of a stimulus from the responses of all the cells. For multiple cell analysis, the set of responses, R , consists of response vectors comprising the responses from each cell. Ideally, we would like to calculate I (S, R ) D

X

P (s)I (s, R ).

(2.11)

s2S

However, the information cannot be measured directly from the probability table P (r, s) embodying the relationship between a stimulus s and the response rate vector r provided by the ring of the set of neurons to a presentation of that stimulus. (Note, as is made clear at the start of this article, that stimulus refers to an individual object that can occur with different transforms—for example, as translation here, but elsewhere view and size transforms. See Wallis and Rolls, 1997.) This is because the dimensionality of the response vectors is too large to be adequately sampled by trials. Therefore a decoding procedure is used in which the stimulus s0 that gave rise to the particular ring-rate response vector on each trial is estimated. This involves, for example, maximum likelihood estimation or dot product decoding. (For example, given a response vector r to a single presentation of a stimulus, its similarity to the average response vector of each neuron to each stimulus is used to estimate using a dot product comparison which stimulus was shown. The probabilities of it being each of the stimuli can be estimated in this way. Details are provided by Rolls, Treves, and Tovee

2556

Edmund T. Rolls and T. Milward

(1997) and by Panzeri, Treves, Schultz and Rolls (1999).) A probability table is then constructed of the real stimuli s and the decoded stimuli s0 . From this probability table, the mutual information between the set of actual stimuli S and the decoded estimates S0 is calculated as I (S, S0 ) D

X s,s0

P (s, s0 ) log2

P (s, s0 ) . P (s)P (s0 )

(2.12)

This was calculated for the subset of cells that had as single cells the most information about which stimulus was shown. Often ve cells for each stimulus with high information values for that stimulus were used for this. 2.3 Lateral Inhibition, Competition, and the Neuronal Activation Function. As originally conceived (Wallis & Rolls, 1997), the essentially

competitive networks in VisNet had local lateral inhibition and a steeply rising activation function, which after renormalization resulted in the neurons with the higher activations having very much higher ring rates relative to other neurons. The idea behind the lateral inhibition, apart from this being a property of cortical architecture in the brain, was to prevent too many neurons receiving inputs from a similar part of the preceding layer responding to the same activity patterns. The purpose of the lateral inhibition was to ensure that different receiving neurons coded for different inputs. This is important in reducing redundancy (Rolls & Treves, 1998). The lateral inhibition was conceived as operating within a radius that was similar to that of the region within which a neuron received converging inputs from the preceding layer (because activity in one zone of topologically organized processing within a layer should not inhibit processing in another zone in the same layer, concerned perhaps with another part of the image). However, the extent of the lateral inhibition actually investigated by Wallis and Rolls (1997) in VisNet operated over adjacent pixels. Here we investigate in VisNet2 lateral inhibition implemented in a similar but not identical way, but operating over a larger region, set within a layer to approximately half of the radius of convergence from the preceding layer. The lateral inhibition and contrast enhancement just described is actually implemented in VisNet2 in two stages. Lateral inhibition is implemented in the experiments described here by convolving the activation of the neurons in a layer with a spatial lter, I, where d controls the contrast and s controls the width, and a and b index the distance away from the center of the lter:

Ia,b D

8 2 2 ¡ a Cb > < ¡de s2 > :1 ¡

P

I 6 0 a,b aD 6 0 bD

6 0 or b D 6 0, if a D

if a D 0 and b D 0.

This is a lter that leaves the average activity unchanged.

Invariant Object Recognition in the Visual System

2557

Table 3: Sigmoid Parameters for the 25 Location Runs. Layer

1

2

3

4

Percentile

99.2

98

88

91

Slope b

190

40

75

26

The second stage involves contrast enhancement. In VisNet, this was implemented by raising the activations to a xed power and normalizing the resulting ring within a layer to have an average ring rate equal to 1.0. Here in VisNet2, we compare this to a more biologically plausible form of the activation function, a sigmoid, y D f sigmoid (r) D

1 , 1 C e¡2b (r¡a)

where r is the activation (or ring rate) of the neuron after the lateral inhibition, y is the ring rate after the contrast enhancement produced by the activation function, b is the slope or gain, and a is the threshold or bias of the activation function. The sigmoid bounds the ring rate between 0 and 1, so global normalization is not required. The slope and threshold are held constant within each layer. The slope is constant throughout training, whereas the threshold is used to control the sparseness of ring rates within each layer. The sparseness of the ring within a layer is dened (Rolls & Treves, 1998) as P ( yi / n ) 2 , a D Pi 2 i yi / n

(2.13)

where n is the number of neurons in the layer. To set the sparseness to a given value, for example, 5%, the threshold is set to the value of the 95th percentile point of the activations within the layer. (Unless otherwise stated here, the neurons used the sigmoid activation function as just described.) In this article we compare the results with short-range and longer-range lateral inhibition and with the power and activation functions. Unless otherwise stated, the sigmoid activation function was used with parameters (selected after a number of optimization runs), as shown in Table 3. The lateral inhibition parameters were as shown in Table 4. Where the power activation function was used, the power for layer 1 was 6 and for the other layers was 2 (Wallis & Rolls, 1997). 2.4 Training and Test Procedure. The network was trained by presenting a stimulus in one training location, calculating the activation of the neuron, then its ring rate, and then updating its synaptic weights. The sequence of locations for each stimulus was determined as follows. For a

2558

Edmund T. Rolls and T. Milward

Table 4: Lateral Inhibition Parameters for 25 Location Runs. Layer

1

2

3

4

Radius, s

1.38

2.7

4.0

6.0

Contrast, d

1.5

1.5

1.6

1.4

Figure 3: (Left) Contiguous set of 25 training locations. (Right) Five blocks of ve contiguous locations with random offset and direction.

value of g of 0.8, the effect of the trace decays to 50% after 3.1 training locations. Therefore, locations near the start of a xed sequence of, for example, 25 locations would not occur in sufciently close temporal proximity for the trace rule to work. The images were therefore presented in sets of, for example, 5 locations, with then a jump to, for example, 5 more locations, until all 25 locations had been covered. An analogy here is to a number of small saccades (or, alternatively, small movements of the stimulus) punctuated by a larger saccade to another part of the stimulus. Figure 3 shows the order in which (for 25 locations) the locations were chosen by sequentially traversing blocks of 5 contiguous locations, and visiting all 5 blocks in a random order. We also include random direction and offset components. After all the training locations had been visited (25 unless otherwise stated for the experiments described here), another stimulus was selected in a permutative sequence, and the procedure was repeated. Training in this way for all the stimuli in a set was one training epoch. The network was trained one layer at a time, starting with layer 1 (above the retina) through to layer 4. The rationale for this was that there was no point in training a layer if the preceding layer had a continually changing representation. The number of epochs per layer were as shown in Table 5. The learning rate was gradually decreasing to zero during the training of a layer according to a cosine function in the range 0 to p / 2, in a form of simulated annealing. The traceg value was set to 0.8 unless otherwise stated, and for the runs described in this article, the trace was articially reset to 0 between different stimuli. Each image was 64 £ 64 pixels and was shown at

Invariant Object Recognition in the Visual System

2559

Table 5: Number of Training Epochs per Layer. Layer

1

2

3

4

Number of epochs

50

100

100

75

different positions in the 128 £ 128 “retina” in the investigations described in this article on translation invariance. The number of pixels by which the image was translated was 32 for each move. With a grid of 25 possible locations for the center of the image as shown in Figure 3, the maximum possible shift of the center of the image was 64 pixels away from the central pixel of the retina in both horizontal and vertical directions (see Figure 3). Wrap-around in the image plane ensured that images did not move off the “retina”. The stimuli used (for both training and testing) were either the same set of 7 faces used and illustrated in the previous investigations with VisNet (Wallis & Rolls, 1997) or an extended set of 17 faces. One training epoch consists of presenting each stimulus through each set of locations, as described. The trace was normally articially reset to zero before starting each stimulus presentation sequence, but the effect of not resetting it was only mild, as investigated here and described below. We now summarize the differences between VisNet and VisNet2, partly to emphasize some of the new points addressed here, and partly for clarication. VisNet (see Wallis & Rolls, 1997) used a power activation function, only short-range lateral inhibition, the learning rule shown in equation 2.4, primarily a Fisher measure of performance, a linearly decreasing learning rate across the number of training trials used for a particular layer, and training that always started at a xed position in the set of exemplars of a stimulus. VisNet2 uses a sigmoid activation function that incorporates control of the sparseness of the representation and longer-range lateral inhibition. It allows comparison of the performance when trained with many versions of a trace learning rule shown in equations 2.5 through 2.9, includes information theory–based measures of performance, has a learning rate that decreases with a cosine bell taper over approximately the last 20% of the training trials in an epoch, and has training for a given stimulus that runs for several short lengths of the exemplars of that stimulus, each starting at random points among the exemplars, in order to facilitate trace learning among all exemplars of a given stimulus. 3 Results 3.1 Comparison of the Fisher and Information Measures of Network Performance. The results of a run of VisNet with the standard parameters

used in this article trained on 7 faces at each of 25 training locations are shown using the Fisher measure and the single-cell information measure in

2560

Edmund T. Rolls and T. Milward VisNet: 7f 25l: Single Cell Analysis trace hebb random

Fisher Metric

20 15 10 5 0

3

VisNet: 7f 25l: Single Cell Analysis trace hebb random

2.5 Information (bits)

25

2 1.5 1 0.5 0

10 20 30 40 50 60 70 80 90 100 Cell Rank Fisher Metric

10 20 30 40 50 60 70 80 90 100 Cell Rank Single Cell Information

Figure 4: Seven faces, 25 locations. Single cell analysis: Fisher metric and Information measure for the best 100 cells. Throughout this article, the single cell information was calculated with equation 2.10. Visnet: 7f 25l: Single Cell Analysis: Information v Fisher 3

Visnet: 7f 25l: Single Cell Analysis: Information v Fisher 3 2.5 Information (bits)

Information (bits)

2.5 2 1.5 1 0.5 0

2 1.5 1 0.5

0

2

4

6 8 10 12 14 16 18 Log Fisher Metric

0

0

0.5

1 1.5 2 2.5 Log Fisher Metric

3

3.5

Figure 5: Seven faces, 25 locations. Single cell analysis: Comparison of the Fisher metric and the information measure. The right panel shows an expanded part of the left panel.

Figure 4. In both cases the values for the 100 most invariant cells are shown (ranked in order of their invariance). (Throughout this article, the results for the top layer, designated 3, cells are shown. The types of cell response found in lower layers are described by Wallis & Rolls, 1997.) It is clear that the network performance is much better when trained with the trace rule than when trained with a Hebb rule (which is not expected to capture invariances) or when left untrained with what is random synaptic connectivity. This is indicated by both the Fisher and the single-cell information measures. For further comparison, we show in Figure 5 the information value for different cells plotted against the log (because information is a log measure) of the Fisher measure. It is evident that the two measures are closely related, and indeed the correlation between them was 0.76. This means that the information measure can potentially replace the Fisher measure because they provide similar indications about the network performance. An advantage of the information measure is brought out next. The information about the most effective stimulus for the best cells is seen (see Figure 4) to be in the region of 1.9 to 2.8 bits. This information measure

Invariant Object Recognition in the Visual System Visnet: 7f 25l: Cell (29, 31) Layer 3 ’face0’ ’face1’ ’face2’ ’face3’ ’face4’ ’face5’ ’face6’

Firing Rate

0.8 0.6 0.4 0.2 0

Visnet: 7f 25l: Cell (5, 30) Layer 3 1

’face0’ ’face1’ ’face2’ ’face3’ ’face4’ ’face5’ ’face6’

0.8 Firing Rate

1

2561

0.6 0.4 0.2

5

10 15 Location Index Information

20

0

5

10 15 Location Index Percentage Correct

20

Figure 6: Seven faces, 25 locations. Response proles from two top layer cells. The locations correspond to those shown in Figure 3, with the numbering from top left to bottom right as indicated.

species what perfect performance by a cell would be if it had large responses to one stimulus at all locations and small responses to all the other stimuli at all locations. The maximum value of the information I (s, R) that could be provided by a cell in this way would be log2 NS D 2.81 bits. Examples of the responses of individual cells from this run are shown in Figure 6. In these curves, the different responses to a particular stimulus are the responses at different locations. If the responses to one stimulus did not overlap at all with responses to the other stimuli, the cell would give unambiguous and invariant evidence about which stimulus was shown by its response on any one trial. The cell in the right panel of Figure 6 did discriminate perfectly, and its information value was 2.81 bits. For the cell shown in the left panel of Figure 6 the responses to the different stimuli overlap a little, and the information value was 2.13 bits. The ability to interpret the actual value of the information measure that is possible in the way just shown is an advantage of the information measure relative to the Fisher measure, the exact value of which is difcult to interpret. This, together with the fact that the information measure allows direct comparison of the results with measures obtained from real neurons in neurophysiological experiments (Rolls & Treves, 1998; Tovee, Rolls & Azzopardi, 1994; Booth & Rolls, 1998), provides an advantage of the information measure, and from now on we use the information measure instead of the Fisher measure. We also note that the information measure is highly correlated (r D 0.93) with the index of invariance used for real neuronal data by Booth and Rolls (1998). The inverted form of this index (inverted to allow direct comparison with the information measure) is the variance of responses between stimuli divided by the variance of responses within the exemplars of a stimulus. In its inverted form, it takes a high value if a neuron discriminates well between different stimuli and has similar responses to the different exemplars of each stimulus.

2562

Edmund T. Rolls and T. Milward VisNet: 7f 25l: Multiple Cell Analysis

2

100

trace hebb random

1.5

Percentage Correct

Information (bits)

2.5

1 0.5 0

5

10 15 20 Number of Cells Information

25

30

VisNet: 7f 25l: Multiple Cell Analysis

80

trace hebb random

60 40 20 0

5

10 15 20 Number of Cells Percentage Correct

25

30

Figure 7: Seven faces, 25 locations. Multiple cell analysis: Information and percentage correct. The multiple cell information was calculated with equation 2.12.

Use of the multiple cell information measure I (S, R ) to quantify the performance of VisNet2 further is illustrated for the same run in Figure 7. (A multiple cell assessment of the performance of this architecture has not been performed previously.) It is shown that the multiple cell information rises steadily as the number of cells in the sample is increased, with a plateau being reached at approximately 2.3 bits. This indicates that the network can, using its 10 to 20 best cells in the output layer, provide good but not perfect evidence about which of the 7 stimuli has been shown. (The measure does not reach the 2.8 bits that would indicate perfect performance mainly because not all stimuli had cells that coded perfectly for them.) Consistent with this, the graph of percentage correct as a function of the number of cells, available from the output of the decoding algorithm, shows performance increasing steadily but not quite reaching 100%. Figure 7 also shows that there is more multiple cell information available with the trace than with the Hebb rule or when untrained, showing correct operation of the trace rule in helping to form invariant representations. The actual type of decoding used, maximum likelihood or dot product, does not make a great deal of difference to the values obtained (see Figure 8), so the maximum likelihood decoding is used from now on. We also use the multiple cell measure (and percentage correct) from now on because there is no Fisher equivalent and because it is quantitative and shows whether the information required to identify every stimulus in a set of a given size is available (log2 NS ); because the percentage correct is available; and because the value can be directly compared with recordings from populations of real neurons. (We note that the percentage correct is not expected to be linearly related to the information, due to the metric structure of the space. See Rolls & Treves, 1998, section A2.3.4.) 3.2 The Modied Trace Learning Rule. A comparison of the different trace rules is shown in Figure 9. The major and very interesting performance difference found was that for the three rules that calculate the synaptic

Invariant Object Recognition in the Visual System VisNet: 7f 25l: Multiple Cell Analysis

2

90

1.5 1 0.5 0

VisNet: 7f 25l: Multiple Cell Analysis

80

trace opt trace dot hebb opt hebb dot random opt random dot

Percentage Correct

Information (bits)

2.5

2563

trace opt trace dot hebb opt hebb dot random opt random dot

70 60 50 40 30 20

5

10 15 20 Number of Cells Information

25

10

30

5

10 15 20 Number of Cells Percentage Correct

25

30

Figure 8: Seven faces, 25 locations. Multiple cell analysis: Information and percentage correct: Comparison of optimal (maximum likelihood) and dot product decoding. VisNet: 7f 25l: Single Cell Analysis pre (t) pre (t-1) both (t) both (t-1) post (t) post (t-1) random

Information (bits)

2.5 2 1.5 1 0.5 0

20

40

60 80 100 120 140 160 Cell Rank Single Cell

2.5

Information (bits)

3

VisNet: 7f 25l: Multiple Cell Analysis pre (t) pre (t-1) both (t) both (t-1) post (t) post (t-1) random

2 1.5 1 0.5 0

10

20 30 40 Number of Cells Multiple Cell

50

Figure 9: Seven faces, 25 locations. Comparison of trace rules. The three trace rules using the trace calculated at t ¡1 (labeled as pre(t ¡1) implementing equation 2.7, post(t ¡ 1) implementing equation 2.5 and both(t ¡ 1) implementing equation 2.9) produced better performance than the three trace rules calculated at t (labeled as pre(t) implementing equation 2.6, post(t) implementing equation 2.4, and both(t) implementing equation 2.8). Pre-, post-, and both specify whether the trace is present in the presynaptic or the postsynaptic terms, or in both.

modication from the trace at time t ¡ 1 (see equations 2.5, 2.7, and 2.9), the performance was much better than for the three rules that calculate the synaptic modication from the trace at time t (see equations 2.4, 2.6, and 2.8). Within each of these two groups, there was no difference between the rule with the trace in the postsynaptic term (as used previously by Wallis & Rolls, 1997), and the rules with the trace in the presynaptic term or in both the presynaptic and postsynaptic term. (Use of a trace of previous activity that includes no contribution from the current activity has also been shown to be effective by Peng, Sha, Gan, & Wei, 1998.) As expected, the performance with no training, that is, with random weights, was poor. The data shown in Figure 10 were for runs with a xed value ofg of 0.8 for layers 2 through 4. Because the rules with the trace calculated at t ¡1 might

Edmund T. Rolls and T. Milward

Max Info (bits)

2564

Visnet: 7f 25l 2.6 2.4 2.2 2 1.8 1.6 1.4 no reset (t) 1.2 reset (t) 1 no reset (t-1) reset (t-1) 0.8 0.6 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Trace Eta

Figure 10: Seven faces, 25 locations. Comparison of trace rules, with and without trace reset, and for various trace values, g. The performance measure used is the maximum multiple cell information available from the best 30 cells.

have beneted from a slightly different value ofg from the rules using t , we performed further runs with different values of g for the two types of rules. We show in Figure 10 that the better performance with the rules calculating the trace using t ¡ 1 occurred for a wide range of values of g. Thus, we conclude that the rules that calculate the trace using t ¡ 1 are indeed better than those that calculate it from t and that the difference is not due to any interaction with the value of g chosen. Unless otherwise stated, the simulations in this article used a reset of the trace to zero between different stimuli. It is possible that in the brain, this could be achieved by a mechanism associated with large eye movements, or it is possible that the brain does not have such a trace reset mechanism. In our earlier work (Wallis & Rolls, 1997), for generality we did not use trace reset. To determine whether in practice this is important in the invariant representations produced by VisNet, in Figure 10 we explicitly compare performance with and without trace reset for a range of different values of g. It was found that for most values of g (0.6–0.9), whether trace reset was used made little difference. For higher values ofg such as 0.95, trace reset did perform better. The reason is that with a value of g as high as 0.95, the effect of the trace without trace reset will last from the end of one stimulus well into the presentations of the next stimulus, thereby producing interference by the association together of two different stimuli. Thus, with 25 presentations of each stimulus, trace reset is not necessary for good performance with values of g up to as high as 0.9. With 9 presentations of each stimulus, we expect performance with and without trace reset to be comparably good with a value of g up to 0.8. Further discussion of the optimal value and time course of the trace is provided by Wallis and Baddeley (1997). 3.3 Activation Function and Lateral Inhibition. The aim of this section is to compare performance with the sigmoid activation function introduced

Invariant Object Recognition in the Visual System

2565

for use with VisNet2 in this article with the power activation function as used previously (Wallis & Rolls, 1997). We demonstrate that the sigmoid activation function can produce better performance than the power activation function, especially when the sparseness of the representation in each layer is kept at reasonably high (nonsparse) values. As implemented, it has the additional advantage that the sparseness can be set and kept under precise control. We also show that if the lateral inhibition operates over a reasonably large region of cells, this can produce good performance. In the previous study (Wallis & Rolls, 1997), only short-range lateral inhibition (in addition to the nonlinear power activation function) was used, and the longer-range lateral inhibition used here will help to ensure that different neurons within the wider region of lateral inhibition respond to different inputs. This will help to reduce redundancy between nearby neurons. To apply the power activation function, the powers of 6,2,2,2 were used for layers 1 through 4, respectively, as these values were found to be appropriate and were used by Wallis and Rolls (1997). Although the nonlinearity power for layer 1 needs to be high, this is related to the need to reduce the sparseness below the very distributed representation produced by the input layer of VisNet. We aimed for sparseness in the region 0.01 through 0.15, partly because sparseness in the brain is not usually lower than this and partly because the aim of VisNet (1 and 2) is to represent each stimulus by a distributed representation in most layers, so that neurons in higher layers can learn about combinations of active neurons in a preceding layer. Although making the representation extremely sparse in layer 1, with only 1 of 1024 neurons active for each stimulus, can produce very good performance, this implies operation as a look-up table, with each stimulus in each position represented by a different neuron. This is not what the VisNet architecture is intended to model. Instead, VisNet is intended to produce at any one stage locally invariant representations of local feature combinations. Another experimental advantage of the sigmoid activation function was that the sparseness could be set to particular values (using the percentile parameter). The parameters for the sigmoid activation function were as in Table 3. It is shown in Figure 11 that with limited numbers of stimuli (7) and training locations (9), performance is excellent with both the power and the sigmoid activation function. With a more difcult problem of 17 faces and 25 locations, the sigmoid activation function produced better performance than the power activation function (see Figure 12), while at the same time using a less sparse representation than the power activation function. (The average sparseness for sigmoid versus power were: layer 1, 0.008 versus 0.002; layer 2, 0.02 versus 0.004; layers 3 and 4, 0.11 versus 0.015.) Similarly, with the even more difcult problem of 17 faces and 49 locations, the sigmoid activation function produced better performance than the power activation function (see Figure 13), while at the same time using a less sparse representation than the power activation function, similar to those just given.

2566

Edmund T. Rolls and T. Milward 3

Visnet: 7f 9l: Single Cell Information

3

Information (bits)

Information (bits)

2.5 2 1.5

sigmoid power

1

2.5 2

sigmoid power

1.5

0.5

1

0

0.5

0 100 200 300 400 500 600 700 800 900 Cell Rank Single Cell

Visnet: 7f 9l: Multiple Cell Information

5

10 15 20 Number of Cells Multiple Cell

25

30

Figure 11: Seven faces, 9 locations. Comparison of the sigmoid and power activation functions.

Information (bits)

3.5

Visnet: 17f 25l: Single Cell Information

2.5

sigmoid power Information (bits)

4

3 2.5 2 1.5 1 0.5

2

sigmoid power

1.5 1 0.5 0

10 20 30 40 50 60 70 80 90 100 Cell Rank Single Cell

Visnet: 17f 25l: Multiple Cell Information

5

10

15 20 Cell Rank Multiple Cell

25

30

Figure 12: Seventeen faces, 25 locations. Comparison of the sigmoid and power activation functions.

The effects of different radii of lateral inhibition were most marked with difcult training sets, for example, the 17 faces and 49 locations shown in Figure 14. Here it is shown that the best performance was with intermediaterange lateral inhibition, using the parameters for s shown in Table 4. These values of s are set so that the lateral inhibition radius within a layer is approximately half that of the spread of the excitatory connections from the

2

Visnet: 17f 49l: Single Cell Information

1.5 1 0.5 0

1.4

sigmoid power

Visnet: 17f 49l: Multiple Cell Information

1.2 Information (bits)

Information (bits)

2.5

1

sigmoid power

0.8 0.6 0.4 0.2

10 20 30 40 50 60 70 80 90 100 Cell Rank Single Cell

0

5

10 15 20 Number of Cells Multiple Cell

25

30

Figure 13: Seventeen faces, 49 locations. Comparison of the sigmoid and power activation functions.

Invariant Object Recognition in the Visual System Visnet: 17f 49l: Single Cell Analysis narrow medium wide

Information (bits)

1.6 1.4 1.2 1 0.8 0.6 0.4

10 20 30 40 50 60 70 80 90 100 Cell Rank Single Cell

Information (bits)

1.8

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

2567 Visnet: 17f 49l: Multiple Cell Analysis

narrow medium wide

5

10 15 20 Number of Cells Multiple Cell

25

30

Figure 14: Seventeen faces, 49 locations. Comparison of inhibitory mechanisms.

preceding layer. (For these runs only, the activation function was binary threshold. The reason was to prevent the radius of lateral inhibition affecting the performance partly by interacting with the slope of the sigmoid activation function.) Using a small value for the radius s (set to one-fourth of the values given in Table 4) produced worse performance (presumably because similar feature analyzers can be formed quite nearby). This value of the lateral inhibition radius is very similar to that used by Wallis and Rolls (1997). Using a large value for the radius (set to four times the values given in Table 4) also produced worse performance (presumably because the inhibition was too global, preventing all the local features from being adequately represented). With simpler training sets (e.g., 7 faces and 25 locations), the differences between the extents of lateral inhibition were less marked, except that the more global inhibition again produced less good performance (not illustrated). 4 Discussion

We have introduced a more biologically realistic activation function for the neurons by replacing the power function with a sigmoid function. The sigmoid function models not only the threshold nonlinearity of neurons, but also the fact that eventually the ring-rate of neurons saturates at a high rate. (Although neurons cannot re for long at their maximal ring rate, recent work on the ring-rate distribution of inferior temporal cortex neurons shows that they only rarely reach high rates. The ring-rate distribution has an approximately exponential tail at high rates. See Rolls, Treves, Tovee, & Panzeri, 1997, and Treves, Panzeri, Rolls, Booth, & Wakeman, 1999). The way in which the sigmoid was implemented did mean that some of the neurons would be close to saturation. For example, with the sigmoid bias a set at 95%, approximately 4% of the neurons would be in the high ringrate saturation region. One advantage of the sigmoid was that it enabled the sparseness to be accurately controlled. A second advantage is that it enabled VisNet2 to achieve good performance with higher values of the

2568

Edmund T. Rolls and T. Milward

sparseness a than were possible with a power activation function. VisNet (1 and 2) is, of course, intended to operate without very low values for sparseness. The reason is that within every topologically organized region, dened, for example, as the size of the region in a preceding layer from which a neuron receives inputs, or as the area within a layer within which competition operates, some neurons should be active in order to represent information within that topological region from the preceding layer. With the power activation function, a few neurons will tend to have high ring rates, and given that normalization of activity is global across each layer of VisNet, the result may be that some topological regions have no neurons active. A potential improvement to VisNet2 is thus to make the normalization of activity, as well as the lateral inhibition, operate separately within each topological region, set to be approximately the size of the region of the preceding layer from which a neuron receives its inputs (the connection spread of Table 1). Another interesting difference between the sigmoid and the power activation functions is that the distribution of ring rates with the power activation function is monotonically decreasing, with very few neurons close to the highest rate. In contrast, the sigmoid activation function tends to produce an almost binary distribution of ring rates, with most ring rates 0, and, for example, 4% (set by the percentile) with rates close to 1. It is this which helps VisNet to have at least some neurons active in each of its topologically dened regions for each stimulus. Another design aspect of VisNet2, modeled on cortical architecture, which is intended to produce some neurons in each topologically dened region that are active for any stimulus, is the lateral inhibition. In the performance of VisNet analyzed previously (Wallis & Rolls, 1997), the lateral inhibition was set to a small radius of 1 pixel. In the analyzes described here, it was shown, as predicted, that the operation of VisNet2 with longer-range lateral inhibition was better, with best performance with radii that were approximately the same as the region in the preceding layer from which a neuron received excitatory inputs. Using a smaller value for the radius (set to one-fourth of the normal values) produced worse performance (presumably because similar feature analyzers can be formed quite nearby). Using a large value for the radius (set to four times the normal values) also produced worse performance (presumably because the inhibition was too global, preventing all the local features from being adequately represented). The investigation of the form of the trace rule showed that in VisNet2, the use of a presynaptic trace, a postsynaptic trace (as used previously), and both presynaptic and postsynaptic traces produced similar performance. However, a very interesting nding was that VisNet2 produced very much better translation invariance if the rule used was with a trace calculated for t ¡ 1, for example, equation 2.5, rather than at time t , for example, equation 2.4 (see Figure 9). One way to understand this is to note that the trace rule is trying to set up the synaptic weight on trial t based on whether the neuron, based on its previous history, is responding to that stimulus (in

Invariant Object Recognition in the Visual System

2569

other positions). Use of the trace rule at t ¡ 1 does this, that is, takes into account the ring of the neuron on previous trials, with no contribution from the ring being produced by the stimulus on the current trial. On the other hand, use of the trace at time t in the update takes into account the current ring of the neuron to the stimulus in that particular position, which is not a good estimate of whether that neuron should be allocated to represent that stimulus invariantly. Effectively, using the trace at time t introduces a Hebbian element into the update, which tends to build positionencoded analyzers rather than stimulus-encoded analyzers. (The argument has been phrased for a system learning translation invariance but applies to the learning of all types of invariance.) A particular advantage of using the trace at t ¡ 1 is that the trace will then on different occasions (due to the randomness in the location sequences used) reect previous histories with different sets of positions, enabling the learning of the neuron to be based on evidence from the stimulus present in many different positions. Using a term from the current ring in the trace (the trace calculated at time t ) results in this desirable effect always having an undesirable element from the current ring of the neuron to the stimulus in its current position. This discussion of the trace rule shows that a good way to update weights in VisNet is to use evidence from previous trials but not the current trial. This may be compared with the powerful methods that use temporal difference (TD) learning (Sutton, 1988; Sutton & Barto, 1998). In TD learning, the synaptic weight update is based on the difference between the current estimate at time t and the previous estimate available at t ¡ 1. This is a form of error correction learning rather than the associative use of a trace implemented in VisNet for biological plausibility. If cast in the notation of VisNet, a TD(l) rule might appear as

D wj / (yt ¡ yt ¡1 )xjt ¡1 /

dt xjt ¡1 ,

(4.1) (4.2)

where dt D yt ¡ yt ¡1 is the difference between the current ring rate and the previous ring rate. This latter term can be thought of as the error of the prediction by the neuron of which stimulus is present. An alternative TD(0) rule might appear as

D wj / (yt ¡ yt ¡1 )xjt ¡1 / t

t d xjt ¡1 ,

(4.3) (4.4)

where d is a trace of dt the difference between the current ring rate and the previous ring rate. The interesting similarity between the trace rule at t ¡ 1 and the TD rule is that neither updates the weights using a Hebbian component that is based on the ring at time t of both the presynaptic

2570

Edmund T. Rolls and T. Milward

and postsynaptic terms. It will be of interest in future work to investigate how much better VisNet may perform if an error-based synaptic update rule rather than the trace rule is used. However, we note that holding a trace from previous trials, or using the activity from previous trials to help generate an error for TD learning, is not immediately biologically plausible without a special implementation, and the fact that VisNet2 can operate well with the ordinary trace rules (equations at time t) is important and differentiates it from TD learning. The results described here have shown that VisNet2 can perform reasonably when set more difcult problems than those on which it has been tested previously by Wallis and Rolls (1997). In particular, the number of training and test locations was increased from 9 to 25 and 49. In order to allow the trace rule to learn over this many variations in the translation of the images, a method of presenting images with many small and some larger excursions was introduced. The rationale we have in mind is that eye movements, as well as movements of objects, could produce the appropriate movements of objects across the retina, so that the trace rule can help the learning of invariant representations of objects. In addition, in the work described here, for some simulations the number of face stimuli was increased from 7 to 17. Although performance was not perfect with very many different stimuli (17) and very many different locations (49), we note that this particular problem is hard and that ways to increase the capacity (such as limiting the sparseness in the later layers) have not been the subject of the investigations described here. We also note that an analytic approach to the capacity of a network with a different architecture in providing invariant representations has been developed (Parga & Rolls, 1998). A comparison of VisNet with other architectures for producing invariant representations of objects is provided elsewhere (Wallis & Rolls, 1997; Parga & Rolls, 1998; see also Bartlett & Sejnowski, 1997; Stone, 1996; Salinas & Abbott, 1997). The results, by extending VisNet to use sigmoid activation functions that enable the sparseness of the representation to be set accurately, showing that a more appropriate range of lateral inhibition can improve performance as predicted, introducing new measures of its performance, and the discovery of an improved type of trace learning rule have provided useful further evidence that the VisNet architecture does capture and help to explore some aspects of invariant visual object recognition as performed by the primate visual system. Acknowledgments

The research was supported by the Medical Research Council, grant PG8513790 to E. T. R. The authors acknowledge the valuable contribution of Guy Wallis, who worked on an earlier version of VisNet (see Wallis & Rolls, 1997), and are grateful to Roland Baddeley of the MRC Interdisciplinary Research Center for Cognitive Neuroscience at Oxford for many helpful

Invariant Object Recognition in the Visual System

2571

comments. The authors thank Martin Elliffe for assistance with code to prepare the output of VisNet2 for input to the information-theoretic routines developed and described by Rolls, Treves, Tovee, and Panzeri (1997) and Rolls, Treves, and Tovee (1997); Stefano Panzeri for providing equispaced bins for the multiple cell information analysis routines developed by Rolls, Treves, and Tovee (1997); and Simon Stringer for performing the comparison of the information measure with the invariance index used by Booth and Rolls (1998).

References Bartlett, M. S., & Sejnowski, T. J. (1997). Viewpoint invariant face recognition using independent component analysis and attractor networks. In M. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Booth, M. C. A., & Rolls, E. T. (1998). View-invariant representations of familiar objects by neurons in the inferior temporal visual cortex. Cerebral Cortex, 8, 510–523. Desimone, R. (1991). Face-selective cells in the temporal cortex of monkeys. Journal of Cognitive Neuroscience, 3, 1–8. Foldi´ ¨ ak, P. (1991). Learning invariance from transformation sequences. Neural Computation, 3, 194–200. Hawken, M. J., & Parker, A. J. (1987). Spatial properties of the monkey striate cortex. Proceedings of the Royal Society, London [B], 231, 251–288. Panzeri, S., Treves, A., Schultz, S., & Rolls, E. T. (1999).On decoding the responses of a population of neurons from short time windows. Neural Computation, 11, 1553–1577. Parga, N., & Rolls, E. T. (1998). Transform invariant recognition by association in a recurrent network. Neural Computation, 10, 1507–1525. Peng, H. C., Sha, L. F., Gan, Q., & Wei, Y. (1998). Energy function for learning invariance in multilayer perceptron. Electronics Letters, 34(3), 292–294, Rolls, E. T. (1992). Neurophysiological mechanisms underlying face processing within and beyond the temporal cortical areas. Philosophical Transactions of the Royal Society, London [B], 335, 11–21. Rolls, E. T. (1994). Brain mechanisms for invariant visual recognition and learning. Behavioural Processes, 33, 113–138. Rolls, E. T. (1995). Learning mechanisms in the temporal lobe visual cortex. Behavioural Brain Research, 66, 177–185. Rolls, E. T. (1997). A neurophysiological and computational approach to the functions of the temporal lobe cortical visual areas in invariant object recognition. In L. Harris & M. Jenkin (Eds.), Computational and psychophysicalmechanisms of visual coding (pp. 184–220). Cambridge: Cambridge University Press. Rolls, E. T. (2000). Functions of the primate temporal lobe cortical visual areas in invariant visual object and face recognition. Neuron, 27, 1–20.

2572

Edmund T. Rolls and T. Milward

Rolls, E. T., & Tovee, M. J. (1995). Sparseness of the neuronal representation of stimuli in the primate temporal visual cortex. Journal of Neurophysiology, 73, 713–726. Rolls, E. T., & Treves, A. (1998).Neural networks and brain function. Oxford: Oxford University Press. Rolls, E. T., Treves, A., & Tovee, M. J. (1997). The representational capacity of the distributed encoding of information provided by populations of neurons in the primate temporal visual cortex. Experimental Brain Research, 114, 177–185. Rolls, E. T., Treves, A., Tovee, M., & Panzeri, S. (1997).Information in the neuronal representation of individual stimuli in the primate temporal visual cortex. Journal of Computational Neuroscience, 4, 309–333. Salinas, E., & Abbott, L. F. (1997). Invariant visual responses from attentional gain elds. Journal of Neurophysiology, 77, 3267–3272. Stone, J. V. (1996).A canonical microfunction for learning perceptual invariances. Perception, 25, 207–220. Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44. Sutton, R. S., & Barto, A. G. (1981). Towards a modern theory of adaptive networks: Expectation and prediction. Psychological Review, 88, 135–170. Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning. Cambridge, MA: MIT Press. Tanaka, K., Saito, H., Fukada, Y., & Moriya, M. (1991). Coding visual images of objects in the inferotemporal cortex of the macaque monkey. Journal of Neurophysiology, 66, 170–189. Tovee, M. J., Rolls, E. T., & Azzopardi, P. (1994). Translation invariance and the responses of neurons in the temporal visual cortical areas of primates. Journal of Neurophysiology, 72, 1049–1060. Treves, A., Panzeri, S., Rolls, E. T., Booth, M., & Wakeman, E. A. (1999). Firing rate distributions and efciency of information transmission of inferior temporal cortex neurons to natural visual stimuli. Neural Computation, 11, 611–641. Wallis, G., & Baddeley, R. (1997). Optimal unsupervised learning in invariant object recognition. Neural Computation, 9(4), 959–970. Available online at: ftp://ftp.mpik-tueb.mpg.de/pub/guy/nc.ps. Wallis, G., Rolls, E. T. (1997).A model of invariant object recognition in the visual system. Progress in Neurobiology, 51, 167–194. Wallis, G., Rolls, E. T., &Foldi´ ¨ ak, P. (1993). Learning invariant responses to the natural transformations of objects. International Joint Conference on Neural Networks, 2, 1087–1090. Accepted April 8, 1999; received November 24, 1998.

LETTER

Communicated by Geof frey Goodhill

An Analysis of Orientation and Ocular Dominance Patterns in the Visual Cortex of Cats and Ferrets T. M uller ¨ M. Stetter Department of Computer Science, TU Berlin, 10587 Berlin, Germany M. H ubener ¨ F. Sengpiel T. Bonhoef fer Max-Planck-Institute of Neurobiology, 82152 Martinsried, Germany I. G odecke ¨ Actipac Biosystems GmbH, 82152 Martinsried, Germany B. Chapman Davis Center for Neuroscience, UC Davis, Davis, CA 95616 U.S.A. S. L owel ¨ ¨ Neurobiologie, 39118, Magdeburg, Germany Leibniz-Institut fur K. Obermayer Department for Computer Science, TU Berlin, 10587 Berlin, Germany We report an analysis of orientation and ocular dominance maps that were recorded optically from area 17 of cats and ferrets. Similar to a recent study performed in primates (Obermayer & Blasdel, 1997), we nd that 80% (for cats and ferrets) of orientation singularities that are nearest neighbors have opposite sign and that the spatial distribution of singularities deviates from a random distribution of points, because the average distances between nearest neighbors are signicantly larger than expected for a random distribution. Orientation maps of normally raised cats and ferrets show approximately the same typical wavelength; however , the density of singularities is higher in ferrets than in cats. Also, we nd the well-known overrepresentation of cardinal versus oblique orientations in young ferrets (Chapman & Bonhoef fer, 1998; Coppola, White, Fitzpatrick, & Purves, 1998) but only a weak, not quite signicant overrepresentation of cardinal orientations in cats, as has been reported previously (Bonhoeffer & Grinvald, 1993). Orientation and ocular dominance slabs in cats exhibit a tendency of being orthogonal to each other (H ubener ¨ , Shoham, Grinvald, & Bonhoef fer, 1997), albeit less pronounced, as has been rec 2000 Massachusetts Institute of Technology Neural Computation 12, 2573–2595 (2000) °

2574

Muller ¨ et al.

ported for primates (Obermayer & Blasdel, 1993). In chronic recordings from single animals, a decrease of the singularity density and an increase of the ocular dominance wavelength with age but no change of the orientation wavelengths were found. Orientation maps are compared with two pattern models for orientation preference maps: bandpass-ltered white noise and the eld analogy model. Bandpass-ltered white noise predicts sign correlations between orientation singularities, but the correlations are signicantly stronger (87% opposite sign pairs) than what we have found in the data. Also, bandpass-ltered noise predicts a deviation of the spatial distribution of singularities from a random dot pattern. The eld analogy model can account for the structure of certain local patches but not for the whole orientation map. Differences between the predictions of the eld analogy model and experimental data are smaller than what has been reported for primates (Obermayer & Blasdel, 1997), which can be explained by the smaller size of the imaged areas in cats and ferrets. 1 Introduction

In a previous study (Obermayer & Blasdel, 1997), we analyzed orientation and ocular dominance maps obtained from the striate cortex of macaque and squirrel monkeys. We found a strong correlation between the properties and locations of orientation singularities, and we also found that this and several other properties of those maps can be well described by their secondorder statistics. The spatial structure of the maps was fairly similar across animals, species, and ages for the measures used, except for a different overall alignment of orientation bands and different periods of repetition. Here we report a follow-up study using data from cats and ferrets, whose visual pathways differ from primates by a set of distinct features—for example the number of primary visual areas receiving strong, direct geniculate input and the distribution of cells with different response properties over the different cortical layers. Both species show ocular dominance (LeVay, Stryker, & Shatz, 1978; Anderson, Olavarria, & Sluyters, 1988; Redies, Diksic, & Riml, 1990), albeit less pronounced than in macaque striate cortex. Ferrets and cats additionally exhibit clusters of strongly direction-selective cells in the upper layers (Shmuel & Grinvald, 1996; Weliky, Bosking, & Fitzpatrick, 1996), which could, in principle, interact with the spatial layout of orientation-selective cells. Our analysis was performed on 51 orientation and 10 ocular dominance maps obtained from 21 animals, including maps from chronic recordings in single animals over age (longitudinal studies) (Godecke ¨ & Bonhoeffer, 1996), and maps from animals raised under strabismus (Sengpiel et al., 1998). The results are compared with two pattern models: (1) bandpass-ltered white noise (Swindale, 1982; Rojer & Schwartz, 1990; Cowan & Friedman, 1991; Shvartsman & Freund, 1994), which assumes that orientation maps can be fully described by their second-order

Visual Cortex of Cats and Ferrets

2575

statistics, and (2) the eld analogy model (Wolf, Pawelzik, & Geisel, 1994; Wolf, Pawelzik, Geisel, Kim, & Bonhoeffer, 1994), which claims that orientation maps are optimally smooth given the locations and signs of their singularities. The letter is divided into ve sections. In section 2 we describe the data and the methods for data analysis. In the third section, we briey introduce both pattern models. The fourth section contains the results of our data analysis and our comparison between data and models, and the fth section concludes with a brief discussion. 2 Data and Methods of Analysis 2.1 Data Collection and Preprocessing. The data were obtained by optical recording of intrinsic signals (Blasdel, 1992) using standard procedures (Bonhoeffer & Grinvald, 1996), but using light wavelength of 707 nm for the illumination of the cortex. For the purpose of this article, each “raw” data set consists of 16 £ 5 frames, 192 £ 144 pixels in size, with a 2-byte number representing the pixel’s intensity as recorded by the CCD camera. One stimulus condition corresponds to a block of ve frames covering a time window of 3 s (600 ms per frame), which are recorded during stimulation of the animal. Stimuli are bar gratings of four orientations (0, 45, 90, and 135 degrees), which are presented four times each, giving rise to the 16 blocks of frames. Each stimulation is preceded by a 7 s presentation of a static bar grating, which is then moved for 3 s to form the actual stimulus. Afterwards, another randomly selected static bar grating is presented to prepare the next stimulation. If, in addition, ocular dominance is measured, eight conditions are presented only to the right and the other 8 conditions only to the left eye, respectively. Every block of ve frames was processed by subtraction of the rst captured frame from the four subsequent ones (rst frame analysis) and by normalization using a cocktail blank, as described in Bonhoeffer and Grinvald (1996), followed by a pixel-by-pixel average over frames 2 to 5. For the case of ocular dominance, difference images DOD were generated by rst averaging the frames for all left eye and right eye presentations separately and then subtracting the right eye average from the left eye average pixel by pixel. For the case of orientation selectivity, two difference images were obtained by rst averaging all frames recorded during the presentations of an oriented grating independent of eye, and then subtracting the 90 degrees (135 degrees) from the 0 degrees (45 degrees) average pixel by pixel, 45¡135 generating the orientation difference image D0¡90 ). OR (DOR Animals that had large blood vessel artifacts running across the recorded area and experiments with a weak signal were excluded by visual inspection of the maps and the blood vessel patterns. This procedure yielded 52 out of the 221 data sets. For each of the remaining images, a region of interest was

2576

Muller ¨ et al.

Figure 1: (Left) Differential maps for orientation (0–90 degrees, top; 45–135 degrees, bottom) of a cat (experiment C10b). (Middle) Ocular dominance map of the same experiment (C10b). (Right) Differential maps for orientation (0–90 degrees, top; 45–135 degrees, bottom) of a ferret (experiment F1a). The polygons specify the corresponding regions of interest before reduction in size.

drawn, which outlined a zone that avoids artifacts (see Figure 1). For the analysis, these regions of interest were reduced further by 30% in diameter in order to exclude pixels near the boundary that could be affected by artifacts. Table 1 summarizes the data sets used in our study, together with the animals’ age, the size of the region of interest, estimates of the orientation and ocular dominance wavelengths, and an estimate of the mean and maximum signal-to-noise ratio of the difference images as a measure of data quality (see section 2.2). In order to enhance the signal-to-noise ratio of the difference images, a bandpass lter B, ! 2p 2p E E E ¡ ( ) ¡ | k| s | k| , (2.1) Bk Ds l high l low

(

¡

¢

is applied to the difference images in Fourier space, where kE is the wave vector and s(x), s (x ) D

1 , 1 C exp(¡bx)

(2.2)

38 38 43 56 44 47 35 39 42 37 39 42 40 43 38 41 100 41 38 41 44 118

F1a* F1b F1c F1d F2a F2b F3a F3b F3c F4a F4b F4c F5a F5b F6a F6b F7* F8 F9a F9b F9c F9d*

0.46 0.50 0.58 0.27 0.55 0.62 0.82 0.69 0.57 0.71 0.57 0.65 0.63 0.54 0.71 0.67 0.70 0.50 0.68 0.62 0.72 0.46

m /s mean

1.77 3.54 2.52 0.83 1.79 1.89 2.52 1.69 1.74 2.06 1.85 1.79 1.75 1.72 1.99 1.65 1.82 1.70 3.65 2.19 1.84 1.52

m /s max

5.32 9.81 17.64 9.81 10.13 34.80 13.40 13.40 13.40 23.84 23.84 23.84 18.12 18.12 19.46 19.46 7.20 19.48 16.82 16.82 16.82 7.60

ROI (mm2 )

0.64 0.71 0.73 0.76 0.73 0.71 0.68 0.69 0.7 0.81 0.74 0.78 0.7 0.65 0.71 0.68 0.63 0.67 0.63 0.74 0.66 0.61

l OR (mm) C1* C2 C3 C4a C4b* C5* C6* C7a C7b C8a C8b C9 C10a* C10b* CS1a CS1b CS1c SC1d CS1e CS2a CS2b*

Cat 32 65 60 60 60 50 50 54 54 52 52 54 50 50 27 34 42 54 61 27 38

Age (d) 0.44 0.31 0.31 0.58 0.31 0.36 0.30 0.29 0.46 0.42 0.47 0.40 0.43 0.65 0.27 0.53 0.41 0.36 0.41 0.68 0.37

m /s mean 2.96 2.25 4.21 2.52 1.97 3.39 1.12 2.94 1.64 1.54 1.63 3.32 1.75 2.05 6.20 4.29 10.63 10.37 6.71 2.61 2.00

m/s max 6.18 11.43 45.44 9.81 7.95 15.02 7.51 32.12 9.55 30.74 15.37 22.72 9.58 5.17 8.10 8.10 8.10 8.10 2.24 17.27 6.08

ROI (mm2 ) 0.87 0.83 0.92 0.72 0.76 0.63 0.65 0.72 0.72 0.71 0.71 0.91 0.72 0.67 0.75 0.74 0.78 0.67 0.8 0.75 0.61

l OR (mm) — — — — — — — — 0.83 — 0.76 — — 1.04 0.65 0.79 0.8 0.73 1.02 1.03 0.73

l OD [mm]

Notes: The columns on the left show ferret (F) data; the columns on the right show data recorded from normal (C) and strabismic (CS) cats. Numbers within the names denote different animals; letters at the names’ end indicate different recordings from the same animal and cortical area. Mean ((m / s) mean ) and maximum ((m / s)max ) signal-to-noise ratios, orientation wavelengths (l OR ), and ocular dominance wavelengths (l OD ) are measured as described in section 2.2. The region of interest (ROI) column lists the size of the cortical areas used for evaluation. Asterisks indicate animals, which are recorded with higher resolution. Note that ocular dominance was recorded only in some of the experiments with cats, including normal and strabismic animals. Data with large regions of interest represent means over several experiments. Some of the data were already used in other publications.

Age (d)

Ferret

Table 1: Summary of Data Sets Analyzed.

Visual Cortex of Cats and Ferrets 2577

2578

Muller ¨ et al.

is the logistic function. The bandpass lter is designed to reduce the high spatial frequency noise induced by the imaging process (photon shot noise and camera noise) as well as the low spatial frequency biological noise (blood volume components, hemodynamics). Parameters were chosen: l high D 0.25 mm, l low D 2.0 mm, and b D 1.4 mm. In the following, spatially ltered images are used unless otherwise noted. 45¡135 The orientation difference images D0¡90 are sometimes comOR and DOR bined to form the real and imaginary components of a complex orientation eld O (Er ) (Swindale, 1982), (r) C iD45¡135 (Er ), O (Er) D D0¡90 OR E OR

(2.3)

where i is the imaginary unit. Using polar coordinates, O (Er) D q (Er ) exp(2i w (Er)),

(2.4)

one then obtains a measure for the preferred orientation, w (Er ), and the orientation tuning strength, q(Er), as a function of cortical location (Blasdel & Salama, 1986). Figure 1 shows typical examples of an orientation map (Figure 1a) and an ocular dominance map (Figure 1b) of a cat and an orientation map of a ferret (Figure 1c). 2.2 Signal-to-Noise Ratios, Wavelengths, and Gradients. In order to obtain an estimate of the signal-to-noise ratio for the high-frequency photon shot noise as well as camera noise, every image is divided into blocks of 3 £ 3 pixels. Mean values m and standard deviations s of pixel values are calculated for every block, and the ratio m / s is averaged over all blocks within the region of interest to form the mean signal-to-noise ratio (m / s)mean , whereas (m / s)max stores the maximum of the ratio. The typical wavelengths of the orientation and ocular dominance maps are determined by rst computing the spatial power spectra of the unltered difference image DOD and the unltered complex orientation map O, equation 2.3, using a fast fourier transformation. The power of all modes of E is added, and the wave numbers | kEOR OD | of the similar wave number | k| / peaks of the summed spectra are determined. The wavelengths l OR / OD are then calculated from l OR / OD D 2p / | kE OR / OD |. Gradients from difference images are estimated by tting a plane D D a C bx C cy to the pixel values of a 3 £ 3 size region (least squares t; Spiegel, 1975). The parameters a, b, and c are then given by:

P a D

j

Dj

9

P ,

b D

j xj Dj

6

P ,

c D

j yj Dj

6

,

(2.5)

Visual Cortex of Cats and Ferrets

2579

where j runs over all pixels in the 3 £ 3 region and (xj , yj ) are their spatial coordinates. The gradient, r D, at the center pixel is then given by

r D D (b c)T .

(2.6)

Gradients of orientation preference angles are calculated in a similar fashion, except that D is replaced by the difference of the orientation angle at the considered pixel location and the orientation angle at the center pixel, corrected by multiples of p if necessary. 2.3 Positions and Signs of Orientation Singularities. Singularities in orientation maps correspond to the zeroes of the function O(Er). When linearized, the orientation eld around a singularity can be described by the location of the zero crossing and the four additional parameters: amplitude, anisotropy, angle, and skewness (for details see Obermayer & Blasdel, 1997). The sign of a singularity is then given by the sign of its amplitude parameter. It is positive if orientation preferences increase in a clockwise direction around the singularity, and negative otherwise. In order to determine the position of singularities in the experimental data, changes of orientation preferences are added clockwise along small circles of radius 70 m m around every pixel in the image. If the sum along a closed loop is close to §p (difference less than 10 ¡6 ), then a singularity is enclosed by the circle and the pixel is marked by either C 1 or ¡1 depending on the sign of the enclosed singularity. Otherwise the pixel is marked by 0. Ideally, after this procedure, each singularity in the data is covered by a patch with radius 70 m m of pixels marked by § 1, which is surrounded by pixels marked with 0. Disks of radius 70 m m are then tted to the regions marked C 1 and ¡1. The position of the singularity is then given by the location of the center of a disk with maximum overlap with the corresponding region (best t). We ignored regions with less than 33% overlap. 3 Models 3.1 Surrogate Data. We consider the hypothesis that the spatial patterns of orientation and ocular dominance can be fully described by their secondorder statistics—that is, their power spectrum—but are otherwise random. If the power spectrum is taken from experimentally measured maps, socalled surrogate data maps can be generated by randomizing the phases of the individual Fourier modes in the maps’ Fourier spectra. Let DQ (kE ) be the experimentally measured Fourier spectrum of a difference map D (Er ), and let Â (kE ) be a random number from the interval [0, 2p ]. A surrogate data map O is then calculated by SD ±

²

O D F ¡1 | DQ (kE )| expfiÂ(kE )g . SD

(3.1)

2580

Muller ¨ et al.

In order to generate surrogate data for orientation preference maps, one has to take into account that both orientation difference maps, D0¡90 OR and , are coupled because they are generated by the orientation tuning D45¡135 OR curve of the same neurons. This implies that the phase relationship between both difference maps must be preserved; hence, phases must be randomized with respect to the Fourier spectrum OQ (kE ) of the complex orientation map O(Er )—that is, DQ (kE ) has to be replaced by OQ (kE ) in equation 3.1. Surrogate data maps, 192 £ 144 pixels in size, were generated for each data set until the total number of singularities in all surrogate data maps was at least 105 for every experimental map. Note that the above method to generate surrogate data is similar to the method of generating maps from bandpass-ltered noise models (Swindale, 1982; Rojer & Schwartz, 1990; Cowan & Friedman, 1991; Niebur & Worg ¨ otter, ¨ 1994; Freund, 1994; Shvartsman & Freund, 1994; Schwartz, 1994; Erwin, Obermayer, & Schulten, 1995). 3.2 The Field Analogy Model. The eld analogy model (FAM) (Wolf, Pawelzik, & Geisel, 1994; Wolf, Pawelzik, Geisel, Kim, & Bonhoeffer, 1994) is a pattern model that explains the spatial pattern of orientation maps by two underlying principles: (1) that the orientation map is fully determined by the locations fxj , yj g and signs fvj g of the singularities and (2) that the orientation map in between those singularities is optimally smooth. For every given set of locations and signs of singularities, an orientation map Y(x, y ) can be calculated as

0

1

Y

Y(x, y ) D arg @

j

0

1

Y

f(x ¡ xj ) ¡ ivj (y ¡ yj )gA D arg @

Oj A ,

(3.2)

j

which denes the FAM model. In order to test the FAM approach, locations and signs of singularities are determined from the experimental data, and the orientation map is reconstructed using equation 3.2. The error R between the original map H (x, y ) and its reconstruction Y(x, y) is calculated by R2 D

ª2 1 X© g(Y(x, y ) ¡ H (x, y ) C H ref ) , N x,y

(3.3)

where (x, y ) are pixel locations, N is the number of pixels in the sum, Y is given by equation 3.2, H denotes the original orientation preference, and Href is chosen such that R is minimal. The function g is given by g (w ) D

8 <w

Cp w : w ¡p

if if if

w < ¡p / 2 ¡p / 2 · w < p / 2 . w ¸ p /2

(3.4)

Visual Cortex of Cats and Ferrets

2581

In order to avoid boundary effects, the maximum region of interest for the determination of the reconstruction error via equation 3.3 is 30% smaller in area than the original region of interest from Table 1. 4 Results 4.1 Distribution of Orientation Preferences. In the ferret, orientation maps consistently show an overrepresentation in area for orientation preferences close to 0 and 90 degrees (Chapman & Bonhoeffer, 1998). A signicance test for this overrepresentation is not easy to perform because orientations of adjacent pixels are generally not statistically independent. Dependencies arise because of structure in the orientation map and a blur of the signal caused by light scattering in the tissue (Stetter & Obermayer, 1999). In order to avoid correlations due to tissue scattering, the orientation values were subsampled such that their distance exceeded the correlation length of about 100 m m (Bonhoeffer & Grinvald, 1996) caused by the blur. For the test, orientation values were sampled at the inverse Nyquist frequency of the low-pass-ltered images (125 m m). Figure 2 (left) shows a histogram of the percentage of pixels as a function of orientation preference for a typical example (ferret F2a). Angles are pooled into 12 bins of size 15 degrees and compared to the hypothesis that all orientation angles occur equally often. The deviation from uniformity is signicant for all but three ferret data from Table 1 (Â 2 test, signicance level of mean histogram: 34.1 > 19.7, a D 0.05). We also calculated the fractions of cortical area showing orientation preferences between 157.5 and 22.5 degrees and between 67.5 and 112.5 degrees. Typically, these cardinallike orientations were found to occupy a fraction of 60% of the total area of the orientation map (compared to 50% expected for a random distribution). In some of the ferret data sets, it can be shown that the overrepresentation is largest shortly after eye opening (cf. Chapman & Bonhoeffer, 1998). In the cat, only a weak tendency for preference of cardinal orientations can be observed. Some sets of data show an overrepresentation of angles close to 0 and 90 degrees; others show overrepresentation of obliques, and for some, sets of data orientations are equally represented. Correspondingly, all but six cat orientation histograms are signicantly nonuniform, yet with smaller signicance levels than ferret data. Moreover, the mean histogram does not signicantly differ from uniformity (signicance level 15.8 < 19.7, a D 0.05). Close to the border between areas 17 and 18 the cortical maps are arranged in an ordered way with respect to the border line (Chapman, Stryker, & Bonhoeffer, 1996). We tested whether the preference for cardinal orientations is related to the proximity to the 17/18 border, as well by subdividing the regions of interest for all data sets into a lateral and a medial part of equal size, resulting in two parts near and far from the 17/18 bor-

2582

Muller ¨ et al.

14 12 10 8 6 4 2 0 -8 0 -6 0 -4 0 -2 0 0 2 0 4 0 6 0 8 0

14 12 10 8 6 4 2 0 -80 -60 -40 -2 0 0 2 0 4 0 6 0 80

25

25

20

20

15

15

10

10

5

5

0 -8 0 -6 0 -4 0 -2 0 0 2 0 4 0 6 0 8 0

0 -80 -60 -40 -2 0 0 2 0 4 0 6 0 80

Figure 2: Percentage of cortical area within the ROI as a function of its orientation preference (upper left) and of the angle between the preferred orientation and the local orientation gradient (upper right) for ferret F2. The lower two plots show average distribution of orientation preference as error bars (2s range) for ferrets (left) and cats (right). The diamonds above and below the error bars denote the minimum and maximum values found in our data.

der. The orientation histograms for both subdivisions were calculated for each animal, and the hypothesis that both histograms are drawn from the same distribution was tested on both the individual histograms and the histograms summed over all cats and all ferrets within each subdivision. The hypothesis could not be rejected in either case, and the extremely low signicance values (summed histograms: Â 2 D 0.4 < 25, a D 0.05 for cats, Â 2 D 1.28 < 23.7, a D 0.05 for ferrets) showed that the histograms were very similar to each other. This suggests that the preference for cardinal orientations is not evoked by the proximity to the 17/18 border. However, the cortical areas examined represent only a small fraction of area 17 and do not contain parts very far from the border, which could contribute to the lack of signicance.

Visual Cortex of Cats and Ferrets

2583

Figure 2 (right) shows a histogram of the number of pixels as a function of the angle between the preferred orientation at a spatial location and the local orientation gradient for a typical data set (ferret F2). No signicant correlations between the preferred orientation and the direction of the isoorientation lines have been observed for the data sets of Table 1 or for the corresponding surrogate data maps. This nding is similar to what has been reported for macaque striate cortex (Obermayer & Blasdel, 1993). The average wavelength of the orientation pattern is 0.70 § 0.01 mm for ferrets and 0.75 § 0.03 mm for cats (see Table 1). Table 2 shows the density of singularities for all data sets listed in Table 1. The density of singularities was found to be (4.5 § 0.2) mm¡2 for ferrets and (3.5 § 0.2) mm¡2 for cats with ages between 50 and 61 days (error values are the standard error of the mean). Changes with age have been tested in a longitudinal study on the strabismic kitten CS1. For this animal, the singularity density was found to decrease signicantly with age—the decrease rate being 2.1% per day (correlation coefcient r D ¡0.88). However, no signicant change of the orientation wavelength with age could be detected (correlation coefcient r D 0.03) for this particular animal. 4.2 Correlations Between Locations and Signs of Singularities. Table 2 shows the percentages of nearest-neighbor pairs that have same and opposite signs. On average, 19.7% § 1.5% (ferrets), 21.4% § 2.0% (normal cats), and 22% § (strabismic cats) pairs of singularities are of equal sign, where error margins denote the standard deviation of the mean. This has to be compared with the values for the surrogate data model (between 13.6% § 0.2% and 12.4% § 0.3%), which are signicantly lower. The numbers increase for the second nearest neighbors (¼ 40%) and approach 50% for singularities further apart (data not shown). No signicant tendencies have been observed with age or raising conditions. Figure 3 shows a typical histogram of the distances between singularities that are nearest neighbors, independent of sign. The results for the experimental maps are compared to the results for surrogate data (SD) and a random distribution of points in the plane (RAN). Let r be the number p density of points and D D 1 / r be the square root of the average area per point. The probability density P (d ) of distances d between nearest neighbors is then given by the Rayleigh distribution (Papoulis, 1965)

P (d ) D 2p

d

D

( exp ¡p

¡d¢ ) 2

D

.

(4.1)

Figure 3 shows that the peak of the experimental distribution under consideration of all pinwheel pairs is slightly shifted toward larger values compared to equation 4.1, with the average shift being 0.03 D for both cats

F4c

F4b

F4a

F3c

F3b

F3a

F2b

F2a

F1d

F1c

F1b

F1a

Name

42 20 44 21 74 42 39 20 41 21 115 62 75 41 59 31 48 29 125 63 108 56 116 63

#§ #C

4.87

4.53

5.24

3.58

4.41

5.6

3.31

4.05

3.97

4.20

4.48

7.89

[ #2 mm

d

]

10 42 14 32 24 35 3 42 14 39 8 42 5 47 10 42 21 44 8 43 13 39 14 39

CC% C ¡%

3 45 19 35 14 27 12 42 3 44 12 38 11 38 8 40 5 31 10 39 8 41 6 40

OR ¡¡% ¡C%

Table 2: Number Densities and Sign Statistics of Singularities for All Data Sets Listed in Table 1 and the Corresponding Surrogate Data (SD).

13 87 32 68 38 62 15 85 17 83 20 80 15 85 17 83 26 74 18 82 20 80 20 80

P P 6 44 6 43 7 43 7 43 7 43 7 43 6 43 6 43 6 44 6 44 7 43 6 43

CC% C ¡%

6 44 7 43 7 43 7 43 7 43 7 43 7 43 7 44 6 43 6 44 7 44 7 43

SD ¡ ¡% ¡C% 12 88 13 87 14 86 15 85 14 86 14 86 13 87 13 87 13 87 12 88 13 87 13 87

P P

C10a

C9

C8b

C8a

C7

C6

C5

C4b

C4a

C3

C2

C1

Name

22 9 23 13 136 72 41 21 37 20 59 31 23 11 106 51 107 56 44 25 71 35 36 19

#§ #C

3.82

3.14

2.86

3.48

3.30

3.06

3.93

4.66

4.18

3.10

2.01

3.56

[ # 2] mm

d 0 38 6 47 15 42 16 39 0 44 13 44 0 45 10 39 10 40 13 45 8 44 10 41

CC% C ¡%

13 50 6 41 11 32 19 26 15 41 0 43 5 50 10 41 13 37 5 38 6 42 3 45

OR ¡ ¡% ¡C% 13 88 12 88 26 74 35 65 15 85 13 87 5 95 20 80 23 77 18 83 14 86 14 86

P P 5 45 6 44 7 43 7 43 6 44 5 45 7 43 7 43 7 43 6 43 6 44 7 43

CC% C ¡%

6 44 6 44 7 43 7 43 6 44 6 44 6 44 7 43 6 44 6 44 7 43 6 44

SD ¡¡% ¡C%

11 89 12 88 14 86 14 86 12 88 11 89 13 87 14 86 13 87 13 87 13 87 12 88

P P

2584 Muller ¨ et al.

69 36 77 41 96 52 85 39 39 19 71 35 77 38 69 32 68 32 52 26

#§ #C

[

6.9

4.07

4.13

4.61

3.71

5.43

4.39

4.95

4.29

3.84

# mm 2

d

]

17 37 19 34 14 41 5 41 6 42 9 39 9 42 2 44 6 39 2 46

CC% C ¡%

8 38 11 36 7 38 11 43 16 35 8 44 13 36 8 46 9 46 5 46

OR ¡¡% ¡C%

25 75 30 70 21 79 16 84 23 77 17 83 22 78 10 90 15 85 7 93

P P 8 42 7 43 6 44 7 43 6 44 7 43 6 44 7 43 6 44 5 44

C C% C ¡%

8 42 7 43 6 44 7 43 6 44 7 43 6 44 7 43 7 43 5 45

SD ¡¡% ¡C% 15 85 14 86 12 88 13 87 12 88 14 86 13 87 14 86 13 87 11 89

P P

CS2b

CS2a

CS1e

CS1d

CS1c

CS1b

CS1a

C10b

Name

25 12 48 26 49 27 28 14 28 14 7 3 88 49 52 27

#§ #C [

8.6

5.1

3.13

3.46

3.46

6.05

5.93

4.83

# ] mm 2

d 17 25 12 44 10 45 17 30 13 30 0 40 21 36 9 41

CC% C ¡%

42 17 5 39 4 41 22 30 9 48 0 60 6 37 9 41

OR ¡¡% ¡C% 58 42 17 83 14 86 39 61 22 78 0 100 27 73 17 83

P P 6 44 6 44 6 44 7 43 7 43 7 43 6 43 5 44

CC% C ¡%

6 44 6 44 6 44 6 43 7 14 43 6 44 6 44 6 45

SD ¡ ¡% ¡C%

86 13 87 13 87 11 89

12 88 12 88 12 88 13 87

P P

Notes: The left (right) part of the table shows results for ferret (cat) data. The columns show (left to right): name of the data set (see Table 1), total number of singularities within the ROI, the number density r , and the percentages of nearest neighbors with the same and opposite sign (the rst P sign denotes the selected pinwheel, the second sign its nearest neighbor; denotes the sum for equal and opposite sign pairs) for experimental and for surrogate (SD) data.

F9d

F9c

F9b

F9a

F8

F7

F6b

F6a

F5b

F5a

Name

Table 2: Continued.

Visual Cortex of Cats and Ferrets 2585

2586

Muller ¨ et al.

2 1.5

RAN SD

2 1.5

1

1

0.5

0.5

RAN SD

0 0 0.2 0.4 0.6 0.8 1 1 .2 1.4

0 0 0.2 0.4 0.6 0.8 1 1 .2 1.4

2

2

1.5

RAN SD

1.5

1

1

0.5

0.5

0 0 0.2 0.4 0.6 0.8 1 1 .2 1.4

RAN SD

0 0 0.2 0.4 0.6 0.8 1 1 .2 1.4

Figure 3: Percentage of pairs of singularities independent of sign (upper) as a function of their distance for ferret data (left) and for data from normally raised cats (right). All distances are normalized to the square root of the inverse number density of the singularities before averaging histograms over the different sets of data. Bins denote experimental data, diamonds the corresponding surrogate data maps (SD), and lled circles the results for a random distribution of points (RAN). The lower two plots show the distribution for same-sign singularities.

and ferrets. The distribution for surrogate data is shifted toward larger distances compared to the experimental data: 0.04 D for ferrets and 0.03D for cats. Similar histograms have been constructed for all data sets and have been compared to the hypothesis random by a Â 2 test. Both the hypothesis RAN and the hypothesis SD have to be rejected for all sets of data (a D 0.05). There were no obvious trends between different species and between animals of different ages. For the distribution of distances between nearest neighbors of the same sign and the opposite sign only, the shift to larger distances is more pronounced compared to pairs with opposite sign for both ferrets and cats (0.1D on average). The results show the same tendency previously observed for primate orientation maps: that while singularities of opposite sign attract each other,

Visual Cortex of Cats and Ferrets

2587

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

00

0.5

1

1.5

2

2.5

3

00

0.5

1

1.5

2

2.5

3

Figure 4: Average FAM reconstruction error as a function of region size for ferrets (left) and cats (right). The circles at the center of the vertical bars denote the average error over all data sets recorded with a low spatial resolution (no asterisks in Table 1), the bars denote the 2s interval, and the outmost cross hairs denote the maximal and the minimal errors observed.

singularities on the whole repel themselves, compared to a random distribution. The sign correlations are of approximately equal strength in all three species; the shift of the distance distribution of distances is stronger in cat and macaque and less strong in ferret. The sign correlations are reproduced by surrogate data maps; however, sign correlations in those maps are signicantly stronger than in the experimental maps. The distribution of distances between nearest neighbors, however, is well reproduced by the SD maps for cats and ferrets, in contrast to primates (Obermayer & Blasdel, 1997) for which the SD curve is shifted to larger distances. 4.3 Comparison of FAM Reconstructions with Data. Figure 4 shows the average reconstruction errors and their variation across all ferret (see Figure 4, left) and cat (see Figure 4, right) data sets as a function of the size of the reconstructed area. Reconstruction errors (see equation 3.3) are low for small areas and approach an average value of 0.5 (chance level is 0.907) for the largest maps. Reconstruction errors are on average smaller than the reconstruction errors previously obtained for primate maps (Obermayer & Blasdel, 1997); however, the size of the reconstructed area was considerably larger in that case, in m m as well as in units of D . FAM reconstructions of surrogate data maps show reconstruction errors, which are slightly less than the reconstruction errors of the original data. 4.4 Ocular Dominance Maps and Their Relationship with Orientation Preference Maps. Ocular dominance maps have been recorded from

ve sets of cat data, one of which is a longitudinal study from a stra-

2588

Muller ¨ et al.

14 12 10 8 6 4 2 0 -80 -60 -40-20 0 20 40 60 80 14 12 10 8 6 4 2 0 -80 -60 -40-20 0 20 40 60 80

35 30 25 20 15 10 5

RAN

0.0 35 30 25 20 15 10 5

0.1

0.0

0.1

0.2

0.3

0.4 RAN

0.2

0.3

0.4

Figure 5: (Left) Percentage of cortical area as a function of the local intersection angle of the orientation and ocular dominance slabs for animal C7 (upper) and C8 (lower). Iso-orientation and iso-ocular dominance lines tend to cross each other orthogonally. (Right) Distributions of the smallest distances between orientation singularities and ocular dominance zero crossings for the same animals. Bins denote experimental data and diamonds the results for a random distribution of points (RAN). In contrast to the hypothesis RAN, singularities in experimental data signicantly avoid very small distances to zero crossings, indicating their tendency to align along the ocular dominance centers (Â 2 test, signicance level 30.7 > 12.6.

bismic kitten (CS1). The ocular dominance wavelengths, listed in Table 1, yield an average of 0.83 § 0.06 mm for animals with ages between 50 and 61 days. For the longitudinal study (kitten CS1), the ocular dominance wavelength was found to increase with age by 0.8% per day (r D 0.74). Gradients have been calculated for the orientation and the ocular dominance maps as described in the previous section. The cortical area is plotted against the angle of intersection between the orientation and the ocular dominance slabs in Figure 5 (left). Two of the three cats raised under normal conditions show a preference for orthogonality (Figure 5, left) (Hubener ¨ et al., 1997), whereas the two strabismic kittens do not. Singularities, however, tend to lie in the centers of the ocular dominance bands for all of the maps investigated (see Figure 4, right) (Hubener ¨ et al., 1997).

Visual Cortex of Cats and Ferrets

2589

5 Discussion 5.1 Technical Considerations. Optical imaging of intrinsic signals implies two-dimensional neuronal activation patterns from the local changes in the blood oxygenation and the light-scattering properties of the tissue (Frostig, Lieke, Ts’o, & Grinvald, 1990; Bonhoeffer & Grinvald, 1996; Malonek & Grinvald, 1996) caused by active neuron populations. The total optical intrinsic signal represents a mixture of highly localized components, spatially more global yet stimulus-related components, and stimulusindependent components. In the context of optical imaging, the highly resolving components form the mapping signal, while the components with low spatial frequencies contribute to the biological noise. Because absorption and scattering of light in tissue strongly depend on the light wavelength, the relative sizes of the different components, the penetration depth from which the signals arise, and the effect of artifacts caused by supercial vasculature are functions of the light wavelength. At shorter wavelengths near 600 nm, intrinsic signals are dominated by the absorption behavior of hemoglobin, whereas at longer wavelengths beyond 700 nm, lightscattering effects increasingly dominate the intrinsic signals. Although a recent study (McLoughlin & Blasdel, 1997) has shown that the maps recorded with 720 nm and 605 nm do not differ much from each other, light at the shorter wavelength was found to be more strongly absorbed by the supercial blood vessels, which weakens the mapping signal. The wavelength used in this study yields scattering-dominated intrinsic signals. Without further preprocessing, some of the measures addressed in this work, including the orientation histograms, singularity densities, and singularity locations, are affected by biological noise. In order to optimize the separation of mapping signals and biological noise, the singularity measures and the orientation histograms were calculated for different cutoff frequencies of the high-pass lters, and the common optimal value was chosen as the center of a plateau in the plot of the measures versus cutoff frequency. Although this method is probably unable to perform a complete separation of the components, it represents the optimal separation using bandpass ltering. Our data sets were preprocessed by a normalization with a cocktail blank, which is the average of the recorded images over all stimulus conditions. This helps reduce the global components of the intrinsic signal (Bonhoeffer & Grinvald, 1996) and at the same time brings inuences from all stimulus conditions into a single differential image. For example, if a differential image of horizontal and vertical orientations is divided by a cocktail blank, the resulting image implicitly contains contributions from the oblique orientations, because the cocktail blank does as well. Since this could affect the results of an analysis based on the latter images, we performed a second analysis of our orientation maps. This time we used frame analysis rst and a subsequent subtraction of orthogonal single-condition stacks, but omit-

2590

Muller ¨ et al.

ted the normalization by a cocktail blank. The orientation histograms and the positions of the singularities were then compared with those obtained using a cocktail blank. The orientation histograms did not change signicantly, and both versions were visually indistinguishable from each other. Also, the numbers of the singularities did not change, and their positions changed by one pixel at maximum. We conclude that normalization by the cocktail blank does not affect our results. A possible reason for this is that one major inuence of the cocktail blank seems to be a reduction of the biological noise with low spatial frequency, which is subsequently removed again by bandpass ltering. 5.2 Statistical Signicance and Comparison with Models. Ferret orientation maps showed a pronounced bimodal modulation caused by an overrepresentation of 0 degree and 90 degrees preferred orientations. The modulation of cat orientation histograms was only weak and not always conned to the 0 degree and 90 degrees orientation values. Statistical testing of these modulations against the hypothesis of a uniform distribution is difcult to perform, since the orientation values for neighboring pixels of the orientation map are not statistically independent. Correlations are introduced by both dependencies between the neuronal orientation preferences and a blur of the intrinsic signals of about 100 to 200 m m (Bonhoeffer & Grinvald, 1996) due to light scattering within the tissue (Stetter & Obermayer, 1999). We therefore considered the mean orientation taken over disjunct square patches with 125 m m in diameter (which is also the Nyquist frequency for the bandpass-ltered maps) to avoid statistical dependencies introduced by the blur, and performed the signicance tests based on these samples. The method used for the automatic detection of pinwheel centers evaluates the sum of orientations around closed pathways. Thus, it is directly related to the denition of singularities as sources of curl of a vector eld. Because it evaluates only the sum over the orientations instead of requiring all orientations to be present around a singularity (Bonhoeffer, Kim, Malonek, Shoham, & Grinvald, 1995), the proposed method can also detect highly skewed singularities, around which orientations are strongly accumulated near some values. Other methods, which nd singularities by detecting zeros of the complex orientation vectors (L owel ¨ et al., 1998), again are sensitive to artifacts, if the zero crossing of the vector components meets at small angles, that is, if the singularities are highly skewed. The proposed method is insensitive against the skewness of singularities and in addition provides their signs. Accordingly, our results for the singularity densities, for example, (3.5 § 0.2)mm¡2 for cats, are somewhat higher than those found in the literature, which range between 2.1 and 2.7 mm¡2 (Bonhoeffer et al., 1995; Lowel ¨ et al., 1998). The comparison of orientation wavelengths and singularity densities between species and different deprivation stages was based on mean values and standard errors of the means over all animals of a given group.

Visual Cortex of Cats and Ferrets

2591

This procedure neglects possible systematic interindividual variations of the quantities under considerations and treats maps from different animals as realizations of a single random process. However, this procedure provides an upper bound for the true error values, because the given errors now include both random and possible systematic contributions. As a consequence, ndings that report quantities to be different from each other (singularity densities) can be regarded as signicant; equality of data sets (in case of orientation wavelengths) could be spurious, because systematic interindividual variations can mask a small yet existing trend. Orientation wavelengths and singularity densities are different in areas 17 and 18 (Bonhoeffer et al., 1995), and the cortical areas examined could possibly contain the 17/18 border and thus contain a small fraction of area 18. Hence, it cannot be ruled out a priori that our results are affected by contributions from area 18. We tested this point by subdividing the regions of interest for all data sets into a lateral and a medial part of equal size, such that one part could partially contain area 18, whereas the other part is purely area 17. We reanalyzed the orientation maps in both parts separately and compared the results with each other—once for individual animals and once for the sums over all animals. There was no statistically signicant difference in the singularity densities, and the orientation wavelengths ranged within the general wavelength variability observed. One possible conclusion from this nding is that our data did not contain signicant contributions of area 18. However, the areas and the total numbers of singularities in the subdivisions are rather small, which probably contributes to the lack of signicance. However, one may conclude that based on the data available, our results are not signicantly changed by the possible presence of small area 18 patches in the cortical area examined. The distance and sign distributions of pairs of neighboring singularities were found to be signicantly different from both the numbers predicted by the surrogate data (SD) and the two-dimensional Poisson (RAN) models. Real distributions showed less opposite sign pairs than SD data, and the singularities repelled each other compared to both SD and RAN, the latter trend being extraordinarily strong for same-sign singularity pairs. The positions of singularities can be affected by both random white noise and low-spatial frequency components of biological noise. The rst contributions act to randomize the singularity distributions and thus tend to equalize the observed and the random singularity distributions. In particular, white noise cannot articially generate the observed systematic deviations from a random distribution. Low spatial frequency noise in principle can cause systematic shifts in singularity positions. However, a close coincidence of singularity positions deduced from optical imaging with those obtained from tetrode measurements (Maldonado, G odecke, ¨ Gray, & Bonhoeffer, 1997) shows that careful bandpass ltering can remove positional artifacts to a considerable degree. A signicant difference of the measured singularity statistics to that predicted by the surrogate

2592

Muller ¨ et al.

data model (Swindale, 1982; Rojer & Schwartz, 1990) indicates that the hypothesis that orientation maps originate from spatial smoothing of random orientations has to be rejected and that higher than second-order statistics are important for the structural description of cat and ferret orientation maps. The global structure of orientation maps between the singularities cannot be explained completely from an optimal smoothness criterion as implemented by eld analogy models. The reconstruction error remains considerably below the error for random orientations, but increases systematically with the size of the map under consideration for both the measured and surrogate data. Therefore, the local behavior of orientation maps seems to follow an optimal smoothness criterion, while the global structure seems to show strong deviations from that principle. The drop of the error at very small map sizes is probably caused by smoothing due to both tissue scattering and the low-pass lter step during postprocessing. The relatively high reconstruction errors for larger maps can in part be caused by the neglect of anisotropies of the singularities (see also Obermayer & Blasdel, 1997, for monkey data). 5.3 Biological Implications. Our ndings indicate that vertical and horizontal orientations are overrepresented in ferrets, whereas in cats only a weak tendency for overrepresentation is visible. One possible origin of overrepresentation is the more frequent occurrence of vertical and horizontal orientations of contrast lines in both articial and purely natural visual scenes (Coppola, Purves, McCoy, & Purves, 1998; Field, 1987; van der Schaaf & van Hateren, 1996), which could have caused an overrepresentation of orientation preference during either ontogenetic or phylogenetic development. The wavelengths of orientation maps in cat area 17 do not seem to change signicantly between animals raised under normal conditions and strabismic animals of similar age. This nding gives rise to the hypothesis that the scale of the orientation map is only weakly dependent on the statistical structure of the visual input during development, and is mostly determined by factors intrinsic to the cortex, such as constraints based on the intrinsic cortical wiring patterns. It is consistent with the recent result that reverse occlusion during development leads to a precise restoration of the orientation map (Kim & Bonhoeffer, 1994; G odecke ¨ & Bonhoeffer, 1996; Sengpiel et al., 1998), which also points toward the existence of intracortical factors determining the structures of orientation maps. The singularity density has been found to decrease with age in one strabismic animal, while the ocular dominance wavelength increased. Wider ocular dominance bands have been reported for strabismic animals than for normal cats (Lowel, ¨ 1994). As both maps interact with each other, as, for example, demonstrated by the angular correlations between iso-orientation and iso-ocular dominance lines (see Figure 5), a structural change in the oc-

Visual Cortex of Cats and Ferrets

2593

ular dominance pattern would be likely to evoke structural changes in the orientation map too. However, the observed increase in ocular dominance wavelength paralleled the isotropic growth of the primary visual cortex of cats during the same period (Sengpiel et al., 1998; Duffy, Murphy, & Jones, 1998). The fact that across all animals included in this study, neither the orientation wavelengths nor the density of singularities were found to change signicantly with age may be attributable to the fact that most of the ferret and cat data represented very limited age ranges (37 to 47 days and 50 to 60 days, respectively). Acknowledgments

This work was funded by DFG (Ob 102/3-1), the Max-Planck-Gesellschaft, and the EC Biotech Program. References Anderson, P. A., Olavarria, J., & Sluyters, R. C. V. (1988). The overall pattern of ocular dominance bands in cat visual cortex. J. Neurosci., 8, 2183–2200. Blasdel, G. G. (1992). Differential imaging of ocular dominance and orientation selectivity in monkey striate cortex. J. Neurosci., 12, 3115–3138. Blasdel, G. G., & Salama, G. (1986) . Voltage-sensitive dyes reveal a modular organization in monkey striate cortex. Nature, 321, 579–585. Bonhoeffer, T., & Grinvald, A. (1993). The layout of iso-orientation domains in area 18 of cat visual cortex: Optical imaging reveals a pinwheel-like organization. J. Neurosci., 13, 4157–4180. Bonhoeffer, T., & Grinvald, A. (1996). Optical imaging based on intrinsic signals: The methodology. In A. Toga & J. C. Maziotta (Eds.), Brain mapping: The methods (pp. 55–97). Orlando, FL: Academic Press. Bonhoeffer, T., Kim, D. S., Malonek, D., Shoham, D., & Grinvald, A. (1995). Optical imaging of the layout of functional domains in area 17 and across the area 17/18 border in cat visual cortex. Eur. J. Neurosci., 7, 1973–1988. Chapman, B., & Bonhoeffer, T. (1998). Overrepresentation of horizontal and vertical orientation preferences in developing ferret area 17. Proc. Natl. Acad. Sci. USA, 95, 2609–2614. Chapman, B., Stryker, M. P., & Bonhoeffer, T. (1996). Development of orientation preference maps in ferret primary visual cortex. J. Neurosci., 16, 6443–6453. Coppola, D. M., Purves, H. R., McCoy, A. N., & Purves, D. (1998). The distribution of oriented contours in the real world. Proc. Natl. Acad. Sci., 95, 4002–4006. Coppola, D. M., White, L. E., Fitzpatrick, D., & Purves, D. (1998). Unequal representation of cardinal and oblique contours in ferret visual cortex. Proc. Natl. Acad. Sci., 95, 2621–2623. Cowan, J. & Friedman, A. (1991). Simple spin models for the development of ocular dominance columns and iso-orientation patches. In R. Lippman,

2594

Muller ¨ et al.

J. Moody, & D. Touretzky (Eds.), Advances in neural information processing systems, 3 (pp. 26–31). San Mateo, CA: Morgan Kaufmann. Duffy, K. R., Murphy, K. M., & Jones., D. G. (1998). Analysis of the postnatal growth of visual cortex. Vis. Neurosci., 15, 831–839. Erwin, E., Obermayer, K., & Schulten, K. (1995). Models of orientation and ocular dominance columns in the visual cortex: A critical comparison. Neural Comput., 7, 425–468. Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. J. Opt. Soc. Am. A, 4(12), 2379–2394. Freund, I. (1994). Optical vortices in gaussian random elds: Statistical probability densities. J. Opt. Soc. Am. A, 11, 1644–1652. Frostig, R. D., Lieke, E. E., Ts’o, D. Y., & Grinvald, A. (1990). Cortical functional architecture and local coupling between neuronal activity and the microcirculation revealed by in vivo high-resolution optical imaging of intrinsic signals. Proc. Natl. Acad. Sci USA, 87, 6082–6086. Godecke, ¨ I., & Bonhoeffer, T. (1996). Development of identical orientation maps for two eyes without common visual experience. Nature, 379, 251–254. Hubener, ¨ M., Shoham, D., Grinvald, A., & Bonhoeffer, T. (1997). Spatial relationships among three columnar systems in cat area 17. J. Neurosci., 17, 9270– 9284. Kim, D. S., & Bonhoeffer, T. (1994). Reverse occlusion leads to a precise restoration of orientation preference maps in visual cortex. Nature, 370, 370–372. LeVay, S., Stryker, M. P., & Shatz, C. J. (1978) . Ocular dominance columns and their development in layer IV of the cat’s visual cortex: A quantitative study. J. Comp. Neur., 179, 223–244. Lowel, ¨ S. (1994). Ocular dominance column development: Strabismus changes the spacing of adjacent columns in cat visual cortex. J. Neurosci., 14, 7451– 7468. Lowel, ¨ S., Schmidt, K. E., Kim, D. S., Wolf, F., Hoffsmmer, F., Singer, W., & Bonhoeffer, T. (1998). The layout of orientation and ocular dominance domains in area 17 of strabismic cats. Europ. J. Neurosci., 10, 2629–2643. Maldonado, P. E., Godecke, ¨ I., Gray, C. M., & Bonhoeffer, T. (1997). Orientation selectivity in pinwheel centers in cat striate cortex. Science, 276, 1551–1555. Malonek, D., & Grinvald, A. (1996). Interactions between electrical activity and cortical microcirculation revealed by imaging spectroscopy: Implications for functional brain mapping. Science, 272, 551–554. McLoughlin, N. P., & Blasdel, G. G. (1997). Effect of wavelength on differential images of ocular dominance and orientation in monkey striate cortex. Soc. Neurosci. Abstr., 23, 13. Niebur, E., & Worg ¨ otter, ¨ F. (1994). Design principles of columnar organization in visual cortex. Neural Comput., 6, 602–614. Obermayer, K., & Blasdel, G. G. (1993). Geometry of orientation and ocular dominance columns in monkey striate cortex. J. Neurosci., 13, 4114–4129. Obermayer, K., & Blasdel, G. G. (1997). Singularities in primate orientation maps. Neural Comput., 9, 555–567. Papoulis, A. (1965). Probability, random variables, and stochastic processes. New York: McGraw-Hill.

Visual Cortex of Cats and Ferrets

2595

Redies, C., Diksic, M., & Riml, H. (1990). Functional organization in the ferret visual cortex: A double-label2-deoxyglucose study. J. Neurosci., 10, 2791– 2803. Rojer, A. S., & Schwartz, E. L. (1990). Cat and monkey cortical columnar pattern modeled by bandpass-ltered 2D white noise. Biol. Cybern., 62, 381–391. Schwartz, E. L. (1994). Computational studies of the spatial architecture of primate visual cortex. In A. Peters & K. S. Rockland (Eds.), Cerebral cortex. New York: Plenum Press. Sengpiel, F., Godecke, ¨ I., Strawinski, P., Hubener, ¨ M., Lowel, ¨ S., & Bonhoeffer, T. (1998). Innate and environmental factors in the formation of functional maps in cat visual cortex. Neuropharmacology, 37, 607–621. Shmuel, A., & Grinvald, A. (1996). Functional organization for direction of motion and its relationship to orientation maps in cat area 18. J. Neurosci., 16, 6945–6964. Shvartsman, N., & Freund, I. (1994). Vortices in random wave elds: Nearest neighbor anticorrelations. Phys. Rev. Lett., 72, 1008–1011. Spiegel, M. R. (1975). Probability and statistics. New York: McGraw-Hill. Stetter, M., & Obermayer, K. (1999). Simulation of scanning laser techniques for optical imaging of blood-related intrinsic signals. J. Opt. Soc. Am. A, 16, 58–70. Swindale, N. V. (1982). A model for the formation of orientation columns. Proc. R. Soc. Lond. B, 215, 211–230. van der Schaaf, A., & van Hateren, J. H. (1996). Modelling the power spectra of natural images: Statistics and information. Vision Res., 36, 2759–2770. Weliky, M., Bosking, W. H., & Fitzpatrick, D. (1996). A systematic map of direction preference in primary visual cortex. Nature, 379, 725–728. Wolf, F., Pawelzik, K., & Geisel, T. (1994). Emergence of long-range order in maps of orientation preference. In M. Marinaro & P. G. Morasso (Eds.), Articial neural networks I. Amsterdam Elsevier Science. Wolf, F., Pawelzik, K., Geisel, T., Kim, D. S., & Bonhoeffer, T. (1994). Optimal smoothness of orientation preference maps. In F. H. Eeckman (Ed.), Computation in neurons and neural systems. Norwell, MA: Kluwer. Received February 17, 1999; accepted February 15, 2000.

LETTER

Communicated by Ad Aertsen

Improvements to the Sensitivity of Gravitational Clustering for Multiple Neuron Recordings Stuart N. Baker Department of Anatomy, University of Cambridge, Cambridge, CB2 3DY, U.K. George L. Gerstein Department of Neuroscience, University of Pennsylvania, Philadelphia, PA 19104, U.S.A. We outline two improvements to the technique of gravitational cluster ing for detection of neuronal synchrony , which are capable of improving the method’ s detection of weak synchrony with limited data. The advantages of the enhancements are illustrated using data with known levels of synchrony and different interspike interval distributions. The novel simulation method described can easily generate such test data. An important dependence of the sensitivity of gravitational clustering to the interspike interval distribution of the analysed spike trains is described. 1 Introduction

There is an increasing realization among neuroscientists that neurons can code information by more complex means than just the mean ring rate of their discharge (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). One important additional neural code appears to be the simultaneous discharge of two or more cells. Such short-term synchrony has been proposed to underlie processing in a wide range of neural systems (Espinosa & Gerstein, 1988; Lindsey, Hernandez, Morris, & Shannon, 1992; Singer & Gray, 1995; Vaadia et al., 1995; Baker, Olivier, & Lemon, 1997). The detection of abovechance synchrony between cells from data limited in recording length and number of spikes is therefore an important and challenging analysis issue (Strangman, 1997). Gravitational clustering (Gerstein, Perkel, & Dayhoff, 1985; Gerstein & Aertsen, 1985) has been proposed to permit rapid detection of synchronous cell assemblies in simultaneous recordings from large numbers of neurons. Each cell is represented as a particle in a higher-dimensional space. Action potentials produce transient “charges” on the particle corresponding to the cell that red. The particles move according to forces caused by attraction or repulsion of the charges. Particles relating to cells that re synchronously thereby aggregate together. Following a gravitational analysis, aggregated c 2000 Massachusetts Institute of Technology Neural Computation 12, 2597–2620 (2000) °

2598

Stuart N. Baker and George L. Gerstein

groups of particles indicate the presence of synchronous cell assemblies. The dynamic nature of the method allows it to visualize changes in groupings of neural synchronization over the duration of the recording. The gravitational clustering approach has been successfully applied to a range of experimental data (Gochin, Gerstein, & Kaltenbach, 1990; Lindsey, Morris, Shannon, & Gerstein, 1997; Lindsey, Shannon, & Gerstein, 1989; Aertsen, Bonhoeffer, & Kruger, 1987; Aertsen et al., 1991). When compared with conventional cross-correlation-based approaches, the principal advantage of gravitational clustering is that it scales simply with the number of neurons whose ring has been recorded and multiple pairwise interactions can be simultaneously detected. The sensitivity of gravitational clustering has been shown to be similar to cross-correlation approaches in detecting synchrony with limited data (Strangman, 1997). An important difference between the two methodologies is that while crosscorrelation represents a model-free approach to synchrony detection, gravitational clustering is more restricted. In particular, the time course of charge increment on a given particle following a spike limits the detection of synchrony to that which has a comparable jitter in synchronization. A similar limitation is present in the unitary event method (Grun, ¨ 1996; Pauluis & Baker, 2000). Such constraints prevent detection of synchronies that do not conform to the implied template. However, in theory, focusing on only those events deemed of interest a priori could have the capability to improve the sensitivity of detection. In this article we present several modications to the gravitational clustering approach, which are designed to improve its sensitivity in detecting neural synchronization. The desired improvement is demonstrated for simulated data having a magnitude and time course of synchrony in accord with that seen in experimental data. In addition, an important dependence of the sensitivity of the gravitational approach on the interspike interval distribution of the analyzed units is described. 2 Simulation of Spike Trains with Different Interspike Interval Statistics and Fixed Synchrony Magnitude

In order to quantify our improvements, we require a means of producing simulated spike trains in which there is a prespecied level of synchronization above that expected by chance. This can be relatively straightforwardly achieved if the spikes are simulated as a Poisson process, which have an exponential distribution of interspike intervals (Aertsen & Gerstein, 1985; Strangman, 1997). However, real recordings of neurons in the central nervous system usually show a relative refractory period (Rodieck, Kiang, & Gerstein, 1962), and their interspike interval histograms can be better approximated by a gamma distribution (Kufer, Fitzhugh, & Barlow, 1957; Stein, 1965). The following is a method by which specied synchrony in gamma distributions may be achieved.

Gravitational Clustering

2599

Figure 1: Method of synchronous spike train simulation. (a) Poisson spike train S, which serves as the source of synchronous spikes. (b) These are copied with probability p to the destination train T, which also has independent spikes (dotted lines). (c) Decimation produces a spike train with interspike intervals following a gamma distribution. (d, e) The process can be repeated to produce another such spike train. The spike trains of c and e will show synchronization above the chance level (boxed spikes), since in some instances spikes may result from the same S spike. (f) Interspike interval histograms for simulated trains with Poisson and gamma-(order 4) distributed intervals. Firing rate in each case was 25 Hz. (g) Cross-correlation histograms for two pairs of spike trains. One pair had Poisson interval statistics, the other gamma statistics (fourth order). The shape and magnitude of the cross-correlation peak can be xed independently of the interval statistics.

We begin by simulating two event trains, denoted S and T, which are Poisson processes having rates l S (see Figure 1a) and l T (the dotted lines in Figure 1b). A randomly chosen fraction p of the events from the rst process are copied to the second, producing a Poisson process that has a rate of pl S C l T . This composite process is then decimated by selecting every nth event (see Figure 1c). A gamma process of order n is dened as the waiting

2600

Stuart N. Baker and George L. Gerstein

time until the nth event of a Poisson process. The resultant simulated spike train R1 of Figure 1c will therefore have an interspike interval distribution that follows a gamma distribution, order n, and a rate given by lR D

pl S C l T . n

(2.1)

The process can then be repeated, using the same source spike train as used previously (see Figure 1a), but with a different seed for the random number generator used to generate T and determine which spikes will be copied from S (see Figure 1d). After decimation, this will result in a second spike train R2 , having the same rate and interspike interval statistics as R1 . Because R1 and R2 were generated from the same spike train S, they will show an excess of synchronous discharges above the level expected by chance for two independent spike trains ring at a rate l R , since on occasion the same spike will be copied into each spike train from S and also survive the n-fold decimation (boxed spikes in Figures 1c and 1e). This can be quantied as follows. Consider the central region of the cross-correlation histogram, with R2 acting as the trigger and R1 the target neuron. If a spike from R2 occurs at a given time, we require the expected number of spikes from R1 , which will be observed in a time window D t wide around the time at which R2 red. This is given by: E (R1 within D t|R2 ) D P (R2 originated from S ) 0 1 P (same S spike was copied to R1 B and survived decimation) C B C B C E (other S spikes occurred in D t, wereC B C C £B copied to R1 , and survived B C B C decimation) B C @ C E (T1 spikes occurred in D t and A survived decimation) C P ( R2 originated from T2 ) 0 1 E (S spikes occurred in D t, were copied B to R1 , and survived decimation) C C £B @ C E (T1 spikes occurred in D t and A survived decimation) D P (R2 originated from S ) £ P (same S spike was copied to R1 and survived decimation)

C E ( S spikes occurred in

D t were copied

to R1 , and survived decimation)

C E ( T1 spikes occurred in

decimation).

D t and survived (2.2)

Gravitational Clustering

2601

Since P (R2 originated from S|R2 ) C P (R2 originated from T2 |R2 ) D 1. (2.3) The sum of the last two terms of equation 2.2 is simply the expected number of spikes from neurone R1 over a time D t. This is therefore the expected number of counts in the central part of the cross-correlation histogram that would be seen if R1 and R2 were independent. The rst term is the excess counts in this part of the cross-correlation, due to the fact that R1 and R2 are synchronized together more than expected by chance. Both counts are normalized to the ring rate of R2 , the unit chosen to trigger the cross-correlation histogram. Denoting the strength of excess synchrony by s (Abeles, 1991): s D P (R2 originated from S )P (same S spike was copied to R1 and survived decimation)

lSp p ¢ D lSp C lT n D

l S p2 . n (l S p C l T )

(2.4)

In general, we know the values of s and l R , which are desired for the simulated spike trains. The three unknowns (p, l S , and l T ) may then be adjusted to produce these in accordance with equations 2.1 and 2.4. Note that the solution to this problem is not unique. Varying p, l S , and l T over ranges that maintain the same s and l R will change the correlation of the resultant spike trains R with S. This is of no interest, since S will not be used in any subsequent analysis, and hence any valid combination of parameters can be used. Valid solutions to equations 2.1 and 2.4 will have 0 · p · 1, and all ring rates positive. This leads to a constraint on the maximum level of synchrony, which can be simulated for a particular gamma order n: s < 1 / n.

(2.5)

In practice when simulating realistic levels of physiological synchrony and interval distributions (typically s < 0.1, n < 10), this constraint is not a serious limitation. Excess synchrony between real neurons does not occur at exactly zero lag, but exhibits a jitter (typically some tens of milliseconds; Smith & Fetz, 1989). Such jitter can be easily incorporated into the simulation method described by jittering the spikes copied from S to R by a time distributed according to a given random distribution. This will not change the level of excess synchrony s produced, so long as it is quantied over a sufciently long section of the cross-correlation histogram. The shape of the peak in

2602

Stuart N. Baker and George L. Gerstein

the cross-correlation histogram will be given by the autoconvolution of the jitter distribution function. Figure 1f shows the interspike interval histograms for two such simulated spike trains, having n D 1 (Poisson process) and n D 4. The rate l R was 25 Hz for both simulations, and s D 0.07. The higher-order gamma process produced an interspike interval histogram with a realistic relative refractory period and therefore more closely approximated real neuron spiking. Figure 1g shows the correlation between pairs of simulated neurons for these two cases. The copying process included a jitter that was gaussian distributed with a standard deviation of 5 ms. The overlaid cross-correlation histograms are identical to within estimation noise. It was veried that the shape of the peak was equal to the autoconvolution of the gaussian jitter function. The peak area above baseline divided by the number of trigger spikes was very close to 0.07. The above method thereby allows the simulation of any number of spike trains with preset synchronization and interspike interval characteristics, using methods that are highly computationally efcient to implement. While the description above has assumed that the parameters of the spike trains (ring rate and synchrony strength) are stationary throughout the simulation, the method is not restricted to this case. Using equations 2.1 and 2.4, the ring rates l S and l T may be computed as a function of time once the desired rate l R (t ) and synchrony strength s(t) are known. Spike trains S and T can then be simulated as inhomogeneous Poisson processes, and mixed and decimated as described above to produce nonstationary simulated spike trains R. 3 Improvements to the Sensitivity of Gravitational Clustering 3.1 Formalism of Gravitational Technique. N neurons are assumed to be simultaneously recorded. Following previous work on gravitational clustering (Gerstein et al., 1985; Aertsen & Gerstein, 1985), we represent these cells as N particles in N-dimensional space. The initial coordinate along the jth axis of the ith particle is given by: j

xi D

1 0

iD j 6 iD j.

(3.1)

The charge on particle i at time t is given by: qi (t) D

X k

K (t ¡ Tk ) ¡ l i ,

(3.2)

where Tk are the times of the spikes red by this cell and K (t ) is a kernel function that is convolved with the spike train. The kernel function is assumed to have unit area. The ring rate of the ith neuron is given by l i ; this

Gravitational Clustering

2603

is subtracted to ensure that the mean charge on a given particle is zero. In this article, only data with stationary ring rates are considered. However, it is possible to extend the method to nonstationary rates, in which case a time-varying rate l i (t) is subtracted in equation 3.2. The rate is normally estimated over a short window of data (Gerstein & Aertsen, 1985). The enhancements to the gravity method here proposed should also be usable when neurons change their ring rates by making this modication. In the usual gravitational clustering analysis, particle positions are adjusted according to the pairwise attractions between particles. Thus, using a discrete time step of dt, the position update rule is: xi (t C @t) D x i (t) C k @t

X

Qi, j

6 1 8j D

xj ¡ x i , | xj ¡ x i |

(3.3)

where Qi, j D qi qj ,

(3.4)

The constant k controls the rate of aggregation of synchronized neurons. The time step dt was xed at 1 ms throughout this article. The denominator term within the summation prevents a dependence on the current separation of particles; an alternative version, where this is omitted, has been previously described (Dayhoff, 1994). 3.2 Effect of Moving-W indow Smoothing on Charge Product. As a rst potential improvement to the method of gravitational clustering, we propose increasing the sensitivity of the method to repeated instances of synchrony between the same two cells within a particular observation window. The inspiration for this approach comes from the intuitive consideration that whereas two neurons may re once in synchrony simply by chance, two such occurrences within a short space of time are considerably less likely unless both are ring in bursts. Such an extension may be straightforwardly generated by calculating Qi, j as a moving average of the charge product over a window extending back in time for a duration w:

Qi, j (t) D

1 w

Z

t t¡w

qi (u )qj (u )du.

(3.5)

Figure 2a shows the effect such a moving-window smoothing approach has on the distribution of Qi, j computed between two independently ring neurons. The two spike trains were simulated to have a ring rate of 25 Hz and interspike intervals that followed a gamma distribution, order 4. The kernel function K (t) was an exponential: K (t ) D

exp(¡t / t ) 0

t> 0 t · 0.

(3.6)

2604

Stuart N. Baker and George L. Gerstein

Calculation of Qi, j on single points as equation 3.4 led to a distribution that was sharply peaked. By contrast, using a window length w D 1 s long, the distribution was approximately gaussian. The distribution histograms of Figure 2a have been plotted with the x-axis scaled by the standard deviation of the distributions. In addition to changing the form of the charge product distribution function, the standard deviation was reduced by a factor of 12.7 using the 1 s window. The conventional gravity analysis method achieves smoothing and the consequent noise reduction implicitly through the development of the particle trajectory. However, as will be shown below, it can be advantageous to perform this explicitly according to equation 3.5.

Gravitational Clustering

2605

3.3 Detection of Synchrony Using Charge Products. Figure 2b shows the effect of different ring rates on the variability of the charge products. The lines show a linear regression relating the standard deviation of the Qi, j calculated with a 1 s window to the cell ring rate; each line is calculated for runs using different values of the time constant t in the kernel function K (t). The variability of the charge product increases with the increasing ring rate of the cells, as more transient increases in the individual charges occur each second. The slope of this relationship is much steeper when small values of t are used. Small t leads to a single charge being at except just after a spike, when a high-amplitude sharp transient occurs (the kernel amplitude will be large, since it is brief but constrained to have unit area). Large t results in a much more even, and hence less variable, charge prole. These changes in single charge variabilities are reected in the variance of the charge product Qi, j as shown in Figure 2b. The variability in the charge product Qi, j shown in Figure 2a when two neurons re independently means that detection of a mean increase in Qi, j due to slight excess synchrony between the neurons is not trivial. The extent to which the synchronous and independent cases are separated can be measured by the statistic d0 (Green & Swets, 1966). This is illustrated by Figure 2c, and calculated according to

d0 D

¡ QINDEP QSYNC i, j i, j ±

stdev QINDEP i, j

² .

(3.7)

The mean charge for an individual neuron is xed to zero by equation 3.2; hence for independent neurons, the charge product will have zero mean, and the term QINDEP can be ignored. Figure 2d plots d0 versus the amount i, j Figure 2: Facing page. (a) Distribution of the charge product Qi, j using no smoothing, or a window of 1 s. (b) Variation of the standard deviation of the charge product Qi, j for independently ring neurons with different ring rates. The lines indicate regression ts to the results using kernel time constants t of 2, 5, 10, 15, and 20 ms (from highest to lowest line). All results are based on simulated spike trains with Poisson interval distributions. (c) Denition of d0 . (d) Variation of discriminability d0 with strength of synchrony s, for spikes having a Poisson interval distribution. Different symbols mark different ring rates. The line is a linear regression t calculated across all data points. Regardless of ring rate, the measurements are approximately tted by this single line. (e) The same, but using spikes with a fourth-order gamma distribution of interspike intervals. Firing rate now has an important effect on discriminability. Firing rates of 5 Hz, plus 10–100 Hz in steps of 10 Hz were used in d and e. (f) Variation of d0 with s for different orders of the gamma distribution n. Regression lines are for n D 1, 2, 3, 4, and 8 (lines for n D 4 and n D 8 are indistinguishable).

2606

Stuart N. Baker and George L. Gerstein

of synchrony between the neurons, quantied by the measure s dened above. This is shown for simulated spike trains with Poisson distributed interspike intervals. Different superimposed symbols show the results for spike trains with ring rates varying from 5 to 100 Hz. The points appear to lie approximately on the same line, indicating that the detectability of synchrony depends only weakly on the unit ring rates in this case. This is in agreement with previous observations (Strangman, 1997). The rate of occurrence of excess coincidences will be directly proportional to the neuron ring rate (where the constant of proportionality is s as dened above). Hence, the numerator of equation 3.7 must show ring-rate dependence. The nding that the d0 measure is essentially independent of ring rate indicates that the changes in the denominator (the standard deviation; see Figure 2b) balance those in the numerator when working with spikes having Poisson distributed interspike intervals. Figure 2e shows a similar plot of d0 versus s, but now calculated using spike trains simulated to have interspike intervals distributed according to a fourth-order gamma process. The difference from Figure 2d is readily apparent. Firing rate now has a profound effect on the slope of the relationship between d0 and s, as shown by the tted regression lines for each tested frequency. The changes in the numerator of equation 3.7 are independent of the interval statistics of the spike trains (not shown). However, gammadistributed intervals have lower variance than a Poisson train of the same ring rate (Shadlen & Newsome, 1998); the standard deviation of Qij in the denominator of equation 3.7 thus rises less steeply with ring rate than the increase in the mean of Qij in the presence of excess synchrony. If the neurons have a refractory period with a gamma-like interval distribution, synchrony of a given strength is more easily detected when the cells re at higher rates. Figure 2f shows in more detail the effect of changing n, the order of the gamma process underlying the simulated spike trains, on the slope of d0 versus s. The simulations displayed in this example had a constant ring rate of 25 Hz. There is a sharp increase in the slope as the order is increased above n D 1; thereafter, the slope rises only slowly. For an order of n D 4, the tted regression line is indistinguishable from n D 8. It can be seen that detection of synchrony is improved by more than twofold if neurons show a relative refractory period in their interspike intervals histograms, compared with the detectability obtained from Poisson spike trains with the same-size cross-correlation peak. A similar result was obtained using spike trains constructed as Poisson processes with a dead time to model the refractory period (data not shown). The effect of interspike interval distribution on the detection of correlation has also been noted by Davey, Ellaway, and Stein (1986) in a different context. 3.4 Use of a Nonlinearity to Improve Detection. At high noise levels, when the measure d0 is small, detection of synchrony (i.e., a nonzero mean

Gravitational Clustering

2607

Qi, j ) can occur only by averaging over sufcient data to overcome the noise. However, at higher levels of d0 , it is possible to enhance detection by making the forces between particles depend nonlinearly on Qi, j , yielding the following update rule: xi (t C @t) D x i (t) C k @t

X

f (Qi, j )

6 i 8j D

xj ¡ x i , | xj ¡ x i |

(3.8)

where f (u ) is a nonlinear function with a slope that increases its magnitude with increasing distance from u D 0. This will have the effect of suppressing small, noisy uctuations and enhancing the large excursions in Qi, j , which are unlikely to have been caused by chance. The simplest suitable form of nonlinearity is: f (u ) D A|u| m sign (u ),

(3.9)

where the power m can be varied to change the functional form; use of this form permits even and nonintegral m while ensuring that the sign of f (u ) always equals the sign of u. The constant A permits scaling of the function. For practical application of the algorithm, we have found it convenient rst to scale Qi, j by its standard deviation: Q0i, j (t) D

Qi, j (t) stdev (Qi, j )

.

(3.10)

Such a scaling makes the variability of all charge products the same, regardless of cell ring rate or interspike interval statistics. By normalizing for this rst, the choices of constant k in equation 3.8 can be standardized, and the same value used across different data sets. This avoids the need to tune performance of the algorithm by testing multiple values of k . The standard deviation of the charge product can be calculated over the entire length of the data if the ring rate of the neurones is stationary or by using a moving window if not (compare with the calculation of mean charge using a moving window in nonstationary data; Gerstein & Aertsen, 1985). Normalization by standard deviation as a means to correct for nonstationarity has also been used by Aertsen et al. (1987; Aertsen, Gerstein, Habib, & Palm, 1989). Assuming that Q0i, j is normally distributed (a reasonable approximation if a moving-window smoothing process is used, due to the central limit theorem; see Figure 2a), the variance of f (Q0i, j ) will then be 1 s2 D p 2p

Z

1 ¡1

2

A2 u2m e¡u

/ 2 du.

(3.11)

The parameter A can be adjusted to make s 2 D 1 for the chosen value of m. This permits experimentation with different m without causing the severe

2608

Stuart N. Baker and George L. Gerstein

changes to the excursions of the random walks undergone by particles that relate to independent spikes. A measure of the improvement of discrimination achieved by using f (Q0i, j ) rather than Q0i, j is given by f (Q0i, j ) 0

Qi, j

D

R1

¡1

R1

f ( u) e

¡1 ue

¡(u¡d 0 ) 2 2

¡(u¡d0 ) 2 2

du

du

.

(3.12)

Figure 3a shows the behavior of this measure for different values of d0 , using various powers m of nonlinearity. It can be seen that for d0 as low as 0.5, the ratio of equation 3.12 is below one for all powers higher than unity. This indicates that for such a low signal-to-noise ratio, no improvement in discriminability can be obtained above and beyond the linear case. Use of a nonlinearity simply enhances noise uctuations and degrades discrimination. However, for larger d0 , the curves rise above one, indicating that a nonlinearity can enhance performance. The peak in these curves, indicating the optimal power to use, occurs at greater powers the larger d0 . Figure 3b shows this best power as a function of d0 (solid line). It can be seen that even for quite modest values of d0 (corresponding to a mean Qi, j just over one standard deviation away from zero), use of a nonlinearity is expected to improve the performance of the gravitational clustering algorithm. The dotted line in Figure 3b indicates the point at which curves similar to those of Figure 3a recross an ordinate value of one. This indicates the highest power that can be safely used to yield performance that is no worse than the usual gravity algorithm with f (u ) D u. From this perspective, use of a low power function (e.g., cubic, m D 3) will enhance, or at least not degrade, the performance of gravitational clustering for all but the very weakest correlations. Moving-window smoothing of the charge product qi qj as described by equation 3.5 is an essential prerequisite for application of a nonlinearity as described above. Without it, the standard deviation of the charge product will be so great that d0 will never exceed the region of Figure 3b where no better performance than a linear f (u ) can be achieved. Conversely, window smoothing cannot achieve any performance gains without a nonlinear f . This can readily be understood upon reexamination of equations 3.3 through 3.5. With a linear f , the position of a particle at a given time after starting the analysis will depend on the integral of the products between its charge and its neighbors up to that time. This will be the same whether the integration is performed entirely implicitly by the movement of the particles themselves (equations 3.3 and 3.4), or effectively in two stages by equation 3.5 followed by particle movements. 3.5 Performance Improvement by Triple Pair Enhancement. An important feature of the gravity technique is its ability to detect multiple cells

Gravitational Clustering

2609

Figure 3: (a) Effect of the power used in the nonlinearity of equations 3.8 and 3.9 on discriminability, relative to the discriminability with a power of one (linearity). Lines are shown for d0 D 0.5, 1, 1.5, 1.7, 2. (b) Optimal power of nonlinearity to use for the best discrimination as a function of d0 (solid line) and the highest power that will be no worse than linearity (dotted line).

that are members of the same synchronous assembly; such assembly membership is indicated by the clusters into which particles coalesce. This feature of the method suggests a further possible enhancement. Suppose three neurons, i, j, and k, are members of the same assembly. The pairwise synchronous rings (i, j), (i, k ), and ( j, k ) will each occur at above-chance rates, and this will lead the particles to move toward each other in the ordinary gravity computation. However, since all three neurons are part of the same

2610

Stuart N. Baker and George L. Gerstein

assembly, we might expect an excess temporal clustering of the several pair coincidences within a moving window of length w. Such counts of multiple pair coincidences are highly unlikely to occur by chance. The occurrence of triple pairs will be a much more robust indicator of assembly membership than any one of the individual pairwise synchronies alone. The above observation inspires the following modication to the particle movement rule of equation 3.8: 0 1 X X xj ¡ x i 0 0 0 x i (t C @t) D x i (t) C k @t . (3.13) f (Qi, j ) g @ Qi,k Qj,k A | xj ¡ x i | 6 i 6 8j D 8k D i, j

The function g (u ) is intended to scale the force experienced by the particle. The following are desirable features of this function. When the argument of g is zero, no augmentation or diminution of the force should occur; hence g (0) D 1. We have found that a sigmoid function is not suitable, since the maximal slope around u D 0 enhances small chance uctuations in u. The slope of the function g should instead be zero at u D 0 but increase in magnitude to either side. A suitable function among the many that are possible, and which we have found effective, is: 0 3 g (u ) D 1 C u8 4

u < ¡2 p ¡1 · u
(3.14)

This function is thresholded at g D 0; it can thus never become negative, so that force directions cannot be reversed. It is likewise thresholded at g D 4; this prevents extreme outliers of the summed charge products from excessively disrupting the gravity condensation. If no other neuron res synchronously with cells i and j during the window period w, the summation over k in equation 3.13 will be close to zero; g will be nearly one, and the algorithm will be unaffected. However, if another cell k does participate in the assembly in the period w, the summation 0 will on average be above zero, since both Q0i,k and Qj,k will have nonzero, positive mean. The force causing aggregation between i and j will therefore be enhanced. By symmetry, a similar effect will occur for the force between particle pairs (i, k ) and ( j, k ), and all three particles will therefore condense together more rapidly than had the term in g (u ) not been included. The practical success of this modication is demonstrated in the next section. It would be possible to extend this approach to groups larger than three neurons; however, this would increase the number of combinations that need to be examined and is expected to be helpful only if large assemblies are present. Use of groups of three is expected to cause assemblies with larger numbers of members to aggregate considerably faster, since such populations have several subgroups of size three. We therefore have not investigated further than the triple pair approach outlined.

Gravitational Clustering

2611

4 Quantication of Improvements to Gravitational Clustering

Figure 4 illustrates the results of gravitational analysis using the modications described above. This was performed on simulated data. Ten neurons were used, all with a constant ring rate of 25 Hz. Three of these cells were synchronized together, with a synchrony strength s D 0.1; the copying process used to produce the synchronized spikes incorporated a temporal jitter that was normally distributed, with 5 ms standard deviation (as in Figure 1g). Figures 4a through 4c present the results when the simulated

2612

Stuart N. Baker and George L. Gerstein

spike trains were Poisson processes (order n D 1); Figures 4c, 4d, and 4f are the results for a different data set, simulated so that the interspike intervals were gamma distributed, order n D 4. The gures are plots of the pairwise distances between particles as a function of time, for the 50 s duration of the simulation. A smoothing window w D 1 s was used, and the kernel time constant was t D 10 ms. Figures 4a and 4d show the results for a conventional gravity analysis, as dened by equations 3.3 and 3.4. Three of the distance traces, corresponding to the distances between the particles representing the three cells synchronized together, condensed and became clearly separated from the other lines. (The crossed line to the right of each graph indicates the mean and standard deviation of distance trajectories for pairs where neither neuron is part of the synchronized assembly.) The separation occurs earlier and is somewhat more pronounced at the end of the run, for the data with gamma order 4 interspike intervals (see Figure 4d) compared with the Poisson case (see Figure 4a); this agrees with the results of Figure 2, which suggested that synchrony is more easily detected in a gamma spike train. Figures 4b and 4e show the performance of the algorithm modied to use a cubic nonlinearity (m D 3 in equation 3.9). Particle condensation is considerably impaired for the Poisson spike trains, such that only two traces appear clearly separated from the noise excursions by the end of the analysis. Separation is adequate for the gamma distributed spike train. Figures 4c and 4f show the performance when using the further modication described by equations 3.13 and 3.14. Once again, separation of the three distance traces relating to synchronized pairs is poorly achieved for the Poisson simulations (see Figure 4c), but good separation occurs for the spikes with gamma intervals (see Figure 4f).

Figure 4: Facing page. Results of gravitational analysis using different algorithms. (a–c) Pairwise distances between particles as a function of time, using Poisson distributed spike trains. (d–f) Using spike trains having a gamma distribution of interspike intervals (fourth order). (a,d) Conventional linear gravity method; (b,e) with cubic nonlinearity; (c,f) cubic nonlinearity and triple pair enhancement. The cross on the right of each plot shows the mean plus or minus standard deviation of the interpair distance at t D 50 s for all pairs where neither neuron was part of the synchronized assembly. The ordinate scales have been chosen such that the height of this marker is the same in a–c and d– f , permitting visual comparison of the magnitude of the particle aggregation relative to the aggregation noise. (g,h) Relative performance of the linear, cubic, and cubic plus triple enhancement algorithms, shown by solid, dotted, and dashed lines respectively. The lines show the mean of 20 runs using different simulated data in each case. The bars to the right indicate the standard error of the mean computed over the 20 runs at time 50 s. Spike trains were Poisson in g and had fourth-order gamma interval statistics in h.

Gravitational Clustering

2613

While some differences between the performance of the different algorithms can be assessed by eye, it is desirable to develop a quantitative measurement of the degree to which particles representing synchronized neurons aggregate. We have chosen to use the following measure for this purpose: ZD

di, j2I ¡ di, j2S , stdev (di, j2I )

(4.1)

where di, j is the distance between the ith and jth particle, S represents the set of synchronized neurons, and I is the set of independently ring neurons in the same simulation. This measure Z has analogies with d0 as dened in equation 3.7; it is the difference between the mean separation of the particles representing synchronous and independently ring neurons, expressed in terms of the variability seen in the independently ring interparticle distances. A similar measure would be appropriate for real, nonsimulated data where neuron synchrony properties are not known a priori. Here we would identify the synchronized neurons in terms of condensing trajectories. Figures 4g and 4h show how Z varies during the gravity analysis for the different situations illustrated in the earlier part of the gure. The lines represent the mean of Z calculated over 20 data sets simulated with different random number seeds. The variability is indicated by the bars to the right of each graph, which represent one standard error either side of the mean, assessed from the 20 separate analyses at time 50 s. The solid lines show the behavior of Z using conventional gravity analysis (see Figures 4a and 4d), the dotted line using a cubic nonlinearity (see Figures 4b and 4e), and the dashed line using the triple pair enhancement (see Figures 4c and 4f). Figure 4g presents the results for analysis of Poisson spike trains. Conventional gravity analysis without modication here clearly gives the best result, with a discriminability some 50% higher than either of the two more complex algorithms. By contrast, Figure 4h shows the opposite effect when gammadistributed spike trains are analyzed. All three lines are consistently above their corresponding plots in Figure 4g, conrming the increased detectability of synchrony for a spike train with gamma compared with Poisson interval statistics. A modest but worthwhile improvement over the conventional algorithm is achieved by using the cubic nonlinearity. A considerable improvement (almost a twofold rise in Z) can be achieved by combining the nonlinear and triple pair enhancement approaches. Since experimentally recorded spike trains will usually have interspike intervals closer to a gamma than a Poisson distribution, a similar improvement should be seen in experimental data. Figure 5 presents data showing the robustness of this result. Figure 5a plots the nal value of Z after a 50 s analysis, similar to that in Figure 4, for the three algorithms, as a function of the order of the gamma distribution of interspike intervals of the simulated spike trains. The same line style

2614

Stuart N. Baker and George L. Gerstein

Figure 5: (a) Efciency of gravitational algorithms (measured by Z at the end of a 50 s analysis run) depending on the order of the gamma distribution that their interspike intervals follow (n). Results with conventional linear gravity, a cubic nonlinearity, and the nonlinearity combined with triple pair enhancement are shown by solid, dotted, and dashed lines, respectively. Each point marks the mean from the analysis of 20 runs; error bars mark the standard error calculated over the same sample. (b) Variation of algorithm efciency with the number of neurons in the synchronized assembly. Display conventions as in a.

conventions as in Figures 4g and 4h have been used to denote the three algorithm variants under investigation. The usual gravitational method is superior only for Poisson spike trains (n D 1). Performance of the three techniques is the same for n D 2; at all higher orders, there is a clear advantage in using the modications detailed in this article.

Gravitational Clustering

2615

The triple pair enhancement algorithm was specically designed to enhance assemblies with three members; it is important to be certain that its performance in other situations is adequate. Figure 5b accordingly indicates the effect of the size of the assembly on the relative performance of the three methods. Use of a cubic nonlinearity provided a fairly consistent increment over the usual, linear algorithm at all assembly sizes tested. Values of Z using triple pair enhancement overlap those produced without it when only two cells are synchronized, indicating that this method introduces no appreciable extra noise in the small assembly case. However, the benets of triple pair enhancement grow rapidly with larger assemblies, presumably due to the combinatorial rise in the number of triple pairs that can be formed. 5 Gravitational Clustering Averaged Relative to an External Event

The techniques described allow detection of synchronous cell assemblies when analysis can be performed over a relatively long period of time. The enhancements described rely on moving-window smoothing. The width of this window must be sufcient to produce a reasonable probability of more than one excess coincidence between a given pair of cells within it. At the ring rate (25 Hz) and strength of synchrony (s D 0.1) considered above, there is a probability of 0.71 of two excess synchronous events occurring within a 1 s window (computed from the Poisson counting distribution). Reducing the window duration would decrease the probability of more than one excess synchronous event falling within the window. The improvements in performance described above would thus be considerably degraded, due to the reduction in d0 . However, in many experiments, the aim is to investigate the modulation of synchronous ring over much shorter timescales (typically tens to hundreds of milliseconds). The small number of spikes in such short sections means that reliable conclusions cannot be reached on the basis of a single trial. The usual solution is to average the time course of synchrony over multiple trials with reference to an external event such as a stimulus occurrence or behavioral marker (Aertsen et al., 1989; Grun, ¨ 1996; Pauluis & Baker, 2000). The increased amount of data then available permits increased statistical power. The gravitational clustering approach is also amenable to such acrosstrial averaging. In order to use the enhancements described above, this averaging is performed as follows. The pairwise, smoothed charge products Qi, j (t) are found for each available trial using equation 3.5. These are then averaged across trials, to obtain a set of mean charge product time courses Qi, j (t). The standard error of the mean, computed across trials, can then be used to scale the charge product time courses according to equation 3.10, obtaining Q0i, j (t). The gravitational analysis is carried out by using these values in equations 3.3, 3.8, or 3.13. By averaging over multiple trials, the

2616

Stuart N. Baker and George L. Gerstein

noise level of Q0i, j is reduced, increasing d0 and permitting the use of a nonlinearity and triple pair enhancement. Figure 6 shows an example of the application of such trial-averaged gravitational clustering. One hundred trials, each 3 s long, were simulated in which 10 neurons discharged at 25 Hz with interspike intervals following a fourth-order gamma process. For the rst 1.5 s of each trial, neurons 1, 2, and 3 were synchronized together; for the last 1.5 s of each trial, these cells ceased to show excess correlation, and cells 4, 5, and 6 became synchronized instead (s D 0.1 in all cases). Figures 6a, 6b, and 6c show the results of gravitational clustering using the linear, nonlinear (quintic, m D 5 in equation 3.9), and triple-pair enhancement algorithms, respectively. A considerably shorter smoothing window, w D 100 ms, was used in comparison to previous analyses to retain temporal resolution. The higher power of nonlinearity compared with that used earlier (quintic compared with cubic) takes advantage of the improved signal-to-noise ratio available by averaging across trials (see Figure 3b). The pairwise distances plotted here have been adjusted by subtracting from each distance its value at t D 1.5 s. This renders the changes that occur on either side of this point more easily seen. In each case, the three pairs relating to the synchronized cells show clear separation from the remaining pairs. It should be noted that while the plots appear symmetrical, this is in fact not the case, since a different set of pairs condenses before and after the normalization point in the middle of the analysis. In addition to a clustering of the distances between particles representing the three synchronized neurons, a second group can be seen to separate from the remaining distances to a lesser extent. These are distance pairs that include one of the synchronized cells and one of the independently ring ones. This small condensation of distance is expected from the geometry of the situation, as illustrated schematically in Figures 6d and 6e. If the two lled particles condense together, but there are no net attractive forces between them and the white particle, this will still result in a narrowing of the distance between the white and black particles. Such a small apparent condensation of pairs including only one synchronized neuron is often not noticeable in gravity analysis due to a high noise level (e.g., see Figure 4). The improved algorithms described in this article clearly provide for a reduced level of noise in the distance plots of Figures 6b and 6c compared with Figure 6a. This is quantied in Figures 6f and 6g using the measure Z dened in equation 4.1. Figure 6f presents the Z calculation for the (1,2,3) assembly, and Figure 6g for the (4,5,6) one. The calculation of Z has again been carried out over 20 separate simulations (each of 100 trials); the thin lines show the mean, plus or minus one standard error. The same line styles have been used as in Figures 4 and 5 to code the different analysis types. The order of relative efciency of the three versions of the algorithm is as expected from the previous analysis (linear worse than nonlinear, which is

Gravitational Clustering

2617

Figure 6: Results using trial-averaged gravitational analysis. (a–c) The results using conventional linear gravity, a quintic (fth power) nonlinearity, and the nonlinearity combined with triple pair enhancement, respectively. Traces are the interpair distances as a function of time relative to the trial onset. Distances have been normalized by subtraction of their value at time 1.5 s for clarity of display. Distance plots have been labeled in b according to whether they relate to pairs in which both neurons were part of the synchronized assembly (S-S), only one was in the assembly (S-S) or neither were (S-S). Note that the two different assemblies S1 and S2 , which were active before and after time 1.5 s, respectively, are separated before and after this time. (d) Schematic showing how apparent condensation can occur simply due to the geometry of the system. The black particles attract each other due to excess synchronization; the white particle, by contrast, represents an independently ring cell and experiences no net forces. After the black particles have condensed together (e), the distance between them and the white particle is nevertheless reduced. This explains the apparent condensation of the S-S pairs in a–c. (f) Efciency of the three algorithms computed for assembly (1,2,3), which was synchronized for the rst half of each trial. Line-style conventions are as in Figures 4g and 4h. (g) The same for assembly (4,5,6),which was active in the second half of each trial. Traces show the mean and standard deviation of Z computed over 20 simulation runs.

worse than nonlinear with triple pair enhancement); the magnitude of the performance enhancement (greater than a 17-fold increase in Z) is, however, considerably greater than in the previous examples, which did not use trial averaging.

2618

Stuart N. Baker and George L. Gerstein

6 Conclusion

The two enhancements to the gravitational clustering algorithm presented in this article are capable of a considerable improvement in its sensitivity for detecting synchronous cell assemblies. While not universally applicable, the range of parameters over which it becomes worthwhile to use the two adjustments is wide and encompasses much of that seen physiologically. Implementation of the improvements to the gravity analysis will carry with it the cost of somewhat lengthened computation time. While this was of some concern only a few years ago (Dayhoff, 1994), rapid advances in computer power have rendered this of less consequence. An interesting observation made in the course of the work presented here is that a given level of synchrony is harder to detect (i.e., requires more data before reaching signicance) for a Poisson event train compared with events having a relative refractory period, as simulated by the gamma distribution. Detection of synchrony by neurons has some analogues to the gravity method (e.g., compare exponentially decaying kernels with cortical excitatory postsynaptic potentials), suggesting that this result may also hold for synchrony detection by neurons. If true, it would imply that the relative refractory period can be viewed as an important adaptation to permit accurate synchrony detection. This hypothesis, currently speculative, clearly requires further work. Acknowledgments

This work was supported by NIH MH46428, DC01249, and NSF to G. L. G. S. N. B. was funded by Christ’s College, Cambridge. References Abeles, M. (1991). Corticonics. Cambridge: Cambridge University Press. Aertsen, A. M. H. J., Bonhoeffer, T., & Kruger, J. (1987). In E. R. Caianiello (Ed.), Physics of cognitive processes (pp. 1–34). Singapore: World Sciences Publishers. Aertsen, A. M. H. J., & Gerstein, G. L. (1985).Evaluation of neuronal connectivity: Sensitivity of cross correlation. Brain Research, 340, 341–354. Aertsen, A. M. H. J., Gerstein, G. L., Habib, M. K., & Palm, G. (1989). Dynamics of neuronal ring correlation—modulation of effective connectivity. Journal of Neurophysiology, 61, 900–917. Aertsen, A., Vaadia, E., Abeles, M., Ahissar, E., Bergman, H., Karmon, B., Lavner, Y., Margalit, E., Nelken, I., & Rotter, S. (1991). Neural interactions in the frontal-cortex of a behaving monkey—signs of dependence on stimulus context and behavioral state. Journal fur ¨ Hirnforschung, 32, 735–743. Baker, S. N., Olivier, E., & Lemon, R. N. (1997). Task dependent coherent oscillations recorded in monkey motor cortex and hand muscle EMG. Journal of Physiology, 501, 225–241.

Gravitational Clustering

2619

Davey, N. J., Ellaway, P. H., & Stein, R. B. (1986). Statistical limits for detecting change in the cumulative sum derivative of the peristimulus time histogram. Journal of Neuroscience Methods, 17, 153–166. Dayhoff, J. E. (1994). Synchrony detection in neural assemblies. Biological Cybernetics, 71, 263–270. Espinosa, I. E., & Gerstein, G. L. (1988). Cortical auditory neuron interactions during presentation of 3-tone sequences—Effective connectivity. Brain Research, 450, 39–50. Gerstein, G. L., & Aertsen, A. M. H. J. (1985).Representation of cooperative ring activity among simultaneously recorded neurons. Journal of Neurophysiology, 54, 1513–1528. Gerstein, G. L., Perkel, D. H., & Dayhoff, J. E. (1985). Cooperative ring activity in simultaneously recorded populations of neurons: Detection and measurement. Journal of Neuroscience, 5, 881–889. Gochin, P. M., Gerstein, G. L., & Kaltenbach, J. A. (1990). Dynamic temporal properties of effective connections in rat dorsal cochlear nucleus. Brain Research, 510, 195–202. Green, D., & Swets, J. (1966). Signal detection theory and psychophysics. New York: Wiley. Grun, ¨ S. (1996). Unitary joint-events in multiple-neuron spiking activity. Thun, Frankfurt am Main: Harri Deutsch. Kufer, S. W., Fitzhugh, R., & Barlow, H. B. (1957). Maintained activity in the cat’s retina in light and darkness. Journal of General Physiology, 40, 683– 702. Lindsey, B. G., Hernandez, Y. M., Morris, K. F., & Shannon, R. (1992). Functional connectivity between brain stem midline neurons with respiratorymodulated ring rates. Journal of Neurophysiology, 67, 890–904. Lindsey, B. G., Morris, K. F., Shannon, R., & Gerstein, G. L. (1997). Repeated patterns of distributed synchrony in neuronal assemblies. Journal of Neurophysiology, 78, 1714–1719. Lindsey, B. G., Shannon, R., & Gerstein, G. L. (1989).Gravitational representation of simultaneously recorded brain-stem respiratory neuron spike trains. Brain Research, 483, 373–378. Pauluis, Q., & Baker, S. N. (2000). An accurate measure of the instantaneous discharge probability, with application to unitary joint-event analysis. Neural Computation, 12, 647–669. Rieke, F., Warland, D., de Ruyter van Steveninck, R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Rodieck, R. W., Kiang, N. Y.-S., & Gerstein, G. L. (1962). Some quantitative methods for the study of spontaneous activity of single neurons. BiophysicalJournal, 2, 351–368. Shadlen, M. N., & Newsome, W. T. (1998). The variable discharge of cortical neurons: Implications for connectivity, computation, and information coding. Journal of Neuroscience, 18, 3870–3896. Singer, W., & Gray, C. M. (1995). Visual feature integration and the temporal correlation hypothesis. Annual Review of Neuroscience, 18, 555–586.

2620

Stuart N. Baker and George L. Gerstein

Smith, W. S., & Fetz, E. E. (1989). Effects of synchrony between corticomotoneuronal cells on post-spike facilitation of primate muscles and motor units. Neuroscience Letters, 96, 76–81. Stein, R. B. (1965). A theoretical analysis of neuronal variability. Biophysical Journal, 5, 173–194. Strangman, G. (1997). Detecting synchronous cell assemblies with limited data and overlapping assemblies. Neural Computation, 9, 51–76. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey motor cortex in relation to behavioural events. Nature, 373, 515–518. Received May 14, 1999; accepted February 2, 2000.

LETTER

Communicated by Ad Aertsen

Neural Coding: Higher -Order Temporal Patterns in the Neurostatistics of Cell Assemblies Laura Martignon Max Planck Institute for Human Development, Lentzeallee 94, 14195 Berlin, Germany Gustavo Deco Siemens AG, Corporate Technology ZT, 81739 Munich, Germany Kathryn Laskey Department of Systems Engineering, George Mason University, Fairfax, VA 22030, U.S.A. Mathew Diamond Cognitive Neuroscience Sector,InternationalSchool of Advanced Studies, 34014Trieste, Italy Winrich Freiwald Center for Cognitive Science, University of Bremen, 28359 Bremen, Germany Eilon Vaadia Department of Physiology, Hebrew University of Jerusalem, Jerusalem 91010, Israel Recent advances in the technology of multiunit recordings make it possible to test Hebb’ s hypothesis that neurons do not function in isolation but are organized in assemblies. This has created the need for statistical approaches to detecting the presence of spatiotemporal patterns of more than two neurons in neuron spike train data. We mention three possible measures for the presence of higher -order patterns of neural activation—coef cients of log-linear models, connected cumulants, and redundancies—and present arguments in favor of the coefcients of loglinear models. We present test statistics for detecting the presence of higher -order interactions in spike train data by parameterizing these interactions in terms of coefcients of log-linear models. We also present a Bayesian approach for inferring the existence or absence of interactions and estimating their strength. The two methods, the frequentist and the Bayesian one, are shown to be consistent in the sense that interactions that are detected by either method also tend to be detected by the other. A heuristic for the analysis of temporal patterns is also proposed. Finally, a Bayesian test is presented that establishes stochastic differences between recorded segments of data. The methods are applied to experimental data and synthetic data drawn from our statistical models. Our experimental data are drawn from multiunit recordings in the prefrontal cortex of c 2000 Massachusetts Institute of Technology Neural Computation 12, 2621–2653 (2000) °

2622 L. Martignon, G. Deco, K. Laskey, M. Diamond, W. Freiwald, & E. Vaadia behaving monkeys, the somatosensory cortex of anesthetized rats, and multiunit recordings in the visual cortex of behaving monkeys. 1 Introduction

Hebb (1949) conjectured that information processing in the brain is achieved through the collective action of groups of neurons, which he called cell assemblies. One of the assumptions on which he based his argument was his other famous hypothesis: that excitatory synapses are strengthened when the involved neurons are frequently active in synchrony. Evidence in support of this so-called Hebbian learning hypothesis has been provided by a variety of experimental ndings over the last three decades (e.g., Rauschecker, 1991). Evidence for collective phenomena conrming the cell assembly hypothesis has only recently begun to emerge as a result of the progress achieved by multiunit recording technology. Hebb’s followers were left with a twofold challenge: to provide an unambiguous denition of cell assemblies and to conceive and carry out the experiments that demonstrate their existence. Cell assemblies have been dened in terms of both anatomy and shared function. One persistent approach characterizes the cell assembly by near simultaneity or some other specic timing relation in the ring of the involved neurons. If two neurons converge on a third one, their synaptic inuence is much larger for near-coincident ring, due to the spatiotemporal summation in the dendrite (Abeles, 1991; Abeles, Bergman, Margalit & Vaadia, 1993). Thus syn-ring and other specic temporal relationships between active neurons have been posited as mechanisms by which the brain codes information (Gray, Konig, ¨ Engel, & Singer, 1989; Singer, 1994; Abeles & Gerstein, 1988; Abeles, Prut, Bergman, & Vaadia, 1994; Prut et al., 1998). In pursuit of experimental evidence for cell assembly activity in the brain, physiologists thus seek to analyze the activation of many separate neurons simultaneously, preferably in awake, behaving animals. These multineuron activities are then inspected for possible signs of interactions among neurons. Results of such analyses may be used to draw inferences regarding the processes taking place within and between hypothetical cell assemblies. The conventional approach is based on the use of cross-correlation techniques, usually applied to the activity of pairs (sometimes triplets) of neurons recorded under appropriate stimulus conditions. The result is a time-averaged measure of the temporal correlation among the spiking events of the observed neurons under those conditions. Application of these measures has revealed interesting instances of time- and contextdependent synchronization dynamics in different cortical areas. For interactions among more than two neurons, Gerstein, Perkel, and Dayhoff (1985) devised the so-called gravity method. Their method views neurons as particles that attract each other and measures simultaneously the attractions

Neural Coding

2623

between all possible pairs of neurons in the set. The method, although it allows one to look at several neurons globally, does not account for nonlinear synapses or for synchronization of more than two neurons. Recent investigations have focused on the detection of individual instances of synchronized activity called unitary events (Grun, ¨ 1996; Grun, ¨ Aertsen, Vaadia, & Riehle, 1995; Riehle, Grun, ¨ Diesman, & Aertsen, 1997) between groups of two and more neurons. Of special interest are patterns involving three or more neurons, which cannot be described in terms of pair correlations. Such patterns are genuine higher order phenomena. The method developed by Prut et al. (1998) for detecting precise ring sequences of up to three units adopts this view, subtracting from three-way correlations all possible pair correlations. The models reported in this article were developed for the purpose of describing and detecting correlations of any order in a unied way. In the data we analyze, the spiking events (in the 1 ms range) are encoded as sequences of 0s and 1s, and the activity of the whole group is described as a sequence of binary congurations. This article presents a family of statistical models for analyzing such data. In our models, the parameters represent spatiotemporal ring patterns. We generalize the spatial correlation models developed by Martignon, von Hasseln, Grun, ¨ Aertsen, and Palm (1995), to include a temporal dimension. We develop statistical tests for detecting the presence of a genuine order-n correlation and distinguishing it from an artifact that can be explained by lower-order interactions. The tests compare observed ring frequencies on the involved neurons with frequencies predicted by a distribution that maximizes entropy among all distributions consistent with observed information on synchrony of lower order. We also present a Bayesian approach in which hypotheses about interactions on a large number of subsets can be compared simultaneously. Furthermore, we introduce a Bayesian test to establish essential differences (i.e., differences that are not to be attributed to noise) between different phases of recording (e.g., prestimulus phase versus stimulus phase). We compare our log-linear models with two other candidate approaches to measuring the presence of higher-order correlations. One of these candidates, drawn from statistical physics, is the connected cumulant. The other, drawn from information theory, is mutual information or redundancy. We argue that the coefcients of log-linear models, or effects, provide a more natural measure of higher-order phenomena than either of these alternate approaches. We present the results of analyzing synthetic data generated from our models to test the performance of our statistical methods on data of known distribution. Results are presented from applying the models to multiunit recordings obtained by Vaadia from the frontal cortex of monkeys, to multiunit recordings obtained by Diamond from the somatosensory cortex of rats, and to multiunit recordings obtained by Freiwald from the visual cortex of behaving monkeys.

2624 L. Martignon, G. Deco, K. Laskey, M. Diamond, W. Freiwald, & E. Vaadia 2 Measures for Higher -Order Synchronization 2.1 Effects of Log-Linear Models. The term spatial correlation has been used to denote synchronous ring of a group of neurons, while the term temporal correlation has been used to indicate chains of ring events at specic temporal intervals. Terms like couple and triplet have been used to denote spatiotemporal patterns of two or three neurons (Abeles et al., 1993; Grun, ¨ 1996) ring simultaneously or in sequence. Establishing the presence of such patterns is not straightforward. For example, three neurons may re together more often than expected by chance1 without exhibiting an authentic third-order interaction. For example, if a neuron participates in two couples, such that each pair res together more often than by chance, then the three involved neurons will re together more often than the independence hypothesis would predict. This is not, however, a genuine third-order phenomenon. Authentic triplets and, in general, authentic nth-order correlations must therefore be distinguished from correlations that can be explained in terms of lower-order correlations. We introduce log-linear models for representing ring frequencies on a set of neurons and show that nonzero coefcients or effects of these log-linear models are a natural measure for synchronous ring. We argue that the effects of log-linear models are superior to other candidate approaches drawn from statistical physics and information theory. In this section we consider only models for synchronous ring. Generalization to temporal and spatiotemporal effects is treated in section 3. Consider a set of n neurons. Each neuron is modeled as a binary unit that can take on one of two states: 1 (ring) or 0 (silent). The state of the n units is represented by the vector x D (x1 , . . . , xn ), where each xi can take on the value zero or one. There are 2n possible states for the n neurons. If all neurons re independent of each other, the probability of conguration x is given by

p(x1 , . . . , xn ) D p(x1 ) ¢ ¢ ¢ p (xn ).

(2.1)

Methods for detecting correlations look for departures from this model of independence. Following a well-established tradition, we model neurons as Bernoulli random variables. Two neurons are said to be correlated if they do not re independently. A correlation between two binary neurons, labeled 1 and 2, is expressed mathematically as 6 p (x1 ) p( x2 ). p( x1 , x2 ) D

(2.2)

Extending this idea to larger sets of neurons introduces complications. It is not sufcient simply to compare the joint probability p (x1 , x2 , x3 ) with 1

That is, more often than predicted by the hypothesis of independence.

Neural Coding

2625

Figure 1: Overlapping doublets and authentic triplet.

the product p (x1 )p (x2 )p(x3 ) and declare the existence of a triplet when the two are not equal. This would confuse authentic triplets with overlapping doublets or combinations of doublets and singlets. Thus, neurons 1, 2, and 3 may re together more often than the independence model, equation 2.1, would predict because neurons 1 and 2, and neurons 2 and 3, are each involved in a binary interaction (see Figure 1). Equation 2.2 expresses in mathematical form the idea that two neurons are correlated if the joint probability distribution for their states cannot be determined from the two individual probability distributions p(x1 ) and p (x2 ). Now consider a set of three neurons. The probability distributions for the three pair congurations are p (x1 , x2 ), p (x2 , x3 ), and p(x1 , x3 ).2 A genuine third-order interaction among the three neurons would occur if it were impossible to determine the joint distribution for the three-neuron conguration using only information on these pair distributions. A canonical way to construct a joint distribution on three neurons from the pair distributions is to maximize entropy subject to the constraint that the two-neuron marginal distributions are given by p (x, x2 ), p (x2 , x3 ), and p (x1 , x3 ). This is the distribution that adds the least information beyond the information contained in the pair distributions. It is well known3 that the joint distribution thus obtained can be written as p (x1 ,x2 ,x3) D expfh0 C h1 x1 C h2 x2 C h3 x3 C h12x1 x2 C h13 x1 x3 C h23 x2 x3g, (2.3) where the hs are real-valued parameters and h0 is determined from the other hs and the constraint that the probabilities of all congurations sum to 1. In this model, there is a parameter for each individual neuron and each pair of neurons. Each of these parameters, which in the statistics literature are called effects, can be thought of as measuring the tendency to be active of the neuron(s) with which it is associated. Increasing the value of hi without 2 Note that these pairwise distributions also determine the single-neuron ring frequencies p(x1 ), p(x2 ), and p(x3 ), which can be expressed as their linear combinations. 3 See, for example, Good (1963) or Bishop, Fienberg, and Holland (1975).

2626 L. Martignon, G. Deco, K. Laskey, M. Diamond, W. Freiwald, & E. Vaadia

changing any other h s increases the probability that neuron i is in its “on” state. Increasing the value of hij without changing any other h s increases the probability of simultaneous ring of neurons i and j. It is instructive to note that there is a second-order correlation between neurons i and j in the sense 6 0. of equation 2.2 precisely when hij D Not all joint probability distributions on three neurons can be written in the form of equation 2.3. A general joint distribution for three neurons can be expressed by including one additional parameter:4 p(x1 , x2 , x3 ) D expfh0 C h1 x1 C h2 x2 C h3 x3

C h12 x1 x2 C h13 x1 x3 C h23 x2 x3 C h123 x1 x2 x3 g.

(2.4)

Holding the other parameters xed and increasing the value of h123 increases the probability that all three neurons re simultaneously. Equation 2.3 corresponds to the special case of equation 2.4 in which h123 is equal to zero, just as independence corresponds to the special case of equation 2.3 in which all second-order terms hij are equal to zero. It seems natural, then, to dene a genuine order-3 correlation as a joint probability distribution 6 that cannot be expressed in the form of equation 2.3 because h123 D 0. As will be shown in section 3, this idea can be extended naturally to larger sets of neurons and temporal patterns. Parameterizations of the general form equations 2.3 and 2.4 are called loglinear models because the logarithm of the probabilities can be expressed as a linear sum of functions of the conguration values. They correspond to the coefcients of the energy expansions of generalized Boltzmann machines in statistical mechanics.The parameters of the log-linear model are called effects. A joint probability distribution on n neurons is represented as a log-linear model in which effects are associated with subsets of neurons. A genuine nth-order correlation on a set A of n neurons is dened as the presence of a nonzero effect hA associate d with the set A in the log-linear model representing the joint distribution of conguration frequencies. The information-theoretic justication for our proposed denition of higher-order interaction is theoretically satisfying. But if the idea is to have practical utility, we require a method for measuring the presence of genuine correlations and distinguishing them from noise due to nite sample size. Fortunately, there is a large statistical literature on estimation and hypothesis testing using log-linear models. In the following sections, we discuss ways to detect and quantify the presence of higher-order correlations. We also apply the methods to real and synthetic data.

4 This can be seen by noting that log p(x1 , x2 , x3 ) can be expressed as a set of linear equations relating the conguration probabilities to the hs. These equations give a unique set of hs for any given probability distribution.

Neural Coding

2627

2.2 Other Candidate Measures of Higher -Order Interactions. Before closing this section, we mention two further approaches to measuring the existence of higher-order correlations in a set of binary neurons. Connected cumulants are measures of higher order correlations used in statistical mechanics as well as in nonlinear system theory. They were introduced by Aertsen (1992) as a candidate parameter for spatiotemporal patterns of neural activation. Their introduction was motivated by their role in the WienerVolterra theory of nonlinear systems. For three neurons, for instance, the connected cumulant is dened by

C123 D hx1 x2 x3 i ¡ hx1 i hx2 x3 i ¡ hx2 i hx1 x3 i ¡ hx3 i hx1 x2 i C 2 hx1 i hx2 i hx3 i .

(2.5)

In these expressions, the symbol h¢i stands for expectation. Thus, the symbol hxi i denotes the expected ring frequency of the ith neuron. This formula can easily be generalized to any number of neurons. Another approach to measuring higher-order correlations uses the concept of mutual information drawn from information theory. This approach was originally suggested by Grun ¨ (1996). For three neurons, the parameter dened by I3 (x1 , x2 , x3 ) D H (x1 ) C H (x2 ) C H (x3 ) ¡ H (x1 , x2 ) ¡ H (x1 , x3 ) ¡ H (x2 , x3 ) C H (x1 , x2 , x3 ),

(2.6)

where H denotes entropy (see Cover & Thomas, 1991, for details), is a measure of higher-order correlation. This formula can easily be generalized to any number n of neurons, for n > 1. Dening second-order interactions as either C12 D hx1 , x2 i ¡ hx1 ihx2 i or I2 D H (x1 , x2 ) ¡ H (x1 ) ¡ H (x2 ) is equivalent to the above denition as a nonzero effect of the appropriate log-linear model. However, extending these measures to correlations of order 3 or higher does not yield equivalent characterizations of the set of distributions exhibiting no interaction of the given order. A detailed discussion of both measures as well as a presentation of statistical methods for their detection is given in Deco, Martignon, and Laskey (1998). In this article we argue that an interaction among a set of neurons should be modeled as a nonzero effect in a log-linear model for the joint distribution of spatiotemporal congurations of the involved neurons. There are several reasons for this choice. First, log-linear models are natural extensions of the Boltzmann machine. As we already mentioned, dening interactions as nonzero effects of log-linear models has a compelling informationtheoretic justication. Second, methods exist for estimating effects and testing hypotheses about the degree of interaction using data from multiunit recordings. Third, the models can be used to examine behavioral corre-

2628 L. Martignon, G. Deco, K. Laskey, M. Diamond, W. Freiwald, & E. Vaadia

lates of neuronal activation patterns by estimating, comparing, and testing hypotheses about similarities and differences in log-linear models for data segments recorded at different times under differing environmental stimuli. 3 A Mathematical Model for Spatiotemporal Firing Patterns

The previous section described several approaches to generalizing the concept of pairwise correlations to larger numbers of neurons. Parameters of log-linear models were proposed as a measure of higher-order correlations and justied by their ability to distinguish true higher-order correlations from apparent correlations that can be explained in terms of lower-order correlations. For clarity of exposition, the discussion was limited to synchronous ring on groups of two and three neurons. In this section, we generalize the approach to arbitrary numbers of neurons and temporal correlations. Consider a set L of n binary neurons and denote by p the probability distribution on sequences of binary congurations of L . Assume that the sequence of congurations xt D (x1t , . . . , xnt ) of n neurons forms a Markov chain of order r. Let d be the time step, and denote the conditional distribution for xt given previous congurations by p (xt | x (t¡d ) , x ((t¡2d ) , . . . , x(t¡rd ) ). We assume that all transition probabilities are strictly positive and expand the logarithm of the conditional distribution as ( p(xt | x(t¡d ), x(t¡2d ) , . . . , x (t¡rd ) ) D exp h0 C

X

) hA XA .

(3.1)

A2 J

In this expression, each A is a subset of pairs of nonnegative indices (i, m), where at least one value of m is equal to zero. The set A represents a possible spatiotemporal pattern. The index i indicates a neuron, Q and the index m indicates a time delay in units of d. The variable xA D 1· j·k x (ij ,mj ) is equal to 1 in the event that all neurons represented by indices in A are active at the indicated time delays, and is equal to zero otherwise. For example, if there are 10 instances in which neuron 4 res 30 ms after neuron 2, and if the time step is 10 ms, then xA D 1 for A D f(2, 0), (4, 3)g. The set J for 6 which hA D 0 is called the interaction structure for the distribution p. The parameter hA is called the interaction strength for the interaction on subset A. Clearly, hA D 0 means A 2/ J and is taken to indicate absence of an order-|A| interaction among neurons in A. We denote the structure-specic vector of nonzero interaction strengths by h J . Denition 1. We say that neurons fi1 , i2 , . . . , ik g exhibit a spatiotemporal pattern if there is a set of time intervals m1d, m2d, . . . , mkd with r ¸ mi ¸ 0 and at 6 0 in (1), where A D f(i1 , m1 ), . . . , (ik , mk )g. least one mi D 0, such that hA D

Neural Coding

2629

Figure 2: Nonzero h123 due to common hidden inputs H and/or K.

Denition 2. A subset fi1 , i2 , . . . , ik g of neurons exhibits a synchronization or 6 0 for A D f(i1 , 0), . . . , ( ik , 0)g. spatial correlation if hA D In the case of absence of any temporal dependencies, the congurations at different times are independent, and we drop the time index: ( p (x ) D exp hw C

)

S

hA XA

(3.2)

A2 J

Q where each A is a nonempty subset of L and xA D i2A xi . Of course, we expect temporal correlation of some kind to be present, one such example being the refractory period after ring. Nevertheless, equation 3.2 may be adequate in cases of weak temporal correlation and generalizes naturally to the situation of temporal correlation. Increasing the bin width can result in temporal correlations of short time intervals manifesting as synchronization. It is clear that equations 2.3 and 2.4 are special cases of equation 3.2 for the case of three neurons. The interaction structure J for equation 2.4 consists of all nonempty subsets of neurons. For equation 2.3, the interaction structure consists of all single-neuron and two-neuron subsets. Although the models, equations 3.1 and 3.2, are statistical and not physiological, one would naturally expect synaptic connection between two neurons to manifest as nonzero hA for the set A composed of these neurons with the appropriate delays. One example leading to nonzero hA in equation 3.2 would be simultaneous activation of the neurons in A due to common input (see Figure 2a). Another example is simultaneous activation of overlapping subsets covering A, each by a different unobserved input (see Figure 2b).

2630 L. Martignon, G. Deco, K. Laskey, M. Diamond, W. Freiwald, & E. Vaadia

An attractive feature of our models is that the two cases of Figure 2 can be distinguished when the neurons in A exhibit no true lower-order interactions. Suppose a model such as depicted in Figure 2 is constructed for all neurons, both observed and unobserved. The model of Figure 2a corresponds to a log-linear expansion with only individual neuron coefcients and the fourth-order effect hH123 , where H is a hidden unit, as in Figure 2. The model of Figure 2b contains individual neuron effects and the triplet effects hH12 and hK23 , where K is another hidden unit different from H. When the probabilities are summed over the unobserved neurons to obtain the distribution over the three observed neurons, both models will contain third-order coefcients h123 . However, the model of Figure 2a will contain no nonzero second-order coefcients, and the model of Figure 2b will contain nonzero coefcients of all orders up to and including order 3. Thus log-linear models have the desirable feature that the existence of a nonzero hA coupled with hB D 0 for B ½ A indicates a correlation among the neurons in A possibly involving unobserved neurons. On the other hand, a nonzero 6 hA together with hB D 0 for B ½ A may indicate an interaction that can be explained away by interactions involving subsets of A and unobserved neurons. 4 Estimation and Testing Based on Maximum Likelihood Theory 4.1 Construction of a Test Statistic for the Presence of Interactions.

We are interested in the problem of detecting nonzero hA for subsets A of neurons in a set of n neurons whose ring frequencies are governed by model 3.1 or 3.2. That is, for a given set A, we are interested in testing which of the two hypotheses, H0 : hA D 0 6 0 H1 : hA D

is the case. We wish to declare the existence of a higher-order interaction only when the observed correlations are unlikely to be a chance result of a model in which hA D 0. We apply the theory of statistical hypothesis testing (Neyman & Pearson, 1928) to construct a test statistic whose distribution is known under the null hypothesis. Then we choose a critical range of values of the test statistic that are highly unlikely under the null hypothesis but are likely to be observed under plausible deviations from the null hypothesis. If the observed value of the test statistic falls into the critical range, we reject the null hypothesis in favor of the alternative hypothesis. 6 Consider the problem of testing hA D 0 against hA D 0 in the log-linear model, equation 3.2 for synchronous ring in observations with no temporal correlation. A reasonable test statistic is the likelihood ratio statistic, denoted by G2 . To construct the G2 test statistic, we rst obtain the maximum

Neural Coding

2631

likelihood estimate of the parameters h J under the null and alternative hypotheses. We construct our test statistic by conditioning on the silence of neurons outside A. We compute estimates p1 (x ) and p0 (x ) for the vector of conguration probabilities under the alternative (H1 ) and null (H0 ) hypotheses, respectively. Using these probability estimates, we compute the likelihood ratio test statistic, G2 D 2N

X x

¡ ¢ p1 (x ) ln p1 (x ) ¡ ln p0 (x ) ,

(4.1)

where N is the sample size for the test (the number of congurations of 0s and 1s in which neurons outside of A are silent). As the sample size for the test tends to innity, the distribution for this test statistic under H0 tends to a chi-squared distribution, where the number of degrees of freedom is the difference in the number of parameters in the two models being compared. In this case, the larger model has a single extra parameter, so the reference distribution has one degree of freedom. The interpretation of test results is as follows. If the value of the test statistic is highly unusual for a chisquared distribution with one degree of freedom, this casts doubt on the null hypothesis H0 and is therefore regarded as evidence that the additional parameter hA is nonzero. 4.2 A Caveat on Statistical Power . An explicit expression for hA in terms of the probabilities of congurations is given by

hA D

X

(¡1) |A¡B| ln p(ÂB )

(4.2)

B½A

where ÂB represents the conguration of 1s on all elements of B and 0s elsewhere. Thus,hA is fully determined by the probabilities of congurations that have 0s outside A. Conditioning on the silence of neurons outside A and thus discarding observations for which neurons outside A are active simplies estimation of hA and construction of a test statistic for H1 against H0 . The distribution of the test statistic does not depend on whether there are nonzero coefcients associated with sets containing neurons outside A. Ignoring all observations for which neurons outside A are ring, as the above test does, reduces statistical power as compared with a test based on all the observations. 5 This is not a major difculty when overall ring frequencies are low and the number of neurons is small, but can make the approach infeasible for large numbers of neurons. An alternative approach would be to construct a test using all the data. To construct such a test, 5 A statistical hypothesis test has a high power if the probability of correctly rejecting a false null hypothesis is high and high signicance if the probability of incorrectly rejecting a true null hypothesis is low.

2632 L. Martignon, G. Deco, K. Laskey, M. Diamond, W. Freiwald, & E. Vaadia

it is necessary to estimate parameters for an interaction structure over all 6 0 for all sets B for which it is reasonable to neurons, which allows for hB D expect that interactions may be present (typically, all subsets of neurons less than or equal to a given order). Unfortunately, neither of these approaches scales to large numbers of neurons. As the number of neurons grows, the probability that some neuron outside the set A is ring also grows, increasing the fraction of the data that needs to be discarded for the rst approach. The second approach suffers from the problem that the number of simultaneously estimated parameters grows as an order }A} polynomial in the number of neurons, resulting in unstable parameter estimates when the number of neurons is large. In contrast, as we will argue, the natural Occam’s razor (Smith and Spiegelhalter, 1980) associated with the Bayesian mixture model that will be introduced in section 5 permits that method to be applied usefully even with very large sets of neurons. Although we do regard statistical significance of a coefcient as a useful indicator of existence of correlation of the associated neurons, the Bayesian approach described below makes use of all the data and is well suited to problems involving large numbers of parameters. 4.3 Estimating Parameters Under Null and Alternative Hypotheses.

The maximum likelihood estimate of h J under the alternative hypothesis H1 : J D 2 A is easily obtained in closed form by setting the frequency distribution p1 (x ) over congurations of neurons in the set A to the sample frequencies and solving for h 2A . The maximum likelihood estimate for h J under the null hypothesis H0 : J D 2A ¡ A may be obtained by a standard likelihood maximization procedure such as iterative proportional tting (IPF) (see, for instance, Bishop et al., 1975). As an alternative to IPF, we have developed a simple one-dimensional optimization procedure, described in the appendix, which we call the constrained perturbation procedure (CPP). We have found computation time for CPP to be much faster than for IPF. Once the estimate h J has been obtained, one can solve for the estimated conguration probabilities p0 (x ) for the null hypothesis. 5 Statistical Tests Based on Bayesian Model Comparison 5.1 Detecting Synchronizations Using the Bayesian Approach. An alternative approach to the Neyman and Pearson hypothesis testing described in section 4 is Bayesian estimation. Standard frequentist methods may run into difculties on high-dimensional problems such as the one treated in this article. In contrast, high-dimensional problems pose no essential difculty in the Bayesian approach as long as prior distributions are chosen appropriately. For this reason, and also because of advances in computational methods, the Bayesian approach has been gaining in favor as a theoretically sound and practical way to treat high-dimensional problems.

Neural Coding

2633

In this section we focus on the problem of determining synchronization structure in the absence of temporal correlation, as represented by equation 3.2. A heuristic treatment of temporal correlations is discussed in section 5.3, and a full treatment of spatiotemporal interactions will appear in a future article. To apply the Bayesian approach, we assign a prior probability to each interaction structure J and a continuous prior distribution for the nonzero parameters h J . A sample of observations from multiunit recordings is used to update the prior distribution and obtain a posterior distribution. The posterior distribution also assigns a probability to each interaction structure and a continuous probability distribution to the nonzero hA given the interaction structure. The prior distribution can be thought of as a bias that keeps the model from jumping to inappropriately strong conclusions. In our problem, it is natural to assign a prior distribution that introduces a bias toward models in which most hA are zero. We use a prior distribution in which all hA are independent of each other and are probably equal to zero. We then assume a continuous probability density for hA conditional on its being nonzero, in which values very different from zero are less likely than values near zero. Because of the bias introduced by the prior distribution, a large posterior 6 0 indicates reasonably strong observational evidence probability that hA D for a nonzero parameter. A formal specication for the prior distribution is given in the appendix. We estimate posterior distributions for both structure and parameters in a unied Bayesian framework. A Markov chain Monte Carlo model composition algorithm (MC3 ) is used to search over structures (Madigan & York, 1993). Laplace’s method (Tierney & Kadane, 1986; Kass & Raftery, 1995) is used to estimate the posterior probability of structures. For a given interaction structure, hA is estimated by the mode of the posterior distribution for that structure, and its standard deviation is estimated by the appropriate diagonal element of the inverse Fisher information matrix. The posterior 6 0 is the sum of the probability of all interaction strucprobability that hA D 6 0. An overall estimate for hA given that it is nonzero is tures in which hA D obtained as a weighted average of structure-specic estimates, where each structure is weighted by its posterior probability and the result is normal6 ized over structures in which hA D 0. Formulas for the estimates may be found in the appendix. 5.2 Detecting Changes in Synchronization Patterns. The Bayesian approach can also be used to infer whether there are systematic differences in ring rates and interactions during different time segments. A fully Bayesian treatment of this problem would embed the model 3.3 in a larger model, including all time segments under consideration, and would explicitly model hypotheses in which the hA are identical or different in different time segments. We instead apply a simpler approach, in which we choose a single

2634 L. Martignon, G. Deco, K. Laskey, M. Diamond, W. Freiwald, & E. Vaadia

interaction structure J rich enough to provide a reasonable t to both segments of data. We compute separate posterior distributions h J 1 and h J 2 under the hypothesis that the two segments are governed by different distributions and a combined posterior distribution for h J c that treats both segments as a single data set. Using a method described in detail in the appendix, we use these estimates to compute a posterior probability for the hypothesis that the two segments are governed by the same distribution. 5.3 A Heuristic for Detecting Temporal Patterns. The correct approach to modeling temporal structure is the Markov chain dened in equation 3.2. As currently implemented, our methods are unable to handle temporal correlation for even very small sets of neurons. In this section we describe a heuristic that can be useful as an approximation. We see a temporal pattern as a shifted synchrony. Suppose we observe that neuron frequently 1 res exactly two bins after neuron 2. When do we declare that that this phenomenon is signicant? As a heuristic, we shift the data of neuron 2 two bins ahead of neuron 1 and neglect the two rst bins of neuron 1 and the last two of neuron 2. We now have an articially created parallel process for these two neurons such that the temporal pattern has become a synchrony. We will declare that a temporal pattern is signicant if the synchrony obtained by performing the corresponding shifts is signicant. For clusters of three neurons, three-dimensional displays are easily obtained. For clusters with more than three neurons, graphical displays can still be provided by using projection techniques (Martignon, 1999). Figure 3 illustrates the histogram of the correlations between two neurons

Figure 3: Histogram of estimated interactions of two neurons (1,2 of Table 4).

Neural Coding

2635

recorded by Freiwald in the visual cortex of a macaque (see section 7.2). This histogram, as one would expect, is equivalent to the one obtained by means of the method proposed by Palm, Aertsen, and Gerstein (1988) based on the concept of surprise. 5.4 Discussion of the Bayesian Approach. The frequentist procedures described in section 4 are designed to address the question of whether a particular set A of neurons, xed in advance of performing the test, exhibits 6 an interaction, as evidenced by hA D 0. The problem we face is different: we wish to discover which sets of neurons are involved in spatiotemporal patterns. The Bayesian approach is designed for just such problems. 6 0 Our Bayesian procedure automatically tests all hypotheses in which hA D against all hypotheses in which hA D 0, for every A we consider. A major advantage for our application is that we have no need to choose a specic null hypothesis against which to test the alternative hypothesis each time. The Bayesian approach is often more conservative than the frequentist 6 approach. We have sometimes obtained a low posterior probability that hA D 0 even when a frequentist test rejects hA D 0. We consider this conservatism an advantage. Another advantage of the Bayesian approach is its ability to treat situations in which the data do not clearly distinguish between alternate models. 6 6 Suppose each of two hypotheses, hA D 0 and hB D 0, can be rejected individually by a standard frequentist test, but neither can be rejected when the other nonzero effect is in the model. This might occur when the sets A and B overlap, the model containing only hA\B is inadequate to explain the data, but the data are not sufcient to determine which neurons outside A \ B are 6 involved in the interaction. Because the two models hhA D 0, hB D 0i and 6 0, hB D 0i are nonnested, constructing a frequentist test to determine hhA D which effect to include in the model is not straightforward. The Bayesian mixture approach handles this situation naturally. A mixture model will assign a moderately high probability to the hypothesis that each effect individually is nonzero and a very small probability to the hypothesis that both are nonzero. The theoretical arguments in favor of the Bayesian approach amount to little unless the method is practical to apply. Until recently, computational constraints limited the application of Bayesian methods to very simple problems. Naive application of the model described is clearly intractable. With n neurons and considering only synchronicity with no temporal correlation, n there are 22 interaction structures to consider. Clearly it is infeasible to enumerate all of these for even moderately sized sets of neurons. We performed explicit enumeration for problems of four or fewer neurons. For larger sets, we sampled interaction structures using MC3 . We have found that our MC3 approach discovers good structures after a few hundred iterations and that a few thousand iterations are sufcient to obtain accurate parameter esti-

2636 L. Martignon, G. Deco, K. Laskey, M. Diamond, W. Freiwald, & E. Vaadia

mates. However, the algorithm we currently use estimates the entire set of probabilities, one for each conguration, for each structure considered. The algorithm thus scales as 2nr and becomes intractable for large numbers of neurons or long temporal windows.6 Our current implementation in LispStat (Tierney, 1990) on a 133 MHz Macintosh Power PC enumerates the 2048 interaction structures for four neurons in about 1 hour and samples 5000 structures on six neurons in about 8 hours. We reimplemented our sampling algorithm in C++ on a 300 MHz Pentium computer and achieved speedup of more than an order of magnitude. Nevertheless, it is clearly necessary to develop faster estimation procedures. We are working on approximation methods that scale as a low-order polynomial in the number of neurons and the length of the time window. With such approximation methods, the Bayesian approach will become tractable for more complex problems. In this article our objective is to demonstrate the value of the methods on data sets for which our current approach is feasible. Future work will extend the reach of our methods to larger data sets and problems of temporal correlation. In summary, the Bayesian approach is naturally suited to simultaneous testing of multiple hypotheses. With appropriate choice of prior distribution, it handles simultaneous estimation of many parameters with less danger of overtting than frequentist methods because of its bias toward models with a small number of adjustable parameters. It treats nonnested hypotheses with no special difculty. It provides a natural way to account for situations in which the data indicate the existence of higher-order effects but are ambiguous about the specic neurons involved in the higher-order effects. The approach described here is thus quite attractive from a theoretical perspective, but is currently tractable only for small sets of neurons exhibiting no temporal correlation. Further work is needed to develop computationally tractable estimation procedures for larger sets of neurons exhibiting temporal correlation. 6 Applying the Models to Simulated Data

The rst simulation we present describes four neurons modeled as integrateand-re units. Three neurons labeled 1,2,3 are excited by a fourth one, labeled 0, which receives an input of 2. The resolution is of 1 msec and the simulation is of 200,000 msec. The units’ thresholds are of 10 MV (reset of 0 MV), t D 25 ms, and the refractory time is 1 msec. A gaussian noise is added to the potential (s D 0.5 MV). Figure 4 describes three different situations.

6 This intractability also plagues naive application of the frequentist tests described in section 4 above, which also rely on computing the entire vector of conguration probabilities for the null and alternative hypotheses

Neural Coding

2637

Figure 4: Three connectivity situations

In the rst two graphs of Figure 4 (from left to right) we depict situations in which neuron 0 excites neurons 1, 2, and 3. In the second graph there are two additional interactions. In the third graph we depict a situation where neuron 0 excites only one of the remaining three neurons and there are other interactions between neurons 1, 2, and 3. Data were generated from simulations for the three situations. Both the frequentist and the Bayesian methods were used with the three sets of data. The triplet (1,2,3) was detected as signicant at the 0.0001 level by the frequentist method and a probability of 0.99 by the Bayesian method for data corresponding to the situation described by the rst graph. The triplet was also signicant at the 0.0001 level and had a probability of 0.97 for data corresponding to the second situation. In data corresponding to the third situation, none of the methods detected the triplet. To test the ability of our models to identify known interaction structures, we applied our models to synthetically generated data. We generated data randomly from a four-neuron model of the form 3.3, where we specied J and hA . Table 1 shows the hA for |A| ¸ 2 in the model we used to generate the data. With four neurons and constraining all the single-neuron effects to be in the model, there are 11 effects to be estimated. Thus, there are 211 D 2048 different interaction structures to be considered. We were able to enumerate all interaction structures for the Bayesian model average, eliminating the need for the MCMC search. Table 1 shows the results of the Bayesian analysis for simulated segments of length 10,000 up to 640,000. It is interesting to note the increasing ability to detect interactions as the segment length increases, as well as the increasing accuracy of the estimates of the interaction strengths. It is also interesting to note that the third data set assigns a 45% probability to an effect for A D f1, 3, 4g, a set for which hA D 0, but there is considerable overlap with sets for which there are true interactions. In general, it may be possible to determine that certain neurons are involved in interactions with certain other neurons, even when it is not possible to determine precisely which other neurons are also involved in specic interactions.

2638 L. Martignon, G. Deco, K. Laskey, M. Diamond, W. Freiwald, & E. Vaadia Table 1: Results from Simulated Synchronization Patterns.

Cluster hA A 1,2 1,3 1,4 2,3 2,4 3,4 1,2,3 1,2,4 1,3,4 2,3,4 1,2,3,4

0 0.05 0.10 0 0.30 0.50 0.30 0 0 0 0.20

10,000 Samples

40,000 Samples

160,000 Samples

640,000 Samples

6 0) MAP P(hA D Est. hA

6 0) MAP P(hA D Est. hA

6 0) MAP P(hA D Est. hA

6 0) MAP P(hA D Est. hA

0.01 0.02 0.01 0.01 0.52 1.00 0.03 0.01 0.04 0.03 0.10

0.02 0.12 0.05 ¡0.03 0.21 0.57 0.20 0.10 0.20 0.19 0.43

0.01 0.00 0.72 0.00 1.00 1.00 1.00 0.21 0.01 0.00 0.02

0.06 0.03 0.13 ¡0.02 0.33 0.52 0.39 0.20 0.09 ¡0.03 0.18

0.00 0.27 1.00 0.00 1.00 1.00 1.00 0.00 0.45 0.00 0.03

0.00 0.06 0.12 ¡0.01 0.30 0.51 0.36 0.04 0.11 0.12 0.12

0.00 1.00 1.00 0.00 1.00 1.00 1.00 0.00 0.00 0.00 1.00

0.01 0.05 0.11 ¡0.01 0.30 0.50 0.30 0.00 0.01 0.01 0.21

7 Detecting Synchronization in Experimental Data 7.1 Recordings from the Frontal Cortex of Rhesus Monkeys. In this section we present results from tting our model to data obtained by Vaadia and collegues. The recording and behavioral methodologies were described in detail in Prut et al. (1998). Briey, spike trains of several neurons were recorded in frontal cortical areas of rhesus monkeys. The monkeys were trained to localize a source of light and, after a delay, to touch the target from which the light was presented. At the beginning of each trial, the monkeys touched a “ready-key,” and then a central red light was turned on (ready signal). Three to 6 seconds later, a visual cue was given in the form of a 200 ms light blink coming from the left or the right. After a delay of 1 to 32 seconds, the color of the ready signal changed from red to yellow and the monkeys had to release the ready key and touch the target from which the cue was given. The spikes of each neuron were encoded as discrete events by a sequence of zeros and ones with time resolution of 1 millisecond. The activity of the whole group of simultaneously recorded six neurons was described as a sequence of congurations or vectors of these binary states. Since the method presented in this article for detecting synchronizations does not take into account nonstationarities, the data were selected by cutting segments of time during which the cells did not show signicant modulations of their ring rate. This selection was performed by the experimenter by means of eyeballing heuristics. The segments were taken from 1000 ms before the cue onset to 1000 ms thereafter. The duration of each segment was 2 seconds. We used 94 such segments (total time of 188 seconds) for the analysis presented

Neural Coding

2639

Table 2: Results for Preready Signal Data, Coefcients with Posterior Probability > 0.20. Cluster A 4,6 3,4 3,4,6 2,3,4,5 1,4,5,6

G2

Posterior Probability of A

MAP Estimate of hQA

Standard Deviation of hA

.001 .232 .257 0.13 .98

0.99 0.18 0.38 0.48 0.29

0.49 0.59 0.78 2.34 ¡1.85

0.11 0.25 0.29 0.82 1.20

below. The data were then binned in time windows of 40 milliseconds. The choice of the bin length was determined in discussion with the experimenter (see also Vaadia et al., 1995, for the rationale). The frequencies of congurations of zeros and ones in these windows are the data used for analysis in this article. We analyzed recordings prior to the ready signal separately from data recorded after the ready signal. Each of these data sets is assumed to consist of independent trials from a model of the form 2.20. Tables 2 and 3 present results from applying our method to these data sets. Likelihood ratio statistics (G2 ), posterior probabilities of nonzero effects, point estimates of effect magnitudes, and standard deviations are shown for all interactions with posterior probability at least 20%. There was a high-probability second-order interaction in each data set: (4, 6) in the preready data and (3, 4) in the postready data. In the preready data, a fourth-order interaction (2, 3, 4, 5) had posterior probability near 50% (this represents nearly ve times the prior probability of 0.1). Several data sets from this type of experiment were analyzed by means of the methods presented in this article under the guidance of the sixth author, who conducted the experiments. In certain cases we compared our results on pairwise correlations with results that the experimenter and coworkers obtained by traditional methods, and, as we expected, our detected interactions coincided with theirs. With rare exceptions interactions of order 2 or more were all characterized by positive effects. Table 3: Results for Postready Probability > 0.20. Cluster A 3,4 2,5 1,4,6 2,4,5,6 1,4,5,6

Signal

Data:

Effects

with

Posterior

G2

Posterior Probability of A

MAP Estimate of hQA

Standard Deviation of hA

0.03 0.091 0.752 1.454 0.945

0.95 0.44 0.26 0.22 0.35

1.00 1.06 0.38 ¡1.53 1.12

0.27 0.36 0.15 1.33 0.46

2640 L. Martignon, G. Deco, K. Laskey, M. Diamond, W. Freiwald, & E. Vaadia

Comparison of Tables 2 and 3 reveals some similarities and some important differences. The preready data show strong evidence for the (4,6) interaction. Although the posterior probability of a pair interaction between neurons 4 and 6 in the postready data was very small (the value was 0.03), there are several higher-order terms with moderately high posterior probability that involve these two neurons. Overall, we estimate a 99.9% probability that neurons 4 and 6 are involved in some interaction in the preready data and a 62.0% probability in the postready data. The postready data show strong evidence for the (3,4) interaction; this two-way interaction has probability only 18% in the preready data. Again, there are several moderately probable interactions in the preready data involving these two neurons. We estimate a 91.2% chance that neurons 3 and 4 are involved in some interaction in the preready data and a 99.0% chance in the postready data. It thus appears plausible that the pairs (4,6) and (3,4) are involved in interactions, possibly involving other neurons in the set, in both pre- and postready segments. We might speculate that a model involving pair interactions (4,6) and (3,4) is a plausible structure for both pre- and postready data. We might then ask whether the differences between the estimates in Tables 2 and 3 are merely artifacts or whether there are statistically detectable differences between pre- and postready data. To answer this question, we applied the test described in sections 5.2 and A.1.2. We t a modelcontaining all pair interactions to both pre- and postready data and t o the combined data sets. We computed a posterior probability of only 1.7 £ 10 ¡7 that both pre- and postready segments come from the same distribution, using a 50% prior probability that they came from the same distribution. Thus, there is extremely strong evidence that there are detectable differences between the pre- and postready segments. Both of the second-order interactions we detected are positive. Results of this type—that is, of varying second-order structure across the phases of the experiment—have been obtained previously by several groups of researchers (Grun, ¨ 1996; Abeles et al., 1993; Riehle et al., 1997) who have used a frequentist method based on rejecting the null hypothesis that the distribution of the joint process is binomial. Both this method and that of Palm et al., (1988) are essentially consistent with the method presented here for the second-order case. A difference between the frequentist approach introduced by Palm et al. and ours is the test used for determining signicance. We use G2 , while Palm et al. use a Fisher signicance test comparing the data with the binomial distribution determined by the success probabilities xed by the rst-order model. 7.2 Higher -Order Synchronization in the Visual Cortex of the Behaving Macaque. Precise patterns of neural activity of groups of neurons in the

visual cortex of behaving macaques were detected by means of the analysis of data on multiunit recordings obtained by Freiwald. The experiment consisted of a simple xation task: the monkey was supposed to learn to

Neural Coding

2641

detect a change in the intensity of a xated light point on a screen and react with a motoric answer. The xation task paradigm is the following: A xation point appears on the screen and stays on the screen for 3 seconds. The monkey xates this point and then presses a bar. After the bar is pressed, an interval of approximately 5 seconds begins, during which the monkey xates the point and continues to press the bar. After 5 seconds, the point darkens and disappears from the screen, while the monkey releases the bar. During the described interval, a short visual stimulus is presented on the screen, which is not relevant for behavior. Its function is that the monkey learns that the disappearance of the xation point can occur only after the stimulus presentation. Thus, the monkey is not forced to total concentration on the xation point. The data presented here were obtained by recordings of the activity of four neurons in the inferotemporal cortex (IT) on repeated trials of 5.975 ms each. We decomposed each trial into a rst chunk of 1,000 ms, considered stationary, a nonstationary interval of 800 ms, and then four segments of 1,000 ms each, followed by a short last segment of 175 ms. We chose a bin width of 10 ms. We present a comparison between the activity during the rst xation segment and the activity of the segments after stimulus. We performed tests of synchronization with both the frequentist and the Bayesian methods and obtained consistent results. The results of the Bayesian method are displayed in Tables 4 and 5. We rst performed pairwise comparisons of fully saturated models for the four poststimulus segments as described in sections 5.2 and A.1.2 and determined that none of these four segments was statistically distinguishable from the others. We therefore combined these segments into a single data set. We then compared the xation phase with the four poststimulus segments and found that it differed from them. We therefore show the results from the xation phase in Table 4 and the combined poststimulus phase in Table 5. We found that Table 4 exhibits extremely strong excitatory couples for all pairs of neurons and a few triplets. The comparison of these interactions with those of the poststimulus segments (see Table 5) is quite interesting. All pairs of neurons exhibit positive pair interactions in both data sets. The magnitudes of most of the interaction strengths are similar. However, the (1, 4) interaction is nearly twice as strong in the xation phase as in the poststimulus phase. The (1, 2) interaction is also larger in the xation phase by a factor of about 1.5. The threeway interaction found in the xation phase also appears in the poststimulus phase. The poststimulus data show two additional three-way interactions that are improbable in the xation phase.

7.3 Neuronal Interactions in Anesthetized Rat Somatosensory Cortex.

We also applied the models to data collected from the vibrissal region of the rat cortex. The vibrissae are arranged in ve rows on the snout (named A, B, C, D, and E) with four to seven vibrissae in each row. Neurons in the vibrissal area of the somatosensory cortex are organized into columns; layer

2642 L. Martignon, G. Deco, K. Laskey, M. Diamond, W. Freiwald, & E. Vaadia Table 4: Synchronies with Probability Larger Than 0.1 in the First 1000 ms in the Fixation Task. Cluster A 3,4 2,4 2,3 1,4 1,3 1,2 2,3,4 1,3,4 1,2,3

Posterior Probability of A

MAP Estimate hQj

Standard Deviation of hA

0.99 1.00 1.00 1.00 0.99 0.99 0.22 0.03 0.99

0.69 0.56 1.54 0.70 1.72 1.45 0.66 ¡0.34 ¡1.26

0.15 0.11 0.15 0.09 0.12 0.09 0.28 0.25 0.23

IV of the column is composed of a tightly packed cluster of neurons called a barrel (Woolsey & Van der Loos, 1970). Each barrel column is topologically related to a single mystacial vibrissa on the contralateral snout; thus vibrissa D2 (second whisker in row D) is linked to column D2 (Welker, 1971). Our wish was to nd out whether interacting neurons are distributed across cortical columns or restricted to single columns. Furthermore we examined whether the spatial distribution of interacting neurons varies as a function of the stimulus site. The data analyzed here were recorded from the cortex of a rat lightly anesthetized with urethane (1.5 g/kg) and held in a stereotaxic apparatus. Cortical neuronal activity was recorded through an array of six tungsten microelectrodes inserted in cortical columns D1, D2, D3,and C3, as well as in the septum located between columns D1 and D2 (see Figure 5). Using spike sorting methods, events from 15 separate neurons were discriminated. Table 5: Synchronies with Probability Larger Than 0.1 in the 4000 ms After the Visual Stimulus. Cluster A

3,4 2,4 2,3 1,4 1,3 1,2 2,3,4 1,3,4 1,2,3

Posterior Probability of A (Frequency)

MAP Estimate hQj

Standard Deviation of hA

1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.85 1.00

0.70 0.53 1.51 0.39 1.45 0.94 ¡0.59 ¡0.43 ¡0.68

0.10 0.05 0.06 0.06 0.06 0.04 0.12 0.12 0.09

Neural Coding

2643

Figure 5: Seven neurons located between and in four barrels corresponding to four vibrissae.

We analyzed data for 7 of these 15 neurons. Locations of these 7 neurons are shown in Figure 5. During the recording session, individual whiskers were deected across 50 to 55 trials by a computer-controlled piezoelectric wafer that delivered an up-down movement lasting 100 ms; the stimulus was repeated once per second (Diamond, Armstrong-Jones, & Ebner, 1993). We have restricted the analysis to data collected during stimulation of vibrissae D1, D2, and D3 and only during the 200 ms peristimulus period (100 ms after the up movement and 100 ms after the down movement). For each stimulus site, then, there were 10,000 to 11,000 ms of data. As in the previous section, the spiking events (in the 1 ms range) of each neuron were encoded as a sequence of zeros and ones, and the activity of the whole group was described as a sequence of congurations or vectors of these binary states. The interactions illustrated in Table 6 are those for which the likelihood test statistic is signicant at the 0.0001 level. We report the probabilities of these interactions obtained by means of the Bayesian approach and the estimated Hs . Table 6 shows the interactions occurring during stimulation of whiskers D1, D2, and D3. From analysis of this limited data set, several interesting observations can be made: (1) only positive interactions were detected; (2) interactions occurred both within a cortical columnar unit and across columns; (3) interactions were stimulus dependent (neurons were not continuously correlated with one another but became grouped together as a function of the stimulus parameter); and (4) during different stimulus conditions, a single neuron may take part in different groups. 8 Discussion

This article proposes a parameterization of synchronization and, in general, of temporal correlation of neural activation of more than two neurons. This

2644 L. Martignon, G. Deco, K. Laskey, M. Diamond, W. Freiwald, & E. Vaadia Table 6: Interactions Signicant at the 0.0001 Level and High Probability of Occurring. Cluster

4,5,6 4,7 3,4,6 2,7 2,5 1,4,5 3,7

D1

D2

D3

6 0) P(hA D

Est. hA

6 0) P(hA D

Est. hA

6 0) P(hA D

Est. hA

0.63 — — — — — —

3.52 — — — — — —

— 0.68 0.75 0.76 0.52 0.54 —

— 2.19 3.50 2.22 1.55 3.01 —

— — — — — — 0.58

— — — — — — 1.55

Note: Blanks correspond to low probabilities and high signicance levels.

parameterization has a sound information-theoretical justication. Recent work by Amari (1999) on information geometry on hierarchical decomposition of stochastic interactions provides mathematical support of our thesis that the effects of log-linear models are the adequate parameterization of pure higher-order interactions. If the problem is to analyze a group of neurons and establish whether they re in synchrony, the statistical problems of estimating the signicance of the corresponding effect being nonzero have a straightforward, reliable treatment. If the problem is to detect the whole interaction structure in a group of neurons (i.e., which subsets have interactions and how strong), complexity becomes an inevitable feature of any candidate treatment. We present both a frequentist and a Bayesian approach to the problem and argue that the Bayesian approach has major advantages over the frequentist approach. The approach is naturally suited to simultaneous estimation of many parameters and simultaneous testing of multiple hypotheses. Yet the Bayesian approach is of great complexity when the temporal interaction structure of, say, eight neurons is to be assessed. Even the heuristic shift method briey discussed in section 5.3 becomes very complex for the Bayesian approach in the form described here. New techniques for speeding up Bayesian search across structures are presented in a forthcoming article. An important underlying assumption in our investigations has been that the data are basically stationary. This is seldom the case. As we know from the important work by Abeles et al. (1995), cortical activity ips from one quasistationary state to the next. These authors developed a hidden Markov model technique to determine these ips. A combination of their methods with the ones presented here would allow the analysis of structure changes through quasistationary segments. Another advancement in global structure analysis of spike activity has been obtained by de Sa’, de Charms, and Merzenich (1997) by means of a Helmholtz machine that determines highly probable patterns of activation

Neural Coding

2645

based on spike train data. Here again a combination of methods would allow analysis of the correlation structure in probable patterns. The reason for our presentation of both the frequentist and the Bayesian approach to detecting correlation structure is twofold. Our wish is, on the one hand, to reinforce the use of the frequentist method, in spite of all its shortcomings, because it is quick and can be used for large numbers of neurons and long time delays. Our experience with the systematic comparison of the two methods has enhanced the condence that a careful frequentist strategy is reliable. On the other hand, we promote the Bayesian approach, which, from the point of view of experimenters, may seem cumbersome or too complex, but which, in the long run, will probably become an accepted tool. As a nal remark let us observe that higher-order interactions in the empirical data analyzed in this article did not seem to be frequent. Appendix A.1 The Constrained Perturbation Procedure. In this appendix we construct the null hypotheses for the tests proposed in section X . Assume that p is a probability distribution dened on the congurations of a set of neurons. Consider the following problem: Find a distribution p* such that the marginals of p¤ coincide with those of p and p¤ maximizes entropy among all distributions whose marginals of order less than |A| coincide with those of p. It can be proved (Whittaker, 1989) that there is a unique distribution p¤¤ on the congurations of A whose marginals of order less than |A| coincide with those of p and such that h ¤¤ A D 0, where h ¤¤ A is the coefcient corresponding to A in the log expansion of p¤¤ . Let us combine this fact with Good’s (1963) famous theorem, which states that the distribution p¤ maximizing entropy in the manifold of distributions with the same marginals of order less than |A| of p necessarily satises h ¤ A D 0. We conclude that the distribution p¤¤ has to coincide with p¤ . We sketch the construction of such a distribution p* for the simple case of three neurons. Suppose that A D f1, 2, 3g and that p is a strictly positive distribution on A. We want to construct the distribution p¤ that maximizes entropy among all those consistent with the marginals of order 2, denoted by Marg.(ff1, 2g, f1, 3g, f1, 3gg). Dene

p ¤ (1, 1, 1) D p(1, 1, 1) ¡ D p ¤ (1, 1, 0) D p(1, 1, 0) C D p ¤ (1, 0, 1) D p(1, 0, 1) C D p ¤ (1, 0, 0) D p(1, 0, 0) ¡ D p ¤ (0, 1, 1) D p(0, 1, 1) C D

2646 L. Martignon, G. Deco, K. Laskey, M. Diamond, W. Freiwald, & E. Vaadia

p ¤ (0, 1, 0) D p (0, 1, 0) ¡ D p ¤ (0, 0, 1) D p (0, 0, 1) ¡ D p ¤ (0, 0, 0) D 1 ¡

X

p(ÂB ).

B½f1,2,3g

Let us compute the marginal p¤ (x1 D 1, x2 D 1). We have p¤ (x1 D 1, x2 D 1) D p (1, 1, 1) ¡ D C p(1, 1, 0) C D D p(1, 1, 1) C p (1, 1, 0) D p (x1 D 1, x2 D 1). Now we have to solve for D in the equations h ¤ A D 0. Observe that our construction contains only one step where a numerical approximation becomes necessary: solving for h ¤ A D 0 in terms of D . This can be done by using any one-dimensional optimization procedure such as Newton’s approximation method. Here Newton’s approximation method has to be used in combination with the condition that the perturbation D is such that all candidate solutions are probability distributions, and thus their values are between 0 and 1. Let us now sketch the construction in the general case of N neurons. If B is a nonempty subset of A, denote by ÂB the conguration that has a component 1 for every index in B and 0 elsewhere. For each nonempty subset B, dene p¤ (ÂB ) D [(ÂB ) C (¡1)] |B| D , where, again, D is to be determined by solving for h ¤ A ´ 0. Every marginal of order N ¡ 1 is the sum of the probabilities of two congurations that coincide on N ¡ 1 entries and differ on the remaining one. Thus, the sign of D will be different in the two summands, and D will cancel out. Therefore, the marginal of p¤ will coincide with the correspondent marginal of p, as condition 1 requires. A.2 Bayesian Estimation and Hypothesis Testing Methods.

A.2.1 Inferring Structure and Parameters from Observations. We are concerned with the problem of inferring the interaction structure J and interaction strengths h J from a sample x1 , . . . , xN of N independent observations from the distribution p(x | h J , J ) given by equation 3.3. Initial information about h J prior to obtaining the data is expressed as a prior probability distribution over structures and interaction strengths. We use a mixed discretecontinuous prior distribution denoted by g (h J | J )p J , where p J is the prior probability of structure J and g(h J | J ) is a continuous density function for h J . We chose a prior distribution that encodes the assumptions that all singleton effects are nonzero, most effects of order 2 or greater are nonzero, and most nonzero effects are not too large. Specically, the prior distribution assumes all singleton effects are included with certainty, and each interaction A of order 2 or greater is included with probability 0.1 independent of the other interactions.7 Thus, an interaction structure J that includes all 7

We varied the prior probability between 0.025 and 0.40. As expected, extreme poste-

Neural Coding

2647

singleton sets and k interactions of order 2 or higher has prior probability p J D 10 ¡k . Conditional on an interaction structure J , the interaction strength hA corresponding to any absent interaction A 2/ J is assumed to be identically zero. We assume that the nonzero hA are independent of each other and are normally distributed with zero mean and standard deviation s D 2. This prior distribution encodes a prior assumption that the hA are symmetrically distributed about zero, that is, excitatory and inhibitory interactions are equally likely. The standard deviation s D 2 (that is, most hA lie between ¡4 and 4) is based on previous experience applying this class of models (Martignon et al., 1995) to other data, as well as discussions with neuroscientists. We initially regarded the symmetry of the normal distribution as unrealistic because we expected most interaction parameters to be positive. However, we felt that the convenience of the normal distribution outweighed this disadvantage. We found in our data analyses that many estimated interactions, especially those of order higher than 2, were negative (we attribute these negative higher-order interactions to overlapping pair interactions with external neurons). The joint probability of the observed set of congurations for a xed interaction structure, viewed as a function ofh J , is called the likelihood function for h J given J . From equation 3.3, it follows that the likelihood function can be written ( ) L (h J ) D p (x1 | h J , J ) ¢ ¢ ¢ p(xN | h J , J ) D exp N

S

h A tA ,

(A.1)

A2 J

S

where tA D N1 tA is the frequency of simultaneous activation of the nodes in the set A and is referred to as the marginal frequency for A. The random vector8 T of marginal frequencies is called a sufcient statistic for h J because the likelihood function, and therefore the posterior distribution of h J , depends on the data only through T.9 The expected value of T A is the probability that TA D 1 and is simply the marginal function for the set A. The posterior distribution is also a mixed discrete-continuous distribution g¤ (h J | J )p ¤ J . The posterior density for h J conditional on structure J is given by g (h J | J , x1 , . . . , xN ) D Kp (x1 | h J , J ) ¢ ¢ ¢ p (xN | h J , J ) g (h J | J ), (A.2) rior probabilities remained extreme. Manipulation of the prior probabilities had the most effect when the evidence for or against the presence of an interaction was not overwhelming (i.e., posterior probabilities roughly in the range between 0.1 and 0.9). 8 We use the standard convention of denoting random variables by uppercase letters and their observed values by lowercase letters. We denote vectors by underscores. 9 The sufcient statistic T contains TA only for those clusters A 2 J . For notational simplicity, the dependence of T on the structure J is suppressed.

2648 L. Martignon, G. Deco, K. Laskey, M. Diamond, W. Freiwald, & E. Vaadia

where the constant of proportionality K is chosen so that equation A.2 integrates to 1: 0 1¡1 Z B C (A.3) K D @ p (x1 | h J , J ) ¢ ¢ ¢ p (xN | h J , J ) g (h J | J ) dh J A . hJ

In this expression, the range of integration is over h J where J 0 D fA: A 2 6 J, A D Æ g. That is, the joint density is integrated only over those parameters that vary independently. Because the probabilities are constrained to sum to 1, h Æ is a function of the other hA : 8 91 0 hÆ

B D ¡ log B @

S x

> > <

> > =C exp hA tA (x ) C A. > > > > : A2 J ;

S

(A.4)

A2/ Æ

A.2.2 Approximating the Posterior Distribution. Using data to infer the distribution involves two subproblems: (1) given an interaction structure J , infer the parameters hA for A 2 J , and (2) determine the posterior probability of an interaction structure J . Neither of these subproblems has a tractable closed-form solution. We discuss approximation methods for each in turn. Options for approximating the structure-specic posterior density function g¤ (h J | J ) D g (h J | J , x1 , . . . , xN ) include analytical or sampling-based methods. Because we must estimate posterior distributions for a large number of structures, we expected Monte Carlo methods to be infeasibly slow. We therefore chose to use the standard large-sample normal approximation. The approximate posterior mean is given by the mode hQ J of the posterior distribution. We found the posterior mode by using Newton’s method to maximize the logarithm of the joint mass density function: © ¡ ¢ª hQ J ¡ argmaxh log p (x1 | h J , J ) ¢ ¢ ¢ p (xN | h J , J ) g(h J | J ) . (A.5) J

When we used arbitrary starting values, we encountered occasional instabilities. This problem was alleviated by running an iterative proportional tting algorithm on a “pseudosample” consisting of a small, positive constant added to the sample counts (the positive constant bounded the starting probabilities away from zero for congurations with no observations). We then iterated using Newton’s method to nd the maximum a posteriori estimate. The posterior covariance is approximated by the inverse of the Hessian matrix of second partial derivatives evaluated at the maximum a posteriori estimate. This matrix can be obtained in closed form: h ¡ ¢i¡1 . (A.6) SQ J D Dh2 log p(x1 | h J , J ) ¢ ¢ ¢ p (xN | h J , J ) g (h J | J )

Neural Coding

2649

The posterior probability of a structure J is given by p ¤ J D p( J | x ) / p (x | J )p J ,

(A.7)

where the rst term on the right is obtained by integrating h J out of the joint parameter / data density function: Z (A.8) p (x | J ) D p (x1 | h J , J ) ¢ ¢ ¢ p (xN | h J , J ) g (h J | J ) dh J . hJ

There is no closed-form solution for this integral. We use Laplace’s method (Tierney & Kadane, 1986; Kass & Raftery, 1995), which again relies on the normal approximation to the posterior distribution. Assuming that p (x | h J , J ) is approximately normal with mean A.5 and covariance A.6, and integrating the resulting normal density as in A.8, we obtain:

1 /2

p (x | J ) ¼ (2p )d / 2 SQ J p (x1 | hQ J , J ) ¢ ¢ ¢ p (xN | hQ J , J ) g (hQ J | J ).

(A.9)

The normal approximation is accurate for large sample sizes (De Groot, 1970}). The posterior density function is always unimodal, and we have observed that the conditional density for hA given the other h’s is not too asymmetric even when the corresponding marginal T A is small. We have also found that very small frequencies typically give rise to very small posterior probabilities of the associated effects, in which case the accuracy of the approximation is not crucial. Nevertheless, the quality of the Laplace approximation is a question worthy of further investigation. Equation A.7 gives the posterior probability of a structure only up to a proportionality constant. The posterior probability is obtained by computing equation A.7 for all structures and then normalizing. The total number of structures grows as , where k is the number of neurons. For more than four neurons, this number is much too large to enumerate explicitly. For data sets with more than four neurons, we used an MC3 algorithm to search over structures (Madigan & York, 1993). The algorithm works as follows: 1. Begin at initial structure J . Compute p (x | J )p J .10 2. Nominate a candidate structure A to add to or delete from J to obtain a new structure J 0 . The candidate structure is nominated with probability r ( A | J ).11 10 To avoid numeric underow, we compute the logarithm of the required probability. However, for clarity of exposition, we present the algorithm in terms of probabilities. 11 To improve acceptance probabilities, we modied the distribution r (A0 | J ) as sampling progressed. The sampling probabilities were bounded away from zero to ensure ergodicity.

2650 L. Martignon, G. Deco, K. Laskey, M. Diamond, W. Freiwald, & E. Vaadia

3. Compute p(x | J )p J . 4. Accept the candidate structure J 0 with probability »

pAccept

¼ p (x | J 0 )p J 0 r ( A | J 0 ) . D min 1, p (x | J 0 )p J 0 r ( A | J )

(A.10)

5. Unless the stopping criterion is met, return to step 2; else stop. We discarded a burn-in run of 500 samples (quite sufcient to move the sampler into a range of high-probability models on the data sets we tried) and based our estimates on a run of 5000 samples. We implemented our algorithm in Lisp-Stat running on a 133 MHz Macintosh PowerBook. For the six-neuron data sets described in section 7, a run of 5000 samples took about 8 hours. This was reduced to about 2 hours on a 266 MHz PowerBook G3. We computed posterior probabilities for each effect as the frequency of occurrence of that effect in the sample run. We estimate the conditional expected value, hQA ¡

q #[A 2 J ]

S

A2 J

hQA| J ,

(A.11)

where the structure-specic estimate is hQA| J obtained from equation A.5. Similarly, we compute a variance estimate as the sum of within- and between-structure components. The within-structure variance component, 2 sQ A,w D

1 #[A 2 J ]

S

A2 J

2 sQ A| J,

(A.12)

where the summands of equation A.12 are the appropriate diagonal elements from equation A.6. The between-structure component is given by 2 sQ A,b D

1 #[A 2 J ]

S

A2 J

±

hQA| J ¡ hQA

²2

.

(A.13)

A.2.3 Detecting Changes Between Different Time Segments. Suppose we are concerned with detecting whether an environmental stimulus produces detectable changes in neuronal activity. We might examine this question by comparing models of the form 3.3 for segments recorded before and after occurrence of the stimulus. Of course, estimates for the two segments will differ. The question of interest is whether these differences can be explained by random noise or whether they represent real differences in activation patterns that can be associated with the stimulus. One way to address this question would be to construct a super model encompassing both pre- and postready segments and extend the methods

Neural Coding

2651

described to include hypothesis tests for which interactions were the same as or different for the two data sets. We took a simpler approach, which could be performed using the models we have described. Suppose we have two segments of data, labeled x1 and x2 , of the same neurons recorded at different times. We begin by tting models of the form 3.3 for each data set individually and for the combined data set. We then selected a model structure J that included effects for any set A that had high probability in either an individual model or the combined model. We t this single structure for the individual and combined data sets. We then computed predictive probabilities as in equation A.9 for each of the three data sets. We denote these by pC (x1 , x2 | J ) for the combined model, p1 (x1 | J ) for the rst segment, and p2 (x2 | J ), for the second segment. Assuming that the individual and combined models are equally likely a priori, the posterior probabilities were computed using the formula: pCombined pC (x1 , x2 | J ) . D pSeparate p1 ( x1 | J ) C p2 ( x2 | J )

(A.14)

Acknowledgments

We gratefully acknowledge the anonymous reviewers of an earlier version of this article for useful comments that have greatly improved the article. K. L. was supported by a Career Development Fellowship with the Krasnow Institute for the study of cognitive science at George Mason University. The expriments by W. F. were conducted at the Max-Planck-Institute for Brain Research in Frankfurt/Main together with Wolf Singer and Andreas K. Kreiter. This work was supported by HFSP grant RG-20/95 B and SFB 517. References Abeles, M. (1991). Corticonics: Neural circuits of the cerebral cortex. Cambridge: Cambridge University Press. Abeles, M., Bergman, H., Gat, I., Meilijson, I., Seidenmann, E., Tishby, N., & Vaadia, E. (1995). Cortical activity ips among quasi-stationary states. Proc. Natl. Acad. Sci., 92, 8616–8620. Abeles, M., Bergman, H., Margalit, E. and Vaadia, E. (1993). Spatiotemporal ring patterns in the frontal cortex of behaving monkeys. Journal of Neurophysiology, 70, 1629–1638. Abeles, M., & Gerstein, G. (1988). Detecting spatiotemporal ring patterns among simultaneously recorded single neurons. J. Neurophysiol., 60, 909–924. Abeles, M., Prut, Y., Bergman, H., & Vaadia, E. (1994). Synchronization in neural transmission and its importance for information processing. Prog. Brain. Res., 102, 395–404. Aertsen, A. (1992). Detecting triple activity. Preprint. Amari, S. (1999). Information geometry on hierarchical decomposition of stochastic interactions. Preprint.

2652 L. Martignon, G. Deco, K. Laskey, M. Diamond, W. Freiwald, & E. Vaadia Bishop, Y., Fienberg, S., & Holland, P. (1975). Discrete multivariate analysis. Cambridge, MA: MIT Press. Cover, T., & Thomas, J. (1991).Elements of information theory. New York: Wiley. Deco, G., Martignon, L., & Laskey, K. (1998). Neural coding: Measures for synre detection. In Wulfram Gestner (Ed.) Advances in articial neural networks (pp. 78–83) New York: Springer-Verlag. De Groot, M. H. (1970). Optimal statistical decisions. New York: McGraw-Hill. de Sa’, R., de Charms, R., & Merzenich, M. (1997). Using Helmholtz machines to analyze multi-channel neuronal recordings. Neural Information Processing System, 10, 131–136. Diamond, M. E., Armstrong-Jones, M. A., & Ebner, F. F.(1993). Experiencededpendent plasticity in the barrel cortex of adult rats. Proceedings of the National Academy of Science, 90, 2602–2606. Gerstein, G., Perkel, D., & Dayhoff, J. (1985). Cooperative ring activity in simultaneously recorded populations of neurons: Detection and measurement. Journal of Neuroscience, 5, 881–889. Good, I. (1963). On the population frequencies of species and the estimation of population parameters. Biometrika, 40, 237–264. Gray, C. M., Konig, ¨ P., Engel, A. K., & Singer, W. (1989). Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reects global stimulus properties. Nature, 388, 334–337. Grun, ¨ S. (1996). Unitary joint-events in multiple-neuron spiking activity-detection, signicance and interpretation. Frankfurt: Verlag Harry Deutsch. Grun, ¨ S., Aertsen, A., Vaadia, E., & Riehle, A. (1995). Behavior-related unitary events in cortical activity. Poster presentation at Learning and Memory, 23rd Grottingen ¨ Neurobiology Conference. Hebb, D. (1949). The organization of behavior. New York: Wiley. Kass, R., & Raftery, A. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773–795. Madigan, D., & York, J. (1993). Bayesian graphical models for discrete data (Tech. Rep. No. 239). Pullman: Department of Statistics, University of Washington. Martignon, L. (1999). Visual representation of spatio-temporal patterns in neural activation. (Tech. Rep.) Berlin: Max Planck Institute for Human Development. Martignon, L., von Hasseln, H., Grun, ¨ S., Aertsen, A., & Palm, G. (1995). Detecting higher order interactions among the spiking events of a group of neurons. Biol. Cyb., 73, 69–81. Neyman, J., & Pearson, E. (1928). On the use and the interpretation of certain test criteria for purposes of statistical inference. BiometriKa, 20, 175–240, 263–294. Palm, G., Aertsen, A., & Gerstein, G. (1988). On the signicance of correlations among neuronal spike trains. Biol. Cyb., 59, 1–11. Prut, Y., Vaadia, E., Bergman, H., Haalman, I., Slovin, H., & Abeles, M. (1998). Spatiotemporal structure of cortical activity: Properties and behavioral relevance. Journal of Neurophysiology, 79, 2857–2874. Rauschecker, J. (1991) Mechanisms of visual plasticity: Hebb synapses, NMDA reception and beyond. Phys. Rev., 71, 587–615.

Neural Coding

2653

Riehle, A., Grun, ¨ S., Diesmann, M., & Aertsen, A. (1997). Spike synchronization and rate modulation differentially involved in motor critical function. Science, 281, 34–42. Singer, W. (1994). Time as coding space in neocortical processing: A hypothesis. In Buzsaki, G. et al. (Eds). Temporal coding in the brain. Berlin: Springer-Verlag. Smith, A. F. M., & Spiegelhalter, D. J. (1980). Bayes factors and choice criteria for linear models. Journal of the Royal Statistical Society, series B, 42, 213–220. Tierney, L. (1990). Lisp-Stat. New York: Wiley. Tierney, L., & Kadane, J. (1986). Accurate approximations of posterior moments and marginal densities. Journal of the American Statistical Association,81, 82–86. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioral events. Nature, 373, 515–518. Welker, C. (1971). Microelectrode delineation of ne grain somatotopic organization of SmI cerebral neocortex in albino rat. Brain Res., 26, 259–275. Whittaker, J. (1989). Graphical models in applied multivariate statistics. New York: Wiley. Woolsey, T. A., & Van der Loos, H. (1970). The structural organization of layer IV in the somatosensory region, SI, of mouse cerebral cortex. Brain Res., 17, 205–242. Received August 7, 1997; accepted February 16, 2000.

LETTER

Communicated by Radford Neal

Gaussian Processes for Classication: Mean-Field Algorithms Manfred Opper Neural Computing Research Group, Department of Computer Science and Applied Mathematics, Aston University, Birmingham B4 7ET, U.K. Ole Winther Theoretical Physics, Lund University, S-223 62 Lund, Sweden, and Bohr Institute, 2100 Copenhagen Ø, Denmark

CONNECT ,

Niels

We derive a mean-eld algorithm for binary classication with gaussian processes that is based on the TAP approach originally proposed in statistical physics of disordered systems. The theory also yields an approximate leave-one-out estimator for the generalization error, which is computed with no extra computational cost. We show that from the TAP approach, it is possible to derive both a simpler “naive” mean-eld theory and support vector machines (SVMs) as limiting cases. For both mean-eld algorithms and support vector machines, simulation results for three small benchmark data sets are presented. They show that one may get stateof-the-art performance by using the leave-one-out estimator for model selection and the built-in leave-one-out estimators are extremely precise when compared to the exact leave-one-out estimate. The second result is taken as strong support for the internal consistency of the mean-eld approach. 1 Introduction

There has been a great deal of interest in nonparametric Bayesian approaches to regression and classication based on the concept of gaussian processes (Mackay, 1997; Williams, 1997; Williams & Rasmussen, 1997; Neal, 1997; Gibbs & Mackay, 1997; Barber & Williams, 1996; Williams & Barber, 1998; Opper & Winther, 1999a, 1999b). The underlying idea is conceptually very simple. Instead of dening prior distributions over parameters of a learning machine (e.g., weights and biases in case of a neural net), one directly denes a gaussian prior distribution over the function that the machine computes. For regression problems, this approach allows for an explicit statistical inference by analytical methods. On the other hand, for nonlinear statistical models like the ones used for classication, the high-dimensional integrals that occur in performing posterior averages can only be treated by approximative methods. An approximation to these integrations can be based on c 2000 Massachusetts Institute of Technology Neural Computation 12, 2655–2684 (2000) °

2656

Manfred Opper and Ole Winther

Monte Carlo sampling (Neal, 1997), which for a large number of data may be time-consuming. Other, semianalytic approaches use Laplace’s methods— that is, the approximation of the posterior by a multivariate gaussian (Barber & Williams, 1997; Williams & Barber, 1998)—or a tractable variational bound on the data likelihood (Gibbs & Mackay, 1997). This article deals with a different approach, which has its origin in the statistical physics of disordered media. Their thermodynamic properties can be calculated from high-dimensional expectations over Gibbs distributions, which contain a large set of random quantities, similar to the input data in the posterior distribution of Bayesian analysis. Our method is based on a TAP (Thouless, Anderson, & Palmer, 1977; M´ezard, Parisi, & Virasoro, 1987) type of mean-eld approximation, which goes beyond a variational approximation of the posterior by a product distribution. The TAP approach is a controlled approximation in the sense that it becomes exact for simple distributions of the disorder (the input data) in the limit where the number of integration variables approaches innity (M´ezard et al., 1987; Opper & Winther, 1996). A second advantage is that the method, by its construction, automatically computes a leave-one-out estimate of its generalization error. For real data, the distribution of the disorder is unknown, and one cannot assess the validity of the mean-eld approach directly. We will therefore (in contrast to most prior applications of the TAP method) not make any explicit assumptions for this distribution. Instead, we base our approach on a gaussian assumption for the so-called cavity elds (M´ezard et al., 1987)—which we believe to be reasonably good for a large class of distributions—from which a closed set of nonlinear equations for posterior averages can be derived. An alternative derivation of these equations was given in previous work (Opper & Winther, 1999a), which used a less intuitive approximation for the Gibbs free energy (Parisi & Potters, 1995). We also show that the internal consistency of the mean-eld approximation can be checked indirectly through comparing the leave-one-out estimate for the generalization error provided by the TAP theory with the exact leave-one-out estimate. Section 2 gives the basic denitions for classication with gaussian processes. In section 3, we derive the TAP mean-eld algorithm. A recipe for solving the mean-eld equations by iteration is given in section 4. A leaveone-out (loo) estimator for generalization error of the TAP mean-eld algorithm is proposed in section 5. In section 6, we derive a naive mean-eld algorithm. In 7, Support vector machines (SVMs) are obtained from TAP mean-eld theory by a limiting procedure in the prior and a change in the likelihood (which ruins the probabilistic interpretation). In section 8, we present simulation results for three small benchmark data sets: Sonar— Mines versus Rocks, Leptograpsus Crabs, and Pima Indians Diabetes. The article is concluded in section 9. Appendixes A and B discuss, respectively, the noise model and the covariance functions used.

Gaussian Processes for Classication

2657

2 Gaussian Process Models

We consider the following supervised learning problem. A training set, DN D f(xi , ti )| i D 1, . . . , Ng of input vectors x i and associated outputs ti is given, and we want to infer the output t for a new input x, where we restrict ourselves to binary classication t D §1. The probability of the output t, at a given point x , p(t|h (x )) is assumed to depend on the input x through a scalar activation function h (x ), which must be inferred by the learning process. The simplest example for such a likelihood is the noise-free case, p (t|h (x )) D H (t sgnh(x )) D H (t h (x )),

(2.1)

where H (x ) D 1 for x > 0 and 0 otherwise. Thus, only functions that have the same sign as the training label have nonzero probability. A more common choice for the likelihood, which corresponds to noisy or ambiguous classications, would replace equation 2.1 with a sigmoidal function of t h (x ). Since our approach requires integrations of likelihood functions over gaussian distributions, we will restrict ourselves to likelihoods corresponding to a probit model, p (t|h ) D W

¡

th p v2

¢

,

(2.2)

where W is the error function (or more precisely the gaussian cumulative distribution function) Z W(x) D

x ¡1

y2 dy p e¡ 2 , 2p

(2.3)

for which such integrations can be performed exactly. However, as discussed in appendix A, the Bayesian classication for such a choice can also be derived from a model of the form of equation 2.1 by a simple redenition, which amounts to modifying the prior over h(x ) such that it includes the noise. In the subsequent discussion, we will therefore consider only the likelihood equation, 2.1. There are many possible parametric approaches to modeling the function h (x ). A popular one is to represent h by a neural network and try to estimate the parameters from the data. In the Bayesian approach to learning, a prior probability distribution over the parameters (e.g., the weights of the network) that determine h is specied. To go one step further, one may skip the parameters and introduce a prior distribution over the entire space of functions h. The choice of a gaussian probability measure over functions has been motivated by a study of the limiting prior distribution in the neural neural network case, when the number of hidden units grows large (Neal,

2658

Manfred Opper and Ole Winther

1996; Williams, 1997). In such a case, a typical function h drawn at random from the prior distribution is a realization of a gaussian process. A gaussian process is a family of random variables h(x ), x 2 T, where T denotes a possibly uncountable index set, such that any nite collection h D (h (x1 ), . . . , h (xN )) of random variables has a joint gaussian distribution, p( h ) D p

1

¡

¢

1 exp ¡ (h ¡ m )T C ¡1 (h ¡ m ) , N 2 (2p ) det C

(2.4)

where m is the mean m D E (h ) (which we will set to zero in the following) and C ´ E (hh T ) ¡ mm T is the covariance matrix of the input set with elements C(xi , xj ). The function C(x, x 0 ) explicitly determines how the activations at different points are correlated, which should reect our prior beliefs about the variability of the function h (x ). In principle, we may choose any positive semidenite function for C. Several parametric choices, some of them related to neural network models, are discussed in appendix B. Formally, we may represent a gaussian process model by a single-layer neural network with (possibly innitely many) weights wr , r D 1, 2, . . . with a spherical gaussian prior distribution E[wr wr 0 ] D drr 0 , by performing a Karhunen-Loeve expansion (see, e.g., Papoulis, 1984), h (x ) D

X

p wr l r Yr (x ),

(2.5)

r

where Yr (x ) are eigenfunctions of the covariance function (kernel) C with eigenvalues lr . From equation 2.5, we get the kernel representation C(x, x 0 ) D P 0 r l r Yr ( x )Yr ( x ). At rst glance, the innite dimensional nature of the gaussian process h (x ) seems a bit awkward. However, since we are interested in making inference on a single input x based on a discrete set of training data inputs x i , one needs only the N C 1-dimensional gaussian prior distribution p (h , h) where h D h (x 1 ), . . . , h (x N ) and h D h(x ) is the random variable associated with the new input x. Using Bayes’ rule together with the likelihood equation, 2.1, one may now form the joint posterior distribution of h and h, p(h , h| t ) D

N 1 Y 1 p (ti |h(xi )) p (h , h ) D p (t | h ) p (h , h), p( t ) i D 1 p (t )

(2.6)

where t D t1 , . . . , tN is shorthand for the training set outputs and Z p( t ) D

Z dh dh p (t | h )p (h , h ) D

dh p ( t | h ) p ( h )

(2.7)

Gaussian Processes for Classication

2659

is a normalization constant. To nd p (h| t ), formally all we need is to marginalize over h : Z (2.8) p (h| t ) D dh p (h , h| t ). We will call p (h (x )| t ) the predictive posterior of h. The term predictive is used because x is not in the training set, and this posterior may—as will be shown below—be used for making Bayesian predictions. The predictive posterior is also a central concept in our mean-eld theory since the meaneld approximation amounts to assuming a special simple parametric form of this distribution. 2.1 Bayesian Classication. The predictive posterior p(h (x )| t ) summarizes the knowledge about h(x ) after having observed the training set. We may calculate the predictive probability of the output label

Z p (t| t ) D hp(t|h)i ´

dh p (t|h )p(h| t ),

(2.9)

where we have introduced the notation h. . .i for a posterior average. Bayes’ algorithm for classication selects the output with the highest probability, tBayes (DN , x ) D argmax p (t| t ), t

in order to minimize the posterior probability of an error. For binary classication, Bayes’ algorithm becomes £ ¤ tBayes (DN , x ) D sgn p ( C 1| t ) ¡ p(¡1| t ) D sgnhsgnhi.

(2.10)

In the Bayesian framework, we may also quantify the uncertainty of the prediction of Bayes algorithm. According to our posterior belief, the probability for the output t to be correct is given by the predictive probability p (t| t ). The estimate of the probability that Bayes’ algorithm gives the wrong answer is therefore p(¡tBayes (DN , x )| t ). 3 Mean-Field Theory

The posterior average needed to derive Bayes’ algorithm is in most cases of interest not analytically tractable. For the likelihood equation, 2.1, one has to resort to approximate techniques. In this article we introduce an advanced mean-eld theory approach based on ideas of statistical mechanics.

2660

Manfred Opper and Ole Winther

The basic idea of mean-eld theories is to approximate the statistics of a random variable that is correlated to other random variables by assuming that the inuence of the other variables can be compressed into a single effective mean “eld” with a rather simple distribution. Often such an approximation is based on a variational product approximation for the distribution of interest, which neglects the correlation between random variables (Parisi, 1988). For applications to gaussian processes, see (Csat o´ , Fokou´e, Opper, Schottky, & Winther, in press and in graphical models, see Jordan, 1999). Our approach goes beyond such a simple mean-eld theory and is equivalent to the so-called TAP mean-eld theory known in statistical mechanics of disordered systems. We assume that the posterior distribution of h(x ) the predictive posterior, can be approximated by a gaussian with mean hhi and variance l D hh2 i ¡ hhi2 : that is, p(h (x )| t ) ¼ p

1 2pl

¡

exp ¡

(h ¡ hhi)2 2l

¢

.

(3.1)

To motivate such an approximation,one may look at the expansion equation, 2.5. Clearly the posterior distribution of the weight vector, p(w1 , w2 , . . . | t ) will not be gaussian for a nongaussian likelihood. However, one might justify a gaussian approximation by the central limit theorem if (1) the weights wr are sufciently weakly dependent and if (2) in addition the uctuations of the sum, equation 2.5, were dominated by a large number of terms with roughly the same magnitude. One may argue that condition (2) will not be fullled when the eigenvalues lr are rapidly decreasing with r (e.g., exponentially fast). However, when the dimension of the input space is large and we assume a kernel that is permutation invariant with respect to the components of the input vector, there will be many features (e.g., the linear ones) that have the same eigenvalues and may thus give similar contributions to the uctuations of equation 2.5. The quality of the gaussian approximation to h(x ) will strongly depend on the input x . If we take a point that is close to one of the training inputs x ¼ x i and consider the likelihood equation, 2.1, which imposes the strict inequality ti h(x i ) > 0, the distribution of h (x ) should be rather different from a gaussian (see the subsequent discussion after equation 3.5). However, as we will see in a moment, the gaussian approximation will be applied to only test points x and training inputs xi , for which the label ti and the corresponding likelihood factor have been discarded. In such a case, we expect that the approximation should be fairly good when input points are typically not close to each other. In fact, for specic models (Opper & Winther, 1996), the approximation becomes exact when the dimension of the input space approaches innity and the input vectors are drawn at random from a probability distribution with independent components. In this limit, with probability one, inputs vectors will come out uncorrelated.

Gaussian Processes for Classication

2661

In the theory of disordered media, similar approximations are made for the distribution of the magnetic eld at the cavity, which is left when an atom (corresponding to the data label) is removed from the system. Hence, we will also speak of cavity elds or cavity distributions. (For more detailed discussion of the validity of the approximation for specic models in the statistical physics of disordered media, see M´ezard, et al., 1987.) We will show that means and variances for the gaussian approximation equation, 3.1, can be calculated in a self-consistent way. Before doing so, we use the gaussian approximation to calculate the predictive probability equation, 2.9,

¡

¢

hh(x )i , p (t| t ) D W t p l

(3.2)

where W is the error function equation, 2.3. The Bayes classier becomes tBayes (x , DN ) D sgnhh(x )i.

(3.3)

Note that this result holds for any distribution that is symmetric around the mean value. The approximation to the error probability may also be calculated:

¡

|hhi| p (¡tBayes (x, DN )| t ) D W ¡ p l

¢

.

(3.4)

It is important to note that the gaussian mean-eld approximation to the predictive posterior is not equivalent to a gaussian approximation to the full posterior. The latter approximation is commonly used in Bayesian modeling for parametric models, in the limit where the number of data is much larger than the number of parameters. This would not be justied in our nonparametric case. Second, it would not be justied for the nonsmooth likelihoods equation, 2.1. To understand the difference to this simpler approximation, consider the case where an N C 1th example (x, t) has been observed. Then, according to Bayes’ rule, the posterior distribution is given by p (t|h (x ))p(h (x )| t ) . p (h (x )| t, t) D R dh p (t|h )p (h| t )

(3.5)

For the likelihood equation, 2.1, the posterior distribution is far from being gaussian, as illustrated in Figure 1. To derive the mean-eld algorithm, we shall exploit that under the gaussian approximation, one can derive an analytical relation between the mean of predictive distribution, hhi, and the mean of p (h(x )| t, t ). Within the mean-eld approximation, we thus need to derive expressions for hhi and l D hh2 i ¡ hhi2 to make Bayesian predictions and quantify

2662

Manfred Opper and Ole Winther

p(h(x)|t,t)

p(h(x)|t) h(x) Figure 1: The predictive posterior of h(x ), p(h(x )| t ) (dashed line) and the posterior p(h(x )| t, t) (solid line) after the addition of the N C 1th example (x , t). In the gure, t D 1. The lines crossing the x-axis indicate the mean value of the two distributions.

the uncertainty of the prediction. This will be the task of the remainder of this section. We will start by deriving exact expressions for the two rst moments of the elds. The exact expressions are written in terms of averages over cavity distributions (for which one of the training examples is left out of the likelihood). These averages may be carried out within the mean-eld approximation, and the second moments of the cavity distribution are determined self-consistently. The nal result is a set of 2N nonlinear mean-eld equations that has to be solved to nd hhi and l. In simulations, this will be done by iteration. The gaussian assumption for the distribution of the cavity eld is only one possibility to derive the TAP mean-eld theory. Another derivation that is based on a perturbation expansions for the Gibbs free energy (Opper & Winther, 1999a) is less intuitive and requires a somewhat deeper background of specic techniques of statistical physics. Such alternative derivations will be discussed in more detail elsewhere. 3.1 Exact Expressions for hhi and hh(xi )i. The starting point of our approximative treatment is based on exact relations for the posterior mean, Z 1 hhi D dh dh h p(t | h )p(h , h), p (t )

and the posterior variance. The exact relations are derived using h p (h ) D CC ¡1 h p (h ) D ¡C

@ @h

p (h ),

(3.6)

Gaussian Processes for Classication

2663

which follows from equation 2.4 (with m D 0). We may now nd the exact relation for the posterior mean hh(x )i at an arbitrary point x by extending the prior to include the new point x (h(x ) is the N C 1th eld) and applying integration by parts to shift the differentiation from the prior to the likelihood: hh(x )i D ¡ D

1 p (t )

N X

dh dh p(t | h )

C ( x, x i )

iD 1 N X

D

Z

C(x , x i )

iD 1

Z

1 ( p t)

C1 N X

dh dh p (h , h )

C(x, x i ) ti ai .

@ @hi

@ @hi

p (h , h )

p (t | h )

(3.7)

iD 1

Here we have introduced the notation hi ´ h (x i ) and dened the “embedding strength,” ai ´

ti p (t )

Z dh p(h )

@ @hi

p (t | h ).

(3.8)

Introducing the further shorthand notation Cij D C(x i , xj ) and applying equation 3.7 to the training data points x D xi , we have hhi i D

X

Cij tj aj .

(3.9)

j

These exact expressions are very interesting because they allow us to express the expected activation hh(x )i as a weighted average over the examples of the training set, as it is also the case in the support vector learning approach (Vapnik, 1995) and in the MAP approach to gaussian processes (Williams & Barber, 1998). In calculating the activation at a point x , the training data are weighted according to how close they are to x (in an appropriate metric) by the kernel function, which denes the relevant neighborhood for each point in the input space. The overall importance of a training point x i in the learning problem is measured by the variable ai , which as it may be seen from equation 3.8, is always nonnegative when p(t|h) is an increasing function of ht. For our cavity derivation, it is useful to introduce a new set of predictive posteriors, one for each example, by RQ p (hi | tnti ) D

R

6 i jD

dhj Q

dh

Q

6 i jD

6 i jD

p(tj |hj )p (h )

p(tj |hj )p (h )

,

(3.10)

2664

Manfred Opper and Ole Winther

where tnti D ti , . . . , ti¡1 , ti C 1 , . . . , tN denotes a training set without the ith example. Denoting an average over this distribution by Z h. . .ini D dhi . . . p (hi | tnti ), we can rewrite equation 3.8 as ai D ti

h @@h p(ti |hi )ini i hp(ti |hi )ini

.

(3.11)

3.2 Mean-Field Expression for ai . The main reason that it is possible to calculate averages over p (hi | t nti ) is the fact that it is a predictive posterior of the eld at an input xi ; ti is not included in the data set. We can therefore apply the same gaussian approximation as in equation 3.1, ! (hi ¡ hhi ini )2 1 exp ¡ , (3.12) p(hi | tnti ) ¼ p 2l i 2p l i

(

with the variance dened as l i D hh2i ini ¡ hhi i2ni . Inserting equation 3.12 into 3.11, we derive the explicit expression, ±

D

²

hhi ini p li

1 ± ², ai ¼ p l i W ti hhpi ini l

(3.13)

i

where we have introduced the gaussian measure 1 2 D (y ) D p e¡y / 2 . 2p So far we have accomplished writing ai in terms of new unknown quantities, the two rst moments of the predictive distribution for hi : hhi ini and l i . We therefore need to express these in terms of some known quantities. For the rst moments, we may exploit the following exact relation between the average over the posterior and the predictive posterior: h. . .i D

h. . . p(ti |hi )ini . hp(ti |hi )ini

(3.14)

Since p (hi | t nti ) is (assumed to be) gaussian, we get (following the same reasoning as equation 3.6): hi p (hi | tnti ) ¼ hhi ini ¡ l i @@hi p (hi | t nti ) and R

hhi i D hhi ini ¡ l i D hhi ini C l i

dhi p (ti |hi ) @@hi p (hi | tnti ) hp(ti |hi )ini

h @@h p(ti |hi )ini i hp(ti |hi )ini

.

(3.15)

Gaussian Processes for Classication

2665

Comparing with equation 3.11, we see that hhi i ¼ hhi ini C l i ti ai .

(3.16)

This relation shows that adding the ith example to the posterior gives a change in the mean of the eld in the direction of the target, as may also be seen from Figure 1. Note from equations 3.13 and 3.16 that the equations for ai are nonlinear and thus have to be solved numerically. In section 5, we will exploit the fact that we have calculated hhi ini for every training example i D 1, . . . , N to propose a mean-eld estimate of leave-one-out estimator for the generalization error. To complete our mean-eld theory, we have to derive expressions for the second moments: l D hh2 (x )i ¡hh(x )i2 and l i ´ hh2i ini ¡ hhi i2ni . The derivation is given in appendix C. The nal results are: h i X l ¼ C(x , x ) ¡ (3.17) C(x, xj ) ( C C ) ¡1 C (x k , x ) Mean-Field Expressions for l and l i .

jk

jk

li ¼ £ with

1 (

(3.18)

¤ ¡ Li

C C ) ¡1 ii

D diag(L 1 , . . . , L N ) and

¡

L i ´ ¡l i ¡ ti

@ai

¼¡

¡1

@hhi ini

An explicit expression for tion 3.13: @hhi ini

¢

@ai

@ai @hhi ini

.

(3.19) is, with some algebra obtained from equa-

ai hhi i. li

(3.20)

Equation 3.18 with equations 3.19 and 3.20 thus denes a set of nonlinear equations for l i , i D 1, . . . , N. For gaussian process regression with a gaussian noise model, one nds the same expressions for l and l i with L i equal to the noise variance (Williams & Rasmussen, 1996). It is therefore tempting to interpret L i as the effective noise variance for the ith example. 3.3 Summary of Mean-Field Equations. We conclude the derivation of the TAP mean-eld equations with a summary of the results: The mean eld prediction for new data point x follows from equations 3.3 and 3.7:

tBayes (x , DN ) ¼ sgn

X i

C(x, x i )ti ai .

(3.21)

2666

Manfred Opper and Ole Winther

To obtain ai , i D 1, . . . , N, one has to solve the set of 2N nonlinear meaneld equations for ai and l i , where ai is given by equation 3.13 (with equations 3.9 and 3.16) and l i is given by equation 3.18 (with equations 3.19 and 3.20). In the next section we give a recipe for solving the mean-eld equations. 4 Solving the Mean-Field Equations

The mean-eld equations are solved by iteration. The parallel iterative scheme used is described below in pseudocode, where we also indicate which equations are used. The iterative scheme for the naive mean eld (see section 6) is obtained by omitting the statements for L i and l i in the iteration step. Computationally, the most expensive step of the TAP mean-eld approach is the inversion of the C C -matrix, which is of O (N3 ) compared to O (N 2 ) for the other operations. It turns out that it requires many more iterations to solve for the equations for ai than for l i , which seems to be somewhat insensitive to precise value of the ai s. We therefore choose to make a (greedy) update of l i only every twentieth iteration step. For the problems studied, convergence to the required accuracy could be obtained in fewer than 100 iteration steps and takes about 1 CPU second on an Alpha 433au for the largest problem studied (N D 200). 1 Initialization

1. Learning rate g :D 0.05 and fault tolerance ftol :D 10 ¡4 . 2. Calculate the covariance matrix C (see appendix B). 3. For every training example i: ai : D 0 and l i : D Cii . Iterate

While maxi |dai | 2 > ftol do: 1. For every i: (Equations 3.9, 3.13, and 3.16): hhi i : D

X

Cij tj aj

j

1 D (zi ) dai : D p ¡ ai , l i W (zi )

zi ´ ti

hhi i ¡ l i ti ai p . li

2. For every i: ai :D ai C gdai . 1 An important contributing factor to learning speed is the use of an iterative learning P |dai | 2 decreases in the update step and g :D g / 2 rate. We set g :D 1.1g if “the error” i otherwise.

Gaussian Processes for Classication

2667

3. For every twentieth iteration: (a) For every i: L i : D l i ( hhi it1 i ai ¡ 1) (equations 3.19 and 3.20).

(b) For every i: l i : D

[(

1

C C )¡1 ]ii

¡ L i (equation 3.18).

5 Leave-One-Out Estimator

Once we have solved the mean-eld equations, the cavity average of the activation function hhi ini will be known. We may use these to dene a leaveone-out (loo) estimator for the generalization error of the algorithm since sgnhhi ini is a mean-eld estimate of the prediction of the algorithm on pattern i trained without that example. We therefore get—as a bonus of the TAP approach—the following leave-one-out estimator for the generalization error: 2 loo D

N 1 X H (¡ti hhi ini ). N iD 1

(5.1)

exact If 2 loo were exact, that is, equal to the exact loo estimator 2 loo obtained by N–fold cross validation, this estimator would be (almost) unbiased. 2 In appendix D, we will show that the expression for hhi ini D hhi i ¡ l i ti ai may also be obtained by treating the perturbation of the TAP equations when the ith example is removed from the training set by a linear response approach. This gives a self-consistency check of the result for hhi ini obtained within the TAP approximation. In appendix D, it is furthermore shown how this technique can be generalized to the derivation of loo estimators for other algorithms, such as the naive mean-eld algorithm (in section 6) and support vector machines (in section 7). Other work on approximate loo estimators for smoothing splines (Xiang & Wahba, 1996) and neural networks (Larsen & Hansen, 1996) is close in spirit to the linear response approach.

6 Naive Mean-Field Theory

There are many different ways to dene and derive mean-eld theories (see, e.g., Parisi, 1988). Perhaps the simplest one is to derive an exact relation between the expectations of the random variables of interest and the expectation of a nonlinear function of these variables. For spin systems, such expressions are known as Callen identities (see Parisi, 1988). At the end of 2 2 exact is an unbiased estimator for the generalization error for a training set of size loo N ¡ 1. We may also dene an additional loo estimator from the error probability equaP |hh i | W(¡ pi ni ). However, this estimator is unbiased only when the tion 3.4, 2 pred D N1 i li

model assumptions are correct and will therefore in many cases be misleading. This is also observed in simulations.

2668

Manfred Opper and Ole Winther

the calculation, the expectation is naively exchanged with the nonlinearity in order to derive a closed self-consistent approximation for the averages. For our problem, we start from the denition of ai , equation 3.8, and rst apply p a standard gaussian transform, which introduces the imaginary unit i D ¡1 (not to be confused with the i appearing as an index): ai D

ti p (t )

D

ti p (t )

Z

@

dh p ( h ) Z

@hi

p (t | h )

¡

1 dh dy exp ¡ yT Cy C iy T h N (2p ) 2

¢

@ @hi

p(t | h ).

(6.1)

Second, integration by parts is applied:

ai D

ti p (t )

¡

Z

¢

1 dh dy (¡iyi ) exp ¡ yT Cy C iy T h p(t | h ) D ¡iti hyi i. (2p )N 2

In the last equality, the bracket is understood as a formal average over the joint complex measure of the variables h and y . Next, we separate the integrations over hi and yi from the rest of the variables to show that ± ² P @p (ti |hi ) + dhi dyi exp ¡ 12 Cii (yi )2 C (¡iyi )( jD6 i Cij (¡iyj ) ¡ hi )) @hi ± ² ai D ti R P dhi dyi exp ¡ 12 Cii (yi )2 C (¡iyi )( jD6 i Cij (¡iyj ) ¡ hi )) p (ti |hi ) P ( (¡ ))) 2 @p(ti |hi ) + * R dh exp ¡ hi ¡ jD i Cij iyj

*R

¡ ¡

i

D ti

R

1

D p

Cii

6

P

dhi exp ¡

(hi ¡

¡P p P ¡

* D

W ti

6 i jD

6 i jD

Cij (¡iyj ) Cii

2Cii

¢+ ¢ .

Cij (¡iyj ) 6 i jD

p

Cij (¡iyj ))2

2Cii

¢ ¢

@hi

p (ti |hi )

(6.2)

Cii

P If we simply neglect the uctuations of the eld jD6 i Cij (¡iyj ) and move the expectation through the nonlinearities, we get a self-consistent set of equations for ai D ¡iti hyi i. We may call this a “naive” mean-eld theory.3 3 The same result can be obtained by a saddlepoint approximation of a suitable partition function (Csat´o et al., in press).

Gaussian Processes for Classication

2669

The explicit expression for ai becomes P C ta jD i ij j j p D 1 Cii P ai ¼ p . C ta Cii jD i ij j j W ti p

¡ ¡

¢ ¢

6

(6.3)

6

Cii

This can be seen as simplied version of the TAP equations, where (cf. equation 3.13) the complicated expression for l i , equation 3.18 has been replaced with Cii . This mean-eld theory is therefore of a lower computational complexity since we avoid the matrix inversion needed in the TAP approach to obtain l i . We may obtain the leave-one-out estimator for naive mean-eld theory from equation D.2, where dhhi i is given by equation D.5 and Vi , equation D.3 becomes

¡

Vi D Cii

¢

1 ¡1 . ti ai hhi i

Again, we observe that the only difference to TAP is that l i has been exchanged with Cii (compare Vi , which corresponds to L i , equation 3.19, for TAP). 7 Support Vector Machine Digression

The relation between variational problems in reproducing kernel Hilbert spaces (such as the learning in support vector machines) and Bayesian gaussian processes is well known and has been pointed out by various authors (see, e.g., Wahba, 1990; Jaakkola & Haussler, 1999). We give a simple formulation of support vector learning in the realizable case (neglecting the bias) (Boser, Guyon, & Vapnik, 1992; Vapnik, 1995; Sch¨olkopf, Burges, & Smola, 1998). Based on the decomposition for the activation function h (x ), equation 2.5, P one tries to nd weights wr such that ti h(x i ) ¸ 1 for all examples and r wr2 is minimal. This is therefore an optimization problem with no direct probabilistic interpretation. However, this ts into the gaussian process framework by the following limiting procedure: To introduce the margin, we replace p (t|h ) D H (th) by the nonnormalized “likelihood” H (th ¡ 1) such that absolute values of activations smaller than 1 are excluded. Second, the prior is rescaled by introducing a parameter b: s bN b exp ¡ h T C ¡1 h . (7.1) p (h ) D (2p )N det C 2

¡

¢

In the independent variables wr , this prior is simply / exp[¡ b2

P r

wr2 ].

2670

Manfred Opper and Ole Winther

Hence, in the limit b ! 1, the “posterior” is dominated by the solution of the SV problem. Interestingly, although the TAP equations give in general only approximations to posterior averages, it is shown in appendix E that they reduce to the exact Kuhn-Tucker conditions that determine the embedding strengths a O i for SVMs: (aO i D 0

and

ti hhi i ¸ 1)

or

(a Oi > 0

and

ti hhi i D 1),

(7.2)

P where hhi i D O j . The prediction of the support vector machine is j Cij tj a P given by tSVM (x , DN ) D sgnhh(x )i D sgn j C (x , xj )tj aO j . In appendix E, we show that taking the SVM limit of the loo estimator, equation 5.1, a simple approximation for the loo estimator for SVMs is obtained, 0 SVM

2 loo

¼

SV 1 X

N

i

1

a Oi C B i A, H @1 ¡ h ¡1 C SV

(7.3)

ii

where the sum goes over SVs only and C SV denotes the covariance matrix for the SV examples. An alternative derivation of equation 7.3 and generalizations are possible along the lines of appendix D (Opper & Winther, 1999b). A similar type of loo estimator for SVM (for a different loss function) has previously been derived by Wahba (1998). 8 Simulations

In this section, we study the performance of the TAP mean eld, naive mean eld, and SVM algorithm for three publicly available and widely tested data sets: Sonar—Mines versus Rocks (Gorman & Sejnowski, 1988), Leptograpsus Crabs and Pima Indians Diabetes (both Ripley, 1996). For SVMs, we used the Adatron algorithm without bias (Anlauf & Biehl, 1989). The input data were preprocessed by linear rescaling such that over the training set, each input variable has zero mean and unit variance. We tested several different covariance functions. All of them are listed in appendix B. In almost all cases, the gaussian covariance function turned out to give the best performance. We have therefore chosen to report results only for that covariance function. The free parameters of the algorithms are the hyperparameters of the covariance functions (described in appendix B). These should be determined from either prior knowledge about the problem or in lack of that from the training data. In the simulations presented here, we have reduced the number of free hyperparameters either by setting all the w -hyperparameters (see

Gaussian Processes for Classication

2671

Table 1: Results for Sonar Data. Algorithm

2 test

2 exact loo

2 loo

TAP mean eld Naive mean eld SVM Simple perceptron Best two-layer (12 hidden)

0.077 0.077 0.096 0.269 0.096

0.212 0.221 0.202

0.212 0.221 0.202

appendix B) to a common value (Sonar) or used the values found in a previous study (Crabs and Pima) (Williams & Barber, 1998). The remaining hyperparameters are chosen so as to minimize the value of the leave-oneout estimator. For the optimization using the leave-one-out estimator, we simply make a very rough and probably nonoptimal optimization by scanning through a range of values. This is possible because we restrict ourselves to at most three free parameters, where in most cases only one is different from zero. It should be noted that the performance was found to be quite robust against changes in the hyperparameters. In section 9, we briey discuss how to make the hyperparameter determination automatic and the Bayesian “evidence” framework as a possible alternative for determining the hyperparameters. The generalization error estimator for the different algorithms have been exact compared with the exact 2 loo leave-one-out cross-validation error—the error count we get, when going through all training examples, keeping one example out for testing and running the mean-eld algorithm on the rest. The results are shown in Tables 1 and 2. Figure 2 shows the values of different error measures for a range of values for a single hyperparameter. The overall conclusion is that the leave-one-out estimators are very precise. Usually they get at most one classication wrong compared to the exact value. An exception is extremely short-ranged covariance functions shown (see Figure 2). 8.1 Sonar Data. Sonar—Mines versus Rocks is the data set used by Gorman and Sejnowski (1988) in their study of the classication of sonar signals using neural networks. The task is to train a classier to discriminate between sonar signals bounced off a metal cylinder and those bounced off a rough cylindrical rock. The data set contains 111 patterns obtained by bouncing sonar signals off a metal cylinder at various angles and under various conditions and 97 patterns obtained from rocks under similar conditions. Here we have used the same training–test data split as Gorman and Sejnowski.4 4 May be obtained online from http://www.boltz.cs.cmu.edu/benchmarks/sonar. html.

2672

Manfred Opper and Ole Winther

Table 2: Results for Leptograpsus Crabs and Pima Indians Diabetes. Leptograpsus Crabs

Pima Indians Diabetes

Algorithm

2 test

exact 2 loo

2 loo

Error

2 test

exact 2 loo

2 loo

Error

TAP mean eld Naive mean eld SVM

0.033 0.017 0.050

0.037 0.025 0.037

0.037 0.025 0.037

4 2 6

0.190 0.193 0.199

0.250 0.245 0.250

0.255 0.245 0.250

63 64 66

Laplace GP–hybrid Monte Carlo –Maximum Likelihood-II Best two–layer network Simple perceptron Logistic regression MARS (degree=1) Projection pursuit (4 ridge functions) Gaussian mixture

3 4 4 8 4 8 3

68 69 75 67 66 75 75 64

8.1.1 Results. For all algorithms, the number of hyperparameters was strongly reduced by setting wl D 1 / s 2 d for l D 1, . . . , d, where d is the input dimension. It turned out that the performance is quite robust against the choice of s 2 , as may seen from Figure 2. The simulation results are shown in Table 1. The comparison is taken from Gorman and Sejnowski (1988). One may observe that the performance of the mean-eld algorithms is slightly better than that of the SVM and the best two-layer network. However, it is doubtful that this difference is statistically signicant. Figure 2 shows that the test set is signicantly “easier ” than the training set. The solution with lowest test error is almost trivial, with all ai » thhi i » constant. Exchanging the training and test set, we nd, for example, for s 2 D 0.05, 2 test D 0.173, and 2 loo D 0.058. One may thus conclude that this training-test split is very biased. Running the algorithms on other splits gave a nontrivial solution, with a much better correspondence between the test and leave-one-out error and an error rate somewhere between the two extremes. In Figure 3, we give an example of the “embedding strength” ai and the “stability” ti hhi i found for TAP mean eld, naive mean eld, and SVMs used under exactly the same condition, that is, the same training set and covariance matrix. The P results are normalized such that the squared length of the weight vector ij ai ti Cij tj aj is the same for all three cases. The results show a clear correlation between the solutions found by the three algorithms. However, the mean-eld algorithms give rise to both smaller and larger stabilities (margins) than found for the SVM. This, however, has no major effect on performance.

Gaussian Processes for Classication

2673

0.5 0.4

e

TAP loo

e

0.3

e

exact loo

0.2

e

0.1 0

10

test

3

2

2

10

s

1

10

0

10

0.5 0.4

e

loo

SVM

e

0.3

e

exact loo

0.2

e

0.1 0

10

3

test 2

10

s

2

1

10

0

10

Figure 2: Test error 2 test (solid line) and leave-one-out estimator 2 loo (dashed exact line) and 2 loo (dash-dotted line) as a function of s 2 for the Sonar data set. (Top) TAP. (Bottom) SVM.

8.2 Leptograpsus Crabs and Pima Indians Diabetes. Both the Leptograpsus Crabs and Pima Indians Diabetes data sets have been made available by Ripley (1996). 5 For Leptograpsus Crabs the task is to classify the sex of crabs on the basis of ve anatomical attributes and a color attribute. There are 50 examples available for each color and sex, making a total of 200 examples. Twenty examples for each color and sex in the data le are used for training (making

5

May be obtained online from http://www.stats.ox.ac.uk/»ripley/PRbook/.

2674

Manfred Opper and Ole Winther

1.5

a

i

1

0.5

0 0

20

40

60

80

100

40

60

80

100

Pattern index

ti

1.5

1

0.5

0

20

Pattern index

Figure 3: (Top) “Embedding strengths” ai plotted for the different algorithms for the Sonar data. (Bottom) Mean activation ti hhi i for the different algorithms (same ordering as the left plot). The triangles are for support vectors, circles are for naive mean-eld theory, and stars are for TAP mean-eld theory. They are sorted in ascending order according to their support vector ai value and the TAP and naive mean-eld solution is rescaled to the length of the support vector solution.

80 examples in total) and the rest are used for testing. We use the same training-test split as Williams and Barber (1998). The Pima Indians Diabetes data set is used as in previous studies with a training-test split of 200 and 332 examples respectively (Ripley, 1996). The task is to diagnose diabetes in a population of women of Pima Indian heritage based on the following information: number of pregnancies, plasma

Gaussian Processes for Classication

2675

glucose concentration in an oral glucose tolerance test, diastolic blood pressure, triceps skin-fold thickness, serum insulin, body mass index, diabetes pedigree function, and age. The baseline error rate, obtained by simply classifying each test input as positive, is 0.33. 8.2.1 Results. The w -hyperparameters were chosen to have values found by Williams and Barber (1998) based on the Laplace approximation to the integral over the gaussian elds and ML-II estimation for the hyperparameters. The results are given in Table 2. The comparisons are taken from Williams and Barber (1998) for Laplace GP and all others are from Ripley (1994, 1996). Our results are found to be close to the state of the art. The performance is quite sensitive to the choice of w -hyperparameters, so one may hopefully improve on these results when the hyperparameters are tuned to the specic algorithm. For Pima Indians Diabetes, the differences in performance between algorithms are small, and, contrary to the other data sets, all covariance functions tested have similar performance. 9 Conclusion

We have presented a mean-eld approach to binary classication with gaussian processes. Our method is an extension of the TAP mean-eld theory known in statistical physics. However, we have avoided making special assumptions about the distribution of the randomness of input data. The derivation of the resulting nonlinear equations for posterior means is new and may also be applied to other problems in probabilistic modeling. Our mean-eld method can be applied to statistical models that are nonsmooth functions of their parameters (like noise-free classication), where approaches based on Taylor expansions of the likelihood or the posterior would fail. A naive mean-eld algorithm, which can be considered a simplied version of the TAP algorithm with a lower computational complexity, has also been derived. We have further shown how SVMs can be obtained from a limiting procedure of the TAP approach. The built-in leave-one-out estimator for the generalization error provided by the TAP mean-eld method naturally extends to the SVM and naive mean-eld case. Simulation results on three small benchmark data sets—Sonar—Mines versus Rocks, Leptograpsus Crabs, and Pima Indians Diabetes—all give promising results matching the state-of-the-art performance. However, there are still a number of unresolved questions that need to be addressed before these new algorithms may be regarded as practical tools. We will discuss these in the following. 9.1 Model Selection. So far we have not dealt with automatic determination of hyperparameters. After having reduced the number of hyperparameters, the remaining ones have been found by minimizing the leave-one-

2676

Manfred Opper and Ole Winther

out estimate over a range of trial values. To be able to work with a higher number of hyperparameters, the search should be automated, preferably by a gradient-descent method on a suitable differentiable function. This rules out the discrete leave-one-out error count. A well-established candidate for such a function, which exploits the full power of the Bayesian approach, is the total probability of the data given the model. This is termed the evidence in Bayesian terminology (Mackay, 1992), or stochastic complexity in Rissanen’s MDL approach. The negative log-evidence corresponds to the so-called free energy in the language of statistical physics for which approximations are also available within mean-eld theory (Opper & Winther, 1999a; Csat o´ et al., in press). Preliminary results of this approach are very promising and will be reported in a forthcoming paper. 9.2 Assessing the Validity of the Mean-Field Approximation. When the iterative scheme for solving the mean-eld equations fails, we have so far no way of telling whether the equations have actually no solution or whether our numerical method is simply not appropriate for the specic choice of parameters. A possible way to approach this problem is its conversion into the minimization of a suitable cost function like the TAP free energy in Thouless et al. (1977) and Opper and Winther (1999a). 9.3 Quality of Mean-Field Approximations. The quality of the meaneld approximations may be directly assessed by a comparison with Monte Carlo sampling (Neal, 1997), which can approximate posterior averages with arbitrary accuracy when enough computer time is invested. On the other hand, for some toy models with high-dimensional input spaces, it should be possible to derive exact Bayesian learning curves by the replica method of statistical mechanics against which our mean-eld methods can be tested (Opper and Winther, 1996). 9.4 Computational Complexity . The computational complexity of the TAP mean-eld algorithm is O (N 3 ), where N is number of training examples since it requires the inversion of an N £ N matrix. This makes the algorithm impractical for data sets larger than a few thousand. The complexity of the naive mean-eld algorithm, on the other hand, should not scale worse than O (N2 ) since one iteration update requires only N inner products of N terms, and the iterative process is expected to converge in a nite number of steps. However, its leave-one-out estimate still requires a matrix inversion, which is of O (N 3 ). If instead the evidence framework is used—which is only O (N 2 ) (Csat o´ et. al., in press) for this algorithm—naive mean-eld theory could be an interesting alternative to SVMs also for larger data sets. The similarity between the results of the mean-eld theory and the SVM algorithm suggests further simplications. In cases where SVMs lead to sparse representations based on a small number of support vectors, the

Gaussian Processes for Classication

2677

mean-eld algorithms are expected to converge also to a small number of large embedding strengths. Setting the remaining ones to zero and leaving the corresponding examples out of the training set at an early stage of the iteration may speed up the algorithm signicantly. Finally, one may think about hybrid approaches that combine the strengths of both the SVM and the Bayesian approaches. One could, for example, calculate the embedding strengths directly from the SVM algorithm, exploiting the sparseness directly, but use a mean-eld approximation to the Bayesian evidence as a yardstick for estimating the hyperparameters in the kernel. 9.5 Extension to Other Data Models. It would be interesting to try to apply our TAP approach for gaussian process models to other types of likelihoods. A natural choice would be the problem of classication with more than two classes. A likelihood for this problem can be simply obtained from a softmax function, where each class has its own gaussian process. Our TAP approach requires the integration of such a likelihood over gaussian distributions, a task that unfortunately can no longer be done exactly. Hence, further approximations for treating such (usually) low-dimensional integrations will be necessary. Appendix A: Input Noise

In this appendix, we discuss the inclusion of input noise within the probit model and the modication of the mean-eld algorithms implied. We dene the input noise model as the addition of an independent random variable to the activation function in the likelihood equation, 2.1: p (t|h (x ), j (x )) D H (t (h (x ) C j (x ))).

(A.1)

We will assume that the noise is gaussian with zero mean and variance v2 , that is, probit regression (Neal, 1997). With gaussian noise, the original step function likelihood will be modied to a sigmoidal-shaped (error) function with the gain given by the variance of the noise Z p (t|h ) D

dj p (t|h, j )p (j ) D W

¡

th p v2

¢ .

Alternatively, but completely equivalent, one may shift variables to a noisy process h (x ) C j (x ). This process will also be gaussian with a modied covariance matrix Cij C v2dij . The likelihood (and the mean-eld equations) for this process is therefore the same as for the noise-free case. The noise is simply included as the extra term in the covariance matrix. In the simulations, we will stick to the latter formulation. Importantly, the two formulations are

2678

Manfred Opper and Ole Winther

equivalent for both TAP and naive mean-eld theory; the Bayes prediction and the leave-one-out estimate hhi ini are the same in both formulations.6 Some the quantities in the theory are not invariant; for example, hhi i in the original process, is equivalent to hhi i C v2 ti ai in the noisy process, whereas others, such as ai remain unchanged. Appendix B: Covariance Functions

A thorough discussion of covariance functions and their properties is given in Abrahamsen (1997). We have tested the following covariance functions Simple perceptron: C(x , x 0 ) D

Pd

0 l D 1 w l xl xl C

v1 . P

¡

Innite net: C (x , x0 ) D v0 p2 arcsin p Gaussian: C (x , x0 ) D v0 exp(¡ 12 Cauchy: C(x, x 0 ) D v0 (1 CP

l

P l

(1 C

P

l l

wl xl x0l

wl xl xl )(1 C

¢ P l

wl x0l x0l )

C v1 .

wl (xl ¡ x0l )2 ) C v1 .

1 C wl (xl ¡x0l )2 )d

v1 .

The rst two are obtained from feedforward neural network models, which converge to gaussian processes. In both cases, the prior on the weights and thresholds is spherical gaussian. The innite net covariance function is obtained from a two-layer network with linear output unit in the limit of an innite number of hidden units with a sigmoidal (error function) activation function given by g(x ) D 2W(x) ¡ 1 (Neal, 1996; Williams, 1997). The sign activation function g (x ) D sgnx may also be treated in the innite net limit and simply amounts to removing the terms of 1 in the denominator of the innite net covariance function. The gaussian covariance function is a well-known example of an exponential covariance function. It may be obtained from a radial basis network with an innite number of hidden units (Williams, 1997). The Cauchy or rational quadratic covariance function is dened for alld > 0 (Abrahamsen, 1997). In the simulations, we have simply chosen to work with d D 1. Since the properties of the solution are independent of the scale of gaussian elds, we may x v0 to, say, v0 D 1. The length scale for input dimension p l is dened by 1 / wl . Important inputs will be well dened and thus be associated with a relatively short length scale and vice versa for unimportant length scales. v1 corresponds to the variance of the output threshold. To include the effect of gaussian noise on the elds, a term v2 is as discussed in appendix A, added to the diagonal of the covariance matrix.

6 Note that in an MAP approach (Williams & Barber, 1998), the two formulation are not equivalent.

Gaussian Processes for Classication

2679

Appendix C: Second Moments l and l i

In this appendix, we derive mean-eld expressions for the second moments, l D hh2 (x )i ¡ hh(x )i2 l i D hh2i ini ¡ hhi i2ni ,

using a linear response method. We rst derive l in some detail and thereafter indicate how to generalize this result to l i . By integrating by parts using equation 3.6, we nd hh2 (x )i D C(x, x ) C

X

C (x , x k )

k

1 p( t )

Z dh dh p (h , h ) h (x )

@ @hk

p (t | h ).

(C.1)

To rewrite the last term so that we may apply a linear response argument, it should be expressed as a derivative of a known quantity—in this case, hh(x )i. This is achieved by introducing elds » D j1 , . . . , jN added to the random activations h in the likelihood term, taking the derivative with respect to this eld and setting » D 0 afterward, that is, @@h p (t | h ) D k @ ( p t | h C » )| » D0 : @j k

l D C(x , x ) C

X

C ( xk , x )

k

D C(x , x ) C

@ @jk

¡

1 p (t | » )

Z

¢

dh dh p (h , h ) h (x ) p (t | h C » )

X @hh(x )i C (x k , x ) , @jk k »D0

»D0 (C.2)

R where p (t | » ) D dh p (h )p(t | h C » ). Note that moving the derivative to the left of the normalization constant gives rise to an additional factor of ¡hh(x )i2 compared to equation C.1. x )i To calculate the linear responses @hh( within the gaussian approxima@jk tions of the TAP approach, we make a further approximation and assume that by adding small elds, we need only to consider the changes in the means of the elds, but can neglect changes in the variances. Using equation 3.7, we get dhh(x )i D

X

C(x , xj )tjdaj .

(C.3)

j

6 0 ¡ hh( x )i| where d(h(x )) ´ hh(x )i| » D » D 0 . We therefore have to solve for daj

2680

Manfred Opper and Ole Winther

in terms of » . The shifts of the aj are related to the shifts in the means of the elds via equation 3.9, dhhi i D ji C

X

Cij tjdaj ,

(C.4)

j

where the shift daj is found from equation 3.16, dhhi i ¼ ¡L i tidai ,

(C.5)

with L i dened by equation 3.19. Substituting dhhi i into equation C.4 and solving for dai , we nally arrive at equation 3.17. We can repeat the same steps to derive equations for l i . The only difference is that all posterior distributions must be replaced by the corresponding posteriors where the ith pattern is absent. Analogous to equation 3.17, we nd h i X l i ¼ Cii ¡ (C.6) Cij ( ni C C ni ) ¡1 Cki , jk

6 i jk D

where ni and C ni denote the N ¡ 1 £ N ¡ 1 reduced matrices where the ith row and column are deleted. Here, we have assumed that the change in L j 6 is negligible when the ith pattern is removed (i D j). Using the following identity for the partitioned inverse (see, e.g., Press, Teukolsky, Vetterling, & Flannery, 1992), h i ¡1 h i X A ¡1 (C.7) D Aii ¡ Aij (A ni ) ¡1 Aki , ii

6 i jk D

jk

the expression may be simplied to the nal result, equation 3.18. Appendix D: Leave-One-Out from Linear Response

The subsequent derivation of an approximate leave-one-out (loo) estimator is close in spirit to other similar procedures that rely on rst-order expansions. These may be motivated by the assumption of small changes in appropriate variables, when an example is removed from the training set. (See, e.g., the work of Xiang & Wahba, 1996, who developed an approximate loo estimator for smoothing splines. For applications to neural networks, see, e.g., Larsen & Hansen, 1996.) For simplicity, we will use the equality (D ) in all expressions and expect that the reader will understand which of them are approximations. Assume we remove example j from the training set and the algorithm is retrained on the remaining ones. Then the mean of the elds of the remaining inputs are changed according to equation 3.9: dhhi i D

X 6 j kD

Cik tkdak ¡ Cij tj aj .

(D.1)

Gaussian Processes for Classication

2681

The general leave-one-out estimator is then simply 2 loo D

N ¡ ¢ 1 X H ¡ti (hhi i C dhhi i) , N iD 1

(D.2)

with j set equal to i in equation D.1. To make the following derivation as general as possible, we will assume that the algorithm produces embedding strengths ai , which are found as the solution of an equation ai D f (hhi i, ai ). Examples are equations 3.13 and 6.3 for, respectively, TAP and naive mean-eld theory. There may be further dependencies on other quantities as well, but we will assume that their response to the change of hhi i or ai is negligible. The change dhhi i can be expressed by dai , assuming that all changes are small within linear @f @f response dai D @hhi i dhhi i C @ai dai , where we, analogous to equation C.5, get dhhi i D ¡Vi tidai with the denition V i ´ ti

¡ @f

1 @f @hhi i

@ai

¢

¡1 .

(D.3)

Hence, we may solve for dak

¡

tkdak D

X h 6 j iD

Vnj C C nj

i ¡1

¢ ki

Cij tj aj .

(D.4)

Finally, in order to nd dhhi i when example i itself is removed from the training set, we set j D i in equation D.1 with dak from equation D.4. Using the matrix identity, equation C.7, we get ( dhhi i D ¡ £

1 (V C C ) ¡1

) ¤ ¡ Vi ti ai .

(D.5)

ii

To make a self-consistent check for TAP mean-eld theory, we note that dhhi i should be equal to hhi ini ¡ hhi i D ¡l i ti ai . This is indeed found to be the case. Compare with l equation 3.18, where Vi , equation D.3, is identied with L i , equation 3.19. Appendix E: Support Vector Machines

In this appendix, we consider the SVM limit of the TAP mean-eld equations. TAP equations for SVM learning in the simple perceptron for spherical input distribution and large input dimensionality have previously been derived by Wong (1995). We rst rederive the Kuhn-Tucker condition from

2682

Manfred Opper and Ole Winther

the TAP equations and then show that the leave-one-out (loo) estimator becomes especially simple in this limit. We can show self-consistently that the limit b ! 1 in equation 7.1 is equivalent to sending the variances l i ! 0 in the TAP approach. Having the factor b in the prior is the same as rescaling the covariance matrix by a factor of 1 / b; thus equation 3.9 becomes hhi i D

X j

Cij tj aO j ,

where we have introduced the rescaled (and nite) embedding strength a O i ´ ai / b. Using the asymptotic expression, the error function D ( z) ! ¡zH (z) for |z| ! 1 W ( z) in equation 3.13 (with an extra factor minus 1 for the margin), we obtain 1¡ti hhi ini aOi D H(1 ¡ ti hhi ini ). Inserting this back into equation 3.16, we get bl i ti hhi i D 1 C (ti hhi ini ¡ 1)H (ti hhi ini ¡ 1). Putting the expressions for aOi and ti hhi i together, we arrive at the Kuhn-Tucker condition, equation 7.2. For the loo estimator, equation 5.1, we need to compute the cavity average of the activation hhi ini . When ti hhi ini > 1, we see immediately that ti hhi i D ti hhi ini > 1 so nonsupport vectors do not contribute to the loo error. We need only to consider support vectors for which ti hhi i D 1 and thus ti hhi ini D 1 ¡ l ib a O i . We therefore have to determine l i b, which is done by taking the l i ! 0 limit in the expressions for L i , equation 3.19, and thereafter in the expression for l i , equation 3.18, giving

li D

8
1

[C ¡1 SV ]ii

: Cii b

for

ti hhi ini < 1

for

ti hhi ini > 1,

where C SV denotes the covariance matrix for the support vector examples. The result for l i proves our self-consistent assumption: l i ! 0 for b ! 1. Finally, inserting the result for ti hhi ini in equation 5.1 gives the loo estimator for SVMs, equation 7.3. Acknowledgments

We are grateful to Junbin Gao, P´al Ruj´an, Sara A. Solla, and Grace Wahba for discussions and to Chris Williams and two anonymous referees for very constructive comments and suggestions. This research is supported by the Swedish Foundation for Strategic Research.

Gaussian Processes for Classication

2683

References Abrahamsen, P. (1997). A review of gaussian random elds and correlation functions (Tech. Rep. No. 917). Oslo, Norway: Norwegian Computing Center. Anlauf, J. K., & Biehl, M. (1989). The AdaTron: An adaptive perceptron algorithm. Europhys. Lett., 10, 687. Barber, D., & Williams, C. K. I. (1997). Gaussian processes for Bayesian classication via hybrid Monte Carlo. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Neural information processing systems, 9 pp. 340–346. Cambridge, MA: MIT Press. Boser, B., Guyon, I., & Vapnik, V. N. (1992). A training algorithm for optimal margin classiers. In Procedings of the 5th Annual Workshop on Computational Learning Theory (pp. 144–152). Pittsburgh: ACM. Csat´o, L., Fokou´e, E., Opper, M., Schottky, B., & Winther, O. (in press). Efcient approaches to gaussian process classication. In S. A. Solla, T. K. Leen, & K.-R. Muller, ¨ Advances in neural information processing, 12. Cambridge, MA: MIT Press. Gibbs, M. N., & Mackay, D. J. C., (1997). Variational gaussian process classiers. Preprint, Cambridge University. Gorman, R. P., & Sejnowski, T. J. (1988). Analysis of hidden units in a layered network trained to classify sonar targets. Neural Networks, 1, 75. Jaakkola, T. S., & Haussler, D. (1999). Probabilistic kernel regression models. In: Proceedings of the 1999 Conference on AI and Statistics. Jordan, M. I., (Ed.). (1999). Learning in graphical models. Cambridge, MA: MIT Press. Larsen, J., & Hansen, L. K. (1996). Linear unlearning for cross-validation. Advances in Computational Mathematics, 5, 269–280. Mackay, D. J. C. (1992). Bayesian interpolation. Neural Computation, 4, 415–447. Mackay, D. J. C. (1997). Gaussian processes: A replacement for neural networks. NIPS tutorial. Available online at: http://wol.ra.phy.cam.ac.uk/pub/ mackay/. M´ezard, M., Parisi, G., & Virasoro, M. A. (1987). Spin glass theory and beyond. Singapore: World Scientic. Neal, R. (1996). Bayesian learning for neural networks. Berlin: Springer-Verlag. Neal, R. M. (1997). Monte Carlo implementation of gaussian process models for Bayesian regression and classication. (Tech. Rep. No. 9702). Toronto: Department of Statistics, University of Toronto. Opper, M., & Winther, O. (1996). A mean eld approach to Bayes learning in feed-forward neural networks. Phys. Rev. Lett., 76, 1964. Opper, M., & Winther, O. (1999a). Mean eld methods for classication with gaussian processes. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11. Cambridge, MA: MIT Press. Opper, M., & Winther, O. (1999b). Gaussian processes and SVM: Mean eld results and leave-one-out. In P. Bartlett, B. Sch¨olkopf, D. Schuurmans, & A. Smola (Eds.), Large margin classiers. Cambridge, MA: MIT Press. Papoulis, A. (1984). Probability, random variables, and stochastic processes (2nd ed.).

2684

Manfred Opper and Ole Winther

New York: McGraw-Hill. Parisi, G. (1988). Statistical eld theory. Reading, MA: Addison-Wesley. Parisi, G., & Potters, M.(1995). Mean-eld equations for spin models with orthogonal interaction matrices. J. Phys. A: Math. Gen., 28, 5267. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992).Numerical recipes in C. Cambridge: Cambridge University Press. Ripley, B. D. (1994). Flexible non-linear approaches to classication. In, V. Cherkassy, J. H. Friedman, & H. Wechsler (Eds.), From statistics to neural networks. Berlin: Springer-Verlag. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge: Cambridge University Press. Sch¨olkopf, B., Burges, C. J. C., & Smola, A. J. (Eds.). (1998). Advances in kernel methods: Support vector learning. Cambridge, MA: The MIT Press. Thouless, D. J., Anderson, P. W., & Palmer, R. G. (1977). Solution of a “solvable model of a spin glass.” Phil. Mag., 35, 593. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag Wahba, G. (1990). Spline models for observational data. Philadelphia: SIAM. Wahba, G. (1998). Support vector machines: Reproducing kernel Hilbert spaces and the randomized GACV. In B. Sch¨olkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in kernel methods: Support vector learning. Cambridge, MA: MIT Press. Williams, C. K. I. (1997). Computing with innite networks. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Neural information processing systems, 9 (pp. 295–301). Cambridge, MA: MIT Press. Williams, C. K. I., & Barber, D. (1998). Bayesian classication with gaussian processes. IEEE Trans Pattern Analysis and Machine Intelligence, 20, 1342–1351, Williams, C. K. I., & Rasmussen, C. E. (1996). Gaussian processes for regression. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Neural information processing systems, 8 (pp. 514–520). Cambridge, MA: MIT Press. Wong, K. Y. M. (1995). Microscopic equations and stability conditions in optimal neural networks. Europhys. Lett., 30, 245. Xiang, D., & Wahba, G. (1996). A generalized approximate cross validation for smoothing splines with non-gaussian data. Statistica Sinica, 6, 675–692. Received February 22, 1999; accepted January 5, 2000.

LETTER

Communicated by Christopher Bishop

The Bayesian Evidence Scheme for Regularizing Probability-Density Estimating Neural Networks Dirk Husmeier Biomathematics and Statistics Scotland, Scottish Crop Research Institute, Invergowrie, Dundee DD2 5DA, U.K. Training probability-density estimating neural networks with the expectation-maximization (EM) algorithm aims to maximize the likelihood of the training set and therefore leads to overtting for sparse data. In this article, a regularization method for mixture models with generalized linear kernel centers is proposed, which adopts the Bayesian evidence approach and optimizes the hyperparameters of the prior by type II maximum likelihood. This includes a marginalization over the parameters, which is done by Laplace approximation and requires the derivation of the Hessian of the log-likelihood function. The incorporation of this approach into the standard training scheme leads to a modied form of the EM algorithm, which includes a regularization term and adapts the hyperparameters on-line after each EM cycle. The article presents applications of this scheme to classication problems, the prediction of stochastic time series, and latent space models. 1 Introduction and Notation

Consider the problem of inferring the probability density P (y t | x t ) of some m-dimensional target vector y t conditional on an n-dimensional vector of explanatory variables xt . A common approach is to approximate the unknown true distribution by a mixture model (Jacobs, Jordan, Nowlan, & Hinton, 1991; Jordan & Jacobs, 1994; Bishop, 1995), for which, in this article, the functional form of Husmeier and Taylor (1998) will be chosen. Let g (xt ) 2 RnQ denote a (xed and in general nonlinear) transformation of the explanatory variables, and dene the following generalized linear function, fki (x t ) : D f (x t I w ki ) : D w †ki g (x t )

(1.1)

Q in which w ki is an n-dimensional parameter vector. (Note that the dimen6 sions of g (xt ) and x t are in general different: nQ D n.) Also, P introduce the positive and normalized mixing coefcients pk (pk ¸ 0, k pk D 1) and the positive precisions (inverse variances) bk > 0. Then a multivariate generalization of the model discussed in Husmeier and Taylor (1998) has the c 2000 Massachusetts Institute of Technology Neural Computation 12, 2685–2717 (2000) °

2686

Husmeier

Figure 1: Gaussian mixture network for predicting conditional probability densities. The network contains two hidden layers, where units in the rst hidden layer have a sigmoidal and those in the second hidden layer a gaussian transfer function. The precisions of the gaussian bumps, bk , are treated as adaptable parameters, and the output weights pk are positive and normalized. Arrows symbolize weights with a constant value of one; bold lines represent wires with adaptable weights. When applying the random vector functional link net approach (discussed in the text), the weights between the input and the rst hidden layer, plotted as narrow lines, are drawn at random from some appropriate distribution and then remain xed during training.

following form: P (y t | x t ) D

K X

pk P (y t | x t , k )

(1.2)

kD 1

v ! um m X uY bki b ki 2 t [yti ¡ fki (x t )] . exp ¡ P (y t | x t , k ) D 2p 2 iD 1 iD 1

(

(1.3)

A possible neural network realization is shown in Figure 1. Note that the restriction implied by equation 1.1 is essentially a generalized linear model. This allows the model parameters q : D fp1 , . . . , pk , . . . , pK¡1 , b11 , . . . , b ki , . . . , b Km , w 11 , . . . , w ki , . . . , w Kmg

(1.4)

to be adapted with the expectation-maximization (EM) algorithm (Demp-

Regularizing Probability-Density Estimating Neural Networks

2687

ster, Laird, & Rubin, 1977) which is known to be a fast (superlinear) algorithm to maximize the likelihood P (D | q ) of the training set D : D fx t , y t gN tD1 (see, e.g., Xu & Jordan, 1996). Here and in what follows, lowercase boldface letters denote column vectors, uppercase boldface letters denote matrices, and the dagger (†) indicates matrix transposition. The index k 2 f1, . . . , Kg will be used to label kernels, t 2 f1, . . . , Ng labels training exemplars, and the subscript i 2 f1, . . . , mg in yti represents the ith coordinate of the target vector yt . The mixing coefcients and kernel precisions are collected in the vectors p : D (p1 , . . . , pK¡1 )† and ¯ : D (b11 , . . . , b Km )† .1 The vector )† comprises all the weights feeding into the kth kernel. w k : D (w † , . . . , w† k1 km Alternatively, the matrix notation W k : D (w k1 , . . . , w km )† will occasionally be used (see equation 2.18). 2 Finally, w denotes the vector of all weights in )† the network: w : D (w †1 , . . . , w † K . 2 Regularization with the Bayesian Evidence Scheme 2.1 Summary of the Concept. It is well known that for sparse training data, a maximum likelihood approach usually leads to severe overtting. The objective of this article, therefore is to generalize the Bayesian evidence scheme (MacKay, 1992) to regularizing probability-estimating neural networks. First, introduce a gaussian prior on the weights w , which depends on a vector of hyperparameters ® D (a1 , . . . , aK )† :

P (w | ® ) : D

K Y m Y k D 1 i D1

P (w ki |ak ) : D

K Y m ± Y ak ² nQ / 2 kD 1 iD 1

2p

±

exp ¡

² ak † w ki w ki . (2.1) 2

This is equivalent to the common regularization method of linear weight decay, by which large values of w ki are penalized. Note that equation 2.1 implies a division of the weights w into several weight groups w ki , with weights feeding into the same kernel sharing a common weight-decay hyperparameter ak . Also, note that the hierarchical structure of the model, ® ! w ! y , implies conditional independence of y on ® given w . On observing the training data D D fy t , x t gN t D 1 , the most probable parameter values are those that maximize the posterior probability, P (w , ¯ , p , ® | D ) / P (D | w , ¯ , p )P (w | ® ),

(2.2)

where the indicated proportionality holds for a metric in which the prior 1

P

Note that due to the constraint k pk D 1, only K ¡1 mixing coefcients are adaptable parameters. 2 Note that w is an mn-dimensional Q vector, while W k is an m-by-nQ matrix. k

2688

Husmeier

P (¯, p , ®) is uniform. Dene E o (w , ¯, p ) : D ¡ ln P (D | w , ¯ , p )

(2.3)

E (w , ¯, p I ®) : D Eo (w , ¯ , p ) C R (w I ® ).

(2.5)

R (w I ® ) : D ¡ ln P (w | ® )

(2.4)

Note that unregularized training corresponds to minimizing the cost function Eo , whereas regularized training with xed weight-decay hyperparameters ® is effected by minimizing the total cost function E. Since, in general, we do not have sufcient prior knowledge to decide on the values for ® in advance, a straightforward approach seems to be the maximization of the posterior probability in equation 2.2 with respect to all parameters. However, as pointed out by MacKay (1992), this is likely to lead to suboptimal results: the joint posterior distribution tends to be strongly skewed with a mode that is not representative of the distribution as a whole (see also Bishop & Qazaz, 1995, and MacKay, 1993, for further discussion). This article therefore follows the idea in MacKay (1992) to nd the mode of P (¯ , p , ® | D ) after integrating out the weight parameters w . Assuming again a metric in which P (¯ , p , ® ) is uniform, this is equivalent to maximizing the likelihood Z Z P (D | ¯ , p , ® ) D P (D , w | ¯ , p , ® )dw D P (D | w , ¯ , p ) P ( w | ® )d w Z

D

exp[¡E(w I ¯ , p , ® )]dw ,

(2.6)

where in the last step, equations 2.3 through 2.5 have been applied. Note that MacKay’s approach is equivalent to the type II maximum likelihood method of conventional statistics. The integral in equation 2.6 is solved by Laplace approximation, which is equivalent to a Taylor series expansion of E up to second order: O I ¯ , p, ®) C E (w I ¯ , p , ® ) D E ( w

1 (w ¡ w O )† H (w ¡ w O) 2

w O : D argminw fE(w I ¯ , p , ®)g

h i † H : D rw rw E

w Dw O

.

(2.7) (2.8) (2.9)

Inserting equation 2.7 into 2.6 and dening M : D dim w D mnQ gives: s £ ¤ (2p )M O I ¯, p , ®) (2.10) P (D | ¯ , p , ® ) D exp ¡E(w det H O I ¯ , p , ® ) : D ¡ ln P (D | ¯ , p , ® ) E ¤ (w 1 M O I ¯ , p , ® ) C ln det H ¡ ln(2p ). D E(w 2 2

(2.11)

Regularizing Probability-Density Estimating Neural Networks

2689

This leads to the following iterative optimization scheme: 1. Given preliminary values for the hyperparameters3 ¯ , p , ®, minimize the cost function E in equation 2.5 with respect to w . From equation 2.5, this is equivalent to (2.12)

¡rw ki Eo D rw ki R.

2. Given the new values of the weight parameters, w D w O , minimize the cost function E ¤ in equation 2.11 with respect to ¯ , p , ®. Making use of equations 2.5 and 2.11, this gives @Eo @r

C

@R @r

D ¡

1 @ ln det H , 2 @r

(2.13)

where r represents any of the hyperparameters pk , bki , and aki . This scheme is iterated until a self-consistent solution for w , ¯ , p , ® has been found. 2.2 The Regularized EM Algorithm. Updating the parameters and hyperparameters according to equations 2.12 and 2.13 can be accomplished with a modied version of the EM algorithm. This section gives an algorithmic description of this scheme; the detailed derivation is relegated to the appendix. First, dene the posterior probability for the generation of data point (x t , y t ) by the kth kernel of the mixture model, equation 1.2:

p k (t) : D P (k| y t , xt ) D P

P (y t | x t , k ) p k , 0) ( k0 P y t | x t , k pk0

(2.14)

where Bayes’ rule has been applied. Moreover, the following denitions are introduced: ¡

G : D g (x1 ), . . . , g (xN )

¢

Q matrix n-by-N

¦k : diagonal N-by-N matrix with I : unit matrix,

(¦k )tt0 : D p k (t)dtt0

Iij : D dij

1 yt,iD 1 B . C y t. : D @ .. A , yt,iD m 3

Strictly speaking,

(2.16) (2.17)

W k : D (w k1 , . . . , w km )†

0

(2.15)

(2.18) 0

1 ytD 1,i B . C y .i : D @ .. A , yt D N,i

and p are parameters treated like hyperparameters.

2690

Husmeier

0

y† iD 1

1

B . C C Y : D ( y t D1 , . . . , y t D N ) D B @ .. A . y† iD m

(2.19)

† Let eºki denote the eigenvalues of the matrix A ki :D [rw ki rw E ] O (for ki o w D w which an explicit expression will be derived later in equation A.28), and introduce, similarly to MacKay (1992), the number of well-determined parameters: nQ X

c ki : D

ºD 1

eºki

, ak C eºki

c k :D

m X

c ki .

(2.20)

i D1

Now, the new regularized EM algorithm is given by the following iterative update equations: pOk D

Nk N

¡

G ¦k y .i D

1 D O b ki

P

G ¦k G † C

t p k (t )[ f ( x t ,

w O k ) ¡ yti ]2

Nk ¡ c ki

1 1 h O O †i tr W k W k D aO k ck

¢

ak I w ˆ ki b ki

(2.21) (2.22) (2.23) (2.24)

where Nk : D

N X

p k (t)

(2.25)

tD1

and the posterior probabilities p k (t) (dened in equation 2.14) are computed on the basis of the old parameter values. This scheme is iterated according to the standard EM paradigm, where at each iteration the Hessian and its eigenvalues are recalculated, thereby obtaining new values for the numbers of well-determined parameters c ki according to equation 2.20. Note that the standard unregularized EM algorithm is regained by setting ak ´ 0 and c ki ´ 0, as seen from a comparison with the respective update equations in Husmeier and Taylor (1998). P Since Nk D N t D1 p k ( t ) occurs in the denominator of the update equation, 2.23, the algorithm becomes unstable for small Nk , that is, for small mixing coefcients pk D Nk / N. This typically happens when the number of training exemplars N is too small for the given network complexity. It is therefore reasonable to introduce an explicit pruning scheme and discard kernels for which Nk · c ki C e , where in this study e :D 10 ¡6 was chosen.

Regularizing Probability-Density Estimating Neural Networks

2691

3 Application to Latent Space Models

Recently latent space models have become very popular, where dependencies between observations in a high-dimensional data space are explained by a smaller number of so-called latent degrees of freedom. This allows the probabilistic reformulation and reinterpretation of several well-established machine learning algorithms. Examples are probabilistic principal component analysis (Tipping & Bishop, 1999), independent factor analysis (Attias, 1999), and the generative topographic map (Bishop, Svensen, & Williams, 1998). This study will show how the Bayesian evidence scheme can be applied to mixtures of probabilistic principal component analyzers (MPPCA), as introduced in Tipping and Bishop (1999). Two simplications are inherent in the MPPCA approach. First, the function g in equation 1.1 is linear, that is, fk ( x t ) : D W k x t C µ k ,

(3.1)

where µ k is a vector of bias parameters.4 Second, the gaussian kernels P (y t | x t , k ) are isotropic, that is, bki D bk 8i. Consequently, equation 1.3 simplies to

¡b ¢ k

P (y t | x t , k ) D

m /2

2p

¡ exp ¡

¢

bk ky t ¡ W k x t ¡ µ k k2 , 2

(3.2)

and the regularized EM update equations, 2.22 and 2.23, can be written in the more compact form (recall the denitions 2.15–2.20, and see the appendix for a derivation): G ¦k Y † D

1 D bOk

¡

G ¦k G † C

PN

¢

ak ˆ† I W k bk

¡ fk (x t )]2 . mNk ¡ c k

t D1 p k ( t )[y t

(3.3)

(3.4)

The crucial assumption of the MPPCA model is a normal prior on the latent variables, dened by P (x t ) D

¡1¢ 2p

n/2

¡

¢

1 exp ¡ x † xt . 2 t

(3.5)

4 Note that this need not be made explicit in equation 1.1 since a bias is easily included by dening one component of g ( xt ) to be the constant function, gi ( xt ) ´ 1.

2692

Husmeier

This allows the integration over the latent space to be carried out analytically and leads to a marginal distribution of y t in the form P (y t ) D

K X

pk P (yt |k)

(3.6)

kD 1

Z P (y t |k) D

P ( yt | x t , k ) P ( x t ) dxt £

D (2p )m det C k

¤¡1 / 2

¡

¢

1 exp ¡ (y t ¡ µ k )† C k¡1 (y t ¡ µ k ) , (3.7) 2

where C k is the so-called model covariance matrix, dened by C k : D bk¡1 I C W k W † k.

(3.8)

The probability of the latent variables xt conditional on the data vectors y t (needed for data compression) is given by Tipping and Bishop (1999), s P (x t | y t , k ) D

¡b ¢ k

2p

n

¡

£ exp ¡

det M k µ o† bk n x t ¡ M k¡1 W † k [y t ¡ µ k ] 2 n o¶ [ ] £ M k xt ¡ M k¡1 W † y ¡ µ t k k

¢ (3.9)

where M k : D b k¡1 I C W † k Wk

(3.10)

Following Tipping and Bishop (1999), we can extend the EM approach of the previous section and consider the latent variables fxt g to be missing data. If their values were known, the standard EM update equations, 2.21–2.24, 3.3, and 3.4, could be applied. Since fx t g are actually unknown, we take the expectation value with respect to the posterior distribution, equation 3.9, denoted by angle brackets, h. . .i. With the update equation P p k ( t )y t µO k D Pt () t pk t

(3.11)

and the denition YQ k : D (y t D1 ¡ µO k , . . . , y t D N ¡ µO k ),

(3.12)

Regularizing Probability-Density Estimating Neural Networks

2693

the new update equations become of the following form: pOk D

Nk N

hG i ¦k YQ † k D

¡D

E

G ¦k G † C

¢

ak ˆ† I W k bk

(3.13) (3.14)

N D E X 1 1 O k x t ¡ µ k )2 p k ( t ) ( yt ¡ W D mNk ¡ c k t D 1 bOk

(3.15)

i 1 1 h tr W k W † D k a Ok ck

(3.16)

There are three main differences from the standard update equations, 2.21–2.24, 3.3, and 3.4: 1. The angle brackets h. . .i indicate that an expectation value with respect to the distribution of equation 3.9 has been taken. 2. This expectation value also applies to the Hessian, from which the numbers of well-determined parameters, c k , are obtained. A derivation will be given in the appendix. 3. Y is replaced by YQ k , which includes a subtraction of the already updated parameters µO . This, in fact, follows from a two-level EM scheme, in which the µ -parameters are updated in a separate M-step. A discussion of the advantages of this approach is given in Tipping and Bishop (1999). From equation 3.9 we obtain: D E D E ¡1 ¡1 xtx† h x i x† D t t t C bk Mk ( ), hxt i D M k¡1 W † k y t. ¡ µ k

(3.17) (3.18)

where M k was dened in equation 3.10. Now recall that g (xt ) D x t , and apply equations 2.15, 2.16, and 3.12, hG i ¦k YQ † k D

N X tD 1

p k (t)hx t i(y t. ¡ µ k )† ,

(3.19)

which by inserting equation 3.18 yields: hG i ¦k YQ † k D

N X tD 1

( )( )† p k (t)M k¡1 W † k y t. ¡ µ k y t. ¡ µ k

¡1 D Nk M k W † k Sk ,

(3.20) (3.21)

2694

Husmeier

where the denition of the empirical covariance matrix, S k :D

N 1 X p k (t)(y t. ¡ µ k )(yt. ¡ µ k )† , Nk t D1

has been introduced (recall that Nk D D

E

G ¦k G † D

N X tD 1

P

t pk

(3.22)

(t)). From equation 3.17, we get,

D E p k (t ) x t x † t h

i

¡1 ¡1 ¡1 D Nk M k W † k S k W k C bk I M k ,

(3.23)

and inserting equations 3.23 and 3.21 into 3.14 leads to:

¡

¡1 bk¡1 I C W † k Sk W k M k C

¢

ak O † D W † Sk . Mk W k k bk Nk

(3.24)

O k gives5 Taking the transpose of both sides and solving for W

¡

O k D S k W k b ¡1 I C M ¡1 W † S k W k C W k k k

ak Mk b k Nk

¢

¡1

.

(3.25)

Note the distinction between the old parameters, W k , and the new paramO k . Also, note that for ak ´ 0, equation 3.25 reduces to equation C.14 eters, W in Tipping and Bishop (1999), so the effect of the Bayesian regularization scheme is the addition of the extra term bkaNk k M k . Inserting equations 3.18 and 3.17 into 3.15 and applying equations 3.22 and 3.23 leads to: N D E X 1 1 O k x t ¡ µ k )2 p k ( t ) (y t ¡ W D mNk ¡ c k t D1 bOk

D

D

N n X 1 O† p k (t) ky t ¡ µ k k2 ¡ 2hx † t iW k ( y t ¡ µ k ) mNk ¡ c k t D1 h io O k hx t x † iW O† C tr W t k N n o X 1 O † (y t ¡ µ ) p k (t) ky t ¡ µ k k2 ¡2(y t ¡ µ k )† W k M k¡1 W k k mNk ¡c k t D1 h i O k (M ¡1 W † S k W k C b ¡1 I )M ¡1 W O† C Nk tr W k k k k k

5 The following expression is derived to get the update equation into a form equivalent to that stated in Tipping and Bishop (1999). For a practical implementation, the matrix inversion would not be carried out explicitly.

Regularizing Probability-Density Estimating Neural Networks D

2695

h Nk O† tr S k ¡ 2S k W k M k¡1 W k mNk ¡ c k i O k (M ¡1 W † S k W k C b ¡1 I )M ¡1 W O† . CW k k k k k

By making use of equation 3.25, this gives µ ¶ 1 Nk O † ¡ ak W O kW O† tr S k ¡ S k W k M k¡1 W D k k mNk ¡ c k Nk bk bOk

(3.26)

where kernels with Nk · cmk are pruned. For ak ´ 0 and c k ´ 0 (that is, the unregularized case), this reduces to equation C.15 in Tipping and Bishop (1999). 4 Experiments 4.1 Unconditional Probability Densities. When conditioning the data vectors y t on a constant element g (x t ) ´ 1, equation 1.2 reduces to the modeling of unconditional probability densities, P (y t | x t ) D P (y t |1), where the weights w k exiting the constant unit are equivalent to the centers of the gaussian kernels. The prior on the weights is slightly modied from the form of equation 2.1 in that it is centered on the mean of the data rather than the origin.6 This leads to a small modication of the update equations, 2.21–2.24, as discussed in the appendix (equations A.38 and A.39). This study focuses on binary classication problems, where the class conditional distributions P (y t |class D 1) and P (y t |class D 2) are modeled separately with two different networks. The generalization errors of these separate density estimates are measured in terms of the negative normalized log-likelihood for the test data,

Ec : D ¡

1 X ln P (yt |class D c), Nc y 2Dc t

(4.1)

test

where Dctest is an independent test set of cardinality Nc for class c. The networks are then combined into a modular structure to predict the probability for a given class, conditional on the input vector y t , by Bayes’ rule: P (class D c| y t ) D

P (yt |class D c)P (class D c) , P(y t )

(4.2)

P2 where P (class D 1) C P (class D 2) D 1, and P (y t ) D c D1 P ( y t |class D c)P (class D c). The further adjustable parameter P (class D 1) represents the 6 This is equivalent to a force that pulls the kernel centers toward the mean of the distribution, whereby kernels are discouraged from homing in on remote outliers.

2696

Husmeier

2.5

Ripley’s data

1.5

2

Tremor data

1

1.5 0.5 1 0

0

x2

x2

0.5

0.5

0.5 1

1 1.5 1.5 2

2 2.5 2

0 x1

2

2.5 2

0 x1

2

Figure 2: Two classication problems chosen for the numerical experiments. (Left) Ripley’s synthetic data. (Right) Tremor data.

prior for class 1 and can (if no actual prior knowledge is available) easily be optimized so as to minimize the misclassication on the training set. A straightforward approach is to follow a hard classication scheme and assign exemplar yt to class c if P (class D c| y t ) > 0.5. The generalization performance can then be measured on an independent test set in terms of the percentage misclassication (Eclass ). The Bayesian regularization scheme was tested on the following three data sets. Ripley is a synthetic data set taken from Ripley (1994). There are two features and two classes, where each class has a bimodal distribution, as seen from Figure 2 (left). The class distributions are given by equal mixtures of two normal distributions, whose overlap is chosen to allow a best-possible error rate (Bayes limit) of about 8%. The networks were trained on Ntrain D 250 training exemplars (125 for each class) and tested on an independent set of size Ntest D 1000. Tremor is a real-world data set, plotted in Figure 2 (right), where the objective is to identify Parkinson’s disease on the basis of muscle tremor. The data set, collected by Spyers-Ashby (1996), consists of two input fea-

Regularizing Probability-Density Estimating Neural Networks

2.2 2 10 20 Training time

2.4 2.2 2 1.8 0

10 20 Training time

2.2 2 1.8 0

10 20 Training time

30

2.2 2 10 20 Training time

30

10 20 Training time

30

10 20 Training time

30

2.4 2.2 2 1.8 0

30

2.4

2.4

1.8 0

30

Log likelihood

Log likelihood

Log likelihood

2.4

1.8 0

Log likelihood

Evidence scheme

Log likelihood

Log likelihood

No regularisation

2697

2.4 2.2 2 1.8 0

Figure 3: Evolution of E1 , the normalized negative log-likelihood for class 1, when training a GM network with K D 10 kernels on Ripley’s data set. Dashed lines: Training set performance; solid lines: Test set performance. (Left) Results of unregularized training. (Right) Networks regularized with the evidence scheme. The three rows refer to different initializations of the parameters. Abscissa: training time; ordinate: E1 .

tures derived from measurements of arm muscle tremor and a class label representing patient or nonpatient: Ntrain D 179, Ntest D 178. The task of the Kaposi problem is to predict whether an AIDS patient is likely to contract Kaposi’s sarcoma, a vascular tumor that is often aggressive in patients with underlying immunosuppression. The classication is done on the basis of four immunological input variables measuring the concentrations of various lymphocytes, Ntrain D 263, Ntest D 39, with two different partitions into training and test sets. Figure 3 shows the evolution of E1 , the normalized negative test set loglikelihood for class 1 (see equation 4.1), for training three differently initialized networks (all with K D 10 kernels) on Ripley’s data set. Without

2698

Husmeier

regularization7 (left column) we observe the typical behavior of overtting. While E1 naturally decreases on the training set (dashed line), the performance on the test set (solid line) deteriorates from a certain (unknown) optimum number of adaptation steps on. This deciency is signicantly improved when the evidence method for regularization is applied (right column). In this case, E1 on the test set reaches a constant plateau and does not vary much in response to changes in the training time, thus considerably reducing the risk of overtting. The simulations were then repeated 24 times for four initial p kernel numbers (K D 5, 10, 15, 20), two initial kernel widths (s0 D 1 / b0 D 0.5, 1), and three initializations of the kernel centroids,8 where in each case training was carried out over a xed number of T D 20 adaptation steps. (The output weights were always initialized uniformly: pk D 1 / K 8k.) The left column of Figure 4 shows three scatterplots (for E1 , top, E2 , middle, and Eclass , bottom), where for corresponding simulations (starting from the same initialization) results obtained without regularization (abscissa) are plotted against those obtained with the Bayesian regularization scheme (ordinate). It is clearly seen that unregularized training leads to large variations in both performance measures (E1 and E2 indicating the accuracy of modeling the class-conditional distributions, and Eclass showing the classication error), with large kernel numbers K usually causing drastic overtting. This is effectively prevented with the Bayesian regularization scheme, which results in a rather constant performance irrespective of the model complexity K and therefore considerably reduces the need for any sophisticated model selection methods. Also, note that the achieved classication performance with regularization is always close to the Bayes limit. Similar simulations were carried out on the other two data sets. 9 For the Tremor data, the results are shown on the right of Figure 4, which shows the same scatter diagrams as for Ripley’s data. Again, it is seen that the evidence scheme leads to a considerable stabilization of the generalization performance with respect to changes in the model complexity, although unregularized training occasionally results in a better modeling of the probability density of class 1 (top right) and, consequently, a better classication performance (bottom right). In order to test the statistical signicance of the improvement achieved with the evidence method, a matched-pairs t-test was carried out (as de-

7 Recall that training without regularization is equivalent to equations 2.21–2.24 with ak ´ 0, c ki ´ 0. 8 The f k g were drawn from N(M c , 1), where M c is the unconditional mean for the cth class. 9 For the Tremor data, only two rather than three different initializations of the kernel centers k were chosen (leading to 16 simulations), whereas the simulations on the Kaposi data were repeated for two different partitions of the data into training and test sets (giving 48 simulations).

Regularizing Probability-Density Estimating Neural Networks

Negative log likelihood, class 1

Negative log likelihood, class 1 2.2

Regularisation with Bayesian evidence

Regularisation with Bayesian evidence

3.5

2.1

3

2

1.9 1.8 1.7

2.5

1.6 1.5

2 2

2.5 3 3.5 No regularisation Negative log likelihood, class 2

1.6 1.8 2 No regularisation Negative log likelihood, class 2

2.2

5

2.7

4

2.6 2.5

3

2.4

2

2.3 2.2

1

2.1 2 2

2.2 2.4 2.6 No regularisation Classification error

0 0

2.8

1

2 3 4 No regularisation Classification error

5

24

Regularisation with Bayesian evidence

18 16

22

14

20

12

18

10 8 8

1.4 1.4

Regularisation with Bayesian evidence

Regularisation with Bayesian evidence

2.8

Regularisation with Bayesian evidence

2699

16

10

12 14 16 No regularisation

18

14 14

16

18 20 22 No regularisation

24

Figure 4: Comparison between unregularized training (abscissa) and the evidence scheme (ordinate), applied to Ripley’s synthetic data (left) and the Tremor data (right). (Top) E1 (normalized negative log-likelihood for class 1, test set). (Middle) E2 (normalized negative log-likelihood for class 2, test set). (Bottom) Eclass (percentage misclassication, test set). The symbols refer to different numbers of kernels in the network. X-marks: K D 5; circles: K D 10; crosses: K D 15; diamonds: K D 20. The dashed line indicates equal performance. Symbols below that line point to a performance improvement as a result of regularization with the evidence method.

2700

Husmeier

Table 1: Matched Pairs t-Test for a Comparison Between the Evidence Scheme and Unregularized Training over a Fixed Number of T D 20 Training Steps. Data

Measure

t-Statistic

Critical Value

Evidence Method

Ripley

E1 E2 Eclass

4.02 6.24 5.97

§ 1.71 § 1.71 § 1.71

Better Better Better

Tremor

E1 E2 Eclass

0.99 3.97 1.94

§ 1.75 § 1.75 § 1.75

Equal Better Better

Kaposi

E1 E2 Eclass

8.84 5.90 2.34

§ 1.68 § 1.68 § 1.68

Better Better Better

Notes: E1 and E2 represent the negative normalized test set log-likelihoods for the two classes (see equation 4.1) and are a measure of how good the class-conditional distributions are modeled. Eclass is the percentage misclassication. Positive values indicate that the evidence method leads to a better performance; for negative values, the alternative method is superior. The deviation is signicant (at a 95% signicance level) if the modulus of the statistic is larger than the critical value.

scribed, e.g., in Hoel, 1984). For corresponding simulations, the differences in the performance measures E1 , E2 , and Eclass between regularized and unregularized training were calculated. 10 Positive values indicate that the evidence method leads to a better performance; for negative values, the alternative method is superior. The results are listed in Table 1, together with the critical value11 for rejecting the null hypothesis of equal performance. The Bayesian regularization scheme consistently leads to an improvement in all three performance measures, which, except for E1 of the Tremor data, is always signicant. In a further study, the results of the evidence approach were compared with the best test set performance obtained with the unregularized EM algorithm, that is, with the minimum of test set learning curves like those of Figure 3 (left). This is akin to the method of early stopping, except that the selection is done on the basis of the test set rather than a separate valida-

10 In more detail, let zi denote the results obtained without regularization and z0i the results obtained with regularization. Now the differences di : D zi ¡ zi0 and calP consider n 1 (d ¡ d) 2 , in which d denotes the empirical culate the empirical variance S2 :D n¡1 iD 1 i

Pn

p

n d D S d, which is mean: d : D n1 iD 1 i . The t-statistic can now be calculated according to t compared with the critical value for (n ¡ 1) degrees of freedom and a given (in this case, 95%) signicance level. 11 The critical value depends on the number of degrees of freedom, which is the number of different simulations minus one (Ripley: 23, Tremor: 15, Kaposi: 47). See, for instance, Hoel (1984).

Regularizing Probability-Density Estimating Neural Networks

2701

Table 2: Matched Pairs t-Test for a Comparison Between the Evidence Scheme and Optimal Early Stopping. Data

Measure

t-Statistic

Critical Value

Evidence Method

Ripley

E1 E2 Eclass

3.40 2.74 3.66

§ 1.71 § 1.71 § 1.71

Better Better Better

Tremor

E1 E2 Eclass

¡1.97 ¡4.31 ¡2.33

§ 1.75 § 1.75 § 1.75

Worse Worse Worse

Kaposi

E1 E2 Eclass

3.02 ¡1.46 ¡1.62

§ 1.68 § 1.68 § 1.68

Better Equal Equal

Note: For details, see the caption of Figure 1.

tion set. Since the latter would reduce the number of training exemplars and, consequently, degrade the performance of the prediction model, the alternative results are better than what could be obtained in real applications. The results are listed in Table 2. On Ripley’s data, the evidence method achieves a consistent improvement in terms of all three performance measures, and for the Kaposi data, the modeling of one of the class-conditional probability distributions is signicantly improved. However, on the Tremor data, the evidence method turns out to be inferior to the alternative approaches. A possible explanation for this deciency will be given in section 5. 4.2 Conditional Probability Densities. The objective of the second part of this study is the prediction of the conditional probability density of a noisy time series, for which the synthetic benchmark problem of Husmeier and Taylor (1997) was chosen. Two stochastic dynamical systems are coupled stochastically according to £ ¤ £ k¤ (4.3) yt C 1 D H (jt ¡ h ) at yt (1 ¡ yt ) C [1 ¡ H (jt ¡ h )] 1 ¡ yt t ,

where at 2 [3, 4], kt 2 [0.5, 1.25], and jt 2 [0, 1] are random variables uniformly distributed in the respective interval, h D 1 / 3, and H(.) denotes the Heaviside function. The resulting time series is a rst-order Markov process with a bimodal conditional probability distribution in state-space. The networks were trained on a time-series segment of length Ntrain D 200, and the generalization performance was tested on Ntest D 1000 independent exemplars. The chosen network architecture was of the form depicted in Figure 1, where the weights on connections between the input and the rst hidden layer were xed at randomly selected values. Note that this corresponds

2702

Husmeier

to a random vector functional link network, whose universal approximator power is discussed in Igelnik and Pao (1995). Training with the Bayesian evidence scheme followed equations 2.21– 2.24. Recall that this reduces to the unregularized EM algorithm for ak ´ 0 and c ki ´ 0. The following alternative regularization schemes were employed for a comparison: Eigenvalue (EV) cutoff. This is equivalent to the unregularized EM algorithm except that when updating the weights w ki according to equa-

tion 2.22 (with ak D 0), the matrix on the left, G ¦k G † , is singular-value decomposed, and small eigenvalues 12 are set to innity. When solving for w ki , the contributions along the corresponding eigenvalues thus disappear. This is equivalent to a projection of the full solution vector onto a subspace perpendicular to the domain spanned by the eigenvectors belonging to small eigenvalues or, put differently, components of w ki that are poorly determined by the data are discarded. (For a detailed exposition of this method, see Press, Teukolsky, Vetterling, & Flannery, 1992, chap. 2).

Weight decay. The weights w ki are updated according to equation 2.22 with

xed values for ak . This is equivalent to a simple weight decay scheme. Two different values of the weight-decay hyperparameters were used: ak D 0.01 and ak D 0.1 (the same for all kernels k).

Naive Bayes. The kernel variances are updated according to

1 D bOki

P

¡ fki ]2 C 2r 2r ¸ , ( ) p t N t k

y t p k (t )[ P ti

(4.4)

where the hyperparameter r is set to the value that gave the best generalization performance in Ormoneit and Tresp (1996): r D 0.1. Effectively this imposes a lower bound on 1 / bki . The adaptation of the weights w ki follows from equation 2.22, with ak ´ 0.1 8k. Note that this is equivalent to a Bayesian maximum a posteriori approach with a gamma prior on the inverse variances bki and a gaussian prior on the weights w ki . For each method, an ensemble of 80 networks was created in the following way. The random weights in the network were drawn from four different gaussian distributions N (0, srand ) with ln srand 2 f¡1, 0, 1, 2g. For each of the four distribution widths srand , 20 weight congurations were drawn, based on different random number generator seeds. Each set of net-

12 In this study, an eigenvalue was dened as small when eº < 10 ¡6 emax , where emax is the maximum eigenvalue in that weight group. The same cutoff value is recommended in Press et al. (1992).

Regularizing Probability-Density Estimating Neural Networks

Average

2703

Best

0.5

0.6

E

E

0

0.8 0.5

1 0

20 40 60 Training Time

1 20

40 60 Training Time

Figure 5: Dependence of the generalization performance on the regularization method and the training time for the time-series prediction problem. (Left) Evolution of the average generalization “error,” measured in terms of the mean negative normalized test-set log-likelihood Etest . (Right) Evolution of the best generalization “error,” measured in terms of the minimum of Etest . The abscissa represents the number of adaptation steps. Five regularization methods were tested: dotted line: EV cutoff; dashed-dotted line: weight decay, a D 0.01; dashed line: weight decay, a D 0.1; thin solid line: naive Bayes; thick solid line: Bayesian evidence method.

works with identical srand was further subdivided into four subsets, which differed with respect to the initialization of the adaptable weights w ki , drawn from (1) N (0, 0.1), (2) N (0, 0.25), (3) N (0, 0.5), or (4) N (0, 1.0). The remaining parameters, bk and pk , were initialized as before: pk D 1 / K 8k, b k D 1 8k. Each network contained 10 tanh units in the rst hidden layer and K D 10 gaussian kernels in the second hidden layer. Figure 5 (left) shows the dependence of the average generalization performance on the regularization method and the training time. The graphs represent the mean of the negative normalized test set log-likelihood, Etest D P test ( |y ). ¡ N1test N t D 1 ln P yt C 1 t The best results are obtained with the evidence scheme, but the difference is not statistically signicant. A clear improve-

2704

Husmeier

ment, however, is achieved with respect to variations in the length of training time. The simulations suggest that no overtting occurs for the evidence scheme. Compare this with the rst three regularization methods (EV cutoff, weight decay with a D 0.01, weight decay with a D 0.1), which show strong overtting after about 20 to 30 adaptation steps. This calls for model selection by cross-validation, which, however, would reduce the amount of training data and is therefore likely to incur a loss of modeling accuracy. The naive Bayesian method, where the hyperparameter r was set to the optimal value in Ormoneit and Tresp (1996), is strongly overregularized, with an increase of the mean of Etest by about 1.7 standard deviations. Figure 5 (right) shows the dependence of the best generalization performance on the regularization method and the training time, that is, the graphs represent the minimum of Etest . Again, the evidence scheme leads to a considerable stabilization of the training process with respect to changes in the training time, whereas the rst three methods (EV cutoff, a D 0.01, a D 0.1) show drastic overtting. For small training times, the best underregularized models are slightly better than the best model obtained with the evidence scheme, but this can, in general, not be exploited in practice (since model selection requires a cross-validation set, thereby incurring the problems discussed before). The naive Bayesian method turns out to be strongly overregularized again. 4.3 Latent Space Model (MPPCA). The last application is taken from sleep research, where the objective is to distinguish different sleep phases. The data consisted of two classes, deep sleep and rapid eye movement, where each class contained N D 960 10-dimensional feature vectors.13 The data were randomly split into two sets of equal size for training and testing. An MPPCA model of latent dimension n D 2 was trained over a xed number of 60 adaptation steps. For the Bayesian evidence scheme, the weights w k and kernel precisions bk were updated according to equations 3.25 and 3.26. Unregularized training followed the same update rules, but set ak ´ 0 and c k ´ 0. (Note that this is equivalent to equations C.14 and C.15 in Tipping & Bishop, 1999.) The top graphs of Figure 6 show a typical evolution of the training process. When training with the unregularized maximum likelihood scheme, the negative log-likelihood E D ¡ N1 ln P (D |class) continually decreases on the training set (narrow dashed lines), but soon increases on the test set (bold dashed lines) as a result of overtting. When training with the Bayesian evidence method, the training set performance (narrow solid lines) becomes (naturally) worse, but the test set performance (bold solid lines) is improved, and overtting is effectively prevented.

13 These vectors were obtained by tting a tenth-order autoregressive model to a moving xed-size window of an EEG signal.

Regularizing Probability-Density Estimating Neural Networks Class 1

Class 2 13.5

14

E= Log Likelihood

13.8 13.6 13.4 13.2 13 0

20

15

Time

40

60

12.5

12 0

E= Log Likelihood

14 13.5

14 Unregularised

20

12

14.5

13 13

13

Bayesian evidence

E= Log Likelihood

14.2

Bayesian evidence

2705

15

Time

40

60

Misclassification

11 10 9 8 7 6 6

8 10 Unregularised

12

Figure 6: The Bayesian evidence scheme applied to MPPCA (mixture of probabilistic principal component analyzers). The top row shows the evolution of E, the normalized negative log-likelihood, on the training set (narrow lines) and the test set (bold lines) without regularization (dashed lines) and with the Bayesian evidence scheme (solid lines). The scatterplots in the bottom row compare the Bayesian evidence scheme with unregularized training for a total of 24 simulations, where the gure on the left shows the negative test set log-likelihood for classes 1 (crosses) and 2 (circles) and the gure on the right shows the percentage misclassication scores. The diagonal dashed lines indicate equal performance; symbols below the dashed line point to a performance improvement as a result of applying the Bayesian evidence scheme. Note the small but consistent improvement in the plot on the bottom left. The respective t-statistics are shown in Table 3.

To test the statistical signicance of the results, the simulations were repeated 24 times for two kernel numbers (K D 4, 6), three different partitions of the data into training and test sets, and four different initializations.14 14 Two initial kernel widths were chosen—b ¡1 D 0.5 and b ¡1 D 1.0—and the simulak k tions started from two different random number generator seeds. The initial weights were set to wki D 0, and the components of the kernel centers, k , were drawn from a normal distribution centered at the mean of the respective class. The rst ve adaptation steps

2706

Husmeier

Table 3: Mixture of Probabilistic Principal Component Analyzers.

E1 D ¡ ln P( D |class D 1) E2 D ¡ ln P( D |class D 2) Misclassication

Unregularized (Maximum Likelihood)

Optimal Early Stopping

9.65 7.93 2.36

4.58

1.40 0.74

Notes: The table shows the t-statistics for a matched pairs test with 23 degrees of freedom and a critical value of 1.71. Positive values indicate that the Bayesian evidence scheme gives better results; boldface gures indicate that this improvement is signicant (at a 95% signicance level). Left column: Comparison between the Bayesian evidence scheme and unregularized training. Right column: Comparison between the Bayesian evidence scheme and optimal early stopping (described in the text).

The results are plotted in the bottom row of Figure 6, which shows scatter diagrams similar to those discussed in section 4.1. The gure on the bottom left shows a comparison between unregularized (maximum likelihood) training and the Bayesian evidence scheme for modeling the two class-conditional distributions. Note that the dashed diagonal line indicates equal performance of the two methods, whereas symbols below the dashed line point to a performance improvement as a result of applying the evidence scheme. This is, in fact, observed in the gure, which suggests that although the improvement achieved with the evidence approach is relatively small, it is consistent in that it occurred in every simulation. A matched-pairs t-test gives a value of 9.65 (see Table 3), which is larger than the critical value of 1.71 (for a 95% signicance level) and therefore suggests that the improvement is signicant. Similarly, the gure on the bottom right shows a comparison between unregularized training and the evidence scheme with respect to the actual classication scores. The respective t-statistic, shown in Table 3, is 2.36; hence the improvement achieved with Bayesian regularization is less dramatic but still statistically signicant. Finally, the evidence scheme was also compared with the method of optimal early stopping. (Recall that this is an idealized version of early stopping that gives better results than can be achieved in practice.) The results of a matched-pairs t-test are given in Table 3, which suggests that the evidence scheme still gives better results, although this is less pronounced and statistically signicant for only one (out of three) performance measure.

were unregularized and followed the (faster) direct update scheme of equations 3.12 and 3.13 in Tipping and Bishop (1999).

Regularizing Probability-Density Estimating Neural Networks

2707

5 Discussion

This study has applied the Bayesian evidence scheme to the regularization of probability-density estimating neural networks. This is to be distinguished from Roberts, Husmeier, Rezek, and Penny (1998), where the evidence scheme was applied to model selection, whereas the actual training process was unregularized. Also, note that in Roberts et al. (1998), only the lowest-order approximation to the Hessian, equivalent to the rst term in the expansion of equations A.28 and A.37 was applied. The numerical experiments suggest that the evidence method derived in this study leads to a considerable stabilization of the generalization performance with respect to variations in the model complexity (see Figure 4) and training time (see Figures 3 and 5). A comparison with alternative regularizers gave, in general, very favorable results. The following restrictions, however, are to be noted. When modeling unconditional probability densities according to section 4.1, a gaussian prior is introduced on the kernel centers fk D w k themselves (rather than on parameters determining the functional mapping onto the kernel centers). This explains the comparatively poor performance on the tremor data, where one of the class-conditional distributions is strongly skewed and considerably deviates from a gaussian. Moreover, for classication problems per se, a gaussian prior might be suboptimal, since one would prefer to deploy more kernels near the classication boundary rather than in the interior of the distribution. Consequently, although the modeling of the actual distribution tends to be improved by the evidence scheme, this is not necessarily reected in the classication performance (as seen for the Kaposi data; see Table 2). These restrictions do not apply to the modeling of conditional densities, for which the kernel centers are modeled by generalized linear functions according to equation 1.1, and the chosen form of the prior is equivalent to the standard complexity regularization of linear weight decay. Modeling the kernel centers as generalized linear functions implies that for a given approximation accuracy, the model complexity will usually be larger than that of an equivalent one-hidden-layer neural network. However, Igelnik and Pao (1995) claim that for an appropriate stochastic choice of the basis functions (the functions g (x t ) in equation 1.1), this increase in the complexity may not be dramatic. Probabilistic principal component analysis is based on a generalized linear model for predicting conditional densities P (y | x ), but in fact integrates the latent variables x out so as to model unconditional densities. This is done according to the EM algorithm, which requires taking the expectation value of the Hessian with respect to the posterior distribution P (x | y , k ) of equation 3.9. In the work presented here, this has been done only for the lowestorder approximation to the Hessian, as discussed in the appendix. The resulting update algorithm was found to achieve a signicant improvement

2708

Husmeier

over training with the maximum likelihood method for both the modeling of the class-conditional distributions and the classication performance. On a comparison with an idealized form of early stopping, the evidence method still tended to achieve better results, although this was less pronounced and not always statistically signicant. The derivation of the expectation value of a higher-order approximation to the Hessian, which takes the second term in equation A.28 into account, is the subject of future research. 6 Conclusion

Training probability-density estimating neural networks with the EM algorithm aims to maximize the likelihood of the training set and therefore leads to severe overtting for sparse data. The proposed regularization method adopts the Bayesian evidence scheme, whereby the hyperparameters of the prior are optimized by type II maximum likelihood, that is, after marginalizing over the parameters. This is done by Laplace approximation, which requires the derivation of the Hessian of the log-likelihood function. The incorporation of this approach into the standard training scheme leads to a modied form of the EM algorithm, which includes extra regularization terms whose hyperparameters are adapted on-line after each EM cycle. The method has been applied to the direct modeling of unconditional densities, the indirect modeling of unconditional densities via a latent-space model, and the prediction of conditional probability densities. Simulations on four classication problems and one time-series prediction task suggest that the generalization performance is signicantly improved over unregularized maximum likelihood training and that the evidence scheme tends to outperform the alternative regularization methods employed in this study. Moreover, most simulation results suggest that the training process is stabilized with respect to changes in the training time and the model complexity, which prevents overtting and greatly reduces the need for model selection. Appendix A.1 Derivation of the Regularized EM Algorithm. Consider a densityestimating network with parameters q D fw , p , ¯g, and dene the K by N matrix of binary variables l k (t), 0 1 l 1 (t ) ¡ ¢ B C ¤ : D ¸ (1), . . . , ¸ (N ) , ¸ (t) :D @ ... A , l k (t) 2 f0, 1g, l K (t)

k¸ (t)k D 1 8t,

(A.1)

where l k (t) D 1 indicates that the tth exemplar y t has been generated from the kth component of the mixture distribution, equation 1.2, and the last

Regularizing Probability-Density Estimating Neural Networks

2709

equation on the right implies that at any particular time t, one and only one l k (t ) is set to 1. Moreover, dene Y(q , ¤ ) : D ¡ ln P (D , ¤ | q )

(A.2)

« ¬ U (q | q 0 ) : D Y(q , ¤ ) ¤ |D , q 0

(A.3)

« ¬ S (q | q 0 ) : D ln P (¤ | D , q ) ¤ |D , q 0 ,

(A.4)

« ¬ R where the abbreviation f ¤ | D , q 0 :D f (¤ )P (¤ | D , q 0 )d¤ has been applied. Note the distinction between q and q 0 , where the latter denote the current or old parameters, which are kept xed, and the former the new parameters, which are adapted (see below). From denition 2.3 and equations A.2 through A.4 we obtain: Eo (q ) D ¡ ln P (D | q ) D ln P (¤ | D , q ) ¡ ln P (D , ¤ | q ) D U ( q | q 0 ) C S( q | q 0 ) £

r Eo ( q )

¤

£

q Dq0

D r U (q |q 0 )

¤

qDq0

(A.5)

,

(A.6)

where equation A.6 follows from the fact that S (q | q 0 ) has its maximum at q D q 0 (e.g., Papoulis, 1991). The joint probability for the class labels and the data is given by15 P (D , ¤ | q ) D D

N Y K ¡ Y

pk P (y t | q , k )

tD 1 k D 1 N Y K Y m Y

" r pk

tD 1 k D 1 i D 1

¢l k (t)

(A.7)

¡

b ki bki exp ¡ 2p 2

¡

# ² ² l k (t ) yti ¡ fki

2

(A.8)

and, with equation A.2, Y(q , ¤ ) D

N X K X m X

µ l k (t )

t D 1 kD 1 iD 1

¡ ¢¶

¢2 bki ¡ 1 b ki yti ¡ fki ¡ln pk ¡ ln 2 2 2p

(A.9)

¡ ¢ Since l k (t) is binary and thus hl k (t)i¤ | D , q0 D P l k (t) D 1| D , q 0 D p k (t), equations A.3 and A.9 lead to U (q |q 0 ) D 15

N X K X m X t D1 k D 1 i D1

µ p k (t )

¡ ¢¶

¢2 bki ¡ 1 b ki yti ¡ fki ¡ln pk ¡ ln 2 2 2p

. (A.10)

See, for instance, Bishop (1995), Dempster et al. (1977), and Husmeier (1998).

2710

Husmeier

Taking the derivatives of U gives16 @U @ak @U @bki @U @pk

D0

D

(A.11)

µ N X p k (t) ¡ 2

tD1

D ¡

yti ¡ fki

¢2

¡

1 bki

¶ (A.12)

N N mX m X p k (t) C p K (t ) p k tD 1 pK t D 1

¡rw ki U D b ki D b ki

N X t D1

(

(A.13)

£ ¤ p k (t) yti ¡ f (xt I w ki ) rw ki f (x t I w ki )

N X tD1

p k (t)yti g (x t ) ¡

N X

! †

p k (t)g (xt )[g (xt )] w ki

t D1

h ± ² i D b ki G ¦k y .i ¡ G ¦k G † w ki ,

(A.14)

where in the last step, the denitions 2.15, 2.16, and 2.19 have been applied. On inserting the expression for the prior on w , equation 2.1, into the denition of the regularization term, equation 2.4, we obtain R (w I ® ) D

K X m X ak k D1 i D1

2

w †ki w ki ¡

K ak mnQ X ln , 2 kD1 2p

(A.15)

and for the derivatives, @R

rw ki R D ak w ki ,

@ak

D

m 1X mnQ w †ki w ki ¡ , 2 iD 1 2ak

@R @pk

D

@R @bki

D 0. (A.16)

The derivatives of ln det H will be derived in the next section: equations A.30, A.31, A.33, and A.34. Inserting this into equations 2.12 and 2.13 and applying equation A.6 leads to the regularized EM algorithm, equations 2.21 through 2.24 and equations 3.3 and 3.4. 16

The second term on the right of equation A.13 stems from the constraint

1 ) pK D 1 ¡

P K¡1 kD 1

pk .

PK kD 1

pk D

Regularizing Probability-Density Estimating Neural Networks

2711

A.2 The Hessian. The derivations presented in this section are based on the following identity: Lemma. 0

If q D q ,

then

@2 S (q | q 0 ) @qi @qk

½ D ¡

@Y @Y

¾

@qi @qk ¤ | D , q0

C

@Eo @Eo . @qi @qk

(A.17)

From the denition of S, equation A.4, we obtain

Proof.

@2 S (q | q 0 ) @qi @qk

D D

Z

@2

¡

@qi @qk

¢ ln P (¤ | D , q ) P (¤ | D , q 0 )d¤

Z

1 @2 P (¤ | D , q ) P (¤ | D , q 0 )d ¤ P(¤ |D , q ) @qi @qk Z 1 @P (¤ | D , q ) @P (¤ | D , q ) ¡ P ( ¤ | D , q 0 )d ¤ . ¡ ¢2 @ q @ q i k ( ) P ¤ |D , q

For q D q 0 , the rst integral in the last equation is zero and therefore @2 S (q | q 0 ) @qi @qk

Z D ¡

@ ln P (¤ | D , q ) @ ln P (¤ | D , q ) @qi

@qk

P ( ¤ | D , q 0 )d ¤ .

P ¤, D q Now use (i) P (¤ | D , q ) D P (D |q ) ) ln P (¤ | D , q ) D ln P (¤, D | q )¡ln P (D | q ) and (ii) the fact that for q D q 0 , the following identity holds: (

Z

@ ln P (¤, D | q )

Z

@qi

| )

P ( ¤ | D , q 0 )d ¤ @P (¤, D | q )

1

P ( ¤ | D , q 0 ) d¤ P ( ¤ | D , q ) P (D | q ) @qi 1 @ @ ln P (D | q ) . D P (D | q ) D P (D | q ) @qi @qi D

This gives @2 S (q | q 0 ) @qi @qk

Z D ¡ C

@ ln P (¤, D | q ) @ ln P (¤, D | q ) @qi

@qk

@ ln P (D | q ) @ ln P (D | q ) @qi

@qk

,

which leads, with denitions 2.3 and A.2 to A.17.

P ( ¤ | D , q 0 )d ¤

2712

Husmeier

It is now easy to prove the following relation, which allows the calculation of the Hessian of the total cost function E from U: Theorem.

If q D q

0

then

@2 E @qi @qk

D

@2 U @qi @qk C

Proof.

@U @U

C

@qi @qk

@2 R @qi @qk

½ ¡

@Y @Y

¾

@qi @qk ¤ |D , q 0

.

(A.18)

From equations 2.5 and A.5, we get: @2 E (q ) @qi @qk

D

@2 U (q | q 0 ) @qi @qk

@2 S (q | q 0 )

C

@qi @qk

C

@2 R (q ) @qi @qk

.

(A.19)

Now making use of the above lemma, equation A.17, this leads to @2 E (q ) @qi @qk

D

@2 U (q | q 0 ) @qi @qk

½ ¡

@Y @Y

¾

@qi @qk ¤ |D , q 0

Since q D q 0 , @@Eqo can be replaced by i completes the proof.

C

@U (q | q0 ) @qi

@Eo @Eo @qi @qk

C

@2 R (q ) @qi @qk

.

(A.20)

due to equation A.6. This

From equations A.14 and A.16, we obtain:

rw ki rw† k0 i0 U D dkk0 dii0 bki G ¦k G† rw ki rw† k0 i0 R D dkk0 dii0 ak I .

(A.21) (A.22)

« ¬ To calculate the outer product rw Y(rw Y)† , take the gradient of the expression in equation A.9 and apply equation 1.1:

rw ki Y D ¡

N X t D1

¡ ¢ l k (t)bki yti ¡ f (xt I w ki ) g (xt ).

(A.23)

P P In what follows, we will nd expressions of the form h[ t l k (t)W ki (t)][ t0 l k0 6 t0 (t0 )W k0 i0 (t0 )]i. For independent times t D « training ¬data, events « at different ¬ are independent. Therefore l k (t )l k0 (t0 ) D hl k (t)i l k0 (t0 ) D p k (t)p k0 (t0 ). For t D t0 , however, the events are not independent since at any particular time, only one of the indicator variables l k (t) is “switched on.” Moreover, since the l k (t) are binary, [l k (t)]2 D l k (t). This can be summarized as l k (t )l k0 (t ) D dkk0 l k (t) (dkk0 denotes the Kronecker delta), yielding the overall relation « ¬ l k (t)l k0 (t0 ) D (1 ¡ dtt0 )p k (t)p k0 (t0 ) C dtt0 dkk0 p k (t).

(A.24)

Regularizing Probability-Density Estimating Neural Networks

2713

Using this result, we obtain *

X

l k (t)W ki (t )

t

D D

X

+ 0

0

l k0 (t )W k0 i0 (t )

t0

XX«

¬ l k (t)l k0 (t0 ) W ki (t)W k0 i0 (t0 )

t

t0

XX t

t0

C dkk0

* D

X

p k (t)p k0 (t0 )W ki (t)W k0 i0 (t0 ) ¡

X

C dkk0

+*

X

X t

+ l k0 (t)W k0 i0 (t) ¡

X

l k (t)W ki (t )

t

* D

X

¡ (1 ¡ dkk0 ) X t

P

(p k (t))2 W ki W ki0 ,

+ 0

0

l k0 (t )W k0 i0 (t ) +*

l k (t)W ki (t)

t

C dkk0

p k (t)p k0 (t)W ki (t)W k0 i0 (t)

t

p k (t)W ki (t)W ki0 (t),

t0

X

X

t

and, after adding and subtracting a term dkk0 *

p k (t)p k0 (t)W ki (t)W k0 i0 (t)

t

p k (t)W ki (t)W ki0 (t)

t

l k (t)W ki (t)

t

X

X

X

+ l k0 (t)W k0 i0 (t)

t

p k (t)p k0 (t)W ki (t)W k0 i0 (t)

t

p k (t) [1 ¡ p k (t)] W ki (t)W ki0 (t ).

(A.25)

With equations A.23 and A.25, the denition W ki (t) : D bki (yti ¡ f (x t I w ki ))g (x t ), and the identity hYi D U (from equation A.3), this leads to D

E (rw ki Y)(rw k0 i0 Y)† ¼ (rw ki U )(rw k0 i0 U )† 2 C dkk0 dii0 bki

N X t D1

£ g (xt )g † (x t ),

¡ ¢2 p k (t ) [1 ¡ p k (t)] yti ¡ f (x t I w ki ) (A.26)

6 k0 , and cross-coordinate terms, where the sums over cross-kernel terms, k D 0 17 6 iD i have been neglected. Inserting equations A.21, A.22, and A.26 into 17 These terms can be both positive and negative and therefore tend to cancel each other out. Also note that for a distinct partitioning of the data among the kernels, the 6 k0 will usually be small. product p k (t)p k0 (t) for k D

2714

Husmeier

A.18 leads to the following expression for the Hessian H D [rw rw† E]w D wO : H kk0 ii0 D dkk0 dii0 (A ki C ak I ),

(A.27)

Q where the denition of the n-bynQ matrix, 2 A ki : D bki G ¦k G † ¡ b ki

£ g (x t )g † (xt ),

N X tD 1

¡ ¢2 p k (t) [1 ¡ p k (t)] yti ¡ f (x t I w ki ) (A.28)

has been introduced. This matrix has a diagonal block structure, so for the logarithm of the determinant we obtain ln det H D

K X m X k D1 i D1

ln det(A ki C ak I ),

(A.29)

which gives the following derivatives: @ ln det H @pk @ ln det H @ak

D 0 D D D

@ ln det H @bki

D

(A.30) m @ X

@ak iD1

ln det(A ki C ak I ) D

m X nQ X

1

eº C i D 1 ºD1 ki

ak

D

mnQ ¡ c k ak nQ @ X

@bki ºD1

m nQ X @ X

@ak ºD1 iD 1

ln(eºki C ak )

m X nQ a C eº ¡ eº 1 X k ki ki ak iD 1 ºD 1 eºki C ak

(A.31) ln(eºki C ak ) D

nQ X

1

eº C ºD 1 ki

@eº ki

ak @bki

,

(A.32)

where eºki is the ºth eigenvalue of A ki , and c k was dened in equation 2.20. The last expression can be simplied by assuming that the eigenvalues are homogeneous18 in bki , eºki / bki ) @ ln det H @bki

¼

nQ X ºD 1

@eºki @bki

eºki c ki . D º eki C ak b ki b ki 1

eº

D bki : ki

(A.33)

18 This is exact for K D 1, where the second contribution on the right of equation A.28 disappears. For K > 1, note that the terms in the sum on the right of equation A.28 scale like bki2 p k (t)[1 ¡ p k (t)]. For large kernel widths, bki ¿ 1, they can obviously be neglected. For narrow kernels, the posteriors p k (t) will also be narrow, leading to a steep transition from p k (t) D 1 to p k (t) D 0. Because of the factor p k (t)[1 ¡ p k (t)], only exemplars with p k (t) ¼ 0.5 will give signicant contributions, leaving the sum of order O (1), whereas the other term is of order O (N).

Regularizing Probability-Density Estimating Neural Networks

2715

For MPPCA, where bki D bk 8 i, we get @ ln det H @bk

D

m nQ @ XX

@bk iD 1 ºD 1

ln(eºki C ak ) ¼

m X c ki iD 1

bk

D

ck . bk

(A.34)

A.3 The Hessian for MPPCA. For MPPCA, we need to take the expectation value of the Hessian with respect to the distribution 3.9. In this study, the second term in the expansion of the Hessian, equation A.28, has been neglected, which is equivalent to the approximation made in Roberts et al. (1998). We then get

D

A ki D bk G ¦k G †

E

h

i

¡1 ¡1 D Nk bk M k W † k SkWk C I Mk ,

(A.35)

where in the second step, equation 3.23 has been applied. A.4 A Special Case: Unconditional Densities. As mentioned in section 4.1, the modeling of unconditional densities is given by the special case g (xt ) ´ 1. This allows a more accurate approximation to the Hessian, 6 where cross-coordinate terms, i D i0 , are not neglected, and equation A.27 is replaced by

Hkk0 ii0 D dkk0 (Akii0 C dii0 ak ) Akii0 : D dii0 b ki Nk ¡ bki bki0

N X tD1

(A.36)

p k (t) [1 ¡ p k (t)] (yti ¡ wki )(yti0 ¡ wki0 ).

(A.37)

P eº With the number of well-determined parameters, c k D º a Ck eº , where eºk k k are the eigenvalues of A in equation A.37, this leads to the following update rules (replacing equations 2.21–2.24): pOk D

Nk , N

1 D bOki

P

P wO ki D

t pk

(t)yti C (ak / bki )Mi Nk C (ak / bki )

¡ wO ki ]2 , Nk ¡ c k

t p k ( t )[yti

1 1 † w O w O k, D aO k ck k

(A.38)

(A.39)

where M is the unconditional mean (this term stems from the modied prior mentioned in section 4.1), and Nk was dened in equation 2.25. Kernels with Nk · c k are pruned.

2716

Husmeier

Acknowledgments

Major parts of this work were carried out at King’s College London and at Imperial College London, where I was supported by a Postgraduate Knight Studentship and by the Jefferiss Research Trust, respectively. I thank John G. Taylor, William D. Penny, and Stephen J. Roberts for stimulating discussions and helpful comments on earlier versions of the manuscript.

References Attias, H. (1999). Independent factor analysis. Neural Computation, 11(4), 803– 851. Bishop, C. M. (1995). Neural networks for pattern recognition. New York: Oxford University Press. Bishop, C. M., & Qazaz, C. S. (1995). Bayesian inference of noise levels in regression. In Proceedings ICANN 95 (pp. 59–64). Bishop, C. M., Svensen, M., & Williams, C. K. I. (1998). GTM: The generative topographic mapping. Neural Computation, 10(1), 215–234. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B39(1), 1–38. Hoel, P. G. (1984). Introduction to mathematical statistics. New York: Wiley. Husmeier, D. (1998).Modelling conditionalprobabilitydensitieswith neural networks. Unpublished doctoral dissertation, King’s College London. Available online at: http://www.bioss.sari.ac.uk/»dirk/My publications.html. Husmeier, D., & Taylor, J. G. (1997). Predicting conditional probability densities of stationary stochastic time series. Neural Networks, 10(3), 479–497. Husmeier, D., & Taylor, J. G. (1998). Neural networks for predicting conditional probability densities: Improved training scheme combining EM and RVFL. Neural Networks, 11(1), 89–116. Igelnik, B., & Pao, Y. H. (1995). Stochastic choice of basis functions on adaptive functional approximation and the functional-link net. IEEE Transactions on Neural Networks, 6, 1320–1329. Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptives mixtures of local experts. Neural Computation, 3, 79–87. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. MacKay, D. J. C. (1992). Bayesian interpolation. Neural Computation, 4, 415–447. MacKay, D. J. C. (1993). Hyperparameters: Optimize, or integrate out. In G. Heidbreder (Ed.), Maximum entropy and bayesian methods(pp. 43–59). Norwell, MA: Kluwer. Ormoneit, D., & Tresp, V. (1996). Improved gaussian mixture density estimates using Bayesian penalty terms and network averaging. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 542–548). Cambridge, MA: MIT Press.

Regularizing Probability-Density Estimating Neural Networks

2717

Papoulis, A. (1991). Probability, random variables, and stochastic processes (3rd ed.). New York: McGraw-Hill. Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical recipes in C. New York: Cambridge University Press. Ripley, B. D. (1994). Neural networks and related methods for classication. Journal of the Royal Statistical Society B, 56(3), 409–456. Roberts, S. J., Husmeier, D., Rezek, I., & Penny, W. (1998).Bayesian approaches to gaussian mixture modeling. IEEE Transactions on Pattern Analysis and Machine Learning, 20, 1133–1142. Spyers-Ashby, J. M. (1996). The recording and analysis of tremor in neurological disorders. Unpublished doctoral dissertation, Imperial College, London. Tipping, M. E., & Bishop, C. M. (1999). Mixtures of probabilistic principal component analyzers. Neural Computation, 11(2), 443–482. Xu, L., & Jordan, M. I. (1996). On convergence properties for the EM algorithm. Neural Computation, 8, 129–151. Received May 27, 1998; accepted November 12, 1999.

LETTER

Communicated by Radford Neal

A Bayesian Committee Machine Volker Tresp Siemens AG, Corporate Technology, Department of Information and Communications, 81730 Munich, Germany The Bayesian committee machine (BCM) is a novel approach to combining estimators that were trained on different data sets. Although the BCM can be applied to the combination of any kind of estimators, the main foci are gaussian process regression and related systems such as regularization networks and smoothing splines for which the degrees of freedom increase with the number of training data. Somewhat surprisingly , we nd that the performance of the BCM improves if several test points are queried at the same time and is optimal if the number of test points is at least as large as the degrees of freedom of the estimator . The BCM also provides a new solution for on-line learning with potential applications to data mining. We apply the BCM to systems with xed basis functions and discuss its relationship to gaussian process regression. Finally, we show how the ideas behind the BCM can be applied in a non-Bayesian setting to extend the input-dependent combination of estimators. 1 Introduction

For reasons typically associated with their architectures and learning algorithms, some learning systems are limited in their capability to handle large data sets and to perform on-line learning. Typical examples are kernel-based approaches such as smoothing splines, regularization networks, kriging, and gaussian process regression, where it is necessary to invert matrices of the size of the training data set and where the number of kernels grows proportionally to the number of training data. All four systems are closely related and are considered biologically relevant models for information processing (Poggio & Girosi, 1990). Gaussian process regression is an approach recently introduced into the machine learning community (Neal, 1996; Rasmussen, 1996; Williams & Rasmussen, 1996; Williams, 1997, 1998a; Gibbs & MacKay, 1997; MacKay, 1998) and is considered suitable for applications with only up to a few hundred training samples. The reason is that the computational cost of the required matrix inversion scales as O (K3 ) where K is the number of training data. In this article we introduce an approximate solution to regression whose computational cost increases only linearly with the number of training patterns and is applicable in particular to kernelbased regression systems but also to other regression models. The idea is c 2000 Massachusetts Institute of Technology Neural Computation 12, 2719–2741 (2000) °

2720

Volker Tresp

to split up the data set in M data sets and train M systems on the data sets. We then combine the predictions of the individual systems using a new weighting scheme in the form of a Bayesian committee machine (BCM). This scheme is an extension to the input-dependent averaging of estimators, a procedure that was introduced by Tresp and Taniguchi (1995) and Taniguchi and Tresp (1997). A particular application of our solution is online learning where data arrive sequentially and training must be performed sequentially as well. We present an implementation of our approach that is capable of on-line learning and requires the storage of only one matrix of the size of the number of query points. In section 2 we develop the BCM in a general setting and in section 3 apply the BCM to kernel regression systems—in particular, gaussian process regression. In section 4 we apply the BCM to regression with xed basis functions. We show that in the special case, we have at least as many query points as there are linearly independent basis functions we obtain the optimal prediction. In section 5 we explain the principle behind the BCM. We show that the dimension and the effective degrees of freedom PData of eff an estimator are important quantities for determining the optimal number of query points. In section 6 we give a non-Bayesian version of the BCM and show how it can be used to improve the input-dependent combination of estimators by Tresp and Taniguchi (1995). In section 7 we derive on-line Kalman lter versions of our approach and show how predictions at additional query points can be calculated. In section 8 we demonstrate the effectiveness of our approach using several data sets. In section 9 we present our conclusions. 2 Derivation of the BCM 2.1 Combining Bayesian Estimators. Let us assume an input variable x and a one-dimensional real response variable f (x). We assume that we can measure only a noisy version of f ,

g (x ) D f (x) C 2 (x), where 2 (x) is independent gaussian distributed noise with zero mean and variance sy2 (x). m Let Xm D fxm 1 , . . . , xK g denote the input vectors of the training data set of size K (superscript m for measurement), and let gm D ( g1 , . . . , gK )0 be the vector of the corresponding target measurements. Let D D fXm , gm g denote both inputs and targets. q q Furthermore, let Xq D fx1 , . . . , xNQ g denote a set of NQ query or test points (superscript q for query), and let f q D ( f1 , . . . , fNQ ) be the vector of the corresponding unknown response variables. We assume a setting where instead of training one estimator using all the data, we split up the data into M data sets D D fD1 , . . . , DM g (of typically

A Bayesian Committee Machine

2721

approximately same size) and train M estimators separately on each training data set. Gaussian process regression, which will be introduced in the next sections, is one example where this procedure is useful. Correspondingly, m m m m we partition inputs Xm D fXm 1 , . . . , XM g and targets g D fg1 , . . . , gM g. Let 1 i i D D fD , . . . , D g denote the data sets with indices smaller than or equal to i with i D 1, . . . , M. Let P ( f q |Di ) be the posterior predictive probability density at the NQ query points for the ith estimator.1 Then we have in general2 P ( f q | D i¡1 , Di ) / P ( f q )P (D i¡1 | f q )P (Di | D i¡1 , f q ). Now we would like to approximate P (Di | D i¡1 , f q ) ¼ P (Di | f q ).

(2.1)

This is not true in general unless f q ´ f : only conditioned on the complete function f are targets independent. The approximation might still be reasonable when, rst, NQ is large—since then f q determines f everywhere—and, second, the correlation between the targets in D i¡1 and Di is small, for example, if inputs in those sets are spatially separated from each other. Using the approximation and applying Bayes’ formula, we obtain P ( f q | D i¡1 , Di ) ¼ const £

P ( f q | D i¡1 )P ( f q |Di ) P( f q )

(2.2)

such that we can achieve an approximate predictive density PO ( f q |D) D const £

QM

q i i D1 P ( f |D ) , q P ( f ) M¡1

(2.3)

where const is a normalizing constant. The posterior predictive probability densities are simply multiplied. Note that since we multiply posterior probability densities, we have to divide by the priors M ¡ 1 times. This general formula can be applied to the combination of any Bayesian regression estimator. 2.2 The BCM. In case the predictive densities are gaussian (or can be approximately reasonably well by a gaussian), equation 2.2 takes on an 1

As an example, for an estimator parameterized by a weight vector w, the posteR rior predictive probability density is P( f q |Di ) D P( f q |w)P(w|Di )dw (Bernardo & Smith, 1994). 2 In this article we assume that inputs are given, and an expression such as P(D| f q ) denotes the probability density of only the targets gm conditioned on both f q and inputs Xm .

2722

Volker Tresp

especially simple form. Let us assume that the a priori predictive density at the NQ query points is a gaussian with zero mean and covariance S qq and the posterior predictive density for each module is a gaussian with mean E ( f q |Di ) and covariance cov ( f q |Di ). In this case we achieve (see apc are caclulated with respect to the approximate denpendix A.1; EO and cov O sity P) EO ( f q |D) D C¡1

M X

cov ( f q |Di )¡1 E ( f q |Di ),

(2.4)

iD 1

with c ( f q |D)¡1 D ¡(M ¡ 1)( S qq ) ¡1 C C D cov

M X

cov ( f q |Di ) ¡1 .

(2.5)

i D1

We recognize that the prediction of each module i is weighted by the inverse covariance of its prediction. But note that we do not compute the covariance between the modules but the covariance between the NQ query points! We have named this way of combining the predictions of the modules the Bayesian committee machine. Modules that are uncertain about their predictions are automatically weighted less than modules that are certain about their prediction. In the next sections we will apply the BCM to two important special cases with gaussian posterior predictive densities—rst to gaussian process regression and then to systems with xed basis functions. 3 Gaussian Process Regression and the BCM

An important application of the BCM is kernel-based regression in the form of regularization networks, smoothing splines, kriging, and gaussian processes regression. The reasons are that, rst, the degrees of freedom in such systems increase with the number of training data points such that it is not suitable to update posterior parameter probabilities and, second, the approaches require the inversion of matrices of the dimension of the number of data points, which is clearly unsuitable for large data sets. Since it is well known that all four approaches can give equivalent solutions, we will focus on only one of them: the gaussian process regression framework. In the next section, we briey review gaussian process regression and in the following section discuss the BCM in context with gaussian process regression. 3.1 A Short Review of Gaussian Process Regression. In contrast to the ususal parameterized approach to regression, in gaussian process regression, we specify the prior model directly in function space. In particular, we assume that a priori f (x) is gaussian distributed (in fact, it

A Bayesian Committee Machine

2723

would be an innite-dimensional gaussian) with zero mean and a covariance cov ( f (x1 ), f (x2 )) D sx1 ,x2 . As before, we assume that we can measure only a noisy version of f , g (x) D f (x ) C 2 (x ), where 2 (x ) is independent gaussian distributed noise with zero mean and variance sy2 (x ). Let ( S mm )ij D sxmi ,xjm be the covariance matrix of the measurements,

Y mm D sy2 (x )I be the noise-variance matrix, ( S qq )ij D sxq ,xq be the covarii j ance matrix of the query points, and ( S qm )ij D sxq ,xm be the covariance matrix i

j

between training data and query data. I is the K-dimensional unit matrix. Under these assumptions, the conditional density of the response variables at the query points is gaussian distributed with mean E ( f q |D) D S qm (Y mm C S mm ) ¡1 gm ,

(3.1)

and covariance cov ( f q |D) D S qq ¡ S qm (Y mm C S mm ) ¡1 ( S qm )0 .

(3.2)

Note that for the ith query point we obtain q

E ( fi |D) D

K X j D1

sxq ,xm vj , i j

(3.3)

where vj is the jth component of the vector (Y mm C S mm ) ¡1 gm . The last equaq tion describes the weighted superposition of kernel functions bj (xi ) D sxq ,xm , i

j

which are dened for each training data point and are equivalent to some solutions obtained for kriging, regularization networks, and smoothing splines (Wahba, 1990; Poggio & Girosi, 1990; Williams & Rasmussen, 1996; MacKay, 1998; Hastie & Tibshirani, 1990). A common assumption is that sxi ,xj D A exp(¡1 / (2c 2 )kxi ¡ xj k2 ) with A, c > 0 such that we obtain gaussian basis functions. 3.2 The BCM for Gaussian Process Regression. Since the computational cost of the matrix inversion in equation 3.1 scales as O (K3 ), where K is the number of training data, the cost becomes prohibitive if K is large. There have been a few attempts at nding more efcient approximate solutions. Gibbs and MacKay (1997) have applied an algorithm developed by Skilling (1993) to gaussian process regression. This iterative algorithm scales as O (K2 ). In the regression lter algorithm by Zhu and Rohwer (1996) and Zhu, Williams, Rohwer, and Morciniec (1998), the functional space is

2724

Volker Tresp

projected into a lower-dimensional basis function space where the Kalman lter algorithm can be used for training the coefcients of the basis functions. The latter approach scales as O (KN2 ), where N is the number of basis functions. The big advantage of the application of the BCM (see equation 2.4) to gaussian process regression (see equations 3.1 and 3.2) is that now instead of having to invert a K-dimensional matrix, we only have to invert approximately (K / M )-dimensional and NQ -dimensional matrices. If we set M D K / a, where a is a constant, the BCM scales linearly in K. We found that M D K / NQ is a good choice to optimize the number of computations since all matrices that have to be inverted have the same size. In this case, the number of elements in each module is identical to NQ , the number of query points. In section 7 we will describe on-line versions of the algorithm, which are well suited for data mining applications. If NQ is large, it can be more efcient to work with another formulation of the BCM, which is described in appendix A.2. 4 The BCM and Regression with Fixed Basis Functions 4.1 Regression with Fixed Basis Functions. In the previous section, the BCM was applied to a kernel-based system, that is, gaussian process regression. For kernel-based systems, the BCM can be very useful for on-line learning and when the training data set is large. For regression with xed basis functions, on the other hand, sufcient statistics exist such that large data sets and on-line learning are not problematic. Still, there are two good reasons that we want to study the BCM in the context of systems with xed basis functions. First, the system under consideration might have a large number—or even an innite number—of basis functions such that working with posterior weight densities is infeasible. Second, the study of these systems provides insights into kernel-based systems. Consider the same situation as in section 2, only that f (x) is a superposition of N xed-basis functions, that is,

f (x ) D

N X

wi w i (x),

iD 1

where w D (w1 , . . . , wN )0 is the N-dimensional weight vector with a gaussian ) be the weight prior with mean 0 and covariance Sw . Let (W)ki D w i (xm k design matrix, that is, the matrix of the activations of the basis functions q at the training data. Furthermore, let (W q )ki D w i (xk ) be the corresponding matrix for the query points. We obtain for the posterior weights a gaussian density with mean ±

E (w|D ) D Sw¡1 C W 0 (Y mm ) ¡1 W

² ¡1

W 0 (Y mm ) ¡1 gm

(4.1)

A Bayesian Committee Machine

2725

and covariance ±

cov (w|D ) D Sw¡1 C W 0 (Y mm ) ¡1 W

² ¡1

.

The posterior density of the query point is then also gaussian with mean ±

E ( f q |D) D W q E (w|D ) D W q Sw¡1 C W 0 (Y mm )¡1 W

² ¡1

W 0 (Y mm )¡1 gm (4.2)

and covariance ±

cov ( f q |D) D W q Sw¡1 C W 0 (Y mm ) ¡1 W

² ¡1

(W q )0 .

By substituting these expressions into equations 2.4 and 2.5 we obtain the BCM solution. Note that here, S qq D W q Sw (W q )0 . As a special case, consider that W q is a nonsingular square matrix; that is, there are as many query points as there are parameters N. After some manipulations, we obtain EO ( f q |D) D E ( f q |D), that is, the BCM approximation is identical to the solution we would have obtained if we had done Bayesian parameter updates. Assuming the model assumptions are correct, this is the Bayes optimal prediction. This should be no surprise since N appropriate data points uniquely dene the weights w and therefore also f is uniquely dened and the approximation of equation 2.1 becomes an equality.3 We can see this also from equation 4.2. If W q is a square matrix with full rank, the optimal weight vector can be recovered from the expected value of f q by matrix inversion. 4.2 Fixed-Basis Functions and Gaussian Process Regression. Note that by using basic matrix manipulations, we can write the expectation in equation 4.2 also as

E ( f q |D) D W q Sw W 0 (Y mm C W Sw W 0 ) ¡1 gm

(4.3)

and, equivalently, ¡ ¢¡1 W Sw0 (W q )0 . cov ( f q |D) D W q Sw (W q )0 ¡ W q S w W 0 Y mm C W Sw W 0 This is mathematically equivalent to equation 3.1 and 3.2 if we substitute

S mm

D W Sw W 0

S qm D Wq Sw W0

S qq

D W q S w (W q )0 .

(4.4)

3 For nonlinear systems we can replace the design matrix W by the matrix of rst derivatives of the map at the inputs with respect to the weights (Tresp & Taniguchi, 1995). The equations in this section are then valid within the linearized approximation.

2726

Volker Tresp

This means that the solution we have obtained for xed-basis functions is identical to the solution for gaussian process regression with kernel functions q q bj (xi ) D sxqi ,xm D w (xi )0 Sw w (xjm ), j

(4.5)

where w (x) D (w 1 (x), . . . , w N (x))0 is the activation of the xed-basis functions at x. From this it is obvious that the kernel functions are simply linear combinations of the xed-basis functions and “live in” the same functional space. The maximum dimension of this space is equal to the number of xedbasis functions (assuming these are linearly independent).4 In the context of smoothing splines, the relationship between xed-basis functions and kernel regression is explored by Wahba (1990) and in the context of a Bayesian setting by Neal (1996, 1997), Williams & Rasmussen (1996), Williams (1996, 1997, 1998a, 1998b) and MacKay (1998). Note that according to equation 4.5, the covariance function of the response variables and therefore also the kernel function is dened by inner products in feature space. In the work on support vector machines (Vapnik, 1995), this relationship is also explored. There the question is posed, given the kernel function bj (x), is there a xed-basis function expansion that corresponds to these kernel functions? One answer is given by Mercer’s theorem, which states that for symmetric and positive denite kernels, a corresponding set of xed-basis function exists. Note also that if the kernel is known and if the number of basis functions is much larger than the number of training data points, the calculations in equation 4.2 are much more expensive than the calculations in equation 4.3 (resp. 3.1). The relationship between the support vector machine, kernel-based regression and and basis function approaches was recently explored by Poggio and Girosi (1998), Girosi (1998), and Smola, Sch¨olkopf, and Muller ¨ (1998). Note that we can dene the priors in either weight space with covariance Sw or in function space with covariance S mm . This duality is also explored by Moody and Rognvaldsson ¨ (1997) and in the regression-lter algorithm by Zhu and Rohwer (1996). 5 Why the BCM Works

The central question is under which conditions equation 2.1 is a good approximation. One case is when the data sets are unconditionally indepen4 An interesting case is when the number of xed-basis functions is innite. Consider the case that we have an innite number of uniformly distributed xed radial gaussian q basis functions w i (x) D G(xI xi , s 2 ) and Sw D I, where I is the unit matrix. Then bj (xi ) D R q q G(xi I x, s 2 )G(xjm I x, s 2 )dx / G(xi I xjm , 2s 2 ), which is a gaussian with twice the variance. We will use gaussian kernel functions in the experiments. G(xI c, S ) is our notation for a normal density function with mean c and covariance S evaluated at x.

A Bayesian Committee Machine

2727

dent, that is, P (Di | D i¡1 ) ¼ P (Di ). This might be achieved by rst clustering the data and then assigning the data of each cluster to a separate module. A second case where the approximation is valid is if the data sets are independent conditioned on f q . In the previous section, we saw an example where we even have the equality P (Di | D i¡1 , f q ) D P (Di | f q ). We have shown that for a system with N xed-basis functions, N query points are sufcient such that the BCM provides the Bayes optimal prediction. Most kernel-based systems, though, have innite degrees of freedom, which would require an innite number of query points to make the previous equality valid. We saw one exception in the previous section: a (degenerate) gaussian process regression system that has the same number of degrees of freedom as the equivalent system with xed-basis functions. In this section we show that even for gaussian process regression with innite degrees of freedom, in typical cases only a limited number of degrees of freedom are really used, such that even with a small number of query points, the BCM approximation is reasonable. 5.1 The Dimension and the effective Degrees of Freedom. In gaussian process regression, the regression function lives in the space spanned by the kernel functions (see equations 3.1 and 3.3). Since the number of kernel functions is equal to the number of data points, the dimension of the subspace spanned by the kernel functions is upper bounded by the number of data points K. For two reasons, the dimension might be smaller than K. The rst reason is that the data points are degenerate—for example, if some input training data points appear several times in the data set. The second reason is that the kernel functions themselves are degenerate, such that even for an arbitrary selection of input data, the dimension of the functional space represented by the kernel functions is upper bounded by dim max . In section 4 we gave an example of a nite dim max : for the system of xed-basis functions and the equivalent gaussian process regression system, we have dimmax D N. In addition to the dimension, we can also dene the effective degrees of freedom kernel regression (Hastie & Tibshirani, 1990). A reasonable denition for gaussian process regression is

¡ ¢ ¡1 ), PData D trace S mm (Y mm C S mm eff which measures how many degrees of freedom are used by the given data. This is analogous to the denition of the effective number of parameters by Wahba (1983), Hastie and Tibshirani (1990), Moody (1992), and MacKay

2728

Volker Tresp gamma = 0.01

gamma = 0.03

gamma = 0.1

100

100

100

80

80

80

60

60

60

40

40

40

20

20

20

0

20

40

60

80

100

K

0

20

40

60 K

80

100

0

20

40

60

80

100

K

Figure 1: Shown are the effective degrees of freedom PData (continuous line) eff mm and the rank of S (dashed line) as a function of the number of training data K. The rank of S mm is a lower bound of the dimension of the space spanned by the kernel functions. The rank is here calculated as the number of singular values that are larger than tol D K £ norm( S mm ) £ eps where eps D 2.2204 £ 10¡16 and norm( S mm ) is the largest singular value of S mm (default Matlab settings). As in the experiments in section 8, for two locations in input space xi and xj , we set sxi ,xj D exp(¡1 / (2c 2 )kxi ¡ xj k2 ), where c is a positive constant. The locations of the input training data were chosen randomly in the interval [0, 1]. For the noise variance, we have sy D 0.1. Note that PData is always smaller than K, especially eff for large K. PData also decreases when we assume a strong correlation (i.e., a eff

large c ). The rank is larger than PData eff but also smaller than K, in particular for large c and K. For this covariance structure, dimmax is theoretically innite. As the plots show—if the input data are always drawn from the same distribution—for practical purposes, dimmax might be considered to be nite if c is large.

(1992). 5 Let fl 0 , . . . l K g be the eigenvalues of S mm . Then we obtain P Data D eff

K X i D1

li . l i C sy2

The effective degrees of freedom are therefore approximately equal to the number of eigenvalues that are greater than the noise variance. If sy2 is small, the system uses all degrees of freedom to t the data. For a large sy2 , ! 0 and the data barely inuence the solution (see Figure 1). PData eff We can therefore conclude that only PData query points are necessary to eff x the relevant degrees of freedom of the system. We have to select the query points such that they x these degrees of freedom.6

5 Note that S mm (Y mm C S mm ) ¡1 maps training targets into predictions at the same locations. This expression is therefore equivalent to the so-called hat matrix in linear regression. There, analogously, the trace of the hat matrix is dened as the effective number of parameters. 6 An impractical solution would be to include all training data in the set of query

A Bayesian Committee Machine

fq f*

D w

2729

1

D2

... DM

D f*

q

f

1

D2

... DM

Figure 2: (Left) The dependency structure of parameterized Bayesian learning. (Right) The approximate dependency structure we assume for the BCM.

Independently, Ferrari-Trecate, Williams, and Opper (1999) made a similar observation. They found that for small amounts of data, the coefcients of the eigenfunctions with small eigenvalues are not well determined by the data, and thus can effectively be ignored. 5.2 Dependency Graph. Another way of looking at the issue is to use a dependency graph, which will provide us with a more general view on the BCM. Consider again the case that a set of past observations D is partitioned into M smaller data sets D D fD1 , . . . , DM g. In a typical statistical model, the data sets are independent given all the parameters in a model w. Furthermore, a future prediction f ¤ is independent of the training data sets given the parameters. The corresponding dependency graph is shown in Figure 2 (left). Let us assume that there is another (typically vectorial) variable f q that is independent of f ¤ and the data sets given w. In the previous discussion we always assumed that f ¤ 2 f q , but in section 7 we will also be interested in the case that this is not the case. If we now consider a probabilistic model without w, that is, if we marginalize out the parameters w, then f ¤ , f q and all the data sets will in general become fully dependent. On the other hand, if we can assume that w is almost uniquely determined by f q , the approximation that given f q the data sets and f ¤ are all mutually independent might still be reasonable (see Figure 2, right). Explicitly, we assume that, rst,

P (Di | f q , f ¤ , D n Di ) ¼ P (Di | f q ), which means that if f q is xed, the different data sets are mutually independent and also independent of f ¤ . One case where this is true is if w is a data, i.e. f m µ f q . Although this would guarantee the conditional independence of the measurements, the set of query points would obviously be too large.

2730

Volker Tresp

deterministically determined function by f q . Second, we assume that P ( f ¤ |D, f q ) ¼ P ( f ¤ | f q ). This says that given f q , D does not provide additional information about f ¤ . This is guaranteed, for example, if f ¤ 2 f q as in section 2 or if, as before, w is deterministically determined by f q .7 Under these two assumptions, M M Y Y 1 1 PO ( f q |D) D P ( f q ) P ( Di | f q ) / P ( f q |Di ) q ) M¡1 ( D P f iD 1 iD 1

(5.1)

and Z PO ( f ¤ |D) D

P ( f ¤ | f q )PO ( f q |D) df q .

(5.2)

Equation 5.1 is the basis for the BCM (compare equation 2.3). Equation 5.2 shows how a new prediction of f ¤ can be derived from PO ( f q |D). The dependency structure can experimentally be determined for gaussian process regression by analyzing the inverse covariance structure (see section A.3). 6 The Idea Behind the BCM Applied to a Non-Bayesian Setting

So far we have developed the BCM in a Bayesian setting. In this section we apply the basic idea behind the BCM to a non-Bayesian setting. The basic difference is that here we have to form expectations over data rather than parameters. In Tresp and Taniguchi (1995) and Taniguchi and Tresp (1997), an inputdependent weighting scheme was introduced, which takes into account the input-dependent certainty of an estimator. This approach is now extended to include the ideas developed in this article. In contrast to the previous work, we allow that estimates at other query points can be used for improving the estimate at a given point x. As before, we use NQ query points. Let there be M estimators, and let fOi (x ) denote the estimate of the ith estimator at q q q input x. Let fOi D ( fOi (x1 ), . . . , fOi (xNQ ))0 denote the vector of responses of the q q ith estimator to all query points, and let fOq D (( fO1 )0 , . . . , ( fOM )0 )0 denote the vector of all responses of all estimators. We assume that all estimators are unbiased. Then our goal is to form a linear combination of the predictions at the query points q fOcomb D A fOq , 7

In section 7.1 on on-line learning, we will also consider the case where f ¤ 2/ f q .

A Bayesian Committee Machine

2731

where A is a NQ £ (NQ M ) matrix of unknown coefcients such that the expected error E[(tq ¡ A fOq )0 (tq ¡ A fOq )] is minimum where tq is the (unknown) vector of true expectations at the query points. Using the bias-variance decomposition for linearly combined systems, we obtain8 E[(tq ¡ A fOq )0 (tq ¡ A fOq )] D E

µ±

²0 ±

²¶

E (A fOq ) ¡ A fOq E ( A fOq ) ¡ A fOq µ± ²0 ± ²¶ C E E ( A fOq ) ¡ t q E (A fOq ) ¡ tq

D trace (AVA0 ) C ( AE[ fOq ] ¡ tq )0 ( AE[ fOq ] ¡ tq ).

The term in the trace is the covariance of the combined estimator, and the second term is the bias of the combined estimator, and ±

²

(V)ij D cov ( fOq )i , ( fOq )j . Note that at this point, we include the case that different estimators are correlated. We want to enforce the constraint that the combined system is unbiased. This can be enforced if we require that A I D I NQ , where INQ is an NQ -dimensional unit matrix and I D (INQ , . . . , INQ )0 concatenates M unit matrices. Under these constraints, it can easily be shown q that E ( fOcomb ) D AE[ fOq ] D tq . Using Lagrange multipliers to enforce the constraint, we obtain for the optimal coefcients A D (I 0 V ¡1 I ) ¡1 I 0 V ¡1 . This is the most general result, allowing for correlations between both estimators and query points. In our case, we assume that the individual estimators were trained on different data sets, which means that their respective prediction errors are uncorrelated. In this case 0

qq

cov1 B 0 VDB @ ... 0 8

0 qq cov2 ... 0

... ... ... ...

1 0 0 C C, ... A qq covM

In this section, expectations are taken with respect to different data sets.

2732

Volker Tresp qq

where covi is the covariance matrix of the query points for the ith estimator. Then 0 1¡1 ! q fOcomb D @

M X j D1

(covjqq )¡1 A

(

M X i D1

¡1 Oq (covqq fi . i )

(6.1)

Note the similarity to equation 2.4 if ( S qq ) ¡1 ! 0, that is, if we have a very “weak” prior. Note again that the covariance matrices are with respect to responses in the same estimator, and not between estimators. For qq linear systems with xed-basis functions, covj can be calculated using wellknown formulas from linear regression (Sen and Srivastava, 1990). Tresp and Taniguchi (1995) and Taniguchi and Tresp (1997) describe how the covariance of the prediction of estimators can be calculated for nonlinear systems. 7 Extensions 7.1 On-line BCM. In this section we derive an on-line version of the BCM. We assume that data arrive continuously in time. Let Dk denote the data points collected between time t (k ) and t (k¡1 ), and let D k D fD1 , . . . , Dk g denote the set of all data collected up to time t (k). We can derive an algorithm that directly adapts the posterior probabilities of f Q . The iterations are based on the Kalman lter and yields for k D 1, 2, . . .

Ak D Ak¡1 C Kk (E ( f q |Dk ) ¡ Ak¡1 ) Kk D Sk¡1 (Sk¡1 C cov( f q |Dk )) ¡1 Sk D Sk¡1 ¡ Kk (Sk¡1 C cov ( f q |Dk ))Kk0 with S0 D S qq and A0 D (0, . . . , 0)0 . At any time 1 EO ( f q | D k ) D S qq k

¡1 k

S qq ¡ Sk

¢

¡1

Ak .

Note that only one matrix and one vector of the size of the number of query points need to be stored, which makes this on-line version of the BCM applicable for data mining applications. If NQ is large, the Kalman lter described in section A.2 might be more suitable. 7.2 Varying Query Points. In some cases, particularly in on-line learning, we might be interested in the responses at additional query points. The goal is then to infer from the estimated density at the query points f q the density at other query points f ¤ . Using the approximation in section 5, PO ( f ¤ |D) is gaussian distributed with

EO ( f ¤ |D) D S ¤q ( S qq ) ¡1 EO ( f q | D k ).

(7.1)

A Bayesian Committee Machine

2733

Note that this equation describes the superposition of basis functions dened for every query point f q and evaluated at f ¤ with weights ( S qq )¡1 EO ( f q | D k ). Note also that these operations are relatively inexpensive. Furthermore the covariance becomes c ( f ¤ |D) D S ¤¤ ¡ S ¤q ( S qq ) ¡1 ( S ¤q )0 C S ¤q ( S qq ) ¡1 cov c ( f q |D)( S ¤q ( S qq ) ¡1 )0 . cov c ( f q |D) has The covariance of the prediction is small if the covariance cov ¤ q small elements and if f can be well determined from f .

8 Experiments

In the experiments we set for two locations in input space xi and xj : sxi ,xj D A exp(¡1 / (2c 2 )kxi ¡ xj k2 ) where A, c are positive constants. In the rst experiments, we used an articial data set. The input space is Dim D 5dimensional, and the inputs were randomly chosen in x 2 [¡1, 1]Dim . Targets were generated by adding independent gaussian noise with a standard deviation of 0.1 to a map dened by ve normalized gaussian basis functions (see section A.4). The width parameter c and the “noise to prior” ¡ factor sy2 / A were optimized using a validation set of size 100 and were then left xed for all experiments. Alternative approaches to estimate (or to integrate out) these hyperparameters are described by Gibbs and MacKay (1997), Neal (1996), and Williams and Rasmussen (1996). Figure 3 (left) shows the CPU time for the different algorithms as a function of the number of training samples. Clearly the CPU time for gaussian process regression scales as K3 . The two versions of the BCM schemes scale linearly. Also note that for K / M D 100 ¼ K / NQ (in the experiment, NQ D 128), we obtain the least computational cost. Figure 3 (right) shows the mean squared query set error for simple gaussian process regression and for the BCM. As can be seen, the BCM is not signicantly worse than the simple gaussian process regression system with only one module. Note that the average error of the individual modules is considerably worse than the BCM solution and that simple averaging of the predictions of the modules is a reasonable approach but considerably worse than the BCM. Figure 4 (left) shows the mean squared query set error of the BCM as a function of the query set size NQ . As conrmed by the theory, for small NQ , the performance deteriorates. We obtain better results when we query more at the same time. In the second set of experiments, we tested the performance using a number of real and articial data sets. Here we found it more useful to normalize the error. The relative error is dened as msear D (msea ¡ mse f ) / mse f , where msea is the mean squared query set error of algorithm a, and mse f is the mean squared query set error of gaussian process regression (M D 1).

2734

Volker Tresp

300

0.05

0.04

200

query set error

CPU time [s]

250

0.03

150

0.02

100

0.01

50 0

200

400

K

600

800

1000

0

200

400

K

600

800

1000

Figure 3: (Left) CPU time in seconds as a function of the training set size K for M D 1 (gaussian process regression, continuous) and the BCM with module size K / M D 10 (dash-dotted line) and module size K / M D 100 (dashed line). In these experiments the number of query points was NQ D 128. (Right) The mean squared query set error as a function of the number of training data. The continuous line shows the performance of the simple gaussian process regression system with only one module. For the other experiments, the training data set is divided into K / 100 modules of size 100. Shown are the average errors of the individual modules (dash-dotted line), the error achieved by averaging the predictions of the modules (dotted line), and the error of the BCM (dashed line). Note that the error of the BCM is not signicantly worse than the error of simple gaussian process regression.

Tables 1 and 2 summarize the results. The excellent performance of the BCM is well demonstrated. The results show that P Data is indeed a good eff indication of how many query points are required: if the number of query points is larger than PData , the performance of the BCM solution is hardly eff distinguishable from the performance of gaussian process regression with only one module. For the real-world data (see Table 1) we obtain excellent performance for the BUPA and the DIABETES data sets with a PData of 16 and 64, respectively, eff and slightly worse performance for the WAVEFORM and the HOUSING data with a PData of 223 and 138, respectively. A possible explanation for the eff slightly worse performance for the Boston housing data set might be the well-known heteroscedasticity (unequal target noise variance) exhibited by this data set. For the articial data set (see Table 2), a large PData is obtained for a data eff set with no noise and a high-dimensional data set with 50 inputs. In contrast, a small PData is obtained for a noisy data set and for a low-dimensional data eff set. In our third experiment (see Figure 4, right) we tested the on-line BCM described in section 7.1. Note that the BCM with K D 60,000 achieves results

A Bayesian Committee Machine

2735 0.06

0.035

0.03

query set error

query set error

0.05

0.025

0.04 0.03 0.02 0.01

0.02 20

40

60

N

Q

80

100

120

0 0

1

2

3 K

4

5

6 x 10

4

Figure 4: (Left) Mean squared query set error as a function of the number of query points NQ for a module size K / M D 100 and with K D 1000. Note that for small NQ , the performance deteriorates. (Right) Test of the on-line learning algorithms of section 7.1 using the articial data set. The continuous line is the mean squared query set error for NQ D 1000 and a module size of K / M = 1000 as a function of the number of training data K. The width parameter c and the “noise to prior” ¡ weight sy2 / A were optimized for the large data set (K D 60,000). The dash-dotted line shows the error of gaussian process regression (M D 1) with K D 1000, approximately the largest training data size suitable for gaussian process regression. Here, the width parameter c and the “noise to prior” ¡ weight sy2 / A were optimized using a validation set of size K D 1000. Note the great improvement with the BCM, which can employ the larger data sets. The dashed line is the performance of the BCM with NQ D 100. Note that the full benet of the larger data set is exploited only by the larger NQ . For K D 60, 000 gaussian process regression would need an estimated one year of CPU time using a typical workstation, whereas the BCM with NQ D 100 required only a few minutes and with NQ D 1000 a few hours.

unobtainable with gaussian process regression, which is limited to a training data set of approximately K D 1000. Considering the excellent results obtained by the BCM for K D 60, 000 as displayed in the gure, one might ask how close the BCM solution comes to a gaussian process regression system for such a large data set. We expect that the latter should give better results since the width parameter c can be optimized for the large training data set of size K D 60,000. Typically, the optimal c decreases with K (Hastie & Tibshirani, 1990). For the BCM, c is chosen such that the effective degrees of freedom approximately matches the number of query points NQ so the optimal c should be more or less be independent of K. Experimentally, we found that scaling c down with K also improves performance for the BCM but that c cannot be scaled down as much as in case of the gaussian regression model. But recall that for K D 60,000, the gaussian process regression system would require one year of CPU time.

2736

Volker Tresp

Table 1: Results from Real World Data.

Dim K (training size) GPR BCM(10, 50) BCM(100, 50) BCM(10, 100) BCM(100,100) rank PData eff

BUPA

DIABETES

WAVEFORM

HOUSING

6 200 0.828 § 0.028 (¡2.6 § 1.8) £ 10 ¡4 (1.6 § 1.8) £ 10 ¡4 (0.1 § 0.2) £ 10 ¡4 (0.0 § 0.1) £ 10¡4 159 16

8 600 0.681 § 0.032 0.0095 § 0.0030 ¡0.0027 § 0.0023 ¡0.0001 § 0.0015 ¡0.0011 § 0.0010 600 43

21 600 0.379 § 0.019 0.0566 § 0.0243 0.0295 § 0.0213 0.0037 § 0.0095 0.0163 § 0.0086 600 223

13 400 0.1075 § 0.0065 0.338 § 0.132 0.117 § 0.038 0.196 § 0.053 0.1138 § 0.0568 400 138

Notes: These data sets can be retrieved from the UCI database at http://www.ics.uci.edu/ »mlearn/MLRepository.html. All data sets were normalized to a mean of zero and a standard deviation of one. Shown is the input dimension Dim, the number of training data used K, and the mean square query set error of various algorithms. Shown are averages over 20 experiments from which error bars could be derived. In each experiment, data were assigned to training and query data randomly. The width parameter c and the “noise to prior ” ¡ factor sy2 / A were optimized using a validation set of size 100. The rst result (GPR) shows the mean squared query set error of gaussian process regression (M D 1). The other rows show relative errors (see text). BCM(i, j) shows the relative error with module size K / M D i and query set size NQ D j. Also shown is the rank of S mm (calculated as described in Figure 1) and the effective number of parameters.

9 Conclusions

We have introduced the BCM and have applied it to gaussian process regression and to systems with xed-basis functions. We found experimen-

Table 2: Results from Articial Data.

Dim K GPR BCM(10, 50) BCM(100, 50) BCM(10, 100) BCM(100, 100) rank PData eff

ART (sy D 0)

ART (sy D .5)

ART (sy D 0.1)

ART (sy D 0.1)

5 600 0.0185 § 0.0009 0.3848 § 0.0442 0.3066 § 0.0282 0.1574 § 0.0135 0.0595 § 0.0114 600 121

5 600 0.0529 § 0.0026 0.0021 § 0.0013 0.0011 § 0.0012 0.0001 § 0.0000 0.0000 § 0.0000 600 22

50 600 0.0360 § 0.0017 0.4135 § 0.0860 0.0994 § 0.0541 ¡0.0923 § 0.0181 ¡0.0527 § 0.0166 600 600

2 600 0.0022 § 0.00012 0.0809 § 0.0221 0.0928 § 0.0157 ¡0.0007 § 0.0008 0.0002 § 0.0011 242 49

Notes: The data are generated as described in the beginning of section 8. As before, the width parameter c and the “noise to prior ” ¡ factor sy2 / A were optimized using a validation set of size 100. In columns from left to right, the experiments are characterized as: no noise sy D 0 , large noise sy D 0.5, high-input dimension Dim D 50, and low-input dimension Dim D 2. The large relative error for the “no noise” experiment must be seen in connection with the small absolute error.

A Bayesian Committee Machine

2737

tally that the BCM provides excellent predictions on several data sets. The CPU time of the BCM scales linearly in the number of training data and the BCM can therefore be used in applications with large data sets, as in data mining and in applications requiring on-line learning. Maybe the most surprising result is that the quality of the prediction of the BCM improves if several query points are calculated at the same time. Appendix A.1 Product and Quotient of Gaussian Densities. For the product of

two gaussian densities we obtain

G(xI c1 , S1 )G(xI c2 , S2 ) / G(xI ( S1¡1 C S2¡1 )¡1 ( S1¡1 c1 C S2¡1 c2 ), ( S1¡1 C S2¡1 ) ¡1 ), and for the quotient of two gaussian densities we obtain G(xI c1 , S1 ) / G(xI c2 , S2 ) / G(xI ( S1¡1 ¡ S2¡1 )¡1 ( S1¡1 c1 ¡ S2¡1 c2 ), ( S1¡1 ¡ S2¡1 ) ¡1 ). G (xI c, S ) is our notation for a normal density function with mean c and covariance S evaluated at x. Substituting gaussian densities in equation 2.3, we obtain, PO ( f q |D) D const £

M Y

1 G[ f q I 0,

S qq]M¡1

G[ f q I E ( f q |Di ), cov ( f q |Di )]

i D1

1 G[ f q I 0, S qq / (M ¡ 1)] 2 !¡1

D const £

£ G 4 f qI £

( (

2

M X

q

cov ( f |D

i ) ¡1

i D1 M X iD 1

cov( f

( ( (

q |Di ) ¡1

(

!¡1 3

M X

£

q

cov( f |D

i ) ¡1

i)

E ( f |D

5

i ) ¡1

cov( f |D

q

M X

cov ( f |D

iD 1 i)

E ( f |D

q

i ) ¡1

£ ¡(M ¡ 1)( S qq ) ¡1 C

M X i D1

!¡1

! ,

iD 1

such that equations 2.4 and 2.5 follow.

q

iD 1

¡1 D const £ G 4 f q I ¡(M ¡ 1)( S qq ) C M X

! q

cov ( f q |Di ) ¡1

!¡1 3

5,

,

2738

Volker Tresp

A.2 Another Formulation of the BCM. Based on our gaussian assumptions, we can also calculate the probability density of measurements given the query points P ( gm | f q ). This density is also gaussian with mean

E ( gm | f q ) D ( S qm )0 ( S qq ) ¡1 f q ,

(A.1)

and covariance cov ( gm | f q ) D Y mm C S mm ¡ ( S qm )0 ( S qq )¡1 S qm .

(A.2)

Now, the BCM approximation can also be written as PO ( f q |D) / P ( f q )

M Y i D1

q P ( gm i | f ).

Then we obtain with Ai D ( S qm,i )0 ( S qq ) ¡1 M 1X q ¡1 m EO ( f q |D) D A0 cov ( gm i | f ) gi , C iD 1 i

(A.3)

with c ( f q |D)¡1 D ( S qq ) ¡1 C C D cov

M X i D1

q ¡1 A0i cov ( gm i | f ) Ai .

(A.4)

S qm,i is the the covariance matrix between training data for the ith module

and the query points. An on-line version can be obtained using the Kalman lter, which yields the iterative set of equations, k D 1, 2, . . . q) EO ( f q | D k ) D EO ( f q | D k¡1 ) C Kk ( gm k ¡ Ak f q )) ¡1 Kk D (Sk¡1 A0k )(Ak Sk¡1 A0k C cov ( gm k |f q )) 0 Sk D Sk¡1 ¡ Kk ( Ak Sk¡1 A0k C cov( gm Kk k |f

with S0 D S qq . In the iterations, no matrix of size NQ needs to be inverted. A.3 Independencies in the Inverse Covariance Matrix. Figure 5 (left) shows the absolute values of the covariance matrix for K D 300 training data points (indices 1–300) and NQ D 100 query points (indices 301–400). The covariance matrix is calculated as

¡S

mm C Y mm qm

S

( S qm )0

S qq

¢

.

A Bayesian Committee Machine

2739

50

50

100

100

150

150

200

200

250

250

300

300

350

350

400

50

100

150

200

250

300

350

400

400

50

100

150

200

250

300

350

400

Figure 5: (Left) Absolute values of the covariance matrix. White pixels are entries larger than 0.8. (Right) Absolute values of the inverse covariance matrix. White pixels are larger than 1.

It is apparent that there are strong correlations in all dimensions. Figure 5 (right) shows the absolute values of the inverse covariance matrix. Here, white pixels indicate strong dependencies. Apparent are the strong dependencies among the query points. Also apparent is the fact that there are strong dependencies between the training data and the query data but not in between the training data (given the query data) conrming that the approximation in equation 2.1 is reasonable. Readers not familiar with the relationship between the inverse covariance matrix and independencies are referred to Whittaker (1990). A.4 Articial Data. Explicitly, the response is calculated as ±

P5

i D 1 ai

f (x ) D P 5

ik exp ¡ kx¡center 2s 2

i D1 exp

±

a

2 ik ¡ kx¡center 2sa2

2

²

² .

In most experiments sa D 0.36 a D (1.16, 0.63, 0.08, 0.35, ¡0.70)0 , and centeri is generated randomly according to a uniform density in the Dim-dimensional unit hypercube. Acknowledgments

Fruitful discussions with Harald Steck, Michael Haft, Reimar Hofmann, and Thomas Briegel and the comments of two anonymous reviewers are gratefully acknowledged.

2740

Volker Tresp

References Bernardo, J., M., & Smith, A. F. M. (1994). Bayesian theory. New York: Wiley. Ferrari-Trecate, G., Williams, C. K. I., & Opper, M. (1999). Finite-dimensional approximation of gaussian processes. In M. S. Kearns, S. A. Solla, & D. A. Cohn (Eds.), Advances in neural information processing systems, 11 (218–224). Cambridge MA: MIT Press. Gibbs, M., & MacKay, D. J. C. (1997). Efcient implementation of gaussian processes. Available online at http://wol.ra.phy.cam.ac.uk/mackay/ homepage. html. Girosi, F. (1998). An equivalence between sparse approximation and support vector machines. Neural Computation, 10(6), 1454–1480. Hastie, T., & Tibshirani, R. J. (1990). Generalized additive models. London: Chapman & Hall. MacKay, D. J. C. (1992). Bayesian model comparison and backprop nets. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 4 (839–846). San Mateo, CA: Morgan Kaufmann. MacKay, D. J. C. (1998). Introduction to gaussian processes. In C. Bishop (Ed.), Neural networks and machine learning. Berlin: Springer-Verlag. Moody, J. E. (1992). The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in neural information processing systems, 4 (pp. 847–854). San Mateo, CA: Morgan Kaufmann. Moody, J. E., & Rognvaldsson, ¨ T. S. (1997). Smoothing regularizers for projective basis function networks. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Neal, R. M. (1996). Bayesian learning for neural networks. New York: SpringerVerlag. Neal, R. M. (1997). Monte Carlo implementation of gaussian process models for Bayesian regression and classication (Tech. Rep. No. 9702). Toronto: Department of Statistics, University of Toronto. Poggio, T., & Girosi, F. (1990). Networks for approximation and learning. Proceedings of the IEEE, 78, 1481–1497. Poggio, T., & Girosi, F. (1998). A sparse representation for function approximation. Neural Computation, 10(6), 1445–11454. Rasmussen, C. E. (1996). Evaluation of gaussian processes and other methods for nonlinear regression. Unpublished doctoral dissertation, University of Toronto. Available online at http://www.cs.utoronto.ca/»carl/. Sen, A., & Srivastava, M. (1990). Regression analysis. New York: Springer-Verlag. Skilling, J. (1993). Bayesian numerical analysis. In W. T., Jr. Grandy, & P. Milonni (Eds.), Physics and probability. Cambridge: Cambridge University Press. Smola, A. J., Sch¨olkopf, B., & Muller, ¨ K.-R. (1998). The connection between regularization operators and support vector kernels. Neural Networks, 11, 637–649. Taniguchi, M., & Tresp, V. (1997). Averaging regularized estimators. Neural Computation, 9(5), 1163–1178.

A Bayesian Committee Machine

2741

Tresp, V., & Taniguchi, M. (1995). Combining estimators using non-constant weighting functions. In G. Tesauro, D. S. Touretzky, & T. K. Leen (Eds.), Advances in neural information processing systems, 7 (pp. 419–426). Cambridge, MA: MIT Press. Vapnik, V. N. (1995). The nature of statistical learning theory. New York: SpringerVerlag. Wahba, G. (1990). Spline models for observational data. Philadelphia: Society for Industrial and Applied Mathematics. Wahba, G. (1983). Bayesian “condence intervals” for the cross-validated smoothing spline. J. Roy. Stat. Soc. Ser. B, 45(1), 133–150. Whittaker, J. (1990). Graphical models in applied multivariate statistics. New York: Wiley. Williams, C. K. I. (1997). Computing with innite networks. In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9. Cambridge, MA: MIT Press. Williams, C. K. I. (1998a).Computing with innite neural networks. Neural Computation, 10, 1203–1216. Williams, C. K. I. (1998b). Prediction with gaussian processes: From linear regression to linear prediction and beyond. In M. I. Jordan (Ed.), Learning in graphical models (pp. 599–621). Boston: Kluwer. Williams, C. K. I., & Rasmussen, C. E. (1996). Gaussian processes for regression. In D. S. Touretzky, M. C. Mozer, & M. E. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 514–520). Cambridge, MA: MIT Press. Zhu, H., & Rohwer (1996). Bayesian regression lters and the issue of priors. Neural Comp. Appl., 4, 130–142. Zhu, H., Williams, C. K. I., Rohwer, R., & Morciniec, M. (1998). Gaussian regression and optimal nite dimensional linear models. In C. M. Bishop (Ed.), Neural networks and machine learning (315–332). Berlin: Springer-Verlag. Received August 24, 1998; accepted December 6, 1999.

Neural Computation, Volume 12, Number 12

CONTENTS Article

Evolution of Cooperative Problem Solving in an Articial Economy Eric B. Baum and Igor Durdanovic

2743

View

Cortical Potential Distributions and Information Processing Henry C. Tuckwell

2777

Note

Asymptotic Bias in Information Estimates and the Exponential (Bell) Polynomials Jonathan D. Victor

2797

Letters

Relating Macroscopic Measures of Brain Activity to Fast, Dynamic Neuronal Interactions D. Chawla, E. D. Lumer, and K. J. Friston

2805

Analysis of Pointing Errors Reveals Properties of Data Representations and Coordinate Transformations Within the Central Nervous System J. McIntyre, F. Stratta, J. Droulez, and F. Lacquaniti

2823

Modeling Selective Attention Using a Neuromorphic Analog VLSI Device Giacomo Indiveri

2857

Asymptotic Convergence Rate of the EM Algorithm for Gaussian Mixtures Jinwen Ma, Lei Xu, and Michael I. Jordan

2881

Incremental Active Learning for Optimal Generalization Masashi Sugiyama and Hidemitsu Ogawa

2909

A Quantitative Study of Fault Tolerance, Noise Immunity, and Generalization Ability of MLPs J. L. Bernier, J. Ortega, E. Ros, I. Rojas, and A. Prieto

2941

On the Computational Complexity of Binary and Analog Symmetric Hopeld Nets Ï õ ma, Pekka Orponen, and Teemu Antti-Poika JiÏr´õ S´

2965

ARTICLE

Communicated by Terrence Sejnowski

Evolution of Cooperative Problem Solving in an Articial Economy Eric B. Baum Igor Durdanovic NEC Research Institute, Princeton, NJ 08540 We address the problem of how to reinforce learning in ultracomplex environments, with huge state-spaces, where one must learn to exploit a compact structure of the problem domain. The approach we propose is to simulate the evolution of an articial economy of computer programs. The economy is constructed based on two simple principles so as to assign credit to the individual programs for collaborating on problem solutions. We nd empirically that starting from programs that are random computer code, we can develop systems that solve hard problems. In particular , our economy learned to solve almost all random Blocks World problems with goal stacks that are 200 blocks high. Competing methods solve such problems only up to goal stacks of at most 8 blocks. Our economy has also learned to unscramble about half a randomly scrambled Rubik’ s cube and to solve several commercially sold puzzles. 1 Introduction

Reinforcement learning (cf. Sutton & Barto, 1998) formalizes the problem of learning from interaction with an environment. The two standard approaches are value iteration and policy iteration. At the current state of the art, however, neither appears to offer much hope for addressing ultracomplex problems. Standard approaches to value iteration depend on enumerating the state-space and are thus problematic when the state-space is huge. Moreover, they depend on nding an evaluation function showing progress after a single action, and such a function may be extremely hard to learn or even to represent. We believe that what is necessary in hard domains is to learn and exploit the compact structure of the state-space, for which powerful methods are yet to be discovered. Policy iteration, on the other hand, suffers from the fact that the space of policies is also enormous, and also has a bumpy tness landscape. Standard methods for policy iteration, among which we would include the various kinds of evolutionary and genetic programming, thus grind to a halt on most complex problems. The approach we propose here is to evolve an articial economy of modules, constructed so as to assign credit to the individual modules for their contribution. Evolution then can progress by nding effective modules. c 2000 Massachusetts Institute of Technology Neural Computation 12, 2743–2775 (2000) °

2744

Eric B. Baum and Igor Durdanovic

This divide-and-conquer approach greatly expedites the search of program (a.k.a. policy) space and can result in a compact solution exploiting the structure of the problem. Holland’s (1986) seminal classier systems previously pursued a similar approach, using an economic model to assign credit to modules. Unfortunately, classier systems have never succeeded in their goal of dynamically chaining modules together to solve interesting problems (Wilson & Goldberg, 1989). We believe we have understood the problems with Holland’s proposal, and other multiagent proposals, and how to correct them in terms of simple but fundamental principles. Our picture applies widely to the evolution of multiagent systems, including ecologies and economies. We discuss here evidence for our picture, evolving articial economies in three different representations and for three problems that dynamically chain huge sequences of learned modules to solve problems too vast for competing methods to address. Control experiments show that if our system is modied to violate our simple principles, even in ways suggested as desirable in the classier system literature, performance immediately breaks as we would predict. We present results on three problems: Blocks World, Rubik’s cube, and a commercially sold set of puzzles called “Rush Hour.” Block’s World has a lengthy history as a benchmark problem, which we survey in section 3. Methods like TD learning using a neural net evaluation, genetic programming, inductive logic programming, and others are able to solve problems directly comparable to the ones we experiment with only when they are very small and contain a handful of blocks—that is, before the exponential growth of the state-space bites in. By contrast, we report the solution of problems with hundreds of blocks, and state-spaces containing on the order of 10100 states, with a single reward state. On Rubik’s cube we have been able to evolve an economy able to unscramble about half of a randomly scrambled cube, comparable to the performance of many humans. On Rush Hour, our approach has solved 15 different commercially sold puzzles, including one dubbed “advanced” by the manufacturer. Genetic programming (GP) was unable to make headway on either Rubik or Rush Hour in our experiments, and we know of no other learning approach capable of interesting results here. Note that we are attempting to learn a program capable of solving a class of problems. For example, we are attempting to learn a program capable of unscrambling a randomly scrambled Rubik’s cube. We conjecture that for truly complex problems, learning to solve a class is often the most effective way to solve an individual problem. No one can learn to unscramble a Rubik’s cube, for example, without learning how to unscramble any Rubik’s cube. And the planning approach to Blocks World has attempted a huge search to solve individual Blocks World problems, but what is needed is to learn to exploit the structure of Blocks World. In our comparison experiments, we attempted to use alternative methods such as GP to learn to solve

Cooperative Problem Solving

2745

a class of problems, but we failed. We train by presenting small instances, and our method learns to exploit the underlying compact structure of the problem domain. We conjecture that learning to exploit underlying compact structure is closely related to “understanding,” a human ability little explained. Section 2 describes our articial economy and the principles that it embodies. Section 3 describes the Blocks World problem and surveys alternative approaches. Section 4 describes the syntax of our programs. Section 5 describes our results on Blocks World. Section 6 describes a more complex economic model involving metalearning. Section 7 describes our results with Rubik’s cube. Section 8 describes our results with the Rush Hour problem. Section 9 sums up what we have learned from these experiments and suggests avenues for further exploration. Appendix A discusses results on Blocks World with alternative methods such as TD learning and genetic programming. Appendix B gives pseudocode. Appendix C discusses parameter settings. 2 The Articial Economy

This section describes and motivates our articial economy, which we call Hayek. Hayek interacts with a world that it may sense, where it may take actions, and that makes a payoff when it is put in an appropriate state. Blocks World (see Figure 1) is an example world. Hayek is a collection of modules, each consisting of a computer program with an associated numeric “wealth.” We call the modules agents 1 because we allow them to sense features of the world, compute, and take actions on the world. The system acts in a series of auctions. In each auction, each agent simulates the execution of its program on the current world and returns a nonnegative number. This number can be thought of as the agent’s estimate of the value of the state its execution would reach. The agent bids an amount equal to the minimum of its wealth and the returned number. The solver with the highest bid wins the auction. It pays this bid to the winner of the previous auction, executes its actions on the current world, and collects any reward paid by the world, as well as the winning bid in the subsequent auction. Evolutionary pressure pushes agents to reach highly valued states and to bid accurately, lest they be outbid. In each auction, each agent that has wealth more than a xed sum 10Winit creates a new agent that is a mutation of itself. Like an investor, the creator endows its child with initial wealth Winit and takes a share of the child’s prot. Typically we run with each agent paying one-tenth of its prot plus a small constant sum to its creator, but performance is little affected if this

1 This is in the spirit of the currently best-selling AI textbook (Russell & Norvig, 1995) which denes an agent as “just something that perceives and acts” (p. 7).

2746

Eric B. Baum and Igor Durdanovic

share is anywhere between 0 and .25. Each agent also pays resource tax proportional to the number of instructions it executes. This forces agent evolution to be sensitive to computational cost. Agents are removed if, and only if, their wealth falls below their initial capital, with any remaining wealth returned to their creator. Thus, the number of agents in the system varies, with agents remaining as long as they have been protable. This structure of payments and capital allocations is based on simple principles (Baum, 1998). The system is set up so that everything is owned by some agent, interagent transactions are voluntary, and money is conserved in interagent transactions (i.e., what one pays, another receives) (Miller & Drexler, 1988). Under those conditions, if the agents are rational in that they choose to make only protable transactions, a new agent can earn money only by increasing total payment to the system from the world. But irrational agents are exploited and go broke. In the limit, the only agents that survive are those that collaborate with others to extract money from the world. Hayek guarantees that everything is owned by auctioning the whole world to one agent, which then can alone act on the world, sell it, and receive reward from it. Agent interactions are voluntary in that they respect property rights. For example, the agent owning the world can refuse to sell by outbidding other agents, which costs her nothing since she pays herself. The guide for all ad hoc choices (e.g., the one-tenth prot fraction) was that the property holder might reasonably make a similar choice if given the option. In principle, all such constants could be learned. By contrast, such property rights are not enforced in real ecologies (Miller & Drexler, 1988), simulated ecologies like Tierra (Ray, 1991) and Avida (Lenski, Ofria, Collier, & Adami, 1999) or other multiagent learning systems such as Lenat’s Eurisko (Lenat, 1983; Miller & Drexler, 1988) or Holland classier systems (Holland, 1986). When not everything is owned, or money is not conserved, or property rights are not enforced, agents can earn money while harming the system, even if other agents maximize their prots. The overall problem, then, cannot be factored because a local optimum of the system will not be a local optimum of the individual agents. For example, because in Holland classiers many agents are active simultaneously, there is no clear title to reward, which is usually shared by active agents. A tragedy of the commons (Hardin, 1968) can then ensue in which any agent can prot by being active when reward is paid, even if its action harms system performance (Baum, 1996). Such systems evolve complex behavior but not accurate credit assignment (Miller & Drexler, 1988; Baum, 1998). We have done runs with conservation of money broken in various ways. The system learns to exploit any way of “creating money” without solving the hard problems posed by the world (cf. Baum, 1996). Conversely, if money leaks out of the system too fast (the resource tax is a small leak, but must be kept very small, at less than 10¡5 per instruction), the economy collapses to essentially random agents. We have also experimented with violations of property rights. For example, we did runs identical to Hayek in every

Cooperative Problem Solving

2747

Figure 1: A Blocks World instance with four blocks and three colors. (a) Initial state. (b) The position just before solution. Hatching represents color. When the hand drops the white block on stack 1, the instance will be solved, and Hayek will see another random instance.

way, except using the procedure advocated in zeroth-level classier systems (Wilson, 1994) of choosing the winning bidder with probability proportional to its bid. These runs never solved more than six-block problems, even if given NumCorrect, or made any progress on Rubik’s cube. 3 Blocks World

We have trained Hayek by presenting Blocks World (BW) problems of gradually increasing size (see Figure 1). Each problem contains 4 stacks of colored blocks, with 2n total blocks and k colors. The leftmost stack, stack 0, serves as a template only and is of height n. The other three stacks contain, between them, the same multiset of colored blocks as stack 0. The learner can pick up the top block on any but stack 0 and place the block on top of any stack but 0. The learner takes actions until it asserts “Done,” or exceeds 10n log3 .k/ actions. If the learner copies stack 0 to stack 1 and states Done, it receives a reward of n. If it uses 10n log3 .k/ actions or states Done without copying stack 0, it terminates activity with no reward. Note that the goal is to discover an algorithm capable of solving random new instances. The best human-generated algorithm of which we are aware (a recursive algorithm similar to that solving the Tower of Hanoi problem that we developed in conversation with Manfred Warmuth) is capable of solving arbitrary BW problems in 4nlog2 .k/ grabs and drops. Blocks World (BW) has been well studied (cf. Korf, 1987; Whitehead & Ballard, 1991; Koza, 1992; Koehler, 1998; Bacchus & Kabanza, 1996; Dzeroski, Blockeel, & De Raedt, 1998). Although it is easy for humans to solve or program, its exponential-size state-space and the complexity of code that would

2748

Eric B. Baum and Igor Durdanovic

solve it have stymied fully autonomous methods. One line of research has involved planning programs, which are told the goal and do an extensive search to attempt to solve an individual instance. Until about 1998, state-ofthe-art planning programs such as Prodigy 4.0 could solve only problems involving about ve blocks before succumbing to the combinatorial explosion (Bacchus & Kabanza, 1996). More recently, it has been discovered that the Blum-Furst graphplan heuristic (Blum & Furst, 1997) can be effectively used to constrain the search, postponing, although not preventing, the combinatorial explosion. The best results are by Koehler (1998), who solved a related 80-block stacking problem in 5 minutes and a 100-block problem in 11 minutes. (Note the exponential increase.) Koehler’s problems are simpler than those discussed here in that all blocks start on the table, with no block on top of them, avoiding any necessity for the long, Tower-of-Hanoi-like chains of stacks and unstacks required by a three-block-wide table. No standard method appears able to address the BW reinforcement learning problem. Previous attempts to reinforcement-learn in BW (e.g., Whitehead & Ballard, 1991; Koza, 1992; Birk & Paul, 1994; Dzeroski et al., 1998) addressed only much simpler versions not involving abstract goal discovery (they attempted to stack a xed order of colors rather than match an arbitrary target stack); nor did they have a xed small table size; nor did they involve goal stacks of more than a handful of blocks. Q-learning (Watkins, 1989) is inapplicable to this problem because there are combinatorial numbers of states; there are almost 10100 distinct possible 200-block goal stacks and for each of these, a search space of some 10100 states, among which there is a single goal state. The standard approach to dealing with large state-spaces is to train a neural net evaluator in TD(¸) learning (cf. Sutton & Barto, 1998). This has been effective in some domains, such as backgammon (Tesauro, 1995). However, the success at backgammon was attributed to the fact that a linear function is a good evaluator (Tesauro, 1995). TD(¸) using a neural net evaluation function is not suitable here because the goal is so nonlinear and abstract, and the size of the problem description is variable.2 We nonetheless attempted to train a neural net.3 This succeeded in solving four-block problems from a raw representation or eight-block problems using a powerful hand-coded feature NumCorrect. NumCorrect returns the current number of correct blocks on stack1, the largest integer l such that the bottom l blocks of stacks 0 and 1 agree in color. A recent attempt at inductive logic programming solved only two-block problems (Dzeroski et al., 1998). We have experimented extensively with genetic programming (Baum & Durdanovic, 1999) and also 2

For a discussion of various other difculties, see Whitehead & Ballard (1991) We report in this paragraph results on a substantially simpler version of the problem where, rather than demanding that the learner say “done” when the stack is correct, we externally supplied the “done” statement. Run on the full problem, these standard methods did worse. 3

Cooperative Problem Solving

2749

Figure 2: Data from Hayek3 runs with NumCorrect. The horizontal axis is in millions of instances presented. The right vertical axis is in units of agents. The left vertical axis is in units of numbers of blocks or, equivalently, reward. (A) The moving average over the last 100 instances of winning rst and last bid and payoff q from the world. (B) The moving average (solid line) of the score, computed

P200

as 2 iD 1 p.i/i for p.i/ D fraction instances of size i solved. If all instances of every size up to S were solved and no instances above size S were solved, the score would be about S. The score is computed only over instances with no new agents introduced. This approximates the average size instance being solved if agent creation is suspended. The moving average of payoff is lower than the score because new agents frequently lead to failures. A system performing at the level of score can, however, be achieved at any time by suspending creation. The dashed line shows the moving average of the number of agents in the population (measured against the right-hand axis). Effective solving coincides with a split of rst and last bids, when it learns to estimate value of states.

some less standard approaches: a hill-climbing approach on computer programs, previous economic models, and an attempt to induce an evaluation function using S-expressions. These approaches all solved at most four- or ve-block problems reliably.4 (See appendix A for a fuller discussion of these results.) 4 Representation Language

This section discusses the syntax of the agent programs. The programs of the agents discussed in this article are typed S-expressions (Montana, 1994), recursively dened as a symbolic expression, as in Lisp, consisting of either a symbol or a list structure whose components are S-expressions. Sexpressions are isomorphic to parse trees. A simple example is shown in Figure 6a. All our expressions are typed, taking either integer, void, color,or boolean values. All operations respect types so that colors are never compared to 4 A simpler economic model learned to solve arbitrary BW problems, but only if provided intermediate reward in training whenever an action was taken partially advancing solution (Baum, 1996).

2750

Eric B. Baum and Igor Durdanovic

Figure 3: Example of solution of an eight-block instance.

integers, for example. The use of typing semantically constrains mutations and thus improves the likelihood of randomly generating meaningful and useful expressions. Our S-expressions are built out of constants, arithmetic functions, conditional tests, loop controls, and four interface functions: Look(i,j), which returns the color of the block at location i; j., Grab(i) and Drop(i), which act on stack i; and Done, which ends the instance. Some experiments also con-

Cooperative Problem Solving

2751

Figure 4: Data from Hayek3 runs without NumCorrect or R nodes. For further information, see the caption of Figure 2, which presents identically formatted data for a different run.

tain the function NumCorrect, and some contain a random node R.i; j/, which simply executes branch i with probability 1/2 and j with probability 1/2. The system starts with a single, special, hand-coded agent, called Seed, with zero wealth. Seed does not bid, but simply creates children as random S-expressions in the same way that expressions are initiated in genetic programming (Koza, 1992; Montana, 1994), choosing each instruction in the

2752

Eric B. Baum and Igor Durdanovic

S-expression at random from the instruction set with increasing probability of choosing a terminal expression so the tree terminates. Winit is initially set to 0, so all agents can create freely. Eventually one of Seed’s descendants earns reward. Thereafter, W init is set equal to the size of the largest reward earned, and agents can no longer create unless they earn money. We report elsewhere experiments using a similar economic model, but with three other representation languages. Hayek1 (Baum, 1996) used simple productions of a certain form in some ways similar to the language used by classier systems (Holland, 1986). In Hayek and also in classier systems, a conguration is stable only if agents are on average followed by agents of higher bid, else agents will go broke. Conversely, a human writing programs in such a language will naturally use higher bids to prioritize agents, which may result in high-bidding agents preceding lower-bidding ones. This means that many programs that a human could write in a given language may not be dynamically stable. Although the classier system language has been proved universal (Forrest, 1985), it is not clear whether it remains universal when restricted to stable congurations. Baum (1996) and Lettau and Uhlig (1999) independently pointed this problem out. Similarly, because of this problem, Hayek1 was able to solve large BW problems only when given an intermediate reward for partial progress. Hayek2 used an assembler-like language inspired by Tierra (Ray, 1991). This proved both ineffective and incomprehensible to humans. We call the system experimented with here, using typed S-expressions, Hayek3. Hayek4, reported on in a companion paper, uses postproduction systems. This language is Turing complete, and Hayek4 succeeds in learning a program capable of solving arbitrary BW problems (of our type). By comparison, we do not believe the S-expression language used in Hayek3 is Turing complete, and so we will see that Hayek3 succeeds in building only systems capable of solving large, but nite, BW problems. 5 Blocks World Behaviors

We now report on experiments running the Hayek3 system. We trained Hayek on the following distribution of problems: One-block problems if there was no money in the system (e.g., until the rst instance is solved). Else we presented problems with size uniformly distributed between 1 and 2 C m C m=5, where m was set as follows. We maintained a table Sp .i/ as the fraction of the last 100 instances of size i presented that were solved. m was increased by 1 whenever Sp .m/ > 75% and Sp .m C 1/ > 30%. m was decreased by 1 whenever Sp .m/ < 25% and Sp .m ¡ 1/ < 70%.

Cooperative Problem Solving

2753

This distribution was chosen ad hoc, with a small amount of experimentation, to present larger instances smoothly as Hayek learned. The reason that we used different conditions for increasing and decreasing m was to prevent rapid oscillation between two neighboring values. We rst describe experiments with NumCorrect in the language. Within a few hundred thousand training instances, Hayek3 evolves a set of agents that cooperate to solve 100-block goals virtually 100% of the time and problems involving much higher goal stacks over 90% of the time. (See Figure 2.) Our best run to date solved almost 90% of 200-block goals, the largest stacks among our tests, in a problem with k D 3 colors. Such problems have about 10100 states. Hundreds of agents act in sequence, each executing tens of actions, with a different sequence depending on the instance. The system is stable even though reward comes only at the end of such long sequences. The learning runs lasted about a week on a 450 MHz Pentium 2 processor running Linux, but the trained system solved new 200-block problems in a few seconds. Empirically we nd Hayek3 evolves a system with the following strategy. The population contains 1000 or more agents, each of which bids according to a complex S-expression that can be understood, using Maple, to be effectively equal A ¢ NumCorrect C B, where A and B are complex S-expressions that vary across agents but evaluate, approximately, to constants. The agents come in three recognizable types. A few, which we call “cleaners,” unstack several blocks from stack 1, stacking them elsewhere, and have a positive constant B. The vast majority (about 1000), which we call “stackers,” have similar positive A values to each other, small or negative B, and shufe blocks around on stacks 2 and 3, and stack several blocks on stack 1. “Closers” bid similarly to stackers but with a slightly more positive B, and say Done. At the beginning of each instance, blocks are stacked randomly. Thus, stack 1 contains about n=3 blocks, and one of its lower blocks is incorrect. All agents bid low since NumCorrect is small, and a cleaner whose B is positive thus wins the auction and clears some blocks. This repeats for several auctions until the incorrect blocks are cleared. Then a stacker typically wins the next auction. Since there are hundreds of stackers, each exploring a different stacking, usually at least one succeeds in adding correct blocks. Since bids are proportional to NumCorrect, the stacker that most increases NumCorrect wins the auction. This repeats until all blocks are correctly stacked on stack 1. Then a closer wins, either because of its higher B or because all other agents act to decrease the number of blocks on stack 1 and thereby reduce NumCorrect. The instance ends successfully when this closer says Done. A schematic of this procedure is shown in Figure 3. This evolved strategy does not solve all instances. It can fail, for example, when the next block to place is buried under 5 or 10 other blocks and no agent can nd a way to make progress. The next block needed is more

2754

Eric B. Baum and Igor Durdanovic

likely to be buried deep for higher numbers of colors, so runs with k D 10 or k D 20 colors evolve more agents, up to 3000 in some experiments, and hence run slower than runs with k D 3. Nonetheless, they follow a similar learning curve and have learned to solve BW instances with up to 50 blocks consistently. Classical planning programs search a large space of possible actions. By contrast, Hayek3 learns to search only a few hundred macroactions and learns to break BW into a sequence of subgoals, long an open problem in the planning community (Korf, 1987). Note that the macroactions are tuned to the problem domain. For example, the macroaction of the cleaner involves unstacking from 1 and stacking elsewhere, and the macroaction of each stacker involves unstacking some block (or blocks) and placing it (or them) on stack 1. Standard RL approaches, as well as classier systems, try to recognize progress after a single action. Such an evaluation requires not only understanding NumCorrect, but also concepts such as “number of blocks on top of the last correct block,” “next block needed,” and “number of blocks on top of the next block needed.” It is too complicated to hope to learn in one piece. Hayek3 instead succeeds by simultaneously learning macroactions and evaluation. Its macroactions make sufcient progress that it can be evaluated. We also ran Hayek3 without the function NumCorrect. NumCorrect can still be computed as For.And.EQ.Look.0; h/; Look.1; h//; Not.EQ.Look.0; h/; Empty////: Hayek3 learned to approximate NumCorrect as For..EQ.Look.0; h/; Look.1; h//// and, employing a similar strategy as above, solved problems involving goal stacks of about 50 blocks (see Figure 4). To improve For..EQ.Look.0; h/; Look.1; h//// to the full NumCorrect, Hayek3 has to make a large jump at once. We have discovered that adding a node R.a; b/ to the language greatly improves the evolvability. R.a; b/ simply returns the subexpression a with probability 1/2 and the subexpression b with probability 1/2. In addition, we add mutations changing R.a; b/ to R.a; a/ or R.b; b/. The protability of an S-expression containing an R node interpolates between that of the two S-expressions, with the R node simply replaced by each of its two arguments. This seems to smooth the tness landscape. If one of the two alternatives evolves to a useful expression, the mutation allows it to be selected permanently. With R added to the language, Hayek3 consistently succeeds in discovering the exact expression for NumCorrect (see Figure 5). The runs then follow a strategy identical to that with NumCorrect included as a primitive. Unfortunately, the runs are slower by a factor of perhaps 100 because the single built-in instruction NumCorrect becomes a complex For loop with an execution cost in the hundreds of slow instructions. Accordingly, after a week of

Cooperative Problem Solving

2755

Figure 5: Data from Hayek3 runs without NumCorrect or R nodes. For further information, see the caption of Figure 2, which presents identically formatted data for a different run.

computation, the system has learned to solve only 20-block problems consistently. Nevertheless, this system is stably following a similar learning curve to that with a primitive NumCorrect supplied, apparently learning only a constant factor slower, which could be made up with a faster computer. By contrast, standard strategies typically incur an exponentially increasing cost for solving larger problems. The 50- and 20-block problems solved by

2756

Eric B. Baum and Igor Durdanovic

Hayek3 without NumCorrect are each several times bigger than were solved by competing methods with NumCorrect. 6 Metalearning

This section describes experiments with a more complex approach where new agents are created by existing agents rather than as simple mutations of existing agents. The point of this is metalearning. By assigning credit to agents that create other good agents, we hope to learn how to create good agents, and thus expedite the learning process greatly. In this scheme there are two kinds of agents: solvers and creation agents. An agent is a creation agent if the instruction at the root of its S-expression is a modify or a create, otherwise it is a solver. Solvers behave like the agents we discussed in previous sections. Creation agents do not bid. Instead, in each auction, all the creators that have wealth more than Winit (a xed sum) are allowed to create. The creator endows its new child with an initial wealth Winit . Creators are compensated for creating protable solvers, protable creators, or agents modied into protable agents, as follows. In each instance, all agents pay one-tenth their prot (if any) plus a small constant sum toward this compensation. For agents created de novo, this payment goes to their creator (which then recursively passes one-tenth on to its creator). For agents created by a “modify” command, the payoff is split equally between their creator and the holders of intellectual property used to create them. Intellectual property rights in an agent A are deemed shared equally by its creator and, recursively, the holder of intellectual property rights in the agent (if any) modifed to create A. Creators that prot create more agents. Creators with no surviving sons and wealth less than Winit are removed. Creators are thus removed when their wealth, plus that of all surviving descendants (which they are viewed as “owning”), is less than Winit so that creators, like solvers, are removed whenever their net impact has been negative. Thus, there is evolutionary pressure toward creators that are effective at creating protable progeny. The system is initiated with a single creation agent, called Seed, with zero capital. Seed is written (by us) to create children that consist of random code. These agents can create other agents if their code so indicates. As in the simpler scheme described in section 2, Winit is initiated as zero but raised when an agent rst earns money from the world. Creators endow their child with Winit capital. In addition to the instruction set described in section 4, Creators employ one “wildcard” ¤ of each type; four binding symbols $1,$2,$3,$4 of each type, and two functions, create and modify. Wildcards are treated as symbols of appropriate type when they appear in creators. When a wildcard (or an unbound binding symbol) appears in an S-expression of a solver, however, it is expanded into a random subtree.

Cooperative Problem Solving

2757

This is done similarly to the general approach in strongly typed genetic programming (Montana, 1994), by growing from the top down. At each node, one randomly chooses an expression of the correct type. Unless this expression is a constant (i.e., takes zero arguments), we must iterate to the next level, choosing random expressions for the children (subexpressions). At each step, we multiplicatively decrease the probability of generating a nonterminal symbol so that the expansion terminates with a small tree.5 Create takes one argument. Create creates a new agent whose expression is simply identical to its argument. The modify instruction has two arguments. Modify chooses a random agent in the population6 and attempts to match the expression in its rst argument with a subexpression of either the randomly chosen agent’s void or integer expressions. If it fails to nd a match, no agent is created. If it succeeds, it creates a new agent identical to the matched one except that modify’s second argument is substituted in place of the matched expression. (See Figure 6.) A match must be exact, except that the arguments of the modify expression may contain binding symbols and wildcards. Binding symbols and wildcards can match arbitrary expressions of the appropriate type. The difference between them is that such binding symbols in the rst argument are bound when it matches, and if the same binding symbol also occurs in the second argument, the binding symbol is replaced by the expression it was bound to. Wildcards match independently and do not bind. Every tenth new agent, however created, is further mutated by having a randomly chosen node or nodes replaced by random trees. This helps to ensure the system will not get stuck in a conguration from which it cannot further evolve. Note that a modify operation may use code from a solver in the population to create another creator. This creator may then use code from a second solver in creating another solver. In this way, it is possible for code fragments from different agents to be combined. When we ran on BW with creation agents, the system learned a collection of solvers employing a similar strategy to that described in section 5. The behavior of creation agents was less clear. Some evolved to modify se-

5 Let P expand .d/ be the probability of choosing a nonterminal at depth d. We set Pexpand .d C 1/ D Pexpand .d/ ¤ C. We imposed limit conditions that Pexpand (depth D 0) D initial ¡ expansion and Pexpand (average ¡ depth) D 0:5. We chose (ad hoc, without

experimentation) an average depth of 3 and initial expansion of 0:9. The constant C is then computed as: C D exp..log.0:5/ ¡ log.initial expansion//=average depth/. 6 Note that there is no “selection” operator per se. Modify chooses a random agent in the population. Selection occurs because unprotable agents are removed from the population. Modiers can only sense which agent they modify in that they must match a pattern. We have not yet explored allowing modiers other sensations and options, such as choosing to modify a wealthy agent or even one recently active.

2758

Eric B. Baum and Igor Durdanovic

Figure 6: (a) An integer expression. This is equivalent to if[(look(0,0)=look(1,0)) and(look(3,0)=E)] then 0 else ¡3. It returns 0 if the bottom block of the 0 stack is the same color as the bottom block of the 1 stack and there are no blocks on the 3 stack; else it returns ¡3. (b) The void expression of a creator, with modify at its root. This can match with the expression of a by matching the and expressions, binding $1 to eq(look(0,0),look(1,0)), and binding $2 to eq(E,look(3,0)). Substituting the second argument of the modify (the subtree rooted by or) results in c. Finally the wildcard ¤ is expanded into a random tree, resulting in the expression of d.

quences of grabs and drops in ways that create useful stackers or cleaners. Others were not transparent to us. We ran control experiments to discover whether we were in fact able to learn effective ways to create. We rst compared a run with the creation agents to a control in which creation agents attempted to create, but we discarded the agent they created in favor of a mutated agent. Hayek3 with creation agents performed 30% to 50% better after a week of execution than this control. This indicates that the creation agents were learning useful techniques. However, the pattern matching in the creation agents was expensive in time, and the system with creation agents performed no better than the less complicated system reported in previous sections. The conclusion of our experience with metalearning is thus mixed. With our current techniques, metalearning is probably not a practical alternative. It remains plausible that with a more powerful creation language and for yet more complex problem domains, metalearning may reemerge as a useful technique.

Cooperative Problem Solving

2759

7 Rubik’ s Cube

We tried to learn a Hayek capable of unscrambling a randomly scrambled Rubik’s cube. This section will assume some familiarity with Rubik’s cube. For the uninitiated, a “cubie” is one of the 26 physical little cubes out of which a Rubik’s cube is constructed. We call a “square” one face of a cubie. Rubik’s cube is a complex problem. We experimented with a variety of instruction sets, including several versions of numcorrect functions, several presentation schemes of increasingly harder instances, and several reward schemes. All of the versions we experimented with used creation and modify operators, as discussed in the previous section. We used instruction sets with several types. For example, one set of runs used six types: boolean, integer, coordinate, face, square, and void (i.e., actions). To describe a particular square, you could write s.i; x; y/ where i is a coordinate of type “face” specifying a face of the cube (e.g., the face with red center), and x and y are coordinates taking values 0, 1, or 2, specifying the location on the face. Having specied two squares s1 and s2, it would be possible to compare their color with the boolean-valued function EQCs.s1; s2/, which returns true if and only if they have the same color. We supplied an instruction Sum taking a boolean argument, which summed any constructed boolean function over all cube faces. Hayek evolved evaluation functions, which used sums to estimate the number of correct cubies. We realized, however, that it was impossible for Hayek to express the physical concept of a cubie in this language and so also experimented with a language based around cubies, in which we supplied a function NumCorrect describing the number of correct cubies. We also worked with two different presentation schemes. In Presentation Scheme 1, we initially presented cubes scrambled with one rotation, and as Hayek learned to master cubes at a given level of scrambling, presented cubes scrambled with increasing numbers of rotations. We then gave reward only for completely unscrambling the cube. In Presentation Scheme 2, we presented a completely scrambled cube—a cube scrambled with 100 random rotations—and presented Hayek with reward proportionalto its partial progress at the time it said “done.” For this purpose we measured partial progress according to three different metrics (in different runs): (1) the number of cubies it got right, (2) the number of cubies it got right according to a xed sequence (i.e., we numbered the cubies starting with the top face and gave Hayek reward equal to one less than the rst cubie in the sequence that was wrong), and (3) the number of cube faces it got right (maximum 54). We also worked with two different cube models. Cube model A allowed three actions: F– a one-quarter move on the front face, and X and Y rotations of the whole cube. Cube model B held the x-, y- and z-axes of the cube xed, but allowed a one-quarter move on each of six faces (L,R,F,B,T,D). Some representative results are as follows. In a run using Presentation Scheme 1, with Hayek given an instruction NumCorrect that we set equal

2760

Eric B. Baum and Igor Durdanovic

to the total number of cubie faces correct relative to the center cubie face, Hayek learned after a few days to descramble cubes scrambled with 20 quarter-moves 80% of the time and cubes scrambled with 30 quarter-moves 40% of the time. In another experiment, using Presentation Scheme 2, with reward equal to the total number of cubie faces it left correct, Hayek learned to improve scrambled positions by about 20 cubie faces (the average number of correct cubie faces at the end minus the number of correct cubie faces at the beginning was about 20). In runs with Presentation Scheme 2, with Hayek given a reward for partial success dened as the number of correct cubies in a particular .x; y; z/ order, stopping at the rst incorrect cubie (and not counting the center cubies, which are always correct by denition), Hayek learned to solve about 10 cubies or the whole rst slice and 2 more cubies of the middle slice. This is further than many humans get. For each of the schemes we tried, Hayek made signicant progress, but after a few days of crunching, it would plateau and cease further improvement. Actually, this describes as well the progress of many humans, because as the cube becomes more solved, it becomes increasingly difcult to nd operators that will make progress without destroying the progress already achieved. Such operators become increasingly long chunks of code, and there are no evident incremental properties of the chunks, so it becomes exponentially hard to nd them unless some deep understanding can be achieved. Hayek’s strategy in these runs was fundamentally similar to its strategy in BW: it produced a collection of 300 to 700 agents that took various clever sequences of actions and bid an estimate of the value of the state they reached. Typical instances involved a dozen or so auctions, the winner in each making partial progress. Hayek made progress in all of the different schemes, and it was not obvious that any scheme was particularly preferable to the others. It is unclear whether Hayek, at least using the S-expression representation here, can exploit the compact structure of the problem domain in more powerful ways. It is worth noting, for example, that within the Sexpression languages used here, it does not seem that high-level concepts such as “cubies” can be represented (never mind learned) to the extent that we do not simply insert them by hand. However, it also seems likely there is fundamentally less structure to be exploited in Rubik’s cube than in BW. The smallest program a human could write that will solve Rubik’s cube is much much larger than the smallest human-crafted program to solve BW, while at the same time the state-space in BW is huge (in fact innite) and the state-space in Rubik is merely large. Indeed Rubik can be solved by brute search (Korf, 1997).

Cooperative Problem Solving

2761

8 Rush Hour

This section describes experiments applying Hayek to the commercially available game Rush Hour. The game comes with 40 different puzzle instances. Figure 7 shows instance 35. Each instance consists of several cars placed on a 6 £6 grid. Cars are of width 1 and come in two different lengths, 2 or 3. A car can only be moved forward or backward, not turned. The goal of the game is to move the cross-marked car out the exit. The game was honored for excellence by Mensa, and humans nd the problems challenging. Moreover the game is provably hard. Flake and Baum (1999) show that the generalization to an n £ n grid is P-space complete, equivalent to a reversable Turing machine, and that solution may require a number of moves exponential in n. We trained Hayek, and also a genetic program, by presenting problems of gradually increasing difculty, including easy training problems generated from the original 40 by randomly removing cars. Rush Hour presents new challenges, particularly in the selection of a representation language for the agents, since agents need exibility in specifying actions and must be able naturally to express relations such as “the car that is blocking this car.” The version of the program discussed here used integer-type S-expressions and creation operators. We call this Hayek3.1 Agents were limited to expressions of up to 20; 000 nodes. The language consists of constants (0,1,2), arithmetic functions (add, subtract, multiply), if-then-else, and some additional functions, as follows. There is a pointer that is initiated pointing to the marked car. The function PUSH moves the car pointer to either the car blocking forward motion of the pointed car or the car blocking its backward motion, depending on whether its argument is zero or not. (If there is no such blocking car—motion in

Figure 7: Rush hour problem 35.

2762

Eric B. Baum and Igor Durdanovic

Table 1: Ordering of the Tasks by Random Moves. Order 1 Task 9 Moves 1394 Order 9 Task 2 Moves 9919 Order 17 Task 21 Moves 21,968 Order 25 Task 37 Moves 28,127 Order 33 Task 24 Moves 100,016

2 13 3181 10 12 10,204

3 28 5909 11 4 11,718

4 14 6063 12 6 13,134

5 23 6069

6 10 6854

7 7 7487

22 8931

13 20 13,532

14 27 14,281

15 16 14,451

16 1 20,916

8

18 19 20 21 22 23 24 25 30 17 32 3 5 38 22,348 23,255 24,336 25,813 26,077 27,572 27,828 26 27 28 29 30 31 32 11 15 19 29 8 18 26 36,892 41,421 57,901 58,439 70,804 79,533 87,473 34 35 36 37 38 39 40 40 39 34 31 33 36 35 111,243 123,884 157,495 160,507 171,687 184,341 209,923

Notes: Task is the number supplied by the manufacturer. Moves is the number of moves taken to solve the problem using random search. Problems printed in boldface were solved by Hayek3.1.

that direction is blocked only by a wall or, alternatively, it is possible to drive off the board—then PUSH fails and takes no action.) As a side effect, the previously pointed car is pushed onto a car stack. Cars can be popped back, giving our language a weak backtracking ability. Function PUSH2 behaves like PUSH, except that it moves the car pointer to the second blocking car (the car that would block next if the rst blocking car were removed). Function MOVE moves the pointed car forward or backward. Each of these functions returns 1 if they succeed and 0 if they fail. Creators employ additional statements: MODIFY, a “wildcard” ¤, and four binding symbols $1,$2,$3,$4. To get a rough estimate of the difculty of the problems, we applied random search to them. Table 1 shows the problems ordered by the number of moves used to solve them by a random search algorithm. We call the ranking according to Table 1 “order no.” and the number supplied by the manufacturer “task no.” For each task we allowed a total number of car moves bounded by 100 C #random-moves=100 where #random-moves was the average number of moves taken to solve that particular instance random search. In each instance we allowed 10 C #random-moves=1000 auctions. If the move limit or the auction limit is exceeded, the instance ends with no reward. Hayek3.1 was trained using random problems drawn from a distribution crafted to present easier problems rst, as follows. We rst select a problem

Cooperative Problem Solving

2763

Figure 8: Sequence of positions solving problem 20.

randomly and uniformly from the 40 supplied with the commercial game. The instance is then subsampled by removing cars as follows. For each problem number, we maintain a count of the fraction of instances of that problem number solved out of the last 100 times this instance was presented. 7 We remove a number of cars so that the fraction of cars remaining, called p, is equal to the fraction of the last 100 problems of that number solved (up to round-off, since we must present an integer number of cars). This allows Hayek3.1 to learn from simpler examples, and as it learns on each example, we gradually ramp up p until we are presenting the full example. The reward R given for solving an instance was R D .orderno:/e.4:06p¡4:06/ : Thus more reward was given for harder instances, and reward is exponentially increased for solving instances with no subsampling. The coefcients in the exponent were determined by tting e.ap C b/ to be equal to 0.01 for p D 0 and 1.0 for p D 1. After training, Hayek contains a collection of hundreds of agents that collaborate to solve some of the problems. Figure 8 shows a sequence of agents acting to solve task no. 20 (a.k.a. order no. 13). Table 2 shows the sequence of bids and number of actions taken. Notice that the agents recognize as they close on a solution and bid higher. Note that the nal bid is quite accurate (although in this case a slight overbid). (This agent was proting on other instances, but lost money on this one.) In instances where the problem is ultimately not solved, bids stay low, as agents realize they are not close to solution. The ability of agents to recognize distance from the solution is critical to Hayek’s performance. In each of a series of auctions, it chooses the agent that in its own estimation makes the most progress, eventually converging on a solution. This breaks down a lengthy search into a sequence of operations, effectively achieving subgoals. Figures 9 and 10 show the time evolution of statistics for this typical Hayek3.1 run. The data points are a sample every 10,000 instances of the average over 100 instances. Figure 9 shows the winning bid in the rst 7 Actually, the bins record data from runs only where a new agent did not win the bidding and immediately die. Such failures are deemed irrelevant for the purpose of determining the distribution, as this dead agent is no longer part of Hayek3.1 and thus does not affect future performance.

2764

Eric B. Baum and Igor Durdanovic

Figure 9: Bid as discriminator between positions.

auction, the winning bid in the last auction, and the nal payoff and the the last bid accurately estimates payoff and the rst bid is lower. Figure 10a shows the reward earned and Figure 10b the linearly weighted sum of fully solved problems In each run, Hayek has learned to solve a number of unsubsampled problems. A typical run creates a collection of agents that is solving about ve unsubsampled problems at a time. The best run solved nine unsubsampled problems simultaneously. In one run or another, Hayek has solved problems 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 20, and 21 according to task number (i.e., the manufacturer ’s ordering). When Hayek solves an unsubsampled problem, it generally involves a collaboration of two to ve agents. Subsampled problems are much easier, of course, and are often solved by a single agent.

Table 2: Auctions for a Solution of Problem 20. Auction #

Winning Agent

Winning Bid

Reward

Actions

0 1 2 3 4

a1052 a1011 a1006 a1198 a1198

6.749 7.369 9.208 13.741 13.853

0 0 0 0 13

14 15 20 22 35

Cooperative Problem Solving

2765

Figure 10: (A) Reward per instance. Horizontal axis times 100 is instances presented to Hayek. Horizontal axis times 10 is GP generations. The rst point shown is after 100 GP generations. GP at generation 1 scores about 2 on this scale and learns rapidly at rst. Each GP generation involves presentation of 100 instances of each type (4000 instances total) to each program in the population. The scales were chosen to agree roughly in wall clock time. GP shows reward per instance of the best program in the population. (B) Score as linearly weighted sum of fully solved instances.

2766

Eric B. Baum and Igor Durdanovic

We have tried several approaches to getting a genetic program to solve these problems, varying the instance presentation scheme and other parameters. In each case, our GP used the same instruction set as Hayek3.1. The runs shown used a population size of 100 and an instance presentation scheme as similar to that Hayek saw as possible.8 As the graphs show, although GP learns to solve subsampled problems, and thus to earn reward, it has never learned to solve any of the original problem set. It reaches a plateau where it is solving only subsampled problems after fewer than 100 generations. Even when trained solely on Problem 1, we have not been able to learn a program solving the full problem. Possibly some better scheme for ramping up problem difculty would improve performance. 9 Discussion

We have created an articial economy that assigns credit to useful agents. When we enforce property rights to everything and conservation of money, and when agents are rational in the sense that they accept transactions only on which they prot, new agents can enter if and only if they improve system performance. Agents are initiated far from rational, but we hoped exploitation of irrational agents would lead to the evolution of rational agents and a smoothly functioning system that divides and conquers hard problems. Conversely, we argued that when property rights or conservation of money are not enforced, agents will be able to prot at the expense of the system. Under those conditions, a local optimum of the system will not be evolutionarily stable, and there is no reason to expect evolution to produce a useful system. We have now performed experiments on BW using three different representation languages for the agents (one in Baum, 1996; one here; and a very different one reported in Baum & Durdanovic, 2000, which adopts a radically different solution strategy), and also with and without creation agents. We have performed experiments on Rubik’s cube using two different representation languages (here and in Baum & Durdanovic, 2000). We have reported experiments here on Rush Hour. In each of these cases, we have stably evolved systems with agents collaborating to solve hard problems. In the BW examples in particular, we are evolving systems with stable sequences of agents hundreds long, receiving reward only at the end. This contrasts with classier systems, which are rarely able to achieve stable chains more than a few agents long (Wilson & Goldberg, 1989). Our dynamics have also been largely robust to parameter variations.

8 For a GP, there is a question as to how to compute the fraction of problems of type i solved in recent generations. The runs shown used the average over the last 10 generations of the fraction of problems of that type solved by the upper half scoring members of the population, which worked as well as any of the other approaches we tried.

Cooperative Problem Solving

2767

All of this dynamic stability vanishes when conservation of money is broken. If the tax rate is raised too high, so that money leaks out, the system dies. If, on the other hand, money is nonconserved in a positive fashion, for example by introducing new agents with externally supplied money (Baum, 1996), or in other ways as has happened more than once due to bugs in the code, the system immediately learns to exploit the ability to create money, and cooperation in solving problems from the world disappears. If property rights are broken, for example by choosing the winner of the auction probabilistically, as sometimes suggested in the classier system literature (Wilson, 1994), again cooperation immediately breaks and the system can no longer function. We believe that all of this provides powerful evidence for our economic picture. As we have discussed elsewhere (Baum, 1998), we believe that many of the dynamic phenomena in natural complex systems such as ecologies, as well as many problems with other learning programs, can be protably viewed as manifestations of violations of property rights or conservation of money. We also propose that Hayek is of interest as an articial life. In particular, our experiments (particularly with metalearning) answer the open question (Thearling & Ray, 1997) of how to get a collection of selfreproducing programs to cooperate like a multicellular being in solving problems of interacting with an external world. We further conjecture that progress in reinforcement learning in ultracomplex environments requires learning to exploit the compact structure of the problem. This means that learning in complex environments will essentially involve automatic programming. We have learned to exploit the structure from small problems and generalized the knowledge to produce programs solving large problems. The problem with automatic programming methods, including ours, is that the space of programs is enormous and has a very bumpy tness landscape, making it inherently difcult to search. Our proposal to divide and conquer automatically using an economy helps, but the problem remains. This raises the key question of what language will be evolvable. We have made progress using S-expressions, which have previously been used in genetic programming (Koza, 1992) precisely because, intuitively, their structure makes mutations more likely to create semantically meaningful expressions. Typing (Montana, 1994) has also been critical here in cutting down the search space. We found that using a certain generic random node smoothed the tness landscape in BW and increased the evolvability of complex expressions. We found that Hayek was able to solve problems so long as the longest chunk of code it needed as a component was not too long to nd. Hayek has been able to create very complex programs made of relatively small chunks of code. Nonetheless, our ability to evolve complex code has been limited. For example, we were not able to progress on Rubik’s cube because after a

2768

Eric B. Baum and Igor Durdanovic

point, the next chunk of code needed was too long to evolve. Metalearning was proposed as a possible solution to this problem, and our economy was able to support some degree of metalearning, but to date we have not been able to extend the overall abilities of Hayek using metalearning. A related language question is the power of the language. We believe the S-expression language used here is not computationally universal, and accordingly Hayek has not been able to evolve a program solving, for example, arbitrary BW problems. By contrast, we report elsewhere (Baum & Durdanovic, 2000) a Hayek employing postproduction system language, which is computationally universal and succeeds in evolving a simple universal solver for BW. However, this system also founders in the search problem for Rubik’s cube. More generally, for truly complex problems, one might imagine somehow evolving an information economy, with agents buying and selling information from one another. We have not solved the problem of how to make this work. Also, it would be interesting to have a single, universal language able to address many problem domains and see if it is possible to transfer knowledge from one domain (e.g., BW) to another (e.g., Rubik’s cube.) Appendix A: Other Methods Applied to Blocks World

For comparison, we tested TD(¸) using a neural net evaluator. (For a description of the method, see Sutton & Barto, 1998, or Tesauro, 1995.) We tested several values of ¸, several discount rates, 9 and several neural net topologies. We report results on only three color problems. As will be evident, 10 color problems would have further stressed the encoding requiring much larger nets if a unary coding were desired. We also report results only on an easier version of our BW where, rather than requiring the learner to recognize when it is done, we externally supply that information. We considered a single move to be a grab-and-drop combination, which improved performance. Neural nets have a xed input dimension, so it is unclear how to encode a problem like BW with varying size. An evaluator that does not see the whole space, however, will cause aliasing, turning a Markov problem into a non-Markovian one causing great difculties. (For a discussion of such problems, see Whitehead & Ballard, 1991.) After some experimentation, we used neural nets with inputs for each of the top ve blocks in each stack. We tried both unary and analog encoding for the block color. In unary encoding, each block was represented by three inputs to the net, with one having value

9 A discount rate of 1 (no discounting), as used by Tesauro in his backgammon program (Tesauro, 1995), is the most logical in an episodic presentation scheme like ours. However, empirically discounted rates of about 0.9 stabilized learning in some runs.

Cooperative Problem Solving

2769

1 and the others value 0 to indicate which color the block had and all value 0 indicating no such block (because the stack was not ve blocks high). In analog encoding, the situation was indicated by a single integer input value from 0 to 3. We ran experiments both with and without NumCorrect supplied as an input. We presented increasingly larger instances as the system learned, as in our experiments with Hayek3. However, to keep learning stable, we found we had to slow the increase in instance size greatly. We presented instances chosen uniformly over the sizes from 1 up to l C 1, where l was the largest size being solved 80% of the time. In experiments without NumCorrect, using a 61-20-1 feedforward net, with 60 inputs giving unary encoding of the colors of the top ve blocks in each of four stacks and one “on” input to realize a threshold, we were able to learn stably and solve problems with n D 4 after several days of learning. No other system without NumCorrect supplied did better. In experiments with NumCorrect supplied as an input, the system learned fairly rapidly to solve problems up to about n D 8. Once it got up to this size, its learning would destabilize and it would crash to much worse performance. Then it would rapidly learn again. The system was never able to learn beyond n D 8. These results are perhaps disappointing but not surprising. While neural nets are good at associative memory and approximating some functions, there is no evidence to indicate that they are suited to solving symbolic search problems or inducing programs as necessary for solving BW. Tesauro’s success with backgammon, for example, is attributable to the fact that a linear function provides a reasonable evaluator for backgammon (Tesauro, 1995). We also extensively tested genetic programming. Results are reported in Baum and Durdanovic (1999). Since this publication, we have done further extensive tests using population sizes up to 1000, various selection methods, and various combinations of instruction sets, including NumCorrect. As discussed in Baum and Durdanovic (1999), our results were of some independent interest as bearing on the hotly debated question within the genetic programming community, whether genetic programming benets from crossover, or would be just as effective using the “headless chicken macromutation” where, rather than swapping random subtrees between trees in the population, random subtrees are simply replaced with new random subtrees. A recent textbook (Banzhaf, Nordin, Keller, & Francone, 1998) reports several studies showing these methods essentially equivalent and none showing a big advantage for crossover. BW is thus in some sense an unusually suitable problem for genetic programming, as our BW results represent the only published comparison of which we are aware showing crossover much superior to headless chicken. Nonetheless, genetic programming had only limited success on this problem. Run with a learner having to say done, our best genetic programming run solved 70%

2770

Eric B. Baum and Igor Durdanovic

of size 3 and 30% of size 4. Run with done externally supplied, GP produced at best S-expressions capable of solving 80% of size 4 problems and 30% of size 5. These results are arguably directly comparable to our Hayek results as both GP and Hayek used the same language for their S-expressions. We also extensively tested hill-climbing approaches using an assemblerlike language. Results, reported in Baum and Durdanovic (1999), were that only about four block instances could be solved. We also tested a system that searched for an evaluator as the maximum of a set of S-expressions in Hayek’s language. This system differed from Hayek3 in that we did not learn agents capable of sequences of actions. Rather, as in TD(¸), we considered all pairs of a grab followed by a drop, evaluated the resulting state, and chose the move leading to the highest evaluated state. We modied our evaluator better to estimate the value of the next state reached. There is no standard method for training complex functional expressions as evaluators. We tried to approximate the evaluation from below, by using the maximum of a number of S-expressions, and removing an S-expression from our population when it overestimated the evaluation, replacing it with another S-expression, generated either randomly or as a mutation of an existing S-expression. This method was also ineffective, solving only single-digit BW problems. We believe that it is critical to learn macroactions, as it will be very difcult to learn any evaluator capable of showing progress after only a single grab-drop pair. Baum (1999) discusses several other ineffective approaches we experimented with. Appendix B: Pseudocode for Hayek3 hayek() [ for ( ;; ) [ task.new() // create an instance solve() // try to solve it payment() // perform money transactions tax() // collect taxes ] ] solve() [ record.clean() for ( Agent A = Solvers.begin(); A < Solvers.end(); A++ ) [ A.oldÇ wealth = A.wealth ] while ( task.state() == RUN ) [ create() // creates new agents auction() // performs an auction ]

Cooperative Problem Solving

2771

] create() [ for ( Agent A = Creators.begin(); A < Creators.end(); A++ ) if ( A.wealth > InitialWealth() ) A.create() ] auction() [ bestÇ bid = -1 bestÇ state = task bestÇ agent = nil for ( Agent A = Solvers.begin(); A < Solvers.end(); A++ ) [ if ( A.wealth > bestÇ bid ) [ A.move( task, state ); bid = min( A.eval( state ), A.wealth ); if ( bid > bestÇ bid ) [ bestÇ bid = bid; bestÇ state = state; bestÇ agent = A; ] ] ] if ( bestÇ agent != nil ) [ // the winner takes the world into // a new state task = bestÇ state; updateÇ agentÇ wealth( bestÇ agent, bestÇ bid ) record.add( bestÇ agent, bestÇ bid, task.reward() ) ] ] updateÇ agentÇ wealth( agent, bid ) [ bestÇ agent.wealth -= bestÇ bid if ( record.last = nil ) [ payÇ toÇ world += bestÇ bid else [ record.last.wealth += bestÇ bid ] ] payment() [ for ( AuctionRecord R = record.begin() ; R < record.end(); R++ ) [ delta = R.agent.wealth - R.agent.oldÇ welth if ( delta > 0 ) [ R.agent.wealth += 0.9 * delta copyrightÇ payment( A, 0.1 * delta ) ] else [ R.agent.wealth += delta

2772

Eric B. Baum and Igor Durdanovic

] ] ] copyrightÇ payment( Agent A, Money M ) [ if ( A.father == nil ) [ A.wealth += M ] else [ A.father += 0.5 * M copyrightÇ payment( A.father, 0.5 * M ) ] ] tax() [ for ( Agent A = Agents.begin(); A < Agents.end(); A++ ) [ A.wealth -= A.executedÇ instructions * 1E-6 A.executedÇ instructions = 0 if ( A.wealth < A.initialÇ wealth AND A.noÇ livingÇ sons == 0 ) [ Agents.remove( A ) ] ] ]

Appendix C: Parameters

There are a number of constants in this system chosen in an ad hoc way, such as the one-tenth prot sharing, the resource-tax, W init , and, in versions with creation agents, the equal split between creators and IP holders. These parameters were not subject to evolution within the system. We attempted only a small amount of empirical tuning. Within reasonable bands in each parameter, the behavior of the system was not found empirically to be qualitatively sensitive to these choices, and we did not establish quantitative differences. Runs of the system require a day or more to learn, and there is considerable random variation from run to run, so empirical tuning is quite difcult unless a qualitatively different behavior can be observed. We discuss our choices of these values in turn. Since creators are intuitively considered to own their children, in principle the creator might be allowed to set the prot-sharing fraction as it pleases, say by writing appropriate code in the child, and indeed this was done in an earlier version (Baum & Durdanovic, 1999). For the work reported here, just for simplicity, we xed the prot-sharing fraction at one-tenth. We also experimented with one-half. Our subjective impression was that passing one-half of the prot passed too much money to the creation agents, causing an increase in creation, which slowed the system. However, the

Cooperative Problem Solving

2773

increase, if any, was not huge, and we have no statistically signicant claim that one-tenth is better than one-half. Winit was initially set equal to zero. Once there was money in the system because instances were solved, Winit was simply set equal to the maximum payoff possible in the run. Thus, if we were going to run up to a maximum instance size of 200, we set Winit to 200. This ensured that the created agent would have enough money to make any possible rational bid. If the newly created agent’s money ever went below its initial capital, it was removed, with its remaining money transferred back to its creator. Thus a creator that invested 200 in a child would not lose the full 200 if the child proved to be unprotable. We did not experiment with any other choices. We experimented with several values of the resource tax. Resource tax above 10¡4 per instruction sucked too much money out of the system, which then could not get started. We did not establish a difference between resource taxes of 10 ¡5 , 10 ¡6 , and 10 ¡7 , although subjectively runs of 10 ¡7 ran slower. Our best runs used simply 10 ¡6 . The runs here with creation agents used an equal split of the passed prot between creators and IP holders, but we also experimented with not using IP at all.10 The equal split of passed prot with IP holders subjectively worked marginally better, but again we do not claim to have established a statistically signicant difference. A more principled approach to all of these choices is a subject for future work. Fixing these parameters by at has the feel of xing prices, which typically can cause inefciencies in economies. References Bacchus, F., & Kabanza, F. (1996). Using temporal logic to control search in a forward chaining planner. In M. M. Ghallab & A. Milani (Eds.), New directions in planning (pp. 141–153). Hillsdale NJ: IOS Press. Banzhaf, W., Nordin, P., Keller, R. E., & Francone, F. D. (1998). Genetic programming: An introduction. San Mateo, CA: Morgan Kaufmann. Baum, E. B. (1996). Toward a model of mind as a laissez-faire economy of idiots, extended abstract. In L. Saitta (Ed.), Proc. 13th ICML ’96. San Mateo, CA: Morgan Kaufmann. Baum, E. B. (1998). Manifesto for an evolutionary economics of intelligence. In C. M. Bishop (Ed.), Neural networks and machine learning. Berlin: SpringerVerlag. Baum, E. B. (1999). Toward a model of intelligence as an economy of agents. Machine Learning, 35(2), 155–185.

10 Although popular opinion seems to hold that enforcing intellectual property rights is important to progress in real economies, the scientic literature on this subject is wide open (cf. Breyer, 1976; Malchlup, 1958).

2774

Eric B. Baum and Igor Durdanovic

Baum, E. B., & Durdanovic, I. (1999). Toward code evolution by articial economies. In L. F. Landweber & E. Winfree (Eds.), Evolution as computation. Berlin: Springer-Verlag. Baum, E., & Durdanovic, I. (2000).An evolutionary post production system. Submitted. Available online at: http://www.neci.nj.nec.com:80/homepages/eric/ eric.html. Birk, A., & Paul, W. J. (1994). Schemas and genetic programming. In Conference on Integration of Elementary Functions into Complex Behavior. Bielefeld. Blum, A., & Furst, M. (1997). Fast planning through planning graph analysis. Articial Intelligence, 90, 281. Breyer, S. (1976). The uneasy case for copyright: A study of copyright in books, photocopies, and computer programs. Harvard Law Review, 284, 281. Dzeroski, S., Blockeel, H., & De Raedt, L. (1998). Relational reinforcement learning. In J. Shavlik (Ed.), Proc. 12th ICML. San Mateo, CA: Morgan Kaufmann. Flake, G., & Baum, E. B. (1999). Rush hour is pspace-complete. Submitted. Available online at: http://www.neci.nj.nec.com/homepages/ake/. Forrest, S. (1985).Implementing semantic network structures using the classiersystem. In Proc. First International Conference on Genetic Algorithms (pp. 188– 196). Hillsdale, NJ: Erlbaum. Hardin, G. (1968). The tragedy of the commons. Science, 162, 1243–1248. Holland, J. H. (1986). Escaping brittleness: The possibilities of general purpose learning algorithms applied to parallel rule-based systems. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning (Vol. 2). San Mateo, CA: Morgan Kaufmann. Koehler, J. (1998). Solving complex planning tasks through extraction of subproblems. In J. Allen (Ed.), Proceedings of the 4th International Conference on Articial Intelligence Planning Systems (pp. 62–69). Menlo Park: AAAI Press. Korf, R. E. (1987). Planning as search: A quantitative approach. AIJ, 33, 65. Korf, R. E. (1997). Finding optimal solutions to Rubik’s cube using pattern databases. In Proceedings of the National Conference on Articial Intelligence (pp. 700–705). Koza, J. (1992). Genetic programming. Cambridge, MA: MIT Press. Lenat, D. B. (1983). EURISKO: A program that learns new heuristics and domain concepts, the nature of heuristics III: Program design and results. Articial Intelligence, 21(1,2), 61–98. Lenski, R., Ofria, C., Collier, T., & Adami, C. (1999). Genome complexity, robustness and genetic interactions in digital organisms. Nature, 400, 661–664. Lettau, M., & Uhlig, H. (1999). Rules of thumb versus dynamic programming. American Economic Review, 89, 148–174. Machlup, F. (1958). An economic review of the patent system. Study No. 15, Subcommittee on Patents, Trademarks and Copyrights, Senate Committee on the Juciciary, 85th Cong., 2d Sess. Miller, M. S., & Drexler, K. E. (1988). Markets and computation: Agoric open systems. In B. A. Huberman (Ed.), The ecology of computation (p. 133). Amsterdam: North-Holland. Montana, D. J. (1994). Strongly typed genetic programming. Evolutionary Computation, 3(2), 199–230.

Cooperative Problem Solving

2775

Ray, T. S. (1991). An approach to the synthesis of life. In Articial life II (Vol. 11). Redwood City, CA: Addison-Wesley. Russell, S. & Norvig, P. (1995). Articial intelligence: A modern approach. Englewood Cliffs, NJ: Prentice Hall. Sutton, R. S. & Barto, A. G. (1998). Reinforcement learning, an introduction. Cambridge, MA: MIT Press. Tesauro, G. (1995). Temporal difference learning and td-gammon. CACM, 38(3), 58. Thearling, K., & Ray, T. (1997). Evolving parallel computation. Complex Systems, 10(3), 229–237. Watkins, C. (1989). Learning from delayed rewards. Unpublished doctoral dissertation, Cambridge University. Whitehead, S., & Ballard, D. (1991). Learning to perceive and act. Machine Learning, 7(1), 45. Wilson, S. (1994).ZCS: A zeroth level classier system. Evolutionary Computation, 2(1), 1–18. Wilson, S., & Goldberg, D. (1989). A critical review of classier systems. In Proc. 3rd ICGA, p 244. San Mateo, CA: Morgan Kaufmann. Received February 23, 2000; accepted March 15, 2000.

VIEW

Communicated by Terrence J. Sejnowski

Cortical Potential Distributions and Information Processing Henry C. Tuckwell Epid´emiologie et Sciences de l’Information, Universit´e Paris 6, 75012 Paris, France The use of sets of spatiotemporal cortical potential distributions (CPDs) as the basis for cognitive information processing results in a very large space of cognitive elements with natural metrics. Results obtained from current source density (CSD) analysis suggest that in the CPD picture, action potentials may make only a relatively minor contribution to the brain’s code. In order to establish if two CPDs are close, we consider standard metrics in spaces of continuous functions, and these may be employed to ascertain if two stimuli will be identied as the same. The correspondence between the electrical activity within brain regions, including not only action potentials but all postsynaptic potentials (PSPs), and CPDs is considered. We examine the possibility of using the CSD approach to nd potential distributions using the descriptive approach in which precise sets of times are ascribed to the occurrence of action potentials and PSPs. Using metrics in the multidimensional space of paths of collections of point processes, we show that closeness of CPDs is implied by closeness of sets of spike times and PSP times if a certain metric is used but not others. We also set forth a dynamical model consisting of a system of reaction-dif fusion equations for ionic concentrations coupled with nerve membrane potential equations and active transport systems. Making the approximation of a descriptive approach, the correspondence between sets of spike times and PSP times and CPDs is obtained as with the CSD method. However , since it is not possible to ascribe precise times to the occurrence of PSPs and action potentials, the descriptive approach cannot be used to describe the conguration of electrical activity in cortical regions accurately. We also discuss how the CPD framework relates to the binding problem and submillisecond timing. 1 Introduction

Early theories of coding in the nervous system (Sherrington, 1906; Eccles, 1953) focused on rate coding in which the average frequency of action potentials carries information about the stimulus. This type of coding, which was established in the ring responses of such cells as motoneurons (Granit, Kernell, & Lamarre, 1966) and sensory receptors (Terzuolo & Washizu, 1962), has become associated with the integrate-and-re mode of operation of a nerve cell (Konig, Engel, & Singer, 1996) where timing is not a key factor. c 2000 Massachusetts Institute of Technology Neural Computation 12, 2777–2795 (2000) °

2778

Henry C. Tuckwell

The idea has also been developed in the past few decades that time intervals between impulses in patterns of action potentials play a key role; this manner of information delivery has become known as temporal coding (Barlow, 1963; Villa & Fuster, 1992; Abeles et al., 1995; see Fujii, Ito, Aihara, Ichinose, & Tsukada, 1996, for a review). Nearly all theories of information processing thus focus attention on properties of trains of action potentials, which are usually considered as temporally discrete events. (McCulloch & Pitts, 1943; Hubel & Wiesel, 1962; Hopeld, 1982; Tuckwell, 1998). This includes synchronization, oscillation, correlation and repeating patterns (Toyama & Tanaka, 1984; Kreiter & Singer, 1992; Ghose & Freeman, 1992; Deppisch et al., 1993; Usher, Schuster, & Niebur, 1993; Hopeld & Herz, 1995; Maass, 1995; Bush & Sejnowski, 1996; Kempter, Gerstner, van Hemmen, & Wagner, 1998; Lestienne & Tuckwell, 1998; Singer, 1998; Golomb, 1998). The approach in which attention is focused on these discrete events (times of occurrence of spikes) leads to difculties in the construction of metrics for neocortical activity because common metrics for sequences (of time points) often give large distances if seemingly minor differences occur between spike trains. This can be seen by considering two nite sequences of spike times, s D fsk g and t D ftk g, where k D 1; 2; : : : ; n; and employing, for example, the standard metric " ½2 .s; t/ D

n X 1

#1 |sk ¡ tk |

2

2

:

Two spike trains can be very similar, and yet such a metric will return a large value for the distance between the sequences of spikes. To illustrate, with time in milliseconds, rst let s D f0; 5; 6; 10; 11; 15; 20; 21g and let t D f0;p5; 10; 11; 15; 20; 21; 22g. Then the distance between the trains is ½2 .s; t/ D 60. If now the second spike train is replaced by the similar one 6; 10; 11; 15; 21; 22g, then the distance using this metric is only t D f0; 5;p ½2 .s; t/ D 2. Furthermore, focusing on spike trains as the carriers of information or the determinants of cognitive states appears to be an expedient approach, perhaps due to experimental emphasis on unit recordings. Since there is a large amount of electrophysiological activity in addition to the occurrence of spikes, it seems that this contribution from this other activity (mainly in the form of postsynaptic potentials) should be included in a description of cortical neurophysiological states and cannot be dismissed when considering cognitive information processing. That is, one has to admit the possibility that theories that concentrate only on the occurrence of action potentials might be omitting signicant contributions from nonspiking neurophysiological activity. That is, action potentials constitute perhaps only part or even a small part of the picture as far as cognitive experiences are concerned. Action potentials are a means of transmitting a signal from one part of the nervous system to another (or other kind of organ), but their

Cortical Potential Distributions and Information Processing

2779

arrival at a target manifests itself primarily as postsynaptic potentials, not further action potentials. A quantication of global neurocognitive or brain states is considered to be an important component of a neurophysiological theory of information processing. Whether one focuses on spikes or broader classes of neurophysiological activity, it is important to be able to ascertain if two cognitive processes (assumed neurophysiologically based) are close to each other, not only from the point of view of a satisfactory theory but also with respect to neurophysiological activities themselves; that is, with such a quantication there must be a measure of distance between such states, which leads to the introduction of metrics. Without a suitable metric to ascertain if physiological brain states are similar, it is not possible to understand when and why similar states lead to similar perceptions or possibly more general cognitive (“mental”) processes. This is also motivated by the need to compare cognitive processes in different individuals between whom there must be considerable variability in anatomy, physiology, and chemistry. We adopt the viewpoint that collections of sequences of spike times in populations of neurons probably do not by themselves provide an accurate or complete basis for describing cognitive states, despite the fact that some models of discrete neural networks are capable of emulating features of gross cortical activity such as alpha rhythms (Liley, Alexander, Wright, & Aldous, 1999). This view springs from the observations that global or eld potential distributions are only partially dependent on spikes and that they may also depend signicantly on collections of sequences of postsynaptic potentials at various synapses. That is, many different types of ionic current within certain brain regions may contribute to some extent to the overall cognitive state of an individual. This idea leads naturally to the description of cognitive states by means of spatiotemporal distributions of eld potentials. These are taken in what follows to be electrical potentials, although the role of magnetic elds is not negligible (Okada, Lauritzen, & Nicholson, 1988; Bowyer et al., 1999; Mackert et al., 1999), so that in a broader framework, one might use an electromagnetic eld description for cortical states (classically, the pair fE .x ; t/; H .x; t/g of electrical and magnetic eld strengths throughout the relevant time and space domains). 2 Potential Distribution Coding

Rather than focus on spikes and, in particular, their times of occurrence as a point process, we concentrate on the spatiotemporal distribution, stochastic or deterministic, of the electrical potential as it changes from instant to instant. A three-dimensional system of space coordinates with a xed origin, and xed axes can be chosen within a brain or brain region. Similarly an arbitrary but xed time origin can be ascribed. At any space point x and time t, there is an associated electrical potential, V.x ; t/, relative to that at some reference point. If M ½ R3 is a three-dimensional region of the cerebral

2780

Henry C. Tuckwell

cortex or other brain region and T D [t1 ; t2 ] ½ R C is a time interval, then for xed t and x 2 M, V.x ; t/ is a real-valued continuous function varying in space, or a spatial distribution of potential. We wish to consider the set of actual electrical potential distributions V D V M;T D fV.x ; t/; x 2 M; t 2 Tg throughout the space-time region. For convenience, only extracellular points of M may be considered; we will assume that measurements of (eld) potentials are inuenced only by ions and their associated currents in this extracellular compartment, which is a topologically connected region. Although only extracellular points of M will be considered, we will continue to use the same symbol M for the spatial region under consideration. (The amount of extracellular space within the cortex has been estimated as about 0.2 of the total volume; Lux & Neher, 1973). Note that V can be considered as a four-parameter stochastic process fV.x ; tI !/; x 2 M; t 2 Tg in which the parameters are the three spatial coordinates and the time, the sample space variable (!) being usually suppressed. Different realizations correspond to different values of !. (For further explanation, see Parzen, 1962, or Tuckwell, 1995.) For xed !, the sample paths of V are supposed to describe, or correspond to, the neurocognitive processes that occur within M in the time interval T. Although M is not xed during the life of an individual, it is convenient to regard it as xed in this context in relatively small time intervals. We also note that the exact potential V.x ; t/ cannot be measured exactly but can be estimated by an approximate experimental eld potential over a region Ax surrounding the space point x (and probably also a small time interval surrounding t), VE .x ; t/ D

1 |Ax |

Z Z Z Ax

V.x 0 ; t/w.x 0 ; t/dx 0 ;

(2.1)

where |:| denotes volume. The region Ax reects the size of a recording electrode, and the weight function w.x; t/ reects its electrical properties. 2.1 Metrics. The set of all possible potential distributions V.x; t/ on M £ T, which we denote by CV .M £ T/, is a set of bounded continuous functions and a subset of the space C.M £ T/ of all bounded continuous functions on that product space. For such a space, there are suitable commonly used metrics (Simmons, 1963), or distance functions, which are natural and hence useful in describing the magnitudes of the distances between cortical states. Let V 1 and V 2 be two points in CV .M £ T/ (i.e., spatiotemporal potential distributions). One such metric is provided by the uniform or sup norm,

D1 .V 1 ; V 2 / D sup |V1 .x ; t/ ¡ V2 .x ; t/|:

(2.2)

M£T

Alternatively and perhaps more satisfactorily, we may consider CV as a subset of the space of square integrable functions on M £T, usually denoted

Cortical Potential Distributions and Information Processing

2781

by L2 [M £ T], in which case we may use the metric µZ Z

¶1 2

D2 .V 1 ; V 2 / D

M

T

.V1 .x; t/ ¡ V2 .x ; t// dtdx

2

:

(2.3)

However, if one is interested only in comparing spatial potential distributions at a given time t1 and thus considered only to be functions of x , then the following corresponding metrics will be useful: d1 .V 1 ; V 2 / D sup |V1 .x ; t/ ¡ V2 .x ; t/|

(2.4)

M

and µZ d2 .V 1 ; V 2 / D

¶1 M

.V1 .x; t/ ¡ V2 .x ; t//2 dx

2

:

(2.5)

D1 and d1 are volts, whereas D2 is in volts sec 1=2 meters3=2 and d2 is in volts meters3=2 . 3 Neurophysiological Activity 3.1 Stimuli. Consider a stimulus S0 , of extrinsic or intrinsic origin, or a combination of both. With this stimulus will be associated a set of spike trains, postsynaptic potentials, and other neurophysiological events that can be considered as constituting a point in a multidimensional metric space. There will also be an associated potential distribution, which we assume is dened in the region M £T. However, it is very unlikely that the set of spike trains, postsynaptic potentials, or corresponding potential distribution are uniquely determined by S0 because the response to the same stimulus is never exactly the same at different presentations. Thus, there will be an average potential distribution associated with S0 , which we denote by V 0 . If a stimulus S occurs, it will be identied with S0 if the potential distribution V , elicited, satises

D1 .V ; V 0 / < ²1 ;

(3.1)

D2 .V ; V 0 / < ²2 ;

(3.2)

or

where the positive constants ²1 and ²2 are measures of the discriminatory ability of neurocognitive processes. Now M may be partitioned into w disjoint subsets Mk such that [k Mk D M; Mj \ Mk D Á

2782

Henry C. Tuckwell

for j; k D 1; 2; : : : ; w; where Á is the null (empty) set. Then any potential distribution V can be decomposed into w subpotential distributions, V k D fV.x ; t/; x 2 M k ; t 2 Tg; k D 1; 2; : : : ; w;

with a similar decomposition for V 0 . Metrics D1;k can be dened on C[Mk £ T], and S will be identied with S0 if, for example, max D1;k .V k ; V 0;k / < ²3 : k

3.2 General Neurophysiological Processes. The determinants of the electrical potential (or the entire electromagnetic eld) throughout brain regions are the xed distributions of charges and the currents due to the movements of charged particles, including various ions as they course into and out of neurons and glia and the extracellular compartment. Then, as a rst approximation, one may assume that the methods of current source density (CSD) analysis (Rappelsberger, Pockberger, & Petsche, 1993, for example; see Mitzdorf, 1985, for a review) are valid, though the usefulness of this approach remains to be fully explored. This formalism, which is based on Maxwell’s equations, is usually expressed through the equation

r ¢ [¾ r V] D ¡I.x ; t/;

(3.3)

where ¾ is the conductivity tensor and I.x ; t/ is the current density. Equation 3.3 is usually employed to determine the current density I from the measured eld potentials, which may be an ill-posed problem. However, here we note that given I, one may in principle solve equation 3.3 with appropriate boundary and initial conditions to determine the potential distribution throughout a brain region over a given time interval. It is possible to deduce from this that the potential distribution V should be uniquely determined by the current density. If, as hypothesized, V M;T determines the cognitive processes in an individual over the time interval T, then, on the assumption of the validity of equation 3.3, there is a one-to-one correspondence between such processes and current density distributions I D fI.x; t/; x 2 M; t 2 Tg: One may assume also that I is in C[M £ T]. 4 Spike Trains and Postsynaptic Potentials: A Descriptive Approach

It is important to relate the distance function in the cortical potential distribution (CPD) framework to the ongoing neurophysiological activity in the brain. In order to incorporate the contributions from spike trains and activated synapses within M, we adopt a descriptive approach at rst, attempt to develop a dynamical approach later, and also consider a combination of the two approaches. In the descriptive approach, we are given the sequences of ring times and times of activation of synapses. For simplicity,

Cortical Potential Distributions and Information Processing

2783

it is assumed that the neurophysiological effects of an action potential or postsynaptic potential are always the same. This is clearly an approximation, as the effects of spikes and postsynaptic potentials depend on the previous history of events and more particularly on the immediate previous history. For example, after a rapid succession of spikes, there is often sufcient potassium accumulation in the extracellular space to cause a substantial shift in membrane potential, which can have a dampening effect on subsequent spiking (Lebovitz, 1970). Similarly, rapid volleys of impulses at a synapse can lead to a reduced postsynaptic potential amplitude (Curtis & Eccles, 1960). In addition, even in adult brains, the anatomy and physiology, including synaptic distributions and synaptic strengths at both preand postlocations are very susceptible to dynamical changes (Hess & Creese, 1987; Dawirs, Teuchert-Noodt, & Busse, 1991; Liao, Jones, & Malinow, 1992; Manabe, Wyllie, Perkel, & Nicoll, 1993). 4.1 Impulsive Currents and Point Neurons and Synapses. The following approach is probably the simplest, but it can lead, as shown in the appendix, to problems with boundary conditions. As a rst step, assume that all the currents associated with an action potential of a given neuron can be collected at a single space point and that the current ows are relatively rapid and can therefore be considered instantaneous. Let there be n neurons located at space points x k ; k D 1; 2; : : : ; n, and let their corresponding spike times be tjk ; j D 1; 2; : : : ; nk .t/, where nk .t/ labels the largest spike time of neuron k, which occurs in the time interval .0; t]. When neuron k emits a spike, suppose there are pk contributions to the current density from the various ion species of amplitudes Aki ; i D 1; 2; : : : ; pk . Aki is actually the total charge delivered for ion species i whenever neuron k res. Now let the synapses be localizable at the space points y l ; l D 1; 2; : : : ; m, and assume again that their associated net current ows are impulsive and have times of occurrence sil ; i D 1; 2; : : : ; ml .t/, where ml .t/ labels the largest postsynaptic potential (PSP) time of synapse l, which occurs in the time interval .0; t]. The corresponding strengths of the currents generated are denoted by Blr ; r D 1; 2; : : : ; ql . Thus, Blr is the charge delivered for ion species r at an activation of synapse l. Aki and Blr are not xed; they may be random, time dependent, dependent on several state variables (ion concentrations, transmitter concentrations, etc.), or history dependent. For convenience we put

®k D

pk X iD 1

Aki ;

¯l D

ql X

(4.1)

Blr :

rD 1

Then in this descriptive approach, we have 2 n n k .t/ X X r ¢ [¾ r V] D ¡ 4 ®k ±.x ¡ xk / ±.t ¡ tjk / k D1

jD 1

2784

Henry C. Tuckwell

C

m X lD 1

¯l ±.x ¡ y l /

m l .t/ X iD 1

# ±.t ¡ sil / :

(4.2)

Here distances are in meters, V is in volts, ®k and ¯l are in coulombs, and ¾ has units of [ohms meters] ¡1 . Equation 4.2 may be written in another way. We let Nk .t/ be the number of spikes of the neuron at x k in the time interval .0; t] and let Ml .t/ be the number of postsynaptic potentials that occur at the synapse located at y l in the same time interval. The quantities Nk D fNk .t/; t ¸ 0g and Ml D fMl .t/; t ¸ 0g are stochastic point processes, which leads to a stochastic partial differential equation for the cortical potential V.x ; t/: " n X

# m dNk X dM l C r ¢ [¾ r V] D ¡ ®k ±.x ¡ x k / ¯l ±.x ¡ y l / : dt dt kD 1 lD 1

(4.3)

The sample paths of the Nk and Ml are step functions, so their derivatives, as they appear in equation 4.4, are random sequences of impulse (Dirac delta) functions. The processes Nk and M l are not in general Poisson processes, but sometimes they will be able to be approximated locally (in time) by temporally inhomogeneous Poisson processes. 4.2 One Space Dimension. In the case of one space dimension, assuming the conductivity does not vary with x, we have, for a nite space interval 0 · x · L,

" # n m X @ 2 V.x; t/ dNk X dMl C ¾ D ¡ ®k ±.x ¡ xk / ¯l ±.x ¡ yl / : @x2 dt dt kD 1 lD 1

(4.4)

However, it is seen in the appendix that this leads to difculties with boundary conditions. Hence we turn to spatially distributed neurons and temporally distributed spikes and postsynaptic potentials. 4.3 Spatially Distributed Neurons and Synapses and Spatiotemporally Distributed Action Potentials and Postsynaptic Potentials. Since

neurons, and to a lesser extent synapses, and their corresponding current sources and sinks are distributed over nonnegligible spatial regions and the associated current ows have nonzero durations, we obtain a more realistic equation for V by taking these factors into account. We again let the neurons be labeled k D 1; 2; : : : ; n and the synapses carry indices l D 1; 2; : : : ; m. The contribution to the current density per unit volume from the kth neuron for an action potential elicited at time t D 0 is uk .x; tI x k / and that for a PSP arising at the lth synapse is vl .x ; tI y l /. The emission of spikes and the activation of synapses is assumed to occur at the same times as above, so the equation

Cortical Potential Distributions and Information Processing

2785

for V is 2

r ¢ [¾ r V] D ¡ 4

n nX k .t/ X k D1 j D1

C

H.t ¡ tkj /uk .x ; t ¡ tkj I x k /

m m l .t/ X X l D1 i D 1

# H.t ¡ sli /vl .x ; t ¡ sli I y l / :

(4.5)

Here H.:/ is a unit step function, being unity for nonnegative arguments and zero otherwise. V is a random process, as the sequences of occurrence times of action potentials and postsynaptic potentials constitute random point processes. In one space dimension, if one uses the CSD approach, the electric eld is now given as 2 Z n nX k .t/ 1 x 4X Vx .x; t/ D ¡ H.t ¡ tkj /uk .x1 ; tI xk / ¾ 0 k D1 j D1 # m m l .t/ X X C (4.6) H.t ¡ sli /vl .x1 ; t ¡ sli I yl / dx1 C f .t/; l D 1 i D1

where the subscript notation has been used for a partial derivative with respect to the subscripted variable and f .t/ is a function to be determined from initial and boundary conditions. Another partial integration gives the potential, which we designate V1 : 2 Z Z n n k .t/ X 1 x x1 4X V1 .x; t/ D ¡ H.t ¡ tkj /uk .x2 ; t ¡ tkj I xk / ¾ 0 0 k D 1 j D1 # m m l .t/ X X C H.t ¡ sli //vl .x2 ; t ¡ sli I yl / l D 1 i D1

£dx2 dx1 C g.x; t/;

(4.7)

where g is obtained from boundary and initial conditions. Now consider possibly different sequences of spike times ft¤kj g and PSP times fs¤li g at the same neurons and synapses, which give rise to a potential V2 . Then using the metric D2 , we have for a measure of the distance between the corresponding space-time distributions of potential V 1 and V 2 D 2 .V 1 ; V 2 / 8 8 0 2 Z Z Z x Z x1 X < n
j

2786

Henry C. Tuckwell n¤k .t/

¡ C

8 m <m l .t/ X X l D1

:

iD 1

X j D1

¤ H.t ¡ tjk /uk .x2 ; t ¡ t¤kj I xk / m¤l .t/

H.t ¡ sli /vl .x2 ; tI yl / ¡ 3

12

5 dx2 dx1 A dtdx

X i D1

9 = ;

9 = H.t ¡ s¤li /vl .x2 ; t ¡ s¤li I yl / ;

91 2 > = > ;

:

(4.8)

Here, nk .t/ and n¤k .t/ are the indices of the largest spike times for neuron k in .0; t], and ml .t/ and m¤l .t/ are the corresponding indices for the postsynaptic potential times at synapse l. 4.4 Relation Between Spike and PSP Times and Potential Distributions. We will show that if we adopt the CSD analysis approach, then cer-

tain standard distances between sets of spike and PSP times cannot be used to establish distances between potential distribution, whereas another ner metric can be so employed. Consider the number Nk .t/ of spikes from neuron k in .0; t] and the number Ml .t/ of PSPs at synapse l in .0; t] with corresponding quantities Nk¤ .t/ and M¤l .t/ for a second set of events. The functions (xing the sample paths) Nk ; Ml ; Nk¤ ; M¤l are all in the space D[0; t] of realvalued functions which are at each point of the interval [0; t] possessing left-hand limits and right-continuous (cadlag—see, for example, Pollard, 1984). A standard metric for D[0; t] is the uniform metric, which gives the distance between Nk and Nk¤ as ½. Nk ; Nk¤ / D sup |Nk .s/ ¡ Nk¤ .s/|; k D 1; : : : ; n;

(4.9)

0·s ·t

and similarly for the PSP sequences ½. Ml ; M ¤l / D sup |Ml .s/ ¡ M¤l .s/|; l D 1; : : : ; m:

(4.10)

0 ·s·t

The entire collection of spike trains from the n neurons and the PSP times from the m synapses is in the product space Dm C n [0; t] D D[0; t] £ ¢ ¢ ¢ £ D[0; t]: By a well-known result in analysis, one possible metric for this (metric) product space is obtained by adding the individual metrics. Hence, a measure of the distance between the whole set of functions L D fN1 ; N2 ; : : : ; Nn ; M1 ; M2 ; : : : ; M m g and the corresponding starred quantity L ¤ is ½1 .L ; L ¤ / D

n X kD 1

½.Nk ; Nk¤ / C

m X lD 1

½.Ml ; M¤l /:

(4.11)

Cortical Potential Distributions and Information Processing

2787

Now if integrands are close, usually so too are the corresponding integrals. However, it may be concluded that ½1 .L ; L ¤ / · ±

(4.12)

does not imply that the corresponding space-time potential distributions satisfy D2 .V 1 ; V 2 / · ²;

(4.13)

where the smaller ² is, the smaller the corresponding value of ± is. That is, closeness of the sequences of spike and PSP times does not imply that the integrands in equation 4.8 themselves are close if the metric ½1 is used. This can be established by an example such as the following. Suppose n1 ¸ 1 excitatory synapses are ring at a regular rate and that this is the only activity; let the corresponding potential distribution be Ve . Consider also that n1 inhibitory synapses are ring at that same rate with a resulting potential distribution Vi . Then we have ½1 .L e ; L i / D 0, but D2 .V e ; V i / is large. On the other hand, if the set of all the individual distances ½.Nk ; Nk¤ /, k D 1; : : : ; n and ½.M l ; M¤l /; l D 1; : : : ; m is uniformly small, which is gauged by the metric ½2 .L ; L ¤ / D maxff½.Nk ; Nk¤ /g [ f½.Ml ; M ¤l /gg; k;l

(4.14)

then when ½2 · ±1 for some small enough ±1 , then the corresponding potential distributions will also be close. 5 A Dynamical Approach

One clear disadvantage of the CSD analysis method is that it is not evolutionary in the dynamical sense; that is, it does not provide a means of determining the future distribution of potential from a previous one in conjunction with data on external stimuli. In order to overcome this difculty, one possibility is to turn to reaction-diffusion equations similar to those used to describe uctuations of ionic concentrations around neurons and glia in the modeling cortical spreading depression (Tuckwell & Hermansen, 1981). Here we will sketch briey a mathematical model for determining potential distributions using differential equations, with no claim to a complete model. The number of variables in such a construction is very large, as the following reduction to a one-space-dimensional consideration shows. The treatment is deterministic, but it is possible that uctuations, whose origin is in the complexity of the connections and variety of nerve cells and synapses, will occur in the solutions and appear random. The latter two components will involve parameters that are probably reasonably tted with normal distributions.

2788

Henry C. Tuckwell

Let the concentration of the ºth ion species be Cº .x; t/, its valence be zº , º D 1; : : : ; p, and its effective diffusion coefcient be Dº . Assume that (spaceclamped or point) neurons are located at single space points as above with the kth at xk ; k D 1; : : : ; n, having membrane potential Ák . Similarly, assume synapses are localizable at points yl ; l D 1; : : : ; m, with pre- and postsynaptic membrane potentials Ãl1 and Ãl2 , respectively. As a starting point we use the equation of continuity (see, for example, Tuckwell, 1988, chap. 2), @½ @j D¡ ; @t @x where ½.x; t/ is the charge density and j.x; t/ is the current density. Allowing for diffusion, movement of ions due to the electric eld, and the contributions to the current density from movements in and out of neurons during action potentials and postsynaptic potentials, we have, letting the electric potential be V.x; t/: p 1 X @ 2V ¡ D zº Cº @ x2 ²²o º D1

(5.1)

µ ¶ p » X @Cº @ 2 Cº zº Dº F @ @V C zº D Dº Cº @t @x2 RT @x @x º D1 º D1

p X

C

n X kD 1

±.x ¡ xk /Fºk .Ák ; C / C )

C Pº . C / p X @ Ák D gºk .Vºk ¡ Ák / @t º D1

@ Ãlj @t

D

p X º D1

gº;lj .Vº;lj ¡ Ãlj /; j D 1; 2:

m X lD 1

±.x ¡ yl /Gºl .Ãl1 ; Ãl2 ; C / (5.2)

(5.3)

(5.4)

The physical constants F; R; T; ², and ²o have their usual meanings, and the vector C represents not only the collection of extracellular ionic concentrations Cº but also intracellular concentrations, including those of pre- and postsynaptic elements. Vºk is the Nernst potential for ion species º at the kth neuron, gºk is the conductance of ion species º at neuron k, and gº;lj ; j D 1; 2; is the conductance for ion species º at the pre- and postsynaptic membranes of synapse l with Vº;lj as the corresponding Nernst potentials. Pº is the ionic

Cortical Potential Distributions and Information Processing

2789

pump or active transport rate for ion species º, usually operating to restore resting conditions. Fºk gives the contribution to current density through ion species º due to the action potentials in neuron k, whereas Gºl gives those contributions arising due to activity at synapse l. Here equations 5.3 and 5.4 are based on standard “conductance-based” models similar to the Hodgkin-Huxley model with appropriate modications for the neurons in the cortical region under consideration, such as pyramidal cells (see, for example, Mainen & Sejnowski, 1995). Subsidiary equations for the ionic activation and inactivation variables are omitted for brevity. Included in the conductances gºk are contributions from synaptic activation due to all neurons that connect with neuron k. The network details have also been omitted for simplicity. Furthermore, there may be synaptic inputs from cells outside the cortical region under consideration. Equation 5.2 does not imply (except when there is only one species of ion present) the following set of equations, which can, however, be employed if it can be assumed that ions of each species are independent: µ ¶ X n @Cº @ 2 Cº zº Dº F @ @V C C D Dº C ±.x ¡ xk /Fºk .Ák ; C / º @t @x2 RT @x @x kD 1 C

m X lD 1

±.x ¡ yl /Gºl .Ãl1 ; Ãl2 ; C / C Pº .C /; º D 1; : : : ; p:

(5.5)

These equations, in conjunction with equations 5.1, 5.3, and 5.4, are more amenable than equations 5.1 through 5.4 to numerical solution by implicit schemes for parabolic systems. The parameters required to obtain solutions for the concentrations of the ions and the potential are obtainable from known anatomical and physiological data on cortical systems. We shall present numerical solutions in a later publication. It is also possible to use the descriptive approach, in which occurrence times of action potentials and postsynaptic potentials are considered to be given, in conjunction with the dynamical approach of this section. This would lead to a simplication of the above system in that we would retain equation 5.1, dispense with the systems 5.3 and 5.4, and replace equation 5.2 by p X º D1

zº

µ ¶ p » X @Cº @ 2 Cº zº Dº F @ @V C D Dº C º @t @x2 RT @x @x º D1 C

n X kD 1

±.x ¡ xk / )

C Pº .C /

;

m X dNk dMl akº C ±.x ¡ yl / blº dt dt lD 1

(5.6)

where Mk .t/ is the number of action potentials issuing forth from neuron k in .0; t] and Ml .t/ is the corresponding number of postsynaptic potentials from

2790

Henry C. Tuckwell

the lth synapse. The quantity akº is the contribution to the current density through ion species º when an action potential is emitted by the kth neuron and blº is the corresponding quantity for activation of the lth synapse. Spatially and temporally distributed current waveforms can be used to replace these impulsive forms, as was done in equation 4.5. Furthermore, it is clear but not so immediately provable that if one uses the descriptive approach to obtain the nonhomogeneous terms as in equation 5.6, then the same results will obtain as in section 4 on the correspondence between spike and PSP time sequences and potential distributions. 6 Discussion

We have suggested that the mammalian brain operates in a space of potential distributions that constitute a subset of the space of continuous functions of three space variables and one time variable. That is, neither rate codes nor temporal coding provide a complete description of the ongoing information processing by the central nervous system. The concept is not far removed from that of distributed or vector coding (Churchland & Sejnowski, 1992, Ch. 4). Stimuli are identied with the potential distributions they elicit in space-time, and closeness of stimuli can be established using certain standard metrics for continuous function spaces. Such metrics seem natural and may coincide with the manner in which a brain distinguishes different perceptual or cognitive states. It is argued that since CSD analysis has purportedly demonstrated that action potentials contribute to a minor extent to CPDs, it follows that if one adopts the CPD coding picture, then spikes per se make a lesser contribution to the brain’s code than in either the rate coding or temporal coding pictures. We note that such conclusions from CSD analysis cannot be used to establish that the brain is using a CPD code. To establish the latter, experimental observations of potential distributions would have to be found to correlate strongly with cognitive states, and perhaps more strongly than spatiotemporal patterns of action potentials alone. There is already some evidence for this from electroencephalogram studies (attention, desynchronization, and sleep-related phenomena), evoked potentials, event-related potentials, and readiness potentials (Eccles, 1984; Hiraiawa, Shimohara, & Tokunaga, 1990). The descriptive approach has been presented as that in which it is assumed that there are well-dened time points at which action potentials and PSPs commence and that the current ows associated with them have well-dened functional forms, approximated, possibly by impulsive functions, in space-time. Using this descriptive approach in conjunction with either the CSD analysis method or a dynamical system obtained from the equation of continuity and involving ion concentrations and nerve membrane potentials along with subsidiary variables, the use of certain metrics has been considered. This was done to see how closeness of potential distributions corresponded with closeness of sets of spike trains and times of

Cortical Potential Distributions and Information Processing

2791

occurrence of PSPs. PSPs have been included because they make the greatest contribution to cortical potentials. It was found when employing a standard metric in multidimensional event-time space, closeness of spike trains and PSP times does not imply closeness of potential distributions. However, this could be remedied if a metric was employed that measured the simultaneous closeness of all components of the multidimensional event-time paths. However, the assumption that action potential activity and PSPs can be described by sets of single time points is clearly not valid because there are always (relative) continua of graded potential changes preceding, leading into, and following both action potentials and postsynaptic potentials. Thus, whether one adopts the CSD analysis method or the dynamical systems approach of equations 5.1 through 5.4, it is not possible to obtain a complete description of CPDs using the descriptive approach. It might be questioned, in the climate of current theories of information processing, with their emphasis on spikes, why a seemingly more complicated scheme might be needed in which regional space-time potential distributions constitute the elements of the brain’s code. However, the exclusive use of action potentials has probably been partially fortuitous because these are the most noticeable and easily observed electrophysiological events in the brain. Sampling eld potentials over wide regions should be feasible experimentally at the same time as unit recordings are made. It is unlikely that the myriad of other ongoing electrophysiological processes is ignored in taking into account an individual’s cognitive state, and these are reected in the overall space-time distribution of potential. The use of CPDs seems also to contribute to providing a solution to the binding problem that addresses the issue as to how activities in diverse structures are woven together to obtain a global picture (Engel, Fries, Konig, & Singer, 1999). This is because, in the suggested framework, a cognitive state is determined by the global distribution of potential, and a metric that operates on such global distributions automatically takes into account the various components from disparate regions or activities. Furthermore, changes in a V-distribution can occur rapidly, and if the distribution at time t is noticeably different from the distribution at time t C 1t, an informational change will occur that is in the spirit of quests to demonstrate submillisecond timing (Singer, 1999). That is, if the CPD determines the cognitive state, the latter may change in time faster than the limits imposed by the frequencies of action potentials. To illustrate, consider T1 D [0; ±t/ and T2 D [±t; 2±t/ so that T1 and T2 are disjoint successive small time intervals. Then dene V 1 D fV.x ; t/; x 2 M; t 2 T1 g and V 2 D fV.x ; t C ±t/; x 2 M; t 2 T1 g. The distance between these two potential distributions in the same region over successive time intervals may be quantied as µZ Z D2 .V 1 ; V 2 / D

¶1 2

M

T1

.V1 .x ; t/ ¡ V2 .x ; t// dtdx

2

:

2792

Henry C. Tuckwell

As long as D2 is large enough to lead to a perceived change in cognitive state, there is no limit as to how small ±t may be. Appendix

An integration in equation 4.4 gives " n X dNk ¾ [Vx .x; t/ ¡ Vx .0; t/] D ¡ ®k H.x ¡ xk / dt kD 1 C

m X l D1

dMl ¯l H.x ¡ yl / dt

# C f .t/;

(A.1)

where f .t/ is a function of time to be determined by boundary and initial conditions. A zero electric eld condition at x D 0 and at x D L gives " n X

# m dNk X dMl C ¾ Vx .x; t/ D ¡ ®k H.x ¡ xk / ¯l H.x ¡ yl / : dt dt kD 1 lD 1

(A.2)

Since this gives in fact the electric eld strength on [0; L], we see that this quantity has impulses at the times of occurrence of action potentials and postsynaptic potentials. For example, if there is an action potential in neuron k in the interval .t; t C ±t], then the electric eld impulsively jumps for all x ¸ xk at time t and returns instantaneously to zero. The potential, obtained by another integration, also clearly acts anomalously. This seems to point to a general deciency in the CSD method or indicates that the use of impulsive current densities is not suitable with that method. Acknowledgments

I appreciate correspondence with Valery Sbitnev of St. Petersburg Nuclear Physics Institute and the nancial support of Institut National de la Sant´e et de la Recherche M´edicale, Paris. Comments from one of the referees were most useful, and one of his remarks has appeared as part of the revised text. References Abeles, M., Bergman, H., Gat, I., Meilijson, I., Seidemann, E., Tishby, N., and Vaadia, E. (1995). Cortical activity ips among quasi-stationary states. Proc. Natl. Acad. Sci. USA, 92, 8616–8620. Barlow, H. B. (1963). The information capacity of nervous transmission. Biol. Cybern., 2, 1. Bowyer, S. M., et al. (1999). Analysis of MEG signals of spreading cortical depression with propagation constrained to a rectangular cortical strip. I. Lissencephalic rabbit model. Brain Res., 843, 71–78.

Cortical Potential Distributions and Information Processing

2793

Bush, P., & Sejnowski, T. (1996). Inhibition synchronizes sparsely connected cortical neurons within and between columns in realistic network models. J. Comp. Neurosci., 3, 91–110. Churchland, P. S., & Sejnowski, T. J. (1992). The computational brain. Cambridge, MA: The MIT Press. Curtis, D. R., & Eccles, J. C. (1960). Synaptic action during and after repetitive stimulation. J. Physiol., 150, 374–398. Dawirs, R. R., Teuchert-Noodt, G., & Busse, M. (1991). Single doses of methamphetamine cause changes in the density of dendritic spines in the prefrontal cortex of gerbils (Meriones unguiculatus). Neuropharmacology, 30, 275– 282. Deppisch, J., Bauer, H-U., Schillen, T., Konig, P., Pawelzik, K., & Geisel, T. (1993). Alternating oscillatory and stochastic states in a network of spiking neurons. Network, 4, 243–257. Eccles, J. C. (1953). The neurophysiology of mind. Oxford: Clarendon. Eccles, J. C. (1984). The human mystery. London: Routledge. Engel, A. K., Fries, P., Konig, P., & Singer, W. (1999). Temporal binding, binocular rivalry, and consciousness. Conscious. Cogn., 8, 128–151. Fujii, H., Ito, H., Aihara, K., Ichinose, N., & Tsukada, M. (1996). Dynamical cell assembly hypothesis—theoretical possibility of spatiotemporal coding in the cortex. Neural Networks, 9, 1303–1350. Ghose, G.M., & Freeman, R. D. (1992). Oscillatory discharge in the visual system: Does it have a role? J. Neurophysiol., 68, 1558–1574. Golomb, D. (1998). Models of neuronal transient synchrony during propagation of activity through neocortical circuitry. J. Neurophysiol., 79, 1–12. Granit, R., Kernell, D., & Lamarre, Y. (1966). Algebraic summation in synaptic activation of motoneurones ring in the “primary range” to injected currents J. Physiol., 187, 401–415. Hess, E. J., and Creese, I. (1987). Biochemical characterization of dopamine receptors. In I. Creese & C. M. Fraser (Eds.), Dopamine receptors (pp. 1–27). New York: Liss. Hiraiawa, A., Shimohara, K., & Tokunaga, Y. (1990). EEG topography recognition by neural networks. IEEE Eng. Med & Biol. Mag., 9, 39–42. Hopeld, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. Natl. Acad. Sci. USA, 79, 2554– 2559. Hopeld, J. J., & Herz, A. V. M. (1995).Rapid synchronization of action potentials: Toward computation with coupled integrate-and-re neurons. PNAS USA, 92, 6655–6662. Hubel, D. H., & Wiesel, T. (1962). Receptive elds, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. Lond., 160, 106– 154. Kempter, R., Gerstner, W., van Hemmen, J. L., & Wagner, H. (1998). Extracting oscillations. Neuronal coincidence detection with noisy periodic spike input. Neural Computation, 10, 1987–2017. Konig, P., Engel, A. K., & Singer, W. (1996). Integrator or coincidence detector? The role of the cortical neuron revisited. Trends Neurosci., 19, 130–137.

2794

Henry C. Tuckwell

Kreiter, A. K., & Singer, W. (1992). Oscillatory neuronal responses in the visual cortex of the awake macaque monkey. Eur. J. Neurosci., 4, 369–375. Lebovitz, R. M. (1970). A theoretical examination of ionic interactions between neural and non-neural membranes. Biophys. J., 10, 423–444. Lestienne, R., & Tuckwell, H. C. (1998). The signicance of precisely replicating patterns in mammalian CNS spike trains. Neuroscience, 82, 315–336. Liao, D., Jones, A. & Malinow, R. (1992). Direct measurement of quantal changes underlying long-term potentiation in CA1 hippocampus. Neuron, 9, 1089– 1097. Liley, D. T., Alexander, D. M., Wright, J. J., & Aldous, M. D. (1999). Alpha rhythm emerges from large-scale networks of realistically coupled multicompartmental model cortical neurons. Network, 10, 79–92. Lux, H. D., & Neher, E. (1973). The equilibrium time course of [KoC ] in cat cortex. Exp. Brain. Res., 17, 190–198. Maass, W. (1995). On the computational power of networks of spiking neurons. In G. Tesauro, D. Touretzky, & T. Leen (Eds.), Advances in neural processing systems, 7 (pp. 183–190). Cambridge, MA: MIT Press. Mackert, B. M., et al. (1999). Non-invasive long-term recordings of direct current (DC) activity in humans using magnetoencephalography. Neurosci. Lett., 273, 159–162. Mainen, Z., & Sejnowski, T. (1995). Reliability of spike encoding in neocortical neurons. Science, 268, 1503-1506. Manabe, T., Wyllie, D. J., Perkel, D. J., & Nicoll, R. A. (1993). Modulation of synaptic transmission and long-term potentiation: Effects on paired pulse facilitation and EPSC variance in the CA1 region of the hippocampus. J. Neurophysiol., 70, 1451–1459. McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas imminent in nervous activity. Bull. Math. Biophys., 7, 89–93. Mitzdorf, U. (1985). Current source-density method and application in cat cerebral cortex: Investigation of evoked potentials and EEG phenomena. Physiol. Rev., 65, 37–100. Okada, Y. C., Lauritzen, M., & Nicholson, C. (1988). Magnetic eld associated with spreading depression: A model for the detection of migraine. Brain Res., 442, 185–190. Parzen, E. (1962). Stochastic processes. San Francisco: Holden-Day. Pollard, D. (1984). Convergence of stochastic processes. New York: SpringerVerlag. Rappelsberger, P., Pockberger, H., & Petsche, H. (1993). Sources of electric brain activity: Intracortical current dipoles. Physiol. Meas., 14 (Suppl. 4A), 17– 20. Sherrington, C. S. (1906). Integrative action of the nervous system. New Haven, CT: Yale University Press. Simmons, G. F. (1963). Introduction to topology and modern analysis. New York: McGraw-Hill. Singer, W. (1998). Consciousness and the structure of neuronal representations. Philos. Trans. R. Soc. Lond. B Biol. Sci., 353, 1829–1840. Singer, W. (1999). Time as coding space. Curr. Opin. Neurobiol., 9, 189–194.

Cortical Potential Distributions and Information Processing

2795

Terzuolo, C. A., & Washizu, Y. (1962). Relation between stimulus strength, generator potential and impulse frequency in stretch receptor of crustacea. J. Neurophysiol., 25, 56–66. Toyama, K., & Tanaka, K. (1984). Visual cortical functions studied by crosscorrelation analyis. In G. M. Edelman et al. (Eds.), Dynamic aspects of cortical function. New York: Wiley. Tuckwell, H. C. (1988). Introduction to theoretical neurobiology (Vol 1). New York: Cambridge University Press. Tuckwell, H. C. (1995). Elementary applications of probability theory: With an introduction to stochastic differential equations. London: Chapman and Hall. Tuckwell, H. C. (1998). Continuum models in neurobiology and information processing. BioSystems, 48, 223–228. Tuckwell, H. C., & Hermansen, C. E. (1981). Ion and transmitter movements in spreading cortical depression. Int. J. Neurosci., 12, 109–135. Usher, M., Schuster, H. G., & Niebur, E. (1993). Dynamics of populations of integrate-and-re neurons, partial synchronization and memory. Neural Computation, 5, 570–586. Villa, A. E. P., & Fuster, J. M. (1992). Temporal correlates of information processing during visual short-term memory. NeuroReport, 3, 113–116. Received July 28, 1999; accepted March 7, 2000.

NOTE

Communicated by Alessandro Treves

Asymptotic Bias in Information Estimates and the Exponential (Bell) Polynomials Jonathan D. Victor Department of Neurology and Neuroscience, Weill Medical College of Cornell University, New York City, New York 10021, U.S.A. We present a new derivation of the asymptotic correction for bias in the estimate of information from a nite sample. The new derivation reveals a relationship between information estimates and a sequence of polynomials with combinatorial signicance, the exponential (Bell) polynomials, and helps to provide an understanding of the form and behavior of the asymptotic correction for bias. 1 Introduction

In its most basic form, application of the tools of information theory to laboratory data relies on the estimation of the information in a process consisting of independent occurrences of K kinds of mutually exclusive events, each of which occurs with a probability qj .j D 1; : : : ; K/ (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). The quantity HD¡

K 1 X qj ln qj ln 2 jD1

(1.1)

is the information (in bits) associated with a single observation. Typically, the probabilities qj are not known and must be estimated from a nite set of observations. It is well known that the naive estimate for H, based on replacement of the exact probabilities qj by their empirical probabilities observed from N observations, downwardly biases the estimate of equation 1.1. Essentially this is because equation 1.1 is a concave-downward function, so an average estimate for H derived from a range of estimates of the true probabilities qj is less than the value of H given by equation 1.1 at the center of this range. Several authors (Carlton, 1969; Miller, 1955) have derived asymptotic estimates for this bias in the limit of large N. The leading term in the asymptotic estimate of the bias depends on K, the number of kinds of events, and N, the number of observations but, remarkably, is independent of the probabilities qj of the events. These calculations are readily extended to estimates of mutual information (Miller, 1955; Treves & Panzeri, 1995), since mutual information in a table is the sum of the information in the distributions of c 2000 Massachusetts Institute of Technology Neural Computation 12, 2797–2804 (2000) °

2798

Jonathan D. Victor

the marginal probabilities, minus the information in the distribution of table entries. Because the number of rows (say, KR ) and columns (say, KC ) of a nontrivial table is fewer than the number of entries in the table .KR KC /, the downward bias in simple estimates of information (see equation 1.1) translates into an upward bias in estimates of mutual information. This article presents a new and concise derivation of the bias estimate. The new derivation claries the basis for the lack of dependence of the bias on the probabilities qj , and reveals a relationship between information estimates and the exponential (Bell) polynomials (Bell, 1934b), a sequence of integer polynomials with a well-known combinatorial interpretation. 2 Results

We consider an estimate of information from a set of N events, each of which can independently have one of K possible mutually exclusive outcomes. The P probability of the jth outcome is denoted qj , and jKD 1 qj D 1. We dene * + K X s (2.1) U.N; s/ D pj : j D1

N

The quantities pj are estimates of the probabilities qj , which are considered to be denite but unknown. That is, pj D nj =N is the empirical probability of outcome j, as estimated from a set of N observations in which this outcome occurred nj times. h iN denotes an average over all sets of N observations drawn from this universe. The expected value hHiN for the estimate of the information from N observations is thus hHiN D ¡

1 @ U.N; s/| s D1 : ln 2 @s

(2.2)

We form the generating function Q U.z; s/ D

1 X Ns zN U.N; s/: N! N D0

(2.3)

To evaluate the expectation implicit in equations 2.1 and 2.3, we assign to each possible set of observations (say, nj occurrences of each outcome j, with P N D jKD 1 nj ) its corresponding multinomial probability, N! QK

j D 1 nj !

K Y

.qj /nj :

jD 1

In this expression, the factorial terms count the number of ways of ordering nj occurrences of each outcome j into a sequence of N observations, and

Information Estimates and Bell Polynomials

2799

the terms involving qj specify the probability of any one such sequence of observations. Inclusion of the multinomial probability in equation 2.1 converts it to 3 ³ K !2 K Y .qj /nj N! X X 5; (2.4) U.N; s/ D s ns 4 ! N nj J D1 J n j jD 1 and equation 2.3 to

³

Q U.z; s/ D

3 !2 K nj Y .q z/ j 5; nsJ 4 n j! JD1 jD 1

1 K X X nj D 0

(2.5)

where the outer sum is over all sets of nonnegative integers nj , satisfying PK j D 1 nj D N in equation 2.4, but unrestricted in equation 2.5. Term-by-term consideration of the J sum in equation 2.5 and rearrangement lead to 12 0 13 nj nJ Y X .q z/ .q z/ j J Q @ A4 @ A5 U.z; s/ D nsJ ! ! n n J j 6 J J D1 n D0 n D0 jD K X

0

1 X j

D ez

K X

j

b.qJ z; s/;

(2.6)

J D1

where we have dened b.z; s/ D e¡z

1 X n D0

ns

zn ; n!

(2.7)

P and made use of KJD 1 q J D 1. For nonnegative integer values of s, b.z; s/ are polynomials and have the generating function 1 X s D0

b.z; s/

us u D ez.e ¡1/ : s!

(2.8)

Thus, b.z; s/ are a simple example of the exponential (Bell) polynomials in z (the quantities ´ in Bell, 1934b), and b.1; s/ are the exponential (Bell) numbers (Bell, 1934a) (history reviewed in Rota, 1964) . For all (not necessarily integral) s ¸ 0, b.z; s/ satises the recurrence relation µ ¶ @ b.z; s C 1/ D z b.z; s/ C b.z; s/ ; @z

(2.9)

2800

Jonathan D. Victor

as can be veried from equation 2.7 or 2.8. Also from equation 2.7, b.z; 0/ D 1, and for integer s > 0, the leading terms of b.z; s/ are s.s ¡ 1/ s¡1 s.s ¡ 1/.s ¡ 2/.3s ¡ 5/ s¡2 C z z 2 24 2 2 s.s ¡ 1/.s ¡ 2/ .s ¡ 3/ s¡3 C C ¢¢¢: z 48

b.z; s/ D zs C

(2.10)

The Bell polynomials have a combinatorial interpretation (Bell 1934a, 1934b; Rota, 1964): the coefcient of zt in b.z; s/ is the number of ways of placing s distinguishable objects into t indistinguishable containers. In particular, the coefcient of zs¡1 is s.s ¡ 1/=2, the number of ways of choosing one pair of the s objects to share a container. From equation 2.3, ¡

1 1 1 @ Q 1 X N ln N N X N N U.z; s/| sD 1 D ¡ z C z hHiN ln 2 @s ln 2 N D0 N! N! ND0

D ¡

1 X ez @ zN hHiN ; b.z; s/| sD 1 C ln 2 @s .N ¡ 1/! ND0

(2.11)

where we have used U.N; 1/ D 1 (from equation 2.1) and equation 2.2 in the rst step, and equation 2.7 in the second step. Combining this with equation 2.6 yields 1 X

1 ez zN hHiN D ln 2 .N ¡1/! ND0

"

# K @ @ X b.z; s/| sD 1 ¡ b.q J z; s/| sD 1 : @s @s J D 1

(2.12)

Equation 2.12 is exact. To derive an asymptotic estimate, we estimate the partial derivatives in equation 2.12 by assuming that formula 2.10, a nite series for integer values of s, is a useful approximation at noninteger values as well. That is, we use the approximation 1 1 1 @ C b.z; s/| s D1 D z ln z C C :::: 2 12z 12z2 @s

(2.13)

The behavior of this approximation is illustrated in Figure 1. Note that as terms beyond the constant term are added, the improvement in the approximation for large values of z is accompanied by a worsening for values of z < 1. Inserting this approximation into equation 2.12 and identifying corresponding coefcients of zN on its two sides leads directly to 1 H D hHiN C ln 2

"

³K ! # X 1 1 K ¡1 C ¡ 1 C ::: : 2N 12N.N C 1/ J D 1 qJ

(2.14)

Information Estimates and Bell Polynomials

2801

Figure 1: Behavior of an asymptotic estimate for @s@ b.z; s/| sD 1 . (Left) Approximations provided by one and two terms of equation 2.13. (Right) Approximations provided by three and four terms of equation 2.13.

The rst correction term corresponds to the previous results of several authors (Carlton, 1969; Miller, 1955; Treves & Panzeri, 1995). The fact that it is independent of the probabilities q J reects the constant value .1=2/ of the rst correction term in equation 2.13. The mth correction term derived from our approach has a denominator N.N C 1/ : : : .N C m ¡1/. This is different from the asymptotic series derived by Treves and Panzeri (1995), which is strictly in inverse powers of N. Nevertheless, the approaches agree. For example, the second correction term in equation 2.14, whose value depends on the probabilities q J , differs from the results of Treves and Panzeri (1995), but the difference is third order (i.e., O.N ¡3 /), and thus is subsumed in the third correction term. Implementation of this bias correction is not completely straightforward. In the laboratory, one has access only to estimates of the event probabilities qJ , the quantities we have denoted p J . Equation 2.14 states that the naive estimate of information hHiN , obtained by replacing the qJ in equation 1.1 by p J , is too low. The rst correction term in equation 2.14 requires knowing the number of possible kinds of events, K, but not their probabilities. But even K may not be known in advance. One can be sure that an event is possible only if one has observed it, but one does not know that additional kinds of events are impossible, merely that they have not been observed in a sample of size N. Panzeri and Treves (1996) suggest a sophisticated approach for modifying the count of the observed number of kinds of events. Here we initially consider the simple strategy of setting K equal to the number of kinds of events that were actually observed within N trials. The left panels of Figure 2 demonstrate the effects of this strategy for an experiment in which there are two kinds of events. The initial correction

2802

Jonathan D. Victor

Figure 2: Comparison of strategies for the adjustment of empirical estimates of information in a simulation with two kinds of events .K/ and 4, 16, or 64 trials .N/. Exact value (see equation 1.1), naive estimate hHiN , and corrections of the naive estimate for bias. Abscissa: Probability q of the rst kind of event .q1 D q; q2 D 1 ¡ q/. Ordinate: Information H. (Left panels) Number of kinds of events K and their probabilities estimated from the data. The two curves for the corrected estimates are virtually indistinguishable. (Right panels) Number of kinds of events K and their probabilities known in advance. On the right, the two-term correction leads to higher estimates of information (as labeled in the upper panel). All functions are symmetric about q D 0:5, but are plotted for q in [0:0001; 0:9] on a logarithmic scale to emphasize the behavior for extreme values of q.

closes most of the gap between the naive estimate hHiN and the true value H, provided that the number of trials is sufciently large so that each kind of event has a reasonable chance of occurring .Nq À 1/. When the number of trials is so small that one of the events will probably not be observed .Nq ¿ 1/, the correction is ineffective—as would be expected, since there is no empirical evidence for more than one kind of event. The second-order correction is very small in both regimes (nearly superimposed on the initial correction).

Information Estimates and Bell Polynomials

2803

One might expect that the estimate of information could be improved if there were a better way to estimate either K or the event probabilities qJ . For illustrative purposes, the right panels of Figure 2 consider the extreme situation: that both are known exactly (but only the empiric values p J are used to estimate information). The rst correction is not as helpful. In the regime in which it is signicant .Nq ¿ 1/, it amounts to an overcorrection, because the “correct” value K D 2 is always used, even though two kinds of events were not typically observed. Surprisingly, the second-order correction is even worse, resulting in large overestimates, because of the terms involving reciprocals of small probabilities qJ . In the rst strategy considered, since the q J are taken from empirical estimates, 1=q J is limited by N, thus bounding the second-order correction. But there is no such limit here, since even smaller values of q J , based on a priori knowledge, may occur. A hybrid strategy (using a priori knowledge of K, but not of the event probabilities qJ ) results in performance that is worse than either of the two considered above (not shown). This pattern of behavior was also seen in numerical experiments involving several kinds of events and a wide range of event probabililties. In sum, the most straightforward application of equation 2.14, making use of only what was observed, appears to be both conservative and effective. It fails under appropriate circumstances: when observation is so limited that possible modes of behavior have not been observed. Under these circumstances, higher-order corrections are not helpful. 3 Discussion

We derived an exact expression, equation 2.12, for the expected information estimate from an N-trial data set in terms of the partial derivative of the @ Bell polynomials with respect to their order, @s b.z; s/| s D1 . The asymptotic expansion for this derivative, equation 2.13, corresponds in a term-by-term fashion to an asymptotic expansion, equation 2.14, of the expected infor@ mation estimate. The leading term in the expansion of @s b.z; s/| s D1 , namely z ln z, corresponds to the (unbiased) estimate of information from an unlim@ ited sample. The second term in the expansion of @s b.z; s/| s D1 , namely, 12 , corresponds to the initial correction due to nite sample size (the estimate of the bias), as derived by previous authors (Carlton, 1969; Miller, 1955; Treves & Panzeri, 1995). Since this term is a constant, the initial term in the asymptotic form for the bias depends on only the number of terms in the sum on the right-hand side of equation 2.12: the number of possible kinds of events K, and not on their probabilities qJ . This analysis helps to explain why the high-order correction terms of equation 2.14 are not useful in practice. The higher-order correction terms @ reect the successive approximations to the Taylor expansion of @s b.z; s/| sD 1 for small z. As is seen in Figure 1, the asymptotic series, equation 2.13, converges rapidly for large z, but diverges in the neighborhood of z D 0.

2804

Jonathan D. Victor

Thus, in the regime in which the higher-order correction terms might matter (low N and estimates of q below 1=N), they worsen the estimate of bias (see the upper two right panels of Figure 2). When N is large, the second-order corrections do indeed improve the estimate of bias for some values of q, but the size of the correction is minuscule (see the lower panels of Figure 2). Acknowledgments

This work was supported in part by EY9314. I thank Jeff Tsai and Danny Reich for comments on the manuscript. References Bell, E. T. (1934). Exponential numbers. Amer. Math. Monthly, 41, 411–419. Bell, E. T. (1934). Exponential polynomials. Ann. Math., 35, 258–277. Carlton, A. G. (1969). On the bias of information estimates. PsychologicalBulletin, 71, 108–109. Miller, G. A. (1955).Note on the bias on information estimates. Information Theory in Psychology; Problems and Methods II-B, 95–100. Panzeri, S., & Treves, A. (1996). Analytical estimates of limited sampling biases in different information measures. Network, 7, 87–107. Rieke, F., Warland, D., de Ruyter van Steveninck, R. R., & Bialek, W. (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. Rota, G-C. (1964). The number of partitions of a set. Amer. Math. Monthly, 71, 498–504. Treves, A., & Panzeri, S. (1995). The upward bias in measures of information derived from limited data samples. Neural Computation, 7, 399–407. Received September 27, 1999; accepted January 21, 2000.

LETTER

Communicated by William Lytton

Relating Macroscopic Measures of Brain Activity to Fast, Dynamic Neuronal Interactions D. Chawla E. D. Lumer K. J. Friston Wellcome Department of Cognitive Neurology, Institute of Neurology, Queen Square, London WC1N 3BG, U.K. In this article we used biologically plausible simulations of coupled neuronal populations to address the relationship between phasic and fast coherent neuronal interactions and macroscopic measures of activity that are integrated over time, such as the BOLD response in functional magnetic resonance imaging. Event-related, dynamic correlations were assessed using joint peristimulus time histograms and, in particular , the mutual information between stimulus-induced transients in two populations. This mutual information can be considered as an index of functional connectivity . Our simulations showed that functional connectivity or dynamic integration between two populations increases with mean background activity and stimulus-related rate modulation. Furthermore, as the background activity increases, the populations become increasingly sensitive to the intensity of the stimulus in terms of a predisposition to transient phase locking. This reects an interaction between background activity and stimulus intensity in producing dynamic correlations, in that background activity augments stimulus-induced coherence modulation. This is interesting from a computational perspective because background activity establishes a context that may have a profound effect on eventrelated interactions or functional connectivity between neuronal populations. Finally, total ring rates, which subsume both background activity and stimulus-related rate modulation, were almost linearly related to the expression of dynamic correlations over large ranges of activities. These observations show that under the assumptions implicit in our model, ratespecic metrics based on rate or coherence modulation may be different perspectives on the same underlying dynamics. This suggests that activity (averaged over all peristimulus times), as measured in neuroimaging, may be tightly coupled to the expression of dynamic correlations. 1 Introduction

Previously (Chawla, Lumer, & Friston, 1998), we found, using computer simulations of coupled neuronal populations, that mean activity and sync 2000 Massachusetts Institute of Technology Neural Computation 12, 2805–2821 (2000) °

2806

D. Chawla, E. D. Lumer, and K. J. Friston

chronization were tightly coupled during relatively steady-state dynamics. This allowed us to make inferences about the degree of phase locking or synchronization among, or within, neuronal populations given macroscopic measures of activity such as those provided by neuroimaging. This article is about the relationship between fast dynamic interactions among neurons, as characterized by multiunit electrode recordings of separable spike trains, and measures of neural activity that are integrated over time (e.g., functional neuroimaging). In particular we address the question, Can anything be inferred about fast coherent or phasic interactions based on averaged macroscopic observations of cortical activity? This question is important because a denitive answer would point to ways in which electrophysiological ndings (in basic neuroscience) might inform functional neuroimaging studies that employ a train of stimulus or task events to detect changes in time-integrated activity. Our basic hypothesis is that fast, dynamic interactions between two neuronal populations are a strong function of their background activity. This hypothesis derives from a series of compelling computational studies (e.g., Boven & Aertsen, 1990; Aertsen & Preissl, 1991; Aertsen, Erb, & Palm, 1994). In other words, the dynamic coupling between two populations, reected in changes in their coherent activity over a timescale of milliseconds, cannot be separated from the context in which these interactions occur. This context is shaped by the population dynamics expressed over extended periods of time and, in particular, the overall level of activity. This is based on the commonsense observation that the responsiveness of one unit, to the presynaptic input of another distant unit, will depend on postsynaptic depolarization extant at the time the presynaptic input arrives. In a previous modeling study, using relatively steady-state dynamics (i.e., in the absence of induced transients), we showed that the mean ring rate and average phase locking between two populations were tightly coupled in all regions of the model’s parameter space. There could therefore be a link between mean activity and the emergence of dynamic correlations over a timescale of milliseconds. Previous modeling work has shown it to be the case that functional and effective connectivity vary strongly with background population activity (Boven & Aertsen, 1990; Aertsen & Preissl, 1991; Aertsen et al., 1994). In this article, we pursue this same question but with a more biologically motivated neuronal model, which uses Hodgkin-Huxley-like dynamics and a more rened analysis of dynamic correlations that is statistically grounded and uses mutual information. We expected that the emergence of phasic coherent interactions between two populations is both facilitated by and results in high mean population activity, suggesting that high background population activity levels may be a necessary condition for the emergence of faster interactions. In order to examine this, we measured the short-term correlation structures between two simulated time series as characterized by the joint peristimulus time his-

Neuronal Activity and Dynamic Correlations

2807

togram (J-PSTH). The advantages of this characterization include a proper assessment of phasic and stochastic interactions over peristimulus time, where these interactions are referred to a stimulus or behavioral event. Using simulations, we show that the expression of dynamic correlations is a strong function of the mean activity (averaged over time) extant in the two populations at the time that these interactions are expressed. Furthermore, we show an interaction between background and evoked ring-rate changes that is mediated by activity-dependent changes in functional coupling. Although previous work has established that the effective connectivity among neurons is sensitive to mean levels of population activity, the specic issue we wanted to address in this work is how this activity-dependent change in functional coupling would be expressed in terms of integrated ring rates. This is important from the point of view of neuroimaging, where only time-integrated measures of activity are available. These averages include a number of components, including the background activity and stimulus-related rate modulation. The latter component may be a strong function of the effective connectivity within and among neuronal populations and, consequently, the background activity itself. The interaction between background and evoked rate modulation is therefore an important phenomenon when trying to interpret responses observed with functional neuroimaging. For example, consider the cortical responses to a train of stimuli measured when the subject was attending and not attending to these stimuli. Increased time-integrated responses may be due to attentional modulation of background activity, increased stimulus-related rate modulation, or both. Demonstrating an obligatory increase in rate modulation with background activity in neuronal stimulations would greatly simplify the interpretation of imaging results because it would suggest that both mechanisms were being expressed. In order to address the interaction between background activity and stimulus intensity in modulating event-related responses, we varied both while measuring the total integrated activity and dynamic correlations. Section 2 describes the synthetic neural model on which our simulations were based. Section 3 describes a characterization of the model dynamics in terms of short-term interactions using J-PSTHs and mutual information. Section 3 establishes a relationship between the expression of short-term interactions (dynamic correlations) and macroscopic descriptors of the population dynamics (mean activity), revealed by varying the strength of the simulated stimulus and the background tonic activity levels. On the basis of these simulations we were able to characterize the specic form for the relationship between fast dynamic interactions and mean activity in two neuronal populations and look at the interaction between background activity and stimulus intensity in mediating changes in these measures.

2808

D. Chawla, E. D. Lumer, and K. J. Friston

2 The Neural Model

Individual neurons, both excitatory and inhibitory were modeled as singlecompartment units. Spike generation in these units was implemented according to the Hodgkin-Huxley formalism for the activation of sodium and potassium transmembrane channels. (Specic equations governing these channel dynamics can be found in appendix A.) In addition, synaptic channels provided fast excitation and inhibition. These synaptic inuences were modeled using exponential functions, with the time constants and reversal potentials for AMPA (excitation) and GABAa (inhibition) receptor channels taken from the experimental literature (see Lumer, Edelman, & Tononi, 1997a, 1997b) for the use and justication of similar parameters to those used in the present model). Intrinsic (intra-area) connections were 20% inhibitory and 80% excitatory (Beaulieu, Kisvarday, Somogyi, & Cynader, 1992). Extrinsic (inter-area) connections were all excitatory. Transmission delays for individual connections were sampled from a noncentral gaussian distribution. Intra-area delays had a mean of 2 ms and a standard deviation of 1 ms and inter-area delays had a mean and standard deviation of 5 ms and 1 ms, respectively. We modeled two areas that were reciprocally connected. Both consisted of a hundred cells that were 90% intrinsically connected and 5% extrinsically connected. Excitatory NMDA synaptic channels were incorporated in the model (see appendix A), in addition to the excitatory AMPA and inhibitory GABAa synaptic channels. These NMDA channels were used only in the feedback connections. Transient dynamics were evoked by providing a burst of noise to population 1. The simulated spike trains from units in both populations were averaged over the population, binned into 4 ms bins, and then smoothed using a gaussian kernel with a half-height full width of 16 ms. These spike trains were then analyzed, using the J-PSTH (Gerstein & Perkel, 1969, 1972; Gerstein, Bedenbaugh, & Aertsen, 1989; Aertsen et al., 1991). The stimulus was provided to population 1 for a duration of 35 ms at intervals of 500 ms. For each analysis, the model was run for a total of 64 seconds of simulated time. This was repeated under different levels of background noise and stimulus intensity, using either AMPA or NMDA feedback receptors. 3 Characterizing Dynamic Correlations with the J-PSTH 3.1 Peristimulus Time Histograms. The display format (see Figure 2) has three components. Plotted along each side of the square matrix (the J-PSTH) are ordinary peristimulus time (PST) histograms, which represent the stimulus time-locked, average-rate modulation. As an index of the total background activity and stimulus-induced rate modulation, we measured the integral under the PSTH of the rst population. This served as our macroscopic measure of neural activity that would be measured by, for example,

Neuronal Activity and Dynamic Correlations

2809

functional magnetic imaging (fMRI) or positron emission tomography (PET) and represents the rst dependent variable in our characterization. The second dependent variable was a measure of the dynamic correlations based on the J-PSTH or crosscorrelation matrix expressed in terms of mutual information (see below). (A detailed explanation of how to read J-PSTHs is given in appendix B.) 3.2 Coincidence Time Histogram and Cross Correlogram. The component of the analysis in the right panel of Figure 2 is the PST coincidence histogram or coincidence time histogram (CTH). This represents the stimulus time-locked average of near-coincident ring (which is simply the leading diagonal of the J-PSTH). This graph thus shows how the level of coherent ring or synchrony (plotted vertically) varies with PST (plotted horizontally). The cross correlogram, the third component, characterizes the degree of coherence averaged over all PSTs at some time lag. Because it is not sensitive to dynamic modulation of coherence, it is not used further in this article. Similarly, we do not use the CTH because, being a metric of coincident ring at near-zero time lags, it is an impoverished metric of dynamic correlations that could be expressed at nonzero time lags.

3.3 A Mutual Information Measure of Dynamic Correlations. We were interested in how stimulus-induced dynamic correlations varied as a function of background noise, stimulus strength, and the interaction between these two factors. As a measure of the dynamic correlations, induced between our two simulated populations, we used the mutual information between the stimulus-induced transients having corrected for mean rate modulation. The calculation and interpretation of mutual information are described in appendix C. In what follows we examine the way in which the mutual information or functional connectivity changes with integrated ring rate. We did this by manipulating the strength of the stimulus under different levels of background activity. This enabled us not only to assess the effects of changing background activity and stimulus strength on mutual information (and integrated rate) but also to characterize any interaction between these two manipulations. The background noise levels were characterized in terms of the average depolarization produced: ¡65:9, ¡63:4, ¡63:2, and ¡61:2 mV. These values were calculated after applying a given noise level to both populations, running the simulation for 64 seconds, and computing the mean membrane potential over units and time. The stimulus intensities used were 10, 25, 50, 75, 125, 150, 175, 200, 225, and 250 Hz. Noise level and stimulus intensity represent our two independent variables that were expected to produce changes in the two dependent variables (integrated rate and mutual information).

2810

D. Chawla, E. D. Lumer, and K. J. Friston

4 Results 4.1 Relationship Between Dynamic Correlations and Integrated Rate.

We found that increases in either background noise or the strength of the stimulus were universally associated with increases in both integrated rate and mutual information. Furthermore, both dependent variables (rate and mutual information) were highly coupled in an almost linear fashion. Figure 1b shows plots of mutual information against integrated rate for three different levels of background noise, demonstrating the coupling among these measures, irrespective of how different levels of either were elicited. Because of this tight relationship, we focused on how the independent manipulations of background noise and stimulus intensity affect mutual information (equivalent effects were observed on integrated rate). Two examples of the mean time course of activity (local eld potential, LFP) in population 1 under two levels of background activity are illustrated in Figure 1a. 4.2 Effect of Background Activity and Stimulus Intensity on Dynamic Correlations. Figure 1c shows mutual information as a function of stim-

ulus intensity for the three levels of background activity. As background noise increased, the gradient of the mutual information versus stimulus intensity plot also increased. A formal test for the differences in regression slopes conrmed the signicance of this effect (t-statistic = 2.2, residual degrees of freedom = 53, p-value = 0.016 for the average increase from lowto high-background activities over NMDA and AMPA simulations using multiple regression and the appropriate contrast). This is clear evidence of an interaction between tonic background activity and stimulus-induced rate modulation in the genesis of dynamic correlations. In other words, high-background activity increased the sensitivity of evoked dynamic correlations to increased stimulus intensity. This is demonstrated more clearly below. These phenomena were evident irrespective of whether we used NMDA (upper panels) or AMPA-like (lower panels) feedback receptors.

Figure 1: Facing page. (a) Mean membrane potential for one stimulus strength at two of the background noise levels. This graph shows the mean membrane potential over three interstimulus intervals (1500 ms). (b) Plots of stimulusinduced mutual information between the two populations against integrated rate (as indexed by the integral under the PSTH) for three different levels of background activity. The stimulus intensity was varied through 10, 25, 50, 75, 125, 150, 175, 200, 225, and 250 Hz. The upper panel shows the results when the model was implemented with NMDA feedback receptors and the lower panel with AMPA feedback receptors. (c) The same as in b, but now mutual information is plotted against the stimulus intensities used. In c, the regression slopes of mutual information on stimulus intensity are plotted.

Neuronal Activity and Dynamic Correlations

2811

2812

D. Chawla, E. D. Lumer, and K. J. Friston

Figure 2: (a) J-PSTH for the simulated populations at the lowest background noise level (see Figure 1) and similarly with a very weak stimulus applied every 500 ms for 35 ms. (b) J-PSTH at the highest background noise level (the same as in Figure 3) and with the low intensity stimulus as in Figure 2a. (c) J-PSTH for the neuronal populations at the low background activity as in Figure 2a and with a high-intensity stimulus. (d) J-PSTH at a high background activity level as in Figure 2b and with the high-intensity stimulus as in Figure 2c.

4.3 Examples of These Effects Demonstrated with J-PSTHs. Figure 2 presents J-PSTHs between the two populations at two different levels of background activity and with two different stimulus intensities. It can be seen that when the stimulus was very weak and the background noise was low, the presence of the stimulus had almost no effect on the synchronous interactions between the populations because it was not strong enough to enable the populations to entrain each other to any extent (see CTH of Figure 2a). However, when the background noise was increased, this same stimulus had a denite effect on the dynamic correlations, as can be seen

Neuronal Activity and Dynamic Correlations

2813

Figure 3: Mutual information plotted against integrated rate for the highest noise level, as depicted in Figures 2b and 2d, under both AMPA and NMDA feedback receptors. The stimulus intensity is varied through the same values as in Figure 1.

in Figure 2b (the background activity level in Figure 2b is higher than any of the background levels in Figure 1). At the low background activity level as in Figure 2a, when the stimulus was very strong, extremely signicant dynamic correlations occurred (see Figure 2c), in contrast to when the stimulus was weak and induced minimal dynamic correlations (see Figure 2a). At the high-background noise level as in Figure 2b, when the stimulus was very strong as in Figure 2c, the synchronization induced never died away, and high levels of synchrony were maintained (see Figure 2d). Figure 3 shows mutual information as a function of integrated rate at the very high background activity level as in Figures 2b and 2d. This shows that at such high background levels, the plot of mutual information versus integrated rate eventually levels off, at which point increasing the stimulus intensity will no longer facilitate an increase in the mutual information between the two populations. This demonstrates nicely the saturation phenomenon as seen in Figure 2d. Figure 2d showed that with a very high background noise, the stimulus intensity may reach a level at which the synchronization induced never dies away. Increasing the stimulus intensity further then has little or no effect.

2814

D. Chawla, E. D. Lumer, and K. J. Friston

5 Discussion

Several lines of evidence support our ndings that a systematic relationship between fast dynamic interactions (measured here in terms of mutual information or functional connectivity) and macroscopic measures exist. Aertsen and Preissl (1991) investigated the behavior of articial networks, analytically and using simulations. They concluded that short-term effective connectivity varies strongly with, or is modulated by, pool activity. Pool activity is dened as the product of the number of neurons and their mean ring rate. Effective connectivity is dened as the inuence one neural system exerts over another at a synaptic level. The mechanism is simple: the efcacy of subthreshold excitatory postsynaptic potentials (EPSPs) in establishing dynamic interactions is a function of postsynaptic depolarization, which in turn depends on the tonic background activity. This idea can be elucidated in the following way. If the network activity is very low, the inputs to a single neuron (say, neuron j) will cause only a very subthreshold EPSP in that neuron. If some presynaptic neuron (say neuron i) res so that it provides input to neuron j, this input will be insufcient to cause neuron j to re. However, if the pool activity is high enough to maintain a slightly subthreshold transmembrane potential in neuron j, then an input from neuron i to neuron j is more likely to push the membrane potential of neuron j over the reversal threshold and elicit an action potential. In previous work, we demonstrated that sustained synchrony shows a monotonic relationship with mean activity (Chawla et al., 1998). As mean activity in the network increases, the mean instantaneous membrane time constants decrease, giving rise to a higher level of synchrony. The decrease in time constants is a natural consequence of conjointly increasing membrane conductances through excitatory and inhibitory channels at high levels of activity. Hence, as activity level increases, smaller membrane time constants increase the synchronous gain in the network, that is, individual neurons became more sensitive to temporal coincidences in their synaptic inputs, responding with a higher ring rate to synchronous rather than asynchronous inputs. Therefore, in the event-related context, as background noise increases, the network should become more prone to stimulus-induced synchronous transients. This is reected in both the time that the poststimulus synchronization endures and a progressive increase in the mutual information versus stimulus intensity regression slope. The latter effect constitutes an interaction and can be viewed as a stimulus-dependent effect that is context sensitive. In this instance, the context is set by the tonic background of activity and could be mediated through a progressive diminution of the effective membrane time constants. Our ndings provide the basis for two fairly important conclusions. The rst is that increasing the tonic or background activity can potentiate the transient or dynamic correlations induced by a salient stimulus or behavioral event. This simple phenomenon might provide a useful mechanism

Neuronal Activity and Dynamic Correlations

2815

in the brain for exerting control over functional integration between neuronal populations in a context-dependent fashion. For example, attentional modulation of background activity in distinct sensory neuronal populations could be used selectively to enhance neuronal interactions in a topographically constrained way (Frith & Friston, 1997). The fact that a higher mutual information between our simulated neuronal populations was induced under conditions of higher noise is presumably very similar to stochastic resonance (Wiesenfeld & Moss, 1995), wherein small amounts of stochastic noise facilitate nonlinear transformations, in this instance effected by interacting neuronal populations. This phenomenon is interesting because of its almost counterintuitive nature. Indeed, it might be thought that increased background noise may lead to greater difculty in distinguishing a transient signal from noise. However, this is not necessarily the case. Mainen and Sejnowski (1995) have shown that noisier neuronal input can increase the precision of the cell spikes, thus increasing the sensitivity of the cells to their inputs. Their data suggests “a low intrinsic noise level in spike generation, which could allow cortical neurons to accurately transform synaptic input into spike sequences, supporting a possible role for spike timing in the processing of cortical information by the neocortex” (Mainen & Sejnowski, 1995). The second conclusion has a more practical importance and pertains to the interpretation of neuroimaging studies in which only macroscopic observations of rate modulation are generally allowed. By simply demonstrating a systematic and consistent relationship between rate modulation (integrated rate over PST) and coherence modulation (the mutual information associated with dynamic correlations), one can be comfortable with the probability that functional neuroimaging is not totally insensitive to event-related fast coherent interactions of the sort mediated by dynamic correlations. This is important given the possible role that dynamic correlations may play in sensorimotor and cognitive operations (Vaadia et al., 1995). This conclusion leads to the more general point that specic metrics based on rate or coherence modulation may be different perspectives on the same underlying dynamics. In this view, synchronized, mutually entrained signals enhance overall ring levels and can be thought of as mediating an increase in the effective connectivity between the two areas. Equivalently, high levels of discharge rates increase the effective connectivity between two populations and augment the fast, synchronous exchange of signals. In this sense there is an almost circular causality in the relationship between rate and transient synchronization. Although we manipulated mean background activity in our simulations, much of the variability in integrated rates over PST can be accounted for by the dynamic correlations induced by the stimulus. In other words, a high mean level of activity facilitates transient coherent interactions, above and beyond those predicted by dynamic rate modulation itself. These dynamic correlations in turn cause a mutual entrainment of the interacting populations and augment activity levels for

2816

D. Chawla, E. D. Lumer, and K. J. Friston

a period of time. At very high levels of activity, any stimulus-evoked transient might ignite the system, leading to high levels of synchrony and mean activity that are self-maintaining (see Figures 2 and 3). The statement that the distinction between temporal and rate coding is simply a matter of perspective is a strong one that is made with some expectation of being refuted. Although it is clearly possible that the information conveyed by the precise timing of spikes is very different from that conveyed by discharge rates, from the point of view of population dynamics, it may be the case that changes in spike timing cannot be divorced from changes in ring rate given the neuronal infrastructure employed by the brain. The point being made here is that due to the intimate relationship between the temporal patterning of presynaptic events (in terms of either phase locking as discussed in Chawla et al., 1998, or in terms of dynamic correlations, as considered in this article), and postsynaptic discharge probabilities, an increase in synchronized input inevitably will result in higher population discharge rates. The mechanisms that underlie this relationship may involve increased membrane conductances, decreased effective membrane time constants, and an increase in synchronous gain mediated by impoverished temporal integration. Put simply, under the constraints imposed by the emergent nonlinear dynamics of neuronal circuits, one cannot change the ne temporal structure of discharge patterns without changing population activity (this point is made by the results in Figure 1, which show a monotonic relationship between integrated ring rate and mutual information). If changes in one metric of neuronal dynamics, such as spike timing, are universally associated with changes in another metric, such as population activity, then the two metrics are mutually redundant and reect different measures of the same underlying dynamics. It should be noted that these observations pertain only to population codes. In summary, we have shown that background activity levels in simulated neuronal populations facilitate and are facilitated by the expression of stimulus-induced dynamic correlations. These ndings have implications for the context-dependent aspects of stimulus-related neuronal interactions and also inform the interpretation of neuroimaging measures of neurophysiology. Appendix A: Modeling Neuronal Dynamics

The neuronal dynamics were produced by simulations based on the equations from the Yamada, Koch, and Adams (1989) single neuron model, using the Hodgkin and Huxley formalism: dV=dt D ¡1=CM f.gNa m2 h.V ¡ VNa / C gK n2 y.V ¡ VK / C gl .V ¡ Vl / C gAMPA .V ¡ VAMPA / C gGABA .V ¡ VGABA /g, dm=dt D ®m .1 ¡ m/ ¡ ¯m m

Neuronal Activity and Dynamic Correlations

2817

Table 1: Table of the Parameter Values of the Neuronal Model. Receptor/Channel

gpeak.mS/

¿ (ms)

Vj (mV)

AMPA GABAa NMDA NaC KC Leak

0.05 0.175 0.01 200 170 1

3 7 100

0 ¡70 0 50 ¡90 ¡60

dn=dt D ®n .1 ¡ n/ ¡ ¯n n dgAMPA =dt D ¡gAMPA =¿AMPA dh=dt D ®h .1 ¡ h/ ¡ ¯h h dy=dt D ®y .1 ¡ y/ ¡ ¯y y dgGABA =dt D ¡gGABA =¿GABA V represents the membrane potential of the neuron, CM represents the membrane capacitance (1¹F), and gNa ; gK , and gl represent the maximum Na+ channel, K+ channel, and leakage conductances, respectively. VNa represents the Na+ equilibrium potential, and similarly for VK and Vl . m, h, n, and y are the fraction of Na+ and K+ channel gates that are open. gAMPA and gGABA are the conductances of the excitatory (AMPA) and inhibitory (GABAa ) synaptic channels, respectively. The peak conductances are constant for each receptor type and are given in Table 1. ¿ represents the excitatory and inhibitory decay time constants. ®n , ¯n , ®m , ¯m , ®h , ¯h , ®y , ¯y are nonnegative functions of V that model voltage-dependent rates of channel conguration transitions. The implementation of NMDA channels was based on Traub, Wong, Miles, and Michelson (1991): INMDA D gNMDA .t/M.V ¡ VNMDA / dgNMDA =dt D ¡gNMDA =¿2 M D 1=.1 C .Mg2 C =3/.exp[¡0:07.V ¡ » /]/ INMDA is the current that enters linearly into the equation for dV=dt, above. gNMDA is a ligand-gated virtual conductance. M is a modulatory term that mimicks the voltage-dependent afnity of the Mg2 C channel pore. » is ¡10 mV and Mg2 C is the external concentration of Mg2 C often used in hippocampal slice experiments (2 mM). These and other parameters are given in Table 1.

2818

D. Chawla, E. D. Lumer, and K. J. Friston

Appendix B: How to Read J-PSTHs

The J-PSTH is a raster plot of the two PSTHs plotted against each other, and as such, coincident rings are shown along the leading diagonal of the JPSTH. Time-lagged synchronized rings are shown as diagonal bands that are shifted relative to this diagonal. The displacement of the band is therefore a measure of the latency of phase locking. The width and structure of the band depend on the details of the coherent interactions. Direct synaptic connections from population 1 to population 2 will produce a 45 degree band of differing density lying below the principal diagonal at a distance proportional to the latency of the interaction. Connections from population 2 to population 1 will produce a similar band lying above the principal diagonal. If the interactions between the two populations are affected by the stimulus, then the diagonal band will show changes in density along its length, evidencing dynamic correlations or coherence modulation as a function of PST. Correlations due purely to rate modulation evoked by the stimulus are removed by subtracting the cross-product matrix of the individual PSTHs from the raw J-PSTH and then dividing the resulting difference matrix (bin by bin) by the cross-product matrix of the standard deviations of the PSTHs. This is mathematically the same as computing the cross-correlation matrix between the binned activities from both populations over stimulus epochs. Appendix C: Explanation of Mutual Information

Our measure of mutual information is equivalent to testing the null hypothesis that all the elements of the J-PSTH are jointly zero. The mutual information can be thought of as the generalization of a correlation for multivariate data (in this case, the activity over different PSTs). As such, it serves as a measure of functional connectivity. The critical idea in this instance is that by predicating our measure of functional connectivity on the J-PSTH, we properly include dynamic correlations that would otherwise be missed if we simply looked at average correlations as implicit in the cross correlogram. The importance of using the J-PSTH in the context of this article is discussed fully in Aertsen et al. (1994). A simple example makes the distinction between these two approaches clear: Imagine a stimulus-induced strong, positive correlation for 100 milliseconds followed, systematically, by equally strong negative correlations for the subsequent 100 milliseconds. The average correlation over all PSTs, as measured by the cross correlogram, would be zero. However, the mutual information mediated by the dynamic correlations in the J-PSTH would be extremely signicant. The reason that the mutual information is an implicit test of the null hypothesis that the dynamic correlations are jointly zero follows from the fact that the maximum likelihood statistic for the latter test is Wilks’ lambda. Under gaussian assumptions, the log of this statistic is proportional to the mutual information between the stimulus-induced transients that produce

Neuronal Activity and Dynamic Correlations

2819

the dynamic correlations. In this article, we restricted ourselves to using the mutual information and refer interested readers to Chateld and Collins (1982) for a discussion of Wilks’ lambda in the context of multivariate analysis of covariance (ManCova). C.1 Mutual Information and Wilks’ Lambda. The J-PSTH is given by XT Y (T denotes transposition) where X is a mean corrected and normalized data matrix from unit or population 1 with a row for every stimulus epoch and a column for every PST bin. Y is the corresponding matrix from the second population. The mutual information between X and Y is given by:

I.X; Y/ D H.X/ C H.Y/ ¡ H.X \ Y/; where H.X/ is the entropy of X, H.Y/ is the entropy of Y, and H.X \ Y/ is the entropy of X and Y considered jointly. Under gaussian assumptions: H.X/ D ln.2¼ en |XT X|/=2; where |XT X| is the determinant of XT X (i.e., the autocovariance matrix of X) and n is the number of columns. Let: A D [XY]T [XY] D dXT X bYT X

XT Ye YT Yc:

then, |A| D |XT X||YT Y ¡ YT X.XT X/ ¡1 XT Y|;

giving,

I.X; Y/ D ln.|YT Y|=|YT Y ¡ YT X.XT X/ ¡1 XT Y|/:

An alternative (statistical) perspective, mathematically equivalent to testing for dynamic correlations, is to test the null hypothesis that XT Y D 0. This can be effected in the context of multivariate analysis using Wilks’ maximum liklihood ratio or Wilks’ lambda, ¸ D |R|=|R0 |, where R and R0 are the residual sum of squares and products under the alternate and null hypotheses, respectively. In this instance, we treat the test for statistical dependencies between X and Y as a multiple regression problem under the general linear model, Y D X¯ C ";

where ¯ are the regression coefcients and " are errors with a multinormal distribution. The parameter estimates are given by: Y¤ D X¯;

¯ D .XT X/ ¡1 XT Y

where Y¤ is the tted data:

2820

D. Chawla, E. D. Lumer, and K. J. Friston

Under the null hypothesis, ¯ D 0 and therefore, R0 D YT Y: Similarly, under the alternate hypothesis: R D [Y ¡ Y¤ ]T [Y ¡ Y¤ ] D YT Y ¡ YT Y¤ I

therefore: ¸ D |R|=|R0 | D |YT Y ¡ YT X.XT X/ ¡1 XT Y|=|YT Y|

and

I.X; Y/ D ¡ ln.¸/

The negative log of Wilks’ lambda is the mutual information. Under the null hypothesis of no dynamic correlations, ln.¸/ has approximately a chi-squared distribution where ¡.r ¡ 1=2/ ln.¸/ » Â 2 .n2 / where r are the residual degrees of freedom (number of epochs minus time bins) (Chateld & Collins, 1982). In practice the number of time bins may exceed the number of epochs. In this instance, one generally reduces the dimensionality of the data X (and Y) using singular value decomposition. Acknowledgments

This work was supported by the Wellcome Trust. References Aertsen, A., Erb, M., & Palm, G. (1994). Dynamics of functional coupling in the cerebral cortex: An attempt at a model-based interpretation. Physica D, 75, 103–128. Aertsen, A., & Preissl, H. (1991). Dynamics of activity and connectivity in physiological neuronal networks. In xx Schuster (Ed.) Non linear dynamics and neuronal networks (pp. 281–302). New York: HG VCH Publishers. Beaulieu, C., Kisvarday, Z., Somogyi, P., & Cynader, M. (1992). Quantitative distribution of GABA-immunopositive and -immunonegative neurons and synapses in the monkey striate cortex (area 17). Cerebral Cortex, 2, 295–309. Boven, K. H., & Aertsen, A. M. H. J. (1990). Dynamics of activity in neuronal networks give rise to fast modulations of functional connectivity. In Parallel processing and neural systems and computers (pp. 53–56). Amsterdam: Elsevier. Chateld, C., & Collins, A. J. (1982). Introduction to multivariate analysis. London: Chapman and Hall. Chawla, D., Lumer, E. D., & Friston, K. J. (1998). The relationship between synchronization and neuronal populations and their mean activity levels. Neural Computation, 11, 1389–1411.

Neuronal Activity and Dynamic Correlations

2821

Frith, C. D., & Friston, K. J. (1996). The role of the thalamus in “top down” modulation off attention to sound. NeuroImage, 4, 210–215. Gerstein, G. L., Bedenbaugh, P. & Aertsen, A. M. H. J. (1989). Neuronal assemblies. IEEE Trans. on Biomed. Engineering, 36, 4–14. Gerstein, G. L., & Perkel, D. H. (1969). Simultaneously recorded trains of action potentials: Analysis and functional interpretation. Science, 164, 828–830. Gerstein, G. L., & Perkel, H. (1972). Mutual temporal relationships among neuronal spike trains. Biophysical Journal, 12, 453–473. Lumer, E. D., Edelman, G. M., & Tononi, G. (1997a).Neural dynamics in a model of the thalamocortical system I. Layers, loops and the emergence of fast synchronous rhythms. Cerebral Cortex, 7, 207–227. Lumer, E. D., Edelman, G. M., & Tononi, G. (1997b).Neural dynamics in a model of the thalamocortical system II. The role of neural synchrony tested through perturbations of spike timing. Cerebral Cortex, 7, 228–236. Mainen, Z. F., & Sejnowski, T. J. (1995). Reliability of spike timing in neocortical neurons. Science, 258, 1503–1506. Traub, R. D., Wong, R. K., Miles, R., & Michelson, H. (1991). A model of a CA3 hippocampal pyramidal neuron incorporating voltage-clamp data on intrinsic conductances. J. Neurophysiol., 66, 635–650. Vaadia, E., Haalman, I., Abeles, M., Bergman, H., Prut, Y., Slovin, H., & Aertsen, A. (1995). Dynamics of neuronal interactions in monkey cortex in relation to behavioural events. Nature, 373, 515–518. Wiesenfeld, K., & Moss, F. (1995). Stochastic resonance and the benets of noise: From ice ages to craysh and SQUIDs. Nature, 373, 33–36. Yamada, W. M., Koch, C., & Adams, P. R. (1989). Multiple channels and calcium dynamics. In C. Koch & I. Segev (Eds.), Methods in neuronal modeling (pp. 97– 134). Cambridge, MA: The MIT Press. Received August 3, 1998; accepted February 9, 2000.

LETTER

Communicated by Alexandre Pouget

Analysis of Pointing Errors Reveals Properties of Data Representations and Coordinate Transformations Within the Central Nervous System J. McIntyre Scientic Institute S. Lucia, National Research Council, University of Rome Tor Vergata Rome, Italy, and Laboratoire de Physiologie de la Perception et de l’Action, Centre National de la Recherche Scientique, Coll`ege de France, Paris, France F. Stratta Scientic Institute S. Lucia, National Research Council, University of Rome Tor Vergata, Rome, Italy J. Droulez Laboratoirede Physiologiede la Perceptionet de l’Action, Centre Nationalde la Recherche Scientique, Coll`ege de France, Paris, France F. Lacquaniti Scientic Institute S. Lucia, National Research Council, Universityof Rome Tor Vergata Rome, Italy The execution of a simple pointing task invokes a chain of processing that includes visual acquisition of the target, coordination of multimodal proprioceptive signals, and ultimately the generation of a motor command that will drive the nger to the desired target location. These processes in the sensorimotor chain can be described in terms of internal representations of the target or limb positions and coordinate transfor mations between different internal reference frames. In this article we rst describe how different types of error analysis can be used to identify properties of the internal representations and coordinate transformations within the central nervous system. We then describe a series of experiments in which subjects pointed to remembered 3D visual targets under two lighting conditions (dim light and total darkness) and after two different memory delays (0.5 and 5.0 s) and report results in terms of variable error, constant error, and local distortion. Finally, we present a set of simulations to help explain the patterns of errors produced in this pointing task. These analyses and experiments provide insight into the structure of the underlying sensorimotor processes employed by the central nervous system.

c 2000 Massachusetts Institute of Technology Neural Computation 12, 2823–2855 (2000) °

2824

J. McIntyre, F. Stratta, J. Droulez, and F. Lacquaniti

1 Introduction

Assessment of errors in pointing toward visual targets has long been used to study the neural control of movement. Sherrington (1918), and later Merton (1961), used observations about eye movement accuracy to argue for or against a role of eye muscle proprioception in oculomotor control. Merton recognized the importance of measuring response variability (variable error), rather than mean error (constant error), to assess the performance of the sensory motor chain. More recently, analyses of constant and variable pointing errors have been used to identify the sources of information that contribute to the internal representation of a memorized target position (Foley & Held, 1972; Prablanc, Echallier, Komilis, & Jeannerod, 1979; Poulton, 1981; Soechting & Flanders, 1989a; Darling & Miller, 1993; Berkinblit, Fookson, Smetanin, Adamovich, & Poizner, 1995; Desmurget, Jordan, Prablanc, & Jeannerod, 1997; Baud-Bovy & Viviani, 1998; McIntyre, Stratta, & Lacquaniti, 1997, 1998; Lacquaniti, 1997). Recently, we have proposed a third measure of pointing errors, called the local distortion, which we have used to characterize neural mechanisms involved in pointing to visually presented 3D targets. In this article, we set out to formalize the relationship of these three different measures of pointing error. We then apply these measures to the analysis of errors observed in a 3D pointing experiment (the data have been previously reported in McIntyre et al., 1998). Finally, through a set of simulations, we show how the analysis of pointing errors can contribute to the understanding of the neural processes involved in pointing. 2 Local Distortion

As a complement to the standard measures of constant and variable errors, we propose an additional measure, which we call local distortion. Local distortion describes the mapping of spatial relationships between nearby points as data are transformed between coordinate systems. Consider a circular array of targets (see Figure 1a, open circles). The mean position of repeated movements to each target creates an array of nal pointing positions (lled circles in Figure 1). If the sensorimotor transformation is accurate, the array of mean nal positions should reproduce exactly the spatial relationships between the individual targets of the array. The measurement of local distortion refers to the delity with which the relative spatial organization of the targets is maintained in the conguration of nal pointing positions, independent of displacement of the entire array. If the constant error in the mapping from target to end point position is the same for all eight targets, the array of nal pointing positions will be an undistorted replica of the target array, even though it may be displaced by a large, constant error common to all targets (see Figure 1b). On the other hand, differences in the mapping of individual targets can result in a distorted representation of the target array within the pattern of nal pointing positions. This dis-

Analysis of Pointing Errors

2825

Figure 1: Denition of local distortion. An undistorted representation of the relative target positions is reproduced in the nal pointing positions (closed circles) if errors in transformations are absent (A) or restricted to biases common to all targets (B). Distorted end point congurations may be produced by inaccurate transformations, resulting in expansion (C), contraction (D), anisotropic expansion and/or contraction (E, F). Rotations (G) or reections (H) may also appear, with or without accompanying expansion or contraction.

tortion can manifest itself as an expansion or contraction of the local space (see Figures 1c and 1d), and the expansion or contraction may be unequal for different dimensions, resulting in an anisotropic distortion of the target array (see Figures 1e and 1f). The transformation from target to end point position might also include a rotation or reection of the local space, either alone or in combination with local expansion or contraction (see Figures 1g and 1h). The transformation from target to nal pointing position will in general be a nonlinear process in which the binocularly acquired target position is transformed into an appropriate joint posture. For a small area of the workspace, however, one would expect the transformation to be continuous and smooth. In this case, local distortions of the spatial organization of the targets can be approximated by a linear transformation from target to end point position. Such linear approximations can be described by a transformation matrix, which can be presented graphically as an oriented ellipse (ellipsoid in 3D). Figures 1a–1h show the representation of each type

2826

J. McIntyre, F. Stratta, J. Droulez, and F. Lacquaniti

of distortion as an ellipse. The ellipse encompasses the area that would be occupied by a circular array of targets after passage through the transformation. Note that an ellipse cannot be used to indicate or assess the presence of rotations or mirror reections within the local transformation. The ellipses in Figures 1f and 1g, for instance, are identical, although they result from two different distorted transformations. However, if the absence of rotations or reections in the local transformation can be demonstrated independently, the graphical presentation of the local transformation as an ellipse provides an intuitive and unambiguous visual representation of local distortions in the mapping from target to end point positions. 3 Sources of Error

Conceptually, one can identify two distinct components in the processes leading from target localization to end point specication during a reaching or pointing task: (1) internal representations of the target location and (2) transformations of information between internal representations. For instance, the encoding of a target’s projection on the retina, the eyes’ position in the orbit, and the head’s position with respect to the trunk all constitute potential internal representations, while the combination of these three sources of information into a (hypothetical) representation of the target location with respect to the body would involve a coordinate transformation. The three measures of error dened above can be used to identify the characteristics of these different components within a sensorimotor pathway. Consider, for example, an idealized system that converts a 2D Cartesian target location into an internal polar representation, then back into a 2D Cartesian end point position (see Figure 2). In this hypothetical system, we have three representations of the target position (the Cartesian input and output plus the polar intermediate representation) and two coordinate transformations. By tracking the effects of bias, noise, and distortions as information passes through this system, one can see how each of these effects can be observed in measurements of error at the output and, conversely, how measurements of error at the output can be used to identify the internal structure of the system. 3.1 Bias. The observation of constant errors at the output of a sensorimotor transformation suggests the presence of bias in an internal representation of the target position or motor command. Patterns of constant error might therefore indicate the coordinate system of the biased internal representation. If the bias is assumed to occur in only one channel of a given coordinate system (an unlikely assumption), the patterns of constant error seen for different workspace regions would indicate the orientation of that internal coordinate system with respect to the world. For example, a constant bias in the estimated distance of the target position would generate a radial pattern of constant errors in the nal pointing positions (see Fig-

Analysis of Pointing Errors

2827

Figure 2: Model system for demonstrating the relationship between variable error, constant error, and local distortions. The model converts a 2D target position into an internal polar representation, then back to a 2D Cartesian pointing position.

Figure 3: (A) Pattern of constant error (undershoot) for pointing to six different targets (closed circles) induced by a xed bias in the estimate of target distance. Constant errors (arrows) point toward the origin (open circle) of the polar coordinate system. (B) Pattern of constant error induced by a xed bias in both the target distance and along the x-axis in the output stage. In this case, the constant error vectors do not point directly toward the origin of the internal polar representation, but rotate around the y-axis.

ure 3A). The existence of such a pattern of errors would therefore provide evidence for the internal polar representation of our model system. Furthermore, the constant error vectors would point to the origin of the polar

2828

J. McIntyre, F. Stratta, J. Droulez, and F. Lacquaniti

coordinate frame. However, bias is not likely to be restricted to a single coordinate axis, and biases in different coordinate frames can add, resulting in a potentially confusing array of constant errors in the workspace (see Figure 3B). Nevertheless, if biases in the system depend on only the workspace location, rotation of the constant error vectors around a single workspace axis would still argue for an internal polar representation of the target position. Parallel constant errors, on the other hand, would indicate a Cartesian representation that for our model system would correspond to bias in either the input or output stages. 3.2 Random Noise. The measurement of the variable error can detect the presence of random noise added at any of the stages of a sensorimotor transformation and thus reveal features of the internal representation of the target location or intended movement. Random noise should be statistically independent between two truly separate channels of information. If the level of noise is greater in one channel than another, the eigenvectors of the covariance matrix will indicate the directions of maximal and minimal variance, and if the two channels are orthogonal, they will thus reveal the local spatial orientation of the underlying independent channels. Changing patterns of variable error across different workspace regions can indicate the existence and the origin of an internal data representation. For instance, in our model system, a greater level of variability in the estimate of target distance versus direction1 would result in a radial pattern of variable error eigenvectors, as seen in Figure 4a. In a multistage transformation, noise added at different stages in the sensorimotor chain may add to form complex patterns of variability at the output. The orientation of anisotropic noise at the output will depend on the relative magnitude of noise at the different intermediate stages. For instance, the radial pattern of variable error produced by noise in the polar representation will become progressively more parallel in the vertical direction as more and more noise is added to the Y channel in the output stage (see Figures 4d and 4g). Adding noise to the X output channel will cause the variable error ellipses to widen, ultimately resulting in ellipses that are horizontally oriented (see Figure 4c). This occurs after passing through a point where anisotropy in the polar and Cartesian representations cancels to generate isotropic output noise (see Figure 4b, near center position). Adding isotropic noise at the output will cause the ellipses to expand in all directions while having a minimal effect on the orientation of the major eigenvector (see Figure 4e), until the isotropic output noise dominates and the output variability becomes essentially circular (see Figure 4i). 1 Although one cannot in general compare variability of target distance, measured in mm, with the variable error of target direction, measured in degrees, one can compare the effect of a given noise level in each channel on the Cartesian position of the output at a given nominal distance.

Analysis of Pointing Errors

2829

Figure 4: Patterns of variable error expected from an independent estimate of target distance and direction, plus variable levels of noise in X and Y at the output. Black bars in plots of variable error indicate the axis of maximum variability. Anisotropy is apparent in the variable-error ellipse, despite accuratetransformations between coordinate systems. Noise in the polar representation of distance and direction generates a radial pattern of variable errors (a). Increasing variability in X (b, c) or Y (d, g) in addition to the noise in distance and direction causes variable errors to align gradually with the x- or y-axis, respectively. The addition of isotropic noise .¾x D ¾y I e; i/ changes the size but not the orientation of the variable-error ellipsoids.

3.3 Distortion. Errors in transformations between coordinate systems can lead to distortions in the mapping from target locations to pointing positions. Consider, for example, an output transformation in which the distance to the target is improperly computed:

Pxf D f .d/ cos ® y

P f D f .d/ sin ®; y

(3.1)

where P xf and P f represent the Cartesian coordinates of the end point, d and ® are the internal representation of the distance and direction to the target, respectively, and f .d/ represents the distorted computation of distance in the transformation from polar to Cartesian coordinates. If the distorted distance

2830

J. McIntyre, F. Stratta, J. Droulez, and F. Lacquaniti

is simply a scaled representation of the true distance, f .d/ D ¸d;

(3.2)

the output pointing positions will result in a systematic pattern of radial constant errors that increase with the distance from the origin and isotropic contraction .¸ < 1/ or expansion .¸ > 1/ of relative pointing positions that can be quantied by the measurement of local distortion (see Figure 5A). Bias, which we have previously considered to be a property of a data representation, may in fact be due to an erroneous coordinate transformation. The addition of bias in the output of a transformation will not in itself change the local spatial organization of the output and as such should not be considered as a case of local distortion. In our model system, the addition of bias in the Cartesian output would generate a shift (constant error) of all output positions, but would not affect the relative spatial organization between targets. On the other hand, the addition of bias to an intermediate

Analysis of Pointing Errors

2831

representation, followed by a second nonlinear transformation, can lead to local distortion in the output. For instance, in our model system, bias added to the internal representation of target distance, f .d/ D d C ° ;

(3.3)

creates not only a radial pattern of constant error but also an anisotropic contraction .° < 0/ or expansion .° > 0/ of the relative pointing positions that depends on the distance from the origin (see Figure 5B). The combination of both scaling and offset errors, f .d/ D ¸d C ° ;

(3.4)

leads to a more complicated pattern of constant error and local distortion at the output. Both undershoot and overshoot may appear in the radial pattern of constant errors. For ¸ < 1, this type of error will also manifest itself as an anisotropic contraction along radially oriented axes (see Figure 5C), but there will be lateral expansion or contraction depending on whether there is overshoot or undershoot for a particular workspace region. In this example, the orientation of both the constant error and the local distortion ellipse would indicate a property of the erroneous coordinate transformation. Of course, other nonlinear distortions of the estimated target distance might also introduce anisotropic distortions at the output. But to the extent that well-behaved (i.e., continuous and smooth) transformations can be locally approximated by a linear function, the ellipses in Figure 5C are representative of the types of local distortion that might be observed in a real sensorimotor chain. Figure 5: Facing page. Patterns of constant error and local distortion induced by incorrect estimates of target distance in the transformation from polar to Cartesian coordinates. Dotted circles indicate an undistorted transformation for reference, while solid circles and ellipses indicate the local distortion for a given workspace region. (A) For a proportional underestimation of target distance (see equation 3.2), constant errors point inward, and local distortion indicates isotropic contraction. (B) A constant undershoot in the estimate of target distance (see equation 3.3) will evoke both a radial pattern of constant errors and anisotropic contraction perpendicular to the radial axis (black bars). (C) Contraction and bias in estimated target distances or contraction toward an average distance (see equation 3.4) may yield both overshoot and undershoot in measures of constant error. Measurements of local distortion show anisotropic contraction along a radial axis toward the origin of the polar coordinate system, as indicated by the dark bars (eigenvector corresponding to the smallest eigenvalue). Local distortion may indicate lateral expansion or contraction, depending on whether the target distance is over- or underestimated in a given region.

2832

J. McIntyre, F. Stratta, J. Droulez, and F. Lacquaniti

One can see from these examples that measurements of constant error and local distortion are closely related. Both arise from errors in transformations. Figure 5C demonstrates, however, that the measurement of local distortion complements the classical assessment of constant error; computing both may lead to a better understanding of the underlying neural mechanisms. In the next section we describe how measurements of local distortion can also be useful in interpreting measurements of variable error. 3.4 Transformed Variable Error. Distortions introduced in coordinate transformations will also affect patterns of variable error as the error is passed from one coordinate frame to another. Noise at the input will be reshaped by an inaccurate coordinate transformation such that a distorted transformation may introduce anisotropy to otherwise balanced input noise. In our model system, the improper computation of target distance would create laterally oriented variable-error ellipsoids from isotropic input noise (see Figure 6A). Furthermore, the combination of bias, noise, and local distortions can create diverse patterns of orientation in each of the three measures of error. Figure 6B indicates the patterns of error seen for our model system in the combined presence of (1) radially oriented input noise that increases with the square of the target distance, as would be expected from a binocular estimate of target distance and direction, (2) rightward bias in the internal representation of target direction, and (3) distortion in the polar-to-Cartesian transformation as dened by equation 3.4. One can see from this example that local distortions later in the transformation can reshape anisotropic input noise in a workspace-dependent fashion. To understand the effect of a given transformation on the shape of a variable-error distribution, one must know the Jacobian matrix of the transformation in question (McIntyre et al., 1997):

Sout D JT Sin J;

(3.5)

where Sin and Sout are the input and output covariance matrices. The linear estimation of local distortion provides a direct estimate of this Jacobian matrix. The calculation of both variable error and local distortion can therefore be useful in identifying the steps in a sensorimotor transformation. Consider, for example, hypothetical measurements of variable error in the input and output of our model system. Differences between the input and output distributions are ambiguously related to either a distortion introduced in a coordinate transformation or to additive noise at an intermediate step along the pathway. In the former case, the transformation error will also be evident in the measurement of local distortion. In fact, by combining measurements of variable error and local distortion, one may

Analysis of Pointing Errors

2833

Figure 6: (A) Patterns of constant error, variable error, and local distortion induced by an incorrect scaling and offset in target distance for the transformation from polar to Cartesian coordinates at the output. Isotropic input noise is reshaped by the radial contraction, resulting in a lateral pattern of variable errors. For this simple and unique distortion, all three measurements of error provide congruent evidence for the internal polar coordinate system. (B) Combined effects of three different sources of error: noise in the estimate of distance that increases with the square of the distance, linear distortion and bias in the transformation from polar to Cartesian coordinates, and bias in X and Y. The three different measurements may show three different spatial patterns. The presence of bias can tilt the constant error vectors away from the origin of the intrinsic coordinate system. The anisotropic variability of the input is reshaped by the local distortion in the erroneous coordinate transformation. The measurement of local distortion is relatively independent of bias and noise that are injected into the system.

be able to identify specic representations of the target position that are not apparent in the output. By passing the known or measured input variance through the measured local distortion transformation, a prediction of the resulting output variability can be made. Features of the measured output variability that are seen in the predicted distribution can be attributed to coordinate transformations along the sensorimotor pathway. Patterns of variability not accounted for in the prediction would reect noise added to the system at intermediate steps, giving evidence for an additional internal reference frame.

2834

J. McIntyre, F. Stratta, J. Droulez, and F. Lacquaniti

3.5 Neural Implementations. We have presented an idealized conception of sensorimotor processing in which data representations and coordinate transformations are distinct. Under this idealized framework, we have associated output variability with noise added to any or all of the data representations or in the execution of the motor output. For instance, noise could arise from the variability of the frequency of discharge of action potentials or from the variability of postsynaptic potentials. Conversely, we have attributed patterns of constant error or distortion to errors in transformations. In a neural circuit, distortions may arise from inaccuracies in the transmission or fusion of signals arising from separate neural pathways. These inaccuracies may, for example, arise from inappropriate tuning of synaptic weights within the circuitry, or they may reect inherent limitations on the complexity of transformations that may be represented by a neuronal network. It is likely, however, that many or all neural mechanisms manifest properties of both internal representations and transformations within the same circuitry. Memory storage provides a good example of this mix. Logically, the storage of a target position in memory would be described as an internal representation. One can expect that noise will increase over time as the memory of the target position fades. Drift might also occur, which would show up as time-dependent increases in measurements of constant error. Memory storage may also exhibit properties of a transformation or distortion. For instance, the subject’s responses may converge toward a nominal or average pointing location as the quality of information in memory storage diminishes. Such a decay would appear as a gradually increasing contraction in measurements of local distortion. Thus, memory storage may best be described as a combination of internal representations and data transformations, although an explicit transformation between coordinate systems may not exist. The evolution of all three types of error (variable error, constant error, and local distortion) for different imposed delays may provide insight into the neural mechanisms that underlie the memory processes. The conceptual divide between representation and transformations can be a useful tool for describing these mechanisms even though neither of these constructs may exist in its pure form within the nervous system. The methods described for analyzing errors must be used appropriately, so as not to ignore the assumptions underlying the analysis. For instance, the measurement of local distortion is precisely that: a local approximation of a global function. In essence, we are estimating the spatial derivative of the global transformation. The global function cannot be uniquely determined from a given derivative, and different global transformation might lead to similar patterns of local distortion. In addition, distortions at different levels along the sensorimotor chain will add. A blind calculation of the local distortion eigenvectors will not necessarily align with any of the internal reference frames. A similar caution applies to the analysis of variable error through the use of principal component analysis (PCA). As illustrated in

Analysis of Pointing Errors

2835

Figure 4, different sources of noise may add, leading to eigenvectors of the covariance matrix that do not align with any of the internal data representations. Furthermore, PCA assumes that the underlying coordinate axes are orthogonal. Violation of this assumption will also generate principal axes in the covariance matrix that would be misleading if blindly applied. The backward identication of a sensorimotor process based on measurements of error is thus ill posed. A more appropriate use of these methods is to test specic hypotheses about the neuronal process. Given a model of the transformations involved, one can predict the patterns of errors that will be seen at the output and compare these predictions with experimental results. 4 Experimental Methods

Subjects performed a pointing task to remembered visual targets presented by a robot in 3D space. The subject sat in a dimly lit room in front of a black table and backboard. The subject placed the index nger at a specied starting position, indicated by an upraised bump on the tabletop. A target LED was positioned by a robot at a specic 3D location, illuminated for 2 s, and then removed. After a programmed delay, an audible beep signaled to the subject to perform the pointing movement. The subject was instructed to place the tip of the index nger so as to touch the remembered position of the target LED. The subject was allowed 2 s to execute the pointing movement and hold the nger at the remembered target position. After a second beep, the subject returned the nger to the starting position in preparation for the next trial. Movements were recorded by an Elite 3D tracking system with a resolution of approximately 0.5 mm RMS. The nal pointing position for each trial was calculated as the average nal position over the last 10 samples (100 msec). Subjects performed trials in blocks of 45 trials, each block lasting about 15 min. In a single block of trials, target locations were restricted to a small region of the workspace. Eight targets were distributed uniformly on a 22 mm radius sphere, with a ninth target position located at the center. Each of the nine targets was presented ve times within a given block, and subjects performed four blocks of trials to each workspace region, for a total of 180 trials per region. Three workspace regions were tested, located slightly above shoulder level, 60 cm in front of the subject. The center workspace regions was located on the midline, while the left and right regions were located 38 cm to each side of the center. Two starting positions, left and right, were tested, located on the table top approximately 20 cm in front of the subject and 20 cm to either side of the midline. Subjects were tested with two different delay durations, 0.5 and 5.0 s, and two lighting conditions. In the lighted condition, dim room lights allowed the subject barely to see the ngertip against the black background. For the dark condition, the target was presented with the dim lights on, but at

2836

J. McIntyre, F. Stratta, J. Droulez, and F. Lacquaniti

the moment of target extinction, the room lights were extinguished, leaving the subject to wait during the memory delay and perform the pointing movement in total darkness. Five right-handed subjects performed a complete set of pointing trials after a 0.5 s delay, using the right hand and starting from the right position, to all three workspace regions in dim light (light-short). The same ve subjects performed trials in darkness for a 0.5 s (dark-short) and 5.0 s (dark-long) memory delay. Two right-handed subjects pointed to targets in all three workspace regions after a 5.0 s delay, using the right arm and starting from the right side (light-long). Five different right-handed subjects pointed in darkness after a 5.0 s delay with the right hand starting from the left side (right-left). A third set of eight right-handed subjects pointed to the left and right target regions only in darkness after a 5.0 s delay, using the left arm and starting from the left side (left-left). The dark-long condition is also referred to as the right-right condition, indicating use of the right hand from the right starting position. 4.1 Data Analysis. Constant error vectors, computed as the average error over all nine targets within a single workspace region, were calculated for each workspace region as described in McIntyre et al. (1997):

eD

9 X ni 1X p i ¡ ti n iD 1 jD 1 j

(4.1)

where ti is the 3D vector location of target i, pji is the nal pointing position for trial j to target i, and ni and n are the number of valid trials to target i and the total number of valid trials to all nine targets, respectively. The 3D covariance estimated from data over all k D 9 targets is computed by Morrison (1990): Pk SD

iD 1

Pni

i i T j D 1 ±j .±j /

n ¡k

;

(4.2)

where the deviation ±ji D pji ¡ pNi for trial j to target i is computed relative to the mean pNi of trials to target i, not to the overall mean for all targets. The 3D covariance matrix S can be scaled to compute the matrix describing the 95% tolerance ellipsoid, based on the total number of trials n: T 0:95 D q

.n C 1/.n ¡ k/ F0:05;q;n¡q¡k C 1 S n.n ¡ q ¡ k C 1/

(4.3)

where q D 3 is the dimensionality of the Cartesian vector space and k D 9 is the number of targets.

Analysis of Pointing Errors

2837

The local transformation matrix M was computed as the linear relationship that best describes the transformation between a target position (relative to the average target position) and the displacement of end point positions for pointing trials to that target (relative to the overall average endpoint position for all trials), as computed by linear-least-squares methods. The linear estimation of the local transformation may contain rotation or reections as well as a local expansion and/or contraction. The overall local transformation can thus be represented as the cascade of two components: a symmetric matrix A, representing the local distortion, and an orthogonal matrix R, Q D RA: M

(4.4)

The symmetric component A was computed from the eigenvectors and Q TM Q as follows: eigenvalues of the quantity M 2 p

A D W4

§ ¸1

0 0

0 p

§ ¸2

0

3 0 0 5 WT p § ¸3

(4.5)

where 2

¸1 Q TM Q D W40 M 0

0 ¸2 0

3 0 0 5 WT ¸3

(4.6)

Q T M. Q By judiciously and W is thus the matrix of eigenvectors for the matrix M selecting the signs of the roots in equation 4.5 so as to match the signs of Q reections in the local transformation, the corresponding eigenvalues of M, if any, were included in the symmetric component A, and the orthogonal matrix R represents a rotation around a single axis. For A nonsingular, the matrix R is given by R D MA ¡1 :

(4.7)

Expansion or contraction in the local transformation can be described by the matrix we call the local distortion, dened as 3 D M T M;

(4.8)

where the square root of the eigenvalues of 3 indicates the amount of expansion or contraction of output vectors along the eigenvectors of 3 (an eigenvalue equal to 1 implies no local distortion in that direction). The local distortion matrix can be plotted as an oriented 3D ellipsoid, where the major and minor axes indicated the directions of maximal and minimal expansion.

2838

J. McIntyre, F. Stratta, J. Droulez, and F. Lacquaniti

5 Experimental Results

Measurements of variable errors and local distortion showed anisotropic patterns for error that varied systematically as a function of workspace location. Figure 7 summarizes the patterns of variable error (A) and local distortion (B) observed for two memory delays and two lighting conditions. The data in this gure represent the average responses computed across subjects for each memory delay and lighting condition. In Figure 7A, each ellipsoid represents the combined 95% tolerance ellipsoid for pointing to

Analysis of Pointing Errors

2839

all nine targets within each workspace region. For pointing in the light, the axes of maximum variability for the computed variable-error ellipsoids, as indicated by dark bars passing along the major axis of each ellipsoid, converge toward the head of the subject. These data indicate a viewercentered reference frame for the representation of the nal pointing position when vision of the ngertip is allowed. A more detailed examination (data not shown) showed that this reference frame was independent of the starting hand position and the effector arm (McIntyre et al., 1997). In darkness, the major axes of the variable-error ellipsoids do not converge, and thus do not indicate a well-dened reference frame for the nal pointing position when the ngertip is no longer visible. Average local distortion ellipsoids for all subjects are presented in Figure 7B, where the axis of maximum contraction (third eigenvector) is indicated by dark bars. Local distortion measurements show a spatial organization that is common to both lighting conditions and memory delays and is more consistent than the pattern of variable errors seen for movements in the dark. For both lighting conditions, the minor axes of the local distortion ellipsoids point toward the subject, to a location situated somewhere between the eyes and the right shoulder. To test for factors affecting the spatial orientation of pointing errors in the dark, we varied the starting position of the hand and the hand (left or right) used to perform the pointing task. Figure 8 shows the results of experiments with different starting positions and effector arms for pointing with a long delay in the dark. It can be seen that patterns of variable error differ according to the workspace region, the effector arm, and the starting hand position. Viewed from above, the variable-error ellipsoids in-

Figure 7: Facing page. Average variable errors (A) and local distortion (B) across subjects for two lighting conditions and two delays, viewed from above and perpendicular to the plane of movement. Small crosses indicate the starting position of the hand. (A) Ellipsoids represent the 95% tolerance region calculated from the matrix of variance and covariance, and the black bars indicate the axis of maximum variability. Dark line segments indicate the direction of the major eigenvector computed for the tolerance ellipsoid. For movements in the dark, a pattern of major eigenvector rotations upward and away from the starting position emerges in the ensemble averages, in comparison to headcentered eigenvector directions seen in the light. (B) Ellipsoids indicate the local distortions induced by the sensorimotor transformation, as estimated by a linear approximation to the local transformation. The unit sphere indicates the ellipsoid corresponding to an ideal, distortion-free local transformation. Dark bars indicate the direction of the third (minor) eigenvector, indicating the axis of maximum local contraction. Under all lighting conditions, axes of maximal contraction point toward the subject. Note the change of scale for ellipsoids viewed in the plane of movement.

2840

J. McIntyre, F. Stratta, J. Droulez, and F. Lacquaniti

Figure 8: Effects of workspace region and movement starting position on estimates of variable error (A) and local distortion (B). The orientation of the variable-error ellipsoids is affected by the relative starting position of the hand but not by the hand used to perform the pointing. Axes of maximum contraction are biased slightly toward the side of the effector hand, independent of the starting hand position.

Analysis of Pointing Errors

2841

dicate a lateral distribution that rotates as the workspace location shifts from left to right. When viewed in the plane of movement (the plane containing the two starting positions and the center target location), the variable-error ellipsoids tend to rotate away from the line of movement in a pattern that is mirror symmetric for the right versus left starting position. In contrast, the minor axes for the measurements of local distortion are relatively unaffected by changes in the starting hand position. However, the axes of maximum contraction are biased toward the shoulder of the effector arm, independent of the starting position of the hand. 6 Simulations

The experimental results demonstrate a pattern of errors that at rst glance do not point to a clearly dened reference frame for movements in the dark. The major axes of variable errors do not converge to a single origin, in contrast to the clearly viewer-centered errors seen in the light. Furthermore, the minor axes for measurements of local distortion, while clearly bodycentered, do not point to a logically dened anatomical origin. However, distortions introduced at different stages in a sensorimotor chain combine to produce an overall local transformation in a manner that is not always intuitively obvious. Furthermore, distributions of errors that arise early in the sensorimotor pathway will be reshaped by distortions that occur in later steps of the transformation. We hypothesized that the reshaping and reorientation of variable-error ellipsoids observed for long delays in the dark may have been due entirely to the anisotropic distortions that develop in the sensorimotor transformation during the memory delay period. To test this hypothesis, we performed a set of simulations to see whether (1) a contraction of data along a shoulder-centered or a head-centered axis, or a combination of the two, can explain the observed patterns of local distortions and (2) whether local distortions in the target to end point mapping can explain the changes seen in the variable-error ellipsoids when movements are performed in the dark. To this end we synthesized local transformation matrices Mv and Ma corresponding to local distortions aligned with the eyes and the shoulder, respectively. Each distortion was characterized by the unit vector v0 parallel to the line from the designated origin to the center target position for a given workspace region and the amount of distortion (¸v and ¸a ) along this axis. The off-axis distortion eigenvalues were set to unity, and the simulated local transformations had no rotational component or reections. The simulated local transformation can thus be computed as: 2 ¸on M D W4 0 0

0 ¸off 0

3 0 0 5 WT ¸off

(6.1)

2842

J. McIntyre, F. Stratta, J. Droulez, and F. Lacquaniti

Figure 9: Simulation of net visuomotor transformations resulting from a cascade of two local distortions: one with an anisotropic contraction of end point positions along a head-centered direction and the second with a similar contraction along a shoulder-centered axis. In panel A, a–i represent the net local distortion at the output for different levels of contraction in each of the component steps. Starting from no distortion in either component for the upper left panel, shoulder-centered contraction increases from 1.0 on the left to 0.6 on the right, while head-centered distortion increases from 1.0 on the top to 0.6 on the bottom. To facilitate comparisons, the measured local distortion for pointing in the dark (see Figure 7) is presented in panel B.

where £ W D v0

v1

v2

¤

(6.2)

is a orthogonal matrix formed from the desired major or minor axis v0 and two mutually orthogonal unit vectors v1 and v2 lying in the plane perpendicular to v0 . The net local transformation for a serial organization of eye-centered and then shoulder-centered distortion is given by Mnet D Ma Mv :

(6.3)

Figure 9A shows the simulated distortion ellipsoids for different combinations of head-centered and shoulder-centered contraction. In the grid, shoulder-centered contraction increases from left to right (¸a varies from

Analysis of Pointing Errors

2843

Figure 10: (A) Simulation of variable-error ellipsoids for each of the computed local transformations shown in Figure 9, viewed from above. Input variance for each workspace is taken from measurements of variable errors for pointing in the light with a short memory delay. (B) Average measured variable error for pointing in the dark (from Figure 7). (C) Simulated viewer-centered input noise transformed by the average local distortion measured for movements in the dark. Simulations e, f, and h, capture the qualitative features of the measured and predicted variable-error ellipsoids for a 5.0 s delay, but not for a 0.5 s delay.

1.0 to 0.6), while head-centered contraction increases from top to bottom (¸v varies from 1.0 to 0.6). It is evident, as expected, that neither source of distortion alone can predict the actual measured transformation shown in Figure 9B. Axes of maximum contraction would point directly at the shoulder for a shoulder-centered-only distortion (top row) while viewer-centered distortions (left column) would produce a strictly head-centered pattern of orientations. The combined effect of both types of distortion, however, can create an intermediate center of rotation, with Figure 9h showing reasonably good correspondence with the average measured values. Figures 10 through 12 show the predicted variable-error ellipsoids for each of the simulated local distortions shown in Figure 9, assuming a viewer-centered input distribution. A purely viewer-centered input varip p ance Sv was computed using eigenvalues (¸on D 25:0; ¸off D 12:0) similar to those found for movements in the light with a 5.0 s delay (see Figure 7). The predicted end point variable error for each simulated distortion from

2844

J. McIntyre, F. Stratta, J. Droulez, and F. Lacquaniti

Figure 11: The same simulated and measured data as in Figure 10, viewed from the side. Simulations f and h show features qualitatively similar to the measured data in panel C, with variable-error ellipsoids pointing above the head of the subject.

Figure 9 was computed by inserting Sv and M net into equation 3.5: S D MTnet Sv M net

(6.4)

S D MTa MTv Sv M v Ma :

(6.5)

For comparison, the average variable-error ellipsoids for movements in the dark (also from Figure 7) are included in panel B of each gure. In addition, a second simulation is included in panel C, in which the viewercentered variable-error ellipsoid Sv was transformed by the average meaQ for delayed movements in the dark (from sured local transformation M Figure 7). Viewed from above (see Figure 10) and from the side (see Figure 11), transformations of the viewer-centered variable error by either the simulated or the measured local distortion capture two of the main qualitative features of the measured variable error. Viewed from above, there is a shift from a converging pattern of variable-error axes toward the subject to a lateral pattern of variability, as both the combined head- and shouldercentered contraction increases in the simulation, and in the measured and predicted variable errors for the long delay in the dark. Viewed from the

Analysis of Pointing Errors

2845

Figure 12: The same simulated and measured data as in Figure 10, viewed perpendicular to the movement plane. Neither the simulated data in A nor the computed variance distributions in C reect the starting-point dependence evident in B and in Figure 8.

side, the simulated head and shoulder contraction also predicts the upward rotation of the major eigenvectors toward the vertical. The simulations do not, however, reproduce all of the qualitative features of the measured variable-error ellipsoids. The major eigenvectors of variable errors measured for a short delay do not seem to converge, in contrast to the simulated patterns seen in Figure 10. Furthermore, when observed from a viewpoint perpendicular to the movement plane (see Figure 12), the simulations that best reproduce the data in the other planes do not reproduce the characteristic rotation of the variable-error ellipsoids away from the starting hand position. On the other hand, neither does the transformation of the viewer-centered input variance by the measured distortion, shown in Figure 12C. Thus, the effect of starting hand position (or movement direction) reects an additional source of noise added to the sensorimotor pathway, rather than a distorted transformation into a hand-centered representation of the upcoming movement. Figures 13 and 14 show the effects on the simulated variable-error ellipsoids of adding increased noise along the axis of movement of the hand (which we will refer to as hand-centered noise). In these gures, headcentered contraction is held constant at ¸v D 0:8 while shoulder-centered

2846

J. McIntyre, F. Stratta, J. Droulez, and F. Lacquaniti

Figure 13: Simulation of variable-error ellipsoids for each of the computed local transformations, with the addition of anisotropic noise along the line from the initial hand position to the nal end point position. Viewer-centered contraction is held constant at 0.8. Shoulder-centered contraction increases from 0.0 in A to 0.6 in C. Additive hand-centered noise is 0 mm (S.D.) in the rst row of each group (a–c) and 6 mm in the second (d–f). For up to 6 mm of hand-centered noise, the patterns of variable error are dominated by the viewer-centered input noise and the head- and shoulder-centered contraction, resulting in the lateral pattern of major eigenvectors seen for human subjects.

Analysis of Pointing Errors

2847

Figure 14: Same as Figure 13, viewed perpendicular to the movement plane. When hand-centered noise is absent (a–c in each group), the pattern of variable errors depends on the effector hand, not the starting hand position Increasing levels of hand-centered noise (d–f) can generate patterns that depend on hand starting position, as seen in Figure 8.

contraction increases from 0.0 (A) to 0.6 (C). Variance along the line from starting to nal pointing position is ¾h D 0 for panels a through c in each group and ¾h D 6 mm for panels d through f. Viewed in the horizontal plane, the pattern of major eigenvectors for variable error is dominated by

2848

J. McIntyre, F. Stratta, J. Droulez, and F. Lacquaniti

viewer-centered input noise and the head- and shoulder-centered contractions, resulting in a lateral pattern of major eigenvectors as the shouldercentered contract increases. One can see how the parallel variable-error axis seen for the 0.5 s delay in the dark might result from a moderate amount of head-centered and shoulder-centered contraction plus additional noise along the hand movement axis. With the shoulder-centered distortion xed at ¸a D 0:8 (see Figure 13B), increasing the amount of hand-centered noise (panel f versus c) causes the major eigenvector to swing toward the left shoulder, and the convergence of the eigenvectors decreases. Figure 14 demonstrates how the added hand-centered noise can cause the variable-error major axes to point away from the starting hand position when viewed in the plane of movement. The simulation of noise along the line from the left starting position produces a mirror effect of starting position when viewed in the same plane. Note that the effects of handcentered noise not seen in the orientation of the variable major axis (see Figure 14A, panels d–f) can be revealed by increasing levels of body-centered local contraction (see Figures 14B and 14C, panels d–f). Thus, a given level of hand-centered noise will have a greater effect on the direction of the variable-error major eigenvector as local contraction increases. It is therefore possible that the effects of starting hand position on variable error may in fact be present for pointing in the dark for all memory delays, but that these effects are not revealed in the measurement of variable error until a longer memory delay generates a sufciently high level of contraction in depth. The simulations depicted in Figures 10 through 14 do not reproduce exactly the measured patterns of variable error. For example, there is not a single combination of viewer-centered contraction, shoulder-centered contraction, and hand-centered noise that can match precisely the features of the measured variable error from all viewpoints simultaneously. Nevertheless, these simulations demonstrate how the different qualitative features can be achieved by varying the weight of the different sources of error. Discrepancies between the simulations and the measured data may arise from varying relative weights for the different error sources among different subjects included in the average. On the other hand, some additional differences between the simulations and measurements, such as the counterclockwise rotation of the variable-error major axis viewed in the movement plane (see Figure 8), may be due to additional features of the sensorimotor process that have yet to be accounted for. 7 Discussion

In this article we have formally described how various sources of error added along a sensorimotor pathway can combine to generate patterns of errors in the motor output of a pointing task. Conversely, we have shown how measurements of errors at the output can reveal properties of the underlying neural circuitry. We have used these concepts to model the behavior

Analysis of Pointing Errors

2849

of human subjects for pointing to remembered 3D targets. Although this particular model may not provide the only explanation of the observed behavior, the results of this study demonstrate how the cumulative effects of noise and distortion lead to patterns of variable error that might not be otherwise intuitively obvious. We propose that computational analyses of the type described here may be useful in understanding electrophysiological and psychophysical data and will lead to experimentally veriable models of biological sensorimotor systems, as we will discuss below. The measure of local distortion described in this article provides a useful complement to the measures of variable and constant error. First, because it is based on local rather than global linearization methods, it is much less sensitive to nonlinearity in the sensorimotor pathways. The validity of the measurement of local distortion requires only that the overall transformation be smooth. Thus, the calculation of local distortion will be appropriate even if a global linearization is not (Bookstein, 1992). Furthermore, spatial patterns of local distortion for these experiments have proved to be more consistent across subjects than are measures of constant error (McIntyre et al., 1998). This indicates that local contraction, rather than global bias, is an invariant property of pointing behavior. Finally, the calculation of local distortion allows for a better understanding of observed patterns of variable error. As shown by the simulations presented above, patterns of measured variable errors may be due to inaccurate coordinate transformations between reference frames rather than to noise added to additional intermediate internal representations. In the experiments described here, the consideration of local distortion in addition to measures of constant and variable error provided a clearer picture of the underlying neural mechanisms. By comparing the experimental results and the simulations presented here, one can conclude that much of the difference in variable errors observed for pointing in the dark versus the light can be attributed to distortions introduced in the transformation of the visual acquired target location to the nal nger position. This transformation includes effects contributed by the storage of the desired pointing position, as indicated by the increasing effect of local distortion for longer memory delays (McIntyre et al., 1998). However, an additional source of noise is required to account fully for the dependence of variable-error orientation on the starting hand position, which is likely to be related to the direction of movement. In the following paragraphs we consider the significance of the identied viewer-centered noise, head- and shoulder-centered distortion, and hand-centered variability. The compression of responses toward the mean is a general phenomenon known as a central tendency or range effect (Poulton, 1981). A general compression of the data toward the center would not in itself be very interesting; however, when the central tendency is anisotropic, as in this study, the anisotropy can indicate the separation of spatial information into independent channels, each of which exhibits its own range effect. In our

2850

J. McIntyre, F. Stratta, J. Droulez, and F. Lacquaniti

experiments, the direction of maximum contraction is almost, but not quite, aligned with the viewer-centered axes of maximum variability observed for pointing in the light. This behavior suggests a Kalman lter type of processing in which the pointing response is the weighted sum of input signals and an “average” response, and in which the weight given to each input channel is determined by the expected noise in that channel. Because binocular estimates of distance are noisier than estimates of target direction, one would expect that the contraction toward the central value would be stronger in a radial direction. Furthermore, as noise increases in the stored representation of the target position, contraction should also increase. This appears to be the case for increasing memory delays in the dark (McIntyre et al., 1998). It might seem therefore that the anisotropic contraction can be attributed to processing of the visual input. Foley (1980) found that the slope of perceived versus actual target distance is consistently less than unity for a variety of different estimation tasks. Gogel (1973) proposed that the central nervous system computes target distance as a weighted sum of different inputs (e.g., stereodisparity, vergence, accommodation), including a “specic distance” toward which the estimate is biased in the absence of adequate sensory cues. Contractions of pointing positions may reect the bias toward the specic distance postulated by Gogel, and we propose further that the weighting of different visual cues may be determined by the expected variability in each channel. For pointing, however, one cannot predict the rotation of the local distortion ellipsoid toward the effector arm by the ltering of viewer-centered noise mapped onto an isotropic internal representation of the target position. Noise in the binocular acquisition of the target should not depend on the effector arm. Observed patterns of local distortion would, on the other hand, be consistent with a mapping of the target position into a shoulder-centered reference frame, followed by a Kalman-type lter applied in this transformed coordinate system. There is, in fact, also evidence for arm-centered local contraction in pure motor tasks. When pointing to kinesthetically presented targets, subjects tend to underestimate the perceived target distance from the shoulder (Baud-Bovy & Viviani, 1998). These data indicate a polar representation of the target position in separate channels of shoulder-centered distance, azimuth, and elevation, as has previously been proposed (Soechting & Flanders, 1989a; Lacquaniti et al., 1995). Local distortion data from the current experiment are, however, also consistent with a distorted transformation into arm segment orientations (arm and forearm azimuth and elevation; Soechting & Flanders, 1989b) or intrinsic shoulder and elbow angle components. This is because the direction to the ngertip is determined primarily by the two shoulder angles, while the distance from shoulder to nger is largely determined by the degree of elbow extension. Note that it is difcult to differentiate between these models because they are intrinsically related. An underestimation of shoulder-centered target distance would generate an underestimation of elbow extension and vice versa (Bennett & Loeb, 1992).

Analysis of Pointing Errors

2851

Indeed, Baud-Bovy and Viviani (1998) observed slopes consistently less than unity in the regression of perceived versus actual arm elevation for pointing to a kinesthetically presented target. Nevertheless, these authors have argued for a spherical target position representation centered at the shoulder, rather than a limb orientation or joint angle representation, based on the analysis of noise correlation for different hypothesized coordinate frames. Thus, a number of studies have identied a viewer-centered reference frame for visually guided pointing to visually presented targets (Soechting, Helms Tillery, & Flanders, 1990; McIntyre et al., 1997). Conversely, an armcentered reference frame has been identied for pointing to kinesthetically presented targets (Baud-Bovy & Viviani, 1998). In the experiments presented here for pointing to visually presented targets without visual guidance, our measurements of local distortions indicate an intermediate reference frame with an origin located between the eyes and shoulder. Using different methods, Soechting and colleagues (Soechting et al., 1990; Helms Tillery, Flanders, & Soechting, 1991; Flanders, Helms Tillery, & Soechting, 1992) have also identied an intermediate reference frame related to both the visual target and the effector arm. Our analysis provides a more explicit explanation for this intermediate representation. Simulations show how the cascading effects of two transformations, one centered at the eyes and the second at the shoulder, can predict the intermediate center of rotation seen in the data. Data from three different experiments support the hypothesis of a twostage transformation, with a gradual passage from viewer-centered to shoulder-centered coordinates. For pointing to continuously visible targets, without vision of the hand, Carrozzo, McIntyre, Zago, and Lacquaniti (1999) found local contraction axes that point toward the subject but that are not necessarily biased toward the shoulder. Although Baud-Bovy and Viviani (1998) did not explicitly measure local distortion, in their experiments on pointing to kinesthetic targets, the observed convergence of the minor axes for the variable-error ellipsoids toward the shoulder of the effector arm is consistent with a local distortion (local contraction) in a shoulder reference frame. In our experiments for pointing to remembered visual targets, we found signicant local distortion for both lighting conditions and memory delays, but the bias toward the effector arm is most readily apparent for the longer delay in the dark. Nevertheless, the bias toward the shoulder for pointing to visual targets (see Figure 7) does not appear to be as strong as that seen for pointing to kinesthetic targets (see Baud-Bovy & Viviani, 1998, Figure 7). The ensemble of these results suggests that a visual target is rst represented in a viewer-centered reference frame, with a contraction of perceived distances along the sight line, and then transformed into an armcentered representation, with an associated distortion in the arm-linked reference frame. As the memory delay increases, the internal representation of the intended hand movement appears to decay in this arm-linked reference frame.

2852

J. McIntyre, F. Stratta, J. Droulez, and F. Lacquaniti

7.1 Implications for Neural Circuits. Anisotropies in the transformation process, particularly those arising from the decay in the memory of the target position, have meaningful implications for the neural mechanisms involved in sensorimotor transformations and memory. Anisotropic compression along an egocentric axis is a behavioral characteristic that one can search for within the neural networks of the brain, in much the same way that behavioral phenomena such as mental rotation (Georgopoulos, Lurito, Petrides, Schwartz, & Massey, 1989) and the two-thirds power law (Schwartz, 1994) have been observed in population vectors measured in M1. Evidence for egocentric compression within a specic neural representation of the target position would help to identify the structures that implement the sensorimotor transformations and short-term working memory. Note that the tuning curves of neurons in the superior parietal lobule of the monkey recorded during pointing to visual targets exhibit a body-centered contraction reminiscent of the radial contraction described here (Lacquaniti, Guignon, Bianchi, Ferraina, & Caminiti, 1995), suggesting a neural manifestation of the observed behavioral effects. In addition, neurological patients affected by optic ataxia, a disturbance of visuomotor coordination due to lesions of the interparietal sulcus and the superior parietal lobule, exhibit a specic decit in pointing to a visual target (Perenin & Vighetto, 1988; Ratcliff & Davies-Jones, 1972) that consists of a pattern of local distortion qualitatively akin to the one we have reported here, although of much greater magnitude. Thus, this particular decit may be an exaggeration of a relatively normal physiological process. Known properties of a variety of different cortical areas provide a number of potential loci for the transformations and representations proposed here. Parietal cortex is replete with cells that combine information about eye, neck, arm, and hand positions, as is the case for premotor cortex (Anderson, Snyder, Bradley, & Xing, 1997; Boussaoud, 1995). Cells in LIP seem to code 3D space through the mechanism of gain elds, in which radial 2D receptive elds are modulated multiplicatively by a sigmoid function of depth. This representation mechanism could account for the anisotropic distortion between depth and direction observed for the nal positions in our pointing task. The anisotropic modication of the memorized target location (or intended movement end point) over the delay period argues for a parcelation of target position information into separate channels, each represented by a relatively independent set of memory-related neurons. In this light, endpoint–position sensitive neurons, such as those found in parietal cortex area 5, become good candidates for the locus of the target position memory (Lacquaniti et al., 1995). On the other hand, uniform encoding of hand displacement direction, as has been postulated for area M1 (Georgopoulos et al., 1988), would be inconsistent with the anisotropic memory decay observed for movements in the dark. Although the vector coding scheme proposed for M1 separates movement distance from direction, the reference frame for

Analysis of Pointing Errors

2853

this system is the hand’s initial position, not a body-centered origin. This is not to say that alternative representations situated upstream or downstream from the egocentric working memory would have no effect on the evolution of the stored target position over time. On the contrary, both the fact that hand-centered M1 neurons are active during memory delay periods (Smyrnis, Taira, Ashe, & Georgopoulos, 1992) and that noise tends to be greater along the axis of movement (Gordon, Ghilardi, & Ghez, 1994; Desmurget et al., 1997; McIntyre et al., 1998) indicate that a hand-centered representation of the intended limb displacement plays a detectable role in the representation of the upcoming movement. Furthermore, undistorted pointing when vision of the ngertip is allowed suggests that the visual representation of the target persists in memory during the delay period. Likewise, shifts in gaze direction during the memory delay period can affect the nal pointing position, even when vision of the ngertip is prevented (Enright, 1995; Henriques, Klier, Smith, Lowy, & Crawford, 1998). These results suggest an ongoing inuence of visuomotor inputs to the short-term memory of the target position or intended movement. Theoretical models suggest how neural networks might represent target locations simultaneously in different reference frames linked to different sensory modalities and/or motor outputs, and how coherence may be maintained between these different reference frames (Droulez & Berthoz, 1991; Pouget & Sejnowski, 1997). Nevertheless, in our experiments, distortion of the nal end point positions along a headshoulder axis increased in a time-dependent fashion during the memory delay period when there was no modication of the visual input, eye position, or body posture. We therefore postulate that the primary storage of the target position or intended movement during the memory delay is carried out in an egocentric representation of the nal end point position. Acknowledgments

This work was carried out under the auspices of the European Laboratory for the Neuroscience of Action. This work was supported in part by grants from the Italian Health Ministry, the Italian Space Agency ASI, the French space agency CNES, the Ministero della Universit`a a Ricerca Scientica e Tecnologica, and Telethon-Italy. References Andersen, R. A., Snyder, L. H., Bradley, D. C., & Xing, J. (1997). Multimodal representation of space in the posterior parietal cortex and its use in planning movements. Ann Rev Neurosci, 20, 303–330. Baud-Bovy, G., & Viviani, P. (1998). Pointing to kinematic targets in space. J Neurosci, 18, 1528–1545. Bennett, D. J., & Loeb, E. (1992). Error analysis, regression and coordinate systems. Behav Br Sci, 15, 326.

2854

J. McIntyre, F. Stratta, J. Droulez, and F. Lacquaniti

Berkinblit, M. B., Fookson, O. I., Smetanin, B., Adamovich, S. V., & Poizner, H. (1995). The interaction of visual and proprioceptive inputs in pointing to actual and remembered targets. Exp Brain Res, 107, 326–330. Bookstein, F. L. (1992). Error analysis, regression and coordinate systems. Behav Br Sci, 15, 327–328. Boussaoud, D. (1995). Primate premotor cortex: modulation of preparatory neuronal activity by gaze angle. J Neurophysiol, 73, 886–890. Carrozzo, M., McIntyre, J., Zago, M., & Lacquaniti, F. (1999). Viewer-centered and body-centered frames of reference in direct visuo-motor transformations. Experimental Brain Res, 129, 201–210. Darling, W. G., & Miller, G. F. (1993). Transformations between visual and kinesthetic coordinate systems in reaches to remembered object locations and orientations. Exp Brain Res, 93, 534–547. Desmurget, M., Jordan, M., Prablanc, C., & Jeannerod, M. (1997). Constrained and unconstrained movements involve different control strategies. J Neurophysiol, 77, 1644–1650. Droulez, J., & Berthoz, A. (1991). The concept of dynamic memory in sensorimotor control. In D. R. Humphrey & H. J. Freund (Eds.), Motor control: Concepts and issues (pp. 137–161). New York: Wiley. Enright, J. T. (1995). The non-visual impact of eye orientation on eye-hand coordination. Vis Res, 35, 1611–1618. Flanders, M., Helms Tillery, S. I., & Soechting, J. F. (1992). Early stages in a sensorimotor transformation. Behav Br Sci, 15, 309–362. Foley, J. M. (1980). Binocular distance perception. Psychol Rev, 87, 411–435. Foley, J. M., & Held, R. (1972). Visually directed pointing as a function of target distance, direction and available cues. Percept Psychophys, 12, 263–268. Georgopoulos, A. P., Kettner, R. E., & Schwartz, A. B. (1988). Primate motor cortex and free arm movements to visual targets in three-dimensional space. II. Coding of the direction of movement by a neuronal population. J Neurosci, 8, 2928–2937. Georgopoulos, A. P., Lurito, J. T., Petrides, M., Schwartz, A. B., & Massey, J. T. (1989). Mental rotation of the neuronal population vector. Science, 243, 234– 236. Gogel, W. C. (1973). Absolute motion, parallax and the specic distance tendency. Percept Psychophys, 13, 284–292. Gordon, J., Ghilardi, M. F., & Ghez, C. (1994). Accuracy of planar reaching movements. I Independence of direction and extent variability. Exp Brain Res, 99, 97–111. Helms Tillery, S. I., Flanders, M., & Soechting, J. F. (1991). A coordinate system for the synthesis of visual and kinesthetic information. J Neurosci, 11, 770–778. Henriques, D. P., Klier, E. M., Smith, M. A., Lowy, D., & Crawford, J. D. (1998). Gaze-centered remapping of remembered visual space in an open-loop pointing task. J Neurosci, 18, 1583–1594. Lacquaniti, F. (1997). Frames of reference in sensorimotor coordination. In M. Jeannerod & J. Grafman (Eds.), Handbook of neuropsychology(Vol. 2, pp. 27–64). Amsterdam: Elsevier.

Analysis of Pointing Errors

2855

Lacquaniti, F., Guigon, E., Bianchi, L., Ferraina, S., & Caminiti, R. (1995). Representing spatial information for limb movement: Role of area 5 in the monkey. Cereb Cortex, 5, 391–409. McIntyre, J., Stratta, F., & Lacquaniti, F. (1997). A viewer-centered reference frame for pointing to memorized targets in three-dimensional space. J Neurophysiol, 78, 1601–1618. McIntyre, J., Stratta, F., & Lacquaniti, F. (1998). Short-term memory for reaching to visual targets: Psychophysical evidence for body-centered reference frames. J Neurosci, 18, 8423–8435. Merton, P. A. (1961). The accuracy of directing the eyes and the hand in the dark. J Physiol, 156, 555–577. Morrison, D. F. (1990). Multivariate statistical methods. New York: McGraw-Hill. Perenin, M. T., & Vighetto, A. (1988). Optic ataxia: A specic disruption in visuomotor mechanisms. I. Different aspects of the decit in reaching for objects. Brain, 111, 643–674. Poulton, E. C. (1981). Human manual control. In V. B. Brooks (Ed.), Motor control (pp. 1337–1389). Bethesda, MD. American Physiological Society. Pouget, A., & Sejnowski, T. J. (1997).Spatial transformations in the parietal cortex using basis functions. J Cog Neurosci, 9, 222–237. Prablanc, C., Echallier, J. F., Komilis, E., & Jeannerod, M. (1979). Optimal response of eye and hand motor systems in pointing at a visual target. I. Spatiotemporal characteristics of eye and hand movements and their relationship when varying the amount of visual information. Biol Cybern, 35, 113–124. Ratcliff, G., & Davies-Jones, G. A. (1972). Defective visual localization in focal brain wounds. Brain, 95, 49–60. Schwartz, A. B. (1994). Direct cortical representation of drawing. Science, 265, 540–542. Sherrington, C. S. (1918). Observations on the sensual role of the proprioceptive nerve-supply of the extrinsic ocular muscles. Brain, 41, 332–343. Smyrnis, N., Taira, M., Ashe, J., & Georgopoulos, A. P. (1992). Motor cortical activity in a memorized delay task. Exp Brain Res, 92, 139–151. Soechting, J. F., & Flanders, M. (1989a). Errors in pointing are due to approximations in sensorimotor transformations. J Neurophysiol, 62, 595–608. Soechting, J. F., & Flanders, M. (1989b). Sensorimotor representations for pointing to targets in three-dimensional space. J Neurophysiol, 62, 582–594. Soechting, J. F., Helms Tillery, S. I., & Flanders, M. (1990). Transformation from head- to shoulder- centered representation of target direction in arm movements. J Cog Neurosci, 2, 32–43. Received December 2, 1998; accepted March 7, 2000.

LETTER

Communicated by Stephen DeWeerth

Modeling Selective Attention Using a Neuromorphic Analog VLSI Device Giacomo Indiveri ¨ Institute of Neuroinformatics, University/ETH Zurich, Switzerland Attentional mechanisms are required to overcome the problem of ooding a limited processing capacity system with information. They are present in biological sensory systems and can be a useful engineering tool for articial visual systems. In this article we present a hardware model of a selective attention mechanism implemented on a very largescale integration (VLSI) chip, using analog neuromorphic circuits. The chip exploits a spike-based representation to receive, process, and transmit signals. It can be used as a transceiver module for building multichip neuromorphic vision systems. We describe the circuits that carry out the main processing stages of the selective attention mechanism and provide experimental data for each circuit. We demonstrate the expected behavior of the model at the system level by stimulating the chip with both articially generated control signals and signals obtained from a saliency map, computed from an image containing several salient features. 1 Introduction

Processing detailed sensory information is a computationally demanding task for biological and articial systems. If the amount of information provided by the sensors exceeds the parallel processing capabilities of the system, as is usually the case with biological and articial vision systems, an effective strategy is to select subregions of the input and process them, shifting from one subregion to another, in a serial fashion (Desimone & Duncan, 1995; Mozer & Sitton, 1998). In biology this strategy, commonly referred to as selective attention, is used by a wide variety of systems, from insects (Pollack, 1988; Bernays, 1996) to humans (Kastner, De Weerd, Desimone, & Ungerleider, 1998; Culham et al., 1999). In primates selective attention plays a major role in determining where to center the high-resolution central foveal region of the retina (Miller & Bockisch, 1997) by biasing the planning and production of saccadic eye movements (Hoffman & Subramaniam, 1995; Behrmann & Haimson, 1999). In general, though, visual regions being attended by the focus of attention do not always correspond to the regions being analyzed by the fovea. Recent ndings even suggest that attention can be used to keep track of multiple targets of interest simultaneously if the visual task requires a low attentional cost (Braun & Julesz, 1998; Culham et al., 1999). c 2000 Massachusetts Institute of Technology Neural Computation 12, 2857–2880 (2000) °

2858

Giacomo Indiveri

Psychophysical evidence indicates that visual attention mechanisms have two main types of dynamics: a transient, rapid, bottom-up, task-independent one and a slower, sustained, task-dependent one, which acts under voluntary control (Nakayama & Mackeben, 1989). In this article we present a hardware implementation of a saliency-based computational model of the fast, transient, stimulus-driven, dynamical form of selective attention (Niebur & Koch, 1998). The model is implemented in VLSI technology, using analog neuromorphic circuits, such as winner-take-all networks and silicon integrate-and-re neurons, and employs a spike-based representation for both receiving input signals and transmitting output signals to actuators or further processing stages. Its input signals are expected to arrive from a saliency map (Koch & Ullman, 1985), topographically encoding local conspicuousness over the entire visual scene. Its output signals can be used in real time to drive motors of active vision systems or select subregions of images captured from wide eld-of-view cameras. The analog, continuous-time response characteristics of the device presented, its ability to interact with the real world in real time, the adjustable properties of many of its circuits, and the spike-based communication infrastructure adopted allow us to investigate different architectural and computational hypotheses of selective attention models and possibly to validate current theories of selective attention mechanisms. The VLSI model presented embodies the basic principles of neuromorphic engineering (Douglas, Mahowald, & Mead, 1995) and was designed in the same spirit of recently proposed VLSI neuromorphic systems (Fragni`ere, van Schaik, & Vittoz, 1997; Horiuchi & Koch, 1999; Harrison & Koch, 2000). In the next section we describe the general structure of the selective attention chip, pointing out its analogies to biology. We conne the detailed descriptions of the circuits implementing the various computational blocks of the system to section 3, so that readers interested only in the functional aspect of the VLSI model can go directly to section 4, in which we demonstrate the model’s behavior at the system level and provide experimental results. In section 5 we point out the relevance of the approach followed, suggest an architecture for implementing an active multichip selective attention system, and draw conclusions. 2 Selective Attention Chip

The selective attention chip presented here contains a one-dimensional architecture of 32 locally coupled elements that compete globally for saliency. Global competition is achieved using a hysteretic winner-take-all (WTA) network that processes all its inputs in parallel and selects the strongest one as the winner. The dynamics of the selective attention model are implemented exploiting the temporal properties of integrate-and-re neurons and synaptic circuits connected to the WTA cells. The synaptic circuits and the integrate-and-re neurons are part of the chip’s I=O interfacing circuits,

Modeling Selective Attention

2859

which allow the chip to communicate with other devices using the addressevent representation (AER) (Lazzaro, Wawrzynek, Mahowald, Sivilotti, & Gillespie, 1993; Boahen, 1998; Deiss, Douglas, & Whatley, 1998). In AER, signals are represented as groups of parallel digital pulses. Analog information is carried in the temporal structure of the intersignal intervals, very much as it is believed to be conveyed in natural spike trains in real neural systems. The silicon area of the chip designed is approximately 2 mm£2 mm. Each processing column is 36 ¹m wide and 240 ¹m long. The technology used to fabricate the chip is a standard 1:2 ¹m CMOS technology. 2.1 Relation to Previous Work. Several VLSI systems for implementing visual selective attention mechanisms already have been presented (Wilson, Morris, & DeWeerth, 1999; Horiuchi & Koch, 1999; Morris, Horiuchi, & DeWeerth, 1998; Brajovic & Kanade, 1998). These systems contain photosensing elements and processing elements on the same focal plane and typically apply the competitive selection process to visual stimuli sensed and processed by the focal plane processor itself. Unlike these systems, the device proposed here is able to receive input signals from any type of AER device. Therefore, input signals need not arrive only from visual sensors but could represent a wide variety of sensory stimuli obtained from different sources. The selective attention chip proposed is also one of the rst of its kind able not only to receive AER signals but also to transmit the result of its computation using the AER. With both input and output AER interfacing circuits, the chip can be thought of as a VLSI “cortical” module able to receive and transmit spike trains. In general, decoupling the sensing stage from the processing stage and using the AER to transmit and receive signals has several advantages: a multichip AER attention system could use multiple sensors to construct a saliency map; visual input sensors could be relatively high-resolution silicon retinas (Boahen, 1999) and would not have the small ll factors that singlechip two-dimensional (2D) attention systems are troubled with; top-down modulating signals could be fused with the bottom-up-generated saliency map to bias the selection process; multiple instances of the same selective attention chip could be used to construct hierarchical selective attention architectures; and sensors could be distributed across different peripheral regions of the neuromorphic system, as is the case for real biological systems. 2.2 System Overview . The cells of the hysteretic WTA network have local excitatory and inhibitory connections, recurrent excitatory connections, and global inhibitory connections (see section 3.3). It has been argued that neural circuits with these types of connectivity patterns are valuable models of cortical processing and can account for many response properties of cortical neurons (Hahnloser, Douglas, Mahowald, & Hepp, 1999; Salinas & Abbott, 1996). The analog circuits of the WTA network implement

2860

Giacomo Indiveri Output spike train

Input spike trains

Figure 1: Biologically equivalent architecture of selective attention model. Input spike trains arrive from the bottom onto excitatory synapses. The populations of cells in the middle part of the gure are modeled by a hysteretic WTA network with local lateral connectivity. Inhibitory neurons, in the top part of the gure, locally inhibit the populations of excitatory cells by projecting their activity to the inhibitory synapses in the bottom part of the gure.

a simplied abstract model of these types of neural networks in which each element of the WTA network can be regarded as a local population of excitatory neurons interconnected among each other with lateral nearestneighbor connections. From this point of view, the architecture of the selective attention chip is equivalent to the neural network diagram depicted in Figure 1. The input excitatory synapses shown in the bottom part of the gure receive spike trains from external devices and provide an excitatory current to the local populations of neurons. These populations compete among each other by means of recurrent interactions with a global inhibitory cell (not shown in the gure) and reach a steady state in which typically all populations except the one receiving the strongest net excitation are silent. Each local population projects to one of the output inhibitory neurons shown in the top row of Figure 1. For typical operating conditions, only the inhibitory neuron connected to the winning population of cells will be active at any given time. The output neuron projects its spikes to both AER interfacing circuits, for transmitting the result of the computation to further processing stages, and to local on-chip inhibitory synapses (equivalent to those shown in the lower part of Figure 1). The resulting inhibitory current is subtracted from its corresponding input excitatory current. This negative feedback loop implements the so-called inhibition of return (IOR) mechanism (Gibson & Egeth, 1994; Tanaka & Shimojo, 1996): after selecting a salient stimulus, the WTA network stimulates the output neuron connected to the winning population of cells. The spikes that the output neuron generates are integrated by the corresponding inhibitory synapse. As the inhibitory current increases

Modeling Selective Attention

2861

in amplitude, the effect of the input excitatory current is diminished, and eventually the WTA network switches stable state, selecting a different cell as the winner. Note how the integrate-and-re neurons, necessary for the address-event I/O interface, allowed us to implement the IOR mechanism by simply including an additional inhibitory synaptic circuit. This solution is quite elegant and compact in comparison with previously proposed alternatives (Morris & DeWeerth, 1996). Depending on the dynamics of the IOR mechanism, the WTA network will continuously switch the selection of the winner between the strongest input and the second strongest, or between the strongest and more inputs of successively decreasing strength, thus generating focus of attention scanpaths, analogous to eye movement scanpaths (Yarbus, 1967). The dynamics of the IOR mechanism depend on the time constants of the excitatory and inhibitory synapses, their relative synaptic strengths, the input stimuli, and frequency of the output inhibitory neuron. 3 Circuit Details

In this section we describe the circuits implementing each computational stage of the model, characterize them, and present experimental data measured from the chip. 3.1 Address-Event I/O Interface. Input and output signals are sent to (and from) the chip using digital pulses that encode the address of the sending node and carry the analog information in their temporal structure. The name of the representation used to convey this type of information is address-event representation (AER). Address-events are stereotyped nonclocked digital pulses; the time intervals between address-events are analog (see Figure 2). In the case of single-sender=single-receiver communication, a simple handshaking mechanism ensures that all events generated at the sender side arrive at the receiver. The address of the sending element is conveyed as a parallel word of sufcient length, while the handshaking control signals require only two lines. The circuits used to implement the AER protocol, for both the receiving and the sending side, are the ones described by Boahen (1998). 3.2 Excitatory and Inhibitory Synapses. The input pulses being received by the chip reach, at each cell location, a current-mirror integrator (Boahen, 1998), which models an excitatory synapse. This circuit, shown in Figure 3a, uses only four transistors and one capacitor. The input pulse is applied to transistor M1, which acts as a digital switch. Transistor M2 is biased by the analog voltage Vw to set the weight of the synaptic strength. Similarly, the voltage Ve on the source of transistor M3 can be used to set the time constant of the synapse. With each input pulse, a xed amount of charge is removed from the capacitor, and the amplitude of the output current Iex is increased.

2862

Giacomo Indiveri Encode

Decode

3 2 1

3 Address Event Bus 3 21 2

1

32

2 1

Inputs

Outputs

Source Chip

Destination Chip Address-Event representation of action potential Action Potential

Figure 2: Schematic diagram of an AER chip-to-chip communication example. As soon as a sending node on the source chip generates an event, its address is written on the address-event bus. The destination chip decodes the addressevents as they arrive and routes them to the corresponding receiving nodes.

If no input is applied (i.e., no current is allowed to ow through M3), the output current Iex decays with a 1=t prole. Similarly, the inhibitory synapse integrates the spikes generated by the output neurons. As with the excitatory synapse, we implemented the inhibitory synapse using a current-mirror integrator circuit (see Figure 3b). The principle of operation of this circuit is very similar to the one described for the excitatory synapse, with the difference that the output current of the circuit is of opposite polarity. Every time the local output neuron projecting to this synapse generates a spike, its output voltage Vout rises to the positive power supply rail (see section 3.4), allowing the transistor M1 to charge the capacitor connected to it. The amount of charge passed can be controlled by Vq (which thus determines the strength of the synaptic weight). As for the circuit of Figure 3a, the voltage at the source node of the diode-connected transistor (Vi on the source of M2 in Figure 3b) controls the time constant and the gain of the synapse. Transistor M4 is a decoupling (cascode) element, used to decrease second-order effects. Specically, it is used to minimize the effect of the Miller capacitance of transistor M3. The voltage Vca on the capacitor is a measure of the spiking history of the neuron projecting to the inhibitory synapse. It determines (with an exponential relationship) the amplitude of the output current Iinh . Inhibitory currents Iinh are subtracted

Modeling Selective Attention

2863

Figure 3: (a) Excitatory synapse circuit. Input spikes are applied to M1, and transistor M4 outputs the integrated excitatory current Iex . (b) Inhibitory synapse circuit. Spikes from the local output neurons are integrated into an inhibitory current Iinh .

from excitatory currents Iex coming from the input synapses to provide the net input current Iin to the WTA network (see Figure 5). We characterized the excitatory synapse of Figure 3a by applying single pulses (see Figure 4a and 4b) and sequences of pulses (spikes) at constant rates (see Figure 4c and 4d). Figure 4a shows the response of the excitatory synapse to a single spike for different values of Vw . Similarly, Figure 4b shows the response of the excitatory synapse for different values of Ve . Changes in Ve modify both the gain and the time constant of the synapse. To better visualize the effects of Ve on the time evolution of the circuit’s response, we normalized the different traces, neglecting the circuit’s gain variations. Figure 4c shows the response of the excitatory synapse to a constant 50 Hz spike train for different synaptic strength values. As shown, the circuit integrates the spikes up to a point in which the output current reaches a mean steady-state analog value, the amplitude of which depends on the frequency of the input spike train, the synaptic strength value Vw , and Ve . Figure 4d shows the response of the circuit to spike train sequences of four rates for a xed synaptic strength value. 3.3 The Hysteretic Winner -Take-All Network. The basic cell of the hysteretic winner-take-all network is based on the current mode WTA circuit originally proposed by Lazzaro, Ryckebusch, Mahowald, and Mead (1989), but contains an additional local positive feedback circuit and lateral excita-

2864

Giacomo Indiveri

Figure 4: (a) Response of an excitatory synapse to single spikes, for different values of the synaptic strength Vw (with Ve D 4:60 V). (b) Normalized response to single spikes for different time constant settings Ve (with Vw D 1:150 V). c) Response of an excitatory synapse to a 50 Hz spike train for increasing values of Vw (0.6 V, 0.625 V, 0.65 V, and 0.7 V, from bottom to top trace, respectively). (d) Response of excitatory synapse to spike trains of increasing rate for Vw D 0:65 V and Ve D 4:6 V (12 Hz, 25 Hz, 50 Hz, and 100 Hz from bottom to top trace, respectively).

tory and inhibitory coupling transistors. Similar WTA networks with positive feedback and lateral coupling have already been proposed (Starzyk & Fang, 1993; DeWeerth & Morris, 1995). A detailed comparison between the circuits described in this section and previously proposed hysteretic WTA networks can be found in Indiveri (1997). The additional features introduced by improving the original basic WTA cell allow these types of networks not only to select the input with strongest amplitude, but also to lock onto it and eventually track it as it shifts smoothly from one pixel to its neighbor (Morris & DeWeerth, 1997; Indiveri, 1999).

Modeling Selective Attention

2865

Figure 5: Schematic diagram of the WTA network. Examples of three neighboring cells connected together.

The current-mode WTA network, even with the additional circuit elements recently introduced, remains a very elegant, compact, and powerful circuit. It collectively processes all its input signals using strictly local interconnections, operates in parallel, in continuous-time, and was only 10 transistors per cell. Figure 5 shows three cells of the hysteretic WTA network connected together. Input is applied to each node of the network through the currents Iin , corresponding to the sum of Iex , coming from the excitatory synapse (see Figure 3a), and of Iinh , going to the inhibitory synapse (see Figure 3b). If (and only if) the cell considered is the winning one, the p-type current mirror in the top part of the circuit produces, at the same time, the output current Iinj and a hysteretic current, summed through a positive feedback loop (indicated by a dashed arrow) back into the input node. The hysteretic current is a copy of the WTA bias current (set by Vb on the bias transistors in the lower part of the circuit). The amplitude of the output current Iinj is independent of the input current Iin and can be modulated by the control voltage Vinj at the source of the output transistor. The n-type current mirror in the lower half of the gure is used both to produce an output current Inet and enhance the response of the WTA network by producing source degeneration of the input transistor. The source degeneration technique consists of converting the current owing through the input transistor into a voltage by dropping it across a diode and feeding this voltage back to the source of the input transistor, to increase its gate voltage. At the network level, source degeneration of the input transistor has the effect of increasing the circuit’s winner selectivity. As the current Inet represents the sum of all of the currents converging into the same cell (the input current Iin , the current being spread to or from the left and right

Giacomo Indiveri

700

700

600

600

WTA net input current (nA)

WTA net input current (nA)

2866

500 400 300 200 100 0

1

500 400 300 200 100

6

11

16 21 Pixel Position

(a)

26

31

0

1

6

11

16 21 Pixel Position

26

31

(b)

Figure 6: Net WTA input current Inet values at each pixel location for a static control input. Pixels 5 through 13 have input currents slightly lower than pixel 21. All other pixels receive weaker input stimuli. (a) In the absence of lateral coupling (Vex D 0 V) the network selects pixel 21 as the winner. (b) In the presence of lateral coupling (Vex D 1:5 V), the network smoothes the input distribution spatially and selects pixel 9 as the winner.

nearest neighbors and the hysteretic current coming from the top p-type current mirror), it is a useful measure for visualizing the state of the WTA network. Each WTA cell is connected to its immediate neighbors through pass transistors controlled by Vex (in the upper half of Figure 5). These pass transistors implement lateral excitation. Laterally connected pass transistors in the lower part of the gure, controlled by Vinh , implement local inhibition. In the two extreme cases they either completely decouple the network, allowing each individual cell to be a winner (Vinh D 0V), or they globally connect all the cells, forcing the network to choose only one winner (Vinh D Vdd). In intermediate cases, modulation of Vinh determines the spatial extent of the local regions over which competition takes place, thus allowing the network to select multiple winners. In Figure 6 we show examples of scanned Inet measurements, having applied constant input currents to the WTA nodes. We generated a spatial input stimulus such that pixel 21 received an input current of approximately 350 nA, pixels 9 through 13 received input currents of approximately 300 nA, and the remaining pixels received currents ranging from 150 nA to 250 nA. Small variations across neighboring pixels are mainly due to device mismatch effects. Figure 6a shows the state of the network in the case at which no lateral excitatory coupling is present (Vex D 0V). The network selects pixel 21 as the winner and supplies the hysteretic current to it (which has an amplitude of approximately 300 nA with the current bias settings). Figure 6b shows the status of the network in the case in which lateral excitatory coupling is applied (Vex D 1:5 V). Lateral coupling effectively smoothes spa-

Modeling Selective Attention

2867

Cfb Vdd

Vdd

Vpb Iinj

Vmem Cm

Vout

Vthr

Vpw

Vrfr

Cr

Figure 7: Circuit diagram of the local inhibitory integrate-and-re neuron.

tially the input currents, thus decreasing the net input current to pixel 21. In this condition, the WTA network selects pixel 9 as the winner (and sums the hysteretic current to it). This example points out the two main characteristic features of the spatially coupled hysteretic WTA network: spatial smoothing and positive feedback. Spatial smoothing is implemented by modulating the gate voltage Vex of the excitatory pass transistors. It can be used to reduce the effect of noise and offsets in the input transistors and to favor the selection of spatial regions with high average activity (as opposed to strongly activated isolated pixels). By adding a copy of the WTA bias current to the winning cell’s input node, the positive feedback loop effectively produces a hysteretic behavior. Hysteresis is used to enforce the selection of the winner and is induced by summing a constant hysteretic current to the input of the winning pixel, through the positive feedback loop implemented with the p-type current mirrors, in the top part of Figure 5. 3.4 Output Inhibitory Integrate-and-Fire Neuron. We implemented the output inhibitory neurons with circuits of the type shown in Figure 7. They are nonleaky integrate-and-re neurons based on circuits proposed by Mead (1989) and by van Schaik, Fragni e` re, and Vittoz (1996). Input is applied to this circuit by injecting a constant DC current Iinj , sourced from the WTA network (see Figure 5), into the membrane capacitance Cm . A comparator circuit compares the membrane voltage Vmem (which increases linearly with time if the injection current is applied) with a xed threshold voltage Vthr . As long as Vmem is below Vthr , the output of the comparator is low, and the neuron’s output voltage Vout sits at 0V.

2868

Giacomo Indiveri

As Vmem increases above threshold, though, the comparator output voltage rises to the positive power supply rail and, via the two inverters, also brings Vout to the rail. A positive feedback loop, implemented with the capacitive divider Cf b Cm , ensures that as soon as the membrane voltage Vmem reaches C

Vthr , it is increased by an amount proportional to Vdd Cm Cf bCf b (Mead, 1989). In this way we avoid the problems that could arise with small uctuations of Vmem around Vthr . When Vout is high, the reset transistor at the bottom left of Figure 7 is switched on, and the capacitor Cm is discharged at a rate controlled by Vpw , which effectively sets the output pulse width (the width of the spike). The membrane voltage thus decreases linearly with time, and as soon as it falls below Vthr , the comparator brings its output voltage to zero. As a consequence, the rst inverter sets its output high and switches on the n-type transistor of the second inverter, allowing the capacitor Cr to be discharged at a rate controlled by Vrf r . This bias voltage controls the length of the neuron’s refractory period: the current owing into the node Vmem is discharged to ground, and the membrane voltage does not increase for as long as the voltage on Cr (Vout ) is high enough. Figures 8a and 8b show traces of Vmem for different amplitudes of the input injection current Iinj and for different settings of the refractory period control voltage Vrf r . The threshold voltage Vthr was set at 2 V and the bias voltage Vpw was set at 0.5 V, such that the width of a spike was approximately 1 ms. Figures 8c and 8d show how the ring rate of the neuron depends on the injection current amplitude. These plots are typically referred to as FI curves. We can control the saturation properties of the FI curves by changing the length of the neuron’s refractory period. The error bars show how reliable the neuron is when stimulated with the same injection current. We changed the injection current amplitude by modulating the control voltage Vinj (see Figure 5). As the injection current changes exponentially with the control voltage Vinj , the ring rate of the neuron follows the same relationship. To verify that the ring rate is linear with the injection current, we can view the same data using a log-scale on the ordinate axis (see Figure 8d). 4 Experimental Results

As mentioned in section 2, the chip can receive signals analogous to spike trains at its input interface, which can represent sensory information, in their temporal structure. In the rst instance, to stimulate and test the selective attention model with well-controlled input signals, we interfaced the chip to a workstation. In a more general scenario, the chip can be interfaced to analog VLSI neuromorphic sensors that use the same AER interfacing circuitry to construct more elaborate multichip systems (Indiveri, Whatley, & Kramer, 1999; Indiveri, Murer, ¨ & Kramer, 2000).

Modeling Selective Attention

2869

Figure 8: Integrate-and-re neuron characteristics. (a) Membrane voltage for two different DC current-injection values (set by the control voltage Vinj ). (b) Membrane voltage for two different refractory period settings. (c) Firing rates of the neuron as a function of current-injection control voltage Vinj plotted on a linear scale. (d) Firing rates of the neuron as a function of Vinj plotted on a log scale (the injection current increases exponentially with Vinj ).

4.1 Providing Input Spikes to the Chip. To provide inputs to all the synapses of the chip, we developed a program on a workstation that continuously addressed all the pixels in a serial fashion, exciting them at the times specied by a look-up table. The rate used to address the pixels sequentially was fast compared to the typical ring rates of input signals (chosen in a range between 10 Hz and 80 Hz). The input synapses, which have time constants on the order of milliseconds, thus appeared to be receiving spikes in parallel. The I/O card, in conjunction with the software we used, was able to cycle through all 32 pixels of the network at a rate of

2870

Giacomo Indiveri

Figure 9: Scanned net input currents to the WTA network Inet (top traces) and inhibitory currents Iinh (bottom traces) measured, by means of an off-chip current sense amplier, at every pixel location. (a) Response of the system to the onset of the stimulation, with a display persistence setting of 3 s. (b) Response of the system after a few seconds of stimulation, with a display persistence setting of 250 ms.

500 Hz. 4.2 Control Inputs. In our control experiment we stimulated input synapses 1 through 9, 11 through 19, and 21 through 32, with spike trains at a constant rate of 10 Hz. Synapses at pixels 10 and 22 were stimulated with spike trains at rates of 50 Hz and 80 Hz, respectively. In Figure 9 we show oscilloscope traces representing the scanned net input currents to the WTA network Inet (see section 3.3) of all 32 pixels in the top traces and the inhibitory currents Iinh (see section 3.2) of all 32 inhibitory synapses in the bottom traces. To illustrate the circuit’s dynamics, we increased the persistence of the oscilloscope’s display. Figure 9a shows the response of the system to the onset of the stimulation. As the input spike trains start to arrive and the excitatory synapses integrate them, the net input current of each pixel increases from zero (corresponding approximately to the level at the central axis of the display) to a mean maximum steady-state value (see also Figure 4). The net input currents at pixels 10 and 22 increase more rapidly, compared to all other pixels, in accordance with the rate of the spike trains arriving at their input synapses (top trace). At the onset of the stimulation, all output inhibitory neurons are silent, and the inhibitory synapses receiving spikes from the output neurons do not generate any increase in the amplitude of the inhibitory synaptic currents Iinh (bottom trace). Ideally all synapses receiving the same inputs should integrate the spike trains to the same mean steady-state value. As shown in the gure, this is not the case; the traces in Figure 9 show the amount of variability present in the

Modeling Selective Attention

2871

synaptic currents due to device mismatches created in the chip’s fabrication process. The offsets introduced by device mismatches are different from chip to chip but do not change over time. Figure 9b shows the response of the system after a few seconds of stimulation. As expected, all excitatory synapses reached a mean steady-state value, but the network keeps on switching from selecting pixel 10 as the winner, to selecting pixel 22 as the winner, and back again. Specically, Figure 9b shows the situation in which pixel 22 has just been deselected and pixel 10 selected. At pixel position 22, the inhibitory synaptic current is in the process of decreasing back to zero (bottom trace), while at the tenth pixel position, the WTA hysteretic current has just been added to the net input (top trace), the output neuron has been activated, and the current of the inhibitory synapse is increasing with every output neuron’s spike (bottom trace). In a second experiment we measured the membrane potential of single neurons rather than using time multiplexing to scan all of the pixels’ outputs. The input stimulus was the same one used in the rst experiment: all pixels were excited with 10 Hz spike trains except for pixels 10 and 22, which were receiving spikes at rates of 50 Hz and 80 Hz, respectively. Given that our system behaves in real time, we were able to make multiple recordings without having to wait for the long simulation and computation times that are typical of software algorithms modeling networks of spiking neurons. To measure statistical properties of the system and observe the variability of the system’s response to the same stimulus, we repeated the experiment 100 times. In the rst 50 trials we measured the membrane voltage of output neuron 10, and in the remaining 50 trials we measured the membrane voltage of output neuron 22. The input stimulus was applied for 3 seconds per trial, and there was a delay of 30 seconds between each trial to allow the system to return to its initial “resting” state. Figure 10 shows raster plots describing the responses of these two neurons to the two consecutive sessions of 50 trials each. (Figure 10a represents the activity of neuron 10, and Figure 10b represents the activity of neuron 22.) In the rst 500 ms to 800 ms of stimulation, the input synapses have not reached their mean steady-state value (see also Figure 4). As the excitatory synaptic currents reach their steady-state value, the WTA network selects either pixel 10 or pixel 22 as the winner and excites the corresponding output neuron, continuously switching between the two. The offsets introduced in the circuits are constant over time, and the single circuit elements of the system, such as the synapses and the neurons, are highly reliable (see the error bars of Figure 8c). Yet the traces of Figure 10 show a signicant amount of intertrial variability for the same input stimulus and the same output neuron. Small variations in the input stimulus software, executed in a multiplexing environment, and thermal noise in the circuit elements are not sufcient to explain such a large amount of variability. This variability is therefore due to network effects. A possible (and probable) explanation for this phenomenon could lie in the recurrent

2872

Giacomo Indiveri 0 Neuron #10

10 20 30 40 50 0

500

1000 Time (ms)

1500

2000

1500

2000

1500

2000

(a)

0 Neuron #22

10 20 30 40 50 0

500

1000 Time (ms) (b)

PSTH

6 4 2 0 0

500

1000 Time (ms) (c)

4000 ISI occurrence

3000 2000 1000 0

0

5

10 Time (ms)

15

20

(d)

Figure 10: (a) Raster plots of neuron 10 in response to the control stimulus (see the text for an explanation). (b) Raster plots of neuron 22. (c) Peristimulus time histogram of neurons 10 (solid line) and neuron 22 (dashed line). (d) Interspike interval distribution of neurons 10 (front bars) and 22 (rear bars).

Modeling Selective Attention

2873

nature of the competition mechanism that takes place in the WTA network. Computer simulations of completely deterministic models have already demonstrated that recurrent networks of reliable (software) spiking neurons can produce highly irregular ring patters (Usher, Stemmler, Koch, & Olami, 1994; Hansel & Sompolinsky, 1996; van Vreeswijk & Sompolinsky, 1996; Ritz, Gerstner, Gaudoin, & van Hemmen, 1997). Carrying out a detailed analysis of the dynamics of the VLSI system would prove to be extremely difcult and would go beyond the scope of this article. But this experiment is a real-time demonstration that reliable VLSI spiking neurons, such as the integrate-and-re neurons used here, can produce highly irregular spike trains if embedded in a WTA network containing recurrent excitatory and inhibitory pathways. Despite the highly irregular ring patterns produced by the chip’s output neurons, the overall response of the system is consistent with its input: on average, as expected from the input stimulus distribution, the network selects pixel 22 more often than pixel 10. This can also be seen by the peristimulus time-histogram in Figure 10c. Figure 10d shows the interspike interval histograms for the two neurons. Both histograms show the same type of bimodal distribution. This distribution can be approximated by a superposition of two gaussians— one centered around 7 ms and the other around 12 ms. The bimodal distribution arises from the fact that the output neuron is either being driven by a constant injection current Iinj (in the case in which that pixel is the winner) or is not receiving any input (in the case in which the WTA network selects a different pixel as the winner). If the pixel considered is the winner, the interspike interval (which is inversely proportional to the amplitude of the injection current Iinj ) is constant and approximately equal to 7 ms. This explains why the gaussian centered around 7 ms has a small standard deviation. On the other hand, if the neuron is not receiving any input, the interspike interval is not constant and depends on the time the network takes to switch from the other winning pixel back to the one considered. This time is not just a function of one parameter (as is the case for Iinj ), but depends on many factors, ranging from the values of the bias voltages in the circuits, to the frequencies of the input spike trains, to the frequency of the output neuron. Therefore the gaussian centered around 12 ms has a larger standard deviation. The control experiments were useful to verify the expected behavior of the circuits at the system level. To test the behavior of the system in the more general context of selective visual attention, we used data obtained from real-world images, processed on the workstation and sent to the chip using the same method described in section 4.1. 4.3 Saliency Map Inputs. We processed static color images with an algorithm that generates a saliency map, a feature map that topographically codes for local conspicuousness over the entire scene (Itti, Koch, & Niebur, 1998). Specically, given the digitized input image, the algorithm computes

2874

Giacomo Indiveri

50

5

100

10

150 15

200 250

20

300

25

350 30

400 100

200

300

400

5

10

15

(a)

70 60 Winning neuron

Spike input/output distribution

80

50 40 30 20 10 0 1

3

5

7

9 11 13 15 17 19 21 23 25 27 29 31 Pixel position

(c)

20

25

30

(b) 31 29 27 25 23 21 19 17 15 13 11 9 7 5 3 1 0

0.2

0.4

0.6 Time (s)

0.8

1

1.2

(d)

Figure 11: Test image with salient features. (a) Original gure. (b) Corresponding saliency map. (c) Input spike frequencies obtained from the injective mapping describe in the text (upper trace) and distribution of the output neuron’s spike counts recorded over a period of 3 s (lower histogram). (d) Position of the attended pixel recorded over time.

a set of multiscale feature maps, responding to orientation, color, and intensity contrast, and after appropriately normalizing them, combines them in a bottom-up fashion. Figure 11b shows an example of a saliency map generated from the image shown in Figure 11a. The algorithm has no a priori knowledge of what is salient and what is not, so the fact that the four neurons illustrated in the image appear to be salient (and the “Neural Systems” title appears to be absolutely not salient) is simply due to the color, spatial scale, and contrast properties of those local regions in the image. As the VLSI architecture used in this work is one-dimensional and the saliency map is a 2D data array, we had to choose an appropriate operator to map the 2D saliency map space onto a 1D input vector. We applied the max./ operator to the pixels of each column of the saliency map and mapped the location of the selected pixel to the 1D input vector. In case of multiple maxima with the same value, we selected the pixel associated with the rst occurrence of

Modeling Selective Attention

2875

the maximum, while scanning the column from the top part of the image to its bottom part. This type of mapping is injective: any value of the 1D input vector is associated with only one pixel of the saliency map. The top trace of Figure 11c shows the values of the 1D vector obtained by applying the mapping described above to the saliency map of Fig 11b. Each component of the vector corresponds to the brightest pixel of the column in the saliency map with the matching index and determines the frequency of the input spike train (e.g., the excitatory synapse at pixel 1 receives approximately 12 spikes per second, the one at pixel 2 receives approximately 19 spikes per second, the one at pixel 9 approximately 85 spikes per second, and so on). We applied this stimulus to the chip, using the method described in section 4.1, for a period of 3 seconds and recorded spike trains from the chip’s output neurons. The histogram in the lower part of Figure 11c represents the activity of these output neurons. As shown, on average the system attends pixels 9 and 10 most of the time, shifting its focus of attention to regions centered around pixels 19 and 28 quite frequently and to other regions of the image less frequently. To show the dynamical aspects of the focus of attention, we plotted, in Figure 11d, the address of the pixel attended, over time. As the mapping performed from the 2D data of Figure 11b to the 1D vector of Figure 11c is injective, we can replot the 1D data of Figure 11b onto the 2D saliency map. Figure 12 shows such a plot: the white dots superimposed on the saliency map represent the locations attended by the chip, and the white solid lines join successive attended locations. Figure 12 resembles a visual scanpath similar to those recorded from human subjects (Yarbus, 1967; Noton & Stark, 1971); yet the gure represents the movements of the focus of attention, which do not necessarily match on a one-to-one basis the saccadic eye movements measured in scanpaths. The focus of attention in our VLSI system tends to shift more frequently between locations that are spatially close to each other. This property, which appears to be characteristic also of human subjects (Niebur & Koch, 1998), has been explicitlyengineered into our system by implementing excitatory lateral connections between winner-take-all cells (see section 3.3). Furthermore, the average time spent by the system at each location is approximately 50 ms. This measurement, also in accordance with data measured from psychophysical experiments performed on human subjects, is an emergent property of the system and has not been explicitly engineered. Mapping the 2D saliency map onto a 1D vector and the 1D output of the chip back onto the 2D saliency map introduces some artifacts that do not allow us to make a fair comparison between the scanpaths obtained from the chip with scanpaths recorded from human observers. For example, the region around pixel values .260I 475/ of Figure 12 is never selected by the system precisely due to the injective mapping described above. The example of focus of attention scanpaths shown in Figure 12 is only illustrative. Although the 1D selective attention chip can be used in several application

2876

Giacomo Indiveri

50 100 150 200 250 300 350 400 450 500 100

200

300

400

500

Figure 12: Mapping of the 1D data of Figure 11d onto the resampled 2D saliency map data of Figure 11b. Shifts along the horizontal axis are due to the selective attention chip’s response. Shifts along the vertical axis are introduced articially by the injective mapping described in the text.

domains (Indiveri, 1999), 2D saliency maps containing both horizontal and vertical salient features are best processed using 2D selective attention chips (Indiveri, 2000; Indiveri et al., 2000). 5 Conclusions and Outlook

We have presented a VLSI device containing analog neuromorphic circuits for modeling selective attention mechanisms. The device does not implement a complete sensorimotor system (e.g, for generating saccadic eye movements); rather, it represents a basic building block for the construction of multichip selective attention models. Furthermore, it enables us to verify whether architectures of the type shown in Figure 1, implemented using integrate-and-re neurons, biological-like signal representations, and stimulated with realistic stimuli, have the same computational properties of real selective attention mechanisms. The model proposed, containing a linear array of 32 processing elements, uses spike-based representation to receive, process, and transmit signals. It represents one of the rst examples

Modeling Selective Attention

2877

of neuromorphic devices able to work as a transceiver, receiving sensory input signals and transmitting the result of its computation using the same AER of the input signals. The architecture described in this article is onedimensional, but its extension to two dimensions is straightforward: a 2D selective attention chip has already been fabricated, using the circuits presented in section 3, and is being used in a two-chip active vision system (Indiveri et al., 2000). Multiple instances of both 1D and 2D selective attention chips could be used to design hierarchical attention systems. At the low levels of the hierarchy, the selective attention chips would have the main role of ltering out nonsalient information, allowing multiple cells to win simultaneously, and normalizing all the system’s sensory inputs to a common signal representation. At the higher levels of the hierarchy, the selective attention chips would merge inputs coming from the different sensory modalities, select the most salient location across multiple sensory modalities and project their outputs to either actuators (e.g., for orienting visual sensors) or motor control modules. The analog, continuous, real-time response properties of these devices, the possibility of exploiting the AER to connect multiple chips with adjustable connectivity mappings, and the adaptation properties of many of these neuromorphic circuits are leading us to the design of increasingly complex models of sensorimotor systems that can interact with the real world in real time.

Acknowledgments

I am grateful to Laurent Itti for allowing me to use his software visual attention model to generate the saliency map of Figure 11b, to Timmer Horiuchi and Adrian Whatley for their constructive comments on the manuscript, and to Rodney Douglas and Kevan Martin for providing an outstanding intellectual environment at the Institute of Neuroinformatics in Zurich. Some of the ideas that led to the design and implementation of the chip proposed were inspired by the Telluride Workshop on Neuromorphic Engineering ( http://www.ini.unizh.ch/telluride99). Fabrication of the integrated circuits was provided by MOSIS. This work was supported by the Swiss National Science Foundation SPP Grant and the U.S. Ofce of Naval Research.

References Behrmann, M., & Haimson, C. (1999). The cognitive neuroscience of visual attention. Current Opinion in Neurobiology, 9, 158–163. Bernays, E. (1996, July). Selective attention and host-plant specialization. Entomologia Experimentalis et Applicata, 80(1), 125–131.

2878

Giacomo Indiveri

Boahen, K. (1998). Communicating neuronal ensembles between neuromorphic chips. In T. S. Lande (Ed.), Neuromorphic systems engineering (pp. 229–259). Norwell, MA: Kluwer Academic. Boahen, K. (1999, April). Multiple pathways: Retinomorphic chips that see quadruple images. In Proceedings of the Seventh International Conference on Microelectronics for Neural, Fuzzy and Bio-inspired Systems; Microneuro ’99 (pp. 12– 20). Los Alamitos, CA: IEEE Computer Society. Brajovic, V., & Kanade, T. (1998, August). Computational sensor for visual tracking with attention. IEEE Journal of Solid State Circuits, 33(8), 1199–1207. Braun, J., & Julesz, B. (1998). Dividing attention at little cost: Detection and discrimination tasks. Perception and Psychophysics, 60, 1–23. Culham, J., Brandt, S., Cavanagh, P., Kanwisher, N., Dale, A., & Tootell, P. (1999). Cortical fMRI activation produced by attentive tracking of moving targets. J. Neurophysiol., 81, 388–393. Deiss, S. R., Douglas, R. J., & Whatley, A. M. (1998). A pulse-coded communications infrastructure for neuromorphic systems. In W. Maass & C. M. Bishop (Eds.), Pulsed neural networks (pp. 157–178). Cambridge, MA: MIT Press. Desimone, R., & Duncan, J. (1995). Neural mechanisms of selective visual attention. Annu. Rev. Neurosci., 18, 193–222. DeWeerth, S., & Morris, T. (1995, June). CMOS current mode winner-take-all circuit with distributed hysteresis. Electronic Letters, 31(13), 1051–1053. Douglas, R., Mahowald, M., & Mead, C. (1995). Neuromorphic analogue VLSI. Annu. Rev. Neurosci., 18, 255–281. Fragni`ere, E., van Schaik, A., & Vittoz, E. (1997,May). Design of an analogue VLSI model of an active cochlea. J. Analog Integrated Circuits and Signal Processing, 13(1/2), 19–35. Gibson, B., & Egeth, H. (1994). Inhibition of return to object-based and environment-based locations. Percept. Psychophys., 55, 323–339. Hahnloser, R., Douglas, R. J., Mahowald, M., & Hepp, K. (1999). Feedback interactions between neuronal pointers and maps for attentional processing. Nature Neuroscience, 2, 746–752. Hansel, D., & Sompolinsky, H. (1996, March). Chaos and synchrony in a model of a hypercolumn in visual cortex. J. Computat. Neuroscience, 3(1), 7–34. Harrison, R., & Koch, C. (2000). A silicon implementation of the y’s optomotor control system. Neural Computation. Hoffman, J., & Subramaniam, B. (1995). The role of visual attention in saccadic eye movements. Perception and Psychophysics, 57(6), 787–795. Horiuchi, T., & Koch, C. (1999). Analog VLSI-based modeling of the primate oculomotor system. Neural Computation, 11, 243–265. Indiveri, G. (1997). Winner-take-all networks with lateral excitation. J. Analog Integrated Circuits and Signal Processing, 13(1/2), 185–193. Indiveri, G. (1999, November). Neuromorphic analog VLSI sensor for visual tracking: Circuits and application examples. IEEE Trans. on Circuits and Systems II, 46(11), 1337–1347. Indiveri, G. (2000). A 2D neuromorphic VLSI architecture for modeling selective attention. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks; IJCNN2000.

Modeling Selective Attention

2879

Indiveri, G., Murer, ¨ R., ,& Kramer, J. (2000). Active vision using an analog VLSI model of selective attention. Submitted. Indiveri, G., Whatley, A., & Kramer, J. (1999, April). A recongurable neuromorphic VLSI multi-chip system applied to visual motion computation. In Proceedings of the Seventh International Conference on Microelectronics for Neural, Fuzzy and Bio-inspired Systems; Microneuro ’99 (pp. 37–44). Los Alamitos, CA: IEEE Computer Society. Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. on Pattern Analysis and Machine Intelligence, 20(11), 1254–1259. Kastner, S., De Weerd, P., Desimone, R., & Ungerleider, L. G. (1998, October). Mechanisms of directed attention in the human extrastriate cortex as revealed by functional MRI. Science, 282(2), 108–111. Koch, C., & Ullman, S. (1985). Shifts in selective visual-attention—towards the underlying neural circuitry. Human Neurobiology, 4(4), 219–227. Lazzaro, J., Ryckebusch, S., Mahowald, M., & Mead, C. (1989). Winner-takeall networks of O.n/ complexity. In D. Touretzky (Ed.), Advances in neural information processing systems, 2 (pp. 703–711). San Mateo, CA: Morgan Kaufmann. Lazzaro, J., Wawrzynek, J., Mahowald, M., Sivilotti, M., & Gillespie, D. (1993). Silicon auditory processors as computer peripherals. IEEE Trans. on Neural Networks, 4, 523–528. Mead, C. (1989). Analog VLSI and neural systems. Reading, MA: AddisonWesley. Miller, M. J., & Bockisch, C. (1997). Where are the things we see? Nature, 386(10), 550–551. Morris, T., & DeWeerth, S. (1996, February). Analog VLSI circuits for covert attentional shifts. In Proceedings of the Fifth International conference on Microelectronicsfor Neural, Fuzzy and Bio-inspiredSystems; Microneuro ’96 (pp. 30–37). Los Alamitos, CA: IEEE Computer Society Press. Morris, T. G., & DeWeerth, S. P. (1997, May–June). Analog VLSI excitatory feedback circuits for attentional shifts and tracking. Analog Integrated Circuits and Signal Processing, 13(1/2), 79–92. Morris, T. G., Horiuchi, T. K., & DeWeerth, S. P. (1998). Object-based selection within an analog VLSI visual attention system. IEEE Trans. on Circuits and Systems II, 45(12), 1564–1572. Mozer, M., & Sitton, M. (1998). Computational modeling of spatial attention. In H. Pashler (Ed.), Attention (pp. 341–395). East Sussex: Psychology Press. Nakayama, K., & Mackeben, M. (1989). Sustained and transient components of focal visual attention. Vision Research, 29, 1631–1647. Niebur, E., & Koch, C. (1998). Computational architectures for attention. In R. Parasuraman (Ed.), The attentive brain (pp. 163–186). Cambridge, MA: MIT Press. Noton, D., & Stark, L. (1971). Scanpaths in eye movements during pattern perception. Science, 171, 308–311. Pollack, G. (1988). Selective attention in an insect auditory neuron. J. Neurosci., 8, 2635–2639.

2880

Giacomo Indiveri

Ritz, R., Gerstner, W., Gaudoin, R., & van Hemmen, J. L. (1997). Poisson-like neuronal ring due to multiple synre chains in simultaneous action. In J. Bower (Ed.), Computational neuroscience: Trends in research 1997 (pp. 801–806). New York: Plenum Press. Salinas, E., & Abbott, L. (1996, October). A model of multiplicative neural responses in parietal cortex. Proc. Natl. Acad. Sci., 93, 11956–11961. Starzyk, J. A., & Fang, X. (1993, May). CMOS current mode winner-take-all circuit with both excitatory and inhibitory feedback. Electronic Letters, 29(10), 908–910. Tanaka, Y., & Shimojo, S. (1996, July). Location vs. feature: Reaction time reveals dissociation between two visual functions. Vision Research, 36(14), 2125–2140. Usher, M., Stemmler, M., Koch, C., & Olami, Z. (1994, September). Network amplication of local uctuations causes high spike rate variability, fractal ring patterns and oscillatory local-eld potentials. Neural Computation, 6(5), 795–836. van Schaik, A., Fragni´ere, E., & Vittoz, E. (1996, February). An analogue electronic model of ventral cochlear nucleus neurons. In Proceedings of the Fifth International Conference on Microelectronics for Neural, Fuzzy and Bio-inspired Systems; Microneuro ’96 (pp. 52–59). Los Alamitos, CA: IEEE Computer Society Press. van Vreeswijk, C., & Sompolinsky, H. (1996, December). Chaos in neuronal networks with balanced excitatory and inhibitory activity. Science, 274(5293), 1724–1726. Wilson, C., Morris, T., & DeWeerth, S. (1999). A two-dimensional, object-based analog VLSI visual attention system. In S. P. DeWeerth, S. Wills, & A. Ishii (Eds.), Conference on advanced research in VLSI, 20. Los Alamitos, CA: IEEE Computer Society Press. Yarbus, A. L. (1967). Eye movements and vision. New York: Plenum Press. Received May 12, 1999; Accepted March 10, 2000.

LETTER

Communicated by Chris Williams

Asymptotic Convergence Rate of the EM Algorithm for Gaussian Mixtures Jinwen Ma Department of Computer Science & Engineering, The Chinese University of Hong Kong, Shatin Hong Kong and Institute of Mathematics, Shantou University, Shantou, Guangdong, 515063, People’s Republic of China Lei Xu Department of Computer Science & Engineering, The Chinese University of Hong Kong, Shatin Hong Kong, People’s Republic of China Michael I. Jordan Department of Computer Science and Department of Statistics, University of California at Berkeley, Berkeley, CA 94720, U.S.A. It is well known that the convergence rate of the expectation-maximization (EM) algorithm can be faster than those of convention rst-order iterative algorithms when the overlap in the given mixture is small. But this argument has not been mathematically proved yet. This article studies this problem asymptotically in the setting of gaussian mixtures under the theoretical framework of Xu and Jordan (1996). It has been proved that the asymptotic convergence rate of the EM algorithm for gaussian mixtures locally around the true solution 2¤ is o.e0:5¡" .2 ¤ //, where " > 0 is an arbitrarily small number , o.x/ means that it is a higher -order innitesimal as x ! 0, and e.2 ¤ / is a measure of the average overlap of gaussians in the mixture. In other words, the large sample local convergence rate for the EM algorithm tends to be asymptotically superlinear when e.2¤ / tends to zero. 1 Introduction

The expectation-maximization (EM) algorithm is a general methodology for maximum likelihood (ML) or maximum a posteriori (MAP) estimation (Dempster, Laird, & Rubin, 1977). A substantial literature has been devoted to the study of the convergence of EM and related methods (e.g., Wu, 1983; Redner & Walker, 1984; Meng, 1994; Liu & Rubin, 1994; Lange, 1995a; Meng & van Dyk, 1997; Delyon, Lavielle, & Moulines, 1999). The starting point for many of these studies is the fact that EM is generally a rst-order or linearly convergent algorithm, as can readily be seen by considering EM as a mapping 2.k C 1/ D M.2 .k/ /, with xed point 2¤ D M.2 ¤ /. Results on c 2000 Massachusetts Institute of Technology Neural Computation 12, 2881–2907 (2000) °

2882

Jinwen Ma, Lei Xu, & Michael I. Jordan

the rate of convergence of EM are obtained by calculating the information matrices for the missing data and the observed data (Dempster et al., 1977; Meng, 1994). The theoretical convergence results obtained to date are of satisfying generality but of limited value in understanding why EM may converge slowly or rapidly in a particular problem. The existing methods to enhance the convergence rate of EM are generally based on the conventional superlinear optimization theory (e.g., Lange, 1995b; Jamshidian & Jennrich, 1997; Meng & van Dyk, 1998), and are usually rather more complex than EM. However, they are prey to other disadvantages, including an awkwardness at handling constraints on parameters. More fundamentally, it is not clear what the underlying factors are that slow EM’s convergence. It is also worth noting that EM has been successfully applied to largescale problems such as hidden Markov models (Rabiner, 1989), probabilistic decision trees (Jordan & Jacobs, 1994), and mixtures of experts architectures (Jordan & Xu, 1995), where its empirical convergence rate can be signicantly faster than those of conventional rst-order iterative algorithms (i.e., gradient ascent). These empirical studies show that EM can be slow if the overlap in the given mixture is large but rapid if the overlap in the given mixture is small. A recent analysis by Xu and Jordan (1996) provides some insight into the convergence rate of EM in the setting of gaussian mixtures. For the convenience of mathematical analyses, they studied a variant of the original EM algorithm for gaussian mixtures and showed that the condition number associated with this variant EM algorithm is guaranteed to be smaller than the condition number associated with gradient ascent, providing a general guarantee of the dominance of this variant EM algorithm over the gradient algorithm. Moreover, in cases in which the mixture components are well separated, they showed that the condition number for this EM algorithm approximately converges to one, corresponding to a local superlinear convergence rate. Thus, in this restrictive case, this type of EM algorithm has the favorable property of showing quasi-Newton behavior as it nears the ML or MAP solution. Xu (1997) further showed that the original EM algorithm has the same convergence properties as this variant EM algorithm. We have further found by experiments that the convergence rate of the EM algorithm for gaussian mixtures is dominated by a measure of the average overlap of gaussians in the mixture. Actually, when the average overlap measure becomes small, the EM algorithm converges quickly when it nears the true solution. As the measure of the average overlap tends to zero, the EM algorithm tends to demonstrate a quasi-Newton behavior. In this article, based on the mathematical connection between the variant EM algorithm and gradient algorithms and on one of its intermediate result on the convergence rate by Xu and Jordan (1996), as well as on the same convergence result of the original EM algorithm (Xu, 1997), we present a further theoretical analysis of the asymptotic convergence rate of the EM algorithm

Asymptotic Convergence Rate of the EM Algorithm

2883

locally around the true solution 2¤ with respect to e.2¤ /, a measure of the average overlap of the gaussians in the mixture. We prove a theorem that shows the asymptotic convergence rate is a higher-order innitesimal than e0:5¡" .2 ¤ / when e.2 ¤ / tends to zero under certain conditions, where " > 0 is an arbitrarily small number. Thus, we see that the large sample local convergence rate for the EM algorithm tends to be asymptotically superlinear when e.2¤ / tends to zero. Section 2 presents the EM algorithm and the Hessian matrix for the gaussian mixture model. In section 3, we intuitively introduce our theorem on the asymptotic convergence rate of EM. Section 4 describes several lemmas needed for the proof; the proof is contained in section 5. Section 6 presents our conclusions. 2 The EM Algorithm and the Hessian Matrix of the Log-Likelihood

We consider the following gaussian mixture model: P.x|2/ D

K X

®j P.x|mj ; 6j /;

jD 1

®j ¸ 0;

K X jD 1

®j D 1;

(2.1)

where P.x|mj ; 6j / D

1 d 2

.2¼/ |6j |

1 2

e

¡ 12 .x¡mj /T 6j¡1 .x¡mj /

(2.2)

and where K is the number of the mixture components, x denotes a sample vector, and d is the dimensionality of x. The parameter vector 2 consists of the mixing proportions ®j , the mean vectors mj , and the covariance matrices .j/

6j D .¾pq /d£d , which are assumed positive denite. Given K and given independently and identically distributed (i.i.d.) samples fx.t/ gN 1 , we estimate 2 by maximizing the log-likelihood: l.2/ D log

N Y tD1

P.x.t/ |2/ D

N X

log P.x.t/ |2/:

(2.3)

tD1

This log-likelihood can be optimized iteratively via the EM algorithm as follows: .k C 1/

®j

mj.k

C 1/

D

N 1 X .k/ h .t/ N t D1 j

D P N

1

(2.4) N X

.k/ tD 1 hj .t/ t D 1

hj.k/ .t/x.t/

(2.5)

2884

Jinwen Ma, Lei Xu, & Michael I. Jordan

6j.k

C 1/

D P N

N X

1 .k/

t D 1 hj .t/

t D1

hj.k/ .t/.x.t/ ¡ mj.k

C 1/

/.x ¡ mj.k

C 1/

/T ;

(2.6)

where the posterior probabilities hj.k/ .t/ are given by .k/

.k/

.k/

®j P.x.t/ |mj ; 6j / hj.k/ .t/ D PK : .k/ .k/ .k/ .t/ i D 1 ®i P.x |mi ; 6 i /

(2.7)

This iterative procedure converges to a local maximum of the log-likeliO is a local solution to the hood (Dempster, et al., 1977). We suppose that 2 likelihood equation 2.3, and the EM algorithm converges to it. We now analyze the local convergence rate around this solution. For the convenience of mathematical analyses, Xu and Jordan (1996) studied a variant of the EM algorithm by letting equation 2.6 be replaced by .k C 1/

6j

D P N

N X

1

.k/ t D 1 hj .t/

t D1

.k/

.k/

.k/

hj .t/.x.t/ ¡ mj /.x ¡ mj /T ;

(2.8)

that is, the update of 6j.k 1/ is based on the last update mj.k/ instead of mj.k 1/ in equation 2.6. For clarity, we denote this variant of EM by VEM in this article. They showed that at each iteration, the following relationship holds between the gradient of the log-likelihood and the VEM update step: C

A .k C 1/ ¡ A.k/ D PA.k/

C

@l | D .k/ @A A A

@l | D .k/ @mj mj mj h i h i C 1/ vec 6j.k ¡ vec 6j.k/ D P6j .k/

(2.9)

mj .k C 1/ ¡ mj .k/ D P mj .k/

(2.10) @l | .k/ ; @vec[6j ] 6j D 6j

(2.11)

where h i ² 1 ± T diag ®1.k/ ; : : : ; ®K.k/ ¡ A.k/ A.k/ N 1 P mj .k/ D PN .k/ 6j.k/ t D 1 hj .t/ PA .k/ D

P6j .k/ D PN

2

.k/ t D 1 hj .t/

±

.k/

6j

² 6j.k/ ;

(2.12) (2.13) (2.14)

and where A denotes the vector of mixing proportions [®1 ; : : : ; ®K ]T , j indexes the mixture components .j D 1; : : : ; K/, k denotes the iteration number, vec[B] denotes the vector obtained by stacking the column vectors of

Asymptotic Convergence Rate of the EM Algorithm

2885

the matrix B, vec[B]T D .vec[B]/T , and denotes the Kronecker product. P Moreover, given the constraints jKD1 ®j.k/ D 1 and ®j.k/ ¸ 0; PA .k/ is a positive denite matrix and the matrices Pmj .k/ and P6j .k/ are positive denite with probability one for N sufciently large. Assembling these matrices into a single matrix P.2/ D diag[PA ; Pm1 ; : : : ; PmK ; P61 ; : : : ; P6K ]; we have the following equation for the VEM update step given by equation 3.7 in Xu and Jordan (1996): ± ² @l | D .k/ ; 2.k C 1/ D 2.k/ C P 2.k/ @2 2 2

(2.15)

where 2 is the collection of mixture parameters: 2 D [AT ; mT1 ; : : : ; mTK ; vec[61 ]T ; : : : ; vec[6K ]T ]T : In order to represent 2 to a set of independent variables for derivation, we introduce the following subspace, 8 9 < X K = .j/ .j/ R1 D 2: ®j D 0; ¾pq D ¾qp ; for all j; p; q ; : ; jD 1

which is obtained from 8 9 < X K = .j/ .j/ R2 D 2: ®j D 1; ¾pq D ¾qp ; for all j; p; q : ; D1 j

by the constant shift 20 . For the gaussian mixture, the constraint that all 6j are positive denite should also be added to R2 and thus R1 . It can be easily veried that this constraint makes R1 be an open convex set of it. Since we will consider only the local differential properties of log-likelihood function at an interior point of the open convex set, we can set a new coordinate system for the parameter vector 2 via a set of the unit basis vectors E D [e1 ; : : : ; em ], where m is the dimension of R1 . In fact, for each 20 D 2 ¡ 20 2 R1 , let its coordinates under the bases e1 ; : : : ; em be denoted by 2c ; we have 2 ¡ 20 D E2c ; or 2 D E2c C 20 :

(2.16)

Multiplying its both sides by ET , it follows from ET E D I that ET 2 D ET E2c C ET 20 ; or 2c D ET 2 ¡ ET 20 :

(2.17)

Putting it into equation 2.16, we have 20 D 2 ¡ 20 D ET E.2 ¡ 20 / D EET 20 for 20 2 R1 :

(2.18)

2886

Jinwen Ma, Lei Xu, & Michael I. Jordan

Thus for a matrix A, let kAk D maxk20 kD 1;20 2R 1 kA20 k be its norm constrained on R1 . From equation 2.18, we certainly have the equality that kAk D kAEE T k. We use the Euclidean norm for vectors in this article and have that kET k D D

1

kET 2k D

1

.2T 2/ 2 D

max

k2kD 1;22R

1

max

k2kD 1;22R

1

.2T EET 2/ 2

max

k2k D1;22R

1

max

k2kD 1;22R

1

k2k D 1:

Following the last inequality on p. 137 of Xu and Jordan (1996), we have O is bounded by the local convergence rate around 2 r D lim

k!1

O k2.k C 1/ ¡ 2k .k/ O k2 ¡ 2k

T O O O O · kET .I C P.2/H. k 2//k D kET .I C P.2/H. 2//EE T O O · kET .I C P.2/H. k: 2//EkkE

(2.19)

By the fact kET k D 1, we have O O r · kI C ET P.2/H. 2/Ek:

(2.20)

For the original EM algorithm, Xu (1997) further showed that the convergence rate by the original EM algorithm and the VEM algorithm is the same. Therefore, equation 2.20 also holds for the original EM algorithm. As a result, the following analyses and results apply to both the EM and VEM algorithm. Suppose that the samples fx.t/ gN 1 are randomly selected from the gaussian mixture with the parameter 2¤ . When the gaussian mixture model satises certain regularity conditions, the EM iterations arrive at a consistent solution on maximizing log-likelihood equation 2.3 (Veaux, 1986; Redner & Walker, 1984). In this article, we assume that the EM algorithm asymptotically converges to this true parameter correctly (i.e., when N is large, for the sample ¤ O O data fx.t/ gN 1 , the EM algorithm converges to 2 with limN!1 2 D 2 ), and we analyze the local convergence rate around this consistent solution in the limit form. It follows from equation 2.20 that an upper bound of the asymptotic convergence rate is given by O O r · lim kI C ET P.2/H. 2/Ek N!1

O O D kI C ET lim P.2/H. 2/Ek N!1

D kI C E

T

lim P.2¤ /H.2¤ /Ek:

N!1

(2.21)

Asymptotic Convergence Rate of the EM Algorithm

2887

In order to estimate this bound, the rest of this article pursues an analysis of the convergence behavior of P.2¤ /H.2¤ / as N increases to innity. First, we give the Hessian of the log-likelihood equation 2.3 in the block forms. The detailed derivation is omitted (for an example of such a derivation, see Xu and Jordan, 1996, and for the general methods for matrix derivatives, see Gerald, 1980): HA;AT

HA;mT j

HA;6 T j

Hmi ;mT j

N X @2l ¡1 ¡1 ¡ D HA .t/.HA .t//T I @ A@ AT D 1 t N X

@ 2l

¡1 D RA .t/.x.t/ ¡ mj /T 6j¡1 I j @ A@mjT D 1 t N 1X @ 2l ¡ D R¡1 .t/ vec[6j¡1 ¡ Uj .t/]T I 2 t D1 Aj @ A@6jT N X

@ 2l

N X

¡1 D ¡6i ±ij hi .t/ C °ij .t/6i¡1 .x.t/ ¡ mi / @mi @mjT t D1 tD1

£.x.t/ ¡ mj /T 6j¡1 I Hmi ;6 T j

N 1X @2l D ¡ °ij .t/vec[6j¡1 ¡ Uj .t/]T T 2 tD 1 @mi @vec[6j ]

.6i¡1 .x.t/ ¡ mi // ¡ H6i ;6 T j

N X tD1

±ij hi .t/.6 i¡1 .x.t/ ¡ mi // vec[6i¡1 ]T I

2

@ l @ @l D T ]@vec[6 ] ] @vec[6i @ vec[6i @vec[6j ]T j D ¡

¡

N 1X °ij .t/vec[6j¡1 ¡ Uj .t/]T vec[6i¡1 ¡ Ui .t/] 4 t D1 N 1X ±ij hi .t/.¡6i¡1 6i¡1 C 6i¡1 Ui .t/ C Ui .t/ 6i¡1 /; 2 t D1

where 2 D fA; mj ; 6j ; j D 1; : : : ; Kg, ±ij is the Kronecker function, and ¡1 HA .t/ D [h1 .t/=®1 ; : : : ; hK .t/=®K ]T ;

°ij .t/ D .±ij ¡ hi .t//hj .t/;

[°1j .t/=®1 ; : : : ; °Kj .t/=®K ]T ; R¡1 Aj .t/ D Ui .t/ D Ui .x.t/ / D 6i¡1 .x.t/ ¡ mi /.x.t/ ¡ mi /T 6i¡1 :

2888

Jinwen Ma, Lei Xu, & Michael I. Jordan

When the K gaussian distributions in the mixture are separated enough, that is, the posterior probabilities hj .t/ are approximately zero or one at 2¤ , by the above expression of the Hessian matrix, it is easy to verify that limN!1 P.2 ¤ /H.2 ¤ / will become to the following block diagonal matrix: ³ FD

¡IK C A¤ [1; 1; : : : ; 1] 0

´

0 ¡IK£.d C d2 /

;

(2.22)

where A¤ D [®1¤ ; : : : ; ®K¤ ]T and In is the nth order identity matrix. Furthermore, we rearrange the E formed from the unit basis vectors that span R11 into the following two parts: ³ ED

E1 0

´ 0 ; E2

(2.23)

with its rst part E1 D [e11 ; : : : ; e1K¡1 ] and e11 ; : : : ; e1K¡1 are unit basis vectors P of R11 D fx 2 RK : jKD 1 xi D 0g. From equations 2.22 and 2.23, we have ³ E T .I C F/E D

ET1 A¤ [1; 1; : : : ; 1]E1 0

´ 0 D 0; 0

(2.24)

since [1; : : : ; 1]E1 D 0. Substituting this result into the upper bound in equation 2.21, we have: r · kI C ET lim P.2¤ /H.2¤ /Ek D kET .I C F/Ek D 0: N!1

Thus, the EM algorithm for gaussian mixtures has a superlinear convergence rate. A special case of this result was rst given in Xu and Jordan (1996). Actually gaussian distributions in the mixture cannot be strictly separated. In the rest of this article, we will show that limN!1 P.2¤ /H.2¤ / approximately converges to F as the overlap of any two gaussian distributions in the mixture tends to zero. 3 The Main Theorem

We begin by introducing a set of quantities on the overlap of component gaussian distributions as follows: eij .2¤ / D lim

N!1

Z N 1 X |°ij .t/| D |°ij .x/|P.x|2¤ / dx; for i; j D 1; : : : ; K; N tD1

Asymptotic Convergence Rate of the EM Algorithm

2889

where °ij .x/ D .±ij ¡ hi .x//hj .x/ and ® ¤ P.x|m¤l ; 6l¤ / for l D 1; : : : ; K: hl .x/ D P K l ¤ ¤ ¤ k D1 ®k P.x|mk ; 6k / Since |°ij .x/| · 1, eij .2R¤ / · 1. 6 If i D j, eij .2¤ / D hi .x/hj .x/P.x|2¤ / dx, which can be considered as a measure of the average overlap between the distributions of gaussian i and j. In fact, if P.x|m¤i ; 6i¤ / and P.x|mj¤ ; 6j¤ / overlap in a higher degree at a point x, then hi .x/hj .x/ certainly takes a higher value; otherwise, it takes a lower value. When they are well separated at x, hi .x/hj .x/ becomes zero. Thus, the product hi .x/hj .x/ represents a degree of overlap between P.x|m¤i ; 6i¤ / and P.x|mj¤ ; 6j¤ / at x in the mixture environment, and the above eij .2¤ / is an average overlap between the distributions of gaussian i and j in the mixture. Since two gaussian distributions always have some overlap area, we P always have eij .2 ¤ / > 0. Moreover, by the fact jND 1 hi .x/ D 1, we also have Z eii .2 ¤ / D D

Z |°ii .x/|P.x|2¤ / dx D

XZ 6 i jD

hi .x/.1 ¡ hi .x//P.x|2¤ / dx

hi .x/hj .x/P.x|2¤ / dx D

X

eij .2¤ / > 0:

6 i jD

Clearly, eii .2¤ / represents the overlap of the ith gaussian distribution with all the other gaussian distributions in the mixture. We consider the worst case and dene e.2¤ / D max eij .2¤ / · 1: ij

(3.1)

By experiments, we can nd that as the gaussian distributions in the mixture become more separated, the EM algorithm can asymptotically converge to the true parameter 2¤ with a better convergence rate. Here, we mathematically study the asymptotic convergence properties of the EM algorithm with respect to the average overlap measure e.2¤ /. First , we introduce an alternative quantity ´.2 ¤ / that describes the overall overlap of a gaussian mixture from its geometric structure. For illustration, we dene the characteristic contour of component gaussian density Pi .x|m¤i ; 6i¤ / by ¡1

.x ¡ m¤i /T 6i¤ .x ¡ m¤i / D 1:

(3.2)

Since 6 i¤ is a positive denite matrix, there certainly exists an orthonormal

2890

Jinwen Ma, Lei Xu, & Michael I. Jordan

matrix Ui such that 0 ¸i1 B0 B 6i¤ D UiT B : @ :: 0

0 ¸i2 :: : 0

1 0 0C C :: C Ui ; : A ¸id

¢¢¢ ¢¢¢ :: : ¢¢¢

(3.3)

where ¸i1 ; : : : ; ¸id are the eigenvalues of 6i¤ . By taking the transformation y D Ui .x ¡ m¤i /, we have from equation 3.2 that n X jD 1

¸ij¡1 y2i D 1;

which is a hyperellipsoid at the origin in the transformed space. It follows from the inverse transformation x D m¤i C UiT y that the characteristic contour is a hyperellipsoid at m¤i in the original space. In fact, the axes of this characteristic hyperellipsoid describe the attenuating rates of the component density in the coordinate basis directions in the transformed space (or the d principal directions in the original space from m¤i ), respectively, since they are the square roots of the eigenvalues, which are actually the variances of the ith component distribution in these directions, respectively. In other words, the density will attenuate rapidly as a point goes away from the center along a basis direction if the corresponding axis is small; otherwise, it will attenuate slowly. For simplicity, we use the minimum-radius hypersphere, which includes the characteristic contour, .x ¡ m¤i /T .x ¡ m¤i / D ¸imax ; instead of the characteristic contour, and then the radius of this hypersphere 1 is .¸imax / 2 . In the same way, we can have the characteristic contour and the mini-radius hypersphere for the jth component distribution. Since the overlap between these two densities is approximately proportional to the overlap between the two characteristic contours or minimum-radius hyperspheres, we can dene ´ij .2¤ / as the overlap between the densities of components i and j as follows: 1

j

1

.¸i / 2 .¸max / 2 ´ij .2 / D max¤ : kmi ¡ mj¤ k ¤

Therefore, for a gaussian mixture of true parameter 2¤ , we dene ´.2 ¤ / D max ´ij .2¤ / 6 j iD

(3.4)

Asymptotic Convergence Rate of the EM Algorithm

2891

as an overall overlap degree of gaussians in the mixture from point of view of the worst case. Obviously, ´.2 ¤ / ! 0 is equivalent to e.2 ¤ / ! 0. Second , we consider some assumptions that regularize the manner of e.2 ¤ / tending to zero. We rst assume that 2¤ satises Condition 1: ®i¤ ¸ ®; for i D 1; : : : ; K; where ® is a positive number. When some mixing proportion ®i¤ tends to zero, the corresponding gaussian will withdraw from the mixture, which degenerates to a mixture with a lower number of the mixing components. This assumption excludes this degeneracy. Our second assumption is that the eigenvalues of all the covariance matrices satisfy Condition 2: ¯¸.2 ¤ / · ¸ik · ¸.2 ¤ /; for i D 1; : : : ; K;

k D 1; : : : ; d;

where ¯ is a positive number and ¸.2 ¤ / is dened to be the maximum eigenvalue of the covariance matrices 61¤ ; : : : ; 6 K¤ , that is, ¸.2 ¤ / D max ¸ik ; i;k

which is always upper bounded by a positive number B. That is, all the eigenvalues uniformly attenuate or reduce to zero when they tend to zero. It follows from condition 2 that condition numbers of all the covariance matrices are uniformly upper bounded, that is, 1 · ·.6i¤ / · B0 ; for i D 1; : : : ; K; where ·.6i¤ / is the condition number of 6i¤ and B0 is a positive number. The third assumption is that the mean vectors of the component densities in the mixture satisfy Condition 3: ºDmax .2¤ / · Dmin .2 ¤ / · km¤i ¡ mj¤ k

6 · Dmax .2 ¤ /; for i D j;

where Dmax .2¤ / D maxiD6 j km¤i ¡ mj¤ k; Dmin .2¤ / D miniD6 j km¤i ¡ mj¤ k, and º is a positive number. That is, all the distances between two mean vectors are the same-order innitely large quantities when they tend to innity. Moreover, when the overlap of densities in the mixture reduces to zero, any of two means m¤i ; mj¤ cannot be arbitrarily close; there should be a pos6 itive value T such that km¤i ¡ mj¤ k ¸ T when i D j. Also, it is natural to assume that the mean vectors take different directions when they diverge to innity. With the above preparations, we are ready to introduce our main theorem.

2892

Jinwen Ma, Lei Xu, & Michael I. Jordan

Given i.i.d. samples fx.t/ gN 1 from a mixture of K gaussian distributions of the parameters 2¤ that satises conditions 1–3, when e.2 ¤ / is considered as an innitesimal, as it tends to zero, we have: Theorem 1.

lim P.2¤ /H.2¤ / D F C o.e0:5¡" .2 ¤ //;

N!1

(3.5)

where F is given by equation 2.22 and " is an arbitrarily small positive number. According to this theorem, as the overlap of distributions in the mixture becomes small or, more precisely, e.2¤ / ! 0, the asymptotic value of P.2¤ /H.2¤ / becomes F plus a matrix in which each element is a higherorder innitesimal of e0:5¡" .2¤ /. It further follows from equation 2.24 that ET .I C F/E D 0. Thus, we have I C ET lim P.2 ¤ /H.2 ¤ /E D E T .I C lim P.2¤ /H.2¤ //E N!1

N!1

0:5¡"

D o.e

.2¤ //:

That is, each element of I C ET limN!1 P.2 ¤ /H.2 ¤ /E is a higher-order innitesimal of e0:5¡" .2 ¤ /. Because the norm of a real matrix A D .aij /m£m is P p always not larger than m maxi jmD 1 |aij |, the norm kI C ET P.2¤ /H.2¤ /Ek is certainly a higher-order innitesimal of e0:5¡" .2¤ /. Therefore, it follows from equation 2.21 that the asymptotic convergence rate of the EM algorithm locally around 2¤ is a higher-order innitesimal of e0:5¡" .2¤ /. That is, when e.2¤ / is small and N is large enough, the convergence rate of the EM algorithm approaches approximately zero. In other words, the EM algorithm in this case has a quasi-Newton-type convergence behavior, with its asymptotic convergence rate dominated by the innitesimal e0:5¡" .2¤ /. From this theorem, we can also nd that P.2 ¤ / tends to the inverse of H.2 ¤ / as e.2 ¤ / tends to zero. Thus, when the overlap of gaussian distributions in the mixture becomes a small value, the EM algorithm approximates to the Newton algorithm. Actually, P.2 ¤ / makes the EM algorithm for the gaussian mixture be different from the conventional rst-order iterative algorithm such as gradient ascent. This may be the basic underlying factor that speeds up the convergence of the EM algorithm when the overlap of gaussians in the mixture is low. 4 Lemmas

In this section, we describe several lemmas that will be used for proving the main theorem. The norm for a vector or matrix is always assumed to be Euclidean norm in the following. We now dene three kinds of special polynomial functions that we often meet in the further analyses:

Asymptotic Convergence Rate of the EM Algorithm Denition 1.

2893

g.x; 2¤ / is called a regular function if it satises:

(i) If 2¤ is xed, g.x; 2¤ / is a polynomial function of the component variables x1 ; : : : ; xd of x. (ii) If x is xed, g.x; 2¤ / is a polynomial function of the elements of m¤1 ; : : : ; m¤K , ¡1 ¡1 ¡1 ¡1 ¡1 61¤ ; : : : ; 6K¤ , 61¤ , : : :, 6K¤ , as well as A¤ , A¤ D [®1¤ ; : : : ; ®K¤ ]T . g.x; 2¤ / is called a balanced function if it satises i and ¡1 (iii) If x is xed, g.x; 2¤ / is a polynomial function of the elements of A¤ , A¤ , ¡1 ¡1 m¤1 ; : : : ; m¤K , 61¤ ; : : : ; 6K¤ , ¸.2¤ /61¤ , : : :, ¸.2¤ /6K¤ . Denition 2.

For a regular function g.x; 2¤ /, if there is a positive integer q such that ¸q .2¤ /g.x; 2¤ / is converted into a balanced function, then g.x; 2¤ / is called a regular and convertible function. Denition 3.

Suppose that g.x; 2¤ / is a balanced function and 2¤ satises conditions 1–3. As e.2¤ / tends to zero, we have Z p |g.x; 2¤ /°ij .x/|P.x|2¤ / dx < ¹mp .2 ¤ / e.2¤ /; (4.1) Lemma 1.

where m.2¤ / D maxi km¤i k, ¹ is a positive number and p is a positive integer. Proof . We begin to get an upper bound for E.kXkk /, where k is any positive

integer. By the denition of P.x|2¤ /, we have E.kXkk / D

K X jD 1

®j¤ E.kXkk |mj¤ ; 6j¤ /;

(4.2)

where Z E.kXkk |mj¤ ; 6j¤ / D ·

where

¡ k¢

i ¡1 3j 2 Uj .x

kxkk P.x|mj¤ ; 6j¤ / dx

k ³ ´ X k i D0

i

Z

kmj¤ kk¡i

kx ¡ mj¤ ki P.x|mj¤ ; 6j¤ / dx;

is the combination number. By making the transformation y D

¡ mj¤ /, we get

d

E.kX ¡ mj¤ ki |mj¤ ; 6j¤ / · ¸ 2 .2¤ /E.kYki |0; Id /; ¡ 12

where Uj is dened in equation 3.3 and 3j

(4.3) ¡1

1

D diag[¸j1 2 ; : : : ; ¸jd2 ].

2894

Jinwen Ma, Lei Xu, & Michael I. Jordan

As E.kYki |0; Id / is certainly nite and ¸.2¤ / is upper bounded, E.kX ¡ ¤ i |m¤ mj k j ; 6j¤ / is upper bounded. Since ®j¤ < 1 and m.2 ¤ / is lower bounded by T2 under the constraint km¤i ¡ mj¤ k ¸ T, it follows from equation 4.2 that there exist a positive number ¹ and a positive integer q such that Z k |2¤ kxkk P.x|2¤ / dx · ¹mq .2¤ /: E.kXk /D

(4.4)

We further get an upper bound for E.g2 .X; 2¤ //. In the process of e.2¤ / ¡1 tending to zero under the assumptions, the elements of A¤ and A¤ are j bounded. Since k6j¤ k D ¸max · ¸.2 ¤ / · B, the elements of each 6j¤ are bounded. Moreover, since ¸j1 ; : : : ; ¸jd are the eigenvalues of 6j¤ and ¡1 ¡1 are the eigenvalues of 6j¤ , we have ¸j1 ; : : : ; ¸jd ¡1

¡1

k¸.2¤ /6j¤ k · ¸.2¤ /

1 1 D : ¤ ¯¸.2 / ¯ ¡1

That is, the elements of ¸.2¤ /6j¤ are also bounded. We consider that g.x; 2¤ / is a polynomial function of x1 ; : : : ; xd and that the coefcient of ¡1 each term in g.x; 2¤ / is a polynomial function of the elements of A¤ , A¤ , ¡1 ¡1 m¤1 ; : : : ; m¤K , 61¤ ; : : : ; 6K¤ , ¸.2¤ /61¤ , : : :, ¸.2¤ /6K¤ with constant coefcients. Because the elements of these matrices and vectors except m¤1 ; : : : ; m¤K are bounded, the absolute value of the coefcient is certainly upper bounded by a positive-order power function of m.2¤ /. Therefore, |g.x; 2¤ /| is upper bounded by a positive polynomial function of kxk with its coefcients being some polynomial functions of m.2¤ / of positive constant coefcients. As g2 .x; 2¤ / is certainly a balanced function, it is also upper bounded by a positive polynomial function of kxk. According to equation 4.4 and the property that m.2 ¤ / is lower bounded by T2 , there certainly exist a positive number ¹ and a positive integer p such that E.g2 .X; 2¤ // · ¹2 m2p .2 ¤ /: Finally, it follows from the Cauchy-Schwarz inequality and the fact |°ij .x/| · 1 that Z

|g.x; 2¤ /°ij .x/|P.x|2¤ / dx D E.|g.X; 2¤ /| |°ij .X/|/ 1

1

· E 2 .g2 .X; 2//E 2 .|°ij .X/|/

q

· ¹mp .2¤ / eij .2 ¤ /

p

· ¹mp .2¤ / e.2 ¤ /:

Asymptotic Convergence Rate of the EM Algorithm

2895

Since ´.2¤ / is not an invertible function, that is, there may be many 2¤ s for a value of ´.2¤ /, we further dene f .´/ D sup e.2¤ /:

(4.5)

´.2¤ / D´

Because e.2 ¤ / is always not larger than 1 by the denition, f .´/ is well dened. By this denition, we certainly have eij .2 ¤ / · e.2¤ / · f .´.2¤ //: Lemma 2.

have:

(4.6)

Suppose that 2¤ satises conditions 1–3. As ´.2 ¤ / tends to zero, we

(i) ´.2¤ /, ´ij .2 ¤ / and

¸imax km¤i ¡mj¤ k

are the equivalent innitesimals.

6 (ii) For i D j, we have

km¤i k · T 0 km¤i ¡ mj¤ k;

(4.7)

where T 0 is a positive number. (iii) For any two nonnegative numbers p and q with p C q > 0, we have km¤i ¡ mj¤ kp .¸imax / ¡q · O.´ ¡p_q .2¤ //; km¤i

¡ mj¤ kp .¸imax / ¡q

¸ O.´

¡p^q

¤

.2 //;

(4.8) (4.9)

where p _ q D maxfp; qg; p ^ q D minfp; qg. See the appendix for the proof. Lemma 3.

we have

When 2¤ satises conditions 1–3 and ´.2¤ / ! 0 as an innitesimal,

f " .´.2¤ // D o.´ p .2¤ //;

(4.10)

where " > 0, p is any positive number and o.x/ means that it is a higher-order innitesimal as x ! 0. See the appendix for the proof. Suppose that g.x; 2¤ / is a regular and convertible function and that 2¤ satises conditions 1–3. As e.2¤ / ! 0 is considered as an innitesimal, we have Z N 1 X lim °ij .t/g.x.t/ ; 2¤ / D g.x; 2¤ /°ij .x/P.x|2¤ / dx N!1 N t D1 Lemma 4.

D o.e0:5¡" .2 ¤ //

where " is an arbitrarily small, positive number.

(4.11)

2896

Jinwen Ma, Lei Xu, & Michael I. Jordan

Proof . Because g.x; 2¤ / is a regular and convertible function, there is a pos-

itive integer p such that ¸p .2 ¤ /g.x; 2¤ / is a balanced function. By lemmas 1 and 2, we have Z g.x; 2¤ /°ij .x/P.x|2¤ / dx Z · D¸

|g.x; 2¤ /°ij .x/|P.x|2¤ / dx Z ¡q

¤

.2 /

|¸q .2 ¤ /g.x; 2¤ /°ij .x/|P.x|2¤ / dx

· ¹¸ ¡q .2¤ /mp .2 ¤ /e0:5 .2 ¤ / D ¹¸ ¡q .2¤ /km¤i kp e0:5 .2 ¤ / · ¹0 .¸imax / ¡q km¤i ¡ mj¤ kp e0:5 .2¤ / · º´ ¡n .2¤ /e0:5 .2¤ /;

where we let m.2¤ / D km¤i k, ¹0 ; º are positive numbers and n D p _ q. By lemma 3, we further have R |g.x; 2¤ /°ij .x/|P.x|2¤ / dx lim ¤ e.2 /!0 e0:5¡" .2¤ / R |g.x; 2¤ /°ij .x/|P.x|2¤ / dx D lim ´.2¤ /!0 e0:5¡" .2¤ / ·º

lim ¤

´.2 /!0

e" .2¤ / f " .´.2¤ / · º lim D 0: ¤ n ¤ ´.2 /!0 ´ n .2¤ / ´ .2 /

Therefore, we have Z g.x; 2¤ /°ij .x/P.x|2¤ / dx D o.e0:5¡" .2 ¤ //:

(4.12)

Suppose that 2¤ satises conditions 1–3 and that e.2¤ / ! 0 is Pn ¤ considered as an innitesimal. If g.x; 2 / D lD 1 gl .x; 2¤ / and each gl .x; 2¤ / is a regular and convertible function, we have Corollary 1.

Z N 1 X °ij .t/g.x.t/ ; 2¤ / D g.x; 2¤ /°ij .x/P.x|2¤ / dx N!1 N tD 1 lim

D o.e

0:5¡"

.2¤ //;

where " is an arbitrarily small, positive number. Proof . By the fact

Z |g.x; 2¤ /°ij .x/|P.x|2¤ / dx · the corollary is obviously proved.

n Z X lD 1

|gl .x; 2¤ /°ij .x/|P.x|2¤ / dx;

(4.13)

Asymptotic Convergence Rate of the EM Algorithm

2897

5 Proof of Theorem 1

We are now ready to prove the main theorem. We are interested in the matrix P.2¤ /H.2 ¤ /, which determines the local convergence rate of EM. The explicit expressions of P.2¤ / and H.2 ¤ / allow us to obtain formulas for the blocks of P.2 ¤ /H.2 ¤ /. To clarify our notation, we begin by writing out these blocks as follows: Proof of Theorem 1.

P.2¤ /H.2¤ / D diag[PA ; Pm1 ; : : : ; PmK ; P61 ; : : : ; P6K ]

0 HA;AT BH B m1 ;AT B B:: B: B £B BHmK ;AT BH B 61 ;AT B: B: @: H6K ;AT

HA;mT 1 Hm1 ;mT 1 ::: HmK ;mT 1 H61 ;mT 1 :: : H6K ;mT 1

0 PA HA;AT BP H B m1 m1 ;AT B B:: B: B DB BPmK HmK ;AT BP H B 61 61 ;AT B: B: @: P6K H6K ;AT

¢¢¢ ¢¢¢ :: : ¢¢¢ ¢¢¢ :: : ¢¢¢

HA;mT K Hm1 ;mT K ::: HmK ;mT K H61 ;mT K :: : H6K ;mT K

P A HA;mT 1 P m1 Hm1 ;mT 1 :: : P mK HmK ;mT 1 P 61 H61 ;mT 1 :: : P 6K H6K ;mT 1

¢¢¢ ¢¢¢ :: : ¢¢¢ ¢¢¢ :: : ¢¢¢

HA;6 T 1 Hm1 ;6 T 1 ::: HmK ;6 T 1 H61 ;6 T 1 :: : H6K ;6 T 1

PA HA;mT K Pm1 Hm1 ;mT K :: : PmK HmK ;mT K P61 H61 ;mT K :: : P6K H6K ;mT K

¢¢¢ ¢¢¢ :: : ¢¢¢ ¢¢¢ :: : ¢¢¢

1 HA;6 T K Hm1 ;6 T C C KC :::C C C HmK ;6 T C KC H61 ;6 T C C K ::C C :A H6K ;6 T

K

PA HA;6 T 1 Pm1 Hm1 ;6 T 1 :: : PmK HmK ;6 T 1 P61 H61 ;6 T 1 :: : P6K H6K ;6 T 1

¢¢¢ ¢¢¢ :: : ¢¢¢ ¢¢¢ :: : ¢¢¢

1 PA HA;6 T K Pm1 Hm1 ;6 T C C KC ::C :C C PmK HmK ;6 T C : KC P61 H61 ;6 T C C K ::C C :A P6K H6K ;6 T K

Based on the expressions of the Hessian blocks, we have:

³

PA HA;AT

N X 1 ¡1 ¡1 D .diag[®1¤ ; : : : ; ®K¤ ] ¡ A¤ .A¤ /T / £ ¡ HA .t/.HA .t//T N tD 1

³

D

¡.diag[®1¤ ; : : : ; ®K¤ ]

¤

¤ T

¡ A .A / /

!

! N 1 X ¡1 ¡1 T H .t/.HA .t// : N tD 1 A

2898

Jinwen Ma, Lei Xu, & Michael I. Jordan

By taking the limit as N ! 1, we obtain: 1 N!1 N

lim PA HA;AT D ¡.diag[®1¤ ; : : : ; ®K¤ ] ¡ A¤ .A¤ /T / lim

N!1

£

N X tD1

¡1 ¡1 HA .t/.HA .t//T :

We further let N 1 X ¡1 HA .t/HA¡1 .t/T N!1 N tD 1

I.2 ¤ / D lim

(5.1)

and compute this matrix by the elements. 6 When i D j, we have I.2 ¤ /.i; j/ D ·

N 1 1 X 1 lim hi .t/hj .t/ D ¤ ¤ eij .2¤ / ¤ ¤ ®i ®j N!1 N t D 1 ®i ®j

1 e.2¤ / D o.e0:5¡" .2 ¤ //: ®2

When i D j, we further have N 1 1 X lim hi .t/[1 ¡ .1 ¡ hi .t//] .®i¤ /2 N!1 N t D 1 Z Z 1 1 ¤ |°ii .x/|P.x|2¤ / dx D hi .x/P.x|2 / dx ¡ ¤ 2 .®i¤ /2 .®i /

I.2 ¤ /.i; i/ D

¡1

¸ ®i¤

¡

1 ¡1 e.2 ¤ / D ®i¤ C o.e0:5¡" .2¤ //: 2 ®

Then we have ¡1

¡1

I.2 ¤ / D diag[®1¤ ; : : : ; ®K¤ ] C o.e0:5¡" .2¤ //:

(5.2)

Therefore, we get lim PA HA;AT

N!1

D ¡.diag[®1¤ ; : : : ; ®K¤ ] ¡ A¤ .A ¤ /T /I.2 ¤ /

¡1

¡1

¤ ¤ ¤ ¤ ¤ ¤ 0:5¡" D ¡.diag[®1 ; : : : ; ®K ]¡ A .A /T /.diag[®1 ; : : : ; ®K ] C o.e .2¤ //

D ¡IK C A¤ [1; 1; : : : ; 1] C o.e0:5¡" .2 ¤ //:

Asymptotic Convergence Rate of the EM Algorithm

2899

For j D 1; : : : ; K, we have PA HA;mT D j

N X 1 ¡1 ¡1 .diag[®1¤ ; : : : ; ®K¤ ]¡ A¤ .A ¤ /T / RA .t/.x.t/ ¡mj¤ /T 6j¤ j N tD 1

¤ ]¡ A¤ .A¤ /T / D .diag[®1¤ ; : : : ; ®K

N 1 X ¡1 R¡1 .t/.x.t/ ¡mj¤ /T 6j¤ : N t D1 Aj

¡1 Since RA .t/ D [°1j .t/=®1¤ ; : : : ; °Kj =®K¤ ]T , we let g.x; 2¤ / be any element of j ¡1

¡1

A¤ .x ¡ mj¤ /T 6j¤ . Obviously, this kind of g.x; 2¤ / is a regular and convertible function with q D 1. By lemma 4, we have: N 1 X ¤ T ¤¡1 0:5¡" .t/ R¡1 D o.e .2¤ //: Aj .t/.x ¡ mj / 6j N!1 N t D1

lim

(5.3)

Because the elements of the matrix .diag[®1¤ ; : : : ; ®K¤ ]¡A¤ .A¤ /T / are bounded, we further have lim PA HA;mT D o.e0:5¡" .2¤ //:

N!1

(5.4)

j

Because the Hessian matrix has the following property, T Hmj ;AT D HA;m T;

(5.5)

j

we have: Pmj Hmj ;AT D PN

N

t D 1 hj .t/

³

N 1 X R¡1 .t/.x.t/ ¡ mj¤ /T N t D 1 Aj

!T

for j D 1; : : : ; K. By lemma 4 and letting g.x; 2¤ / be any element of the matrix .x ¡ mj¤ /.A¤¡1 /T we obtain lim Pmj Hmj ;AT D o.e0:5¡" .2¤ //:

N!1

Similarly, we can prove: lim PA HA;6 T D o.e0:5¡" .2¤ //;

N!1

j

lim P6j H6j ;AT D o.e0:5¡" .2¤ //:

N!1

(5.6)

2900

Jinwen Ma, Lei Xu, & Michael I. Jordan

Next, we consider P mi Hmi ;mT and have j

P mi Hmi ;mT D ¡±ij Id C PN j

N X

1

tD 1 hi .t/ tD 1

¡1

°ij .t/.x.t/ ¡ m¤i /.x.t/ ¡ mj¤ /T 6j¤ :

By lemma 4 and letting g.x; 2¤ / be any element of the matrix .x ¡ m¤i /.x ¡ ¡1 mj¤ /T 6j¤ , we get: lim Pmi Hmi ;mT D ¡±ij Id

N!1

j

C

N 1 1 X ¡1 lim °ij .t/.x.t/ ¡ m¤i /.x.t/ ¡ mj¤ /T 6j¤ ¤ ®i N!1 N t D 1

0:5¡" D ¡±ij Id C o.e .2¤ //:

We further consider Pmi Hmi ;6 T and have j

N X 1 1 ¡1 P mi Hmi ;6 T D ¡ PN °ij .t/vec[6j¤ ¡ Uj .t/]T .x.t/ ¡ m¤i / j 2 t D1 hi .t/ t D1

¡ PN

1

N X

t D1 hi .t/ t D1

¡1

±ij hi .t/.x.t/ ¡ m¤i / vec[6i¤ ]T : ¡1

By corollary 1 and letting g.x; 2¤ / be any element of the matrix vec[6j¤ Uj .x/]T .x ¡ m¤i /, we have

N 1 X ¡1 °ij .t/vec[6j¤ ¡Uj .t/]T .x.t/ ¡m¤i / D o.e0:5¡" .2¤ //: N!1 N tD 1

lim

¡

(5.7)

As for the second part of the expression with j D i, we have lim PN

N!1

N X

1

t D1 hi .t/ t D1

¡1

hi .t/.x.t/ ¡ m¤i / vec[6i¤ ]T D 0;

because lim PN

N!1

N X

1

t D1 hi .t/

t D1

hi .t/.x.t/ ¡ m¤i / D 0

(5.8)

is satised under the law of large number. Combining the two results, we obtain lim Pmi Hmi ;6 T D o.e0:5¡" .2 ¤ //:

N!1

j

(5.9)

Asymptotic Convergence Rate of the EM Algorithm

2901

On the other hand, we have 2

T 6i¤ 6i¤ Hm T j ;6i h .t/ i t D1 ( N X 1 ¡1 ¤ ¤ D ¡6i 6i PN °ij .t/vec[6i¤ ¡Ui .t/] t D1 hi .t/ t D1 )

P6i H6i ;mT D PN j

¡1

f6j¤ .x.t/ ¡ mj¤ /gT ( ¡ 26i¤

6i¤

PN

N X

1

t D 1 hi .t/ t D 1

¡1

±ij hi .t/f6i¤ .x.t/ ¡ m¤i /gT

)

¡1 vec[6i¤ ]

;

and by the same argument used for Pmi Hmi ;6 T , we have j

lim P6i H6i ;mTi D o.e0:5¡" .2¤ //:

(5.10)

N!1

Now we turn to the remaining case, P6i H6i ;6 T , and have: j

P6i H6i ;6 T j

1 D ¡ 6i¤ 6i¤ 2

( PN

N X

1

tD 1 hi .t/ tD 1

¡1 vec[6i¤

¡1

°ij .t/vec[6j¤

¡ Uj .t/]

)

¡ Ui .t/]

¡1

¡1

¡1

¡ ±ij .6i¤ 6 i¤ /f¡6i¤ 6i¤ C 6i¤ ( ) N X 1 PN hi .t/Ui .t/ t D 1 hi .t/ t D 1 ( ) N X 1 ¤¡1 C P hi .t/Ui .t/g 6i : N tD 1 hi .t/ t D 1 By corollary 1 and the law of large number, we have ¡1

lim P6i H6i ;6 T D o.e0:5¡" .2¤ // ¡ ±ij .6i¤ 6i¤ /.6i¤

N!1

j

D ¡±ij Id2 C o.e0:5¡" .2 ¤ //:

¡1

6i¤ /

2902

Jinwen Ma, Lei Xu, & Michael I. Jordan

Summing up all the results, we obtain: ³ lim P.2¤ /H.2¤ / D

N!1

¡IK C A¤ [1; 1; : : : ; 1] 0

C o.e

0:5¡"

´

0 ¡IK£.d C d2 /

¤

.2 //

D F C o.e0:5¡" .2 ¤ //;

and the proof is completed. 6 Conclusions

We have presented an analysis of the asymptotic convergence rate of the EM algorithm for gaussian mixtures. Our analysis shows that when the overlap of any two gaussian distributions is small enough, the large sample local convergence behavior of the EM algorithm is similar to a quasi-Newton algorithm. Moreover, the large sample convergence rate is dominated by an average overlap measure in the gaussian mixture as they both tend to zero. Appendix

We begin to prove (i). Let ´.2 ¤ / D ´i0 j0 .2 ¤ /. According to conditions 2 and 3, there exist three pairs of positive numbers a1 ; a2 ; b1 ; b2 ; and c1 ; c2 , such that Proof of Lemma 2.

0

1

j0

1

1

j

1

1

j

1

1

0

j0

1

a1 .¸imax / 2 .¸max / 2 · .¸imax / 2 .¸max / 2 · a2 .¸imax / 2 .¸max / 2 ; 1

j

1

(A.1)

b1 .¸imax / 2 .¸max / 2 · ¸imax · b2 .¸imax / 2 .¸max / 2 ;

(A.2)

c1 km¤i0 ¡ mj¤0 k · km¤i ¡ mj¤ k · c2 km¤i0 ¡ mj¤0 k:

(A.3)

Comparing equations A.1 and A.2 with equation A.3, there exist two other pairs of positive numbers, a01 ; a02 , and b01 ; b02 , such that a01 ´.2¤ / · ´ij .2¤ / · a02 ´.2¤ /; b01 ´ij .2¤ / ·

¸imax · b02 ´ij .2 ¤ /: km¤i ¡ mj¤ k

¸imax km¤i ¡mj¤ k are the equivalent innitesimals. We then consider (ii). If km¤i ¡ mj¤ k is upper bounded as ´.2¤ / ! 0, equation 4.7 obviously holds. If km¤i ¡mj¤ k increases to innity as ´.2¤ / ! 0, by the inequality km¤i k · km¤i ¡ mj¤ k C kmj¤ k, equation 4.7 certainly holds if

Therefore, ´.2¤ /; ´ij .2 ¤ / and

Asymptotic Convergence Rate of the EM Algorithm

2903

km¤i k or kmj¤ k is upper bounded. Otherwise, if km¤i k and kmj¤ k both increase to innity, since m¤i and mj¤ take different directions, the order of the innitely large quantity km¤i k is certainly lower than or equal to that of km¤i ¡ mj¤ k, which also leads to equation 4.7. Therefore, (ii) holds under the assumptions. Finally, we turn to (iii). For the rst inequality, equation 4.8, we consider three cases, as follows. In the simple case p D q > 0, according to (i), we have km¤i ¡ mj¤ kp .¸imax / ¡q D km¤i ¡ mj¤ kp .¸imax / ¡p ³ !¡p ¸imax D D O.´ ¡p .2 ¤ // D O.´ ¡p_q .2 ¤ //: km¤i ¡ mj¤ k If p > q, since ¸imax is upper bounded and according to (i), we have km¤i ¡ mj¤ kp .¸max / ¡q · O.´ ¡p .2¤ // D O.´ ¡p_q .2¤ //: If p < q, as km¤i ¡ mj¤ k ¸ T, we can have km¤i ¡ mj¤ kp .¸imax / ¡q · O.´ ¡q .2¤ // D O.´ ¡p_q .2 ¤ //: Summing up the results on the three cases, we obtain: km¤i ¡ mj¤ kp .¸imax / ¡q · O.´ ¡p_q .2¤ //: In a similar way as above, we can prove the second inequality, equation 4.9. Proof of Lemma 3.

We rst prove that

f .´/ D o.´ p /; as ´ ! 0, where p is an arbitrarily positive number p. We consider the mixture of K gaussian densities of the parameter 2¤ 6 under the relation ´.2 ¤ / D ´. When i D j, for a small enough ´, there is certainly a point m¤ij on the line between m¤i and mj¤ such that ®i¤ Pi .m¤ij |m¤i ; 6i¤ / D ®j¤ Pj .m¤ij |mj¤ ; 6j¤ /: We further dene Ei D fx: ®i¤ Pi .x|m¤i ; 6i¤ / ¸ ®j¤ Pj .x|mj¤ ; 6j¤ /gI Ej D fx: ®j¤ Pj .x|mj¤ ; 6j¤ / > ®i¤ Pi .x|m¤i ; 6i¤ /g:

2904

Jinwen Ma, Lei Xu, & Michael I. Jordan

As ´.2¤ / tends to zero, ·.6i¤ /

¸imax km¤i ¡mj¤ k

and

·.6j¤ /

¸imax km¤i ¡mj¤ k

are the same-order innites-

imals. Moreover, and are both upper bounded. Thus, there certainly exist some neighborhoods of m¤i (or mj¤ ) in Ei (or Ej ). For clarity, we let N ri .m¤i / and N rj .mj¤ / be the largest neighborhood in Ei and Ej , respectively, where ri and rj are their radiuses. Obviously ri and rj are both proportional to km¤i ¡ mj¤ k when km¤i ¡ mj¤ k either tends to innity or is upper bounded. So there exist a pair of positive numbers b1 and b2 such that ri ¸ bi km¤i ¡ mj¤ k; and rj ¸ bj km¤i ¡ mj¤ k: We further dene

Di D N Dj D N

¤ c ri .mi /

D fx: kx ¡ m¤i k ¸ ri gI

¤ c rj .mj /

D fx: kx ¡ mj¤ k ¸ rj g;

and thus E i ½ Dj ;

Ej ½ Di :

Moreover, from the denitions of eij .2¤ / and hk .x/, we have Z eij .2¤ / D

hi .x/hj .x/P.x|2¤ / dx Z

D

Z ·

Z hi .x/hj .x/P.x|2¤ / dx C

Ei

hi .x/hj .x/P.x|2¤ / dx

Ej

Z ¤

D

j

hi .x/hj .x/P.x|2 / dx C

Z

D ®j¤

hi .x/hj .x/P.x|2¤ / dx

D

Z D

We now consider

j

Pj .x|mj¤ ; 6j¤ / dx C ®i¤

R D

i

i

D

Pi .x|m¤i ; 6j¤ / dx: i

Pi .x|m¤i ; 6i¤ / dx. Since ri ¸ bi km¤i ¡ mj¤ k,

Z

Z D

i

Pi .x|m¤i ; 6i¤ / dx ·

kx¡m¤i k¸bi km¤i ¡mj¤ k

Pi .x|m¤i ; 6i¤ / dx:

When ´.2¤ / is sufciently small, by the transformation y D Ui .x¡m¤i /=km¤i ¡

Asymptotic Convergence Rate of the EM Algorithm

2905

mj¤ k, we have Z

Z D

Pi .x|m¤i ; 6i¤ / dx

·

i

kyk¸bi

º

km¤i ¡ mj¤ k .¸imax /

d 2

e

¡ 12

km¤ ¡m ¤ k2 i j .¸imax /d

kyk2

dy;

(A.4)

where º is a positive number. By lemma 2, we have d

km¤i ¡ mj¤ k.¸imax / ¡ 2 · O.´ ¡c1 .2¤ //; km¤i ¡ mj¤ k2 .¸imax / ¡d ¸ O.´ ¡c2 .2¤ //; where c1 D 1 _ d2 ; c2 D 2 ^ d. According to these results, we have from equation A.4 that Z

Z D

i

Pi .x|m¤i ; 6i¤ / dx · ¹

Bi

1 ¡½ c12 kyk2 e ´ dy; ´ c1

(A.5)

where Bi D fy: kyk ¸ bi g, ¹ and ½ are positive numbers. Furthermore, we let Z 1 ¡½ 1 kyk2 Fi .´/ D P.y|´/ dy; P.y|´/ D c e ´c2 ; ´1 Bi and consider the limit of Fi´.´/ as ´ tends to zero. p For each y 2 Bi , we have P.y|´/ D ´!0 ´p lim

lim

³ .c1 C p/

c2 2 ³ D ´1 !1 e³ ½kyk

D0

uniformly in Bi , which leads to Fi .´/ D ´!0 ´p lim

Z

P.y|´/ dy D 0; ´p Bi ´!0 lim

and thus Fi .´/ D o.´ p /: It further follows from equation A.5 that Z

sup

´.2¤ / D´ D

i

Pi .x|m¤i ; 6i¤ / dx D o.´ p /:

Similarly, we can also prove Z sup ´.2¤ / D´ D

j

Pj .x|mj¤ ; 6j¤ / dx D o.´ p /:

(A.6)

2906

Jinwen Ma, Lei Xu, & Michael I. Jordan

As a result, we have fij .´/ D · ·

sup eij .2¤ /

´.2 ¤ /D ´

³

sup ´.2 ¤ /D ´

Z ®j¤

D

j

®i¤

Z

D

P i .x|m¤i ; 6i¤ / dx i

Z

sup ´.2 ¤ /D ´

!

Z Pj .x|mj¤ ; 6j¤ / dx C

D

j

Pj .x|mj¤ ; 6j¤ / dx C sup

´.2¤ / D ´

D

Pi .x|m¤i ; 6i¤ / dx/ i

p

D o.´ /:

In the case i D j, we also have Z

|°ii .x/|P.x|2¤ / dx

fii .´/ D sup

´.2 ¤ /D ´

D

·

Z

sup

´.2 ¤ /D ´

X

hi .x/.1 ¡ hi .x//P.x|2¤ / dx

sup

¤ 6 i ´.2 / D ´ jD

Z hi .x/hj .x/P.x|2¤ / dx D

X

eij .¸/

6 i jD

D o.´ p /:

Summing up the results, we have f .´/ · max fij .´/ D o.´ p /: i;j

Moreover, because lim

´!0

f " .´/ f .´/ D lim . p /" D 0; ´!0 ´ " ´p

we nally have f " .´/ D o.´ p / and thus f " .´.2¤ // D o.´ p .2 ¤ //. Acknowledgments

The work was supported by the HK RGC Earmarked Grant 4297/98E. We thank Professor Jiong Ruan and Dr. Yong Gao for the valuable discussions. References Delyon, B., Lavielle, M., & Moulines E. (1999). Convergence of a stochastic approximation version of the EM algorithm. Annals of Statistics, 27, 94–128.

Asymptotic Convergence Rate of the EM Algorithm

2907

Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc., B, 39, 1–38. Gerald, S. R. (1980). Matrix derivatives. New York: Dekker. Jamshidian, M., & Jennrich R. I. (1997). Acceleration of the EM algorithm by using quasi-Newton methods. J. R. Statist. Soc., B, 59, 569–587. Jordan, M. I., & Jacobs, R. A. (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6, 181–214. Jordan, M. I., & Xu, L. (1995). Convergence results for the EP approach to mixtures of experts architectures. Neural Networks, 8(9), 1409–1431. Lange, K. (1995a). A gradient algorithm locally equivalent to the EM algorithm. J. R. Statist. Soc., B, 57, 425–437. Lange, K. (1995b). A quasi-Newton acceleration of the EM algorithm. Statist. Sinica, 5. Liu, C., & Rubin, D. B. (1994). The ECME algorithm: A simple extension of EM and ECM with faster monotone convergence. Biometrika, 81, 633–648. Meng, X. L. (1994). On the rate of convergence of the ECM algorithm. Ann. Statist., 22, 326–339. Meng, X. L., & van Dyk D. (1997). The EM algorithm—An old folk-song sung to a fast new tune. J. R. Statist. Soc., B, 59, 511–567. Meng, X. L., & van Dyk D. (1998). Fast EM-type implementations for mixed effects models. J. R. Statist. Soc., B, 60, 559–578. Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proc. of IEEE, 77, 257–286. Redner, R. A., & Walker, H. F. (1984). Mixture densities, maximum likelihood, and the EM algorithm. SIAM Review, 26, 195–239. Veaux, D. De R. (1986). Parameter estimation for a mixture of linear regressions. Unpublished doctoral dissertation, Stanford University. Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. Annals of the Statistics, 11, 95–103. Xu, L. (1997). Comparative analysis on convergence rates of the EM algorithms and its two modications for gaussian mixtures. Neural Processing Letters, 6, 69–76. Xu, L., & Jordan, M. I. (1996). On convergence properties of the EM algorithm for gaussian mixtures. Neural Computation, 8, 129–151. Received November 8, 1999; accepted February 29, 2000.

LETTER

Communicated by Stefan Schall

Incremental Active Learning for Optimal Generalization Masashi Sugiyama Hidemitsu Ogawa Department of Computer Science, Tokyo Institute of Technology, Meguro-ku, Tokyo, 152-8552, Japan The problem of designing input signals for optimal generalization is called active learning. In this article, we give a two-stage sampling scheme for reducing both the bias and variance, and based on this scheme, we propose two active learning methods. One is the multipoint search method applicable to arbitrary models. The effectiveness of this method is shown through computer simulations. The other is the optimal sampling method in trigonometric polynomial models. This method precisely species the optimal sampling locations.

1 Introduction

Supervised learning is obtaining an underlying rule from sampled information. Depending on the type of sampling, supervised learning can be classied into two categories. One is the case where information is given unilaterally from the environment. For example, in time-series prediction, sample points are xed to regular intervals, and learners cannot change the interval. The other is the case where learners can design input signals by themselves and sample corresponding output signals. For example, it is possible to design input signals in many scientic experiments or learning of sensorimotor maps of multijoint robot arms. Learning can be performed more efciently if we can actively design input signals. The problem of designing input signals for optimal generalization is called active learning (Cohn, Ghahramani, & Jordan, 1996; Fukumizu, 1996; Vijayakumar & Ogawa, 1999). It is also referred to as experimental design (Kiefer, 1959; Fedorov, 1972; Cohn, 1994) or query construction (Sollich, 1994). Reinforcement learning (Kaelbling, 1996), which has been extensively studied recently in the eld of machine learning, can be regarded as another form of active learning. In mathematical statistics, an active learning criterioncalled the D-optimal design has been thoroughly studied (Kiefer, 1959; Kiefer & Wolfowitz, 1960; Fedorov, 1972). The D-optimal design minimizes the determinant of the c 2000 Massachusetts Institute of Technology Neural Computation 12, 2909–2940 (2000) °

2910

Masashi Sugiyama and Hidemitsu Ogawa

dispersion matrix of the estimator. One of the advantages of the D-optimal design is that it is invariant under all afne transformations in the input space (Kiefer, 1959). Kiefer and Wolfowitz (1960) showed that the D-optimal design agrees with the minimax design when the noise variance is the same magnitude all over the domain. The minimax design is aimed at nding the sample points fxj g minimizing the maximum of the noise variance: argmin max En | f0 .x/ ¡ f .x/| 2 ; fxj g

x

(1.1)

where En , f0 , and f are the ensemble average over the noise, a learning result, and the learning target function, respectively. In order to nd the optimal design of sample points, the learning criterion prescribing the mapping from training examples to a learning result has to be determined. In a general approach to the D-optimal design, best linear unbiased estimation is adopted as a learning criterion, which is aimed at minimizing the mean noise variance over the domain under the constraint of unbiasedness. This implies that the criterion for the D-optimal design is inconsistent with the criterion for best linear unbiased estimation, causing a crucial problem for acquiring the optimal generalization capability. Within the framework of Bayesian statistics, MacKay (1992) derived a criterion for selecting the most informative training data for specifying the parameters of neural networks. Cohn (1994, 1996) and Cohn et al. (1996) gave an active learning criterion for minimizing the variance of the estimator. Fukumizu (1996) proposed an active learning method in multilayer perceptrons using asymptotic approximation for estimating the generalization error. Essentially, the criteria derived in these articles are equivalent to the A-optimal design shown in Fedorov (1972). However, it is generally intractable to calculate the value of the criteria and difcult to nd the optimal solution. MacKay (1992) proposed using a xed set of reference points for estimating the value of the criterion. Cohn et al. (1996) recommended using Monte Carlo sampling for estimating the value of the criterion, and Cohn (1996) used the gradient method for nding a local optimum of the criterion. Fukumizu (1996) introduced a parametric family of the density function for generating sampling locations. In the above approaches, the active learning criteria are aimed at minimizing the variance of the estimator. However, as Geman, Bienenstock, and Doursat (1992) showed, the generalization error consists of the bias and variance. This implies that these methods assume that the bias is zero or small enough to be neglected. Yue and Hickernell (1999) showed an upper bound of the generalization error and derived an active learning criterion for minimizing the upper bound. However, this criterion includes an unknown controlling parameter of the trade-off between the bias and variance, so the optimal solution cannot be obtained. Cohn (1997) used resampling methods such as bootstrapping (Efron & Tibshirani, 1993) and

Incremental Active Learning for Optimal Generalization

2911

cross-validation (Stone, 1974) for estimating the bias and proposed an active learning method for reducing the bias. His experiments showed that the bias-only approach outperforms the variance-only approach. From the functional analytic point of view, Vijayakumar and Ogawa (1999) gave a necessary and sufcient condition of the sampling locations to provide the optimal generalization capability in the absence of noise. In this condition, the bias is explicitly evaluated by using the knowledge of the distribution of the learning target functions. Vijayakumar, Sugiyama, and Ogawa (1998) extended the condition to the noisy case by dividing the sampling scheme into two stages. The rst stage is for reducing the bias, and the second stage is for reducing the variance with the small bias attained in the rst stage maintained. In this article, we propose two active learning methods in the presence of noise. One is the multipoint search method applicable to arbitrary models. The effectiveness of this method is shown through computer simulations. The other is the optimal sampling method in trigonometric polynomial models. This method precisely species the optimal sampling locations. Both methods are based on the idea of the two-stage sampling scheme proposed by Vijayakumar et al. (1998). The difference is that a priori knowledge of the distribution of the target functions is not required in this article. This article is organized as follows. In section 2, the supervised learning problem is formulated. Section 3 describes a general learning process and requirements for acquiring the optimal generalization capability. Section 4 gives a basic sampling strategy. Based on this strategy, section 5 gives the multipoint search method, and section 6 gives the optimal sampling method in trigonometric polynomial models. Finally, computer simulations are performed in section 7, demonstrating the effectiveness of the proposed methods. 2 Formulation of Supervised Learning Problem

In this section, the supervised learning problem is formulated from the functional analytic point of view (see Ogawa, 1989, 1992). Let us consider a supervised learning problem of obtaining the optimal approximation to a target function f .x/ of L variables from a set of m training examples. The training examples are made up of input signals xj in D , where D is a subset of the L-dimensional Euclidean space R L , and corresponding output signals yj in the unitary space C : f.xj ; yj / | yj D f .xj / C nj gjmD 1 ;

(2.1)

where yj is degraded by zero-mean additive noise nj . Let n.m/ and y.m/ be m-dimensional vectors whose jth elements are nj and yj , respectively. y.m/ is called a sample value vector, and a space to which y.m/ belongs is called a sample value space. In this article, the target function f .x/ is assumed to belong to a reproducing kernel Hilbert space H (Aronszajn, 1950; Bergman,

2912

Masashi Sugiyama and Hidemitsu Ogawa

1970; Saitoh, 1988, 1997; Wahba, 1990). If H is unknown, then it can be estimated by model selection methods (e.g., Akaike, 1974; Sugiyama & Ogawa, 1999d). The reproducing kernel K.x; x0 / is a bivariate function dened on D £ D , which satises the following conditions. For any xed x0 in D , K.x; x0 / is a function of x in H.

For any function f in H and for any x0 in D , it holds that h f .¢/; K.¢; x0 /i D f .x0 /;

(2.2)

where h¢; ¢i denotes the inner product.

Note that the reproducing kernel is unique if it exists. In the theory of the Hilbert space, arguments are developed by regarding a function as a point in that space. Thus, the value of a function at a point cannot be discussed within the general framework of the Hilbert space. However, if the Hilbert space has the reproducing kernel, then it is possible to deal with the value of a function at a point. Indeed, if a function Ãj .x/ is dened as Ãj .x/ D K.x; xj /;

(2.3)

then the value of f at a sample point xj is expressed as f .xj / D h f; Ãj i:

(2.4)

For this reason, Ãj is called a sampling function. Let Am be an operator mapping f to an m-dimensional vector whose jth element is f .xj /, Am f D . f .x1 /; f .x2 /; : : : ; f .xm //> ;

(2.5)

where > denotes the transpose of a vector. We call Am a sampling operator. Note that Am is always a linear operator even when we are concerned with a nonlinear function f .x/. Indeed, Am can be expressed as Am D

m ± X j D1

² ej.m/ Ãj ;

(2.6)

where ej.m/ is the jth vector of the so-called standard basis in C m and .¢ ¢/ stands for the Neumann-Schatten product.1 Then the relationship between f and y.m/ can be expressed as y.m/ D Am f C n.m/ :

(2.7)

1 For any xed g in a Hilbert space H1 and any xed f in a Hilbert space H2 , the Neumann-Schatten product . f g/ is an operator from H1 to H2 dened by using any h 2 H1 as (Schatten, 1970) . f g/h D hh; gi f:

Incremental Active Learning for Optimal Generalization

2913

( ( (

.. .

) ) )

Figure 1: Supervised learning as an inverse problem.

Let fm be a learning result obtained from m training examples, and let us denote a mapping from y.m/ to fm by Xm , fm D Xm y.m/ ;

(2.8)

where Xm is called a learning operator. Then the supervised learning problem is reformulated as an inverse problem of obtaining Xm , providing the best approximation fm to f under a certain learning criterion (see Figure 1). 3 Learning Process

In this section, we show a general process for supervised learning and describe requirements for optimal generalization. 3.1 Requirements for Optimal Generalization. Supervised learning is generally processed as illustrated in Figure 2. First, the learning criterion is determined in accordance with the purpose of learning. Then, what data to gather is decided, and sample values are gathered at the decided locations. By using the gathered training examples, a learning procedure is carried out, and the obtained learning result is evaluated. If the learning result is satisfactory, the learning process is completed. Otherwise, training examples are added to improve the learning result until it becomes satisfactory. In this article, training examples are sampled and added one by one along with the process. In practical situations, the number of training examples is always nite. Hence, for acquiring the optimal generalization capability through

2914

Masashi Sugiyama and Hidemitsu Ogawa

Learning criterion is determined

(i)

What data to gather next is chosen (Incremental active learning) Training examples are gathered

(ii)

Learning procedure is carried out (Incremental learning)

Learning result is evaluated unsatisfactory

satisfactory

End

Figure 2: General process for supervised learning.

this learning process, the following requirements should be met in the nonasymptotic sense: 1. The criterion for active learning is consistent with the purpose of learning. 2. The active learning method precisely species the optimal sample points. 3. The incremental learning method provides exactly the same generalization capability as that obtained by batch learning with all training examples. Strictly speaking, “optimal” in requirement 2 has two meanings: the globally optimal, where a set of all training examples is optimal (e.g., Sugiyama & Ogawa, 2000), and the greedy optimal, where the next training example to sample is optimal in each step (e.g., MacKay, 1992; Cohn, 1994, 1996; Fukumizu, 1996). In this article, we focus on the latter greedy case and devise an incremental active learning method meeting the above requirements. In the rest of this section, we review projection learning and a method of incremental projection learning, which meets requirement 3.

Incremental Active Learning for Optimal Generalization

2915

3.2 Projection Learning. Function approximation is performed on the basis of a learning criterion. Our purpose of learning in this article is to minimize the generalization error of the learning result fm measured by

JG D E n k f m ¡ f k 2 :

(3.1)

Then, the following proposition holds. Proposition 1 (Takemura, 1991).

It holds that

JG D kEn fm ¡ f k2 C E n k fm ¡ En fm k2 :

(3.2)

The rst and second terms of equation 3.2 are called the bias and variance of fm , respectively. Let us restrict our discussion within the case where the learning operator Xm in equation 2.8 is linear. Then it follows from equations 2.8 and 2.7 that the learning result fm can be decomposed as fm D Xm Am f C Xm n.m/ :

(3.3)

In this case, it follows from equation 3.3 that (3.4)

En fm D Xm Am f;

and hence the mean of fm over the noise belongs to R.Xm Am /, where R.¢/ denotes the range of an operator. Let PS be the orthogonal projection operator onto a subspace S. In order to minimize the bias of fm , Xm Am f should agree with the orthogonal projection of f onto R.Xm Am /: Xm Am f D PR

.Xm Am /

f:

(3.5)

From Albert (1972), the operator equation, Xm Am D P S ;

(3.6)

has a solution if and only if S ½ R .A¤m /, where A¤m denotes the adjoint operator of Am . Since bigger R .Xm Am / provides better approximation, we adopt the largest one:

R.Xm Am / D R.A¤m /:

(3.7)

For this reason, R.A¤m / is called the approximation space. In order to reduce the generalization error, the variance of fm should be minimized. This learning method is called projection learning:

2916

Masashi Sugiyama and Hidemitsu Ogawa

Denition 1 (Projection Learning) (Ogawa, 1987).

An operator Xm is called the projection learning operator if Xm minimizes the functional JP [Xm ] D En kXm n.m/ k2

(3.8)

under the constraint X m A m D PR

(3.9)

.A¤m / :

Let A†m be the Moore-Penrose generalized inverse of Am .2 Then the following proposition holds: Proposition 2 (Ogawa, 1987).

A general form of the projection learning oper-

ator is expressed as † ¤ † † Xm D Vm Am Um C Ym .Im ¡ Um Um /;

(3.10)

where Ym is an arbitrary operator from C m to H and ± ² Qm D E n n.m/ n.m/ ;

(3.11)

Um D Am A¤m C Qm ;

(3.12)

† Vm D A¤m Um Am :

(3.13)

Substituting equations 3.3, 3.4, and 3.9 into equation 3.2, we have JG D kPR

.A¤m /

f ¡ f k2 C En kXm n.m/ k2 :

(3.14)

Equation 3.14 implies that projection learning reduces the bias of fm to a certain level and minimizes the variance of fm . There are various methods for calculating the projection learning operator Xm and the projection learning result fm by matrix operation. Here we show one of the simplest methods valid for all nite dimensional Hilbert spaces H. When the dimension of H, denoted by ¹, is nite, functions in H can be expressed in the form of f .x/ D

¹ X

ak ’k .x/;

(3.15)

k D1

2 An operator X is called the Moore-Penrose generalized inverse of an operator A if X satises the following four conditions (Albert, 1972; Ben-Israel & Greville, 1974): AXA D A; XAX D X; .AX/¤ D AX; and .XA/¤ D XA: Note that the Moore-Penrose generalized inverse is unique and denoted as A† .

Incremental Active Learning for Optimal Generalization

2917

where f’k g¹ is an orthonormal basis in H and fak g¹ is its coefcients. Let k D1 kD 1 us consider a ¹-dimensional parameter space in which functions in H are expressed as f D .a1 ; a2 ; : : : ; a¹ /> :

(3.16)

If we regard this parameter space as H, then the sampling function Ãj is expressed as Ãj D .’1 .xj /; ’2 .xj /; : : : ; ’¹ .xj //¤ ;

(3.17)

where .a1 ; a2 ; : : : ; a¹ /¤ denotes the complex conjugate of the transpose of .a1 ; a2 ; : : : ; a¹ /. Hence, the sampling operator Am becomes an m £ ¹ matrix whose .j; k/ element is [Am ]jk D ’k .xj /:

(3.18)

This Am is sometimes called the design matrix (Efron & Tibshirani, 1993). Then the projection learning operator obtained by equation 3.10 becomes a ¹ £ m matrix, and the projection learning result fm obtained by equation 2.8 becomes a ¹-dimensional vector: fm D .b1 ; b2 ; : : : ; b¹ /> :

(3.19)

From this, we have the learning result function fm .x/ as fm .x/ D

¹ X

bk ’k .x/:

(3.20)

kD 1

In practice, the calculation of the Moore-Penrose generalized inverse is sometimes unstable. To overcome the unstableness, we recommend using Tikhonov’s regularization (Tikhonov & Arsenin, 1977), A†m Ã¡ A¤m .Am A¤m C ²I/ ¡1 ;

(3.21)

where ² is a small constant, say ² D 10 ¡4 . It has been shown that learning results obtained by projectionlearning are invariant under the inner product in the sample value space (Yamashita & Ogawa, 1992). Hence, without loss of generality, the Euclidean inner product is adopted in the sample value space. When the noise covariance matrix Qm is in the form of Qm D ¾ 2 Im

(3.22)

2918

Masashi Sugiyama and Hidemitsu Ogawa

with ¾ 2 > 0, the projection learning operator is given as (Ogawa, 1987) Xm D A†m :

(3.23)

This implies that projection learning agrees with usual least mean squares learning aimed at minimizing the training error, m X jD 1

| fm .xj / ¡ yj | 2 :

(3.24)

3.3 Incremental Projection Learning. Let us consider the case where a new training example .xm C 1 ; ym C 1 / is added after a learning result fm has been obtained from f.xj ; yj /gjmD 1 . It follows from equation 3.8 that a learning

result fm C 1 obtained from f.xj ; yj /gjmDC1 1 in a batch manner can be expressed as fm C 1 D Xm C 1 y.m C 1/ :

(3.25)

In order to devise an exact incremental learning method meeting requirement 3 shown in section 3.1, let us calculate fm C 1 in equation 3.25 by using fm and .xm C 1 ; ym C 1 /. Let the noise characteristics of .xm C 1 ; ym C 1 / be qm C 1 D En .nm C 1 n.m/ /;

¾m C 1

(3.26)

2 D En |nm C 1 | ;

(3.27)

where nm C 1 denotes the complex conjugate of nm C 1 . Note that qm C 1 is an m-dimensional vector while ¾m C 1 is a scalar. Let N .Am / be the null space of Am and the following notation is dened: Vectors: (3.28)

sm C 1 D Am Ãm C 1 C qm C 1 ; tm C 1 D

Scalars:

† Um sm C 1 :

(3.29)

®m C 1 D Ãm C 1 .xm C 1 / C ¾m C 1 ¡ htm C 1 ; sm C 1 i;

(3.30)

¯m C 1 D ym C 1 ¡ fm .xm C 1 / ¡ hy.m/ ¡ Am fm ; tm C 1 i:

(3.31)

Ãm C 1 ¡ A†m Am Ãm C 1 /;

(3.32)

Functions: ÃQ m C 1 D PN »m C 1 D »Qm C 1 D

.D .Am / Ãm C 1 Ãm C 1 ¡ A¤m tm C 1 ; † Vm »m C 1 :

(3.33) (3.34)

From equation 2.4, Ãm C 1 .xm C 1 / in equation 3.30 agrees with kÃm C 1 k2 . As shown in Sugiyama and Ogawa (1999a, 1999c), the additional training examples such that ®m C 1 D 0 can be rejected since they have no effect on

Incremental Active Learning for Optimal Generalization

2919

learning results. Hence, from here on, we focus on the training examples 6 such that ®m C 1 D 0. Then the exact incremental learning method called incremental projection learning (IPL) is given as follows: Proposition 3 (Incremental Projection Learning) (Sugiyama & Ogawa,

1999a, 1999b). When ®m C 1 , dened by equation 3.30, is not zero, a posterior projection learning result fm C 1 can be obtained by using prior results fm , Am , U†m , † Vm , and y.m/ as follows: 6 R.A ¤ /, 1. When Ãm C 1 2 m

fm C 1 D fm C

¯m C 1 ÃQ m C 1 : ÃQ m C 1 .xm C 1 /

(3.35)

2. When Ãm C 1 2 R.A¤m /, fm C 1 D fm C

¯m C 1 »Qm C 1 : ®m C 1 C h»Qm C 1 ; »m C 1 i

(3.36)

Note that fm C 1 obtained by proposition 3 exactly agrees with the learning result obtained by batch projection learning (see equation 3.25) with f.xj ; yj /gjmDC11 . ÃQ m C 1 .xm C 1 / in equation 3.35 is equivalent to kÃQ m C 1 k2 (see equa6 R.A ¤ / means that Ãm C 1 is linearly tions 2.4 and 3.32). The condition Ãm C 1 2 m m independent of fÃj gjD 1 , that is, the approximation space R.A¤m C 1 / becomes wider than R.A¤m /. In contrast, Ãm C 1 2 R.A¤m / means that Ãm C 1 is linearly dependent of fÃj gjmD1 , and hence the approximation space R.A¤m C 1 / is equal to R.A¤m /. Let us consider the case where the noise covariance matrix Qm C 1 is positive denite3 and diagonal, that is, Qm C 1 D diag.¾1 ; ¾2 ; : : : ; ¾m C 1 /;

(3.37)

0 0 where ¾j > 0 for all j. Let ¯m C 1 and V m be 0 ¯m C 1 D ym C 1 ¡ fm .xm C 1 /; 0

Vm D

¡1 A¤m Qm Am :

(3.38) (3.39)

0 0 † In ¯m C 1 , the third term in ¯m C 1 vanishes. In Vm , Um in V m is replaced with ¡1 . Then IPL is reduced to the following simpler form: Qm

Proposition 3 (Sugiyama & Ogawa, 1999a, 1999c).

If Qm C 1 is given by equation 3.37 with ¾j > 0 for all j, then a posterior projection learning result fm C 1 can 3

6 0. An operator T is said to be positive denite if hTf; f i > 0 for any f D

2920

Masashi Sugiyama and Hidemitsu Ogawa

0† be obtained by using prior results fm and Vm as follows:

6 R .A¤ /, 1. When Ãm C 1 2 m

fm C 1 D fm C

0 ¯m C1

ÃQ m C 1 .xm C 1 /

ÃQ m C 1 :

(3.40)

2. When Ãm C 1 2 R .A¤m /, fm C 1 D fm C

0 ¯m C1

0† ¾m C 1 C hVm Ãm C 1 ; Ãm C 1 i

0† Vm Ãm C 1 :

(3.41)

0 Compared with proposition 3, ¯m C 1 is replaced with ¯m C 1 , and ®m C 1 , Q , and are not required in proposition 3. Again, »m C 1 »m C 1 fm C 1 obtained by proposition 3 exactly agrees with the learning result obtained by batch projection learning with f.xj ; yj /gjmDC11 . If ¾1 D ¾2 D ¢ ¢ ¢ D ¾m C 1 D ¾ 2 .> 0/, that is,

Qm C 1 D ¾ 2 I m C 1 ;

(3.42)

then equation 3.41 yields fm C 1 D fm C

0 ¯m C1

1 C h.A¤m Am /† Ãm C 1 ; Ãm C 1 i

.A¤m Am /† Ãm C 1 :

(3.43)

Equations 3.40 and 3.43 imply that the value of ¾ 2 is not required for IPL when Qm C 1 is in the form of equation 3.42. 4 Basic Sampling Strategy

This section is devoted to giving a sampling strategy that is the basis for devising active learning methods in the following sections. Let Jb and Jv be the changes in the bias and variance of fm through the addition of a training example .xm C 1 ; ym C 1 /, respectively, that is, Jb D kPR

.A¤m C 1 / f

¡ f k2 ¡ kPR

.A¤m / f

.m C 1/ 2

Jv D En kXm C 1 n

k ¡ En kXm n

¡ f k2 ;

.m/ 2

k :

(4.1) (4.2)

Then the following proposition holds: Proposition 4 (Sugiyama & Ogawa, 1999a, 1999c).

For any additional train6 0, the following relations hold: ing example .xm C 1 ; ym C 1 / such that ®m C 1 D 6 R .A¤ /, 1. When Ãm C 1 2 m

Jb · 0 and Jv ¸ 0:

(4.3)

Incremental Active Learning for Optimal Generalization

2921

2. When Ãm C 1 2 R.A¤m /, Jb D 0 and Jv · 0:

(4.4)

Proposition 4 states that an additional training example such that 6 R.A¤ / reduces or maintains the bias while it increases or mainÃm C 1 2 m tains the variance. In contrast, an additional training example such that Ãm C 1 2 R.A¤m / maintains the bias while it reduces or maintains the variance. Let us consider the case where the dimension of the Hilbert space H is nite, and the total number M of training examples to sample is larger than or equal to the dimension of H. In this case, it follows from equation 3.14 that the bias of learning results is zero for any f in H if and only if N .Am / D f0g. For this reason, we comply with the two-stage sampling scheme shown in Figure 3. We start from m D 0. In stage 1, training examples such that 6 R.A¤m / are added to reduce the bias until it reaches zero. Let Ãm C 1 2 ¹ be the dimension of H. Stage 1 ends if a training example such that 6 R.A ¤ / is added ¹ times by which N .A ¹ / D f0g can be attained. In Ãm C 1 2 m stage 2, training examples such that Ãm C 1 2 R .A¤m / are added to reduce the variance until the number of added training examples becomes M. Note that the additional training examples such that Ãm C 1 2 R .A¤m / maintain the bias (see proposition 4 (1)). Hence, the bias remains zero throughout stage 2. The criterion for active learning should be consistent with the purpose of learning; that is, it should be aimed at minimizing the generalization error. Therefore, our active learning problems in both stages become as follows: 6 Stage 1: Find a sample point minimizing Jv under the constraint of Ãm C 1 2

R.A¤m /.

Stage 2: Find a sample point minimizing Jv under the constraint of Ãm C 1 2

R.A¤m /.

Note that all additional training examples in stage 2 satisfy Ãm C 1 2 R.A¤m /. This means that in stage 2, the constraint Ãm C 1 2 R.A¤m / does not 6 R.A ¤ / in stage 1 can have to be taken into account. The condition Ãm C 1 2 m ¤ 6 R .A / if and only if be easily veried since Ãm C 1 2 m PN

.Am / Ãm C 1

6 0: D ÃQ m C 1 D

In practice, we recommend using the following criterion, 6 R.A¤ /; if kÃQ m C 1 k2 > ² then Ãm C 1 2 m

where ² is a small constant, say, ² D 10¡4 .

(4.5)

2922

Masashi Sugiyama and Hidemitsu Ogawa

62 R

N

{}

Figure 3: Two-stage sampling scheme.

In the statistical active learning methods devised so far, the bias of the estimator is assumed to be zero (MacKay, 1992; Cohn, 1994; Fukumizu, 1996). The interpretation of this assumption is illustrated in Figure 4. Let fO be a function that is the best approximation to f in H. Then the assumption of zero bias is equivalent to f D fO and fO D En fm . Namely, f belongs to H and the mean of fm over the noise agrees with fO. In contrast, the condition

Incremental Active Learning for Optimal Generalization

2923

Figure 4: The interpretation of the assumptions in statistical active learning methods and our method. Let Of be a function that is the best approximation to f in H. Statistical active learning methods assume that f D Of and Of D En fm . Namely, f belongs to H, and the mean of fm over the noise agrees with Of . In contrast, our method only assumes that f 2 H. The difference between Of and En fm is explicitly evaluated in stage 1.

assumed in our framework is only f 2 H. The difference between fO and En fm is explicitly evaluated in stage 1. Based on the two-stage sampling scheme shown in Figure 3, we propose two active learning methods in the following sections. 5 Multipoint Search Active Learning

In this section, we propose an active learning method based on the multipoint search. In the derivation of the multipoint search method, the following theorem plays a central role: Theorem 1.

Jv dened by equation 4.2, can be expressed as follows:

6 R.A ¤ /, 1. When Ãm C 1 2 m

Jv D

®m C 1 C h»Qm C 1 ; »m C 1 i ¡ 1: ÃQ m C 1 .xm C 1 /

(5.1)

2. When Ãm C 1 2 R.A¤m /, Jv D ¡

®m C 1

k»Qm C 1 k2 : C h»Qm C 1 ; »m C 1 i

(5.2)

Proofs of all theorems are given in the appendices. Theorem 1 implies that Jv can be calculated without ym C 1 . We can evaluate the quality of additional

2924

Masashi Sugiyama and Hidemitsu Ogawa

training examples by using only their sampling locations. It should be noted that equation 5.2 is conceptually similar to the criteria shown in Fedorov (1972), MacKay (1992), and Cohn (1994). The difference is that the noise is assumed to be independently and identically distributed in their methods while correlated noise can be treated in our method if the noise covariance matrix is available. When the noise is uncorrelated, theorem 1 can be reduced to the following simpler form: If Qm C 1 is given by equation 3.37 with ¾j > 0 for all j, then Jv can be expressed as follows: Theorem 2.

6 R .A¤ /, 1. When Ãm C 1 2 m

Jv D

0† ¾m C 1 C hVm Ãm C 1 ; Ãm C 1 i : ÃQ m C 1 .xm C 1 /

(5.3)

2. When Ãm C 1 2 R .A¤m /, Jv D ¡

0† kVm Ãm C 1 k2

0† ¾m C 1 C hVm Ãm C 1 ; Ãm C 1 i

(5.4)

:

Compared with theorem 1, ®m C 1 , »m C 1 , and »Qm C 1 are not required in theorem 2. If ¾ 1 D ¾2 D ¢ ¢ ¢ D ¾m C 1 D ¾ 2 .> 0/, that is, Qm C 1 D ¾ 2 I m C 1 ;

(5.5)

then theorem 2 becomes as follows: Corollary 1.

If Qm C 1 is given by equation 5.5, then Jv can be expressed as follows:

6 R .A¤ /, 1. When Ãm C 1 2 m

Jv D ¾ 2

1 C h.A¤m Am /† Ãm C 1 ; Ãm C 1 i : ÃQ m C 1 .xm C 1 /

(5.6)

2. When Ãm C 1 2 R .A¤m /, Jv D ¡¾ 2

k.A¤m Am /† Ãm C 1 k2

1 C h.A¤m Am /† Ãm C 1 ; Ãm C 1 i

:

(5.7)

Corollary 1 implies that when Qm C 1 is given by equation 5.5 with ¾ 2 > 0, the value of ¾ 2 is not required for the minimization of Jv . Based on the above theorems and corollary, the algorithm of the multipoint search active learning method is described in Figure 5.

Incremental Active Learning for Optimal Generalization

N

6

2925

{ }{ {

}

{

}

62 R

} {

}

Figure 5: Algorithm of the multipoint search active learning method.

Strictly, the algorithm shown in Figure 5 does not meet requirement 2 mentioned in section 3.1. However, it will be experimentally shown through computer simulations in section 7 that the multipoint search method species a better sampling location. If the dimension of the input x is very large, many candidates may be required for nding a better sampling location. One of the measures is to employ the gradient method for nding a local maximum; that is, with some initial value, xm C 1 is updated until a certain convergence criterion holds as xm C 1 Ã¡ xm C 1 ¡ ²1Jv .xm C 1 /;

(5.8)

where ² is a small, positive constant and 1Jv .xm C 1 / is the gradient of Jv at xm C 1 . 6 Optimal Sampling in the Trigonometric Polynomial Space

In the previous section, we gave an active learning method for general nite dimensional Hilbert spaces. In this section, we focus on the trigonometric polynomial space and devise a more effective active learning method. This method strictly meets requirement 2 described in section 3.1.

2926

Masashi Sugiyama and Hidemitsu Ogawa

First, our model, the trigonometric polynomial space, is dened as follows: Denition 2 (Trigonometric Polynomial Space). Let x D .» .1/ ; » .2/ ; : : : ; » .L/ /> . For 1 · l · L, let Nl be a nonnegative integer and Dl D [¡¼; ¼]. Then a

function space H is called a trigonometric polynomial space of order .N1 ; N2 ; : : : ; NL / if H is spanned by (

L Y

)N1 ;N2 ;:::;NL .l/

exp.inl » /

(6.1)

lD 1

n1 D ¡N1 ;n 2 D ¡N2 ;:::;nL D ¡NL

dened on D1 £ D2 £ ¢ ¢ ¢ £ DL , and the inner product in H is dened as h f; gi D

1 .2¼ /L

Z

¼ ¡¼

Z

¼ ¡¼

Z ¢¢¢

¼ ¡¼

f .x/g.x/d» .1/ d» .2/ ¢ ¢ ¢ d» .L/ :

(6.2)

Note that the function space spanned by fexp.in» /gN n D ¡N is equal to the function space spanned by f1; cos n» ; sin n» gN . The dimension ¹ of nD1 a trigonometric polynomial space of order .N1 ; N2 ; : : : ; NL / is ¹D

L Y l D1

.2Nl C 1/:

(6.3)

The reproducing kernel of a trigonometric polynomial space of order .N1 ; N2 ; : : : ; NL / is expressed as K.x; x0 / D

L Y

Kl .» .l/ ; » .l/0 /;

(6.4)

lD 1

where 8 <

sin .2Nl C 1/.» 2 Kl .» .l/ ; » .l/0 / D : 2Nl C 1

.l/ ¡» .l/0

¿ 0 » .l/ ¡» .l/ sin

/

2

0

6 » .l/ ; if » .l/ D 0

(6.5)

if » .l/ D » .l/ :

The prole of equation 6.4 is illustrated in Figure 6. Then we have the following theorem: Theorem 3.

Suppose that the noise covariance matrix QM is

Q M D ¾ 2 IM ;

(6.6)

Incremental Active Learning for Optimal Generalization

2927

80 60 40 20 0 20 2 0 x

(2)

2

2

0 x

2

(1)

Figure 6: Prole of the reproducing kernel of a trigonometric polynomial space of order .5; 3/ with x0 D .0; 0/> .

c and 1 · l · L, let c.l/ with ¾ 2 > 0. For 0 · k · b M¡1 ¹ k be an arbitrary constant .l/ such that ¡¼ · ck · ¡¼ C than or equal to c. If we put

2¼ 2Nl C 1 ,

where bcc denotes the maximum integer less

xm C 1 D .»m.1/C 1 ; »m.2/C 1 ; : : : ; »m.L/C 1 /> : »m.l/C 1 D cp.l/ C where

¸ m ; ¹ 8 <m · mod .2N1 ¸C 1/ ql D mod .2Nl C 1/ : Ql¡1 m

2¼ ql 2Nl C 1

(6.7)

·

(6.8)

pD

.2Nr C 1/ rD 1

if l D 1; if 2 · l · L;

(6.9)

then xm C 1 minimizes Jv under the constraint that fxj gjmD 1 are successively determined by equation 6.7. Theorem 3 states that for each p, ¹ locations are xed at regular intervals. Figure 7 illustrates a set of 63 sample points determined by theorem 3 when L D 2, N1 D 3, and N2 D 1. For each p, the base point .cp.1/ ; cp.2/ /> is xed in the dark region, and 21 sample points are xed at regular intervals. From equation 6.4, a set f p1¹ Ãp¹ C q g¹ q D1 of sampling functions forms an orthonormal basis in H for each p. Although this sampling scheme is derived as the greedy optimal scheme, this scheme is in fact globally optimal at the same time when .M mod ¹/ D 0 (see Sugiyama & Ogawa, 2000).

2928

Masashi Sugiyama and Hidemitsu Ogawa

1 2

{

{ >

>

>

Figure 7: Optimal sample points in a trigonometric polynomial space of order .3; 1/. The number M of training examples is 7 £ 3 £ 3 D 63. 7 Computer Simulations

In this section, the effectiveness of the proposed active learning methods is demonstrated through computer simulations. 7.1 Illustrative Simulation. Let us consider learning in a trigonometric polynomial space of order 3. Let the target function f .x/ be

f .x/ D

p

p

1 1 2 cos x C p sin 2x C p cos 2x 2 2 2 p p ¡ 2 sin 3x C 2 cos 3x: 2 sin x C

(7.1)

Let the noise covariance matrix be Q M D IM :

(7.2)

We shall compare the performance of following sampling schemes: 1. Optimal sampling . Training examples are sampled following theorem 3 with c.1/ ¡¼ C 2£3¼ C 1 for all k. k D

2. Multipoint search . Training examples are sampled following the multipoint search method shown in Figure 5. Let the number c of candidates be three, and they are randomly generated in the domain [¡¼; ¼ ].

Incremental Active Learning for Optimal Generalization

10

10

5

5

0

0

5

5

10

2929

10 2

0

2

10

10

5

5

0

0

5

5

10

2

0

2

2

0

2

10 2

0

2

Figure 8: Results of learning simulation in a trigonometric polynomial space of order 3 with the noise covariance matrix Q21 D I21 . The solid line shows the target function f .x/. Dashed lines in A–D show the learning results obtained by the optimal sampling method (see theorem 3), the multipoint search method (see Figure 5), the experimental design method, and passive learning, respectively. ± denotes a training example. The generalization error JG is measured by equation 3.1.

3. Experiment design . Equation 2 in Cohn (1994) is adopted as the active learning criterion. The value of this criterion is evaluated by the Monte Carlo sampling with 30 reference points. The next sampling location is determined by the multipoint search with three candidates. Namely, three locations are randomly created in the domain, and the one minimizing the criterion is selected. 4. Passive learning . Training examples are randomly supplied from the domain. Learning results obtained by the above sampling schemes with 21 training examples are shown in Figure 8. The solid and dashed lines show the target function f .x/ and learning results, respectively. ± denotes a training example. Figures 8A–8D show the learning results obtained by the sampling schemes 1–4, where the generalization errors measured by equation 3.1 are 0:333, 0:342, 0:358, and 0:807, respectively. These results show

2930

Masashi Sugiyama and Hidemitsu Ogawa

30

Optimal sampling Multi point search Experiment design Passive learning

25

Variance

20 15 10 5 0

7

14 21 28 35 Number of training examples

42

Figure 9: Relation between the number of training examples and the noise variance in a trigonometric polynomial space of order 3 with the noise covariance matrix Q M D IM . The horizontal axis denotes the number of training examples, while the vertical axis denotes the noise variance measured by equation 3.8. The vertical axis can be regarded as the generalization error when the number of training examples is larger than or equal to 7.

that sampling schemes 1–3 give 58.7%, 57.6%, and 55.7% reductions in the generalization error, respectively, compared with sampling scheme 4. The generalization capability acquired by sampling schemes 2 and 3 is close to that obtained by sampling scheme 1. This means that sampling schemes 2 and 3 work quite well with a small number of candidates since sampling scheme 1 gives the optimal generalization capability (see section 6). The changes in the variance through the addition of training examples are shown in Figure 9. The horizontal axis denotes the number of training examples, and the vertical axis denotes the variance measured by equation 3.8. The solid line shows sampling scheme 1. The dashed, dash-dotted, and dotted lines denote the means of 100 trials by the sampling schemes 2, 3, and 4, respectively. In sampling schemes 1 and 2, it always holds that N .A7 / D f0g because the dimension of H is 7 (see section 4). In sampling schemes 3 and 4, N .A7 / D f0g was attained in all 100 trials in this simulation. Hence, it follows from equation 3.14 that the vertical axis in Figure 9 can be regarded as the generalization error when m ¸ 7. This graph shows that when m · 7, the variance of all sampling schemes increases; this phenomenon is in good agreement with proposition 4. In this

Incremental Active Learning for Optimal Generalization

2931

Figure 10: Two-joint robot arm.

case, sampling schemes 1 and 2 give much lower variance than sampling schemes 3 and 4. It should be noted that sampling scheme 3 gives higher variance than sampling scheme 4, passive learning. This may be caused by the fact that the bias is not zero, so the criterion for the optimal experiment design is no longer valid. When m > 7, the variances of all sampling schemes decrease, as shown in proposition 4. Sampling schemes 1–3 suppress the variance more efciently than sampling scheme 4. This simulation shows that sampling schemes 1 and 2 outperform sampling scheme 3, especially when the number of training examples is small, thanks to the two-stage sampling scheme. Also, the multipoint search method with a small number of candidates is shown to work well. 7.2 Learning of the Sensorimotor Map of a Two-Joint Robot Arm. In this subsection, we apply the multipoint search active learning method proposed in section 5 to a real-world problem. Let us consider learning of sensorimotor maps of a two-joint robot arm shown in Figure 10. A sensorimotor map of the kth joint (k D 1; 2) is a mapping from joint angle µk , angular velocity µPk , and angular acceleration µRk to torque ¿k , which should be applied to the kth joint:

¿k D f .k/ .µ1 ; µ2 ; µP1 ; µP2 ; µR1 ; µR2 / for k D 1; 2;

(7.3)

where ¡¼ · µ1 ; µ2 · ¼ , ¡a1 · µP1 · a1 , ¡a2 · µP2 · a2 , ¡b1 · µR1 · b1 , and ¡b2 · µR2 · b2 .

2932

Masashi Sugiyama and Hidemitsu Ogawa

Function spaces Hk to which f .k/ belong are given as follows (Vijayakumar, 1998): 2 H1 D L .µR1 ; µR2 ; µR1 cos µ2 ; µR2 cos µ2 ; µP2 sin µ2 ;

µP1 µP2 sin µ2 ; sin µ1 ; sin µ1 cos µ2 ; sin µ2 cos µ1 /;

(7.4)

2 H2 D L .µR1 ; µR2 ; µR1 cos µ2 ; µP1 sin µ2 ; sin µ1 ; sin µ1 cos µ2 ;

sin µ2 cos µ1 /;

(7.5)

where H D L .’1 ; ’2 ; : : : ; ’k / means that H is spanned by ’1 ; ’2 ; : : : ; ’k . The inner product in Hk is dened as h f; gi D

1 64¼ 2 a1 a2 b1 b2 Z b2 Z b1 Z a2 Z £ ¡b2

¡b1

¡a2

a1

Z

¡a1

¼ ¡¼

Z

¼ ¡¼

f .x/g.x/dµ1 dµ2 dµP1 dµP2 dµR1 dµR2 : (7.6)

We shall perform a learning simulation of the sensorimotor map f .1/ . From equations 7.4 and 7.6, an orthonormal basis f’j.1/ gj9D 1 in H1 is given as follows: .1/

’1 D ’4.1/ .1/

D

’7 D

p

3

µR1 ;

.1/

b1 p 6 R µ cos µ2 ; b2 2

p

2 sin µ1 ;

’2 D ’5.1/ D

p 3 R b2 µ2 ; p 10 P 2 µ sin µ2 ; a22 2

.1/ ’8 D 2 sin µ1 cos µ2 ;

.1/

’3

p

D b 6 µR1 cos µ2 ; 1

’6.1/ D .1/

’9

p 18 P P sin µ2 ; a1 a2 µ1 µ2

D 2 sin µ2 cos µ1 :

Hence, the reproducing kernel of H1 is given as K1 .x; x0 / D

9 X jD 1

.1/

.1/

’j .x/’j .x0 /:

(7.7)

Suppose that training examples are degraded by additive noise. Let us consider the following two sampling schemes: 1. Multipoint search . Training examples are sampled following the multipoint search method shown in Figure 5. Let the number c of candidates be three, and they are randomly generated in the domain. 2. Passive learning . Training examples are randomly supplied from the domain. Figure 11 shows simulation results. The horizontal axis denotes the number of training examples, and the vertical axis denotes the variance measured by equation 3.8. The solid and dashed lines show the mean variances of 100

Incremental Active Learning for Optimal Generalization

2933

Figure 11: Results of learning of the sensorimotor map f .1/ . The horizontal axis denotes the number of training examples, and the vertical axis denotes the noise variance measured by equation 3.8. The solid and dotted lines show the means of 100 trials by the multipoint search method and passive learning, respectively. The vertical axis can be regarded as the generalization error when the number of training examples is larger than or equal to 10.

trials by sampling schemes 1 and 2, respectively. In sampling scheme 1, it holds that N .A9 / D f0g because the dimension of H is equal to 9. In sampling scheme 2, N .A10 / D f0g was attained in all 100 trials in this simulation. Hence, it follows from equation 3.14 that the vertical axis in Figure 11 can be regarded as the generalization error when m ¸ 10. This graph shows that the multipoint search method with three candidates provides better generalization capability than passive learning. However, the performance of the multipoint search method in this simulation is not as good as that in the previous one. The reason may be that the number c of candidates is small in contrast to the size of the domain. 8 Conclusion

In this article, we gave a basic sampling strategy called the two-stage sampling scheme for reducing both the bias and variance, and based on this scheme, we proposed two active learning methods: the multipoint search method applicable to arbitrary models and the optimal sampling method in the trigonometric polynomial space. The effectiveness of the proposed methods was demonstrated through computer simulations. As well as the

2934

Masashi Sugiyama and Hidemitsu Ogawa

usual active learning methods devised so far, our methods assume that a model to which the learning target belongs is available. If the model is unknown, it should be estimated by model selection methods. To evaluate the robustness of the presented methods when they are combined with model selection is important future work. Appendix A: Proof of Theorem 1

Let tr.¢/ stand for the trace of an operator. Then, for f; g 2 H, it hold that ¡ ¢ tr f g D h f; gi: (A.1) It follows from equations 3.11, 3.12, 3.10, 3.9, and 3.13 that ± ± ² ² E n kXm n.m/ k2 D En tr Xm n.m/ n.m/ X¤m ¡ ¢ ¤ D tr Xm Qm Xm ¡

¢

¡

D tr Xm Um X¤m ¡ tr Xm Am A¤m X¤m

±

²

(A.2) ¢ ±

† ¤ † † † ¤ D tr Vm Am Um Um Um Am Vm ¡ tr PR .A¤m / PR .A¤ / m

±

²

¡

¢

† D tr Vm ¡ tr PR .A¤m / :

Equations 4.2 and A.3 yield ± ² ± † Jv D tr Vm C 1 ¡ tr PR

² .A¤m C 1 /

± ² ¡ † C tr PR ¡ tr Vm

²

(A.3)

¢ .A¤m /

:

(A.4)

6 First, we shall prove the case where Ãm C 1 2 R.A¤m /. It follows from equation 2.6 that

± tr PR

² .A¤m C 1 /

¡

¢

D tr PR .A¤m / C 1:

(A.5)

† † as From Sugiyama and Ogawa (1999b), Vm is expressed by using Vm C1 † † Vm C 1 D Vm C

¡

®m C 1 C h»Qm C 1 ; »m C 1 i Q Ãm C 1 ÃQ m C 1 ÃQ m C 1 .xm C 1 /2

»Qm C 1 ÃQ m C 1 C ÃQ m C 1 »Qm C 1 ; ÃQ m C 1 .xm C 1 /

(A.6)

Since it holds from Ogawa (1987) that † R .Vm / D R.A¤m /;

(A.7)

Incremental Active Learning for Optimal Generalization

2935

Equations 3.32 and 3.34 yield hÃQ m C 1 ; »Qm C 1 i D hPN

† .Am / Ãm C 1 ; Vm »m C 1 i

D 0:

(A.8)

It follows from equations A.6, A.8, 3.32, and 2.4 that ± ² ± ² ® Q m C 1 C h»m C 1 ; »m C 1 i Q † † C tr Vm tr kÃm C 1 k2 D Vm C1 Q Ãm C 1 .xm C 1 /2 ± ² ® Q m C 1 C h»m C 1 ; »m C 1 i † D tr Vm C : ÃQ m C 1 .xm C 1 /

(A.9)

Substituting equations A.5 and A.9 into equation A.4, we have equation 5.1. We shall prove the case where Ãm C 1 2 R.A¤m /. It follows from equation 2.6 that

R.A¤m C 1 / D R.A¤m /:

(A.10)

From Sugiyama and Ogawa (1999b), it holds that † † Vm C 1 D Vm ¡

»Qm C 1 »Qm C 1 : ®m C 1 C h»Qm C 1 ; »m C 1 i

(A.11)

Substituting equations A.10 and A.11 into equation A.5, we have equation 5.20. Appendix B: Proof of Theorem 2

When Qm C 1 is given by equation 3.37 with ¾j > 0 for all j, it holds from Ogawa (1987) that the projection learning operator is expressed as 0† ¤ ¡1 Xm D Vm Am Qm :

(B.1)

Equations A.2, B.1, and 3.39 yield ¡ ¢ En kXm n.m/ k2 D tr Xm Qm X¤m ± ² 0† ¤ ¡1 ¡1 0† D tr Vm Am Qm Qm Qm Am Vm ± ² 0† D tr Vm :

(B.2)

It follows from equations 4.2 and B.2 that ± ² ± ² 0† 0† Jv D tr Vm C 1 ¡ tr Vm :

(B.3)

2936

Masashi Sugiyama and Hidemitsu Ogawa

6 First, we shall prove the case where Ãm C 1 2 R.A¤m /. It follows from Sugiyama and Ogawa (1999c) that

0† 0† Vm C 1 D Vm C

¡

0† ¾m C 1 C hVm Ãm C 1 ; Ãm C 1 i Q Ãm C 1 ÃQ m C 1 ÃQ m C 1 .xm C 1 /2

0† 0† Ã Vm Ãm C 1 ÃQ m C 1 C ÃQ m C 1 Vm m C1 : Q Ãm C 1 .xm C 1 /

(B.4)

0† Since R.Vm / D R.A¤m /, it holds from equation 3.32 that 0† 0† hVm Ãm C 1 ; ÃQ m C 1 i D hVm Ã m C 1 ; PN

.Am / Ãm C 1 i

D 0:

(B.5)

It follows from equations B.4, B.5, 3.32, and 2.4 that ± ² ± ² ¾ 0† m C 1 C hVm Ãm C 1 ; Ãm C 1 i Q 0† 0† tr Vm kÃm C 1 k2 D tr Vm C C1 ÃQ m C 1 .xm C 1 /2 ±

²

0† C D tr Vm

0† ¾m C 1 C hVm Ãm C 1 ; Ãm C 1 i : ÃQ m C 1 .xm C 1 /

(B.6)

Substituting equation B.6 into equation B.3, we have equation 5.3. We shall prove the case where Ãm C 1 2 R.A¤m /. It follows from Sugiyama and Ogawa (1999c) that 0† 0† Vm C 1 D Vm ¡

0† 0† Ã Vm Ãm C 1 Vm m C1

0† ¾m C 1 C hVm Ãm C 1 ; Ãm C 1 i

:

(B.7)

Equations B.3 and B.7 yield equation 5.4. Appendix C: Proof of Theorem 3

If m D 0, then it follows from equation 5.6 that Jv D

¾2 kÃQ m C 1 k2

D

¾2

¾2

D : kÃm C 1 k2 ¹

(C.1)

Hence, Jv is minimized by any x1 . Now we focus on the case where m ¸ 1. Suppose that sample points fxj gjmD1 are successively determined by equation 6.7. For a xed integer s such that 0 · s · b m¡1 c, let us consider the sample points xj and xj0 such ¹ that » .s C 1/¹ if .s C 1/¹ · m; 0 (C.2) s¹ C 1 · j; j · if s¹ C 1 · m < .s C 1/¹: m

Incremental Active Learning for Optimal Generalization

2937

Let t D .j mod ¹/ and t0 D .j0 mod ¹/. Then it follows from equations 6.4 and 6.5 that hÃj0 ; Ãj i D Ãj0 .xj / D K.xj ; xj0 / D ¹±tt0 ;

(C.3)

where ±tt0 denotes Kronecker’s delta. First, we shall prove the case where 1 · m · ¹ ¡ 1. It follows from equations 2.6 and C.3 that ³ ! m ¡ m X X ¢ Ãj Ãj ¤ (C.4) Am Am D Ãj Ãj D ¹ D ¹PR .A¤m / ; p p ¹ ¹ jD 1 jD 1 which yields .A¤m Am /† Ãm C 1 D

1 PR ¹

.A¤m / Ãm C 1

D

1 .Ãm C 1 ¡ ÃQ m C 1 /: ¹

(C.5)

By using the fact that kÃQ m C 1 k2 · kÃm C 1 k2 D ¹, it follows from equations 5.6, C.3, and C.5 that Jv D ¾

2

D ¾2

¸ ¾2 D

¡ ÃQ m C 1 ; Ãm C 1 i

1C

1 hÃm C 1 ¹

1C

1 k2 ¹ .kÃm C 1

¡ kÃQ m C 1 k2 /

1C

1 k2 ¹ .kÃm C 1

¡ kÃm C 1 k2 /

kÃQ m C 1 k2

¾2 ; ¹

kÃQ m C 1 k2 kÃm C 1 k2

(C.6)

where equality holds if and only if Ãm C 1 D ÃQ m C 1

(C.7)

because of equation 3.32. Since equation C.3 implies equation C.7, Jv is minimized. We shall prove the case where ¹ · m · M ¡ 1. Let p D b m c. It follows ¹ from equations 2.6 and C.3 that 2 ³ ! ³ !3 p¹ m X X Ã Ã Ã Ã j j j j 5 C A¤m Am D ¹ 4 p p p p ¹ ¹ ¹ ¹ C jD 1 j Dp¹ 1 ³ ´ 1 D p¹ IH C P ; p

(C.8)

2938

Masashi Sugiyama and Hidemitsu Ogawa

where IH and P denote the identity operator on H and the orthogonal projection operator onto L .fÃj gjmD p¹ C 1 /, respectively. Then it holds that .A¤m Am / ¡1

³ ´ 1 1 D IH ¡ P : p¹ pC1

(C.9)

It follows from equations 5.7, C.9, and C.3 that Jv D ¡¾ 2

D ¡

1 k p¹ Ãm C 1 ¡

1 k2 p.p C 1/¹ PÃm C 1 1 i p.p C 1/¹ PÃm C 1 ; Ãm C 1

1 1 C h p¹ Ãm C 1 ¡ ± ² .2p C 1/kPÃm C 1 k2 ¾2 1 ¡ .p C 1/2 ¹

p.p C 1/¹ ¡

p kPÃm C 1 k2 pC 1

:

(C.10)

Let a function g.t/ be

g.t/ D ¡

± ¾2 1 ¡ p.p

²

.2p C 1/t .p C 1/2 ¹ p C 1/¹ ¡ p C 1 t

;

(C.11)

where kPÃm C 1 k2 in equation C.10 is replaced with t. It follows from equation C.3 that 0 · kPÃm C 1 k2 · kÃm C 1 k2 D ¹:

(C.12)

Hence, we focus on [0; ¹] as the domain of g.t/. Since the derivative of g with respect to t is dg D ± dt

2¾ 2 p2 pC 1

p.p C 1/¹ ¡

p

p C1

t

²2 > 0

(C.13)

and g.t/ is continuous, it is minimized if and only if t D 0. This implies that Jv is minimized if and only if kPÃm C 1 k2 D 0, which is attained by equation C.3. References Akaike, H. (1974). A new look at the statistical model identication. IEEE Transactions on Automatic Control, AC-19(6), 716–723. Albert, A. (1972). Regression and the Moore-Penrose pseudoinverse. New York: Academic Press. Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68, 337–404.

Incremental Active Learning for Optimal Generalization

2939

Ben-Israel, A., & Greville, T. N. E. (1974). Generalized inverses: Theory and applications. New York: Wiley. Bergman, S. (1970). The kernel function and conformal mapping. Providence, RI: American Mathematical Society. Cohn, D. A. (1994). Neural network exploration using optimal experiment design. In J. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in neural information processing systems, 6 (pp. 679–686). San Mateo, CA: Morgan Kaufmann. Cohn, D. A. (1996). Neural network exploration using optimal experiment design. Neural Networks, 9(6), 1071–1083. Cohn, D. A. (1997). Minimizing statistical bias with queries. In M. C. Mozer, M. Jordan, & T. Petsche (Eds.), Advances in neural information processing systems, 9 (pp. 417–423). Cambridge, MA: MIT Press. Cohn, D. A., Ghahramani, Z., & Jordan, M. I. (1996). Active learning with statistical models. Journal of Articial Intelligence Research, 4, 129–145. Efron, B., & Tibshirani, B. (1993). An introduction to bootstrap. New York: Chapman and Hall. Fedorov, V. V. (1972). Theory of optimal experiments. New York: Academic Press. Fukumizu, K. (1996). Active learning in multilayer perceptrons. In D. Touretzky, M. Mozer, & M. Hasselmo (Eds.), Advances in neural information processing systems, 8 (pp. 295–301). Cambridge, MA: MIT Press. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1–58. Kaelbling, L. P. (Ed.). (1996). Special issue on reinforcement learning. Machine Learning, 22(1/2/3). Kiefer, J. (1959). Optimum experimental designs. Journal of the Royal Statistics Society, Series B, 21, 272–304. Kiefer, J., & Wolfowitz, J. (1960). The equivalence of two extremum problems. Annals of Mathematical Statistics, 32, 298. MacKay, D. J. C. (1992). Information-based objective functions for active data selection. Neural Computation, 4(4), 590–604. Ogawa, H. (1987). Projection lter regularization of ill-conditioned problem. Proceedings of SPIE, Inverse Problems in Optics, 808, 189–196. Ogawa, H. (1989). Inverse problem and neural networks. In Proceedings of IEICE 2nd Karuizawa Workshop on Circuits and Systems (pp. 262–268). Karuizawa, Japan (in Japanese). Ogawa, H. (1992). Neural network learning, generalization and over-learning. Proceedings of the ICIIPS’92, International Conference on Intelligent Information Processing and System, 2 (pp. 1–6). Beijing, China. Saitoh, S. (1988). Theory of reproducing kernels and its applications. London: Longman Scientic & Technical. Saitoh, S. (1997). Integral transform, reproducing kernels and their applications. London: Longman. Schatten, R. (1970).Norm ideals of completely continuous operators. Berlin: SpringerVerlag. Sollich, P. (1994). Query construction, entropy and generalization in neural network models. Physical Review E, 49, 4637–4651.

2940

Masashi Sugiyama and Hidemitsu Ogawa

Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, Series B, 36, 111–133. Sugiyama, M., & Ogawa, H. (1999a).Exact incremental projection learning in the presence of noise. In Proceedings of the 11th Scandinavian Conference on Image Analysis (pp. 747–754). Kangerlussuaq, Greenland. Sugiyama, M., & Ogawa, H. (1999b). Incremental projection learning for optimal generalization (Tech. Rep. No. TR99-0007). Tokyo: Department of Computer Science, Tokyo Institute of Technology, Japan. Available online at: ftp://ftp.cs.titech.ac.jp/pub/TR/99/TR99-0007.ps.gz. Sugiyama, M., & Ogawa, H. (1999c). Properties of incremental projection learning (Tech. Rep. No. TR99-0008). Tokyo: Department of Computer Science, Tokyo Institute of Technology, Japan. Available online at: ftp://ftp.cs.titech.ac.jp/pub/TR/99/TR99-0008.ps.gz. Sugiyama, M., & Ogawa, H. (1999d). Functional analytic approach to model selection—subspace information criterion. In Proceedings of 1999 Workshop on Information-BasedInduction Sciences (IBIS’99) (pp. 93–98). Izu, Japan. Available online at: ftp://ftp.cs.titech.ac.jp/pub/TR/99/TR99-0009.ps.gz. Sugiyama, M., & Ogawa, H. (2000). Training data selection for optimal generalization in trigonometric polynomial networks. In S. A. Solla, T. K. Leen, & K.-R. Muller ¨ (Eds.), Advances in neural information processing systems, 12 (pp. 624–630). Cambridge, MA: MIT Press. Takemura, A. (1991). Modern mathematical statistics. Tokyo: Sobunsya (in Japanese). Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of ill-posed problems. Washington, DC: V. H. Winston. Vijayakumar, S. (1998). Computational theory of incremental and active learning for optimal generalization. Unpublished doctoral dissertation, Tokyo Institute of Technology, Japan. Vijayakumar, S., & Ogawa, H. (1999). Improving generalization ability through active learning. IEICE Transactions on Information and Systems, E82-D(2), 480– 487. Vijayakumar, S., Sugiyama, M., & Ogawa, H. (1998). Training data selection for optimal generalization with noise variance reduction in neural networks. In M. Marinaro & R. Tagliaferri (Eds.), Neural nets WIRN Vietri-98 (pp. 153–166). Berlin: Springer-Verlag. Wahba, H. (1990). Spline model for observational data. Philadelphia: Society for Industrial and Applied Mathematics. Yamashita, Y., & Ogawa, H. (1992). Optimum image restoration and topological invariance. The transactionsof the IEICE D-II, J75-D-II(2), 306–313. (in Japanese) Yue, R. X., & Hickernell, F. J. (1999). Robust designs for tting linear models with misspecication. Statistica Sinica, 9, 1053–1069. Received June 2, 1999; accepted March 17, 2000.

LETTER

Communicated by Peter Edwards

A Quantitative Study of Fault Tolerance, Noise Immunity , and Generalization Ability of MLPs J. L. Bernier J. Ortega E. Ros I. Rojas A. Prieto Departamento de Arquitectura y Tecnologia de Computadores,Universidad de Granada, Spain An analysis of the inuence of weight and input perturbations in a multilayer perceptron (MLP) is made in this article. Quantitative measurements of fault tolerance, noise immunity , and generalization ability are provided. From the expressions obtained, it is possible to justify some previously reported conjectures and experimentally obtained results (e.g., the inuence of weight magnitudes, the relation between training with noise and the generalization ability, the relation between fault tolerance and the generalization ability). The measurements introduced here are explicitly related to the mean squared error degradation in the presence of perturbations, thus constituting a selection criterion between different alternatives of weight congurations. Moreover , they allow us to predict the degradation of the learning performance of an MLP when its weights or inputs are deviated from their nominal values and thus, the behavior of a physical implementation can be evaluated before the weights are mapped on it according to its accuracy. 1 Introduction

Articial neural networks (ANNs) are not inherently fault tolerant (Segee & Carter, 1994; Pathak & Koren, 1995). In the case of multilayer perceptrons (MLPs), it can be observed that for a xed structure (a particular number of layers and neurons per layer), different sets of weights may be obtained if the training process is applied by using different initial conditions or parameters of the learning rule (Choi & Choi, 1992). These solutions may present a similar performance with respect to learning (similar mean squared error or classication error) but differ with respect to fault tolerance. In this way, some congurations of weights present a higher tolerance against weight perturbations than others, and similar conclusions can be obtained with respect to tolerance to input noise or the generalization ability of the MLP. c 2000 Massachusetts Institute of Technology Neural Computation 12, 2941–2964 (2000) °

2942

J. L. Bernier, J. Ortega, E. Ros, I. Rojas, and A. Prieto

Moreover, when the training process is carried out in a conventional computer and the weights obtained must be mapped onto a physical implementation, the differences between the precision used during training and the accuracy of the implementation can seriously affect the learning performance simulated in the computer. Moreover, in the case of analog implementations, the weight values may vary within a tolerance margin, which also affects the expected performance (Edwards & Murray, 1998c). It would be useful to possess measurements of fault tolerance and noise immunity to establish a selection criterion between different sets of weights that present a similar learning performance. Most articles related to fault tolerance evaluate it experimentally as the learning degradation with respect to the nominal performance when faults are present via simulation (Segee & Carter, 1994; Neti, Schneider, & Young, 1992) although various measurement parameters have been provided. The probability of hard errors is used in Madaline structures to study the fault tolerance of MLPs that use single-step (Stevenson, Winter, & Widrow, 1990) and multiple-step (Alippi, Pivri, & Sami, 1994) activation functions. Segee and Carter (1994) propose measuring the fault tolerance from simulations of stuck-at faults using the worst-case hypothesis. Stuck-at faults are considered unrealistic in Edwards and Murray (1998a), and so a measurement called saliency is used to evaluate the fault tolerance against weight deviations. Saliency is computed from the Hessian matrix of the output error with respect to the weight values (Bishop, 1995b; Edwards & Murray, 1998b) with a low value indicating a smoother error surface, and so the lower the saliency, the lower the changes of the MLP outputs in the presence of weight deviations. Fault tolerance and the inuence of input noise have been studied in several articles in which the following conclusions have been reached: Input noise during training improves generalization ability (Bishop, 1995a). Synaptic noise during training improves fault tolerance (Murray & Edwards, 1994). Fault tolerance and generalization ability are related. When fault tolerance is improved, the generalization ability is usually better (Edwards & Murray, 1998a; Hochreiter & Schmidhuber, 1997), and vice versa (Minnix, 1992). The lower the weight magnitude, the higher the fault tolerance (Chiu, Mehrotra, Mohan, & Ranka, 1994) and the generalization ability (Krogh & Hertz, 1992). Also, it has been observed that arbitrary augmentation of the network size improves its potential for fault tolerance, but this does not necessarily imply an effective increment (Edwards & Murray, 1998a). Fault tolerance is

Fault Tolerance of MLPs

2943

related to a uniform distribution of the learning among the different neurons, but the classical backpropagation algorithm does not guarantee this good distribution (Edwards & Murray, 1998a). For this reason, different learning algorithms that provide better performance with respect to fault tolerance without the necessity of increasing the network size have been proposed (Edwards and Murray, 1995; Chiu et al., 1994). In Murray and Edwards (1994), the weights are perturbed during training; the effect of this perturbation is to smooth the error surface with respect to the weight values, and it can be considered as a regularization that is inherent to the algorithm (Mao & Jain, 1993). Choi and Choi (1992) proposed statistical sensitivity as a measurement of fault tolerance (when weight deviations are considered) or noise immunity (when input deviations are considered) for MLPs with only one output. Statistical sensitivity should not be confused with output sensitivity. Output sensitivity (or simply sensitivity) refers to the rst derivatives of the MLP output with respect to the weight values (Edwards & Murray, 1995); statistical sensitivity evaluates the output range variation of a neuron when its weights or inputs are perturbed. In fact, as we will show in this article, this is equivalent to considering second derivatives of the mean squared error with respect to weight or input values. The work reported here presents a comprehensive study and unication of the results of previous articles (Bernier, Ortega, Rodriguez, Rojas, & Prieto, 1999a; Bernier, Ortega, Ros, Rojas, & Prieto, 1999b; Bernier, Ortega, Rojas, & Prieto, 2000) including new results and conclusions. In the next section, the concept of statistical sensitivity is introduced, and expressions are obtained that consider two types of perturbations, weight and input deviations, for each of which an additive and a multiplicative model is considered. In section 3, based on statistical sensitivity, a new measurement that we have termed mean squared sensitivity (MSS) is derived in order to evaluate the fault tolerance and the noise immunity of an MLP. The MSS is directly related to the mean squared error (MSE) degradation against perturbations, and so it enables, for example, the prediction of this degradation for a particular value of the tolerance margin of the components in an analog implementation of a MLP, as is shown in section 4. From the expressions obtained, it is possible to justify mathematically some of the conjectures made by other authors. Moreover, the MSSs to input noise and to weight deviations are related to the generalization ability (section 5) and to saliency (section 6), respectively. A new learning rule that incorporates the squared sensitivity as another goal to minimize jointly with the squared error is used in section 7, where the results provided for a benchmarking problem demonstrate that more robust weight congurations against input and weight perturbations are obtained in comparison with classical backpropagation. Finally, the conclusions are summarized in section 8.

2944

J. L. Bernier, J. Ortega, E. Ros, I. Rojas, and A. Prieto

2 Statistical Sensitivity

As is well known, an MLP is composed of M layers, where each layer m (m D 1; : : : ; M) has Nm neurons. The neuron i of layer m is connected to the Nm¡1 neurons of the previous layer by a set of weights wm ij (j D 1; : : : ; Nm¡1 ). m The output of neuron i in layer m, yi , is determined by the activation zm i , dened as the weighted sum of the outputs coming from the neurons in the previous layer, and a function fim called the activation function, which like the sigmoidal or the hyperbolic tangent functions, is continuous, differentiable, and nonlinear on zm i , 0 1 N m¡1 X m m m m@ m m¡1 A ´ fim ; (2.1) yi D fi .zi / D fi wij yj jD 1

where yjm¡1 are the outputs of the Nm¡1 neurons of the previous layer. Specifically, yj0 (j D 1; : : : ; N0 ) are the inputs to the network. The neuron output will change if there are any deviations in the values of the weights or inputs. If these deviations are assumed to be sufciently small, the following approximation can be used, 1ym i

¼

N m¡1 X jD 1

D @ fim

³

m @ym i m C @yi 1yjm¡1 m 1wij m¡1 @wij @yj

N m¡1 X jD 1

! (2.2)

m m¡1 .yjm¡1 1wm /; ij C wij 1yj

m

i where @ fim ´ df . dzm i The statistical sensitivity of neuron i in layer m is dened in Choi and Choi (1992) by the following expression: p var.1ym i / m (2.3) Si D lim ; ¾ !0 ¾

where ¾ represents the standard deviation of the perturbation and var.¢/ is the variance of .¢/. Statistical sensitivity measures the expected variation of the output against perturbations. These perturbations can come from either inputs or weights, and for a given range of perturbations (specied by the corresponding standard deviation), the greater the statistical sensitivity, the greater the expected output variation. If weight perturbations are assumed, statistical sensitivity evaluates the fault tolerance of the neuron. On the other hand, when input perturbations are considered, the statistical sensitivity constitutes a measurement of the neural noise immunity.

Fault Tolerance of MLPs

2945

Statistical sensitivity is not the same as output sensitivity. The output sensitivity is computed from the Jacobian matrix of the error with respect to the weight values and has been used in previous works as a fault tolerance measurement (Chiu et al., 1994) or as a regularizer (Edwards & Murray, 1995). As the output sensitivity measures the output dependence with respect to the weight values (it is related to the rst derivatives), a value of output sensitivity equal to zero implies that the output is independent of the weight values; the MLP has not learned the input pattern. In fact, results on Edwards and Murray (1995) show that learning performance is degraded when output sensitivity is used as regularizer, and the fault tolerance improvement is negligible compared to classical backpropagation. Instead, statistical sensitivity, as shown in section 6, can be related to the Hessian matrix of the MSE with respect to the weights, thus constituting a measurement of the smoothness of the error surface; then a low value of statistical sensitivity implies a low-output variation when weight perturbations occur. The statistical sensitivity to weight and input deviations is computed below. For this purpose, it is necessary to determine var.1ym i /, which is 2 m 2 dened as E[.1ym i / ] ¡ .E[1yi ]/ , with E[¢] being the expected value of [¢]. Then we require some hypotheses about the kind of deviations that can affect the weights or the inputs. Deviation faults are considered more realistic for most current implementations than stuck-at faults (Edwards & Murray, 1998a). In this work two types of deviations frequently used in previous works are considered: additives and multiplicatives (Choi & Choi, 1992; Edwards & Murray, 1998b; Hochreiter & Schmidhuber, 1997). While the additive one is often used to model quantization effects in digital implementations or the offset effects of transistors, the multiplicative model is often used to describe the tolerance margin of analog components. 2.1 Statistical Sensitivity to Weight Deviations. First, weight perturbations are considered for the two models of deviations previously cited: the additive model, which assumes that weights are modied from their nominal values by an additive quantity of mean equal to zero and standard deviation equal to ¾ , and the multiplicative model, which assumes that perturbations are proportional to weight magnitudes. In both cases, the diagonal Hessian matrix approximation is considered, that is, perturbations of different weights are considered to be uncorrelated.

2.1.1 Additive Weight Deviations. The model considered for additive perturbations satises the following conditions: ] 0 w.1. E[1wm ij D

8i; j; m.

0

2 0 0 m ] 0 where ± is the Kronecker delta. w.2. E[1wm ij 1wi0 j0 D ¾ ±ii ±jj ±mm

Proposition 1.

m E[1ym i ] D 0 if E[1wij ] D 0

8i; j; m.

2946

J. L. Bernier, J. Ortega, E. Ros, I. Rojas, and A. Prieto

Proof. It will proved by induction over m.

Assuming that the inputs are free of errors, that is, 1yj0 D 0, and condition w.1, if the expected value for equation 2.2 is computed for m D 1: 2

E[1y1i ] D E 4@ fi1

N0 X jD 1

3

.yj0 1w1ij /5 D @ fi1

N0 X jD 1

yj0 E[1w1ij ] D 0:

] D 0, then for layer m: Assuming that E[1ym¡1 i 2

N m¡1 X

4 m E[1ym i ] D E @ fi D @ fim

jD 1

N m¡1 X jD 1

D 0:

(2.4)

3

m¡1 5 m .yjm¡1 1wm / ij C wij 1yj

m¡1 m ]/ .yjm¡1 E[1wm ij ] C wij E[1yj

(2.5)

m 2 From proposition 1, var.1ym i / can be obtained as E[.1yi / ].

Statistical sensitivity to additive weight deviations of a neuron i in layer m can be expressed as: Proposition 2.

v ! uNm¡1 ³ N m¡1 X uX m mt m¡1 2 m¡1 m m S i D @ fi .yj / C wij wik Cjk jD 1

(2.6)

k D1

m where the terms Cjk can be recursively computed as

m Cjk

D

@ fjm @ fkm

N m¡1 X r D1

³ .ym¡1 /2 ±jk C r

wjrm

N m¡1 X s D1

! m¡1 wm ks Crs

:

] is expanded, Proof. If the term E[1yjm 1ym k m m E[1yjm 1ym k ] D E[@ fj @ fk

£

N m¡1 X sD 1

N m¡1 X r D1

.ym¡1 1wjrm C wjrm 1ym¡1 / r r

m¡1 C wm .ym¡1 1wm / s ks ks 1ys

D @ fjm @ fkm

N m¡1 N m¡1 X X

.ym¡1 ym¡1 E[1wjrm 1wm r s ks ] r D1 s D1 m m m m¡1 ] C wjr ] C ym¡1 wks E[1ym¡1 1ym¡1 wm r s r ks E[1wjr 1ys

(2.7)

Fault Tolerance of MLPs

2947

m¡1 ]/ C ym¡1 wjrm E[1wm D s ks 1yr

D @ fjm @ fkm

N m¡1 N m¡1 X X

.ym¡1 ym¡1 E[1wjrm 1wm r s ks ] rD 1 sD1 m m C wjr ] C 0 C 0/: wks E[1ym¡1 1ym¡1 r s

(2.8)

E[1ym 1ym ]

j k m is dened as m ´ If Cjk and condition w.2 for additive deviCjk ¾2 ations is applied in equation 2.8, then it is straightforward to obtain equa0 tion 2.7. The initial condition for equation 2.7 is that Cjk D 0 8j; k, as the

inputs yj0 are assumed to be free of errors.

2 2 m At this point, taking into account that E[.1ym i / ] D ¾ C ii and substituting in equation 2.3, Proposition 2 is proved.

In the particular case of MLPs with only one hidden layer, expression 2.6 can be computed in a simpler form as Corollary 1.

v uNm¡1 u X m¡1 m¡1 2 m mt S i D @ fi ..yj /2 C .wm / / ij Sj j D1

8i8m D 1; 2;

(2.9)

taking into account that the statistical sensitivity of the input neurons is zero, that is, S0i D 0 8i. m for m D 0; 1 obtained from equation 2.7 Proof. Considering the values of Cjk

and substituting them in 2.6, corollary 1 is proved.

2.1.2 Multiplicative Weight Deviations. Multiplicative perturbations are proportional to weight magnitudes. The model considered satises condition w.1; the mean of this perturbation is zero, but differs with respect to w.2 in the additive case. For multiplicative perturbations, condition w.2 is: 0

m ] m 2 0 0 0 w.2. E[1wm ij 1wi0 j0 D .¾ wij / ±ii ±jj ±mm .

Note that proposition 1 is also satised here, as condition w.1 is the same; E[1ym i ] D 0 8i8m, when multiplicative weight deviations are considered. Statistical sensitivity to multiplicative weight deviations of a neuron i in layer m can be expressed as: Proposition 3.

v ! uNm¡1 ³ N m¡1 X uX m¡1 m¡1 mt Sm .wm /2 C wm wm ; i D @ fi ij yj ij ik Cjk j D1

kD 1

(2.10)

2948

J. L. Bernier, J. Ortega, E. Ros, I. Rojas, and A. Prieto

m where the terms Cjk can be recursively computed as: m Cjk

D

@ fjm @ fkm

N m¡1 X r D1

³ .wjrm ym¡1 /2 ±jk C r

wjrm

N m¡1 X sD 1

! m¡1 wm ks Crs

:

(2.11)

Proof. The proof of proposition 3 is similar to that of proposition 2; the only

difference with respect to the additive model is condition w.2. A detailed proof for the weight multiplicative model appears in Bernier et al. (1999a). In the particular case of MLPs with only one hidden layer, expression 2.10 can be computed in a simpler form as: Corollary 2.

v uNm¡1 uX m¡1 2 m mt 2 S i D @ fi .wm / C .Sjm¡1 /2 / ij / ..yj jD 1

8i 8m D 1; 2;

(2.12)

taking into account that the statistical sensitivity of the input neurons is zero, that is, S0i D 0 8i. Proof. The proof is similar to the proof of corollary 1.

As can be deduced from equations 2.6 and 2.10, statistical sensitivity depends on the weight magnitude. Thus, lower magnitudes imply lower statistical sensitivities, and so the neuron output is more stable against weight perturbations. From experimental evidence of this fact, in Chiu et al. (1994), a learning algorithm that coerces weights to have low magnitudes is proposed in order to maximize fault tolerance. However, to obtain a low value m¡1 of statistical sensitivity, an appropriate combination of the terms with Cjk is also necessary; that is, an adequate distribution of learning among the different weights is required in order to reduce statistical sensitivity, thus improving the tolerance against perturbations (Edwards & Murray, 1998a). In equations 2.9 and 2.12, it is also shown that statistical sensitivity is propagated through the MLP from the inputs to the outputs. 2.2 Statistical Sensitivity to Input Noise. The expressions of statistical sensitivity are now obtained for input deviations. Two perturbation models are also considered: an additive and a multiplicative one. In the case of additive deviations, MLP inputs, y0i , are affected by an additive deviation from their nominal values. This deviation has a mean equal to zero and a standard deviation equal to ¾ , that is:

n.1. E[1y0i ] D 0 n.2.

E[1y0i 1y0i0 ]

8i.

D ¾ 2 ±ii0 with ± being the Kronecker delta.

Fault Tolerance of MLPs

2949

On the other hand for multiplicative input deviations, it is assumed that input perturbations are proportional to their nominal values. In this way, condition n.1 is satised by this model, but condition n.2 is changed to: n.2. E[1y0i 1y0i0 ] D .¾y0i /2 ±ii . The following propositions hold: Proposition 4.

E[1ym i ] D 0

8i; m

if

E[1y0i ] D 0

8i.

The statistical sensitivity to input noise (additive or multiplicative) of a neuron i in layer m can be expressed as: v uN N m¡1 X m¡1 uX m mt m m (2.13) S i D @ fi wm ir wis C rs ; Proposition 5.

rD 1 sD 1

m where the terms Cjk can be recursively computed as:

m Cjk D @ fjm @ fkm

N m¡1 N m¡1 X X r D1 s D1

m wjrm wm ks Crs :

(2.14)

0 The only difference is that the initial case is Cjk D ±jk for additive noise,

0 while the initial case is Cjk D .yj0 /2 ±jk in the case of multiplicative noise, due to the difference with respect to condition n.2. (A full and detailed proof of both propositions is shown in Bernier et al., 1999b.) In this section, the expressions of statistical sensitivities to weight and input deviations have been obtained by considering two different models of deviations in each case. However, statistical sensitivity constitutes a measurement of the output stability provided by a particular neuron to a specic input pattern when perturbations in weights or inputs are present. In the next section we propose a new gure, termed mean squared sensitivity, which enables us to evaluate the overall sensitivity of the MLP against perturbations.

3 Mean Squared Sensitivity

The performance of the learning process is usually evaluated using the MSE of the output obtained with respect to the desired output, for a given set of input patterns. In fact, the backpropagation algorithm modies the weight values in order to reduce the MSE, which can be expressed as MSE D

Np Np X NM 1 X 1 X ².p/ D .di .p/ ¡ yiM .p//2 ; 2Np p D1 iD 1 Np p D1

(3.1)

2950

J. L. Bernier, J. Ortega, E. Ros, I. Rojas, and A. Prieto

where ² is the squared error, Np is the number of input patterns considered, NM is the number of neurons in the output layer, and di .p/ and yiM .p/ are the desired and obtained outputs of neuron i of the output layer for the input pattern p, respectively. If the MLP suffers any deviation due to weight or noise perturbations, its outputs will change, and so the MSE is degraded. By a Taylor expansion of expression 3.1 around the nominal MSE found after learning (MSE0 ) with respect to output changes, it is obtained that: MSE0 D MSE0 C C

Np X NM X NM 1 X @ 2 ².p/ 1y M .p/1yjM .p/ C 0 4Np pD 1 iD 1 jD 1 @yiM @yjM i

D MSE0 C C

Np X NM 1 X @².p/ M 1yi .p/ 2Np p D 1 iD1 @yM i

(3.2)

Np X NM 1 X M .yi .p/ ¡ dM i .p//1yi .p/ Np pD 1 iD 1

Np X NM 1 X .1yiM .p//2 : 2Np pD 1 iD 1

If the expected value of MSE0 is computed, taking into account that E[1yiM ] D 0, as was proved in proposition 1 (for weight perturbations) and proposition 4 (for input perturbations) and considering that from equa2 tion 2.3 it can be derived that E[.1yiM /2 ] ´ ¾ 2 .SM i / , the following expression is obtained: E[MSE0 ] D MSE0 C

Np NM ¾2 X X .S M /2 ; 2Np pD 1 iD 1 i

(3.3)

where SiM is the statistical sensitivity of the output neuron i, which can be computed from propositions 2, 3, or 5, depending on the kind of perturbation considered. By analogy with the denition of MSE, we dene the mean squared sensitivity (MSS) as MSS ´

Np Np X NM 1 X 1 X SS.p/ D .SM /2 ; 2Np p D1 iD 1 i Np pD 1

(3.4)

where SS.p/ constitutes the squared sensitivity for pattern p. In this way, the expected MSE when perturbations with standard deviation equal to ¾ are present can be computed as E[MSE0 ] D MSE0 C ¾ 2 MSS:

(3.5)

Fault Tolerance of MLPs

2951

Table 1: MSE0 and MSS Computed After Training the Predictor Type of Deviations

MSE0

MSS

Additive weight Multiplicative weight Additive noise Multiplicative noise

0.0001 0.0001 0.0001 0.0001

0.0717 0.2254 0.8478 0.3202

Thus, expression 3.5 shows a direct relation between the MSE degradation and the MSS. Considering that MSE0 and MSS can be evaluated over a set of patterns after training, it is possible to predict the value of MSE0 when the inputs or weights are deviated from their nominal values to a range with standard deviation equal to ¾ . Moreover, as expression 3.5 shows, the lower the MSS, the lower the MSE degradation. Thus, we propose using the MSS as a suitable measurement of the MLP tolerance against perturbations. If weight deviations are used to compute SM i , then MSS constitutes a measurement of fault tolerance, while in the case of considering input perturbations, MSS evaluates the noise immunity. 4 Validation of MSS

We compared the results obtained when the MSE is computed for the MLP subject to deviations and the value predicted by expression 3.5. An MLP that implements a predictor of the Mackey-Glass temporal series (Wang, 1994) is considered. It consists of 3 inputs, 11 neurons in the hidden layer, and 1 neuron in the output layer, with all neurons containing a bias input. Table 1 shows the values of MSE0 and MSS obtained after training, computed with the test patterns (different from those used for training). To compute MSS, we considered additive and multiplicative weight and input deviations. Figure 1 represents the values predicted by equation 3.5 and the values experimentally obtained in the case of additive weight deviations. Each experimental point has been averaged over 100 tests, where each test consisted of a random additive perturbation of all the weights of the MLP; that is, for m m each value of ¾ , each weight wm ij is changed to a value .wij § ±ij / where ±ijm is a random variable that follows a normal distribution of mean zero and standard deviation equal to ¾ . The condence interval at 95% is also shown for each experimental point. The absolute values of the weights were within the range [0.046,5.78] and were considered additive perturbations in the range [0,0.25], which assumes the smallest weights suffer perturbations up to 543.48% with respect to their nominal values and the largest weights up to 4.35% when ¾ D 0:25.

2952

J. L. Bernier, J. Ortega, E. Ros, I. Rojas, and A. Prieto Predictor

0.012

Experimental Predicted

Mean Squared Error

0.01

0.008

0.006

0.004

0.002

0

0

0.05

0.1

0.15

0.2

0.25

Standard deviation of weight perturbations

Figure 1: Additive weight deviations.

Multiplicative weight deviations have also been tested, and the results are shown in Figure 2. In this case, each weight wm is changed to a value ij m with m being a random variable that follows a normal distriwm .1 § ± / ± ij ij ij bution with mean equal to zero and standard deviation equal to ¾ . Note that the effect of the analog tolerance of a circuit can be modeled by multiplicative deviations, and thus ¾ represents the tolerance margin of analog components. In particular, the point where the standard deviation of weight perturbations is equal to 0.25 implies that each weight suffers perturbations up to 25% with respect to its nominal value. Figures 1 and 2 show good matching between the predicted and experimental MSE. However, when ¾ increases, the approximation is worse due to the assumption of small deviations. In any case, equation 3.4 seems to provide a good approximation of MSE degradation for moderate perturbations. In this way, expression 3.5 is seen to be useful for predicting the MSE degradation of a particular weight conguration against weight deviations. For example, if a set of weights must be transferred to an analog implementation whose tolerance margin is known, the learning performance of this implementation can be predicted by evaluating the MSS to multiplicative deviations. On the other hand, MSS to weight deviations constitutes a measurement of the fault tolerance of an MLP, as the lower the MSS, the lower the MSE degradation. Thus, MSS can be used as a selection criterion among different

Fault Tolerance of MLPs

2953 Predictor

0.02

Experimental Predicted

0.018

Mean Squared Error

0.016 0.014 0.012 0.01 0.008 0.006 0.004 0.002 0

0

0.05

0.1

0.15

0.2

0.25

Standard deviation of weight perturbations

Figure 2: Multiplicative weight deviations

sets of weights that present similar learning performance or present the best MSE in the presence of weight perturbations. Figures 3 and 4 show the MSE predicted and experimentally obtained considering that additive and multiplicative perturbations affect each input y0i .p/, respectively. Each experimental point has also been averaged over 100 tests, and the perturbations are similar to those described above but related to inputs instead of weights. The input values were in the range [0,1], so additive deviations of 0.1 imply variations of over 10% for inputs with nominal values near to 1, but very high ratios (|1y0i .p/=y0i .p/|) for inputs with values near to 0. For multiplicative input perturbations, the standard deviation indicates the maximum percentage (if it is multiplied by 100) of the variation of each input. Similar conclusions to the case of weight perturbations are obtained. The MSS to input noise allows us to predict the expected MSE degradation presented by an MLP when its inputs present deviations from their nominal values; in this way, MSS can be considered as a measurement of the noise immunity of a MLP. In particular, when the MLP receives discrete inputs, MSS to input noise can be used as a selection criterion among different sets of weights, as the lower the MSS to input noise, the higher the output stability against this noise. From expressions 2.6, 2.10, and 2.13, it can be deduced that low weight magnitudes produce low values of MSS in general. Then a low value of MSS

2954

J. L. Bernier, J. Ortega, E. Ros, I. Rojas, and A. Prieto Predictor

0.01

Experimental Predicted

0.009

Mean Squared Error

0.008 0.007 0.006 0.005 0.004 0.003 0.002 0.001 0

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Standard deviation of input perturbations

Figure 3: Additive input deviations. Predictor

0.016

Experimental Predicted

0.014

Mean Squared Error

0.012 0.01 0.008 0.006 0.004 0.002 0

0

0.05

0.1

0.15

Standard deviation of input perturbations

Figure 4: Multiplicative input deviations.

0.2

Fault Tolerance of MLPs

2955

to weight perturbations may imply a low value of MSS to input noise and vice versa. However, the expressions are different and their relation is not linear. Moreover, it can be deduced that to minimize MSS, it is necessary not only to reduce the weight magnitude but also to achieve an adequate learning distribution among the different units in order to minimize the m. cross dependencies related to the terms Cjk Finally, note that MSS to weight deviations and MSS to input noise can be added to take into account the simultaneous effect of these perturbations. Moreover, additive and multiplicative deviations can be considered simultaneously. 5 MSS to Input Noise as a Measurement of the Generalization Ability

The injection of input perturbations during learning has been proposed as an effective technique to increase the learning ability (Mao & Jain, 1993; Bishop, 1995b). The goal is to smooth the error surface with respect to the inputs. From a Taylor expansion of MSE 3.1 when the inputs of the MLP are perturbed, we obtain: MSE0 D MSE0 C C

Np N0 1 XX @².p/ 0 1yi 2Np pD 1 iD 1 @y0i

(5.1)

Np N0 X N0 1 XX @ 2 ².p/ 0 0 1y 1y C ¢ ¢ ¢ ; 4Np p D1 iD 1 jD 1 @y0i @yj0 i j

where y0i are the nominal inputs, N0 the number of input neurons, ².p/ the squared error for pattern p, and MSE0 is the nominal MSE. If the expected value of MSE 0 is computed taking into account the additive model for input perturbations, the following expression is obtained: E[MSE0 ] D MSE0 C

Np N0 ¾2 X X @ 2 ².p/ : 4Np p D1 iD 1 @.y0i /2

(5.2)

The above expression is formally similar to equation 3.5, so equaling both expressions—5.1 and 5.2—the following relation is obtained: MSSai D

Np N0 1 XX @ 2 ².p/ : 4Np pD 1 iD 1 @ .y0i /2

(5.3)

In this way, the MSS to additive input noise (MSSai ) measures the curvature of the squared error surface with respect to the inputs. A low value of

2956

J. L. Bernier, J. Ortega, E. Ros, I. Rojas, and A. Prieto

0.14

MSS=1.049 MSS=0.455

0.12

Squared error

0.1 0.08 0.06 0.04 0.02 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Input

Figure 5: Squared error versus input.

MSSai implies a similar squared error for any input pattern; that is, there is no specialization of the MLP in the recognition of a few input patterns. So for weight congurations that present similar MSE, the one that presents the lowest value of MSSai has the highest generalization ability. To provide experimental evidence of this conclusion, an approximator of the sine function (Sudkamp & Hammell, 1994) was considered. This consisted of 1 input neuron, 11 neurons in the hidden layer, and 1 output neuron, with all neurons presenting a bias input. This MLP was trained to experience the overtting phenomenon, using only 13 training patterns and 4000 training epochs. Two different weight congurations that presented similar MSE after training (D 0:011) but different values of MSSai were selected for comparison. Figure 5 shows the squared error (with respect to the desired output) as a function of the input. It can be seen that the sample with the higher MSSai presents steeper slopes, that is, it produces large errors for some inputs and small errors for others. On the other hand, the sample that presents the lower value of MSS produces a moderate error for any input. In this way, a lower value of MSSai implies a higher value of the generalization ability in the sense that there is a lower specialization of the MLP in a limited part of the input space. If multiplicative noise is considered, an expression similar to equation 5.3 is obtained. The only difference is that the slope is weighted by the input magnitude (each term of the sum is multiplied by .y0i /2 ).

Fault Tolerance of MLPs

2957 Table 2: Improvement of the Generalization Ability. MSE °

Average

Standard Deviation

0 0.02 0.04 0.06 0.08 0.10

0.0112 0.0047 0.0058 0.0067 0.0076 0.0084

0.0027 0.0001 0.0001 0.0003 0.0006 0.0006

When noise is added during the training process, the backpropagation algorithm modies the weights in order to minimize the MSE subject to such perturbations (MSE0 ). In this way, the MSS to input noise is implicitly minimized jointly with the nominal MSE, and so the generalization ability is enhanced. As the analytical expressions of MSS have been obtained, they can be used by the backpropagation algorithm to make an explicit minimization of MSS (as is done in Bernier et al., (2000), using the average statistical sensitivity as the regularizer), so constituting an explicitly specied regularization technique (Mao & Jain, 1993). In this way, we have developed an algorithm that minimizes equation 3.5 instead of 3.1. Thus, a descent-gradient algorithm is used to modify the weights in order to minimize ².p/ C ° SS.p/, where ° is a value that is decreased at the end of each iteration of the training process from an initial value in order to enable the learning convergence. The new algorithm, which we have termed robust backpropagation algorithm (RBA), has been compared to a classical backpropagation algorithm (CBA) considering additive input deviations to compute SS.p/. Table 2 shows the MSE values obtained employing RBA to train the sine approximator in the same conditions as previously mentioned considering different initial values for the ° parameter. In comparison with the results obtained with CBA (row with ° D 0), we can observe that by using RBA, the MSE is clearly improved, as the new learning rule is opposed to the specialization, thus avoiding the overtting phenomenon and allowing a better generalization. The performance of RBA applied to a more realistic problem is shown in section 7, where multiplicative weight and input perturbations are considered in order to maximize the tolerance against these kinds of perturbations. 6 MSS to Weight Perturbations and Saliency

Saliency has been proposed as a measurement of fault tolerance to weight deviations (Murray & Edwards, 1994; Edwards & Murray, 1998b). This mea-

2958

J. L. Bernier, J. Ortega, E. Ros, I. Rojas, and A. Prieto

surement is computed from the diagonal elements of the Hessian matrix of the error with respect to the weight values. Thus, it evaluates the smoothness of the error surface with respect to the weights such as a low value of saliency implies a higher fault tolerance. Just as input noise is introduced to enhance the generalization ability, the perturbation of weights during training has been proposed to increase the fault tolerance of MLPs (Murray & Edwards, 1994; Edwards & Murray, 1995, 1998a, 1998b). From a Taylor expansion of MSE, equation 3.1, when the MLP is subject to weight deviations, it is obtained that: MSE0 D MSE0 C C

Np X Nm N M X m¡1 X 1 X @ ².p/ 1wm ij 2Np pD 1 mD 1 iD 1 jD 1 @ wm ij

Np X Nm N Nm N M X M X m¡1 X m¡1 X X 1 X @ 2 ².p/ m0 1wm ij 1wi0 j0 C ¢ ¢ ¢ ; (6.1) m0 4Np pD 1 mD 1 iD 1 jD 1 m0 D1 i0 D1 j0 D 1 @wm @ w 0 0 ij ij

and taking into account the additive model of weight deviations to compute the expected value of equation 6.1, it is obtained that E[MSE0 ] D MSE0 C

Np M X Nm N m¡1 X ¾2 X X @ 2 ².p/ : 2 4Np pD 1 mD 1 iD 1 jD 1 @ .wm ij /

(6.2)

Note that the above expression is similar to equation 3.5 and from equaling both expressions, it is obtained that: MSSaw D

Np X Nm N M X m¡1 X 1 X @ 2 ².p/ ; 2 4Np p D 1 m D 1 iD 1 jD1 @.wm ij /

(6.3)

that is, the MSS to additive weight perturbations (MSSaw ) can be identied with saliency. Indeed, they are the same, and MSSaw evaluates the smoothness of the error surface with respect to the weights. However, if multiplicative perturbations are considered, then the following relation is obtained, applying the multiplicative model in equation 6.1 and equaling it to equation 3.5: MSSmw D

Np X Nm N M X m¡1 2 X 1 X 2 @ ².p/ .wm : ij / 2 4Np p D1 m D1 iD1 jD 1 @ .wm ij /

(6.4)

In this way, MSS to multiplicative perturbations (MSSmw ) is not identical to saliency. To evaluate fault tolerance to multiplicative deviations, it is necessary to take into account the weight magnitudes, as equation 6.4 shows, and it is not sufcient to evaluate just the Hessian matrix.

Fault Tolerance of MLPs

2959

The algorithm proposed in Murray and Edwards (1994) and Edwards and Murray (1995) introduces multiplicative weight perturbations during learning, so producing an implicit minimization of MSSmw , but the authors evaluate fault tolerance using saliency (i.e., MSSaw ). Obviously, as expressions 2.6 and 2.10 show, the statistical sensitivities to additive and multiplicative perturbations present similar dependencies. Thus, a low value of MSSaw may imply a low value of MSSmw and vice versa. However, this relation is not linear, and so MSS must be particularized for the perturbation model considered (additive or multiplicative, input or weight), as shown in the next section. The same considerations can be made by comparing MSS to input noise and MSS to weight deviations. Expressions 2.6, 2.10, and 2.13 present some similarities (for example, low weight magnitudes imply low values of MSS in every case). In this way, when fault tolerance is increased (MSS to weight deviations is decreased), the generalization ability or noise immunity may also be increased (MSS to input deviations is decreased), and vice versa. To test this, the next section shows the results obtained using RBA compared to CBA for a benchmarking problem. 7 An Algorithm to Reduce MSS

In section 5 we mentioned a learning algorithm that we termed RBA, which minimizes the MSS jointly with MSE during the training process. In that section, we used RBA in order to minimize the MSS against additive weight perturbations (MSSai ), thus maximizing the generalization ability. In this section, we test the performance of RBA by applying it to a benchmarking problem belonging to PROBEN1 (Prechelt, 1994) considering the minimization of MSS against multiplicative weight (MSSmw ) and multiplicative input perturbations (MSSmi ). A classical backpropagation algorithm (CBA) is used for comparison. This algorithm uses the following learning rule: tC1 t m t .wm D .wm ij / ij / C ®.¡r ² ij // ;

(7.1)

where ® is the learning rate and r ²ijm is the gradient of the squared error, ².p/, with respect to the weight wm ij . On the other hand, the rule used by RBA is: tC1 t m t m t .wm D .wm ij / ij / C ®.¡r ² ij // C ° .¡r SSij / ;

(7.2)

where r SSm ij constitutes the gradient of the squared sensitivity, SS.p/, with respect to wm . Depending on the particular expression of SS.p/ (SSaw , SSmw , ij SSai or SSmi ), the RBA algorithm minimizes fault tolerance (additive or multiplicative) or noise immunity (additive or multiplicative).

2960

J. L. Bernier, J. Ortega, E. Ros, I. Rojas, and A. Prieto

Table 3: CBA Applied to the Breast Cancer Classier. cancer1

MSE CE (%) MSSmw MSSmi MSSaw MSSai

cancer2

cancer3

Average

Standard Deviation

Average

Standard Deviation

Average

Standard Deviation

0.010 2.237 0.178 0.072 0.016 0.572

0.004 0.501 0.2151 0.064 0.003 0.557

0.032 1.868 0.102 0.054 0.021 0.391

0.002 0.249 0.0237 0.011 0.002 0.092

0.024 3.377 0.109 0.058 0.019 0.362

0.002 0.190 0.045 0.032 0.002 0.219

The problem selected for comparison is a breast cancer classier. PROBEN1 provides three disjoint sets of patterns: the training set (used for modifying the weights during the training process), the test set (used for evaluating the results), and the validation set (used during training for testing and selecting the optimal weight conguration). The MLP structure consisted of 9 inputs, 5 hidden neurons, and 2 output neurons, with all the neurons containing a bias input. The training, test, and validation sets consisted of 350, 174, and 175 patterns, respectively; 1000 training epochs were applied with a learning rate of 0.9. In the case of RBA, the parameter ° was reduced during training from 0.4 to 0.01. For each algorithm, the following parameters were computed using the test set: the MSE, the classication error (CE), MSSmw , MSSaw , MSSmi , and MSSai . The values presented were averaged over 40 training runs. The weights used for computing these magnitudes were those that provided the lowest MSE evaluated over the validation set. PROBEN1 provides three different orderings of the pattern sets, which we have labeled cancer1, cancer2, and cancer3. Table 3 shows the results obtained with the CBA algorithm, and Tables 4 and 5 refer to the RBA algorithm particularized for multiplicative weight deviations (RBAmw ) and multiplicative input deviations (RBAmi ), respectively. From these tables we can observe the following: Learning performance is similar for CBA, RBAmw , and RBAmi (similar MSE and similar CE). The value for MSSaw is also similar for the three algorithms. RBAmw obtains the best results for MSSmw , as was expected. Weight congurations obtained with RBAmw are up to 3.0 times more tolerant to multiplicative weight deviations than CBA and up to 1.5 times more than RBAmi . Also, by comparison with CBA, a slight improvement of MSSmi is obtained, while results with respect to MSSai (generalization ability) are similar.

Fault Tolerance of MLPs

2961

Table 4: RBAmw Applied to the Breast Cancer Classier. cancer1

MSE CE (%) MSSmw MSSmi MSSaw MSSai

cancer2

cancer3

Average

Standard Deviation

Average

Standard Deviation

Average

Standard Deviation

0.010 2.888 0.060 0.040 0.016 0.418

0.001 0.157 0.013 0.009 0.001 0.099

0.036 2.213 0.067 0.052 0.018 0.451

0.001 0.205 0.011 0.006 0.002 0.054

0.024 3.448 0.066 0.046 0.017 0.325

0.001 0.000 0.018 0.009 0.002 0.054

Table 5: RBAmi Applied to the Breast Cancer Classier. cancer1

MSE CE (%) MSSmw MSSmi MSSaw MSSai

cancer2

cancer3

Average

Standard Deviation

Average

Standard Deviation

Average

Standard Deviation

0.014 1.438 0.095 0.038 0.019 0.298

0.001 0.287 0.026 0.008 0.002 0.087

0.033 4.239 0.086 0.036 0.024 0.266

0.001 0.420 0.018 0.019 0.004 0.138

0.028 3.693 0.081 0.031 0.025 0.184

0.001 0.986 0.019 0.003 0.001 0.017

RBAmi , as expected, obtains the best results for MSSmi (over 2.0 times better than CBA and 1.5 times better than RBAmw ). MSSai has also been improved (over 1.9 times better than CBA and RBAmw ), and MSSmw is also clearly improved with respect to CBA. We can conclude that RBA can be particularized in order to minimize the target we are interested in (MSSmw , MSSaw , MSSai , or MSSmi ) jointly with MSE. We should take into account that MSS for different kinds of perturbations can be combined in order to minimize them simultaneously. However, when applying RBA in order to minimize a particular aspect, it is possible to obtain colateral benets in other aspects, but this is not guaranteed a priori. An analytical study would be interesting to gain insight into the dependencies among the different expressions of MSS. In any case, the performance of RBA has been experimentally shown. In fact, RBA searches for at minima (Hochreiter & Schmidhuber, 1997) for the weight values; for example, RBAmw looks for weight congurations that are relatively stable against multiplicative weight perturbations, while RBAmi looks for those that are stable against multiplicative input perturbations (and similarly searches with respect to additive perturbations when RBAaw or RBAai is used).

2962

J. L. Bernier, J. Ortega, E. Ros, I. Rojas, and A. Prieto

8 Conclusions

In this article, a quantitative measurement to evaluate the fault tolerance and the noise immunity of a MLP has been obtained. This measurement, termed mean squared sensitivity, has been derived from an explicit relation between the mean squared error degradation of the MLP in the presence of perturbations and the statistical sensitivity. Different perturbation models to compute MSS have been considered. If weight deviations are considered, then MSS constitutes a measurement of the fault tolerance, but when considering input perturbations, MSS evaluates the noise immunity of an MLP. In this way, MSS constitutes a useful criterion to select between different weight congurations that present similar performance with respect to learning but different performance with respect to fault tolerance or noise immunity. The expressions have been validated by experimental results, showing that the expected MSE degradation can be accurately predicted from the value of MSS when perturbations affect the MLP. In the case of additive input perturbations, we have shown that MSS measures the smoothness of the error surface with respect to the inputs, thus enabling the evaluation of the generalization ability provided by the perceptron. In a similar way, we have related the MSS to weight deviations with saliency, another measurement of fault tolerance previously proposed, showing that MSS is more suitable to evaluate fault tolerance than saliency, at least in the case of multiplicative deviations. This new training algorithm, which can be considered as an explicitly specied regularization technique, adds the gradient of the squared sensitivity for the input pattern p (SS.p/) to the backpropagation rule in order to minimize MSS jointly with MSE, and thus, depending on the kind of perturbation considered for computing SS.p/ (weight additive or multiplicative, input additive or multiplicative) the corresponding aspect is improved (fault tolerance, noise immunity, or generalization ability). This new algorithm searches for wider minima of the learning performance with respect to deviations, such that perturbations in weights or inputs produce small changes of the MLP output. Moreover, several conjectures experimentally observed in previous work are deduced from the expressions obtained in this article—for example, lower weight magnitudes imply higher fault tolerance, the relation between the fault tolerance and the generalization ability, and the relation between noise and generalization. The methodology applied here can also be used to study the tolerance to perturbations of other architectures of layered neural networks such as the radial basis function networks that are now extensively used. Acknowledgments

We thank the referees for their very helpful comments and suggestions.

Fault Tolerance of MLPs

2963

The article was partially supported by project TIC97-1149 (CICYT, Spain). References Alippi, C., Piuri,V., & Sami, M. (1994). Sensitivity to errors in articial neural networks: A behavioral approach. In Proc. IEEE Int. Symp. on Circuits & Systems (pp. 459–462). Bernier, J. L., Ortega, J., Rodriguez, M. M., Rojas, I., & Prieto, A. (1999a). An accurate measure for multilayer perceptron tolerance to weight deviations. Neural Processing Letters, 10(2), 121–130. Bernier, J. L., Ortega, J., Rojas, I., & Prieto, A. (2000). Improving the tolerance of multilayer perceptrons by minimizing the statistical sensitivity to weight deviations. Neurocomputing, 31, 87–103. Bernier, J. L., Ortega, J., Ros, E., Rojas, I., & Prieto, A. (1999b).A new measurement of noise immunity and generalization ability for MLPs. International Journal of Neural Systems, 9(6), 511–522. Bishop, C. (1995a). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1), 108–116. Bishop, C. M. (1995b). Neural networks for pattern recognition. New York: Oxford University Press. Chiu, C., Mehrotra, K., Mohan, C. K., & Ranka, S. (1994). Modifying training algorithms for improved fault tolerance. Proc. IEEE Int. Conf. on Neural Networks, 1, 333–338. Choi, J. Y., & Choi, C. (1992). Sensitivity analysis of multilayer perceptron with differentiable activation functions, IEEE Trans. on Neural Networks, 3(1), 101– 107. Edwards, P. J., & Murray, A. F. (1995). Can deterministic penalty terms model the effects of synaptic weight noise on network fault-tolerance? Int. Journal of Neural Systems, 6(4) 401–416. Edwards, P. J., & Murray, A. F. (1998a). Towards optimally distributed computation. Neural Computation, 10, 997–1015. Edwards, P. J., & Murray, A. F. (1998b). Weight saliency regularisation in augmented networks. In European Symp. on Articial Neural Networks, ESANN (pp. 261–266). Edwards, P. J., & A. F. Murray. (1998c).Fault tolerance via weight noise in analog VLSI implementations of MLPs—A case study with EPSILON. IEEE Trans. on Circuits and Systems II: Analog and Digital Signal Processing, 45(9), 1255–1262. Hochreiter, H., & Schmidhuber, J. (1997). Flat minima. Neural Computation, 9, 1–42. Krogh, A., & Hertz, J. A. (1992). A simple weight decay improves generalization. In Proc. Neural Information Processing Systems (NIPS) Conference. (pp. 950–957). San Mateo, CA: Morgan Kaufmann. Mao, J., & Jain, A. K. (1993). Regularization techniques in articial neural networks. World Congress on Neural Networks (pp. 75–79). Minnix, J. I. (1992). Fault tolerance of the backpropagation neural network trained on noisy inputs. Int. Joint Conference on Neural Networks (pp. 75–79).

2964

J. L. Bernier, J. Ortega, E. Ros, I. Rojas, and A. Prieto

Murray, A. F., & Edwards, P. J. (1994). Synaptic weight noise euring MLP training: Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training. IEEE Transactions on Neural Networks, 5(5), 792– 802. Neti, C., Schneider, M. H., & Young, E. D. (1992). Maximally fault tolerant neural networks. IEEE Transactions on Neural Networks, 3(1),14–23. Pathak, D. S., & Koren, I. (1995). Complete and partial fault tolerance of feedforward neural nets. IEEE Transactions on Neural Networks, 6(2), 446–456. Prechelt, L. (1994). PROBEN1—A set of neural network benchmark problems and benchmarking rules (Tech. Rep. No. 21/94). Universitat Karlsruhe, Germany. Segee, B. E., & Carter, M. J. (1994). Comparative fault tolerance of parallel distributed processing networks. IEEE Trans. on Computers, 43(11), 1323–1329. Stevenson, M., Winter, R., & Widrow, B. (1990). Sensitivity of neural networks to weight errors. IEEE Trans. on Neural Networks, 1(1), 71–80. Sudkamp, T., & Hammell, R. (1994). Interpolation, completion, and learning fuzzy rules. IEEE Trans. on Systems, Man & Cybernetics, 24(2), 332–342. Wang, L. (1994). Adaptive fuzzy systems and control: design and stability analysis. Englewood Cliffs, NJ: Prentice Hall. Received June 23, 1999; accepted February 24, 2000.

LETTER

Communicated by Bill Horne

On the Computational Complexity of Binary and Analog Symmetric Hopeld Nets Ï õ ma JiÏr´õ S´ Department of Theoretical Computer Science, Institute of Computer Science, Academy of Sciences of the Czech Republic, P.O. Box 5, 182 07 Prague 8, Czech Republic Pekka Orponen Teemu Antti-Poika Department of Mathematics, University of Jyv¨askyl¨a, FIN-40351 Jyv¨askyl¨a, Finland We investigate the computational properties of nite binary- and analogstate discrete-time symmetric Hopeld nets. For binary networks, we obtain a simulation of convergent asymmetric networks by symmetric networks with only a linear increase in network size and computation time. Then we analyze the convergence time of Hopeld nets in terms of the length of their bit representations. Here we construct an analog symmetric network whose convergence time exceeds the convergence time of any binary Hopeld net with the same representation length. Further , we prove that the MIN ENERGY problem for analog Hopeld nets is NP-hard and provide a polynomial time approximation algorithm for this problem in the case of binary nets. Finally, we show that symmetric analog nets with an external clock are computationally Turing universal. 1 Introduction

In his 1982 paper (Hopeld, 1982), John Hopeld introduced an inuential associative memory model that has since come to be known as the discrete-time Hopeld (or symmetric recurrent) network. The fundamental characteristic of this model is its well-constrained convergence behavior as compared to arbitrary asymmetric networks. Part of the appeal of Hopeld nets also stems from their connection to the much-studied Ising spin glass model in statistical physics (Barahona, 1982) and their natural hardware implementations using electrical networks (Hopeld, 1984) or optical computers (Farhat, Psaltis, Prata, & Paek, 1985). Hopeld originally proposed that this type of network could be used for removing noise from large binary patterns (Chiueh & Goodman, 1988; Gold, 1986; Park, Cho, & Park, 1999; Pawlicki, Lee, Hull, & Srihari, 1988), but they have also been applied to, for example, the fast approximate solution of combinatorial optimization problems (Hopeld & Tank, 1985; Aarts & Korst, 1989; Liao, 1999; Ma´ndziuk, 2000; Sheu, Lee, & Chang, 1991; Wu & Tam, 1999). Because of their simple c 2000 Massachusetts Institute of Technology Neural Computation 12, 2965–2989 (2000) °

2966

Ï õ ma, P. Orponen, and T. Antti-Poika J. S´

structure, Hopeld nets are also often used as a reference model for investigating new analytical or computational ideas, in the same way as Ising spin systems are used in statistical physics or Turing machines in theoretical computer science. Although the basic Hopeld model per se is of limited practical applicability, it has inspired such other important neural network architectures as the bidirectional associative memory (BAM) (Kosko, 1988), and Boltzmann machines and their further stochastic variations (Ackley, Hinton, & Sejnowski, 1985; Haykin, 1999; Rojas, 1996). The accumulated knowledge concerning Hopeld nets has often been an essential prerequisite for understanding the capabilities of these other models. (For instance, the convergence behavior of the BAM model was analyzed in Kosko, 1988, along the lines rst established for the Hopeld model in Hopeld, 1982.) In this article, we investigate a number of issues in the computational analysis of Hopeld networks, complementing the existing literature in this area (Flor´een & Orponen, 1994; Parberry, 1994; Siegelmann, 1999; Siu, Roychowdhury, & Kailath, 1995; Wiedermann, 1994). After a brief review of the basic denitions in section 2, our rst result in section 3 establishes a size- and time-optimal (up to constant factors) simulation of arbitrary discrete-time binary-state neural networks (with in general asymmetric interconnections) by symmetric binary-state Hopeld nets. It is a fundamental property of symmetric Hopeld nets that their computations always lead to a stable network state (Hopeld, 1982) or possibly an oscillation between two different network states when the neurons are updated in parallel (Goles, Fogelman, & Pellegrin, 1985; Poljak & S Çura, 1983). On the other hand, general asymmetric networks can have arbitrarily complicated limit behavior. Thus, the asymptotic dynamics of symmetric Hopeld nets are considerably more constrained than those of arbitrary networks. However, it was established in Orponen (1996) that when only convergent computations are considered, an arbitrary network of n discretetime binary neurons can be simulated by a Hopeld net with O.n2 / symmetrically interconnected binary units. In section 3 we strengthen this result by showing that in fact, a convergent computation that takes t steps on a general network of n neurons can be simulated in 4t steps on a Hopeld network consisting of only 6n C 2 symmetrically interconnected units. This tight converse to Hopeld’s (1982) convergence theorem thus shows that in binary networks, it holds in a quite strong sense that “convergence ´ symmetry,” that is, not only do all symmetric networks converge, but also all convergent computations can be implemented efciently in symmetric networks. The result also has some practical implications, because it guarantees that recurrent networks with arbitrary asymmetric interconnections can always be replaced, without much overhead in network size or computation time, by symmetric networks with their guaranteed convergence properties, and in some technologies more efcient implementations. In section 4 we study the convergence time of both binary- and analogstate Hopeld networks in terms of the length of their bit representations.

Computational Complexity of Hopeld Nets

2967

This is to our knowledge the rst analysis that takes into account the actual representation size of the networks. We obtain the curious result that there exist analog-state symmetric networks whose convergence time is greater than that of any binary-state symmetric network with the same representation size. The result shows that although analog-state networks with limited-precision states are not computationally more powerful than binarystate ones (Casey, 1996; Maass & Orponen, 1998), they may still be more efcient. In section 5 we investigate the NP-complete MIN ENERGY or GROUND STATE problem of nding a network state with minimal energy for a given symmetric network. (For precise denitions, see section 2.) This problem forms the basis of all applications of Hopeld nets to combinatorial optimization (Hopeld & Tank, 1985; Aarts & Korst, 1989; Liao, 1999; Ma´ndziuk, 2000; Sheu, Lee, & Chang, 1991; Wu & Tam, 1999), and is also of importance in the Ising spin glass model (Barahona, 1982). In section 5.1 we derive a new polynomial time approximation algorithm for this problem in the case of binary-state networks. In section 5.2 we provide the rst rigorous proof that the MIN ENERGY problem is NP-hard also for analog-state Hopeld nets, the model actually used in practical optimization applications. Section 6 deals with the computational power of nite analog-state recurrent neural networks, a topic that has recently attracted considerable attention (Siegelmann, 1999). At a general level, it is known that nite asymmetric analog networks are computationally universal, i.e., equivalent to Turing machines (Siegelmann & Sontag, 1995), but that any amount of analog noise reduces their computational power to that of nite automata (Casey, 1996; Maass & Orponen, 1998). For reasons discussed more fully in section 6, one cannot reasonably expect nite symmetric analog networks to be computationally universal. However, we show that universality can be achieved even in symmetric networks, provided that they are augmented with an external oscillator that produces an appropriately sequenced innite stream of binary pulses. Thus, we obtain a full characterization of the computational power of nite analog-state discrete-time networks in the form of “Turing universality ´ asymmetric network ´ symmetric network C oscillator.” Moreover, we give in section 6 necessary and sufcient conditions that the external oscillator needs to satisfy in order to qualify for this equivalence. Ï õ ma, Orponen, and An extended abstract of this article appeared in S´ Antti-Poika (1999). 2 Basic Notions

A nite discrete recurrent neural network consists of n simple computational units or neurons, indexed as 1; : : : ; n, which are connected into a generally cyclic, oriented graph or architecture in which each edge .i; j/ leading from neuron i to j is labeled with an integer weight w.i; j/ D wji . The absence of

Ï õ ma, P. Orponen, and T. Antti-Poika J. S´

2968

a connection within the architecture corresponds to a zero weight between the respective neurons. Special attention will be paid to Hopeld (symmetric) networks, whose architecture is an undirected graph with symmetric weights w.i; j/ D w.j; i/ for every i; j. We shall mostly consider the synchronous computational dynamics of the network, working in fully parallel mode, which determines the evolution of .t/ n the network state y .t/ D .y.t/ 1 ; : : : ; yn / 2 f0; 1g for all discrete time instants t D 0; 1; : : : as follows. At the beginning of the computation, the network is placed in an initial state y .0/ , which may include an external input. At discrete time t ¸ 0, each neuron j D 1; : : : ; n collects its binary inputs from the states (outputs) y.t/ i 2 f0; 1g of incident neurons i. Then its integer excitation Pn .t/ .t/ »j D iD 0 wji yi (j D 1; : : : ; n) is computed as the respective weighted sum of inputs, including an integer bias wj0 , which can be viewed as the weight of the formal constant unit input y.t/ 0 D 1, t ¸ 0. At the next instant t C 1, an .t/ activation function ¾ is applied to »j for all neurons j D 1; : : : ; n in order to determine the new network state y .t C 1/ as follows: .t C 1/

yj

±

.t/

D ¾ »j

²

j D 1; : : : ; n

(2.1)

where a binary-state neural network employs the hard limiter (or threshold) activation function » 1 for » ¸ 0 (2.2) ¾ .» / D 0 for » < 0: Alternative computational dynamics are possible in Hopeld nets. For example, under sequential mode, only one neuron updates its state according to equation 2.1 at each time instant, while the remaining neurons do not change their outputs. Or we shall also deal with the nite analog-state discrete-time recurrent neural networks, which, instead of the threshold activation function of equation 2.2, employ some continuous sigmoid function, for example, the saturated-linear function 8 <1 ¾ .» / D » : 0

for » > 1 for 0 · » · 1 for » < 0:

(2.3)

Hence the states of analog neurons are real numbers within the interval [0; 1], and similarly the weights (including biases) are allowed to be reals. The fundamental property of symmetric nets is that on their state-space, a bounded Lyapunov, or energy function can be dened, whose values are properly decreasing along any nonconstant computation path (productive computation) of such a network. For example, consider a sequential computation of a binary symmetric Hopeld net with, for simplicity, zero biases

Computational Complexity of Hopeld Nets

2969

wj0 D 0 and feedbacks wjj D 0 (negative feedbacks are actually not allowed for sequential Hopeld nets to have a Lyapunov property), and without 6 0, loss of generality (Parberry, 1994) also assume nonzero excitations »j.t/ D

j D 1; : : : ; n. Then one can associate an energy with state y.t/ , t ¸ 0, as follows: n X n ± ² 1X .t/ E y.t/ D E.t/ D ¡ wji y.t/ i yj : 2 j D1 i D1

(2.4)

Hopeld (1982) showed that for this energy function, E.t/ · E.t ¡ 1/ ¡ 1 for every t ¸ 1 of a productive computation. Moreover, the energy function is bounded, that is, |E.t/| · W, where WD

n X n 1X |wji | 2 jD 1 iD 1

(2.5)

is called the weight of the network. Hence, the computation must converge to a stable state within time O.W/. An analogous result can be shown for parallel updates, where a cycle of length at most two different states may appear (Goles, Fogelman, & Pellegrin, 1985; Poljak & S Çura, 1983). 3 An Optimal Simulation of Asymmetric Networks

The computational power of symmetric Hopeld nets is properly less than that of asymmetric networks due to their different asymptotic behavior. Because of the Lyapunov property, Hopeld networks always converge to a stable state (or a cycle of length two for parallel updates), whereas asymmetric networks can have limit cycles of arbitrary length. However, Orponen (1996) showed that this is the only feature that cannot be reproduced, in the sense that any converging fully parallel computation by a network of n discrete-time binary neurons, within general asymmetric interconnections, can be simulated by a Hopeld net of size O.n2 /. More precisely, given an asymmetric network to be simulated, there exists a subset of neurons in the respective Hopeld net whose states correspond to the original convergent asymmetric computation in the course of the simulation—possibly with some constant time overhead per each original update. The idea behind this simulation is that each asymmetric edge to be simulated is implemented by a small symmetric subnetwork that receives “energy support” from a symmetric clock subnetwork (a binary counter; Goles & Mart´õ nez, 1989) in order to propagate a signal in the correct direction. In the context of innite families of neural networks, which contain one network for each input length (a similar model is used in the study of boolean circuit complexity; Wegener, 1987), this simulation result implies that innite sequences of discrete symmetric networks with a polynomially increasing number of binary neurons are computationally equivalent

Ï õ ma, P. Orponen, and T. Antti-Poika J. S´

2970

to (nonuniform) polynomially space-bounded Turing machines. Stated in standard complexity theoretic notation (Balc´azar, D´õ az, & Gabarr o´ , 1995), such sequences of networks thus compute the complexity class PSPACE/ poly, or P/poly when the weights in the networks are restricted to be polynomial in the input size (Orponen, 1996). In the following theorem, the construction from Orponen (1996) is improved by reducing the number of neurons in the simulating symmetric network from quadratic to linear as compared to the size of the asymmetric network; this result is asymptotically optimal. The improvement is achieved by simulating the individual neurons (as opposed to edges, as in Orponen, 1996), whose states are updated by means of the clock technique. A similar Ï õ ma and idea was used for a corresponding continuous-time simulation in S´ Orponen (2000). Also a quantitatively similar result has been achieved for sequential Hopeld nets with negative feedbacks in Goles and Matamala (1996). However, such networks lose the Lyapunov property, and thus they can simulate any in general nonconvergent asymmetric network relatively easily. Any fully parallel computation by a recurrent neural network of n binary neurons with asymmetric weights, converging within t? discrete update steps, can be simulated by a Hopeld net with 6n C 2 neurons, converging within 4t? discrete-time steps. Theorem 1.

Proof. Observe, rst, that any converging computation by an asymmetric network of n binary neurons must terminate within t? · 2n steps. A basic

technique used in our proof is the exploitation of an .n C 1/-bit symmetric clock subnetwork (a binary counter), which, using 3n C 1 units being initially in the zero state, produces a sequence of 2n well-controlled oscillations before it converges. This sequence of clock pulses will be used to drive the underlying simulation of an asymmetric network in the remaining part of the Hopeld net. The construction of the .n C 1/-bit binary counter will be described by induction for n. An example of a 3-bit counter network is presented in Figure 1, where the symmetric connections between neurons are labeled with corresponding weights and the biases are indicated by the edges drawn without an originating unit. In the sequel the symmetric weights in the Hopeld net will be denoted by w, whereas w0 denotes the original asymmetric weights. The induction starts with the least signicant counter unit c0 . Its bias w.0; c0 / D .5U C 4/n C 1 is also denoted by B in Figure 1 and the parameter U is dened as follows: U D max

j D 1;:::;n

n X iD 0

|wji0 |:

(3.1)

As it will be seen later, this large bias prevents the rest of the network from

Computational Complexity of Hopeld Nets

2971

Figure 1: A 3-bit counter network.

affecting the counter computation. Neuron c0 is initially passive (its state is 0). However, at the next time instant, c0 will re or be active (its state is 1) according to equation 2.2 since its excitation is positive. Hence, c0 simply implements counting from 0 to 1. For the induction step, suppose that the counter has been constructed up to the rst k (0 < k < n C 1) counter bits c0 ; : : : ; ck¡1 , and denote by Vk the set of all its nk D 3.k ¡ 1/ C 1 neurons, including the auxiliary ones labeled a` ; b` for ` D 1; : : : ; k ¡ 1. Then the counter unit ck is connected to all nk neurons v 2 Vk via unit weights w.v; ck / D 1, which, together with its bias w.0; ck / D ¡nk , make ck to re if and only if all these units are active. This includes the rst k active counter bits c0 ; : : : ; ck¡1 which means that counting from 0 to 2k ¡ 1 has been accomplished, and hence, the next counter bit ck must re. In addition, the unit ck is connected to ak , which is further linked to bk , so that these auxiliary neurons are, one by one, activated after ck res. This is implemented by the following weights w.ck ; ak / D Wk , w.ak ; bk / D W k ¡ nk , and biases w.0; ak / D ¡1, w.0; bk / D nk ¡ Wk , where Wk > 0 is a sufciently large, positive parameter whose value will optimally be determined below so that neurons ak ; bk are not directly inuenced by a computation of units from Vk except via ck . The neuron ak resets all the units in Vk to their initial zero states, which is consistent with the correct counter computation when ck res. For this purpose, ak is further connected to each v 2 Vk via P a sufciently large negative weight w.ak ; v/ < 0 such that ¡w.ak ; v/ > 1 C u2Vk [f0g: w.u;v/>0 w.u; v/ D Skv exceeds their mutual positive inuence (including the weight w.ck ; v/ D 1)

2972

Ï õ ma, P. Orponen, and T. Antti-Poika J. S´

and its possibly positive bias w.0; v/ > 0. Thus, we can choose the integer weight w.ak ; v/ D ¡Skv ¡ 1 less only by 1 than the minus right side of the preceding inequality suggests in order to optimize the weight size. This also determines the minimal value of the large, positive weight parameter P Wk D 1 ¡ v2Vk w.ak ; v/, which makes the state of ak (similarly for bk below) independent on the outputs from v 2 Vk . Finally, the unit bk balances the negative inuence of ak on Vk so that the rst k counter bits can again count from 0 to 2k ¡ 1 but now with ck being active. This is achieved by the exact weight w.bk ; v/ D ¡w.ak ; v/ ¡ 1 for each v 2 Vk in which ¡w.ak ; v/ eliminates the inuence w.ak ; v/ of ak on v and ¡1 compensates w.ck ; v/ D 1. Again, neurons affect bk v 2 Vk cannot reversely P P since their maximal contribution v2Vk w.v; bk / D ¡nk ¡ v2Vk w.ak ; v/ D Wk ¡nk ¡1 to the excitation of bk cannot beat its bias w.0; bk / D nk ¡Wk . This completes the induction step. It follows from the preceding description that the sizes of integer weights in the counter network are as minimal as this construction allows. Now the symmetric clock subnetwork—in particular, the counter unit c0 , n which outputs the state sequence .0111/2 during the clock operation—will be used to continue the underlying simulation of an asymmetric network. Note that the corresponding weights in the counter have been chosen sufciently large that the clock is not inuenced by the subnetwork it drives. In addition, a neuron cN0 is added, which computes the negation of the c0 output. Then for each neuron j from the asymmetric network, three units pj ; qj ; rj are introduced in the Hopeld net so that pj represents the new (current) state yj.t/ of j at time t ¸ 1 while qj stores the old state yj.t¡1/ of j from the preceding time instant t ¡ 1, and rj is an auxiliary neuron realizing the update of the old state. The corresponding symmetric subnetwork simulating one neuron j is depicted in Figure 2. The total number of units simulating the asymmetric network is 3n C 1 (including cN0 ), which, together with the clock size 3n C 1, gives a total of 6n C 2 neurons in the Hopeld net. At the beginning of the simulation, all the neurons in the Hopeld net are passive, except for those units qj corresponding to the original initially active neurons j, that is, yj.0/ D 1. Then an asymmetric network update at time t ¸ 1 is simulated by a cycle of four steps in the Hopeld net as follows. In the rst step, unit c0 res and remains active until its state is changed by the clock, since its large, positive bias makes it independent of all the n neurons pj . Also the unit cN0 res because it computes the negation of c0 that was initially passive. At the same time, each neuron pj computes its new state yj.t/ from the old states y.t¡1/ , which are stored in the corresponding i units qi . Thus, each neuron pj is connected with units qi via the original weights w.qi ; pj / D w0 .i; j/, and also its bias w.0; pj / D w0 .0; j/ is preserved. So far, unit qj keeps the old state yj.t¡1/ due to its feedback.

Computational Complexity of Hopeld Nets

2973

Figure 2: Symmetric simulation of neuron j.

In the second step, the new state yj.t/ is copied from pj to rj , and the active neuron c0 makes each neuron pj passive by means of a large, negative weight, which exceeds the positive inuence from units qi (i D 1; : : : ; n) including its bias w.0; pj / according to equation 3.1. Similarly, the active neuron cN0 erases the old state yj.t¡1/ from each neuron qj by making it passive with the help of a large, negative weight, which, together with the negative bias, exceeds its feedback and the positive inuence from units pi (i D 1; : : : ; n). Finally, neuron cN0 also becomes passive since c0 was active. In the third step, the current state yj.t/ is copied from rj to qj since all the remaining incident neurons pi and cN0 are and remain passive due to c0 being active. Therefore unit rj also becomes passive. In the fourth step, c0 becomes passive, and the state yj.t/ , being called old from now on, is stored in qj . Thus, the Hopeld net nds itself at the starting condition of the time t C 1 asymmetric network simulation step, which proceeds in the same way. Altogether, the whole simulation is achieved within 4t? discrete-time steps. 4 Convergence Time Analysis

In this section, the convergence time in Hopeld networks—the number of discrete updates required before the network converges—will be analyzed. We shall consider only worst-case bounds (an average-case analysis can be found in Koml´os and Paturi, 1988.) Since a network with n binary neurons has 2n different states, a trivial 2n upper bound on the convergence time in symmetric networks of size n holds. On the other hand, the symmetric clock

2974

Ï õ ma, P. Orponen, and T. Antti-Poika J. S´

network (Goles & Mart´õ nez, 1989), which is used in the proof of theorem 1, provides an explicit example of a Hopeld net whose convergence time is exponential with respect to n. More precisely, this network yields a Ä.2n=3 / lower bound on the convergence time of Hopeld nets, since a .k C 1/-bit binary counter can be implemented using n D 3k C 1 neurons. However, these bounds do not take into account the size of the weights in the network. An upper bound of O.W/ follows from the characteristics of the energy function (see section 2), and this estimate can be made more accurate by using a slightly different energy function (Flor´een, 1991). This yields a polynomial upper bound on the convergence time of Hopeld nets with polynomial weights. Similar arguments can be used for fully parallel updates. In theorem 2, these results will be translated into convergence time bounds with respect to the length of bit representations of Hopeld nets. More precisely, for a binary-state symmetric network, which is described within 1=3 1=2 M bits, convergence-time lower and upper bounds 2Ä.M / and 2O.M / , respectively, will be shown. It is an open problem whether these bounds can be improved. This is an important issue since the convergence-time results for binary-state Hopeld nets can be compared with those for analog-state symmetric networks in which the precision of real weight parameters (i.e., the representation length) plays an important role. For example, in theorem 2, we shall also introduce an analog-state version of the symmetric clock network from theorem 1 whose computation terminates later than that of any other binary-state Hopeld net of the same representation size. To the best of our knowledge, this provides the rst known lower bound on the convergence time of analog-state Hopeld nets. A similar result can be Ï õ ma & Orponen, achieved even for continuous-time symmetric networks (S´ 2000). Note also that in this approach, we express the convergence time with respect to the full descriptional complexity of the Hopeld net, not just the usual measure of the number of neurons, which captures its computational resources only partially. This suggests that analog Hopeld nets whose parameters are limited in the size may gain efciency over binary ones. There exist both binary- and analog-state Hopeld nets with encod1=3 ing size of M bits that converge after 2Ä.M / and 2Ä.g.M// updates, respectively, where g.M/ is an arbitrary continuous function such that g.M/ D Ä.M 2=3 /, g.M/ D o.M/, and M=g.M/ is increasing. On the other hand, any computation of a symmetric binary-state network with a binary representation of M bits terminates 1=2 within 2O.M / discrete computational steps. Theorem 2.

For the underlying lower bounds, the clock network with parameter B D 1 from the proof of theorem 1 can again be exploited. It is sufcient to estimate its representation length. It can be shown by induction on k that the Proof.

Computational Complexity of Hopeld Nets

2975

maximum integer weight in the .k C 1/-bit counter with n D 3k C 1 neurons is of order 2O.n/ . This corresponds to O.n/ bits per weight, which is repeated O.n2 / times, and thus yields at most M D O.n3 / bits in the representation. Hence, the convergence-time lower bound 2Ä.n/ of the binary-state clock 1=3 network can be expressed as 2Ä.M / . In the analog-state case, the underlying weights of the clock with the parameter B D 1 will be adjusted further, as follows. For every analog unit v in the counter network a feedback weight w.v; v/ D 1 C " is introduced, and also each bias is increased by " except for neuron c0 , whose bias is explicitly set to w.0; c0 / D ", where " > 0 is a small (e.g., " < 0:1) optional parameter controlling the convergence rate of the analog clock network. Now, starting the counter computation at the zero initial state, the state of unit c0 gradually grows toward saturation at value 1 because of its bias w.0; c0 / D " and feedback w.c0 ; c0 / D 1 C ". It can be veried that at the least discrete-time instant greater than log.2 ¡ "/= log.1 C "/, its output is greater than 1 ¡ ", which makes the excitation of the next counter unit c1 positive due to its bias w.0; c1 / D ¡1 C ". Now, c1 starts to grow because of its feedback w.c1 ; c1 / D 1 C ", which shortly completes the saturation of c0 at 1. This trick of gradual state transition from 0 to 1 is applied repeatedly throughout the analog clock computation by using the introduced feedbacks 1 C ". It follows that the respective transition for c0 takes at least Ä.1= log.1 C "// discrete-time steps, which, together with the fact that the unit c0 res 2k times before .k C 1/-bit clock converges, provides the lower bound Ä.2k = log.1C "// ¸ Ä.2n=3 ="/ on the convergence time of analog symmetric networks with n neurons. We shall express this bound in terms of the size M in bits of the network representation. The length of the integer part of the weight parameter representation excluding fractions " has already been upper-bounded by O.n3 / bits in the above-considered binary-state case. In addition, the biases and feedbacks of the n units include the fraction ", and taking this into account requires 2.n log.1="// additional bits, say, at least ·n log.1="/ bits for some constant · > 0. By choosing an appropriate " D 2 ¡ f .n/=.·n/ in which f is a continuous increasing function whose inverse is dened as f ¡1 .¹/ D ¹=g.¹/, where g is an arbitrary function such that g.¹/ D Ä.¹2=3 / (implying f .n/ D Ä.n3 /) and g.¹/ D o.¹/, we get M D 2. f .n//, especially M ¸ f .n/ from M ¸ ·n log.1="/. Finally, the convergence time Ä.2n=3 ="/ can be translated to Ä.2 f .n/=.·n/ C n=3 / D 2Ä. f .n/=n/ , which can be rewritten ¡1 as 2Ä.M=f .M// D 2Ä.g.M// since f .n/ D Ä.M/ from M D 2. f .n// and ¡1 f .M/ ¸ n from M ¸ f .n/. On the other hand, for the upper bound, consider a binary-state Hopeld network with an M-bit representation that converges after T.M/ updates. A major part of this M-bit representation consists of m binary encodings of Pmweights w1 ; : : : ; wm of the corresponding lengths M1 ; : : : ; Mm , where r D1 Mr D 2.M/. Clearly there must be at least T.M/ different energy

2976

Ï õ ma, P. Orponen, and T. Antti-Poika J. S´

levels corresponding to the states visited during the computation. Thus, the P underlying weights must produce at least S ¸ T.M/ different sums r2A wr for A µ f1; : : : ; mg where wr for r 2 A agrees with wji for yi D yj D 1 in equation 2.4. So it is sufcient to upper-bound the number of different sums over m weights whose binary representations form a 2.M/-bit string altogether. Observe that all the S D 2m sums over subsets of m weights 1; 2; 4; 8; : : : ; m¡1 2 are different. At the same time these weights altogether have the least length m.m C 1/=2 of binary representation over all the weights that generate 2m different sums. Since the representation length is at most M, we get m D O.M 1=2 / upper bound on the number of weights, which yields T.M/ · 1=2 2O.M / . 5 The Minimum Energy Problem

In this section we investigate the important MIN ENERGY or GROUND STATE problem of nding a network state with minimal energy for a given symmetric network. This problem is of special interest because many hard combinatorial optimization problems have been heuristically solved by minimizing the energy in Hopeld nets (Hopeld & Tank, 1985; Aarts & Korst, 1989; Liao, 1999; Ma´ndziuk, 2000; Sheu, Lee, & Chang, 1991; Wu & Tam, 1999). This issue is also important for the Ising spin glass model in statistical physics (Barahona, 1982). It follows from these heuristic reductions of optimization problems to MIN ENERGY that the problem is computationally hard. In particular, for binary networks, the decision version of MIN ENERGY—that is, the question whether there exists a network state having an energy less than a prescribed value, is NP-complete (Barahona, 1982). On the other hand, it is known that MIN ENERGY has polynomial time algorithms for binary Hopeld nets whose architectures are planar lattices (Bieche, Maynard, Rammal, & Uhry, 1980) or planar graphs (Barahona, 1982). We address here two aspects of the MIN ENERGY problem. In section 5.1 we provide a polynomial time approximation algorithm for the problem in the case of binary nets. In section 5.2 we analyze the complexity of the MIN ENERGY problem for analog Hopeld nets, which is the model whose continuous-time version is usually used for practical optimizations. Surprisingly, to the best of our knowledge this problem has been unresolved so far. This is possibly because there is a slight technical complication to the issue, related to the fact that unlike binary networks, analog networks can also converge to an interior point of their state-space. To prove the NPhardness result, we need to ensure that the minimum energy values of the constructed networks are actually reached close to extremum points of their respective state-spaces. Our proof applies to all continuous nondecreasing sigmoidal neuron activation functions of a practical interest.

Computational Complexity of Hopeld Nets

2977

5.1 Polynomial Time Approximation for Binary Nets. Recall that in denition 2.4 of the energy function for binary nets, we assumed, for simplicity, that wjj D 0 and wj0 D 0 for j D 1; : : : ; n. In addition, without loss of generality (Parberry, 1994), we shall work throughout this section with bipolar states f¡1; 1g of neurons instead of binary ones f0; 1g from equation 2.2, where 0 is now replaced by ¡1. Perhaps the most direct and frequently used reduction to MIN ENERGY is from the MAX CUT problem (see, e.g., Bertoni & Campadelli, 1994). In MAX CUT we are given an undirected graph G D .V; A/ with an integer edge cost function c: A ¡! Z , and we want to nd a cut V1 µ V that maximizes the cut size, X X (5.1) c.V1 / D c.fi; jg/ ¡ c.fi; jg/ : 6 fi;jg2A;i2V1 ;j2V 1

fi;jg2A;c.fi;jg/<0

Note that the standard MAX CUT problem is here generalized by allowing also the negative edge weights that are needed for the opposite reduction from MIN ENERGY to MAX CUT. Recently, a new, randomized approximation algorithm with a high-performance guarantee ® D 0:87856 for this MAX CUT formulation has been proposed (Goemans & Williamson, 1995) and later derandomized (Mahajan & Ramesh, 1995), a fact we shall exploit for approximating the MIN ENERGY problem. We shall observe that MIN ENERGY can be approximated in a polynomial time within absolute error less than 0:243W where W is the network weight (see equation 2.5). For W D O.n2 /, for example, in Hopeld nets with n neurons and constant weights, this result matches the lower bound Ä.n2¡" /, which cannot be guaranteed by any approximate polynomial time MIN ENERGY algorithm for every " > 0 (Bertoni & Campadelli, 1994), unless P D NP. In addition, an approximate polynomial time MIN ENERGY algorithm with absolute error O.n= log n/ is also known in a special case of Hopeld nets whose architectures are two-level grids (Bertoni, Campadelli, Gangai, & Posenato, 1997). The MIN ENERGY problem for binary Hopeld nets can be approximated in polynomial time within absolute error less than 0:243W where W is the network weight (see equation 2.5). Theorem 3.

We rst recall the well-known simple reduction between MIN ENERGY and MAX CUT problems. For a binary Hopeld network with architecture G and weights w.i; j/, we can easily dene the corresponding instance G D .V; A/; c of MAX CUT with edge costs c.fi; jg/ D ¡w.i; j/ for fi; jg 2 A. We shall show that any cut V1 µ V of G corresponds to a Hopeld net state y 2 f¡1; 1gn where yi D 1 if i 2 V1 and yi D ¡1 for i 2 V n V1 , so that the respective cut size c.V1 / is related to the underlying energy E.y /. Thus, the energy function of equation 2.4 can be expressed in terms of the Proof.

Ï õ ma, P. Orponen, and T. Antti-Poika J. S´

2978

cut size of equation 5.1 as follows: E.y / D ¡ D ¡

1 X w.i; j/yi yj 2 i;j2V 1 X 1 X 1 X 1 X w.i; j/ C w.i; j/ C w.i; j/ ¡ w.i; j/ 2 yi D yj 2 y D6 y 2 y D6 y 2 y D6 y i

j

i

j

i

j

X 1 X 1 X 1 X D ¡ w.i; j/ C w.i; j/ D ¡ w.i; j/ ¡ w.i; j/ 2 i;j2V 2 w.i;j/<0 2 w.i;j/>0 6 y yD i

C

X

6 yj yi D

j

1 X 1 X w.i; j/ C w.i; j/ ¡ w.i; j/ 2 w.i;j/>0 2 w.i;j/>0

X X 1 X 1 X w.i; j/ C w.i; j/ ¡ w.i; j/ C w.i; j/ 2 w.i;j/<0 2 w.i;j/>0 6 yj yi D w.i;j/>0 X X D WC2 c.fi; jg/ ¡ 2 c.fi; jg/ D ¡

fi;jg2A;c.fi;jg/<0

D W ¡ 2c.V1 /:

6 fi;jg2A;i2V1 ;j2V 1

(5.2)

It follows from equation 5.2 that the minimum energy state corresponds to the maximum cut. Now, the approximate polynomial time algorithm from Goemans and Williamson (1995) can be employed to solve an instance G D .V; A/; c of the MAX CUT problem, which provides a cut V1 whose size c.V1 / ¸ ®c? is guaranteed to be at least ® D 0:87856 times the maximum cut size c? . Let cut V1 correspond to the Hopeld network state y , which implies c.V1 / D 1=2.W ¡ E.y // from equation 5.2. Hence, we get a guarantee W ¡ E.y / ¸ ®.W ¡E? / where E? is the minimum energy corresponding to the maximum cut c? , which leads to E.y / ¡ E? · .1 ¡®/.W ¡ E? /. Since |E? | · W, we obtain the desired guarantee for the absolute error E.y /¡E? · .1¡®/2W < 0:243W. 5.2 Hardness Result for Analog Nets. In this section we consider the MIN ENERGY(¾ ) problem for analog Hopeld nets with general activation function ¾ . The energy function of equation 2.4 can be generalized for analog networks with nonnegative feedbacks wjj ¸ 0 (j D 1; : : : ; n) as follows (Koiran, 1994):

E ¾ .y / D ¡

n X n n n X X 1X wji yi yj ¡ wj0 yj C 2 j D1 i D1 jD 1 jD 1

Z 0

yj

¾ ¡1 .y/dy;

(5.3)

where the activation function ¾ is continuous and strictly increasing on an interval [®; ¯] (® < ¯, with possibly ® D ¡1 or ¯ C 1), and constant

Computational Complexity of Hopeld Nets

outside, that is, » a ¾ .» / D b

for » · ® for » ¸ ¯:

2979

(5.4)

For ® D ¡1 or ¯ D C 1, we require that lim» !¡1 ¾ .» / D a or lim» !C 1 ¾ .» / D b, respectively. The inverse function ¾ ¡1 in equation 5.3 may not be dened on the whole interval [0; yj ]; in that case we assign it to zero outside of .a; b/. In the sequel, we shall also assume that ¾ is differentiable on .®; ¯ / (thus, we know that ¾ 0 .» / > 0 for » 2 .®; ¯/) and that the respective integrals in equation 5.3 are bounded, that is, Z yj ¡1 sup (5.5) ¾ .y/dy < I¾ ; yj 2[a;b]

0

where 0 < I¾ < C 1. These conditions are satised by all the commonly used continuous activation functions (e.g., the hyperbolic tangent and the saturated-linear map). The decision version of the MIN ENERGY(¾ ) problem for analog Hopeld nets is the question whether a given analog network possesses a state y 2 [a; b]n for which the energy E¾ .y / has value less than a given constant K. Similarly as in the proof of theorem 3, we shall show this problem to be NPhard by reduction from the SIMPLE MAX CUT problem, which is known to be NP-complete (Garey, Johnson, & Stockmeyer, 1976). In SIMPLE MAX CUT, we are given an undirected graph G D .V; A/ and a positive integer k, and we want to decide whether there exists a cut V1 µ V whose size 6 ;g| (corresponding to constant unit cost in MAX c.V1 / D |fe 2 AI e \ V1 D CUT) is at least k. The idea of the proof is to exploit the reduction from the bipolar case by forcing the analog energy of equation 5.3 to achieve its minimum at a state close to one of the extremal points fa; bgn of the state-space. For this purpose we prove the following lemma concerning a saturation tendency of analog symmetric networks with large positive feedbacks. Let .´; #/ be an interval in which ± D inf» 2.´;# / ¾ 0 .» / > 0 (i.e., ¾ is certainly not saturated on .´; # /), and let wjj > 1=± for every j D 1; : : : n. If y ? D .y?1 ; : : : ; y?n / 2 [a; b]n is a local minimum of the energy function of equation 5.3, 6 .¾ .´/; ¾ .# // for every j D 1; : : : n. then yj? 2 Lemma 1.

Suppose that y ? 2 [a; b]n is a local minimum of the energy function 6 of equation 5.3. Clearly, yj? 2 .¾ .´/; ¾ .# // for yj? 2 fa; bg. For yj? 2 .a; b/ ? ? ; y; y? ; : : : ; y? / of one variable y. consider function Ej .y/ D E¾ .y1 ; : : : ; yj¡1 n j C1 Thus, Ej0 .yj? / D 0 and Proof.

Ej00 .yj? / D

@ 2 E¾ @yj2

.y ? / D ¡wjj C

1 ¾ 0 .y? / j

¸ 0;

(5.6)

Ï õ ma, P. Orponen, and T. Antti-Poika J. S´

2980

since y ? is a local minimum of E¾ . It follows from equation 5.6 that wjj · 6 .¾ .´/; ¾ .# //. 1=¾ 0 .yj? /, and hence yj? 2 Now we shall prove the hardness result for analog MIN ENERGY(¾ ) problem, which seems to be the rst formal verication of this fact, although (continuous-time) analog Hopeld nets have often been exploited to solve hard optimization problems heuristically (Hopeld & Tank, 1985; Aarts & Korst, 1989; Liao, 1999; Ma´ndziuk, 2000; Sheu, Lee, & Chang, 1991; Wu & Tam, 1999). The MIN ENERGY(¾ ) problem for analog Hopeld nets is NP-

Theorem 4.

hard. Proof . We reduce the SIMPLE MAX CUT problem to MIN ENERGY(¾ ).

Given a SIMPLE MAX CUT instance G D .V; A/; k with n D |V| vertices and m D |A| edges, a corresponding MIN ENERGY(¾ ) instance, that is, an analog Hopeld net N with activation function ¾ and a prescribed energy level K is constructed in polynomial time. The architecture of N is G, except that each neuron in N has an additional feedback connection. The weights and biases of N are dened as follows: » wji D wj0 D

¡8C .b¡a/2

u

for fi; jg 2 A for i D j

j D 1; : : : ; n;

4C.a C b/ degG .j/ ; .b ¡ a/2

(5.7) (5.8)

where degG .j/ D |fi 2 VI fi; jg 2 Ag| is the degree of vertex j in G, ± ± ²² u CD n I¾ ¡ max |a| 2 ; |b| 2 ; 2

(5.9)

and u is chosen so that u>

1 where ±

±D

inf

»2.¾ ¡1 .a C "/;¾ ¡1 .b¡"//

¾ 0 .» / > 0

(5.10)

and " > 0 satises "<

b¡a : 8m

(5.11)

Finally, the energy level is prescribed as ³

4abm K D 2C ¡2k C 1 ¡ .b ¡ a/2

´ :

(5.12)

Computational Complexity of Hopeld Nets

2981

In the proof of theorem 3, the reduction from MIN ENERGY to MAX CUT is actually the one-to-one correspondence and the respective inverse reduction from MAX CUT to MIN ENERGY, for example, restricted even to SIMPLE MAX CUT instances, provides an NP-completeness proof for bipolar Hopeld nets. As we have mentioned, we exploit this approach by reducing the underlying analog minimum energy problem to the bipolar case. For this purpose, the following linear transformation of the analog neuron state yj 2 [a; b] to yj0 2 [¡1; 1], which preserves the network weights except for biases (Parberry, 1994), is employed: yj0 D

2.yj ¡ a/ ¡ 1: b¡a

(5.13)

With this transformation the energy function of equation 5.3 can be rewritten as 0 1 ³ ´2 n n X X C 1 a b E¾ .y / D 2C @¡ w0 y0 y0 ¡ mA 2 jD 1 iD1; iD6 j ji i j b¡a C

n Z X jD 1

yj 0

¾ ¡1 .y/dy ¡

where the introduced weights » ¡1 for fi; jg 2 A wji0 D 0 otherwise;

n 1X wjj yj2 ; 2 j D1

(5.14)

(5.15)

coincide with those from the bipolar case. Furthermore, consider a local minimum y ? D .y?1 ; : : : ; y?n / 2 [a; b]n of energy E¾ . It follows from lemma 1, whose assumptions are satised by equation 5.10, that yj? 2 [a; a C "] [ [b ¡ "; b] for every j D 1; : : : ; n. Hence, yj? 0 2 [¡1; ¡1 C 2"=.b ¡ a/] [ [1 ¡ 2"=.b ¡ a/; 1], which implies 4" 0 0 y?i yj? >1¡ : b¡a

(5.16)

We introduce bipolar states yNj? 2 f¡1; 1g into equation 5.14 by rounding the corresponding analog ones, that is, yNj? D ¡1 for yj? 0 2 [¡1; ¡1 C 2"=.b ¡ a/] and yNj? D 1 for yj? 0 2 [1 ¡2"=.b ¡a/; 1] (j D 1; : : : ; n). By using equations 5.16, 5.15, and 5.11, the rounding error for the underlying term in equation 5.14 can be estimated for local minimum y? as follows: X n ± ² 1 n X 4"m 1 0 0 ¡ (5.17) wji0 yN ?i yNj? ¡ y?i yj? < < : 2 ¡ 2 b a jD 1 i D1; iD6 j

Ï õ ma, P. Orponen, and T. Antti-Poika J. S´

2982

Also the absolute value of the second row of equation 5.14 can be upperbounded from equations 5.5 and 5.9: n Z yj n X X 1 ¡1 2 ¾ .y/dy ¡ wjj yj < C: 2 j D1 j D1 0

(5.18)

Moreover, the result of equation 5.2 for the bipolar case can here be applied in which the network weight W equals m now and V1? is the cut corresponding to bipolar network state yN ? D .yN ?1 ; : : : ; yN ?n / 2 f¡1; 1gn . Thus, it follows from equations 5.17 and 5.18 that the energy function value given by equation 5.14, at local minimum y ? belongs to the following interval:

³ 2C m

¡ 2c.V1? /

1 ¡ ¡ 2

³

³

?

< E¾ .y / < 2C m

aCb b ¡a

´2 ! m ¡C

¡ 2c.V1? / C

1 ¡ 2

³

aCb b ¡a

´2

! m C C;

(5.19)

which can be rewritten as ³ ´ 4abm 2C ¡2c.V1? / ¡ 1 ¡ .b ¡ a/2 ³ ´ 4abm < E¾ .y ? / < 2C ¡2c.V1? / C 1 ¡ : .b ¡ a/2

(5.20)

According to equation 5.20, the analog energy value E¾ .y ? / at local minimum y ? determines uniquely the integer cut size c.V1? / of cut V1? associated with yN ? . Now, the correctness of the reduction from SIMPLE MAX CUT to MIN ENERGY(¾ ) can easily be veried. Suppose there exists a cut V1 of G whose size is at least k. Then the energy (see equation 5.14) of the corresponding analog state y 2 fa; bgn of N, which is dened as yj D b for j 2 V1 and yj D a for j 2 VnV1 , can be upper-bounded by using equations 5.2 and 5.18 as follows:

³ E ¾ .y / < 2C m ¡ 2k ¡

³

aCb b¡a

´2

! m C C < K:

(5.21)

On the other hand, assume that an analog state y 2 [a; b]n of N with energy E¾ .y / < K exists. Then there is also a local minimum y? 2 [a; b]n of E¾ with energy E¾ .y ? / · E¾ .y / < K. It follows from equations 5.12 and 5.20 that c.V1? / · k.

Computational Complexity of Hopeld Nets

2983

6 Turing Universality of Finite Analog Hopeld Nets

In this section we deal with the computational power of nite analog-state discrete-time recurrent neural networks with the saturated-linear activation function of equation 2.3. For asymmetric analog networks, the computational power is known to increase with the Kolmogorov complexity of their real weights (Balc´azar, Gavalda, ` & Siegelmann, 1997). With integer weights such networks are equivalent to nite automata (Horne & Hush, 1996; InÏ õ ma & Wiedermann, 1998), while with rational weights, arbitrary dyk, 1995; S´ Turing machines can be simulated (Indyk, 1995; Siegelmann & Sontag, 1995). With arbitrary real weights, such networks can even have “super-Turing” computational capabilities, so that, for example, polynomial time computations correspond to the complexity class P/poly and all languages can be recognized within exponential time (Siegelmann & Sontag, 1994). On the other hand, any amount of analog noise reduces the computational power of this model to that of nite automata (Casey, 1996; Maass & Orponen, 1998). For nite symmetric networks, only the computational power of binarystate Hopeld nets has been fully characterized. They recognize the soÏ õ ma, 1995), which form a proper subclass of called Hopeld languages (S´ regular languages; hence such networks are less powerful than nite automata. Hopeld languages can also be faithfully recognized by analog Ï õ ma, 1997), and this symmetric neural networks (Maass & Orponen, 1988; S´ provides a lower bound on the computational power of such networks. A natural question then concerns improvements of this lower bound. Could this model too be Turing universal, that is, can a Turing machine simulation be achieved with symmetric networks and rational weights similarly as in the asymmetric case (Indyk, 1995; Siegelmann & Sontag, 1995)? The main obstacle here is that under fully parallel updates, any analog Hopeld net with rational weights converges to a limit cycle of length at most two (Koiran, 1994). Thus, the only possibility of simulating Turing machines would be to exploit ner and ner distinctions among a sequence of rational network states converging to a limit cycle. Such a simulation seems to be tricky at best, if possible at all. A more reasonable approach is to augment the network with an external clock that produces an innite sequence of binary pulses, thus providing it with an “energy source” that can be used, for example, for simulating an asymmetric analog network similarly as in theorem 1. Indeed, we now prove that the computational power of analog Hopeld nets with an external clock is the same as that of asymmetric analog networks. Especially for rational weights, this implies that such networks are Turing universal. The following theorem also fully characterizes those innite binary sequences by the external clock that prevent the Hopeld network from converging. In this way, we obtain a complete theoretical characterization of the symmetry of weights in analog Hopeld nets in the sense that the computational

2984

Ï õ ma, P. Orponen, and T. Antti-Poika J. S´

power of analog asymmetric networks equals that of symmetric ones plus oscillator of a certain type. Let N 0 be an analog-state recurrent neural network with real asymmetric weights and n neurons working in fully parallel mode. Then there exists an analog Hopeld net N with 3n C 8 units whose real weights have the same maximum Kolmogorov complexity as the weights in N0 and which simulates N 0 , provided it receives as an additional external input any innite binary sequence that contains N Moreover, the 6 innitely many substrings of the form bxbN 2 f0; 1g3 where b D b. external input must have this property to prevent N from converging. Theorem 5.

First observe that an innite binary sequence produced by the external input (clock) c that does not satisfy the assumption of the theorem must be of the form u .b1 b2 /? , where b1 ; b2 2 f0; 1g, and u 2 f0; 1g? is a prex that clearly cannot prevent the network from converging to a limit cycle of length two. On the other hand, we shall prove that if the sequence meets the respective condition, then it must contain innitely many substrings of the form 1x0 (x 2 f0; 1g) since these strings necessarily accompany innitely many substrings 0x1. Thus, consider two subsequent occurrences of 0x1. For x D 1, after 011 possibly followed by several 1s, at least one 0 must appear due to the next 0x1 which gives 110 as required. For x D 0, the string 001 is followed by either a 1, which means the previous case with 011 applies, or a 0, possibly succeeded by several occurrences of 10, which is followed by either a desired 0 or 11, which again leads to 011. The symmetric simulation of an asymmetric network N 0 consisting of n analog neurons is very similar to the binary one (see theorem 1). Thus, each neuron j in N0 is simulated by three units pj ; qj ; rj , which are controlled by three neurons f; g; h (corresponding to c0 ; cN0 in the binary case) whose binary states are generated by a small, symmetric subnetwork of ve auxiliary units a; d; e; r; s, transforming the binary external input signal from c to a well-controlled binary sequence. This gives the desired 3n C 8 units of the simulating analog Hopeld net N. The situation is depicted in Figure 3, including the denition of the symmetric weights, where U is introduced in equation 3.1 and B D .4W C 1/n C 7 in this case. Every binary external input bit is copied from clock c to r and further to s, and therefore the states of these neurons s; r; c store the last three bits of the input sequence, respectively. Neuron a detects the underlying strings of the form 1x0 (x 2 f0; 1g) at the input; that is, it is active iff s is active and c is passive. It may happen that two such strings follow each other immediately (also notice that it is impossible for such a string to appear again at the next step but one). This case is indicated by the activity of both neurons a, d, where d copies the state from a. Thus, unit e copies the state 1 from a iff d is passive. Hence, the neuron e produces a “noiseless” sequence of zeros containing innitely many substrings 100, each being exploited for Proof.

Computational Complexity of Hopeld Nets

2985

Figure 3: Analog symmetric simulation of analog neuron j.

the simulation of one computational step of N0 similarly as in the proof of theorem 1. Namely, qj stores the old analog state yj.t¡1/ by its unit feedback according to equation 2.3 and also h remains initially active preserving the stability of pj ; qj ; rj . After e res, the unit h becomes passive and f is active, which controls the computation of a new analog state yj.t/ by pj via the original weights w.qi ; pj / D w0 .i; j/ and the bias w. f; pj / D w0 .0; j/. Then the control signal is further copied from f to g, which enables rj to receive the state yj.t/ . Finally, h becomes active and the state of qj is updated by yj.t/ , and this is kept until e res again to proceed to the next simulation step. 7 Conclusions

With the results presented in this article, we now seem to have developed quite a good understanding of the computational properties of discrete-time symmetric Hopeld nets. It is somewhat surprising that these networks turn out to be computationally essentially as powerful as general asymmetric networks, despite their Lyapunov-function constrained dynamics. As we have seen, in the binary-state case, symmetric networks can simulate asymmetric ones with only a linear increase in the network size (see section 3), and in the analog-state case, also nite symmetric networks are Turing universal, provided they are supplied with a simple external clock to prevent them from converging (see section 6). In addition, we have veried that the MIN ENERGY problem remains NP-hard also for analog networks (see section 5), a fact that is not immediately obvious because of the possi-

2986

Ï õ ma, P. Orponen, and T. Antti-Poika J. S´

bility of analog networks to converge to states in the interior of the statespace, complicating the combinatorial coding necessary for NP-hardness proofs. It is also interesting to note that analog networks can in some cases be more efcient with respect to their encoding size than binary ones (see section 4). This result suggests that analog models of computation may be worth investigating more for their efciency gains than for their (theoretical) capability for arbitrary-precision real number computation. Very little is currently known about the computational complexity aspects of analog computation, and the area clearly merits more intensive study. Acknowledgments

Ï was supported by GA AS CR Grant No. B2030007, and T. A.-P. was J. S. supported by Academy of Finland Grant No. 37115/96. References Aarts, E., & Korst, J. (1989). Simulated annealing and Boltzmann machines. Chichester: Wiley. Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science, 9, 147–169. Balc´azar, J. L., D´õ az, J., & Gabarr´o, J. (1995). Structural complexity (vol. 1) (2nd ed.). Berlin: Springer-Verlag. Balc´azar, J. L., Gavald`a, R., & Siegelmann, H. T. (1997). Computational power of neural networks: A characterization in terms of Kolmogorov complexity. IEEE Transactions of Information Theory, 43, 1175–1183. Barahona, F. (1982). On the computational complexity of Ising spin glass models. Journal of Physics A, 15, 3241–3253. Bertoni, A., & Campadelli, P. (1994). On the approximability of the energy function. In Proceedings of the ICANN’94 Conference (pp. 1157–1160). Berlin: Springer-Verlag. Bertoni, A., Campadelli, P., Gangai, C., & Posenato, R. (1997).Approximability of the ground state problem for certain Ising spin glasses. Journal of Complexity, 13, 323–339. Bieche, I., Maynard, R., Rammal, R., & Uhry, J. P. (1980). On the ground states of the frustration model of a spin glass by a matching method of graph theory. Journal of Physics A, 13, 2553–2576. Casey, M. (1996). The dynamics of discrete-time computation, with application to recurrent neural networks and nite state machine extraction. Neural Computation, 8, 1135–1178. Chiueh, T.-D., & Goodman, R. (1988). A neural network classier based on coding theory. In D. Anderson, (Ed.), Proceedings of the 1987 IEEE Conference on Neural Information Processing Systems—Natural and Synthetic (pp. 174–183). New York: American Institute of Physics.

Computational Complexity of Hopeld Nets

2987

Farhat, N. H., Psaltis, D., Prata, A., & Paek, E. (1985). Optical implementation of the Hopeld model. Applied Optics, 24, 1469–1475. Flor´een, P. (1991). Worst-case convergence times for Hopeld memories. IEEE Transactions on Neural Networks, 2, 533–535. Flor´een, P., & Orponen, P. (1994). Complexity issues in discrete Hopeld networks (Res. Rep. No. A–1994-4). Helsinki: Department of Computer Science, University of Helsinki. Garey, M. R., Johnson, D. S., & Stockmeyer, L. (1976). Some simplied NPcomplete graph problems. Theoretical Computer Science, 1, 237–267. Goemans, M. X., & Williamson, D. P. (1995). Improved approximation algorithms for maximum cut and satisability problems using semidenite programming. Journal of the ACM, 42, 1115–1145. Gold, B. (1986). Hopeld model applied to vowel and consonant discrimination. In J. Denker (Ed.), Neural networks for computing, AIP Conf. Proc. 151 (pp. 158– 164). New York: American Institute of Physics. Goles, E., Fogelman, F., & Pellegrin, D. (1985). Decreasing energy functions as a tool for studying threshold networks. Discrete Applied Mathematics, 12, 261– 277. Goles, E., & Mart´õ nez, S. (1989). Exponential transient classes of symmetric neural networks for synchronous and sequential updating. Complex Systems, 3, 589–597. Goles, E., & Matamala, M. (1996).Symmetric discrete universal neural networks. Theoretical Computer Science, 168, 405–416. Haykin, S. (1999). Neural networks: A comprehensive foundation (2nd ed.). Upper Saddle River, NJ: Prentice Hall. Hopeld, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. In Proceedings of the National Academy of Sciences (vol. 79, pp. 2554–2558). Hopeld, J. J. (1984). Neurons with graded response have collective computational properties like those of two-state neurons. In Proceedings of the National Academy of Sciences (vol. 81, pp. 3088–3092). Hopeld, J. J., & Tank, D. W. (1985). “Neural” computation of decisions in optimization problems. Biological Cybernetics, 52, 141–152. Horne, B. G., & Hush, D. R. (1996). Bounds on the complexity of recurrent neural network implementations of nite state machines. Neural Networks, 9, 243– 252. Indyk, P. (1995). Optimal simulation of automata by neural nets. In Proceedings of the 12th Annual Symposium on Theoretical Aspects of Computer Science, Lecture Notes in Comp. Sci. 900 (pp. 337–348). Berlin: Springer-Verlag. Koiran, P. (1994). Dynamics of discrete time, continuous state Hopeld networks. Neural Computation, 6, 459–468. Koml´os, P. and Paturi, R. (1988). Convergence results in an associative memory model. Neural Networks, 1, 239–250. Kosko, B. (1988). Bidirectional associative memories. IEEE Transactions on Systems, Man, and Cybernetics, 18, 49–60. Liao, L.-Z. (1999). A recurrent neural network for N-stage optimal control problems. Neural Processing Letters, 10, 195–200.

2988

Ï õ ma, P. Orponen, and T. Antti-Poika J. S´

Maass, W., & Orponen, P. (1998). On the effect of analog noise in discrete-time analog computations. Neural Computation, 10, 1071–1095. Mahajan, S., & Ramesh, H. (1995). Derandomizing semidenite programming based approximation algorithms. In Proceedings of the 36th Annual Symposium on Foundations of Computer Science (pp. 162–163). Los Alamitos, California: IEEE. Ma´ndziuk, J. (2000). Optimization with the Hopeld network based on correlated noises: Experimental approach. Neurocomputing, 30, 301–321. Orponen, P. (1996). The computational power of discrete Hopeld nets with hidden units. Neural Computation, 8, 403–415. Parberry, I. (1994). Circuit complexity and neural networks. Cambridge, MA: MIT Press. Park, J., Cho, H., & Park, D. (1999). On the design of BSB neural associative memories using semidenite programming. Neural Computation, 11, 1985– 1994. Pawlicki, T., Lee, D.-S., Hull, J., & Srihari, S. (1988). Neural networks and their application to handwritten digit recognition. In Proceedings of the IEEE International Conference on Neural Networks (Vol. II, pp. 63–70). Poljak, S., & S Çura, M. (1983).On periodical behaviour in societies with symmetric inuences. Combinatorica, 3, 119–121. Rojas, R. (1996). Neural networks: A systematic introduction. Berlin: SpringerVerlag. Sheu, B., Lee, B., & Chang, C.-F. (1991). Hardware annealing for fast-retrieval of optimal solutions in Hopeld neural networks. In Proceedings of the IEEE International Joint Conference on Neural Networks (Vol. II, pp. 327–332). Siegelmann, H. T. (1999). Neural networks and analog computation: Beyond the Turing limit. Boston: Birkh a¨ user. Siegelmann, H. T., & Sontag, E. D. (1994). Analog computation via neural networks. Theoretical Computer Science, 131, 331–360. Siegelmann, H. T., & Sontag, E. D. (1995). Computational power of neural networks. Journal of Computer System Science, 50, 132–150. Siu, K.-Y., Roychowdhury, V., & Kailath, T. (1995). Discrete neural computation: A theoretical foundation. Englewood Cliffs, NJ: Prentice Hall. Ï õ ma, J. (1995). Hopeld languages. In Proceedings of the 22nd SOFSEM Seminar S´ on Current Trends in Theory and Practice of Informatics, Lecture Notes in Comp. Sci. 1012 (pp. 461–468). Berlin: Springer-Verlag. Ï õ ma, J. (1997). Analog stable simulation of discrete neural networks. Neural S´ Network World, 7, 679–686. Ï õ ma, J., & Orponen, P. (2000). A continuous-time Hopeld net simulation of S´ discrete neural networks. In Proceedings of NC’2000: Second International ICSC Symposium on Neural Computing (pp. 36–42). Wetaskiwin, Canada: ICSC Academic Press. Ï õ ma, J., Orponen, P., & Antti-Poika, T. (1999). Some afterthoughts on Hopeld S´ networks. In Proceedings of the 26th SOFSEM Seminar on Current Trends in Theory and Practice of Informatics, Lecture Notes in Comp. Sci. 1725 (pp. 459– 469). Berlin: Springer-Verlag.

Computational Complexity of Hopeld Nets

2989

Ï õ ma, J., & Wiedermann, J. (1998). Theory of neuromata. Journal of the ACM, 45, S´ 155–178. Wegener, I. (1987). The complexity of boolean functions. Chichester: Wiley/Teubner. Wiedermann, J. (1994). Complexity issues in discrete neurocomputing. Neural Network World, 4, 99–119. Wu, A., & Tam, P. K. S. (1999). A neural network methodology of quadratic optimization. International Journal of Neural Systems, 9, 87–93. Received September 20, 1999; accepted February 28, 2000.

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

Recommend Documents